结合局部敏感哈希的k近邻数据填补算法

doi:10.11772/j.issn.1001-9081.2016.02.0397

计算机应用 ›› 2016, Vol. 36 ›› Issue (2): 397-401.DOI: 10.11772/j.issn.1001-9081.2016.02.0397

• 第三届CCF大数据学术会议(CCF BigData 2015) • 上一篇下一篇

结合局部敏感哈希的k近邻数据填补算法

郑奇斌¹, 刁兴春², 曹建军², 周星¹, 许永平²

1. 解放军理工大学指挥信息系统学院, 南京 210007;
2. 总参第六十三研究所, 南京 210007

收稿日期:2015-08-29 修回日期:2015-09-19 出版日期:2016-02-10 发布日期:2016-02-03
通讯作者: 刁兴春(1964-),男,江苏泰兴人,研究员,硕士,主要研究方向:数据工程、数据质量。
作者简介:郑奇斌(1990-),男,甘肃兰州人,硕士研究生,主要研究方向:数据挖掘、数据质量;曹建军(1975-),男,山东郓城人,工程师,博士,CCF高级会员,主要研究方向:数据工程、数据质量、进化算法;周星(1988-),男,四川广安人,博士研究生,主要研究方向:数据工程、数据质量、模式识别;许永平(1979-),男,河北宣化人,工程师,博士,主要研究方向:数据质量、信息系统效能评估。
基金资助:
国家自然科学基金资助项目(61371196);中国博士后科学基金特别资助项目(201003797);解放军理工大学预研基金项目(20110604,41150301)。

k-nearest neighbor data imputation algorithm combined with locality sensitive Hashing

ZHENG Qibin¹, DIAO Xingchun², CAO Jianjun², ZHOU Xing¹, XU Yongping²

1. College of Command Information System, PLA University of Science and Technology, Nanjing Jiangsu 210007, China;
2. The 63rd Research Institute of PLA General Staff Headquarters, Nanjing Jiangsu 210007, China

Received:2015-08-29 Revised:2015-09-19 Online:2016-02-10 Published:2016-02-03

摘要/Abstract

摘要： k近邻(kNN)算法是缺失数据填补的常用算法,但由于需要逐个计算所有记录对之间的相似度,因此其填补耗时较高。为提高算法效率,提出结合局部敏感哈希(LSH)的kNN数据填补算法LSH-kNN。首先,对不存在缺失的完整记录进行局部敏感哈希,为之后查找近似最近邻提供索引;其次,针对枚举型、数值型以及混合型缺失数据分别提出对应的局部敏感哈希方法,对每一条待填补的不完整记录进行局部敏感哈希,按得到的哈希值找到与其疑似相似的候选记录;最后在候选记录中通过逐个计算相似度来找到其中相似程度最高的k条记录,并按照kNN算法对不完整记录进行填补。通过在4个真实数据集上的实验表明,结合局部敏感哈希的kNN填补算法LSH-kNN相对经典的kNN算法能够显著提高填补效率,并且保持准确性基本不变。

关键词: 数据质量, 数据完整性, 数据填补, k近邻算法, 局部敏感哈希

Abstract: k-Nearest Neighbor (kNN) algorithm is commonly used in data imputation. It is of poor efficiency because of the similarity computation between every tow records. To solve the efficiency problem, an improved kNN data imputation algorithm combined with Locality Sensitive Hashing (LSH) named LSH-kNN was proposed. First, all the complete records were indexed in LSH way. Then corresponding LSH ways for nominal, numeric and mixed-type incomplete data were put forward, and LSH values for all the incomplete records were computed in the proposed way to find candidate similar records. Finally, the incomplete records' real distance to candidate similar records were calculated, and the top-k similar records for kNN imputation were found. The experimental results show that the proposed method LSH-kNN has higher efficiency than traditional kNN as well as keeping almost the same accuracy.

Key words: data quality, data integrity, data imputation, k-nearest neighbor(kNN)algorithm, Locality Sensitive Hashing(LSH)

中图分类号:

TP391

郑奇斌, 刁兴春, 曹建军, 周星, 许永平. 结合局部敏感哈希的k近邻数据填补算法[J]. 计算机应用, 2016, 36(2): 397-401.

ZHENG Qibin, DIAO Xingchun, CAO Jianjun, ZHOU Xing, XU Yongping. k-nearest neighbor data imputation algorithm combined with locality sensitive Hashing[J]. Journal of Computer Applications, 2016, 36(2): 397-401.

参考文献

[1] GARCIA-LAENCINA P J, SANCHO-GOMEZ J-L, FIGUEIRAS-VIDAL A R, et al. K nearest neighbors with mutual information for simultaneous classification and missing data imputation[J]. Neurocomputing, 2009, 72(7/8/9): 1483-1493.
[2] WANG H, WANG S. Discovering patterns of missing data in survey databases: An application of rough sets[J]. Expert System with Applications, 2009, 36(3): 6256-6260.
[3] LITTLE R J A, RUBIN D B. Statistical analysis with missing data[M]. New York: John Wiley & Sons, 2002: 19-20.
[4] DONDERS A R, VAN DER HEIJDEN G J, STIJNEN T, et al. Review: a gentle introduction to imputation of missing values[J]. Journal of Clinical Epidemiology, 2006, 59(10): 1087-1091.
[5] AITTOKALLIO T. Dealing with missing values in large-scale studies: microarray data imputation and beyond[J]. Briefings in Bioinformatics, 2010, 11(2): 253-264.
[6] ANAGNOSTOPOULOS C, TRIANTAFILLOU P. Scaling out big data missing value imputations: pythia vs. godzilla[C]//KDD '14: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2014: 651-660.
[7] RAJARAMAN A, ULLMAN J D.大数据:互联网大规模数据挖掘与分布式处理[M]. 王斌,译.北京:人民邮电出版社,2012:50-67. (RAJARAMAN A, ULLMAN J D. Big data: mining of massive datasets[M].WANG B, translated. Beijing: Posts & Telecom Press, 2012: 50-67.)
[8] BRODER A Z, CHARKAR M, FRIEZE A M, et al. Min-wise independent permutations[J]. Journal of Computer and System Sciences, 2000, 60(3): 630-659.
[9] DATAR M, IMMORLICA N, INDYK P, et al. Locality-sensitive hashing scheme based on p-stable distributions[C]//SCG '04: Proceedings of the twentieth Annual Symposium on Computational Geometry. New York: ACM, 2004: 253-262.
[10] ANDONI A, INDYK P. LSH Algorithm and Implementation (E2LSH) [EB/OL]. [2015-06-22]. http://web.mit.edu/andoni/www/LSH.
[11] HATHAWAY R J, BEZDEK J C. Fuzzy c-means clustering of incomplete data[J]. IEEE Transactions on Systems, Man and Cybernetics, Part B: Cybernetics, 2001, 31(5): 735-744.
[12] FONOLLOSA J, SHEIK S, HUERTA R, et al. Reservoir computing compensates slow response of chemosensor arrays exposed to fast varying gas concentrations in continuous monitoring[J]. Sensors and Actuators B: Chemical, 2015, 215: 618-629.
[13] BALDI P, SADOWSKI P, WHITESON D. Searching for exotic particles in high-energy physics with deep learning[J]. Nature Communications, 2014, 5: 4308.

[1]	卿欣艺, 陈玉玲, 周正强, 涂园超, 李涛. 基于中国剩余定理的区块链存储扩展模型[J]. 计算机应用, 2021, 41(7): 1977-1982.
[2]	李秀艳, 刘明曦, 史闻博, 董国芳. 面向资源受限用户的高效动态数据审计方案[J]. 计算机应用, 2021, 41(2): 422-432.
[3]	吴小莉, 郑艺峰. 基于K近邻算法的噪声种类识别和强度估计[J]. 计算机应用, 2020, 40(1): 264-270.
[4]	黄永鑫, 唐雪飞. 基于近邻传播聚类和TANE算法的高校数据中函数依赖的发现[J]. 计算机应用, 2020, 40(1): 90-95.
[5]	杨萍, 赵冰, 舒辉. 基于图标相似性分析的恶意代码检测方法[J]. 计算机应用, 2019, 39(6): 1728-1734.
[6]	李兆斌, 刘泽一, 魏占祯, 韩禹. 基于哈希链的软件定义网络路径安全[J]. 计算机应用, 2019, 39(5): 1368-1373.
[7]	白平, 张薇, 王绪安. 云环境下基于运算电路的同态认证方案[J]. 计算机应用, 2018, 38(9): 2543-2548.
[8]	白平, 张薇, 李聪, 王绪安. 支持用户撤销的可验证密文检索方案[J]. 计算机应用, 2018, 38(6): 1640-1643.
[9]	王涛春, 刘婷婷, 刘申, 何国栋. 群智感知中的参与者信誉评估方案[J]. 计算机应用, 2018, 38(3): 753-757.
[10]	王漫漫, 束永安. 多移动汇聚节点的无线传感网中基于服务质量的能耗[J]. 计算机应用, 2018, 38(3): 758-762.
[11]	郭喻栋, 郭志刚, 陈刚, 魏晗. 基于数据降维与精确欧氏局部敏感哈希的k近邻推荐方法[J]. 计算机应用, 2017, 37(9): 2665-2670.
[12]	张晶, 陈垚, 范洪博, 孙俊. 基于信息物理融合系统执行器输出事件的价值评价调度策略[J]. 计算机应用, 2017, 37(6): 1663-1669.
[13]	朱会娟, 蒋同海, 周喜, 程力, 赵凡, 马博. 基于动态可配置规则的数据清洗方法[J]. 计算机应用, 2017, 37(4): 1014-1020.
[14]	征察, 吉立新, 李邵梅, 高超. 基于多模态信息融合的新闻图像人脸标注[J]. 计算机应用, 2017, 37(10): 3006-3011.
[15]	杨文文马春光黄予洛. 基于分布式认证的完整性保护数据融合方案[J]. 计算机应用, 2014, 34(3): 714-719.

结合局部敏感哈希的k近邻数据填补算法

k-nearest neighbor data imputation algorithm combined with locality sensitive Hashing

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics