计算机应用 ›› 2016, Vol. 36 ›› Issue (2): 397-401.DOI: 10.11772/j.issn.1001-9081.2016.02.0397

• 第三届CCF大数据学术会议(CCF BigData 2015) • 上一篇    下一篇

结合局部敏感哈希的k近邻数据填补算法

郑奇斌1, 刁兴春2, 曹建军2, 周星1, 许永平2   

  1. 1. 解放军理工大学 指挥信息系统学院, 南京 210007;
    2. 总参第六十三研究所, 南京 210007
  • 收稿日期:2015-08-29 修回日期:2015-09-19 出版日期:2016-02-10 发布日期:2016-02-03
  • 通讯作者: 刁兴春(1964-),男,江苏泰兴人,研究员,硕士,主要研究方向:数据工程、数据质量。
  • 作者简介:郑奇斌(1990-),男,甘肃兰州人,硕士研究生,主要研究方向:数据挖掘、数据质量;曹建军(1975-),男,山东郓城人,工程师,博士,CCF高级会员,主要研究方向:数据工程、数据质量、进化算法;周星(1988-),男,四川广安人,博士研究生,主要研究方向:数据工程、数据质量、模式识别;许永平(1979-),男,河北宣化人,工程师,博士,主要研究方向:数据质量、信息系统效能评估。
  • 基金资助:
    国家自然科学基金资助项目(61371196);中国博士后科学基金特别资助项目(201003797);解放军理工大学预研基金项目(20110604,41150301)。

k-nearest neighbor data imputation algorithm combined with locality sensitive Hashing

ZHENG Qibin1, DIAO Xingchun2, CAO Jianjun2, ZHOU Xing1, XU Yongping2   

  1. 1. College of Command Information System, PLA University of Science and Technology, Nanjing Jiangsu 210007, China;
    2. The 63rd Research Institute of PLA General Staff Headquarters, Nanjing Jiangsu 210007, China
  • Received:2015-08-29 Revised:2015-09-19 Online:2016-02-10 Published:2016-02-03

摘要: k近邻(kNN)算法是缺失数据填补的常用算法,但由于需要逐个计算所有记录对之间的相似度,因此其填补耗时较高。为提高算法效率,提出结合局部敏感哈希(LSH)的kNN数据填补算法LSH-kNN。首先,对不存在缺失的完整记录进行局部敏感哈希,为之后查找近似最近邻提供索引;其次,针对枚举型、数值型以及混合型缺失数据分别提出对应的局部敏感哈希方法,对每一条待填补的不完整记录进行局部敏感哈希,按得到的哈希值找到与其疑似相似的候选记录;最后在候选记录中通过逐个计算相似度来找到其中相似程度最高的k条记录,并按照kNN算法对不完整记录进行填补。通过在4个真实数据集上的实验表明,结合局部敏感哈希的kNN填补算法LSH-kNN相对经典的kNN算法能够显著提高填补效率,并且保持准确性基本不变。

关键词: 数据质量, 数据完整性, 数据填补, k近邻算法, 局部敏感哈希

Abstract: k-Nearest Neighbor (kNN) algorithm is commonly used in data imputation. It is of poor efficiency because of the similarity computation between every tow records. To solve the efficiency problem, an improved kNN data imputation algorithm combined with Locality Sensitive Hashing (LSH) named LSH-kNN was proposed. First, all the complete records were indexed in LSH way. Then corresponding LSH ways for nominal, numeric and mixed-type incomplete data were put forward, and LSH values for all the incomplete records were computed in the proposed way to find candidate similar records. Finally, the incomplete records' real distance to candidate similar records were calculated, and the top-k similar records for kNN imputation were found. The experimental results show that the proposed method LSH-kNN has higher efficiency than traditional kNN as well as keeping almost the same accuracy.

Key words: data quality, data integrity, data imputation, k-nearest neighbor(kNN)algorithm, Locality Sensitive Hashing(LSH)

中图分类号: