计算机应用 ›› 2013, Vol. 33 ›› Issue (08): 2208-2211.

• 数据库技术 • 上一篇    下一篇

海量数据的相似重复记录检测算法

周典瑞,周莲英   

  1. 江苏大学 计算机科学与通信工程学院,江苏 镇江 212013
  • 收稿日期:2013-02-25 修回日期:2013-04-06 出版日期:2013-08-01 发布日期:2013-09-11
  • 通讯作者: 周典瑞
  • 作者简介:周典瑞(1987-),男,山东泰安人,硕士研究生,主要研究方向:数据清洗;
    周莲英(1964-),女,江苏泰州人,教授,博士,主要研究方向:计算机网络性能分析、信息安全、电子商务、网络信息系统。
  • 基金资助:

    江苏省科技支撑项目

Algorithm for detecting approximate duplicate records in massive data

ZHOU Dianrui,ZHOU Lianying   

  1. School of Computer Science and Telecommunication Engineering, Jiangsu University, Zhenjiang Jiangsu 212013, China
  • Received:2013-02-25 Revised:2013-04-06 Online:2013-09-11 Published:2013-08-01
  • Contact: ZHOU Dianrui

摘要: 针对海量数据下相似重复记录检测算法的低查准率和低效率问题,采用综合加权法和基于字符串长度过滤法对数据集进行相似重复检测。综合加权法通过结合用户经验和数理统计法计算各属性的权重。基于字符串长度过滤法在相似检测过程中利用字符串间的长度差异提前结束编辑距离算法的计算,减少待匹配的记录数。实验结果表明,通过综合加权法计算的权重向量更加全面、准确反映出各属性的重要性,基于字符串的长度过滤法减少了记录间的比对时间,能够有效地解决海量数据的相似重复记录检测问题。

关键词: 海量数据, 相似重复记录, 综合加权法, 编辑距离

Abstract: For the problem of low precision and low time efficiency of approximate duplicate records detection algorithm in massive data, integrated weighted method and filtration method based on the length of strings were adopted to do the approximate duplicate records detection in dataset. Integrated weighted method integrated user experience and mathematical statistics to calculate the weight of each attribute to make weight calculation more scientific. The filtration method based on the length of strings made use of the length difference between strings to terminate the edit distance algorithm earlier which reduced the number of the records to be matched during the detection process. The experimental results show that the weight vector calculated by the integrated weighted method makes the importance of each field more comprehensive and accurate. The filtration method based on the length of strings reduces the comparison time among records and effectively solves the problem of the detection of approximate duplicate records under massive data.

Key words: massive data, approximate duplicate record, integrated weighted method, edit distance

中图分类号: