Journal of Computer Applications ›› 2013, Vol. 33 ›› Issue (09): 2493-2496.DOI: 10.11772/j.issn.1001-9081.2013.09.2493

• Database technology • Previous Articles     Next Articles

Data deduplication in Web information integration

LIU Xueqiong,WU Gang,DENG Houping   

  1. College of Information, Beijing Forestry University, Beijing 100083,China
  • Received:2013-03-19 Revised:2013-04-28 Online:2013-10-18 Published:2013-09-01
  • Contact: WU Gang

Web信息整合中的数据去重方法

刘雪琼,武刚,邓厚平   

  1. 北京林业大学 信息学院,北京 100083
  • 通讯作者: 武刚
  • 作者简介:刘雪琼(1986-),女,河北石家庄人,硕士研究生,主要研究方向:Web信息整合、数据挖掘;
    武刚(1962-),男,北京人,教授,博士生导师,博士,主要研究方向:电子商务、信息整合、数据挖掘;
    邓厚平(1989-),男,湖北荆门人,硕士研究生,主要研究方向:信息集成。
  • 基金资助:

    中央高校基本科研业务费专项基金资助项目

Abstract: Since traditional data dedupliation methods are of low time efficiency and detection accuracy, a Stepwise Clustering Data Elimination (SCDE) method was presented based on the features of Web information integration. Firstly the whole record set was divided into sub-sets using both key attributes division and the Canopy clustering technique, and then the similar records in each sub-set were accurately eliminated. A fuzzy entity matching strategy based on dynamic weight was proposed to accurately eliminate the duplicate records, which reduced the influence of missing attribute on record similarity calculation, and the name of company was especially treated to improve the matching accuracy. The results show that the method is superior to traditional algorithms in time efficiency and detection accuracy, and the precision is improved by 12.6%. The method is applied in forestry yellow page system and performs well.

Key words: Web information integration, approximately duplicate record, dynamic weight, fuzzy entity matching

摘要: 针对现有数据去重方法中存在的时间效率和检测精度低的问题,结合Web信息整合的特点,提出一种逐级聚类的数据去重方法(SCDE)。首先通过关键属性分割和Canopy聚类将数据划分成小记录集,然后精确检测相似重复记录,并提出基于动态权重的模糊实体匹配策略,采用动态权重赋值,降低属性缺失对记录相似度计算带来的影响,并对名称的特殊性进行处理,提高匹配准确率。实验结果显示:该方法在时间效率和检测精度上均优于传统算法,其中准确率提高12.6%。该方法已应用于林业黄页系统中,取得了较好的应用效果。

关键词: Web信息整合, 相似重复记录, 动态权重, 模糊实体匹配

CLC Number: