计算机应用 ›› 2005, Vol. 25 ›› Issue (06): 1362-1365.DOI: 10.3724/SP.J.1087.2005.1362

• 数据库与数据挖掘 • 上一篇    下一篇

一种检测汉语相似重复记录的有效方法

程国达,苏杭丽   

  1. 南京财经大学信息工程学院
  • 发布日期:2011-04-06 出版日期:2005-06-01

Efficient approach for identifying approximately duplicate Chinese database records

CHENG Guo-da, SU Hang-li   

  1. College of Information Engineering, Nanjing University of Finance & Economics, Nanjing Jiangsu 210003, China
  • Online:2011-04-06 Published:2005-06-01

摘要: 消除重复记录可以提高数据质量。提出了按字段值种类数选择排序字段的方法。在相似重复记录的检测中,用第1个排序字段建立存储相似重复记录的二维链表,然后再用第2、第3个排序字段对二维链表中的记录进行排序—比较,以提高检测效果。为了正确地匹配汉字串,研究了由于缩写所造成的不匹配和读音、字型相似造成的输入错误。通过查找“相似汉字表”解决部分输入错误的问题,计算相似度函数判断被比较的记录是否是重复记录。实验表明,提出的方法能有效的检测汉语相似重复记录。

关键词: 汉语相似重复记录, 排序字段, 二维链表

Abstract: Eliminating duplicate records could improve data quality. An approach based on type numbers of field values was proposed to select the sorting fields. In the process of identifying approximately duplicate records, the first sorting field was used to create 2-D-linked list storing approximately duplicate records. And the second and third sorting fields were employed to sort pair-wise that belong to 2-D-linked list. To match between Chinese character strings efficiently, various errors were researched since customary abbreviations and some input errors of the similarities in pronunciation and shape. Solving the input mistakes by looking up the “Similarity Chinese Characters Table” and the similarity function which was used to determine whether two records were duplicate or not. The experimental results prove: the approach can detect efficiently the approximately duplicate Chinese database records.

Key words: approximately duplicate Chinese records, sorting field, 2-D-linked list

中图分类号: