Journal of Computer Applications ›› 2005, Vol. 25 ›› Issue (06): 1362-1365.DOI: 10.3724/SP.J.1087.2005.1362
• Database and data mining • Previous Articles Next Articles
CHENG Guo-da, SU Hang-li
Online:
Published:
程国达,苏杭丽
Abstract: Eliminating duplicate records could improve data quality. An approach based on type numbers of field values was proposed to select the sorting fields. In the process of identifying approximately duplicate records, the first sorting field was used to create 2-D-linked list storing approximately duplicate records. And the second and third sorting fields were employed to sort pair-wise that belong to 2-D-linked list. To match between Chinese character strings efficiently, various errors were researched since customary abbreviations and some input errors of the similarities in pronunciation and shape. Solving the input mistakes by looking up the “Similarity Chinese Characters Table” and the similarity function which was used to determine whether two records were duplicate or not. The experimental results prove: the approach can detect efficiently the approximately duplicate Chinese database records.
Key words: approximately duplicate Chinese records, sorting field, 2-D-linked list
摘要: 消除重复记录可以提高数据质量。提出了按字段值种类数选择排序字段的方法。在相似重复记录的检测中,用第1个排序字段建立存储相似重复记录的二维链表,然后再用第2、第3个排序字段对二维链表中的记录进行排序—比较,以提高检测效果。为了正确地匹配汉字串,研究了由于缩写所造成的不匹配和读音、字型相似造成的输入错误。通过查找“相似汉字表”解决部分输入错误的问题,计算相似度函数判断被比较的记录是否是重复记录。实验表明,提出的方法能有效的检测汉语相似重复记录。
关键词: 汉语相似重复记录, 排序字段, 二维链表
CLC Number:
TP391.4
CHENG Guo-da, SU Hang-li. Efficient approach for identifying approximately duplicate Chinese database records[J]. Journal of Computer Applications, 2005, 25(06): 1362-1365.
程国达,苏杭丽. 一种检测汉语相似重复记录的有效方法[J]. 计算机应用, 2005, 25(06): 1362-1365.
0 / Recommend
Add to citation manager EndNote|Ris|BibTeX
URL: http://www.joca.cn/EN/10.3724/SP.J.1087.2005.1362
http://www.joca.cn/EN/Y2005/V25/I06/1362