Efficient approach for identifying approximately duplicate Chinese database records

doi:10.3724/SP.J.1087.2005.1362

Abstract

Abstract: Eliminating duplicate records could improve data quality. An approach based on type numbers of field values was proposed to select the sorting fields. In the process of identifying approximately duplicate records, the first sorting field was used to create 2-D-linked list storing approximately duplicate records. And the second and third sorting fields were employed to sort pair-wise that belong to 2-D-linked list. To match between Chinese character strings efficiently, various errors were researched since customary abbreviations and some input errors of the similarities in pronunciation and shape. Solving the input mistakes by looking up the “Similarity Chinese Characters Table” and the similarity function which was used to determine whether two records were duplicate or not. The experimental results prove: the approach can detect efficiently the approximately duplicate Chinese database records.

Key words: approximately duplicate Chinese records, sorting field, 2-D-linked list

摘要： 消除重复记录可以提高数据质量。提出了按字段值种类数选择排序字段的方法。在相似重复记录的检测中,用第1个排序字段建立存储相似重复记录的二维链表,然后再用第2、第3个排序字段对二维链表中的记录进行排序—比较,以提高检测效果。为了正确地匹配汉字串,研究了由于缩写所造成的不匹配和读音、字型相似造成的输入错误。通过查找“相似汉字表”解决部分输入错误的问题,计算相似度函数判断被比较的记录是否是重复记录。实验表明,提出的方法能有效的检测汉语相似重复记录。

关键词: 汉语相似重复记录, 排序字段, 二维链表

CLC Number:

TP391.4

CHENG Guo-da, SU Hang-li. Efficient approach for identifying approximately duplicate Chinese database records[J]. Journal of Computer Applications, 2005, 25(06): 1362-1365.

程国达，苏杭丽. 一种检测汉语相似重复记录的有效方法[J]. 计算机应用, 2005, 25(06): 1362-1365.

[1]	YANG Ting MENG Xiangru WEN Xiangxi WU Wen. Optimal hyperplane modification of support vector machine based on Fisher within-class scatter [J]. Journal of Computer Applications, 2013, 33(09): 2553-2556.
[2]	CHEN Benzhi. Lane recognition and departure warning based on hyperbolic model [J]. Journal of Computer Applications, 2013, 33(09): 2562-2565.
[3]	HAN Dan SONG Weidong WANG Jingxue. Straight line extraction via phase-grouping method based on adaptive partitioning [J]. Journal of Computer Applications, 2013, 33(06): 1691-1694.
[4]	HUANG Wenli FAN Yong. Human behavior recognition algorithm with space-time topological feature and sparse expression [J]. Journal of Computer Applications, 2013, 33(06): 1701-1710.
[5]	LIU Zhongbao. Face feature extraction method based on graph [J]. Journal of Computer Applications, 2013, 33(05): 1432-1455.
[6]	LIU Min CHEN Zhigang DENG Xiaohong. Image tamper detection scheme based on chaotic system and fragile watermarking [J]. Journal of Computer Applications, 2013, 33(05): 1371-1373.
[7]	JI Bo YE Yangdong LU Hongxing. Taxi gathering area recognition algorithm based on sample weight [J]. Journal of Computer Applications, 2013, 33(05): 1338-1342.
[8]	FANG Wangang ZHU Jiagang LU Xiao. Study on construction of Fisher-kernel-based mixed kernel [J]. Journal of Computer Applications, 2013, 33(04): 994-997.
[9]	SONG Yan ZHANG Jingjing CHEN Xiaopeng QIAN Qing. Surface development of oblique circular cone with SolidWorks re-development [J]. Journal of Computer Applications, 2013, 33(04): 1119-1121.
[10]	XU Fengjiao WANG Guoyin. Face description and recognition by sequence of Gabor pyramid with change in age [J]. Journal of Computer Applications, 2013, 33(03): 695-699.
[11]	HUANG Guolin GUO Dan HU Xuegang. Algorithms for approximate pattern matching with wildcards and length constraints [J]. Journal of Computer Applications, 2013, 33(03): 800-805.
[12]	ZHENG Feng-de ZHANG Hong-bin. Finite Newton algorithm for Lagrangian support vector regression [J]. Journal of Computer Applications, 2012, 32(09): 2504-2507.
[13]	LI Zheng-yi,FENG Gui-yu,ZHAO Long. Face recognition method based on DLPP-SIFT [J]. Journal of Computer Applications, 2012, 32(09): 2588-2591.
[14]	WANG Tie-jian LIU Yan-li. Automatic stitching and regulating algorithm for wide-angle images based on speed-up robust feature [J]. Journal of Computer Applications, 2012, 32(09): 2576-2579.
[15]	FU Mao-sheng LUO Bin WU Yong-long KONG Min. Structural description model for video [J]. Journal of Computer Applications, 2012, 32(09): 2560-2563.

Efficient approach for identifying approximately duplicate Chinese database records

一种检测汉语相似重复记录的有效方法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics