一种检测汉语相似重复记录的有效方法

doi:10.3724/SP.J.1087.2005.1362

计算机应用 ›› 2005, Vol. 25 ›› Issue (06): 1362-1365.DOI: 10.3724/SP.J.1087.2005.1362

一种检测汉语相似重复记录的有效方法

程国达，苏杭丽

南京财经大学信息工程学院

发布日期:2011-04-06 出版日期:2005-06-01

Efficient approach for identifying approximately duplicate Chinese database records

CHENG Guo-da, SU Hang-li

College of Information Engineering, Nanjing University of Finance & Economics, Nanjing Jiangsu 210003, China

Online:2011-04-06 Published:2005-06-01

摘要/Abstract

摘要： 消除重复记录可以提高数据质量。提出了按字段值种类数选择排序字段的方法。在相似重复记录的检测中,用第1个排序字段建立存储相似重复记录的二维链表,然后再用第2、第3个排序字段对二维链表中的记录进行排序—比较,以提高检测效果。为了正确地匹配汉字串,研究了由于缩写所造成的不匹配和读音、字型相似造成的输入错误。通过查找“相似汉字表”解决部分输入错误的问题,计算相似度函数判断被比较的记录是否是重复记录。实验表明,提出的方法能有效的检测汉语相似重复记录。

关键词: 汉语相似重复记录, 排序字段, 二维链表

Abstract: Eliminating duplicate records could improve data quality. An approach based on type numbers of field values was proposed to select the sorting fields. In the process of identifying approximately duplicate records, the first sorting field was used to create 2-D-linked list storing approximately duplicate records. And the second and third sorting fields were employed to sort pair-wise that belong to 2-D-linked list. To match between Chinese character strings efficiently, various errors were researched since customary abbreviations and some input errors of the similarities in pronunciation and shape. Solving the input mistakes by looking up the “Similarity Chinese Characters Table” and the similarity function which was used to determine whether two records were duplicate or not. The experimental results prove: the approach can detect efficiently the approximately duplicate Chinese database records.

Key words: approximately duplicate Chinese records, sorting field, 2-D-linked list

中图分类号:

TP391.4

程国达，苏杭丽. 一种检测汉语相似重复记录的有效方法[J]. 计算机应用, 2005, 25(06): 1362-1365.

CHENG Guo-da, SU Hang-li. Efficient approach for identifying approximately duplicate Chinese database records[J]. Journal of Computer Applications, 2005, 25(06): 1362-1365.

[1]	杨婷孟相如温祥西伍文. 基于Fisher类内散度的支持向量机分类面修正方法[J]. 计算机应用, 2013, 33(09): 2553-2556.
[2]	陈本智. 基于双曲线模型的车道识别与偏离预警[J]. 计算机应用, 2013, 33(09): 2562-2565.
[3]	韩丹宋伟东王竞雪. 自适应分区的相位编组直线提取算法[J]. 计算机应用, 2013, 33(06): 1691-1694.
[4]	黄文丽范勇. 结合时空拓扑特征和稀疏表达的人体行为识别算法[J]. 计算机应用, 2013, 33(06): 1701-1710.
[5]	刘忠宝. 一种基于图的人脸特征提取方法[J]. 计算机应用, 2013, 33(05): 1432-1455.
[6]	刘敏陈志刚邓小鸿. 基于混沌和脆弱水印的图像篡改检测算法[J]. 计算机应用, 2013, 33(05): 1371-1373.
[7]	姬波叶阳东卢红星. 基于样本权重的出租车聚集区识别算法[J]. 计算机应用, 2013, 33(05): 1338-1342.
[8]	方万胜朱嘉钢陆晓. 基于Fisher核的混合核构造研究[J]. 计算机应用, 2013, 33(04): 994-997.
[9]	宋彦张京京陈晓鹏钱清. 基于SolidWorks生成两斜交圆锥表面展开图[J]. 计算机应用, 2013, 33(04): 1119-1121.
[10]	许凤娇王国胤. 年龄变化条件下采用Gabor金字塔的人脸描述与识别[J]. 计算机应用, 2013, 33(03): 695-699.
[11]	黄国林郭丹胡学钢. 基于通配符和长度约束的近似模式匹配算法[J]. 计算机应用, 2013, 33(03): 800-805.
[12]	郑逢德张鸿宾. 拉格朗日支持向量回归的有限牛顿算法[J]. 计算机应用, 2012, 32(09): 2504-2507.
[13]	李政仪冯贵玉赵龙. 基于直接局部保持投影和尺度不变特征变换的人脸识别方法[J]. 计算机应用, 2012, 32(09): 2588-2591.
[14]	王铁建刘艳丽. 基于加速鲁棒特征的广角图像自动拼接校正算法[J]. 计算机应用, 2012, 32(09): 2576-2579.
[15]	符茂胜罗斌吴永龙孔敏. 视频结构化描述模型[J]. 计算机应用, 2012, 32(09): 2560-2563.

一种检测汉语相似重复记录的有效方法

Efficient approach for identifying approximately duplicate Chinese database records

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics