一种自适应信息集成方法

doi:10.3724/SP.J.1087.2005.0666

计算机应用 ›› 2005, Vol. 25 ›› Issue (03): 666-669.DOI: 10.3724/SP.J.1087.2005.0666

一种自适应信息集成方法

程国达¹，邹亚会²，朱静³

1.南京财经大学信息工程学院； 2.南京财经大学图书馆

发布日期:2005-03-01 出版日期:2005-03-01

A self-adaptive approach for information integration

CHENG Guo-da¹,ZOU Ya-hui²,ZHU Jing³

1. College of Information Engineering, Nanjing University of Finance & Economics, Nanjing Jiangsu 210003, China; 2. Library, Nanjing University of Finance & Economics, Nanjing Jiangsu 210003, China

Online:2005-03-01 Published:2005-03-01

摘要/Abstract

摘要： 检测相似重复记录是信息集成中的关键任务之一,尽管已经提出了各种检测相似重复记录的方法,但字符串匹配算法是这些检测方法中的核心。在提出的自适应信息集成算法中,用一个综合了编辑距离和标记距离的混合相似度去度量字符串之间的相似度。为了避免由于表达方式的差异而造成的字符串之间的不匹配,字符串被分割成独立的单词后按单词的第一个字符进行排序。在单词的匹配中,对拼写错误和缩写有一定的容错功能。实验结果表明,自适应信息集成方法比用Smith Waterman和Jaro距离有更高的正确率。

关键词: 相似重复记录, 混合相似度, 自适应信息集成, 字符串匹配

Abstract: Detecting records that are approximate duplicates, but not exact duplicates, is one of the key tasks in information integration. Although various algorithms have been presented for detecting duplicated records, strings matching is essential to those algorithms. In self- adaptive information integration algorithm presented by this paper, the hybrid similarity, a comprehensive edit distance and token metric, was used to measure the similar degree between strings. In order to avoid mismatching because of different expressions, the strings in records were partitioned into vocabularies, then were sorted according to their first character. In the process of vocabularies matching, misspellings and abbreviations can be tolerated. The experimental results demonstrate that the self-adaptive approach for information integration achieves higher accuracy than that using Smith-Waterman edit distance and Jaro distance.

Key words: approximately duplicate records, hybrid similarity, self-adaptive information integration, strings matching

中图分类号:

TP391.1

程国达，邹亚会，朱静. 一种自适应信息集成方法[J]. 计算机应用, 2005, 25(03): 666-669.

CHENG Guo-da,ZOU Ya-hui,ZHU Jing. A self-adaptive approach for information integration[J]. Journal of Computer Applications, 2005, 25(03): 666-669.

[1]	燕彩蓉, 朱斌, 王健, 黄永锋. 基于划分的增量式字符串相似性连接方法[J]. 计算机应用, 2016, 36(1): 27-32.
[2]	肖艳丽, 张振宇, 袁江涛. 基于位置序列的广义后缀树用户相似性计算方法[J]. 计算机应用, 2015, 35(6): 1654-1658.
[3]	刘雪琼武刚邓厚平. Web信息整合中的数据去重方法[J]. 计算机应用, 2013, 33(09): 2493-2496.
[4]	周典瑞周莲英. 海量数据的相似重复记录检测算法[J]. 计算机应用, 2013, 33(08): 2208-2211.
[5]	蔡晓妍戴冠中杨黎斌. 改进的多模式字符串匹配算法[J]. 计算机应用, 2007, 27(6): 1415-1417.
[6]	张永; 迟忠先; 闫德勤. 数据仓库ETL中相似重复记录的检测方法及应用[J]. 计算机应用, 2006, 26(4): 880-882.
[7]	程国达，苏杭丽. 一种检测汉语相似重复记录的有效方法[J]. 计算机应用, 2005, 25(06): 1362-1365.

一种自适应信息集成方法

A self-adaptive approach for information integration

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 7

编辑推荐

Metrics