利用二次归并的Deep Web实体匹配方法

doi:10.11772/j.issn.1001-9081.2016.08.2139

计算机应用 ›› 2016, Vol. 36 ›› Issue (8): 2139-2143.DOI: 10.11772/j.issn.1001-9081.2016.08.2139

• 第六届中国数据挖掘会议(CCDM 2016) • 上一篇下一篇

利用二次归并的Deep Web实体匹配方法

陈丽君

浙江越秀外国语学院网络传播研究所, 浙江绍兴 312000

收稿日期:2016-03-01 修回日期:2016-04-25 发布日期:2016-08-10 出版日期:2016-08-10
通讯作者: 陈丽君
作者简介:陈丽君(1979-),女,浙江乐清人,讲师,硕士,主要研究方向:DeepWeb数据挖掘、智能信息处理、教育信息技术。
基金资助:
全国教育信息技术研究课题资助项目（136241401）；浙江越秀外国语学院科研项目（N201375）。

Deep Web entity matching method based on twice-merging

CHEN Lijun

Network Communication Institute, Zhejiang Yuexiu University of Foreign Languages, Shaoxing Zhejiang 312000, China

Received:2016-03-01 Revised:2016-04-25 Online:2016-08-10 Published:2016-08-10
Supported by:
This work is partially supported by the National Natural Science Foundation of China (61263037), the Natural Science Foundation of Inner Mongolia Autonomous Region (2014BS0604, 2014MS0603).

摘要/Abstract

摘要： 针对权重边剪枝（WEP）方法在准确率和匹配效率等方面的不足，通过引入自匹配和归并概念，提出一种基于二次归并的Deep Web实体匹配方法。首先，提取各对象的属性值，并按属性值重组对象，使具有相同属性值的对象聚集在一起，实现块的有效划分；其次，计算块内各对象间的匹配度，并据此进行剪枝、自匹配检测、归并，输出初步类簇；最后，以初步类簇为基础，利用簇内对象间传递的消息以及对象属性相似值，进一步挖掘匹配关系，触发新一轮的类簇归并与更新。实验结果表明，与WEP方法相比，所提方法通过自匹配检测，自动区分匹配关系并采取合适的匹配策略，使归并过程逐渐精化，提高了匹配准确率；通过分块、剪枝，有效缩减了匹配空间，提高了系统运行效率。

关键词: 二次归并, Deep Web, 实体匹配, 类簇, 相似值

Abstract: Concerning the limitations of the Weighted Edge Pruning (WEP) method in accuracy and matching efficiency, a Deep Web entity matching method based on twice-merging was proposed by introducing the concepts of self-matching and merging. Firstly, attribute values of each object were extracted to regroup objects for gathering objects with the same attribute value together, therefore, all objects could be divided into blocks efficiently. Secondly, the matching values between objects within a same block were calculated for pruning, self-matching detection, merging explicit matching to generate preliminary clusters. Finally, based on these preliminary clusters, matching relationships were further discovered by using the message passing between objects within a cluster and objects' attribute similarity values, which triggered a new round of cluster merging and updating. Experimental results show that compared with the WEP method, the proposed method, by detecting self-matching to automatically distinguish matching relationships and take the proper matching method, gradually refines the merging process to improve the matching accuracy; simultaneously, by blocking and pruning to effectively reduce the matching space, its system efficiency is improved.

Key words: twice-merging, Deep Web, entity matching, cluster, similarity value

中图分类号:

TP391
TP311

陈丽君. 利用二次归并的Deep Web实体匹配方法[J]. 计算机应用, 2016, 36(8): 2139-2143.

CHEN Lijun. Deep Web entity matching method based on twice-merging[J]. Journal of Computer Applications, 2016, 36(8): 2139-2143.

参考文献

[1] 陈丽君,林怀忠.一种用于深层网接口集成的模式匹配方法[J].计算机工程,2012,38(12):42-44.(CHEN L J,LIN H Z.Pattern matching method for Deep Web interface integration[J].Computer Engineering,2012,38(12):42-44.)
[2] KÖPCKE H,RAHM E.Frameworks for entity matching:a comparison[J].Data&Knowledge Engineering,2010,69(2):197-210.
[3] HAN X,SUN L,ZHAO J.Collective entity linking in Web text:a graph-based method[C]//SIGIR'11:Proceedings of the 34th Annual ACM SIGIR Conference on Research and development in Information Retrieval.New York:ACM,2011:765-774.
[4] RASTOGI V,DALVI N,GAROFALAKIS M.Large-scale collective entity matching[J].Proceedings of the VLDB Endowment,2011,4(4):208-218.
[5] WANG Z,LI J,WANG Z,et al.Cross-lingual knowledge linking across Wiki knowledge bases[C]//WWW'12:Proceedings of the 21st International Conference on Word Wide Web.New York:ACM,2012:459-468.
[6] FAN J,LU M,OOI B C,et al.A hybrid machine-crowdsourcing system for matching Web tables[C]//Proceedings of the 2014 IEEE 30th International Conference on Data engineering.Washington,DC:IEEE Computer Society,2014:976-987.
[7] 崔晓军,肖红宇,丁立新.基于距离的自适应Web数据库记录匹配方法[J].武汉大学学报(理学版),2012,58(1):89-94.(CUI X J,XIAO H Y,DING L X.Distance-based adaptive record matching for Web database[J].Journal of Wuhan University (Science Edition),2012,58(1):89-94.)
[8] LIU W,MENG X.A holistic solution for duplicate entity identification in deep Web data integration[C]//SKG'10:Proceedings of the 2010 Sixth International Conference on Semantics,Knowledge and Grids.Washington,DC:IEEE Computer Society,2010:267-274.
[9] 徐红艳,党晓婉,冯勇,等.基于BP神经网络的Deep Web实体识别方法[J].计算机应用,2013,33(3):776-779.(XU H Y,DANG X W,FENG Y,et al.Method of Deep Web entities identification based on BP network[J].Journal of Computer Applications,2013,33(3):776-779.)
[10] LIU W,MENG X,YANG J,et al.Duplicate identification in Deep Web data integration[C]//WAIM'10:Proceedings of the 11th International Conference on Web-age Information Management,LNCS 6184.Berlin:Springer-Verlag,2010:5-17.
[11] 李亚坤,王宏志,高宏,等.基于实体描述属性技术的XML重复对象检测方法[J].计算机学报,2011,34(11):2131-2141.(LI Y K,WANG H Z,GAO H,et al.Efficient entity resolution on XML data based on entity-describe-attribute[J].Chinese Journal of Computers,2011,34(11):2131-2141.)
[12] EFTHYMIOU V,PAPADAKIS G A,PAPASTEFANATOS G,et al.Parallel meta-blocking:realizing scalable entity resolution over large,heterogeneous data[C]//Proceedings of the IEEE 20154th International Conference on Big Data.Piscataway,NJ:IEEE,2015:411-420.
[13] 寇月,申德荣,李冬,等.一种基于语义及统计分析的Deep Web实体识别机制[J].软件学报,2008,19(2):194-208.(KOU Y,SHEN D R,LI D,et al.A Deep Web entity identification mechanism based on semantics and statistical analysis[J].Journal of Software,2008,19(2):194-208.)
[14] MCCALLUM A.Cora citation matching[EB/OL].(2004-02-09)[2015-08-22].http://www.cs.umass.edu/~mccallum/data/cora-refs.tar.gz.

利用二次归并的Deep Web实体匹配方法

Deep Web entity matching method based on twice-merging

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 7

编辑推荐

Metrics

[1]	刘雪琼武刚邓厚平. Web信息整合中的数据去重方法[J]. 计算机应用, 2013, 33(09): 2493-2496.
[2]	徐红艳党晓婉冯勇李军平. 基于BP神经网络的Deep Web实体识别方法[J]. 计算机应用, 2013, 33(03): 776-779.
[3]	冯永张洋. 结合匹配度和语义相似度的Deep Web查询接口模式匹配[J]. 计算机应用, 2012, 32(06): 1688-1691.
[4]	李明李秀兰. 基于结果模式的Deep Web数据标注方法[J]. 计算机应用, 2011, 31(07): 1733-1736.
[5]	王妍宋宝燕张佳旸张洪梅李晓光. 基于标签编码的Deep Web查询接口识别方法[J]. 计算机应用, 2011, 31(05): 1351-1354.
[6]	周顺平柳怀颖. 基于概率及复合指标的矢量数据对比[J]. 计算机应用, 2010, 30(10): 2602-2604.
[7]	崔晓军彭智勇曾承. 基于多标注源的Deep Web查询结果自动标注[J]. 计算机应用, 2009, 29(1): 196-200.