中文异构百科知识库实体对齐

doi:10.11772/j.issn.1001-9081.2016.07.1881

计算机应用 ›› 2016, Vol. 36 ›› Issue (7): 1881-1886.DOI: 10.11772/j.issn.1001-9081.2016.07.1881

中文异构百科知识库实体对齐

黄峻福, 李天瑞, 贾真, 景运革, 张涛

西南交通大学信息科学与技术学院, 成都 611756

收稿日期:2015-12-31 修回日期:2016-03-09 出版日期:2016-07-10 发布日期:2016-07-14
通讯作者: 贾真
作者简介:黄峻福(1990-),男,山西大同人,硕士研究生,CCF会员,主要研究方向:自然语言处理、知识图谱;李天瑞(1969-),男,福建莆田人,教授,博士,CCF高级会员,主要研究方向:智能信息处理、数据挖掘、云计算、大数据;贾真(1975-),女,河南开封人,讲师,博士,主要研究方向:信息抽取、知识图谱;景运革(1970-),男,山西运城人,博士研究生,主要研究方向:粗糙集;张涛(1989-),男,江西上饶人,硕士研究生,主要研究方向:自然语言处理。
基金资助:
国家自然科学基金资助项目（61573292，61572407）；中央高校基本科研业务费专项（2682015CX070）。

Entity alignment of Chinese heterogeneous encyclopedia knowledge base

HUANG Junfu, LI Tianrui, JIA Zhen, JING Yunge, ZHANG Tao

School of Information Science and Technology, Southwest Jiaotong University, Chengdu Sichuan 611756, China

Received:2015-12-31 Revised:2016-03-09 Online:2016-07-10 Published:2016-07-14
Supported by:
This work is partially supported by the National Natural Science Foundation of China (61572407), the Fundamental Research Funds for the Central Universities (2682015CX070).

摘要/Abstract

摘要： 针对传统实体对齐方法在中文异构网络百科实体对齐任务中效果不够显著的问题，提出一种基于实体属性与上下文主题特征相结合的实体对齐方法。首先，基于百度百科及互动百科数据构造中文异构百科知识库，通过统计方法构造资源描述框架模式（RDFS）词表，对实体属性进行规范化；其次，抽取实体上下文信息，对其进行中文分词后，利用主题模型对上下文建模并通过吉布斯采样法求解模型参数，计算出主题-单词概率矩阵，提取特征词集合及对应特征矩阵；然后，利用最长公共子序列（LCS）算法判定实体属性相似度，当相似度位于下界与上界之间时，进一步结合百科类实体上下文主题特征进行判定；最后，依据标准方法构造了一个异构中文百科实体对齐数据集进行仿真实验。实验结果表明，与经典的属性相似度算法、属性加权算法、上下文词频特征模型及主题模型算法进行比较，所提出的实体对齐算法在人物领域和影视领域的准确率、召回率与综合指标F值分别达到97.8%、88.0%、92.6%和98.6%、73.0%、83.9%，比其他方法均有较大的提高。实验结果验证了在构建中文异构百科知识库场景中，所提算法可以有效提升中文百科实体对齐效果，可应用到具有上下文信息的实体对齐任务中。

关键词: 知识库, 实体对齐, 主题模型, 资源描述框架模式, 最长公共子序列算法

Abstract: Aiming at the problem that the traditional entity alignment algorithm may lead to bad performance in entity alignment task of Chinese heterogeneous encyclopedia knowledge base, an entity alignment method based on entity attributes and the features of context topics was proposed. First, a Chinese heterogeneous encyclopedia knowledge base was constructed based on Baidu encyclopedia and Hudong encyclopedia data. Next, the Resource Description Framework Schema (RDFS) vocabulary list was made to normalize the entity attributes. Then the entity context information was extracted and the Chinese word segmentation was used on the contexts. The contexts were modelled by using the topic model and the parameters were computed by Gibbs sampling method. After that the topic-word probability matrix, the characteristic word collection and the corresponding feature matrix were calculated. Last, the Longest Common Subsequence (LCS) algorithm was used to compute the entity attribute similarity. When the similarity was between the lower and the upper bounds, the topic features of the entities' context were combined to resolve the entity alignment problem. Finally, according to the standard method, an entity alignment data set of Chinese heterogeneous encyclopedia was constructed for simulation experiments. In comparison with the traditional property similarity algorithm, weighted-property algorithm, context term frequency feature model and topic model algorithm, the experimental results show that the proposed method achieves 97.8% accuracy, 88.0% recall, 92.6% F-score in people class and 98.6% accuracy, 73.0% recall, 83.9% F-score in movie class. It outperformed the other entity alignment algorithms. The experimental results also indicate that the proposed method can improve the entity alignment results in constructing the Chinese heterogeneous encyclopedia knowledge base, and it can be applied to the entity alignment tasks with context information.

Key words: knowledge base, entity alignment, topic model, resource description framework schema, longest common subsequence algorithm

中图分类号:

TP391.1

黄峻福, 李天瑞, 贾真, 景运革, 张涛. 中文异构百科知识库实体对齐[J]. 计算机应用, 2016, 36(7): 1881-1886.

HUANG Junfu, LI Tianrui, JIA Zhen, JING Yunge, ZHANG Tao. Entity alignment of Chinese heterogeneous encyclopedia knowledge base[J]. Journal of Computer Applications, 2016, 36(7): 1881-1886.

参考文献

[1] BERNERS-LEE T, HENDLER J, LASSILA O. The semantic Web[J]. Scientific American, 2001, 284(5):28-37.
[2] BOLLACKER K, EVANS C, PARITOSH P, et al. Freebase:a collaboratively created graph database for structuring human knowledge[C]//ACM SIGMOD 2008:Proceedings of the 2008 Association for Computing Machinery's Special Interest Group on Management of Data. New York:ACM, 2008:1247-1250.
[3] LEHMANN J, ISELE R, JAKOB M, et al. DBpedia-a large-scale, multilingual knowledge base extracted from wikipedia[J]. Semantic Web, 2015(2):167-195.
[4] BIEGA J, KUZEY E, SUCHANEK F M. Inside YAGO2s:a transparent information extraction architecture[C]//Proceedings of the 22nd International Conference on World Wide Web Conference. New York:ACM, 2013:325-328.
[5] PHILPOT A, HOVY E, PANTEL P. The Omega ontology[C]//OntoLex-05:Proceedings of the 2nd International Joint Conference on Natural Language Processing Workshop on Ontologies and Lexical Resources. Cambridge, UK:Cambridge University Press, 2005:59-66.
[6] LI M, SHI Y, WANG Z, et al. Building a large-scale cross-lingual knowledge base from heterogeneous online wikis[M]//Natural Language Processing and Chinese Computing. Berlin:Springer, 2015:413-420.
[7] MADHU G, GOVARDHAN A, RAJINIKANTH T V. Intelligent semantic Web search engines:a brief survey[J]. International Journal of Web & Semantic Technology, 2011, 2(1):34-42.
[8] HAN X, SUN L. A generative entity-mention model for linking entities with knowledge base[C]//ACL-HLT 2011:Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics:Human Language Technologies-Volume 1. Stroudsburg, PA:Association for Computational Linguistics, 2011:945-954.
[9] NOV O. What motivates wikipedians[J]. Communications of the ACM, 2007, 50(11):60-64.
[10] SLEEMAN J, FININ T. Computing FOAF co-reference relations with rules and machine learning[C]//SDoW-2010:Proceedings of the 3rd International Workshop on Social Data on the Web. Berlin:Springer, 2010:1-11.
[11] ZHENG Z, SI X, LI F, et al. Entity disambiguation with freebase[C]//Proceedings of the 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology. Washington, DC:IEEE Computer Society, 2012:82-89.
[12] 郑杰,茅于杭.基于语境的语义排歧方法[J].中文信息学报,2000,14(5):1-7.(ZHENG J, MAO Y H. Word sense tagging method based on context[J]. Journal of Chinese Information Processing, 2000, 14(5):1-7.)
[13] 张晓辉,蒋海华,邸瑞华.基于属性权重的链接数据共指关系构建[J].计算机科学,2013,40(2):40-43.(ZHANG X H, JIANG H H, DI R H. Property weight based co-reference resolution for linked data[J]. Computer Science, 2013, 40(2):40-43.)
[14] GOZUDELI Y, KARACAN H, YILDIZ O, et al. A new method based on tree simplification and schema matching for automatic Web result extraction and matching[C]//IMECS 2015:Proceedings of the International MultiConference of Engineers and Computer Scientists. Hong Kong:Newswood Limited, 2015, 1:369-373.
[15] MILLER E. An introduction to the resource description framework[J]. Bulletin of the American Society for Information Science and Technology, 1998, 25(1):15-19.
[16] DONG L, WEI F, ZHOU M, et al. Question answering over freebase with multi-column convolutional neural networks[C]//ACL-IJCNLP 2015:Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Stroudsburg, PA:Association for Computational Linguistics, 2015, 1:260-269.
[17] MCBRIDE B. The Resource Description Framework (RDF) and its vocabulary description language RDFS[M]//Handbook on Ontologies. Berlin:Springer, 2004:51-65.
[18] BLEI D M, NG A Y, JORDAN M I. Latent Dirichlet allocation[J]. Journal of Machine Learning Research, 2003, 3:993-1022.
[19] GRIFFITHS T. Gibbs sampling in the generative model of latent Dirichlet allocation[R]. Stanford:Stanford University, 2002.
[20] BERGROTH L, HAKONEN H, RAITA T. A survey of longest common subsequence algorithms[C]//SPIRE 2000:Proceedings of the Seventh International Symposium on String Processing and Information Retrieval. Piscataway, NJ:IEEE, 2000:39-48.
[21] 朱敏,贾真,左玲.中文微博实体链接研究[J].北京大学学报(自然科学版),2014,50(1):73-78.(ZHU M, JIA Z, ZUO L. Research on entity linking of Chinese micro blog[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2014, 50(1):73-78.)
[22] RAIMOND Y, SUTTON C, SANDLER M B. Automatic interlinking of music datasets on the semantic Web[C]//LDOW 2008:Proceedings of the 1st Workshop about Linked Data on the Web. New York:ACM, 2008, 369:1-8.
[23] MORI J, TSUJISHITA T, MATSUO Y, et al. Extracting relations in social networks from the Web using similarity between collective contexts[C]//ISWC 2006:Proceedings of the 5th International Semantic Web Conference. Berlin:Springer, 2006, 4273:487-500.

中文异构百科知识库实体对齐

Entity alignment of Chinese heterogeneous encyclopedia knowledge base

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	杨丰瑞, 霍娜, 张许红, 韦巍. 基于注意力机制的主题扩展情感对话生成[J]. 计算机应用, 2021, 41(4): 1078-1083.
[2]	杨威亚, 余正涛, 高盛祥, 宋燃. 基于跨语言神经主题模型的汉越新闻话题发现方法[J]. 计算机应用, 2021, 41(10): 2879-2884.
[3]	朱思淼, 魏世伟, 魏思恒, 余敦辉. 基于弹幕情感分析和主题模型的视频推荐算法[J]. 计算机应用, 2021, 41(10): 2813-2819.
[4]	尹春勇, 章荪. 面向短文本情感分类的端到端对抗变分贝叶斯方法[J]. 计算机应用, 2020, 40(9): 2536-2542.
[5]	田保军, 刘爽, 房建东. 融合主题信息和卷积神经网络的混合推荐算法[J]. 计算机应用, 2020, 40(7): 1901-1907.
[6]	赵小虎, 赵成龙. 基于多特征语义匹配的知识库问答系统[J]. 计算机应用, 2020, 40(7): 1873-1878.
[7]	杨飞, 罗建桥, 李柏林. 结合全局和局部约束的sLDA铁路扣件分类模型[J]. 计算机应用, 2019, 39(3): 888-893.
[8]	徐红艳, 王丹, 王富海, 王嵘冰. 融合潜在狄利克雷分布与元路径分析的用户相关性度量方法[J]. 计算机应用, 2019, 39(11): 3288-3292.
[9]	余慧, 冯旭鹏, 刘利军, 黄青松. 聊天机器人中用户就医意图识别方法[J]. 计算机应用, 2018, 38(8): 2170-2174.
[10]	许银洁, 孙春华, 刘业政. 考虑用户特征的主题情感联合模型[J]. 计算机应用, 2018, 38(5): 1261-1266.
[11]	李琰, 刘嘉勇. 基于作者主题模型和辐射模型的用户位置预测模型[J]. 计算机应用, 2018, 38(4): 939-944.
[12]	徐立洋, 黄瑞章, 陈艳平, 钱志森, 黎万英. 基于狄利克雷多项分配模型的多源文本主题挖掘模型[J]. 计算机应用, 2018, 38(11): 3094-3099.
[13]	邓扬, 张晨曦, 李江峰. 基于弹幕情感分析的视频片段推荐模型[J]. 计算机应用, 2017, 37(4): 1065-1070.
[14]	褚征, 于炯, 王佳玉, 王跃飞. 基于LDA主题模型的移动应用相似度构建方法[J]. 计算机应用, 2017, 37(4): 1075-1082.
[15]	唐黎哲, 冯大为, 李东升, 李荣春, 刘锋. 以LDA为例的大规模分布式机器学习系统分析[J]. 计算机应用, 2017, 37(3): 628-634.