Abstract:Aiming at the problem that the traditional entity alignment algorithm may lead to bad performance in entity alignment task of Chinese heterogeneous encyclopedia knowledge base, an entity alignment method based on entity attributes and the features of context topics was proposed. First, a Chinese heterogeneous encyclopedia knowledge base was constructed based on Baidu encyclopedia and Hudong encyclopedia data. Next, the Resource Description Framework Schema (RDFS) vocabulary list was made to normalize the entity attributes. Then the entity context information was extracted and the Chinese word segmentation was used on the contexts. The contexts were modelled by using the topic model and the parameters were computed by Gibbs sampling method. After that the topic-word probability matrix, the characteristic word collection and the corresponding feature matrix were calculated. Last, the Longest Common Subsequence (LCS) algorithm was used to compute the entity attribute similarity. When the similarity was between the lower and the upper bounds, the topic features of the entities' context were combined to resolve the entity alignment problem. Finally, according to the standard method, an entity alignment data set of Chinese heterogeneous encyclopedia was constructed for simulation experiments. In comparison with the traditional property similarity algorithm, weighted-property algorithm, context term frequency feature model and topic model algorithm, the experimental results show that the proposed method achieves 97.8% accuracy, 88.0% recall, 92.6% F-score in people class and 98.6% accuracy, 73.0% recall, 83.9% F-score in movie class. It outperformed the other entity alignment algorithms. The experimental results also indicate that the proposed method can improve the entity alignment results in constructing the Chinese heterogeneous encyclopedia knowledge base, and it can be applied to the entity alignment tasks with context information.
[1] BERNERS-LEE T, HENDLER J, LASSILA O. The semantic Web[J]. Scientific American, 2001, 284(5):28-37. [2] BOLLACKER K, EVANS C, PARITOSH P, et al. Freebase:a collaboratively created graph database for structuring human knowledge[C]//ACM SIGMOD 2008:Proceedings of the 2008 Association for Computing Machinery's Special Interest Group on Management of Data. New York:ACM, 2008:1247-1250. [3] LEHMANN J, ISELE R, JAKOB M, et al. DBpedia-a large-scale, multilingual knowledge base extracted from wikipedia[J]. Semantic Web, 2015(2):167-195. [4] BIEGA J, KUZEY E, SUCHANEK F M. Inside YAGO2s:a transparent information extraction architecture[C]//Proceedings of the 22nd International Conference on World Wide Web Conference. New York:ACM, 2013:325-328. [5] PHILPOT A, HOVY E, PANTEL P. The Omega ontology[C]//OntoLex-05:Proceedings of the 2nd International Joint Conference on Natural Language Processing Workshop on Ontologies and Lexical Resources. Cambridge, UK:Cambridge University Press, 2005:59-66. [6] LI M, SHI Y, WANG Z, et al. Building a large-scale cross-lingual knowledge base from heterogeneous online wikis[M]//Natural Language Processing and Chinese Computing. Berlin:Springer, 2015:413-420. [7] MADHU G, GOVARDHAN A, RAJINIKANTH T V. Intelligent semantic Web search engines:a brief survey[J]. International Journal of Web & Semantic Technology, 2011, 2(1):34-42. [8] HAN X, SUN L. A generative entity-mention model for linking entities with knowledge base[C]//ACL-HLT 2011:Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics:Human Language Technologies-Volume 1. Stroudsburg, PA:Association for Computational Linguistics, 2011:945-954. [9] NOV O. What motivates wikipedians[J]. Communications of the ACM, 2007, 50(11):60-64. [10] SLEEMAN J, FININ T. Computing FOAF co-reference relations with rules and machine learning[C]//SDoW-2010:Proceedings of the 3rd International Workshop on Social Data on the Web. Berlin:Springer, 2010:1-11. [11] ZHENG Z, SI X, LI F, et al. Entity disambiguation with freebase[C]//Proceedings of the 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology. Washington, DC:IEEE Computer Society, 2012:82-89. [12] 郑杰,茅于杭.基于语境的语义排歧方法[J].中文信息学报,2000,14(5):1-7.(ZHENG J, MAO Y H. Word sense tagging method based on context[J]. Journal of Chinese Information Processing, 2000, 14(5):1-7.) [13] 张晓辉,蒋海华,邸瑞华.基于属性权重的链接数据共指关系构建[J].计算机科学,2013,40(2):40-43.(ZHANG X H, JIANG H H, DI R H. Property weight based co-reference resolution for linked data[J]. Computer Science, 2013, 40(2):40-43.) [14] GOZUDELI Y, KARACAN H, YILDIZ O, et al. A new method based on tree simplification and schema matching for automatic Web result extraction and matching[C]//IMECS 2015:Proceedings of the International MultiConference of Engineers and Computer Scientists. Hong Kong:Newswood Limited, 2015, 1:369-373. [15] MILLER E. An introduction to the resource description framework[J]. Bulletin of the American Society for Information Science and Technology, 1998, 25(1):15-19. [16] DONG L, WEI F, ZHOU M, et al. Question answering over freebase with multi-column convolutional neural networks[C]//ACL-IJCNLP 2015:Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Stroudsburg, PA:Association for Computational Linguistics, 2015, 1:260-269. [17] MCBRIDE B. The Resource Description Framework (RDF) and its vocabulary description language RDFS[M]//Handbook on Ontologies. Berlin:Springer, 2004:51-65. [18] BLEI D M, NG A Y, JORDAN M I. Latent Dirichlet allocation[J]. Journal of Machine Learning Research, 2003, 3:993-1022. [19] GRIFFITHS T. Gibbs sampling in the generative model of latent Dirichlet allocation[R]. Stanford:Stanford University, 2002. [20] BERGROTH L, HAKONEN H, RAITA T. A survey of longest common subsequence algorithms[C]//SPIRE 2000:Proceedings of the Seventh International Symposium on String Processing and Information Retrieval. Piscataway, NJ:IEEE, 2000:39-48. [21] 朱敏,贾真,左玲.中文微博实体链接研究[J].北京大学学报(自然科学版),2014,50(1):73-78.(ZHU M, JIA Z, ZUO L. Research on entity linking of Chinese micro blog[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2014, 50(1):73-78.) [22] RAIMOND Y, SUTTON C, SANDLER M B. Automatic interlinking of music datasets on the semantic Web[C]//LDOW 2008:Proceedings of the 1st Workshop about Linked Data on the Web. New York:ACM, 2008, 369:1-8. [23] MORI J, TSUJISHITA T, MATSUO Y, et al. Extracting relations in social networks from the Web using similarity between collective contexts[C]//ISWC 2006:Proceedings of the 5th International Semantic Web Conference. Berlin:Springer, 2006, 4273:487-500.