计算机应用 ›› 2016, Vol. 36 ›› Issue (7): 1881-1886.DOI: 10.11772/j.issn.1001-9081.2016.07.1881

• 人工智能 • 上一篇    下一篇

中文异构百科知识库实体对齐

黄峻福, 李天瑞, 贾真, 景运革, 张涛   

  1. 西南交通大学 信息科学与技术学院, 成都 611756
  • 收稿日期:2015-12-31 修回日期:2016-03-09 出版日期:2016-07-10 发布日期:2016-07-14
  • 通讯作者: 贾真
  • 作者简介:黄峻福(1990-),男,山西大同人,硕士研究生,CCF会员,主要研究方向:自然语言处理、知识图谱;李天瑞(1969-),男,福建莆田人,教授,博士,CCF高级会员,主要研究方向:智能信息处理、数据挖掘、云计算、大数据;贾真(1975-),女,河南开封人,讲师,博士,主要研究方向:信息抽取、知识图谱;景运革(1970-),男,山西运城人,博士研究生,主要研究方向:粗糙集;张涛(1989-),男,江西上饶人,硕士研究生,主要研究方向:自然语言处理。
  • 基金资助:
    国家自然科学基金资助项目(61573292,61572407);中央高校基本科研业务费专项(2682015CX070)。

Entity alignment of Chinese heterogeneous encyclopedia knowledge base

HUANG Junfu, LI Tianrui, JIA Zhen, JING Yunge, ZHANG Tao   

  1. School of Information Science and Technology, Southwest Jiaotong University, Chengdu Sichuan 611756, China
  • Received:2015-12-31 Revised:2016-03-09 Online:2016-07-10 Published:2016-07-14
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61572407), the Fundamental Research Funds for the Central Universities (2682015CX070).

摘要: 针对传统实体对齐方法在中文异构网络百科实体对齐任务中效果不够显著的问题,提出一种基于实体属性与上下文主题特征相结合的实体对齐方法。首先,基于百度百科及互动百科数据构造中文异构百科知识库,通过统计方法构造资源描述框架模式(RDFS)词表,对实体属性进行规范化;其次,抽取实体上下文信息,对其进行中文分词后,利用主题模型对上下文建模并通过吉布斯采样法求解模型参数,计算出主题-单词概率矩阵,提取特征词集合及对应特征矩阵;然后,利用最长公共子序列(LCS)算法判定实体属性相似度,当相似度位于下界与上界之间时,进一步结合百科类实体上下文主题特征进行判定;最后,依据标准方法构造了一个异构中文百科实体对齐数据集进行仿真实验。实验结果表明,与经典的属性相似度算法、属性加权算法、上下文词频特征模型及主题模型算法进行比较,所提出的实体对齐算法在人物领域和影视领域的准确率、召回率与综合指标F值分别达到97.8%、88.0%、92.6%和98.6%、73.0%、83.9%,比其他方法均有较大的提高。实验结果验证了在构建中文异构百科知识库场景中,所提算法可以有效提升中文百科实体对齐效果,可应用到具有上下文信息的实体对齐任务中。

关键词: 知识库, 实体对齐, 主题模型, 资源描述框架模式, 最长公共子序列算法

Abstract: Aiming at the problem that the traditional entity alignment algorithm may lead to bad performance in entity alignment task of Chinese heterogeneous encyclopedia knowledge base, an entity alignment method based on entity attributes and the features of context topics was proposed. First, a Chinese heterogeneous encyclopedia knowledge base was constructed based on Baidu encyclopedia and Hudong encyclopedia data. Next, the Resource Description Framework Schema (RDFS) vocabulary list was made to normalize the entity attributes. Then the entity context information was extracted and the Chinese word segmentation was used on the contexts. The contexts were modelled by using the topic model and the parameters were computed by Gibbs sampling method. After that the topic-word probability matrix, the characteristic word collection and the corresponding feature matrix were calculated. Last, the Longest Common Subsequence (LCS) algorithm was used to compute the entity attribute similarity. When the similarity was between the lower and the upper bounds, the topic features of the entities' context were combined to resolve the entity alignment problem. Finally, according to the standard method, an entity alignment data set of Chinese heterogeneous encyclopedia was constructed for simulation experiments. In comparison with the traditional property similarity algorithm, weighted-property algorithm, context term frequency feature model and topic model algorithm, the experimental results show that the proposed method achieves 97.8% accuracy, 88.0% recall, 92.6% F-score in people class and 98.6% accuracy, 73.0% recall, 83.9% F-score in movie class. It outperformed the other entity alignment algorithms. The experimental results also indicate that the proposed method can improve the entity alignment results in constructing the Chinese heterogeneous encyclopedia knowledge base, and it can be applied to the entity alignment tasks with context information.

Key words: knowledge base, entity alignment, topic model, resource description framework schema, longest common subsequence algorithm

中图分类号: