计算机应用 ›› 2016, Vol. 36 ›› Issue (3): 726-730.DOI: 10.11772/j.issn.1001-9081.2016.03.726

• 人工智能 • 上一篇    下一篇

面向文本标题的人物关系抽取

闫旸1, 赵佳鹏1, 李全刚1,2, 张洋1, 柳厅文1, 时金桥1   

  1. 1. 中国科学院 信息工程研究所, 北京 100093;
    2. 电子科技大学 计算机科学与工程学院, 成都 610054
  • 收稿日期:2015-08-17 修回日期:2015-10-08 出版日期:2016-03-10 发布日期:2016-03-17
  • 通讯作者: 柳厅文
  • 作者简介:闫旸(1986-),男,山东济南人,博士研究生,主要研究方向:自然语言处理、信息安全、人工智能;赵佳鹏(1990-),男,内蒙古包头人,硕士,主要研究方向:自然语言处理、机器学习;李全刚(1986-),男,山东潍坊人,博士研究生,主要研究方向:计算机网络异常处理、机器学习。
  • 基金资助:
    中国科学院战略性先导科技专项基金资助项目(XDA06030200)。

Personal relation extraction based on text headline

YAN Yang1, ZHAO Jiapeng1, LI Quangang1,2, ZHANG Yang1, LIU Tingwen1, SHI Jinqiao1   

  1. 1. Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093, China;
    2. School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu Sichuan 610054, China
  • Received:2015-08-17 Revised:2015-10-08 Online:2016-03-10 Published:2016-03-17
  • Supported by:
    This work is partially supported by the Strategic Priority Research Program of the Chinese Academy of Sciences (XDA06030200).

摘要: 为了克服文本标题的人物关系抽取中非人物实体的干扰、关系特征词的选取以及标题中多人物实体对目标实体的关系判定的影响,提出基于决策树的人物实体判别、基于最小集合覆盖的关系特征词生成以及基于三层句式规则统计方法。首先,针对中国机器学习会议(CCML)竞赛中人物关系属性文件中对人物的描述,提取18种特征,采用C4.5分类器,获得了98.2%的查全率和92.6%的查准率,其结果作为下一步人物关系判定的条件;其次,为了保证特征词集合的规模维持在合适的水平,采用了基于最小集合覆盖的特征词覆盖的算法,结果表明,随着特征词集合达到一定的规模,特征词集合完成对所有类别关系的集合覆盖,用以判定文本标题中人物关系类型;最后,采用三层句式规则统计方法,用以生成过滤掉比重较小的句子规则和根据关系正负比例判定的进一步细分句式规则,以判定文本标题关系与否。实验结果表明,在19种人物关系判定上取得82.9%的查全率、74.4%的查准率以及78.4%的F1测度。所提方法可以有效用于新闻标题人物关系提取,用以构建人物关系知识图谱。

关键词: 人物关系抽取, 文本标题, 最小集合覆盖, 实体判别, 句法规则

Abstract: In order to overcome the non-person entity's interference, the difficulties in selection of feature words and muti-person influence on target personal relation extraction, this paper proposed person judgment based on decision tree, relation feature word generation based on minimum set cover and statistical approach based on three-layer sentence pattern rules. In the first step, 18 features were extracted from attribute files of China Conference on Machine Learning (CCML) competition 2015, C4.5 decision was used as the classifier, then 98.2% of recall rate and 92.6% of precision rate were acquired. The results of this step were used as the next step's input. Next, the algorithm based on minimum set cover was used. The feature word set covers all the personal relations as the scale of feature word set is maintained at a proper level, which is used to identify the relation type in text headline. In the last step, a method based on statistics of three-layer sentence pattern rules was used to filter small proportion rules and specify the sentence pattern rules based on positive and negative proportions to judge whether the personal relation is correct or not. The experimental result shows the approach acquires 82.9% in recall rate and 74.4% in precision rate and 78.4% in F1-measure, so the proposed method can be applied to personal relation extraction from text headlines, which helps to construct personal relation knowledge graph.

Key words: personal relation extraction, textual headline, minimum set cover, personal entity judgment, syntax rule

中图分类号: