面向文本标题的人物关系抽取

doi:10.11772/j.issn.1001-9081.2016.03.726

计算机应用 ›› 2016, Vol. 36 ›› Issue (3): 726-730.DOI: 10.11772/j.issn.1001-9081.2016.03.726

面向文本标题的人物关系抽取

闫旸¹, 赵佳鹏¹, 李全刚^1,2, 张洋¹, 柳厅文¹, 时金桥¹

1. 中国科学院信息工程研究所, 北京 100093;
2. 电子科技大学计算机科学与工程学院, 成都 610054

收稿日期:2015-08-17 修回日期:2015-10-08 出版日期:2016-03-10 发布日期:2016-03-17
通讯作者: 柳厅文
作者简介:闫旸(1986-),男,山东济南人,博士研究生,主要研究方向:自然语言处理、信息安全、人工智能;赵佳鹏(1990-),男,内蒙古包头人,硕士,主要研究方向:自然语言处理、机器学习;李全刚(1986-),男,山东潍坊人,博士研究生,主要研究方向:计算机网络异常处理、机器学习。
基金资助:
中国科学院战略性先导科技专项基金资助项目(XDA06030200)。

Personal relation extraction based on text headline

YAN Yang¹, ZHAO Jiapeng¹, LI Quangang^1,2, ZHANG Yang¹, LIU Tingwen¹, SHI Jinqiao¹

1. Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093, China;
2. School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu Sichuan 610054, China

Received:2015-08-17 Revised:2015-10-08 Online:2016-03-10 Published:2016-03-17
Supported by:
This work is partially supported by the Strategic Priority Research Program of the Chinese Academy of Sciences (XDA06030200).

摘要/Abstract

摘要： 为了克服文本标题的人物关系抽取中非人物实体的干扰、关系特征词的选取以及标题中多人物实体对目标实体的关系判定的影响,提出基于决策树的人物实体判别、基于最小集合覆盖的关系特征词生成以及基于三层句式规则统计方法。首先,针对中国机器学习会议(CCML)竞赛中人物关系属性文件中对人物的描述,提取18种特征,采用C4.5分类器,获得了98.2%的查全率和92.6%的查准率,其结果作为下一步人物关系判定的条件;其次,为了保证特征词集合的规模维持在合适的水平,采用了基于最小集合覆盖的特征词覆盖的算法,结果表明,随着特征词集合达到一定的规模,特征词集合完成对所有类别关系的集合覆盖,用以判定文本标题中人物关系类型;最后,采用三层句式规则统计方法,用以生成过滤掉比重较小的句子规则和根据关系正负比例判定的进一步细分句式规则,以判定文本标题关系与否。实验结果表明,在19种人物关系判定上取得82.9%的查全率、74.4%的查准率以及78.4%的F1测度。所提方法可以有效用于新闻标题人物关系提取,用以构建人物关系知识图谱。

关键词: 人物关系抽取, 文本标题, 最小集合覆盖, 实体判别, 句法规则

Abstract: In order to overcome the non-person entity's interference, the difficulties in selection of feature words and muti-person influence on target personal relation extraction, this paper proposed person judgment based on decision tree, relation feature word generation based on minimum set cover and statistical approach based on three-layer sentence pattern rules. In the first step, 18 features were extracted from attribute files of China Conference on Machine Learning (CCML) competition 2015, C4.5 decision was used as the classifier, then 98.2% of recall rate and 92.6% of precision rate were acquired. The results of this step were used as the next step's input. Next, the algorithm based on minimum set cover was used. The feature word set covers all the personal relations as the scale of feature word set is maintained at a proper level, which is used to identify the relation type in text headline. In the last step, a method based on statistics of three-layer sentence pattern rules was used to filter small proportion rules and specify the sentence pattern rules based on positive and negative proportions to judge whether the personal relation is correct or not. The experimental result shows the approach acquires 82.9% in recall rate and 74.4% in precision rate and 78.4% in F1-measure, so the proposed method can be applied to personal relation extraction from text headlines, which helps to construct personal relation knowledge graph.

Key words: personal relation extraction, textual headline, minimum set cover, personal entity judgment, syntax rule

中图分类号:

TP391

闫旸, 赵佳鹏, 李全刚, 张洋, 柳厅文, 时金桥. 面向文本标题的人物关系抽取[J]. 计算机应用, 2016, 36(3): 726-730.

YAN Yang, ZHAO Jiapeng, LI Quangang, ZHANG Yang, LIU Tingwen, SHI Jinqiao. Personal relation extraction based on text headline[J]. Journal of Computer Applications, 2016, 36(3): 726-730.

参考文献

[1] DONG X, GABRILOVICH E, HEITZ G, et al. Knowledge vault: a Web-scale approach to probabilistic knowledge fusion [C]//Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2014:601-610.
[2] NEBHI K. A rule-based relation extraction system using DBpedia and syntactic parsing [EB/OL]. [2015-05-16]. http://xueshu.baidu.com/s?wd=paperuri%3A%28988a6f0ebd3e799e446f6c1623095cfe%29&filter=sc_long_sign&tn=SE_xueshusource_2kduw22v&sc_vurl=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Bjsessionid%3DB8E64ED72AD0E73ED1BF54782EC2E9B9%3Fdoi%3D10.1.1.403.2321%26rep%3Drep1%26type%3Dpdf&ie=utf-8.
[3] AONE C, RAMOS-SANTACRUZ M. REES: a large-scale relation and event extraction system [C]//ANLC 2000: Proceedings of the 6th Conference on Applied Natural Language Processing. Stroudsburg, PA: Association for Computational Linguistics, 2000:76-83.
[4] CUNNINGHAM H, MAYNARD D, BONTCHEVA K, et al. GATE: a framework and graphical development environment for robust NLP tools and applications [C]//ACL 2002: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, 2002:168-175.
[5] 车万翔,刘挺,李生.实体关系自动抽取[J].中文信息学报,2005,19(2):1-6.(CHE W X, LIU T, LI S. Automatic entity relation extraction [J]. Journal of Chinese Information Processing, 2005,19(2):1-6.)
[6] KAMBHATLA N. Combining lexical, syntactic, and semantic features with maximum entropy models for extracting relations [C]//ACL 2004: Proceedings of the ACL 2004 on Interactive Poster and Demonstration Sessions. Stroudsburg, PA: Association for Computational Linguistics, 2004: Article No. 22.
[7] HASEGAWA T, SEKINE S, GRISHMAN R. Discovering relations among named entities from large corpora [C]//ACL 2004: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, 2004: Article No. 415.
[8] 秦兵,刘安安,刘挺.无指导的中文开放式实体关系抽取[J]. 计算机研究与发展,2015,52(5):1029-1035.(QIN B, LIU A A, LIU T. Unsupervised Chinese open entity relation extraction[J]. Journal of Computer Research and Development, 2015,52(5):1029-1035.)
[9] STERN T. Minimum set cover [EB/OL]. [2015-04-22]. http://math.mit.edu/~goemans/18434S06/setcover-tamara.pdf.
[10] 顾益军,樊孝忠,王建华,等.中文停用词表的自动选取[J].北京理工大学学报,2005,25(4):337-340.(GU Y J, FAN X Z, WANG J H, et al. Automatic selection of Chinese stoplist [J]. Transactions of Beijing Institute of Technology, 2005,25(4):337-340.)
[11] SHANNON C E. A mathematical theory of communication [J]. Bell System Technical Journal, 1948,27(3):379-423.
[12] HALL M, FRANK E, HOLMES G, et al. The WEKA data mining software: an update [J]. SIGKDD Explorations, 2009,11(1):10-18.

面向文本标题的人物关系抽取

Personal relation extraction based on text headline

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 1

编辑推荐

Metrics