计算机应用 ›› 2016, Vol. 36 ›› Issue (7): 1875-1880.DOI: 10.11772/j.issn.1001-9081.2016.07.1875

• 人工智能 • 上一篇    下一篇

基于信息关联拓扑的互联网社交关系挖掘

刘锦文1,2, 邢凯1,2, 芮伟康1,2, 张利萍2, 周慧3   

  1. 1. 中国科学技术大学 计算机科学与技术学院, 合肥 230022;
    2. 中国科学技术大学 苏州研究院, 江苏 苏州 215123;
    3. 苏州工业园区疾病防治中心, 江苏 苏州 215123
  • 收稿日期:2016-01-25 修回日期:2016-02-29 出版日期:2016-07-10 发布日期:2016-07-14
  • 通讯作者: 邢凯
  • 作者简介:刘锦文(1992-),女,安徽蚌埠人,硕士研究生,主要研究方向:自然语言处理、数据挖掘、机器学习;邢凯(1981-),男,陕西西安人,副教授,博士,主要研究方向:数据挖掘、大数据、云计算;芮伟康(1992-),男,安徽芜湖人,硕士研究生,主要研究方向:文本挖掘、自然语言处理;张利萍(1990-),女,河南商丘人,硕士研究生,主要研究方向:数据挖掘、自然语言处理;周慧(1980-),女,江苏常州人,硕士研究生,主要研究方向:流行病防治。
  • 基金资助:
    国家自然科学基金资助项目(61332004),苏州市科技计划项目产业技术创新专项(民生科技)(SS201509)。

Info-association topology based social relationship mining on Internet

LIU Jinwen1,2, XING Kai1,2, RUI Weikang1,2, ZHANG Liping2, ZHOU Hui3   

  1. 1. School of Computer Science and Technology, University of Science and Technology of China, Hefei Anhui 230022, China;
    2. Suzhou Institute of Advanced Study, University of Science and Technology of China, Suzhou Jiangsu 215123, China;
    3. Suzhou Industrial Park Centers for Disease Control and Prevention, Suzhou Jiangsu 215123, China
  • Received:2016-01-25 Revised:2016-02-29 Online:2016-07-10 Published:2016-07-14
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61332004), the Suzhou Industrial Technology Innovation Special Funded Project (Science and Technology for People's Livelihood) (SS201509).

摘要: 针对目前基于监督学习的关系抽取方法需要标注大量训练数据和预先定义关系类型,提出了一种基于词语共现信息构建关联网络并在关联网络上进行图聚类分析的人物关系提取方法。首先,从新闻标题数据获得关联度较高的500个人物对用于关系抽取研究;然后,抓取关联人物对所在新闻数据,对其进行预处理,并利用词频-逆向文档频率(TF-IDF)得到人物对共现句子中的关键词;其次,基于词语共现信息得到词语之间的关联,进而建立关键词关联网络;最后,利用对关联网络进行图聚类分析以获得人物关系。在关系抽取的实验中,与传统基于词语共现和模式匹配的中文实体关系提取方法相比,所提方法在准确率、召回率和平衡F分数(F-score)上分别提升了5.5,3.7和4.4个百分点。实验结果表明,所提算法能够在没有标注训练数据的条件下,有效地从新闻数据中抽取丰富且高质量的人物关系数据。

关键词: 社会关系抽取, 共现统计, 词语关联度, 关联网络, 图聚类

Abstract: To solve the problems of needing labeling a great number of training data and pre-defining relation types in relation extraction methods based on supervised learning, a method for personal relation extraction by constructing the correlation network based on word co-occurrence information and performing graph clustering analysis on the correlation network was proposed. Firstly, 500 highly related person pairs for the research of relation extraction were gotten from the news title data. Secondly, the news data which contained related person pairs were crawled and performed pre-processing, and the keywords in the sentences which contained person pairs were gotten by the Term Frequency-Inverse Document Frequency (TF-IDF). Thirdly, the correlation between the words was acquired by the words co-occurrence information, and the key-words correlation network was constructed. Finally, the personal relations were acquired by the graph clustering analysis on the correlation network. In the relation extraction experiments, compared with the traditional algorithm of Chinese relation extraction based on word co-occurrence and pattern matching technology, the precision, recall and F-score of the proposed method were improved by 5.5, 3.7 and 4.4 percentage points respectively. The experimental results show that the proposed algorithm can effectively extract abundant and high-quality personal relation data from news data without labeling training data.

Key words: social relation extraction, co-occurrence statistics, word correlation, correlation network, graph clustering

中图分类号: