基于信息关联拓扑的互联网社交关系挖掘

doi:10.11772/j.issn.1001-9081.2016.07.1875

计算机应用 ›› 2016, Vol. 36 ›› Issue (7): 1875-1880.DOI: 10.11772/j.issn.1001-9081.2016.07.1875

基于信息关联拓扑的互联网社交关系挖掘

刘锦文^1,2, 邢凯^1,2, 芮伟康^1,2, 张利萍², 周慧³

1. 中国科学技术大学计算机科学与技术学院, 合肥 230022;
2. 中国科学技术大学苏州研究院, 江苏苏州 215123;
3. 苏州工业园区疾病防治中心, 江苏苏州 215123

收稿日期:2016-01-25 修回日期:2016-02-29 发布日期:2016-07-14 出版日期:2016-07-10
通讯作者: 邢凯
作者简介:刘锦文(1992-),女,安徽蚌埠人,硕士研究生,主要研究方向:自然语言处理、数据挖掘、机器学习;邢凯(1981-),男,陕西西安人,副教授,博士,主要研究方向:数据挖掘、大数据、云计算;芮伟康(1992-),男,安徽芜湖人,硕士研究生,主要研究方向:文本挖掘、自然语言处理;张利萍(1990-),女,河南商丘人,硕士研究生,主要研究方向:数据挖掘、自然语言处理;周慧(1980-),女,江苏常州人,硕士研究生,主要研究方向:流行病防治。
基金资助:
国家自然科学基金资助项目（61332004），苏州市科技计划项目产业技术创新专项（民生科技）（SS201509）。

Info-association topology based social relationship mining on Internet

LIU Jinwen^1,2, XING Kai^1,2, RUI Weikang^1,2, ZHANG Liping², ZHOU Hui³

1. School of Computer Science and Technology, University of Science and Technology of China, Hefei Anhui 230022, China;
2. Suzhou Institute of Advanced Study, University of Science and Technology of China, Suzhou Jiangsu 215123, China;
3. Suzhou Industrial Park Centers for Disease Control and Prevention, Suzhou Jiangsu 215123, China

Received:2016-01-25 Revised:2016-02-29 Online:2016-07-14 Published:2016-07-10
Supported by:
This work is partially supported by the National Natural Science Foundation of China (61332004), the Suzhou Industrial Technology Innovation Special Funded Project (Science and Technology for People's Livelihood) (SS201509).

摘要/Abstract

摘要： 针对目前基于监督学习的关系抽取方法需要标注大量训练数据和预先定义关系类型，提出了一种基于词语共现信息构建关联网络并在关联网络上进行图聚类分析的人物关系提取方法。首先，从新闻标题数据获得关联度较高的500个人物对用于关系抽取研究；然后，抓取关联人物对所在新闻数据，对其进行预处理，并利用词频-逆向文档频率（TF-IDF）得到人物对共现句子中的关键词；其次，基于词语共现信息得到词语之间的关联，进而建立关键词关联网络；最后，利用对关联网络进行图聚类分析以获得人物关系。在关系抽取的实验中，与传统基于词语共现和模式匹配的中文实体关系提取方法相比，所提方法在准确率、召回率和平衡F分数（F-score）上分别提升了5.5，3.7和4.4个百分点。实验结果表明，所提算法能够在没有标注训练数据的条件下，有效地从新闻数据中抽取丰富且高质量的人物关系数据。

关键词: 社会关系抽取, 共现统计, 词语关联度, 关联网络, 图聚类

Abstract: To solve the problems of needing labeling a great number of training data and pre-defining relation types in relation extraction methods based on supervised learning, a method for personal relation extraction by constructing the correlation network based on word co-occurrence information and performing graph clustering analysis on the correlation network was proposed. Firstly, 500 highly related person pairs for the research of relation extraction were gotten from the news title data. Secondly, the news data which contained related person pairs were crawled and performed pre-processing, and the keywords in the sentences which contained person pairs were gotten by the Term Frequency-Inverse Document Frequency (TF-IDF). Thirdly, the correlation between the words was acquired by the words co-occurrence information, and the key-words correlation network was constructed. Finally, the personal relations were acquired by the graph clustering analysis on the correlation network. In the relation extraction experiments, compared with the traditional algorithm of Chinese relation extraction based on word co-occurrence and pattern matching technology, the precision, recall and F-score of the proposed method were improved by 5.5, 3.7 and 4.4 percentage points respectively. The experimental results show that the proposed algorithm can effectively extract abundant and high-quality personal relation data from news data without labeling training data.

Key words: social relation extraction, co-occurrence statistics, word correlation, correlation network, graph clustering

中图分类号:

TP391.1

刘锦文, 邢凯, 芮伟康, 张利萍, 周慧. 基于信息关联拓扑的互联网社交关系挖掘[J]. 计算机应用, 2016, 36(7): 1875-1880.

LIU Jinwen, XING Kai, RUI Weikang, ZHANG Liping, ZHOU Hui. Info-association topology based social relationship mining on Internet[J]. Journal of Computer Applications, 2016, 36(7): 1875-1880.

参考文献

[1] 雷春雅,郭剑毅,余正涛,等.基于自扩展与最大熵的领域实体关系自动抽取[J].山东大学学报:工学版,2010,40(5):141-145.(LEI C Y, GUO J Y, YU Z T, et al. Domain of automatic entity relation extraction based on seed self-expansion and the maximum entropy machine learning[J]. Journal of Shandong University (Engineering Science Edition), 2010, 40(5):141-145.)
[2] 车万翔,刘挺,李生.实体关系自动抽取[J].中文信息学报,2005,19(2):1-6.(CHE W X, LIU T, LI S. Automatic entity relation extraction[J]. Journal of Chinese Information Processing, 2005, 19(2):1-6.)
[3] 董静,孙乐,冯元勇,等.中文实体关系抽取中的特征选择研究[J].中文信息学报,2007,21(4):80-85.(DONG J, SUN L, FENG Y Y, et al. Chinese automatic entity relation extraction[J]. Journal of Chinese Information Processing, 2007, 21(4):80-85.)
[4] LIANG Z, YUAN C, LENG B, et al. Recognition of person relation indicated by predicates[C]//Proceedings of the 4th CCF Conference on Natural Language Processing and Chinese Computing. Berlin:Springer, 2015:313-324.
[5] PENG C, GU J, QIAN L. Research on tree kernel-based personal relation extraction[C]//Proceedings of the 1st CCF Conference on Natural Language Processing and Chinese Computing. Berlin:Springer, 2012:225-236.
[6] 秦兵,刘安安,刘挺.无指导的中文开放式实体关系抽取[J].计算机研究与发展,2015,52(5):1029-1035.(QIN B, LIU A A, LIU T. Unsupervised Chinese open entity relation extraction[J]. Journal of Computer Research and Development, 2015, 52(5):1029-1035.)
[7] 王庆,陈泽亚,郭静,等.基于词共现矩阵的项目关键词词库和关键词语义网络[J].计算机应用,2015,35(6):1649-1653.(WANG Q, CHEN Z Y, GUO J, et al. Project keyword lexicon and keyword semantic network based on word co-occurrence matrix[J]. Journal of Computer Applications, 2015, 35(6):1649-1653.)
[8] 周鹏,蔡淑琴,石双元,等.基于关键词抽取的微博舆情事件内容聚合[J].情报杂志,2014,33(1):91-96.(ZHOU P, CAI S Q, SHI S Y, et al. Content aggregation of microblogging public opinion events based on keyword extraction[J]. Journal of Intelligence, 2014, 33(1):91-96.)
[9] 樊兴华,孙茂松.一种高性能的两类中文文本分类方法[J].计算机学报,2006,29(1):124-31.(FAN X H, SUN M S. A high performance two-class Chinese text categorization method[J]. Chinese Journal of Computers, 2006, 29(1):124-31.)
[10] 赵军,胡栓柱,樊兴华.一种新的词语相似度计算方法[J].重庆邮电大学学报(自然科学版),2009,21(4):528-532.(ZHAO J, HU S Z, FAN X H. Word similarity computation based on word link distribution[J]. Journal of Chongqing University of Posts and Telecommunications (Natural Science Edition), 2009, 21(4):528-532.)
[11] 温菊屏,钟勇.图聚类的算法及其在社会关系网络中的应用[J].计算机应用与软件,2012,29(2):161-163.(WEN J P, ZHONG Y. Graph clustering algorithm and its application in social network[J].Computer Applications and Software, 2012, 29(2):161-163.)
[12] PALLA G, DERÉNYI I, FARKAS I, et al. Uncovering the overlapping community structure of complex networks in nature and society[J]. Nature, 2005, 435(7043):814-818.
[13] CAVIQUE L, MENDES A B, SANTOS J M A. An algorithm to discover the k-clique cover in networks[C]//Proceedings of the 14th Portuguese Conference on Artificial Intelligence. Berlin:Springer, 2009:363-373.
[14] SAYYADI H, HURST M, MAYKOV A. Event detection and tracking in social streams[C]//Proceedings of the 3rd International AAAI Conference on Weblogs and Social Media. Menlo Park, CA:AAAI Press, 2009:311-314.
[15] 雷钰丽,李阳,王崇骏,等.基于权重的马尔可夫随机游走相似度度量的实体识别方法[J].河北师范大学学报(自然科学版),2010,34(1):26-30.(LEI Y L, LI Y, WANG C J, et al. Method on entity identification using similarity measure base on the weight of Markov random walk[J]. Journal of Hebei Normal University (Natural Science Edition), 2010, 34(1):26-30.)
[16] DAGAN I, LEE L, PEREIRA F C N. Similarity-based models of word co-occurrence probabilities[J]. Machine Learning, 1999, 34(1/2/3):43-69.
[17] LIU J, HE L, LIN X, et al. A specific word relatedness computation algorithm for news corpus[C]//Proceedings of the 2nd International Workshop on Intelligent System and Applications. Piscataway, NJ:IEEE, 2010:148-153.
[18] 王立霞,淮晓永.基于语义的中文文本关键词提取算法[J].计算机工程,2012,38(1):1-4.(WANG L X, HUAI X Y. Semantic-based keyword extraction algorithm for Chinese text[J]. Computer Engineering, 2012, 38(1):1-4.)
[19] 项响琴,李红,陈圣兵.CLIQUE聚类算法的分析研究[J].合肥学院学报(自然科学版),2011,21(1):54-58.(XIANG X Q, LI H, CHEN S B. Analysis and research on clique algorithm[J]. Journal of Hefei University (Natural Sciences), 2011, 21(1):54-58.)
[20] WANG J, YANG J, HE L, et al. Chinese entity relation extraction based on word co-occurrence[EB/OL].[2015-12-01] http://www.ica.stc.sh.cn/picture/article/176/39/ff/b3ae3e1b4a5d96519bfb308c9d13/8ec889c1-54c7-4869-8978-bb7bc5285199.pdf.

基于信息关联拓扑的互联网社交关系挖掘

Info-association topology based social relationship mining on Internet

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	李顺勇, 李师毅, 胥瑞, 赵兴旺. 基于自注意力融合的不完整多视图聚类算法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2696-2703.
[2]	董瑶, 付怡雪, 董永峰, 史进, 陈晨. 不完整多视图聚类综述[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1673-1682.
[3]	杨成昊, 胡节, 王红军, 彭博. 基于注意力机制的不完备多视图聚类算法[J]. 《计算机应用》唯一官方网站, 2024, 44(12): 3784-3789.
[4]	朱云华, 孔兵, 周丽华, 陈红梅, 包崇明. 图对比学习引导的多视图聚类网络[J]. 《计算机应用》唯一官方网站, 2024, 44(10): 3267-3274.
[5]	何子仪, 杨燕, 张熠玲. 深度融合多视图聚类网络[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2651-2656.
[6]	劳景欢, 黄栋, 王昌栋, 赖剑煌. 基于视图互信息加权的多视图集成聚类算法[J]. 《计算机应用》唯一官方网站, 2023, 43(6): 1713-1718.
[7]	董永峰, 邓亚晗, 董瑶, 王雅琮. 基于深度学习的聚类综述[J]. 《计算机应用》唯一官方网站, 2022, 42(4): 1021-1028.
[8]	管娇娇, 钱雪忠, 周世兵, 姜凯彬, 宋威. 基于格拉斯曼流形子空间融合的多视图聚类[J]. 《计算机应用》唯一官方网站, 2022, 42(12): 3740-3749.
[9]	李杏峰, 黄玉清, 任珍文, 李毅红. 基于自适应邻域的鲁棒多视图聚类算法[J]. 计算机应用, 2021, 41(4): 1093-1099.
[10]	陈献, 胡丽莹, 林晓炜, 陈黎飞. 基于核非负矩阵分解的有向图聚类算法[J]. 《计算机应用》唯一官方网站, 2021, 41(12): 3447-3454.
[11]	雷皓云, 任珍文, 汪彦龙, 薛爽, 李浩然. 基于上界单纯形投影图张量学习的多核聚类算法[J]. 《计算机应用》唯一官方网站, 2021, 41(12): 3468-3474.
[12]	戎炜, 蒋哲远, 谢昭, 吴克伟. 基于聚类关联网络的群组行为识别[J]. 计算机应用, 2020, 40(9): 2507-2513.
[13]	黄光球, 谢蓉. 考虑节点过载的碳排放空间关联系统级联失效模型[J]. 计算机应用, 2019, 39(6): 1829-1835.
[14]	卢志刚, 解婉婷. 基于片段的企业信任网络演化图聚类算法[J]. 计算机应用, 2018, 38(1): 270-276.
[15]	丁利向来生刘希玉宋超超. 改进图聚类算法及其应用[J]. 计算机应用, 2012, 32(12): 3278-3282.