计算机应用 ›› 2016, Vol. 36 ›› Issue (2): 460-464.DOI: 10.11772/j.issn.1001-9081.2016.02.0460

• 第三届CCF大数据学术会议(CCF BigData 2015) • 上一篇    下一篇

基于话题标签和转发关系的微博聚类和主题词提取

束珏1, 成卫青1,2, 邓聪1   

  1. 1. 南京邮电大学 计算机学院, 南京 210003;
    2. 计算机网络和信息集成教育部重点实验室(东南大学), 南京 211189
  • 收稿日期:2015-08-29 修回日期:2015-09-13 出版日期:2016-02-10 发布日期:2016-02-03
  • 通讯作者: 成卫青(1972-),女,江苏淮安人,教授,博士,CCF会员,主要研究方向:网络测量、模式识别。
  • 作者简介:束珏(1990-),女,江苏丹阳人,硕士研究生,CCF会员,主要研究方向:数据挖掘;邓聪(1993-),男,江苏南京人,主要研究方向:数据挖掘。
  • 基金资助:
    国家自然科学基金资助项目(61170322,71171117,61373065);计算机网络和信息集成教育部重点实验室资助项目(K93-9-2014-04B)。

Micro-blog clustering and topic word extraction based on hashtag and forwarding relationship

SHU Jue1, CHENG Weiqing1,2, DENG Cong1   

  1. 1. School of Computer Science and Technology, Nanjing University of Posts and Telecommunications, Nanjing Jiangsu 210003, China;
    2. Key Laboratory of Computer Network and Information Integration(Southeast University), Ministry of Education, Nanjing Jiangsu 211189, China
  • Received:2015-08-29 Revised:2015-09-13 Online:2016-02-10 Published:2016-02-03

摘要: 针对微博聚类正确率不高的问题,在研究微博数据特点的基础上,利用微博hashtag来增强向量空间模型,使用微博之间的转发关系提升聚类的准确性,并利用微博的转发、评论数以及微博发布者信息来提取聚类中的主题词。在新浪微博数据集上进行实验发现,与k-means算法和基于加权语义和贝叶斯的中文短文本增量聚类算法(ICST-WSNB)相比,基于话题标签和转发关系的微博聚类算法的准确率比k-means算法提高了18.5%,比ICST-WSNB提高了6.48%,召回率以及F-值也有了一定的提高。实验结果表明基于话题标签和转发关系的微博聚类算法能够有效地提高微博聚类的正确率,进而获取更加合适的主题词。

关键词: 微博数据, 文本挖掘, 特征权重, 微博转发关系, 主题词提取

Abstract: Concerning the low accuracy of micro-blog clustering, on the basis of research on the micro-blog data, micro-blog hashtag was used to enhance vector space model, and micro-blog forwarding relationship was used to improve the accuracy of clustering. With the information such as forwarding number, comment number of a micro-blog and information of the user who posted the blog, topic keywords of the clusters were extracted. Clustering results on the experiments of Sina micro-blog dataset show that, compared with k-means algorithm and ICST-WSNB (a short Chinese text incremental clustering algorithm based on weighted semantics and Naive Bayes), the accuracy of the proposed clustering method based on topic labels and forwarding relationship increases by 18.5% and 6.63% respectively; the recall and F-value are also improved. The experimental results show that the proposed clustering algorithm based on micro-blog topic label and forwarding relationship can effectively improve the accuracy of micro-blog clustering, and then get more appropriate topic words.

Key words: micro-blog data, text mining, feature weight, micro-blog forwarding relationship, topic word extraction

中图分类号: