计算机应用 ›› 2013, Vol. 33 ›› Issue (11): 3071-3075.

• 数据库技术 • 上一篇    下一篇

基于主题聚簇评价的论坛热点话题挖掘

江浩,陈兴蜀,杜敏   

  1. 四川大学 计算机学院,成都 610065
  • 收稿日期:2013-05-08 修回日期:2013-07-14 出版日期:2013-11-01 发布日期:2013-12-04
  • 通讯作者: 陈兴蜀
  • 作者简介:江浩(1989-),男,河北邢台人,硕士研究生,主要研究方向:数据挖掘;陈兴蜀(1968-),女,四川成都人,副教授,博士生导师,博士,主要研究方向:信息安全、计算机网络;杜敏(1987-),男,陕西宝鸡人,博士研究生,主要研究方向:数据挖掘、机器学习。
  • 基金资助:
    国家科技支撑计划课题项目

On-line forum hot topic mining method based on topic cluster evaluation

JIANG Hao,CHEN Xingshu,DU Min   

  1. School of Computer Science, Sichuan University. Chengdu Sichuan 610065, China
  • Received:2013-05-08 Revised:2013-07-14 Online:2013-12-04 Published:2013-11-01
  • Contact: CHEN Xingshu
  • Supported by:
    Key Projects in the National Science & Technology

摘要: 热点话题挖掘是舆情监控的重要技术基础。针对现有的论坛热点话题挖掘方法没有解决数据中词汇噪声较多且热度评价方式单一的问题,提出一种基于主题聚簇评价的热点话题挖掘方法。采用潜在狄里克雷分配主题模型对论坛文本数据建模,对映射到主题空间的文档集去除主题噪声后用优化聚类中心选择的K-means++算法进行聚类,最后从主题突发度、主题纯净度和聚簇关注度三个方面对聚簇进行评价。通过实验分析得出主题噪声阈值设置为0.75,聚类中心数设置为50时,可以使聚类质量与聚类速度达到最优。真实数据集上的测试结果表明该方法可以有效地将聚簇按出现热点话题的可能性排序。最后设计了热点话题的展示方法。

关键词: 潜在狄里克雷分配, 主题模型, K-means 聚类, 聚簇评价, 热点话题

Abstract: Hot topic mining is an important technical foundation for monitoring public opinion. As current hot topic mining methods cannot solve the affection of word noise and have single hot degree evaluation way, a new mining method based on topic cluster evaluation was proposed. After forum data was modeled by Latent Dirichlet Allocation (LDA) topic model and topic noise was cut off, the data were then clustered by improved cluster center selection algorithm K-means++. Finally, clusters were evaluated in three aspects: abruptness, purity and attention degree of topics. The experimental results show that both cluster quality and clustering speed can rise up by setting topic noise threshold to 0.75 and cluster number to 50. The effectiveness of ranking clusters by their probability of the existing hot topic with this method has also been proved on real data sets tests. At last a method was developed for displaying hot topics.

Key words: Latent Dirichlet Allocation (LDA), topic model, K-means clustering, cluster evaluation, hot topic

中图分类号: