计算机应用 ›› 2014, Vol. 34 ›› Issue (8): 2332-2335.DOI: 10.11772/j.issn.1001-9081.2014.08.2332

• 人工智能 • 上一篇    下一篇

基于主题树的微博突发话题检测

邱云飞1,郭弥纶1,邵良杉2   

  1. 1. 辽宁工程技术大学 软件学院,辽宁 葫芦岛125100;
    2. 辽宁工程技术大学 系统工程研究所,辽宁 葫芦岛125100
  • 收稿日期:2014-02-12 修回日期:2014-04-24 出版日期:2014-08-01 发布日期:2014-08-10
  • 通讯作者: 郭弥纶
  • 作者简介:邱云飞(1976-),男(蒙古族),辽宁阜新人,教授,博士,CCF会员,主要研究方向:数据挖掘、话题检测;郭弥纶(1989-),男(满族),辽宁阜新人,硕士研究生,主要研究方向:数据挖掘、话题检测;邵良杉(1961-),男,辽宁阜新人,教授,博士,主要研究方向:数据挖掘。
  • 基金资助:

    国家自然科学基金资助项目;辽宁省创新团队项目;辽宁省高等学校杰出青年学者成长计划

Microblog bursty topic detection based on topic tree

QIU Yunfei1,GUO Milun1,SHAO Liangshan2   

  1. 1. School of Software, Liaoning Technical University, Huludao Liaoning 125100, China;
    2. System Engineering Institute, Liaoning Technical University, Huludao Liaoning 125100, China
  • Received:2014-02-12 Revised:2014-04-24 Online:2014-08-01 Published:2014-08-10
  • Contact: GUO Milun

摘要:

针对传统话题检测方法不能很好处理微博中用语不规范、随意性强、指代不明确以及存在大量网络用语的问题,提出了一种基于潜在狄利克雷分配(LDA)模型的主题树检测方法。首先,运用自然语言处理(NLP)中增大信息熵的方法将相关微博整理成一棵主题树,配合狄利克雷先验α与经验值β随主题数目动态变化的设计思想,结合该模型独特的双重概率统计模式,实现了对文本中每个词“贡献度”的统计,提前处理掉干扰信息,排除垃圾数据对话题检测的影响;然后,利用该“贡献度”作为空间向量模型(VSM)改进后的参数值计算文档间相似度来提取突发话题,达到提高突发话题检测精准度的目的。提出的基于LDA模型的主题树检测方法从F值比对与人工检测两个角度进行了相关实验,实验数据显示该算法不仅可以检测到突发话题,而且获得的结果与知网模型和TF-IDF算法相比分别高出3%、7%,且更符合人的判断逻辑。

Abstract:

A kind of topic tree detection method based on Latent Dirichlet Allocation (LDA) model was put forward, in order to solve the problems of nonstandard terms, randomness, uncertainty of reference and large number of network terms in microblog texts, which can not be solved in traditional detection method. Relevant microblogs were reorganized into a topic tree by increasing information entropy in Natural Language Processing (NLP), combining with the design idea that Dirichelet prior experience value α and experience value β vary with the topic number, then the contribution statistics of every word in the text was achieved using the specific dual probability statistical method of this model. Thus, the interference information would be disposed in advance and the influence of garbage data on topic detection was excluded. Using this contribution as the parameter value of the improved Vector Space Model (VSM), bursty topics were extracted through calculating the similarity between texts, in order to improve the detection precision of bursty topics. Experiments of the proposed detection method were made from two aspects: comparison of the value of F and the manual detection. The experimental data show that, this algorithm not only can detect the bursty topics, but also can improve the precision about 3% and 7% respectively compared with the HowNet model and the TF-IDF (Term Frequency-Inverse Document Frequency) algorithm, and it is more in accordance with human's logic judgments than the traditional ones.

中图分类号: