计算机应用 ›› 2017, Vol. 37 ›› Issue (5): 1407-1412.DOI: 10.11772/j.issn.1001-9081.2017.05.1407

• 人工智能 • 上一篇    下一篇

基于微博文本的词对主题演化模型

史庆伟, 刘雨诗, 张丰田   

  1. 辽宁工程技术大学 软件学院, 辽宁 葫芦岛 125105
  • 收稿日期:2016-10-12 修回日期:2016-12-31 出版日期:2017-05-10 发布日期:2017-05-16
  • 通讯作者: 刘雨诗
  • 作者简介:史庆伟(1973-),男,辽宁阜新人,副教授,博士,主要研究方向:智能数据处理;刘雨诗(1993-),女,辽宁铁岭人,硕士研究生,主要研究方向:智能数据处理;张丰田(1991-),男,河北石家庄人,硕士研究生,主要研究方向:大数据、云计算。

Biterm topic evolution model of microblog

SHI Qingwei, LIU Yushi, ZHANG Fengtian   

  1. School of Software, Liaoning Technical University, Huludao Liaoning 125105, China
  • Received:2016-10-12 Revised:2016-12-31 Online:2017-05-10 Published:2017-05-16

摘要: 针对传统主题模型忽略了微博短文本和文本动态演化的问题,提出了基于微博文本的词对主题演化(BToT)模型,并根据所提模型对数据集进行主题演化分析。BToT模型在文本生成过程中引入连续的时间变量具体描述时间维度上的主题动态演化,同时在文档中构成主题共享的“词对”结构,扩充了短文本特征。采用Gibbs采样方法对BToT参数进行估计,根据获得的主题-时间分布参数对主题进行演化分析。在真实微博数据集上进行验证,结果表明,BToT模型可以描述微博数据集中潜在的主题演化规律,获得的困惑度评价系数低于潜在狄利克雷分配(LDA)、词对主题模型(BTM)和主题演化模型(ToT)。

关键词: 特征值稀疏, 主题演化模型, 动态演化, Gibbs采样, 微博

Abstract: Aiming at the problem that the traditional topic model ignore short text and dynamic evolution of microblog, a Biterm Topic over Time (BToT) model based on microblog text was proposed, and the subject evolution analysis was carried out by the proposed model. A continuous time variable was introduced to describe the dynamic evolution of the topic in the time dimension during the process of text generation in the BToT model, and the "Biterm" structure of the topic sharing in the document was formed to extend short text feature. The Gibbs sampling method was used to estimate the parameters of BToT, and the topic evaluation was analyzed by topic-time distributed parameters. The experimental results on real microblog datasets show that BToT can characterize the latent topic evolution and has lower perplexity than Latent Dirichlet Allocation (LDA), Biterm Topic Model (BTM) and Topic over Time (ToT).

Key words: feature sparsity, theme evolution model, dynamic evolution, Gibbs sampling, microblog

中图分类号: