基于主题树的微博突发话题检测

doi:10.11772/j.issn.1001-9081.2014.08.2332

计算机应用 ›› 2014, Vol. 34 ›› Issue (8): 2332-2335.DOI: 10.11772/j.issn.1001-9081.2014.08.2332

基于主题树的微博突发话题检测

邱云飞¹,郭弥纶¹,邵良杉²

1. 辽宁工程技术大学软件学院，辽宁葫芦岛125100；
2. 辽宁工程技术大学系统工程研究所，辽宁葫芦岛125100

收稿日期:2014-02-12 修回日期:2014-04-24 出版日期:2014-08-01 发布日期:2014-08-10
通讯作者: 郭弥纶
作者简介:邱云飞（1976-），男（蒙古族），辽宁阜新人，教授，博士，CCF会员，主要研究方向：数据挖掘、话题检测；郭弥纶（1989-），男（满族），辽宁阜新人，硕士研究生，主要研究方向：数据挖掘、话题检测；邵良杉（1961-），男，辽宁阜新人，教授，博士，主要研究方向：数据挖掘。
基金资助:
国家自然科学基金资助项目;辽宁省创新团队项目;辽宁省高等学校杰出青年学者成长计划

Microblog bursty topic detection based on topic tree

QIU Yunfei¹,GUO Milun¹,SHAO Liangshan²

1. School of Software, Liaoning Technical University, Huludao Liaoning 125100, China;
2. System Engineering Institute, Liaoning Technical University, Huludao Liaoning 125100, China

Received:2014-02-12 Revised:2014-04-24 Online:2014-08-01 Published:2014-08-10
Contact: GUO Milun

摘要/Abstract

摘要：

针对传统话题检测方法不能很好处理微博中用语不规范、随意性强、指代不明确以及存在大量网络用语的问题，提出了一种基于潜在狄利克雷分配(LDA)模型的主题树检测方法。首先，运用自然语言处理(NLP)中增大信息熵的方法将相关微博整理成一棵主题树，配合狄利克雷先验α与经验值β随主题数目动态变化的设计思想，结合该模型独特的双重概率统计模式，实现了对文本中每个词“贡献度”的统计，提前处理掉干扰信息，排除垃圾数据对话题检测的影响；然后，利用该“贡献度”作为空间向量模型(VSM)改进后的参数值计算文档间相似度来提取突发话题，达到提高突发话题检测精准度的目的。提出的基于LDA模型的主题树检测方法从F值比对与人工检测两个角度进行了相关实验，实验数据显示该算法不仅可以检测到突发话题，而且获得的结果与知网模型和TF-IDF算法相比分别高出3%、7%，且更符合人的判断逻辑。

Abstract:

A kind of topic tree detection method based on Latent Dirichlet Allocation (LDA) model was put forward, in order to solve the problems of nonstandard terms, randomness, uncertainty of reference and large number of network terms in microblog texts, which can not be solved in traditional detection method. Relevant microblogs were reorganized into a topic tree by increasing information entropy in Natural Language Processing (NLP), combining with the design idea that Dirichelet prior experience value α and experience value β vary with the topic number, then the contribution statistics of every word in the text was achieved using the specific dual probability statistical method of this model. Thus, the interference information would be disposed in advance and the influence of garbage data on topic detection was excluded. Using this contribution as the parameter value of the improved Vector Space Model (VSM), bursty topics were extracted through calculating the similarity between texts, in order to improve the detection precision of bursty topics. Experiments of the proposed detection method were made from two aspects: comparison of the value of F and the manual detection. The experimental data show that, this algorithm not only can detect the bursty topics, but also can improve the precision about 3% and 7% respectively compared with the HowNet model and the TF-IDF (Term Frequency-Inverse Document Frequency) algorithm, and it is more in accordance with human's logic judgments than the traditional ones.

中图分类号:

TP391
TP18

邱云飞郭弥纶邵良杉. 基于主题树的微博突发话题检测[J]. 计算机应用, 2014, 34(8): 2332-2335.

QIU Yunfei GUO Milun SHAO Liangshan. Microblog bursty topic detection based on topic tree[J]. Journal of Computer Applications, 2014, 34(8): 2332-2335.

参考文献

［1］MA B, HONG Y, LU J, et al. A thread-based two-stage clustering method of microblog topic detection ［J］. Journal of Chinese Information Processing, 2012, 26(6): 121-128.(马彬,洪宇,陆剑江,等.基于线索树双层类聚的微博话题检测［J］.中文信息学报,2012,26(6):121-128.)
［2］ZHOU G, ZOU H, XIONG X, et al. MB-SinglePass: microblog topic detection based on combined similarity ［J］. Computer Science, 2012, 39(10): 198-202.(周刚,邹鸿程,熊小兵,等.MB-SinglePass:基于组合相似度的微博话题检测［J］.计算机科学,2012,39(10):198-202.)
［3］HONG Y, ZHANG Y, LIU T, et al. Topic detection and tracking review ［J］. Journal of Chinese Information Processing, 2007, 21(6): 71-87.(洪宇,张宇,刘挺,等.话题检测与跟踪的评测及研究综述［J］.中文信息学报,2007,21(6):71-87.)
［4］YANG X, MA J, YANG T, et al. Automatic multi-document summarization based on the latent Dirichlet topic allocation model ［J］. CAAI Transactions on Intelligent Systems, 2010, 5(2):169-176.(杨潇,马军,杨同峰,等.主题模型LDA的多文档自动文摘［J］.智能系统学报,2010,5(2):169-176.)
［5］XU G, WANG H. The development of topic models in natural language processing ［J］. Chinese Journal of Computers, 2011, 34(8):1423-1436.(徐戈,王厚峰.自然语言处理中主题模型的发展［J］.计算机学报,2011,34(8):1423-1436.)
［6］LIU Q, LI S. Word similarity computing based on How-Net ［EB/OL］. ［2014-01-10］. http://wenku.baidu.com/view/b213af9951e79b8968022660.html.(刘群,李素建.基于《知网》的词汇语义相似度计算［EB/OL］. ［2014-01-10］. http://wenku.baidu.com/view/b213af9951e79b8968022660.html.
［7］SUN C, ZHENG C, XIA Q. Chinese text similarity computing based on LDA ［J］. Computer Technology and Development, 2013, 23(1): 217-220.(孙昌年,郑诚,夏青松.基于LDA的中文文本相似度计算［J］.计算机技术与发展,2013,23(1):217-220.)
［8］RAN J, SUN Y. Research of word similarity computing in semantic retrieval ［J］. Computer Technology and Development, 2011, 21(4): 94-97.(冉婕,孙瑜.语义检索中的词语相似度计算研究［J］.计算机技术与发展,2011,21(4):94-97.)
［9］LIN L, XUE F, REN Z. Modified word similarity computation approach based on HowNet ［J］. Journal of Computer Applications, 2009, 29(1): 217-220.(林丽,薛方,任仲晟.一种改进的基于《知网》的词语相似度计算方法［J］.计算机应用,2009,29(1):217-220.)
［10］WANG L, WEI B, YUAN J. Document clustering based on probabilistic topic model ［J］. Acta Electronica Sinica, 2012, 40(11): 2346-2350.(王李冬,魏宝刚,袁杰.基于概率主题模型的文档聚类［J］.电子学报,2012,40(11):2346-2350.)
［11］ZHAO A, LIU P, ZHENG Y. Subtopic division in news topic based on latent Dirichlet allocation［J］. Journal of Chinese Computer Systems, 2013, 34(4): 732-737.(赵爱华,刘培玉,郑燕.基于LDA的新闻话题子话题划分方法［J］.小型微型计算机系统,2013,34(4):732-737.)
［12］LI J. Research of sentence similarity computation based on HowNet ［J］. Computer Knowledge and Technology, 2012, 8(29): 7073-7075.(李进.基于知网的句子相似度计算的研究［J］.电脑知识与技术,2012,8(29):7073-7075.)
［13］GE B, LI F, GUO S, et al. Word's semantic similarity computation method based on Hownet ［J］. Application Research of Computers, 2010, 27(9): 3329-3333.(葛斌,李芳芳,郭丝路,等.基于知网的词汇语义相似度计算方法研究［J］.计算机应用研究,2010,27(9):3329-3333.)

[1]	吴军欧阳艾嘉张琳. 基于影响度的统计显著序列模式挖掘算法[J]. 计算机应用, 0, (): 0-0.
[2]	张璐方春祝铭. 基于Res2Net-YOLACT和融合特征的室内跌倒检测算法[J]. 计算机应用, 0, (): 0-0.
[3]	殷雨昌王洪元陈莉冯尊登肖宇. 基于单标注样本的多损失学习与联合度量视频行人重识别[J]. 计算机应用, 0, (): 0-0.
[4]	胡军许正康刘立钟福金张清华. 融合多粒度社区信息的网络嵌入方法[J]. 计算机应用, 0, (): 0-0.
[5]	李润泽孙雪姣. 基于时间条件提取序列的数据流偏好查询[J]. 计算机应用, 0, (): 0-0.
[6]	罗圣钦陈金怡李洪均. 基于注意力机制的多尺度残差UNet实现乳腺癌灶分割[J]. 计算机应用, 0, (): 0-0.
[7]	曹一珉蔡磊高敬阳. 基于生成对抗网络的基因数据生成方法[J]. 计算机应用, 0, (): 0-0.
[8]	陈冲闫珠赵继轩何为梁华庆. 基于集合经验模态分解和长短期记忆网络的催化裂化装置NOx排放预测[J]. 计算机应用, 0, (): 0-0.
[9]	徐光柱林文杰陈莎匡婉雷帮军周军. U-Net与自适应阈值脉冲耦合神经网络相结合的眼底血管分割方法[J]. 计算机应用, 0, (): 0-0.
[10]	杨鼎康黄帅王顺利翟鹏李一丹张立华. 基于对抗生成网络和网络集成的面部表情识别方法EE-GAN[J]. 计算机应用, 0, (): 0-0.
[11]	李讷徐光柱雷帮军马国亮石勇涛. 交通道路行驶车辆车标识别算法[J]. 计算机应用, 0, (): 0-0.
[12]	孟杰王莉杨延杰廉飚. 基于多模态深度融合的虚假信息检测[J]. 计算机应用, 0, (): 0-0.
[13]	秦庭威赵鹏程秦品乐曾建朝柴锐黄永琦. 基于残差注意力机制的点云配准算法[J]. 计算机应用, 0, (): 0-0.
[14]	鲁永帅唐英杰马鑫然. 基于深度特征融合的无纺布低对比度浆丝缺陷检测方法[J]. 计算机应用, 0, (): 0-0.
[15]	王宇航周永霞吴良武. 基于高斯函数的池化算法[J]. 计算机应用, 0, (): 0-0.

基于主题树的微博突发话题检测

Microblog bursty topic detection based on topic tree

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics