计算机应用 ›› 2012, Vol. 32 ›› Issue (08): 2346-2349.

• 典型应用 • 上一篇    下一篇

基于特定领域的中文微博热点话题挖掘系统BTopicMiner

李劲1,2,张华1,吴浩雄1,向军1   

  1. 1. 湖北民族学院 信息工程学院,湖北 恩施 445000
    2. 华中师范大学 信息管理系,武汉 430079
  • 收稿日期:2012-02-15 修回日期:2012-03-30 发布日期:2012-08-28 出版日期:2012-08-01
  • 通讯作者: 李劲
  • 作者简介:李劲(1973-),男,湖北恩施人,副教授,硕士,CCF会员,博士研究生,主要研究方向:基于互联网的数据挖掘和数据管理、面向云计算的Web服务及Web服务组合;
    张华(1978-), 男,湖北恩施人,讲师,硕士,主要研究方向:信息检索、分布式系统及集成;
    吴浩雄(1979-), 男,湖北建始人,工程师,主要研究方向:Web数据挖掘、信息安全;
    向军(1978-),男,湖北来风人,讲师,博士,主要研究方向:移动计算、实时数据库系统、软件测试。
  • 基金资助:
    国家自然科学基金资助项目(61040006);湖北省自然科学基金资助项目(2010CDZ027);湖北省教育厅科技项目(B20101909)

BTopicMiner: domain-specific topic mining system for Chinese microblog

LI Jin1,2,ZHANG Hua3,WU Hao-xiong3,XIANG Jun3   

  1. 1. Information Management Department,Central China Normal University, Wuhan Hubei 430074, China
    2. School of Information Engineering, Hubei University for Nationalities, Enshi Hubei 445000,China
    3. School of Information Engineering, Hubei University for Nationalities, Enshi Hubei 445000, China
  • Received:2012-02-15 Revised:2012-03-30 Online:2012-08-28 Published:2012-08-01
  • Contact: LI Jin

摘要: 随着微博应用的迅猛发展,自动地从海量微博信息中提取出用户感兴趣的热点话题成为一个具有挑战性的研究课题。为此研究并提出了基于扩展的话题模型的中文微博热点话题抽取算法。为了解决微博信息固有的数据稀疏性问题,算法首先利用文本聚类方法将内容相关的微博消息合成为微博文档;基于微博之间的跟帖关系蕴含着话题的关联性的假设,算法对传统潜在狄利克雷分配(LDA)话题模型进行扩展以建模微博之间的跟帖关系;最后利用互信息(MI)计算被抽取出的话题的话题词汇用于热点话题推荐。为了验证扩展的话题抽取模型的有效性,实现了一个基于特定领域的中文微博热点话题挖掘的原型系统——BTopicMiner。实验结果表明:基于微博跟帖关系的扩展话题模型可以更准确地自动提取微博中的热点话题,同时利用MI度量自动计算得到的话题词汇和人工挑选的热点词汇之间的语义相似度达到75%以上。

关键词: 数据挖掘, 信息检索, 微博, 话题模型, 文本聚类, 互信息

Abstract: As microblog application grows rapidly, how to extract users' interested popular topic from massive microblog information automatically becomes a challenging research area. This paper studied and proposed a topic extraction algorithm of Chinese microblog based on extended topic model. In order to deal with data sparse problem of microblog, the content related microblog text would be firstly clustered to generate synthetic document. Based on the assumption that posting relationship among microblogs implied topical correlation, the traditional LDA (Latent Dirichlet Allocation) topic model was extended to model the posting relationship among microblogs. At last, Mutual Information (MI) measurement was utilized to calculate topic vocabulary after extracting topics by proposing extended LDA topic model for topic recommendation. Furthermore, a prototype system for domain-specific topical mining system, named BTopicMiner, was implemented so as to verify the effectiveness of the proposed algorithm. The experimental result shows that the proposed algorithm can extract topics from microblogs more accurately. Meanwhile, the semantic similarity between automatically calculated topic vocabulary and manually selected topic vocabulary exceeds 75% while automatically calculating topic vocabulary based on MI.

Key words: data mining, information retrieval, microblog, topic model, text clustering, Mutual Information (MI)

中图分类号: