Journal of Computer Applications ›› 2014, Vol. 34 ›› Issue (5): 1354-1359.DOI: 10.11772/j.issn.1001-9081.2014.05.1354

• Artificial intelligence • Previous Articles     Next Articles

News topic mining method based on weighted latent Dirichlet allocation model

LI Xiangdong1,2,BA Zhichao2,HUANG Li3   

  1. 1. Center for the Studies of Information Resources (CSIR), Wuhan University, Wuhan Hubei 430072, China;
    2. School of Information Management, Wuhan University, Wuhan Hubei 430072, China;
    3. Wuhan University Library, Wuhan University, Wuhan Hubei 430072, China
  • Received:2013-11-12 Revised:2013-12-27 Online:2014-05-01 Published:2014-05-30
  • Contact: LI Xiangdong

基于加权隐含狄利克雷分配模型的新闻话题挖掘方法

李湘东1,2,巴志超2,黄莉3   

  1. 1. 武汉大学 信息资源研究中心,武汉 430072
    2. 武汉大学 信息管理学院,武汉 430072
    3. 武汉大学 武汉大学图书馆,武汉 430072
  • 通讯作者: 李湘东
  • 作者简介:李湘东(1963-),男,辽宁庄河人,副教授,博士,主要研究方向:信息检索、数据挖掘、自动分类;巴志超(1990-),男,山东滨州人,硕士研究生,主要研究方向:信息检索、自动分类;黄莉(1964-),女,广东普宁人,副研究馆员,硕士,主要研究方向:科技文献管理、文献资源建设、信息服务。

Abstract:

To solve the problems such as low accuracy and poor interpretability of traditional news topic mining, a new method was proposed based on weighted Latent Dirichlet Allocation (LDA) that combined with the information structure characters of the news. Firstly, the vocabulary weights were improved from different angles and the composite weights were built, the more expressive words were got by extending the process of feature items generated by the LDA model. Secondly, the Category Distinguish Word (CDW) method was used to optimize the word order of the generated result, which could reduce the noise and the ambiguity of the topics and improve the interpretability of the topics. Finally, according to the mathematical characteristics of the probability distribution model of the topics, the topics were quantified in terms of the contribution degree from the documents to the topics and the topics weight probability to get the hot topics. The simulation results show that the false negative rate and false positive rate of the weighted LDA model drop by an average of 1.43% and 0.16% compared with the traditional LDA model, and the minimum standard price drops by an average of 2.68%. It confirms the feasibility and effectiveness of this method.

摘要:

针对传统新闻话题挖掘准确率不高、话题可解释性差等问题,结合新闻报道的体例结构特点,提出一种基于加权隐含狄利克雷分配(LDA)模型的新闻话题挖掘方法。首先从不同角度改进词汇权重并构造复合权值,扩展LDA模型生成特征词的过程,以获取表意性较强的词汇;其次,将类别区分词(CDW)方法应用于建模结果的词序优化上,以消除话题歧义和噪声、提高话题的可解释性;最后,依据模型话题概率分布的数学特性,从文档对话题的贡献度以及话题权值概率角度对话题进行量化计算,以获取热门话题。仿真实验表明:与传统LDA模型相比,改进方法的漏报率、误报率分别平均降低1.43%、0.16%,最小标准代价平均降低2.68%,验证了该方法的可行性和有效性。

CLC Number: