News topic mining method based on weighted latent Dirichlet allocation model

doi:10.11772/j.issn.1001-9081.2014.05.1354

Abstract

Abstract:

To solve the problems such as low accuracy and poor interpretability of traditional news topic mining, a new method was proposed based on weighted Latent Dirichlet Allocation (LDA) that combined with the information structure characters of the news. Firstly, the vocabulary weights were improved from different angles and the composite weights were built, the more expressive words were got by extending the process of feature items generated by the LDA model. Secondly, the Category Distinguish Word (CDW) method was used to optimize the word order of the generated result, which could reduce the noise and the ambiguity of the topics and improve the interpretability of the topics. Finally, according to the mathematical characteristics of the probability distribution model of the topics, the topics were quantified in terms of the contribution degree from the documents to the topics and the topics weight probability to get the hot topics. The simulation results show that the false negative rate and false positive rate of the weighted LDA model drop by an average of 1.43% and 0.16% compared with the traditional LDA model, and the minimum standard price drops by an average of 2.68%. It confirms the feasibility and effectiveness of this method.

摘要：

针对传统新闻话题挖掘准确率不高、话题可解释性差等问题，结合新闻报道的体例结构特点，提出一种基于加权隐含狄利克雷分配（LDA）模型的新闻话题挖掘方法。首先从不同角度改进词汇权重并构造复合权值，扩展LDA模型生成特征词的过程，以获取表意性较强的词汇；其次，将类别区分词（CDW）方法应用于建模结果的词序优化上，以消除话题歧义和噪声、提高话题的可解释性；最后，依据模型话题概率分布的数学特性，从文档对话题的贡献度以及话题权值概率角度对话题进行量化计算，以获取热门话题。仿真实验表明：与传统LDA模型相比，改进方法的漏报率、误报率分别平均降低1.43%、0.16%，最小标准代价平均降低2.68%，验证了该方法的可行性和有效性。

CLC Number:

TP391

LI Xiangdong BA Zhichao HUANG Li. News topic mining method based on weighted latent Dirichlet allocation model[J]. Journal of Computer Applications, 2014, 34(5): 1354-1359.

李湘东巴志超黄莉. 基于加权隐含狄利克雷分配模型的新闻话题挖掘方法[J]. 计算机应用, 2014, 34(5): 1354-1359.

References

［1］LIU Y, QI H, DAI J. Applying latent semantic analysis in Chinese information processing ［J］. Computer Engineering and Applications, 2005,41(3):91-93.(刘云峰,齐欢,代建民.潜在语义分析在中文信息处理中的应用［J］.计算机工程与应用,2005,41(3):91-93.)
［2］HU L, HU G, XU Y, et al. Research of text classification technology based on Web news pages ［J］.Journal of Anhui University: Natural Science, 2010,34(6):66-70.(胡凌云,胡桂兰,徐勇,等.基于Web的新闻文本分类技术的研究［J］.安徽大学学报:自然科学版,2010,34(6):66-70.)
［3］LIM C S, LEE K J, KIM G C. Multiple sets of features for automatic genre classification of Web documents ［J］. Information Processing and Management, 2005,41(5):1263-1276.
［4］ZHANG Y, LI H. Text classification of accident news based on category keyword ［J］. Journal of Computer Applications, 2008,28(S1):139-143.(张永奎,李红娟.基于类别关键词的突发事件新闻文本分类方法［J］.计算机应用,2008,28(6):139-143.)
［5］HONG Y, ZHANG Y, FAN J, et al. New event detection based on division comparison subtopic ［J］. Chinese Journal of Computers, 2008,31(4):687-695.（洪宇,张宇,范基礼,等.基于子话题分治匹配的新事件检测［J］.计算机学报,2008,31(4):687-695.）
［6］LEI Z, WU L, LEI L, et al. Incremental K-means method based on initialization of cluster centers and its application in news event detection ［J］. Journal of the China Society for Scientific and Technical Information, 2006,25(3):289-295.（雷震,吴玲达,雷蕾,等.初始化类中心的增量K均值法及其在新闻事件探测中的应用［J］.情报学报,2006,25(3):289-295.）
［7］DUMAIS S T, FURNAS G W, LANDAUER T K, et al. Using latent semantic analysis to improve access to textual information ［C］// Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. New York: ACM, 1988:281-285.
［8］HOFMANN T. Probabilistic latent semantic indexing ［C］// Proceedings of the 22th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 1999:50-77.
［9］BLEI D M, NG A Y, JORDAN M I. Latent Dirichlet allocation ［J］. The Journal of Machine Learning Research, 2003,3:993-1022.
［10］BLEI D M, LAFFERTY J D. A correlated topic model of science ［J］. The Annals of Applied Statistics, 2007,1(1):17-35.
［11］BLEI D M, GRIFFITHS T L, JORDAN M I, et al. Hierarchical topic models and the nested Chinese restaurant process ［EB/OL］. ［2013-04-02］. http://www.cs.princeton.edu/~blei/papers/BleiGriffithsJordanTenenbaum2003.pdf.
［12］BOYD-GRABER J. BLEI D. Syntactic topic models ［EB/OL］. ［2013-04-09］. https://papers.nips.cc/paper/3398-syntactic-topic-models.pdf.
［13］REN Y, CHEN L, ZHANG Y, et al. Improved method component clustering based on latent semantic analysis ［J］. Computer Engineering, 2011,37(4):67-69.（任姚鹏,陈立潮,张英俊,等.基于潜在语义分析的构件聚类改进方法［J］.计算机工程,2011,37(4):67-69.）
［14］HU X, CAI Z, FRANCESCHETTI D, et al. LSA: The first dimension and dimensional weighting ［EB/OL］. ［2013-04-12］. http://www.academia.edu/2956517/LSA_The_first_dimension_and_dimensional_weighting.
［15］ZHANG Y, ZHU J, XIONG Z. Improved text clustering algorithm of probabilistic latent with semantic analysis ［J］. Journal of Computer Applications, 2011,31(3):674-676.（张玉芳,朱俊,熊忠阳.改进的概率潜在语义分析下的文本聚类算法［J］.计算机应用,2011,31(3):674-676.）
［16］LI J, LI J. A subtopic division in news special ［C］// Proceedings of the 4th National Conference on Information Retrieval and Information Content Security (NCIRCS). Beijing:［s.n.］, 2008:449-458.（李军,李涓子.新闻专题内子话题划分［C］ //第四届全国信息检索与内容安全学术会议论文集.北京:［出版者不详］,2008:449-458.）
［17］CHU K, LI F. LDA model-based news topic evolution ［J］ . Computer Applications and Software, 2011,28(4):4-7.（楚克明,李芳.基于LDA模型的新闻话题的演化［J］.计算机应用与软件,2011,28(4):4-7.）
［18］WU Y, WANG X, DING Y,et al. Adaptive on-line Web topic detection method for Web news recommendation system ［J］. Acta Electronica Sinica, 2010,38(11):2620-2624.（吴永辉,王晓龙,丁宇新,等.基于主题的自适应、在线网络热点发现方法及新闻推荐系统［J］.电子学报,2010,38(11):2620-2624.）
［19］RAMAGE D, HEYMANN P, MANNING C D, et al. Clustering the tagged Web ［C］// Proceedings of the 2nd ACM International Conference on Web Search and Data Mining. New York: ACM, 2009:54-63.
［20］WILSON A T, CHEW P A. Term weighting schemes for latent Dirichlet allocation ［C］// Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Stroudsburg:Association for Computational Linguistics, 2010:465-473.
［21］ZHANG X, ZHOU X, HUANG H, et al. An improved LDA topic model ［J］. Journal of Beijing Jiaotong University, 2010,34(2):111-114.（张小平,周学忠,黄厚宽,等.一种改进的LDA主题模型［J］.北京交通大学学报,2010,34(2):111-114.）
［22］PAN Z. Research on the recognition of Chinese named entity based on rules and statistics ［J］ . Information Science, 2012,30(5):709-712.（潘正高.基于规则和统计相结合的中文命名实体识别研究［J］.情报科学,2012,30(5):709-712.）
［23］CHEN H, CHEN Y. Research on the news topic detection technology ［J］. China Computer & Communication, 2011(8):133-135.（陈慧娜,陈一鸣.新闻话题探测技术的研究［J］.信息与电脑：理论版,2011(8):133-135.)
［24］ZHOU Q, ZHAO M, HU M. Study on feature selection in Chinese text categorization ［J］. Journal of Chinese Information Processing, 2004,18(3):17-23.（周茜,赵明生,扈旻.中文文本分类中的特征选择研究［J］.中文信息学报,2004,18(3):17-23.）
［25］ZHOU W, ZHANG Z, XU D. Feature selection method for Chinese text categorization based on class discriminating words ［J］. Computer Applications and Software, 2013,30(3):193-195.（周万年,张振浩,徐登彩.用于中文文本分类的基于类别区分词的特征选择方法［J］.计算机应用与软件,2013,30(3):193-195.）
［26］ZHOU Z. Topic comparative study between microblog and traditional media based on LDA ［D］. Shanghai: Shanghai Jiao Tong University, 2013.（周振宇.基于LDA的微博与传统媒体的话题对比研究［D］.上海:上海交通大学,2013.）
［27］ZHAO A, LIU P, ZHENG Y. Subtopic division in news topic based on latent Dirichlet allocation ［J］. Journal of Chinese Computer Systems, 2013,34(4):733-737.（赵爱华,刘培玉,郑燕.基于LDA的新闻话题子话题划分方法［J］.小型微型计算机系统,2013,34(4):733-737.）

[1]	Yue ZHANG, Liang ZHANG, Fei XIE, Jiale YANG, Rui ZHANG, Yijian LIU. Road abandoned object detection algorithm based on optimized instance segmentation model [J]. Journal of Computer Applications, 2021, 41(11): 3228-3233.
[2]	Kai LI, Jie LI. Structure-fuzzy multi-class support vector machine algorithm based on pinball loss [J]. Journal of Computer Applications, 2021, 41(11): 3104-3112.
[3]	Yusheng HU, Bingwei HE, Qingkang DENG. Moving object detection and static map reconstruction with hybrid vision system [J]. Journal of Computer Applications, 2021, 41(11): 3332-3336.
[4]	Jie GAO, Yuan ZHU, Ke LU. Object detection method based on radar and camera fusion [J]. Journal of Computer Applications, 2021, 41(11): 3242-3250.
[5]	Bo PENG, Yaru LUO, Shenghua XIE, Lixue YIN. Universal vector flow mapping method combined with deep learning [J]. Journal of Computer Applications, 2021, 41(11): 3368-3375.
[6]	Jicheng CHEN, Hongchang CHEN. Community detection method based on tensor modeling and evolutionary K-means clustering [J]. Journal of Computer Applications, 2021, 41(11): 3120-3126.
[7]	Jiaqi ZHANG, Yueqin ZHANG, Jian CHEN. Pulse condition recognition method based on optimized reinforcement learning path feature classification [J]. Journal of Computer Applications, 2021, 41(11): 3402-3408.
[8]	Junwei REN, Cheng ZENG, Siyu XIAO, Jinxia QIAO, Peng HE. Session-based recommendation model of multi-granular graph neural network [J]. Journal of Computer Applications, 2021, 41(11): 3164-3170.
[9]	Lin SUN, Yubo YUAN. Drowsiness recognition algorithm based on human eye state [J]. Journal of Computer Applications, 2021, 41(11): 3213-3218.
[10]	Chenyu GE, Liang DONG, Yikun XU, Yi CHANG, Hongming ZHANG. Global-scale radar data restoration algorithm based on total variation and low-rank group sparsity [J]. Journal of Computer Applications, 2021, 41(11): 3353-3361.
[11]	Junhua YAN, Ping HOU, Yin ZHANG, Xiangyang LYU, Yue MA, Gaofei WANG. Multiply distortion type judgement method based on multi-scale and multi-classifier convolutional neural network [J]. Journal of Computer Applications, 2021, 41(11): 3178-3184.
[12]	Fuhai LI, Murong JIANG, Lei YANG, Junyi CHEN. Solar speckle image deblurring method with gradient guidance based on generative adversarial network [J]. Journal of Computer Applications, 2021, 41(11): 3345-3352.
[13]	Jianfang CAO, Minmin YAN, Yiming JIA, Xiaodong TIAN. Application of Inception-v3 model integrated with transfer learning in dynasty identification of ancient murals [J]. Journal of Computer Applications, 2021, 41(11): 3219-3227.
[14]	Taiheng LIU, Zhaoshui HE. Surface defect detection method based on auto-encoding and knowledge distillation [J]. Journal of Computer Applications, 2021, 41(11): 3200-3205.
[15]	Yang ZHANG, Xiaoning WANG. Text feature selection method based on Word2Vec word embedding and genetic algorithm for biomarker selection in high-dimensional omics [J]. Journal of Computer Applications, 2021, 41(11): 3151-3155.