基于狄利克雷多项分配模型的多源文本主题挖掘模型

doi:10.11772/j.issn.1001-9081.2018041359

计算机应用 ›› 2018, Vol. 38 ›› Issue (11): 3094-3099.DOI: 10.11772/j.issn.1001-9081.2018041359

• 第七届中国数据挖掘会议(CCDM 2018) • 上一篇下一篇

基于狄利克雷多项分配模型的多源文本主题挖掘模型

徐立洋^1,2, 黄瑞章^1,2,3, 陈艳平^1,2, 钱志森^1,2, 黎万英^1,2

1. 贵州大学计算机科学与技术学院, 贵阳 550025;
2. 贵州省公共大数据重点实验室(贵州大学), 贵阳 550025;
3. 计算机软件新技术国家重点实验室(南京大学), 南京 210093

收稿日期:2018-05-29 修回日期:2018-06-15 发布日期:2018-11-10 出版日期:2018-11-10
通讯作者: 黄瑞章
作者简介:徐立洋(1990-),男,贵州福泉人,硕士研究生,主要研究方向:数据挖掘、文本挖掘、机器学习;黄瑞章(1979-),女,天津人,副教授,博士,CCF成员,主要研究方向:数据挖掘、文本挖掘、机器学习、信息检索;陈艳平(1980-),男,贵州黔南人,副教授,博士,主要研究方向:人工智能、自然语言处理;钱志森(1994-),男,贵州铜仁人,硕士研究生,主要研究方向:数据挖掘、文本挖掘、机器学习;黎万英(1992-),女,贵州贵阳人,硕士研究生,主要研究方向:数据挖掘、文本挖掘、机器学习。
基金资助:
国家自然科学基金资助项目（61462011）；国家自然科学基金重大研究计划项目（91746116）；贵州省重大应用基础研究项目（黔科合JZ字[2014]2001）；贵州省科技重大专项计划项目（黔科合重大专项字[2017]3002）；贵州省自然科学基金资助项目（黔科合基础[2018]1035）。

Multi-source text topic mining model based on Dirichlet multinomial allocation model

XU Liyang^1,2, HUANG Ruizhang^1,2,3, CHEN Yanping^1,2, QIAN Zhisen^1,2, LI Wanying^1,2

1. College of Computer Science and Technology, Guizhou University, Guiyang Guizhou 550025, China;
2. Guizhou Provincial Key Laboratory of Public Big Data(Guizhou University), Guiyang Guizhou 550025, China;
3. State Key Laboratory for Novel Software Technology(Nanjing University), Nanjing Jiangsu 210093, China

Received:2018-05-29 Revised:2018-06-15 Online:2018-11-10 Published:2018-11-10
Supported by:
This work is partially supported by the National Natural Science Foundation of China (61462011), the Major Research Program of the National Natural Science Foundation of China (91746116), the Major Applied Basic Research Program of Guizhou Province (JZ[2014]2001), the Major Special Science and Technology Project of Guizhou Province ([2017]3002), the Science and Technology Project of Guizhou Province ([2018]1035).

摘要/Abstract

摘要： 随着文本数据来源渠道越来越丰富，面向多源文本数据进行主题挖掘已成为文本挖掘领域的研究重点。由于传统主题模型主要面向单源文本数据建模，直接应用于多源文本数据有较多的限制。针对该问题提出了基于狄利克雷多项分配（DMA）模型的多源文本主题挖掘模型——多源狄利克雷多项分配模型（MSDMA）。通过考虑主题在不同数据源的词分布的差异性，结合DMA模型的非参聚类性质，模型主要解决了如下三个问题：1）能够学习出同一个主题在不同数据源中特有的词分布形式；2）通过数据源之间共享主题空间和词项空间，使得数据源间可进行主题知识互补，提升对高噪声、低信息量的数据源的主题发现效果；3）能自主学习出每个数据源内的主题数量，不需要事先给定主题个数。最后通过在模拟数据集和真实数据集的实验结果表明，所提模型比传统主题模型能更有效地对多源数据进行主题信息挖掘。

关键词: 多源文本数据, 主题模型, 吉布斯采样, 狄利克雷多项分配模型, 文本挖掘

Abstract: With the rapid increase of text data sources, topic mining for multi-source text data becomes the research focus of text mining. Since the traditional topic model is mainly oriented to single-source, there are many limitations to directly apply to multi-source. Therefore, a topic model for multi-source based on Dirichlet Multinomial Allocation model (DMA) was proposed considering the difference between sources of topic word-distribution and the nonparametric clustering quality of DMA, namely MSDMA (Multi-Source Dirichlet Multinomial Allocation). The main contributions of the proposed model are as follows:1) it takes into account the characteristics of each source itself when modeling the topic, and can learn the source-specific word distributions of topic k; 2) it can improve the topic discovery performance of high noise and low information through knowledge sharing; 3) it can automatically learn the number of topics within each source without the need for human pre-given. The experimental results in the simulated data set and two real datasets indicate that the proposed model can extract topic information more effectively and efficiently than the state-of-the-art topic models.

Key words: multi-source text data, topic model, blocked-Gibbs sampling, Dirichlet Multinomial Allocation (DMA), text mining

中图分类号:

TP301.6

徐立洋, 黄瑞章, 陈艳平, 钱志森, 黎万英. 基于狄利克雷多项分配模型的多源文本主题挖掘模型[J]. 计算机应用, 2018, 38(11): 3094-3099.

XU Liyang, HUANG Ruizhang, CHEN Yanping, QIAN Zhisen, LI Wanying. Multi-source text topic mining model based on Dirichlet multinomial allocation model[J]. Journal of Computer Applications, 2018, 38(11): 3094-3099.

参考文献

[1] GHOSH R, ASUR S. Mining information from heterogeneous sources:a topic modeling approach[EB/OL].[2018-03-20]. http://www.hpl.hp.com/techreports/2013/HPL-2013-83.pdf.
[2] HUANG R, YU G, WANG Z, et al. Dirichlet process mixture model for document clustering with feature partition[J]. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(8):1748-1759.
[3] GREEN P J, RICHARDSON S. Modelling heterogeneity with and without the Dirichlet process[J]. Scandinavian Journal of Statistics, 2001, 28(2):355-375.
[4] 周建英, 王飞跃, 曾大军. 分层Dirichlet过程及其应用综述[J]. 自动化学报, 2011, 37(4):389-407.(ZHOU J Y, WANG F Y, ZENG D J. Hierarchical Dirichlet processes and their applications:a survey[J]. Acta Automatica Sinica, 2011, 37(4):389-407.)
[5] ANTONIAK C E. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems[J]. The Annals of Statistics, 1974, 2(6):1152-1174.
[6] 高悦, 王文贤, 杨淑贤. 一种基于狄利克雷过程混合模型的文本聚类算法[J]. 信息网络安全, 2015(11):60-65.(GAO Y, WANG W X, YANG S X. A document clustering algorithm based on Diriehlet process mixture model[J]. Netinfo Security,2015(11):60-65.)
[7] JENSEN C S. Blocking Gibbs sampling for inference in large and complex Bayesian networks with applications in genetics[EB/OL].[2018-03-20]. http://vbn.aau.dk/ws/files/104290/csjensen.pdf.
[8] YAN Y, HUANG R, MA C, et al. Improving document clustering for short texts by long documents via a Dirichlet multinomial allocation model[C]//Proceedings of the 1st International Joint Conference on Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint Conference on Web and Big Data. Berlin:Springer, 2017:626-641.
[9] BLEI D M, NG A Y, JORDAN M I. Latent Dirichlet allocation[J]. Journal of Machine Learning Research, 2003, 3:993-1022.
[10] PHAN X H, NGUYEN C T, LE D T, et al. A hidden topic-based framework toward building applications with short Web documents[J]. IEEE Transactions on Knowledge and Data Engineering, 2011, 23(7):961-976.
[11] JIN O, LIU N N, ZHAO K, et al. Transferring topical knowledge from auxiliary long texts for short text clustering[C]//Proceedings of the 20th ACM International Conference on Information and Knowledge Management. New York:ACM, 2011:775-784.
[12] HONG L, DOM B, GURUMURTHY S, et al. A time-dependent topic model for multiple text streams[C]//Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York:ACM, 2011:832-840.
[13] FOSTER A, LI H, MAIERHOFER G, et al. An extension of standard latent Dirichlet allocation to multiple corpora[EB/OL].[2018-03-20].http://evoq-eval.siam.org/Portals/0/Publications/SIURO/Vol9/AN_EXTENSION_STANDARD_LATENT_DIRICHLET_ALLOCATION.pdf?ver=2018-04-06-152049-177.
[14] SALOMATIN K, YANG Y, LAD A. Multi-field correlated topic modeling[EB/OL].[2018-03-20].http://www.cs.cmu.edu/afs/cs.cmu.edu/Web/People/alad/papers/salomatin-sdm09.pdf.
[15] TEH Y W, JORDAN M I, BEAL M J, et al. Sharing clusters among related groups:hierarchical Dirichlet processes[EB/OL].[2018-03-21].http://papers.nips.cc/paper/2698-sharing-clusters-among-related-groups-hierarchical-dirichlet-processes.pdf.
[16] SMYTH P. Model selection for probabilistic clustering using cross-validated likelihood[J]. Statistics and Computing, 2000, 10(1):63-72.
[17] CHEESEMAN P, KELLY J, SELF M, et al. Autoclass:A Bayesian classification system[M]//Proceedings of the Fifth International Conference on Machine Learning. San Francisco, CA:Morgan Kaufmann Publishers, 1988:54-64.
[18] ZHONG S. Semi-supervised model-based document clustering:a comparative study[J]. Machine Learning, 2006, 65(1):3-29.
[19] BELA A, FRIGYIK A, GUPTA M. Introduction to the Dirichlet distribution and related processes[EB/OL]. [2018-03-22].https://www2.ee.washington.edu/techsite/papers/documents/UWEETR-2010-0006.pdf.
[20] TANG J, ZHANG J, YAO L, et al. ArnetMiner: extraction and mining of academic social networks[C]// Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2008: 990-998.
[21] DHILLON I S, MODHA D S. Concept decompositions for large sparse text data using clustering[J]. Machine Learning, 2001, 42(1/2): 143-175.
[22] YIN J, WANG J. A Dirichlet multinomial mixture model-based approach for short text clustering[C]// Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2014: 233-242.

基于狄利克雷多项分配模型的多源文本主题挖掘模型

Multi-source text topic mining model based on Dirichlet multinomial allocation model

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	钟磊, 周允升, 余敦辉, 崔海波. 基于亲和力与研究方向覆盖率的审稿人推荐算法[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 430-436.
[2]	钟新成, 刘昶, 赵秀梅. 基于马尔可夫优化的高效用项集挖掘算法[J]. 《计算机应用》唯一官方网站, 2023, 43(12): 3764-3771.
[3]	徐雪敏, 张秀国, 肖媛元, 曹志英. 基于优化的灰狼算法的大规模Web服务组合[J]. 《计算机应用》唯一官方网站, 2022, 42(10): 3162-3169.
[4]	杨丰瑞, 霍娜, 张许红, 韦巍. 基于注意力机制的主题扩展情感对话生成[J]. 计算机应用, 2021, 41(4): 1078-1083.
[5]	杨威亚, 余正涛, 高盛祥, 宋燃. 基于跨语言神经主题模型的汉越新闻话题发现方法[J]. 计算机应用, 2021, 41(10): 2879-2884.
[6]	朱思淼, 魏世伟, 魏思恒, 余敦辉. 基于弹幕情感分析和主题模型的视频推荐算法[J]. 计算机应用, 2021, 41(10): 2813-2819.
[7]	尹春勇, 章荪. 面向短文本情感分类的端到端对抗变分贝叶斯方法[J]. 计算机应用, 2020, 40(9): 2536-2542.
[8]	田保军, 刘爽, 房建东. 融合主题信息和卷积神经网络的混合推荐算法[J]. 计算机应用, 2020, 40(7): 1901-1907.
[9]	朱昶胜, 康亮河, 冯文芳. 基于自适应鲸鱼优化算法结合Elman神经网络的股市收盘价预测算法[J]. 计算机应用, 2020, 40(5): 1501-1509.
[10]	杨飞, 罗建桥, 李柏林. 结合全局和局部约束的sLDA铁路扣件分类模型[J]. 计算机应用, 2019, 39(3): 888-893.
[11]	徐红艳, 王丹, 王富海, 王嵘冰. 融合潜在狄利克雷分布与元路径分析的用户相关性度量方法[J]. 计算机应用, 2019, 39(11): 3288-3292.
[12]	余慧, 冯旭鹏, 刘利军, 黄青松. 聊天机器人中用户就医意图识别方法[J]. 计算机应用, 2018, 38(8): 2170-2174.
[13]	许银洁, 孙春华, 刘业政. 考虑用户特征的主题情感联合模型[J]. 计算机应用, 2018, 38(5): 1261-1266.
[14]	李琰, 刘嘉勇. 基于作者主题模型和辐射模型的用户位置预测模型[J]. 计算机应用, 2018, 38(4): 939-944.
[15]	邓扬, 张晨曦, 李江峰. 基于弹幕情感分析的视频片段推荐模型[J]. 计算机应用, 2017, 37(4): 1065-1070.