Multi-source text topic mining model based on Dirichlet multinomial allocation model
XU Liyang1,2, HUANG Ruizhang1,2,3, CHEN Yanping1,2, QIAN Zhisen1,2, LI Wanying1,2
1. College of Computer Science and Technology, Guizhou University, Guiyang Guizhou 550025, China; 2. Guizhou Provincial Key Laboratory of Public Big Data(Guizhou University), Guiyang Guizhou 550025, China; 3. State Key Laboratory for Novel Software Technology(Nanjing University), Nanjing Jiangsu 210093, China
Abstract:With the rapid increase of text data sources, topic mining for multi-source text data becomes the research focus of text mining. Since the traditional topic model is mainly oriented to single-source, there are many limitations to directly apply to multi-source. Therefore, a topic model for multi-source based on Dirichlet Multinomial Allocation model (DMA) was proposed considering the difference between sources of topic word-distribution and the nonparametric clustering quality of DMA, namely MSDMA (Multi-Source Dirichlet Multinomial Allocation). The main contributions of the proposed model are as follows:1) it takes into account the characteristics of each source itself when modeling the topic, and can learn the source-specific word distributions of topic k; 2) it can improve the topic discovery performance of high noise and low information through knowledge sharing; 3) it can automatically learn the number of topics within each source without the need for human pre-given. The experimental results in the simulated data set and two real datasets indicate that the proposed model can extract topic information more effectively and efficiently than the state-of-the-art topic models.
[1] GHOSH R, ASUR S. Mining information from heterogeneous sources:a topic modeling approach[EB/OL].[2018-03-20]. http://www.hpl.hp.com/techreports/2013/HPL-2013-83.pdf. [2] HUANG R, YU G, WANG Z, et al. Dirichlet process mixture model for document clustering with feature partition[J]. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(8):1748-1759. [3] GREEN P J, RICHARDSON S. Modelling heterogeneity with and without the Dirichlet process[J]. Scandinavian Journal of Statistics, 2001, 28(2):355-375. [4] 周建英, 王飞跃, 曾大军. 分层Dirichlet过程及其应用综述[J]. 自动化学报, 2011, 37(4):389-407.(ZHOU J Y, WANG F Y, ZENG D J. Hierarchical Dirichlet processes and their applications:a survey[J]. Acta Automatica Sinica, 2011, 37(4):389-407.) [5] ANTONIAK C E. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems[J]. The Annals of Statistics, 1974, 2(6):1152-1174. [6] 高悦, 王文贤, 杨淑贤. 一种基于狄利克雷过程混合模型的文本聚类算法[J]. 信息网络安全, 2015(11):60-65.(GAO Y, WANG W X, YANG S X. A document clustering algorithm based on Diriehlet process mixture model[J]. Netinfo Security,2015(11):60-65.) [7] JENSEN C S. Blocking Gibbs sampling for inference in large and complex Bayesian networks with applications in genetics[EB/OL].[2018-03-20]. http://vbn.aau.dk/ws/files/104290/csjensen.pdf. [8] YAN Y, HUANG R, MA C, et al. Improving document clustering for short texts by long documents via a Dirichlet multinomial allocation model[C]//Proceedings of the 1st International Joint Conference on Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint Conference on Web and Big Data. Berlin:Springer, 2017:626-641. [9] BLEI D M, NG A Y, JORDAN M I. Latent Dirichlet allocation[J]. Journal of Machine Learning Research, 2003, 3:993-1022. [10] PHAN X H, NGUYEN C T, LE D T, et al. A hidden topic-based framework toward building applications with short Web documents[J]. IEEE Transactions on Knowledge and Data Engineering, 2011, 23(7):961-976. [11] JIN O, LIU N N, ZHAO K, et al. Transferring topical knowledge from auxiliary long texts for short text clustering[C]//Proceedings of the 20th ACM International Conference on Information and Knowledge Management. New York:ACM, 2011:775-784. [12] HONG L, DOM B, GURUMURTHY S, et al. A time-dependent topic model for multiple text streams[C]//Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York:ACM, 2011:832-840. [13] FOSTER A, LI H, MAIERHOFER G, et al. An extension of standard latent Dirichlet allocation to multiple corpora[EB/OL].[2018-03-20].http://evoq-eval.siam.org/Portals/0/Publications/SIURO/Vol9/AN_EXTENSION_STANDARD_LATENT_DIRICHLET_ALLOCATION.pdf?ver=2018-04-06-152049-177. [14] SALOMATIN K, YANG Y, LAD A. Multi-field correlated topic modeling[EB/OL].[2018-03-20].http://www.cs.cmu.edu/afs/cs.cmu.edu/Web/People/alad/papers/salomatin-sdm09.pdf. [15] TEH Y W, JORDAN M I, BEAL M J, et al. Sharing clusters among related groups:hierarchical Dirichlet processes[EB/OL].[2018-03-21].http://papers.nips.cc/paper/2698-sharing-clusters-among-related-groups-hierarchical-dirichlet-processes.pdf. [16] SMYTH P. Model selection for probabilistic clustering using cross-validated likelihood[J]. Statistics and Computing, 2000, 10(1):63-72. [17] CHEESEMAN P, KELLY J, SELF M, et al. Autoclass:A Bayesian classification system[M]//Proceedings of the Fifth International Conference on Machine Learning. San Francisco, CA:Morgan Kaufmann Publishers, 1988:54-64. [18] ZHONG S. Semi-supervised model-based document clustering:a comparative study[J]. Machine Learning, 2006, 65(1):3-29. [19] BELA A, FRIGYIK A, GUPTA M. Introduction to the Dirichlet distribution and related processes[EB/OL]. [2018-03-22].https://www2.ee.washington.edu/techsite/papers/documents/UWEETR-2010-0006.pdf. [20] TANG J, ZHANG J, YAO L, et al. ArnetMiner: extraction and mining of academic social networks[C]// Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2008: 990-998. [21] DHILLON I S, MODHA D S. Concept decompositions for large sparse text data using clustering[J]. Machine Learning, 2001, 42(1/2): 143-175. [22] YIN J, WANG J. A Dirichlet multinomial mixture model-based approach for short text clustering[C]// Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2014: 233-242.