计算机应用 ›› 2018, Vol. 38 ›› Issue (11): 3094-3099.DOI: 10.11772/j.issn.1001-9081.2018041359

• 第七届中国数据挖掘会议(CCDM 2018) • 上一篇    下一篇

基于狄利克雷多项分配模型的多源文本主题挖掘模型

徐立洋1,2, 黄瑞章1,2,3, 陈艳平1,2, 钱志森1,2, 黎万英1,2   

  1. 1. 贵州大学 计算机科学与技术学院, 贵阳 550025;
    2. 贵州省公共大数据重点实验室(贵州大学), 贵阳 550025;
    3. 计算机软件新技术国家重点实验室(南京大学), 南京 210093
  • 收稿日期:2018-05-29 修回日期:2018-06-15 出版日期:2018-11-10 发布日期:2018-11-10
  • 通讯作者: 黄瑞章
  • 作者简介:徐立洋(1990-),男,贵州福泉人,硕士研究生,主要研究方向:数据挖掘、文本挖掘、机器学习;黄瑞章(1979-),女,天津人,副教授,博士,CCF成员,主要研究方向:数据挖掘、文本挖掘、机器学习、信息检索;陈艳平(1980-),男,贵州黔南人,副教授,博士,主要研究方向:人工智能、自然语言处理;钱志森(1994-),男,贵州铜仁人,硕士研究生,主要研究方向:数据挖掘、文本挖掘、机器学习;黎万英(1992-),女,贵州贵阳人,硕士研究生,主要研究方向:数据挖掘、文本挖掘、机器学习。
  • 基金资助:
    国家自然科学基金资助项目(61462011);国家自然科学基金重大研究计划项目(91746116);贵州省重大应用基础研究项目(黔科合JZ字[2014]2001);贵州省科技重大专项计划项目(黔科合重大专项字[2017]3002);贵州省自然科学基金资助项目(黔科合基础[2018]1035)。

Multi-source text topic mining model based on Dirichlet multinomial allocation model

XU Liyang1,2, HUANG Ruizhang1,2,3, CHEN Yanping1,2, QIAN Zhisen1,2, LI Wanying1,2   

  1. 1. College of Computer Science and Technology, Guizhou University, Guiyang Guizhou 550025, China;
    2. Guizhou Provincial Key Laboratory of Public Big Data(Guizhou University), Guiyang Guizhou 550025, China;
    3. State Key Laboratory for Novel Software Technology(Nanjing University), Nanjing Jiangsu 210093, China
  • Received:2018-05-29 Revised:2018-06-15 Online:2018-11-10 Published:2018-11-10
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61462011), the Major Research Program of the National Natural Science Foundation of China (91746116), the Major Applied Basic Research Program of Guizhou Province (JZ[2014]2001), the Major Special Science and Technology Project of Guizhou Province ([2017]3002), the Science and Technology Project of Guizhou Province ([2018]1035).

摘要: 随着文本数据来源渠道越来越丰富,面向多源文本数据进行主题挖掘已成为文本挖掘领域的研究重点。由于传统主题模型主要面向单源文本数据建模,直接应用于多源文本数据有较多的限制。针对该问题提出了基于狄利克雷多项分配(DMA)模型的多源文本主题挖掘模型——多源狄利克雷多项分配模型(MSDMA)。通过考虑主题在不同数据源的词分布的差异性,结合DMA模型的非参聚类性质,模型主要解决了如下三个问题:1)能够学习出同一个主题在不同数据源中特有的词分布形式;2)通过数据源之间共享主题空间和词项空间,使得数据源间可进行主题知识互补,提升对高噪声、低信息量的数据源的主题发现效果;3)能自主学习出每个数据源内的主题数量,不需要事先给定主题个数。最后通过在模拟数据集和真实数据集的实验结果表明,所提模型比传统主题模型能更有效地对多源数据进行主题信息挖掘。

关键词: 多源文本数据, 主题模型, 吉布斯采样, 狄利克雷多项分配模型, 文本挖掘

Abstract: With the rapid increase of text data sources, topic mining for multi-source text data becomes the research focus of text mining. Since the traditional topic model is mainly oriented to single-source, there are many limitations to directly apply to multi-source. Therefore, a topic model for multi-source based on Dirichlet Multinomial Allocation model (DMA) was proposed considering the difference between sources of topic word-distribution and the nonparametric clustering quality of DMA, namely MSDMA (Multi-Source Dirichlet Multinomial Allocation). The main contributions of the proposed model are as follows:1) it takes into account the characteristics of each source itself when modeling the topic, and can learn the source-specific word distributions of topic k; 2) it can improve the topic discovery performance of high noise and low information through knowledge sharing; 3) it can automatically learn the number of topics within each source without the need for human pre-given. The experimental results in the simulated data set and two real datasets indicate that the proposed model can extract topic information more effectively and efficiently than the state-of-the-art topic models.

Key words: multi-source text data, topic model, blocked-Gibbs sampling, Dirichlet Multinomial Allocation (DMA), text mining

中图分类号: