《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (8): 2370-2375.DOI: 10.11772/j.issn.1001-9081.2022091354

• 第十九届CCF中国信息系统及应用大会 • 上一篇    

深度动态文本聚类模型DDDC

陆辉1,2, 黄瑞章1,2(), 薛菁菁1,2, 任丽娜1,2, 林川1,2   

  1. 1.公共大数据国家重点实验室(贵州大学),贵阳 550025
    2.贵州大学 计算机科学与技术学院,贵阳 550025
  • 收稿日期:2022-09-06 修回日期:2022-10-26 接受日期:2022-11-01 发布日期:2022-12-12 出版日期:2023-08-10
  • 通讯作者: 黄瑞章
  • 作者简介:陆辉(1998—),男,贵州安顺人,硕士研究生,CCF会员,主要研究方向:动态聚类、主题挖掘
    薛菁菁(1995—),女,山东日照人,博士研究生,CCF会员,主要研究方向:深度文本聚类
    任丽娜(1987—),女,辽宁阜新人,讲师,博士研究生,CCF会员,主要研究方向:自然语言处理、文本挖掘、机器学习
    林川(1975—),男,四川自贡人,副教授,硕士,主要研究方向:文本挖掘、机器学习、大数据管理与应用。
  • 基金资助:
    国家自然科学基金资助项目(62066007)

DDDC: deep dynamic document clustering model

Hui LU1,2, Ruizhang HUANG1,2(), Jingjing XUE1,2, Lina REN1,2, Chuan LIN1,2   

  1. 1.State Key Laboratory of Public Big Data(Guizhou University),Guiyang Guizhou 550025,China
    2.College of Computer Science and Technology,Guizhou University,Guiyang Guizhou 550025,China
  • Received:2022-09-06 Revised:2022-10-26 Accepted:2022-11-01 Online:2022-12-12 Published:2023-08-10
  • Contact: Ruizhang HUANG
  • About author:LU Hui, born in 1998, M. S. candidate. His research interests include dynamic clustering, topic mining.
    XUE Jingjing, born in 1995, Ph. D. candidate. Her research interests include deep document clustering.
    REN Lina, born in 1987, Ph. D. candidate, lecturer. Her research interests include natural language processing, document mining, machine learning.
    LIN Chuan, born in 1975, M. S., associate professor. His research interests include document mining, machine learning, big data management and applications.
  • Supported by:
    National Natural Science Foundation of China(62066007)

摘要:

互联网的飞速发展使得新闻数据呈爆炸增长的趋势。如何从海量新闻数据中获取当前热门事件的主题演化过程成为文本分析领域研究的热点。然而,常用的传统动态聚类模型处理大规模数据集时灵活性差且效率低下,现有的深度文本聚类模型则缺乏一种通用的方法捕捉时间序列数据的主题演化过程。针对以上问题,设计了一种深度动态文本聚类(DDDC)模型。该模型以现有的深度变分推断算法为基础,可以在不同时间片上捕捉融合了前置时间片内容的主题分布,并通过聚类从这些分布中获取事件主题的演化过程。在真实新闻数据集上的实验结果表明,在不同的数据集上,与动态主题模型(DTM)、变分深度嵌入(VaDE)等算法相比,DDDC模型在各时间片的聚类精度均至少提升了4个百分点,且归一化互信息(NMI)至少提高了3个百分点,验证了DDDC模型的有效性。

关键词: 文本动态聚类, 事件主题演化, 主题分布, 时间序列数据, 深度变分推断

Abstract:

The rapid development of Internet leads to the explosive growth of news data. How to capture the topic evolution process of current popular events from massive news data has become a hot research topic in the field of document analysis. However, the commonly used traditional dynamic clustering models are inflexible and inefficient when dealing with large-scale datasets, while the existing deep document clustering models lack a general method to capture the topic evolution process of time series data. To address these problems, a Deep Dynamic Document Clustering (DDDC) model was designed. In this model, based on the existing deep variational inference algorithms, the topic distributions incorporating the content of previous time slices on different time slices were captured, and the evolution process of event topics was captured from these distributions through clustering. Experimental results on real news datasets show that compared with Dynamic Topic Model (DTM), Variational Deep Embedding (VaDE) and other algorithms, DDDC model has the clustering accuracy and Normalized Mutual Information (NMI) improved by at least 4 percentage points averagely and at least 3 percentage points respectively in each time slice on different datasets, verifying the effectiveness of DDDC model.

Key words: dynamic document clustering, event topic evolution, topic distribution, time series data, deep variational inference

中图分类号: