Journal of Computer Applications ›› 2026, Vol. 46 ›› Issue (1): 85-94.DOI: 10.11772/j.issn.1001-9081.2025010126

• Data science and technology • Previous Articles     Next Articles

Deep evolutionary topic clustering model

Ziyang CHENG1,2, Ruizhang HUANG1,2(), Jingjing XUE1,2   

  1. 1.State Key Laboratory of Public Big Data (Guizhou University),Guiyang Guizhou 550025,China
    2.College of Computer Science and Technology,Guizhou University,Guiyang Guizhou 550025,China
  • Received:2025-02-10 Revised:2025-04-29 Accepted:2025-04-29 Online:2026-01-10 Published:2026-01-10
  • Contact: Ruizhang HUANG
  • About author:CHENG Ziyang, born in 2000, M. S. candidate. His research interests include topic information mining, evolutionary clustering.
    XUE Jingjing, born in 1995, Ph. D. candidate. Her research interests include deep text clustering.
  • Supported by:
    National Natural Science Foundation of China(62066007);Guizhou Provincial Science and Technology Support Program (Qiankehe Support [2023] General 300;Qiankehe Support [2023] General 448

深度演化主题聚类模型

程梓洋1,2, 黄瑞章1,2(), 薛菁菁1,2   

  1. 1.公共大数据国家重点实验室(贵州大学),贵阳 550025
    2.贵州大学 计算机科学与技术学院,贵阳 550025
  • 通讯作者: 黄瑞章
  • 作者简介:程梓洋(2000—),男,贵州贵阳人,硕士研究生,主要研究方向:主题信息挖掘、演化聚类
    薛菁菁(1995—),女,山东日照人,博士研究生,CCF会员,主要研究方向:深度文本聚类。
  • 基金资助:
    国家自然科学基金资助项目(62066007);贵州省科技支撑计划项目(黔科合支撑[2023]一般300;黔科合支撑[2023]一般448

Abstract:

To address challenges related to topic ambiguity and inaccurate alignment problems in the existing deep document clustering methods when processing dynamic textual data with topics varying with time, a Deep Evolutionary Topic Clustering Model (DETCM) was proposed. In DETCM, information of topics varying with time was captured in dynamic text, and historical topic information was integrated with document features of the current time slice, thereby discovering event topic trajectories and generating dynamic document topic representations. In specific, to solve topic ambiguity problem of topics varying with time, a topic fusion learning module based on a hybrid encoder was designed, in which topic information from preceding time slices was utilized to enhance topic discrimination and feature extraction of the current time slice. Furthermore, a topic inheritance module across different time slices was designed to achieve topic match alignment across different time slices, so that topic information on historical slices was effectively transferred and incorporated into cluster assignment process of the current time slice. Experimental results based on the real-world arXiv evolving textual document dataset demonstrate that compared with the Deep Evolutionary Document Clustering model with Instance-level Mutual Attention Enhancement (DEDC-IMAE), DETCM achieves an average improvement of 3.08% (-0.37% to 5.43%) in Normalized Mutual Information (NMI) across all time slices, verifying the superior capability of DETCM in tracking topic evolution under dynamic scenarios, enabling more accurate capture of temporal variation features in topics and leading to better clustering performance.

Key words: topic evolution, deep dynamic clustering, representation learning, topic mining, contrastive learning

摘要:

针对现有的深度文档聚类方法在处理动态文档数据时,文档主题随时间演化过程中存在主题混淆和对齐不准确问题,提出一种深度演化主题聚类模型(DETCM)。DETCM可以捕捉动态文档随时间演化的主题信息,结合历史主题信息与当前时间片文档特征,发掘事件主题演化脉络,生成动态文档主题表示。具体来说,为了解决主题随时间演变时的主题混淆问题,设计了基于混合编码器的主题融合学习模块,借助前置时间片的主题信息,加强当前时间片的主题区分度与特征提取。此外,还设计了一种跨时间片的动态主题继承模块,通过将不同时间片上的主题匹配对齐,有效地将历史时间片上的主题信息融入当前时间片的类簇划分过程中。这一设计使得DETCM学习主题时能够继承并借鉴历史时间片的主题信息,有效跟踪动态文本主题持续演化的过程。基于arXiv真实演化文本文档数据集的实验结果表明,相较于深度演化聚类模型DEDC-IMAE (Deep Evolutionary Document Clustering model with Instance-level Mutual Attention Enhancement), DETCM在所有时间片上的标准化互信息(NMI)指标平均提升了3.08%(-0.37%~5.43%),验证了DETCM在动态场景中具有更好的主题演化追踪能力,能够更准确地捕捉主题的时序变化特征,实现更优的聚类性能。

关键词: 主题演化, 深度动态聚类, 表示学习, 主题挖掘, 对比学习

CLC Number: