《计算机应用》唯一官方网站

• •    下一篇

深度演化文档主题聚类模型

程梓洋1,2,黄瑞章1,薛菁菁3   

  1. 1. 贵州大学
    2. 计算机科学与技术学院
    3. 贵州大学计算机科学与技术学院
  • 收稿日期:2025-02-10 修回日期:2025-04-29 发布日期:2025-05-26 出版日期:2025-05-26
  • 通讯作者: 程梓洋
  • 基金资助:
    国家自然科学基金;基于文本计算与行业知识图谱的互联网内容风控关键技术研究与应用

Deep evolutionary document topic clustering model

  • Received:2025-02-10 Revised:2025-04-29 Online:2025-05-26 Published:2025-05-26
  • Supported by:
    the National Natural Science Foundation of China;Guizhou Provincial Science and Technology Support Program

摘要: 针对现有深度文档聚类方法应对动态文档数据时文档主题随时间演化过程中存在的主题混淆问题以及匹配对齐问题,提出了一种深度演化主题聚类模型(DETCM)。DETCM可以捕捉动态文档随时间演化的主题信息,结合历史主题信息与当前时间片文档特征,发掘事件主题演化脉络,生成动态文档主题表示。具体来说,为了解决主题随时间演变时的主题混淆问题,DETCM首先设计基于混合编码器的主题融合学习模块,借助前置时间片的主题信息,进一步明晰当前时间片的主题划分与主题提取。此外,DETCM设计了一种跨时间片的主题对比继承模块,通过将不同时间片上主题匹配对齐,巧妙地将历史时间片上的主题信息融入当前时间片的类簇划分过程中。这一设计使得DETCM学习主题时能够继承并借鉴历史时间片的主题信息,从而有效跟踪动态文本主题持续演化的过程。基于arXiv真实演化文本文档数据集的实验结果表明,相较于深度演化聚类模型DEDC-IMAE,DETCM模型在所有时间片上的标准化互信息(Nmi)指标平均提升了约3.08%,验证了DETCM模型在动态场景中具有更好的主题演化追踪能力,能够更准确地捕捉主题的时序变化特征,从而实现了更优的聚类性能。

关键词: 主题演化, 深度动态聚类, 表示学习, 主题挖掘, 对比学习

Abstract: To address challenges related to topic ambiguity and alignment in existing deep document clustering methods when processing dynamic textual data with evolving themes, a Deep Evolutionary Topic Clustering Model (DETCM) was proposed. Temporal thematic evolution was captured by integrating historical topic patterns with features from the current time slice, allowing event theme trajectories to be uncovered and dynamic document-topic representations to be constructed. To reduce topic ambiguity during temporal transitions, a topic fusion learning module based on a hybrid encoder was designed, in which topic information from preceding time slices was utilized to enhance topic discrimination and feature extraction. Furthermore, a cross-temporal topic contrastive inheritance module was introduced to achieve topic alignment and consistency across different time slices. Historical thematic knowledge was effectively transferred and incorporated into current cluster formation through this mechanism. Experimental results based on the real-world arXiv evolving textual document dataset demonstrate that, compared with the Deep Evolutionary Document Clustering model with Instance-level Mutual Attention Enhancement (DEDC-IMAE), the proposed DETCM model achieves an average improvement of 3.08% in Normalized Mutual Information (NMI) across all temporal checkpoints. This confirms the superior capability of DETCM in tracking thematic evolution under dynamic scenarios, enabling more accurate capture of temporal variations in document topics and leading to enhanced clustering performance.

Key words: topic evolution, deep dynamic clustering, representation learning, topic mining, contrastive learning

中图分类号: