Journal of Computer Applications ›› 2025, Vol. 45 ›› Issue (8): 2457-2463.DOI: 10.11772/j.issn.1001-9081.2024081100

• The 21th CCF Conference on Web Information Systems and Applications (WISA 2024) • Previous Articles    

Deep variational text clustering model based on distribution augmentation

Ao SHEN1,2,3, Ruizhang HUANG1,2,3(), Jingjing XUE1,2,3, Yanping CHEN1,2,3, Yongbin QIN1,2,3   

  1. 1.Engineering Research Center of Ministry of Education for Text Computing and Cognitive Intelligence,Guizhou University,Guiyang Guizhou 550025,China
    2.State Key Laboratory of Public Big Data (Guizhou University),Guiyang Guizhou 550025,China
    3.College of Computer Science and Technology,Guizhou University,Guiyang Guizhou 550025,China
  • Received:2024-08-06 Revised:2024-08-25 Accepted:2024-09-02 Online:2025-08-15 Published:2025-08-10
  • Contact: Ruizhang HUANG
  • About author:SHEN Ao, born in 2000, M. S. candidate. His research interests include natural language processing, text mining.
    XUE Jingjing, born in 1995, Ph. D. Her research interests include natural language processing, text mining.
    CHEN Yanping, born in 1980, Ph. D., professor. His research interests include artificial intelligence, natural language processing.
    QIN Yongbin, born in 1980, Ph. D., professor. His research interests include big data management and application, multi-source data fusion.
  • Supported by:
    National Natural Science Foundation of China(62066007);Guizhou Province Science and Technology Support Program (Qiankehe Support [2023] General 300)

基于分布增强的深度变分文本聚类模型

申奥1,2,3, 黄瑞章1,2,3(), 薛菁菁1,2,3, 陈艳平1,2,3, 秦永彬1,2,3   

  1. 1.贵州大学 文本计算与认知智能教育部工程研究中心,贵阳 550025
    2.公共大数据国家重点实验室(贵州大学),贵阳 550025
    3.贵州大学 计算机科学与技术学院,贵阳 550025
  • 通讯作者: 黄瑞章
  • 作者简介:申奥(2000—),男,山东济宁人,硕士研究生,CCF会员,主要研究方向:自然语言处理、文本挖掘
    薛菁菁(1995—),女,山东日照人,博士,CCF会员,主要研究方向:自然语言处理、文本挖掘
    陈艳平(1980—),男,贵州长顺人,教授,博士,CCF会员,主要研究方向:人工智能、自然语言处理
    秦永彬(1980—),男,山东烟台人,教授,博士,CCF高级会员,主要研究方向:大数据管理与应用、多源数据融合。
  • 基金资助:
    国家自然科学基金资助项目(62066007);贵州省科技支撑计划项目(黔科合支撑[2023]一般300)

Abstract:

To address the issues of missing distribution information and distribution collapse encountered by deep variational text clustering models in practical applications, a Deep Variational text Clustering Model based on Distribution augmentation (DVCMD) was proposed. In this model, the enhanced latent semantic distributions were integrated into the original latent semantic distribution by enhancing distribution information, so as to improve information completeness and accuracy of the latent distribution. At the same time, a distribution consistency constraint strategy was employed to promote the learning of consistent semantic representations by the model, thereby enhancing the model’s ability to express true information of the data through learned semantic distributions, and thus improving clustering performance. Experimental results show that compared with existing deep clustering models and structural semantic-enhanced clustering models, DVCMD has the Normalized Mutual Information (NMI) metric improved by at least 0.16, 9.01, 2.30, and 2.72 percentage points on the four real-world datasets: Abstract, BBC, Reuters-10k, and BBCSports, respectively, validating the effectiveness of the model.

Key words: deep text clustering, distribution augmentation, Variational Auto-Encoder (VAE), semantic representation, distribution consistency constraint

摘要:

针对深度变分文本聚类模型在实际应用中遇到的分布信息缺失和分布坍塌问题,提出一种基于分布增强的深度变分文本聚类模型(DVCMD)。该模型通过分布信息增强的方法,整合增强潜在语义分布至原始潜在语义分布,从而提高潜在分布的信息完整性和准确性;同时,采用分布一致性约束策略促使模型学习一致的语义表征,从而提高模型通过学习的语义分布对数据真实信息的表达能力,进而提升聚类性能。实验结果表明,与现有的深度聚类模型和结构语义增强聚类模型相比,DVCMD的归一化互信息(NMI)指标在Abstract、BBC、Reuters-10k和BBCSports这4个真实数据集上分别至少提升了0.16、9.01、2.30和2.72个百分点,验证了模型的有效性。

关键词: 深度文本聚类, 分布增强, 变分自编码器, 语义表征, 分布一致性约束

CLC Number: