Journal of Computer Applications ›› 2025, Vol. 45 ›› Issue (7): 2145-2152.DOI: 10.11772/j.issn.1001-9081.2024070931

• The 39th CCF National Conference of Computer Applications (CCF NCCA 2024) • Previous Articles     Next Articles

Deep semi-supervised text clustering with intentional regularization

Le XU1,2,3, Ruizhang HUANG1,2,3(), Ruina BAI1,2,3, Yongbin QIN1,2,3   

  1. 1.Engineering Research Center of Ministry of Education for Text Computing and Cognitive Intelligence (Guizhou University),Guiyang Guizhou 550025,China
    2.State Key Laboratory of Public Big Data (Guizhou University),Guiyang Guizhou 550025,China
    3.College of Computer Science and Technology,Guizhou University,Guiyang Guizhou 550025,China
  • Received:2024-07-01 Revised:2024-09-25 Accepted:2024-10-09 Online:2025-07-10 Published:2025-07-10
  • Contact: Ruizhang HUANG
  • About author:XU Le, born in 1999, M. S. candidate. Her research interests include natural language processing, text mining, machine learning.
    HUANG Ruizhang, born in 1979, Ph. D., professor. Her research interests include data fusion analysis, text mining, network mining, knowledge discovery, machine learning.
    BAI Ruina, born in 1995, Ph. D. candidate. Her research interests include natural language processing, multi-view learning.
    QIN Yongbin, born in 1980, Ph. D., professor. His research interests include big data governance and application, multi-source data fusion, intelligent computing, machine learning, algorithm design.
  • Supported by:
    National Natural Science Foundation of China(62066007)

基于意图正则化的深度半监督文本聚类

徐乐1,2,3, 黄瑞章1,2,3(), 白瑞娜1,2,3, 秦永彬1,2,3   

  1. 1.文本计算与认知智能教育部工程研究中心(贵州大学),贵阳 550025
    2.公共大数据国家重点实验室(贵州大学),贵阳 550025
    3.贵州大学 计算机科学与技术学院,贵阳 550025
  • 通讯作者: 黄瑞章
  • 作者简介:徐乐(1999—),女,四川泸州人,硕士研究生,CCF会员,主要研究方向:自然语言处理、文本挖掘、机器学习
    黄瑞章(1979—),女,天津人,教授,博士,CCF会员,主要研究方向:数据融合分析、文本挖掘、网络挖掘、知识发现、机器学习 rzhuang@gzu.edu.cn
    白瑞娜(1995—),女,内蒙古包头人,博士研究生,主要研究方向:自然语言处理、多视图学习
    秦永彬(1980—),男,山东烟台人,教授,博士,CCF高级会员,主要研究方向:大数据治理与应用、多源数据融合、智能计算、机器学习、算法设计。
  • 基金资助:
    国家自然科学基金资助项目(62066007)

Abstract:

Aiming at the problem that the existing semi-supervised text clustering methods fail to consider user intent in processes of representation learning and clustering simultaneously, a Deep Semi-supervised Text Clustering with Intentional Regularization (IRDSTC) model was proposed. With the introduction of intention regularization strategy, the Intention Regularized Representation Learning (IRRL) module and Intention Regularized Clustering (IRC) module were designed. Firstly, an intent matrix was constructed on the basis of the intent constraint information provided by the user to capture the user’s expectations for the relationship between texts. Secondly, the matrix was applied to the representation learning stage and the clustering stage. In the representation learning stage, the intermediate layer representation extracted by the deep model was converted into a representation correlation matrix, and the intent matrix was combined to construct a regular term, so as to use user intent to drive the representation learning. In the clustering stage, an allocation consistency matrix was constructed according to the class cluster allocation probabilities obtained from clustering iterations, and the intent matrix was combined to construct regular terms, so as to realize the guidance of user intent to the clustering process. Experimental results show that IRDSTC model has better performance in clustering ACCuracy (ACC), Normalized Mutual Information (NMI) and Adjusted Rand Index (ARI) compared to other clustering methods on Reu-10k, BBC, ACM, and Abstract datasets. In specific, compared with Improved Deep Embedding Clustering(IDEC), IRDSTC model has the NMI increased by 28.26%, 32.58%, 27.13%, and 34.94%, respectively, indicating that IRDSTC model has better clustering effect.

Key words: intent, regularization, semi-supervision, text clustering

摘要:

针对现有半监督文本聚类方法无法同时在表示学习和聚类过程中考虑用户意图的问题,提出基于意图正则化的深度半监督文本聚类(IRDSTC)模型。通过引入意图正则化策略,设计意图正则化的表示学习(IRRL)模块和意图正则化的聚类(IRC)模块。首先,根据用户提供的意图约束信息构建意图矩阵,以捕获用户对文本之间关系的期望。其次,将该矩阵应用到表示学习阶段和聚类阶段:在表示学习阶段,将深度模型提取的中间层表示转换为表示关联性矩阵,并结合意图矩阵构造正则项,以利用用户意图驱动表示学习;在聚类阶段,根据聚类迭代得到的类簇分配概率构造分配一致性矩阵,并结合意图矩阵构造正则项,以实现用户意图对聚类过程的指导。实验结果表明,IRDSTC模型在Reu-10k、BBC、ACM和Abstract数据集上相较于其他聚类方法在聚类准确率(ACC)、标准化互信息(NMI)和调整兰德指数(ARI)上均具有更好的表现。具体而言,相较于次优模型改进的深度嵌入聚类(IDEC),IRDSTC模型的NMI分别提升了28.26%、32.58%、27.13%和34.94%,表明IRDSTC模型具有更好的聚类效果。

关键词: 意图, 正则化, 半监督, 文本聚类

CLC Number: