Journal of Computer Applications

    Next Articles

Deep semi-supervised text clustering with intentional regularization

  

  • Received:2024-07-05 Revised:2024-09-06 Online:2024-11-19 Published:2024-11-19
  • Supported by:
    Research on deep clustering approach for multi-view text documents

基于意图正则化的深度半监督文本聚类

徐乐1,黄瑞章2,白瑞娜2,秦永彬3   

  1. 1. 贵州大学计算机科学与技术学院
    2. 贵州大学
    3. 贵州大学计算机科学与信息学院
  • 通讯作者: 徐乐
  • 基金资助:
    面向多视图文本的深度文本聚类方法研究

Abstract: Aiming at the problem that the existing semi-supervised text clustering methods failing to simultaneously consider user intent in the process of representation learning and clustering. Deep Semi-supervised Text Clustering with Intentional Regularization (IRDSTC) was proposed. The intention regularization strategy was introduced, and the Intention Regularized Representation Learning (IRRL) module and Intention Regularized Clustering (IRC) module were designed. Firstly, an intent matrix is constructed based on the intent constraint information provided by the user to capture the user's expectations for the relationship between texts. Secondly, the matrix is applied to the representation learning stage and the clustering stage. In the representation learning stage, the intermediate layer representation extracted by the deep model was converted into a representation correlation matrix, and a regular term was constructed in conjunction with the intention matrix, aiming to incorporate user intention into representation learning. In the clustering stage, an allocation consistency matrix was constructed based on the cluster assignment probabilities obtained from clustering iterations, and combined with an intent matrix to construct regular terms to realize the guidance of the user intent on the clustering process. Finally, experimental results show that, the proposed model has better performance in clustering accuracy, normalized mutual information and adjusted rand index evaluation indexes compared to other clustering methods on Reu-10k, BBC, ACM and Abstract datasets. Compared with the suboptimal SDEC (Semi-supervised Deep Embedded Clustering), the normalized mutual information of IRDSTC is increased by 36.39%, 67.56%, 28.95%, and 20.76%, respectively, indicating that IRDSTC has better clustering effect.

Key words: intention, regularization, semi-supervised, text clustering

摘要: 针对现有半监督文本聚类方法无法在表示学习和聚类过程中同时考虑用户意图的问题,提出了基于意图正则化的深度半监督文本聚类模型(IRDSTC-)。通过引入意图正则化策略,设计了意图正则化的表示学习(IRRL)模块和意图正则化的聚类(IRC)模块。首先,根据用户提供的意图约束信息构建意图矩阵,以捕获用户对文本之间关系的期望;其次,将该矩阵应用到表示学习阶段和聚类阶段。在表示学习阶段,将深度模型提取的中间层表示转换为表示关联性矩阵,并结合意图矩阵构造正则项,旨在利用用户意图驱动表示学习;在聚类阶段,根据聚类迭代得到的类簇分配概率构造分配一致性矩阵,并结合意图矩阵构造正则项,以实现用户意图对聚类过程的指导。最后,实验结果表明,所提出的模型在Reu-10k、BBC、ACM和Abstract数据集上相较于其他聚类方法在聚类准确率、标准化互信息和调整兰德指数评价指标上均具有更好的表现,相较于次优的SDEC, IRDSTC的聚类标准化互信息分别提升了36.39%、67.56%、28.95%、20.76%,表明了IRDSTC 具有更好的聚类效果。

关键词: 意图, 正则化, 半监督, 文本聚类

CLC Number: