《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (8): 2364-2369.DOI: 10.11772/j.issn.1001-9081.2022091356

• 第十九届CCF中国信息系统及应用大会 • 上一篇    

基于多层语义融合的结构化深度文本聚类模型

马胜位1,2, 黄瑞章1,2(), 任丽娜1,2, 林川1,2   

  1. 1.公共大数据国家重点实验室(贵州大学),贵阳 550025
    2.贵州大学 计算机科学与技术学院,贵阳 550025
  • 收稿日期:2022-09-12 修回日期:2022-10-13 接受日期:2022-10-17 发布日期:2022-12-26 出版日期:2023-08-10
  • 通讯作者: 黄瑞章
  • 作者简介:马胜位(1999—),女,贵州紫云人,硕士研究生,CCF会员,主要研究方向:自然语言处理、深度聚类
    任丽娜(1987—),女,辽宁阜新人,讲师,博士研究生,CCF会员,主要研究方向:自然语言处理、文本挖掘、机器学习
    林川(1975—),男,四川自贡人,副教授,硕士,主要研究方向:文本挖掘、机器学习、大数据管理与应用。
  • 基金资助:
    国家自然科学基金资助项目(62066007)

Structured deep text clustering model based on multi-layer semantic fusion

Shengwei MA1,2, Ruizhang HUANG1,2(), Lina REN1,2, Chuan LIN1,2   

  1. 1.State Key Laboratory of Public Big Data (Guizhou University),Guiyang Guizhou 550025,China
    2.College of Computer Science and Technology,Guizhou University,Guiyang Guizhou 550025,China
  • Received:2022-09-12 Revised:2022-10-13 Accepted:2022-10-17 Online:2022-12-26 Published:2023-08-10
  • Contact: Ruizhang HUANG
  • About author:MA Shengwei, born in 1999, M. S. candidate. Her research interests include natural language processing, deep clustering.
    REN Lina, born in 1987, Ph. D. candidate, lecturer. Her research interests include natural language processing,text mining, machine learning.
    LIN Chuan, born in 1975, M. S., associate professor. His research interests include text mining,machine learning, big data management and applications.
  • Supported by:
    National Natural Science Foundation of China(62066007)

摘要:

近年来,由于图神经网络(GNN)的结构信息在机器学习中的优势,人们开始将GNN结合进深度文本聚类中。当前结合GNN的深度文本聚类算法在文本语义信息融合时忽略了解码器在语义补足上的重要作用,这导致在数据生成部分出现语义信息的缺失。针对以上问题,提出了一种基于多层语义融合的结构化深度文本聚类模型(SDCMS)。该模型利用GNN将结构信息集成到解码器中,通过逐层语义补充增强了文本数据的表示,并通过三重自监督机制获得更好的网络参数。在Citeseer、Acm、Reutuers、Dblp、Abstract 这5个真实数据集上进行实验的结果表明,与目前最优的注意力驱动的图形聚类网络(AGCN)模型相比,SDCMS在准确率、归一化互信息(NMI)和平均兰德指数(ARI)上分别最多提升了5.853%、9.922%和8.142%。

关键词: 深度文本聚类, 逐层语义增强, 文本语义信息, 图神经网络, 自监督学习

Abstract:

In recent years, due to the advantages of the structural information of Graph Neural Network (GNN) in machine learning, people have begun to combine GNN into deep text clustering. The current deep text clustering algorithm combined with GNN ignores the important role of the decoder on semantic complementation in the fusion of text semantic information, resulting in the lack of semantic information in the data generation part. In response to the above problem, a Structured Deep text Clustering Model based on multi-layer Semantic fusion (SDCMS) was proposed. In this model, a GNN was utilized to integrate structural information into the decoder, the representation of text data was enhanced through layer-by-layer semantic complement, and better network parameters were obtained through triple self-supervision mechanism.Results of experiments carried out on 5 real datasets Citeseer, Acm, Reutuers, Dblp and Abstract show that compared with the current optimal Attention-driven Graph Clustering Network (AGCN) model, SDCMS in accuracy, Normalized Mutual Information (NMI ) and Average Rand Index (ARI) has increased by at most 5.853%, 9.922% and 8.142%.

Key words: deep text clustering, layer-by-layer semantic enhancement, text semantic information, graph neural network, self-supervised learning

中图分类号: