Journal of Computer Applications ›› 2026, Vol. 46 ›› Issue (5): 1441-1449.DOI: 10.11772/j.issn.1001-9081.2025050716

• Artificial intelligence • Previous Articles    

Dual-channel feature fusion representation method for short-text clustering based on large language model

Qianfei WANG1, Yang LI2(), Deyu LI1,3, Suge WANG1,3   

  1. 1.School of Computer and Information Technology,Shanxi University,Taiyuan Shanxi 030006,China
    2.School of Finance,Shanxi University of Finance and Economics,Taiyuan Shanxi 030006,China
    3.Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education (Shanxi University),Taiyuan Shanxi 030006,China
  • Received:2025-06-30 Revised:2025-07-23 Accepted:2025-07-31 Online:2025-09-15 Published:2026-05-10
  • Contact: Yang LI
  • About author:WANG Qianfei, born in 1999, M. S. candidate. His research interests include data mining.
    LI Deyu, born in 1965, Ph. D., professor. His research interests include data mining.
    WANG Suge, born in 1964, Ph. D., professor. Her research interests include natural language processing, text mining, sentiment analysis.
  • Supported by:
    National Natural Science Foundation of China(62473241)

基于大语言模型的双通道特征融合表示的短文本聚类方法

王倩飞1, 李旸2(), 李德玉1,3, 王素格1,3   

  1. 1.山西大学 计算机与信息技术学院,太原 030006
    2.山西财经大学 金融学院,太原 030006
    3.计算智能与中文信息处理教育部重点实验室(山西大学),太原 030006
  • 通讯作者: 李旸
  • 作者简介:王倩飞(1999—),男,山西运城人,硕士研究生,主要研究方向:数据挖掘
    李德玉(1965—),男,山西曲沃人,教授,博士,CCF会员,主要研究方向:数据挖掘
    王素格(1964—),女,河北定州人,教授,博士,CCF会员,主要研究方向:自然语言处理、文本挖掘、情感分析。
  • 基金资助:
    国家自然科学基金资助项目(62473241);国家自然科学基金资助项目(62376143)

Abstract:

To address the problems of insufficient global semantic representation and weak local discriminability in current short-text clustering methods, a Dual-Channel Feature Fusion representation method for short-text clustering based on Large Language Model (LLM), named DCFF, was proposed. From a global perspective, a semantic-enhanced pseudo-label contrastive learning module was established, in which the LLM-generated keyword phrases were dynamically weighted and fused with original texts to enrich representations. Furthermore, high-confidence pseudo-labels were produced via self-adaptive optimal transport, while intra-cluster compactness and inter-cluster separation constraints were integrated into end-to-end training to achieve globally consistent embeddings. From a local perspective, a triplet representation optimization module based on entropy and discrepancy was established, which filtered high-informativeness samples via entropy and discrepancy. The embedding model was then fine-tuned with a confidence-weighted loss and a denoising mechanism to generate a vector representation with strong local discrimination. Finally, the global and local representations were fused using self-attention mechanism for direct application in clustering algorithms. Comparative experimental results on eight public short text clustering datasets against mainstream baselines showed that DCFF outperformed the baselines in accuracy on all datasets, achieving the lowest improvement of 3.19 percentage points on the GoogleNews-T dataset; in Normalized Mutual Information (NMI), DCFF outperformed the baselines on six datasets, achieving the lowest improvement of 3.46 percentage points on the SearchSnippets dataset. The experimental results demonstrate that DCFF is well-suited for clustering tasks in various scenarios.

Key words: Large Language Model (LLM), short-text clustering, global semantic enhancement, local discriminative optimization, self-attention mechanism

摘要:

针对当前短文本聚类中全局语义表示不足和局部区分性弱等问题,提出一种基于大语言模型(LLM)的双通道特征融合表示的短文本聚类方法(DCFF)。该方法从全局视角建立基于语义增强的伪标签对比学习模块,利用LLM生成关键词并与原始文本动态加权融合以增强语义表示;进一步利用自适应最优传输生成高置信度伪标签,并结合类内紧密度与类间可分性约束,采用端到端的训练方式建立全局语义一致的向量表示;从局部视角建立基于熵和差异度的三元表示优化模块,通过熵和差异度筛选高信息量样本,并利用置信度加权损失与去噪机制微调嵌入模型,生成局部判别性强的向量表示。最后,建立基于自注意力机制的全局与局部信息融合语义表示直接用于聚类算法。在8个公开短文本聚类数据集上与主流基线的对比实验结果显示,DCFF在所有数据集上的准确率均优于基线,其中在GoogleNews-T数据集上的提升至少3.19个百分点;在6个数据集上的聚类一致性(NMI)指标优于基线,其中在SearchSnippets数据集上的提升至少3.46百分点。实验结果表明DCFF适用于多场景下的聚类任务。

关键词: 大语言模型, 短文本聚类, 全局语义增强, 局部判别优化, 自注意力机制

CLC Number: