Journal of Computer Applications

    Next Articles

Dual-channel fusion representations method for short text clustering based on large language model

  

  • Received:2025-06-30 Revised:2025-07-23 Accepted:2025-07-31 Online:2025-09-15 Published:2025-09-15

基于大语言模型的双通道融合表示的短文本聚类方法

王倩飞1,李旸2,李德玉1,王素格1   

  1. 1. 山西大学计算机与信息技术学院
    2. 山西财经大学金融学院
  • 通讯作者: 李旸
  • 基金资助:
    国家自然科学基金项目;国家自然科学基金项目

Abstract: Abstract: To address the problems of insufficient global semantic representation and weak local discriminability in current short text clustering, a dual-channel fusion representation method for short text clustering based on large language models is proposed. From the global perspective, a semantic-enhanced pseudo-label contrastive learning module was established. Keyword phrases generated by LLMs were dynamically weighted and fused with original texts to enrich representations. High-confidence pseudo-labels were produced via adaptive optimal transport, while cluster compactness and separation constraints were integrated into end-to-end training to achieve globally consistent embeddings. From the local perspective, an entropy-difference-based triplet refinement module was designed. High-information samples were selected by entropy difference metrics,Then the embedding model is fine-tuned by using the confidence weighted loss and denoising mechanism to generate a vector representation with strong local discrimination. Finally, the global and local representations were fused using self-attention mechanisms for direct clustering. The proposed method was compared with the mainstream baselines on eight public short text clustering datasets.. The experimental results showed that the proposed method surpassed the baselines in accuracy (ACC) on all datasets, achieving an average improvement of 1.85%, with the highest improvement of 3.19% observed on GoogleNews-T. In clustering consistency (NMI), the proposed method outperformed the baseline methods on six datasets, with an average improvement of 1.53% and the highest improvement of 3.46% on SearchSnippets. The experimental results demonstrate that the proposed method consistently adapts to clustering tasks across diverse scenarios.

Key words: Keywords: Large Language Model, Short Text Clustering, Global Semantic Enhancement, Local Discriminative Optimization, Self-Attention Mechanism

摘要: 摘 要: 针对当前短文本聚类中全局语义表示不足和局部区分性弱等问题,提出一种基于大语言模型的双通道融合表示短文本聚类方法。该方法从全局视角,建立基于语义增强的伪标签对比学习模块,利用大语言模型生成关键词短语并与原始文本动态加权融合增强语义表示,进一步利用自适应最优传输生成高置信度伪标签,结合类内紧密度与类间可分性约束,采用端到端的训练方式,建立全局语义一致的向量表示;从局部视角,建立基于熵差异的三元表示优化模块,通过熵差异度量筛选高信息量样本,再利用置信度加权损失与去噪机制微调嵌入模型,生成局部判别性强的向量表示。最终,建立基于自注意力机制的全局与局部信息融合语义表示,并直接用于聚类算法中。在8个公开短文本聚类数据集上,本文方法与主流基线进行对比实验。实验结果验证了所提方法在所有数据集的ACC指标均优于基线,平均提升1.85%,其中在GoogleNews-T上最高提升3.19%;在聚类一致性(NMI)指标上,6个数据集优于基线,平均提升1.53%,其中在SearchSnippets上最高提升3.46%。实验结果表明该方法适应于多场景下的聚类任务。

关键词: 关键词: 大语言模型, 短文本聚类, 全句语义增强, 局部判别优化, 自注意力机制

CLC Number: