《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (10): 3101-3110.DOI: 10.11772/j.issn.1001-9081.2024101464

• 人工智能 • 上一篇    

基于互信息和提示学习的中文无监督对比学习方法

黄朋1, 林佳瑜2(), 梁祖红1,3   

  1. 1.广东工业大学 计算机学院,广州 510006
    2.广东工业大学 图书馆,广州 510006
    3.广东工业大学 实验教学部,广州 510006
  • 收稿日期:2024-10-21 修回日期:2025-01-17 接受日期:2025-01-22 发布日期:2025-10-14 出版日期:2025-10-10
  • 通讯作者: 林佳瑜
  • 作者简介:黄朋(2001—),男,广东深圳人,硕士研究生,CCF会员,主要研究方向:句子嵌入、数据增强、数据挖掘
    林佳瑜(1981—),女,广东惠州人,副研究馆员,硕士,主要研究方向:信息检索分析、数据挖掘
    梁祖红(1980—),男,广东惠州人,教授级高级工程师,博士,主要研究方向:机器学习、智能计算。
  • 基金资助:
    教育部产学合作协同育人项目(220901229305933);2024年度广州市基础研究计划?市校院企联合资助专题项目(2024A03J1199)

Unsupervised contrastive learning for Chinese with mutual information and prompt learning

Peng HUANG1, Jiayu LIN2(), Zuhong LIANG1,3   

  1. 1.School of Computer Science and Technology,Guangdong University of Technology,Guangzhou Guangdong 510006,China
    2.Library,Guangdong University of Technology,Guangzhou Guangdong 510006,China
    3.Experimental Teaching Department,Guangdong University of Technology,Guangzhou Guangdong 510006,China
  • Received:2024-10-21 Revised:2025-01-17 Accepted:2025-01-22 Online:2025-10-14 Published:2025-10-10
  • Contact: Jiayu LIN
  • About author:HUANG Peng, born in 2001, M. S. candidate. His research interests include sentence embedding, data augmentation, data mining.
    LIN Jiayu, born in 1981, M. S., associate research librarian. Her research interests include information retrieval analysis, data mining.
    LIANG Zuhong, born in 1980, Ph. D., professor of engineering. His research interests include machine learning, intelligent computing.
  • Supported by:
    Industry-University Collaborative Education Project of Ministry of Education(220901229305933);2024 Guangzhou Basic Research Plan — City University-Institute-Enterprise Joint Funding Thematic Project(2024A03J1199)

摘要:

中文无监督对比学习面临多重挑战:1)中文句子结构灵活多变,语义模糊性较高,使得模型难以准确捕捉深层语义特征;2)在小规模数据集上,对比学习模型的特征表达能力不足,难以充分学习到有效的语义表示;3)数据增强过程中可能引入多余噪声,进一步加剧训练的不稳定性。这些问题共同限制了模型在中文语义理解上的表现。为了解决这些问题,提出一种基于互信息(MI)和提示学习的中文无监督对比学习(CMIPL)方法。首先,采用提示学习的数据增强方式构建对比学习所需的样本对,在保留全部文本信息和顺序的同时增加文本多样性,规范样本的输入结构,并为输入样本提供提示模板作为上下文,引导模型更深入地学习细粒度语义;其次,在预训练语言模型输出表示的基础上,使用提示模板去噪方法去除数据增强所引入的多余噪声;最后,将正样本结构信息融入模型训练体系之中,计算增强视图的注意力张量的MI,再将注意力MI引入损失函数,通过最小化损失函数,优化模型注意力的分布,最大化增强视图结构的对齐,使模型更好地拉近正样本对的距离。在ATEC、BQ、PAWSX这3个公开中文文本相似度数据集构建的小样本数据上进行对比实验。结果表明,所提方法的平均性能最佳,特别是在训练集数据量较少的情况下,在使用1%和10%样本量的条件下,与基线对比学习模型SimCSE(Simple Contrastive learning of Sentence Embeddings)相比,CMIPL的平均准确率和斯皮尔曼等级相关系数(SR)分别提高了3.45、4.07和1.64、2.61个百分点,验证了CMIPL在小样本中文无监督对比学习领域的有效性。

关键词: 提示学习, 对比学习, 互信息, 注意力张量, 去噪, 无监督

Abstract:

Unsupervised contrastive learning for Chinese faces multiple challenges: first, the structure of Chinese sentences is highly flexible and the semantic ambiguity is high, which make it difficult for models to capture deep semantic features accurately; second, on small-scale datasets, the feature-expression ability of contrastive learning models is insufficient, and effective semantic representations are hard to be learned fully; third, redundant noise may be introduced by the data augmentation process, further enhancing the instability of training. These issues limit the performance of models in Chinese semantic understanding jointly. To solve these problems, an unsupervised contrastive learning method for Chinese with Mutual Information (MI) and Prompt Learning (CMIPL) was proposed. Firstly, data augmentation approach of prompt learning was adopted to construct the sample pairs required for contrastive learning, so that all text information and order were maintained, text diversity was increased, the input structure of samples was standardized, and prompt templates were provided for input samples as context to guide the model to learn fine-grained semantics more deeply. Secondly, based on the output representation of the pre-trained language model, a prompt template denoising method was used to remove the redundant noise introduced by data augmentation. Finally, the structural information of positive samples was incorporated into the model training system, so that MI of the attention tensor of the augmented view was calculated, and the attention MI was introduced into the loss function. By minimizing the loss function, the attention distribution of the model was optimized, and alignment of the augmented view structure was maximized, so as to enable the model to better narrow the distance between positive pairs. Comparison experiments were conducted on few-shot data constructed from three public Chinese text similarity datasets: ATEC, BQ, and PAWSX. The results show that the proposed method has the best average performance, especially when the training data size is small. When using 1% and 10% sample size, compared with the baseline contrastive learning model SimCSE (Simple Contrastive learning of Sentence Embeddings), CMIPL has the average accuracy and the Spearman’s Rank correlation coefficient (SR) increased by 3.45, 4.07 and 1.64, 2.61 percentage points, respectively, verifying the effectiveness of CMIPL in the field of unsupervised few-shot contrastive learning for Chinese.

Key words: prompt learning, contrastive learning, Mutual Information (MI), attention tensor, denoising, unsupervised

中图分类号: