《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (10): 3062-3069.DOI: 10.11772/j.issn.1001-9081.2022091449

所属专题: 人工智能

• 人工智能 • 上一篇    下一篇

基于流形学习的句向量优化

吴明月1,2, 周栋1(), 赵文玉1,2, 屈薇1,2   

  1. 1.湖南科技大学 计算机科学与工程学院,湖南 湘潭 411201
    2.服务计算与软件服务新技术湖南省重点实验室(湖南科技大学),湖南 湘潭 411201
  • 收稿日期:2022-09-30 修回日期:2023-01-24 接受日期:2023-02-01 发布日期:2023-02-28 出版日期:2023-10-10
  • 通讯作者: 周栋
  • 作者简介:吴明月(1999—),男,湖南娄底人,硕士研究生,CCF会员,主要研究方向:自然语言处理、深度学习
    周栋(1979—),男,湖南长沙人,教授,博士,CCF高级会员,主要研究方向:信息检索、自然语言处理. dongzhou1979@hotmail. com
    赵文玉(1993—),女,湖南衡阳人,博士研究生,CCF会员,主要研究方向:信息检索、自然语言处理
    屈薇(1991—),女,湖南湘潭人,硕士研究生,CCF会员,主要研究方向:源代码摘要、自然语言处理。
  • 基金资助:
    国家自然科学基金资助项目(61876062);湖南省自然科学基金资助项目(2022JJ30020);湖南省教育厅科研项目(21A0319)

Sentence embedding optimization based on manifold learning

Mingyue WU1,2, Dong ZHOU1(), Wenyu ZHAO1,2, Wei QU1,2   

  1. 1.School of Computer Science and Engineering,Hunan University of Science and Technology University,Xiangtan Hunan 411201,China
    2.Hunan Key Laboratory for Service Computing and Novel Software Technology (Hunan University of Science and Technology University),Xiangtan Hunan 411201,China
  • Received:2022-09-30 Revised:2023-01-24 Accepted:2023-02-01 Online:2023-02-28 Published:2023-10-10
  • Contact: Dong ZHOU
  • About author:WU Mingyue, born in 1999, M. S. candidate. His researchinterests include natural language processing, deep learning.
    ZHOU Dong, born in 1979, Ph. D., professor. His research interests include information retrieval, natural language processing.
    ZHAO Wenyu, born in 1993, Ph. D. candidate. Her research interests include information retrieval, natural language processing.
    QU Wei, born in 1991, M. S. candidate. Her research interests include source code summarization, natural language processing.
  • Supported by:
    National Natural Science Foundation of China(61876062);Natural Science Foundation of Hunan Province(2022JJ30020);Scientific Research Project of Hunan Provincial Education Department(21A0319)

摘要:

句向量是自然语言处理的核心技术之一,影响着自然语言处理系统的质量和性能。然而,已有的方法无法高效推理句与句之间的全局语义关系,致使句子在欧氏空间中的语义相似性度量仍存在一定问题。为解决该问题,从句子的局部几何结构入手,提出一种基于流形学习的句向量优化方法。该方法利用局部线性嵌入(LLE)对句子及其语义相似句子进行两次加权局部线性组合,这样不仅保持了句子之间的局部几何信息,而且有助于推理全局几何信息,进而使句子在欧氏空间中的语义相似性更贴近人类真实语义。在7个文本语义相似度任务上的实验结果表明,所提方法的斯皮尔曼相关系数(SRCC)平均值相较于基于对比学习的方法SimCSE(Simple Contrastive learning of Sentence Embeddings)提升了1.21个百分点。此外,将所提方法运用于主流预训练模型上的结果表明,相较于原始预训练模型,所提方法优化后模型的SRCC平均值提升了3.32~7.70个百分点。

关键词: 流形学习, 预训练模型, 对比学习, 句向量, 自然语言处理, 局部线性嵌入

Abstract:

As one of the core technologies of natural language processing, sentence embedding affects the quality and performance of natural language processing system. However, the existing methods are unable to infer the global semantic relationship between sentences efficiently, which leads to the fact that the semantic similarity measurement of sentences in Euclidean space still has some problems. To address the issue, a sentence embedding optimization method based on manifold learning was proposed. In the method, Local Linear Embedding (LLE) was used to perform double weighted local linear combinations to the sentences and their semantically similar sentences, thereby preserving the local geometric information between sentences and providing helps to the inference of the global geometric information. As a result, the semantic similarity of sentences in Euclidean space was closer to the real semantics of humans. Experimental results on seven text semantic similarity tasks show that the proposed method has the average Spearman’s Rank Correlation Coefficient, (SRCC) improved by 1.21 percentage points compared with the contrastive learning-based method SimCSE (Simple Contrastive learning of Sentence Embeddings). In addition, the proposed method was applied to mainstream pre-trained models. The results show that compared to the original pre-trained models, the models optimized by the proposed method have the average SRCC improved by 3.32 to 7.70 percentage points.

Key words: manifold learning, pre-trained model, contrastive learning, sentence embedding, natural language processing, Local Linear Embedding (LLE)

中图分类号: