基于流形学习的句向量优化

doi:10.11772/j.issn.1001-9081.2022091449

《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (10): 3062-3069.DOI: 10.11772/j.issn.1001-9081.2022091449

所属专题：人工智能

基于流形学习的句向量优化

吴明月¹^,², 周栋¹(), 赵文玉¹^,², 屈薇¹^,²

^1.湖南科技大学计算机科学与工程学院，湖南湘潭 411201
^2.服务计算与软件服务新技术湖南省重点实验室（湖南科技大学），湖南湘潭 411201

收稿日期:2022-09-30 修回日期:2023-01-24 接受日期:2023-02-01 发布日期:2023-02-28 出版日期:2023-10-10
通讯作者: 周栋
作者简介:吴明月（1999—），男，湖南娄底人，硕士研究生，CCF会员，主要研究方向：自然语言处理、深度学习
周栋（1979—），男，湖南长沙人，教授，博士，CCF高级会员，主要研究方向：信息检索、自然语言处理. dongzhou1979@hotmail. com
赵文玉（1993—），女，湖南衡阳人，博士研究生，CCF会员，主要研究方向：信息检索、自然语言处理
屈薇（1991—），女，湖南湘潭人，硕士研究生，CCF会员，主要研究方向：源代码摘要、自然语言处理。
基金资助:
国家自然科学基金资助项目(61876062);湖南省自然科学基金资助项目(2022JJ30020);湖南省教育厅科研项目(21A0319)

Sentence embedding optimization based on manifold learning

Mingyue WU¹^,², Dong ZHOU¹(), Wenyu ZHAO¹^,², Wei QU¹^,²

^1.School of Computer Science and Engineering，Hunan University of Science and Technology University，Xiangtan Hunan 411201，China
^2.Hunan Key Laboratory for Service Computing and Novel Software Technology （Hunan University of Science and Technology University），Xiangtan Hunan 411201，China

Received:2022-09-30 Revised:2023-01-24 Accepted:2023-02-01 Online:2023-02-28 Published:2023-10-10
Contact: Dong ZHOU
About author:WU Mingyue， born in 1999， M. S. candidate. His researchinterests include natural language processing， deep learning.
ZHOU Dong， born in 1979， Ph. D.， professor. His research interests include information retrieval， natural language processing.
ZHAO Wenyu， born in 1993， Ph. D. candidate. Her research interests include information retrieval， natural language processing.
QU Wei， born in 1991， M. S. candidate. Her research interests include source code summarization， natural language processing.
Supported by:
National Natural Science Foundation of China(61876062);Natural Science Foundation of Hunan Province(2022JJ30020);Scientific Research Project of Hunan Provincial Education Department(21A0319)

摘要/Abstract

摘要：

句向量是自然语言处理的核心技术之一，影响着自然语言处理系统的质量和性能。然而，已有的方法无法高效推理句与句之间的全局语义关系，致使句子在欧氏空间中的语义相似性度量仍存在一定问题。为解决该问题，从句子的局部几何结构入手，提出一种基于流形学习的句向量优化方法。该方法利用局部线性嵌入（LLE）对句子及其语义相似句子进行两次加权局部线性组合，这样不仅保持了句子之间的局部几何信息，而且有助于推理全局几何信息，进而使句子在欧氏空间中的语义相似性更贴近人类真实语义。在7个文本语义相似度任务上的实验结果表明，所提方法的斯皮尔曼相关系数（SRCC）平均值相较于基于对比学习的方法SimCSE（Simple Contrastive learning of Sentence Embeddings）提升了1.21个百分点。此外，将所提方法运用于主流预训练模型上的结果表明，相较于原始预训练模型，所提方法优化后模型的SRCC平均值提升了3.32~7.70个百分点。

关键词: 流形学习, 预训练模型, 对比学习, 句向量, 自然语言处理, 局部线性嵌入

Abstract:

As one of the core technologies of natural language processing， sentence embedding affects the quality and performance of natural language processing system. However， the existing methods are unable to infer the global semantic relationship between sentences efficiently， which leads to the fact that the semantic similarity measurement of sentences in Euclidean space still has some problems. To address the issue， a sentence embedding optimization method based on manifold learning was proposed. In the method， Local Linear Embedding （LLE） was used to perform double weighted local linear combinations to the sentences and their semantically similar sentences， thereby preserving the local geometric information between sentences and providing helps to the inference of the global geometric information. As a result， the semantic similarity of sentences in Euclidean space was closer to the real semantics of humans. Experimental results on seven text semantic similarity tasks show that the proposed method has the average Spearman’s Rank Correlation Coefficient，（SRCC） improved by 1.21 percentage points compared with the contrastive learning-based method SimCSE （Simple Contrastive learning of Sentence Embeddings）. In addition， the proposed method was applied to mainstream pre-trained models. The results show that compared to the original pre-trained models， the models optimized by the proposed method have the average SRCC improved by 3.32 to 7.70 percentage points.

Key words: manifold learning, pre-trained model, contrastive learning, sentence embedding, natural language processing, Local Linear Embedding (LLE)

中图分类号:

TP391.1

吴明月, 周栋, 赵文玉, 屈薇. 基于流形学习的句向量优化[J]. 计算机应用, 2023, 43(10): 3062-3069.

Mingyue WU, Dong ZHOU, Wenyu ZHAO, Wei QU. Sentence embedding optimization based on manifold learning[J]. Journal of Computer Applications, 2023, 43(10): 3062-3069.

图/表 7

参考文献 31

1	赵京胜，宋梦雪，高祥，等. 自然语言处理中的文本表示研究［J］. 软件学报， 2022， 33（1）： 102-128.
	ZHAO J S， SONG M X， GAO X， et al. Research on text representation in natural language processing［J］. Journal of Software， 2022， 33（1）： 102-128.
2	RAJATH S， KUMAR A， AGARWAL M， et al. Data mining tool to help the scientific community develop answers to Covid-19 queries［C］// Proceedings of the 5th International Conference on Intelligent Computing in Data Sciences. Piscataway： IEEE， 2021： 1-5. 10.1109/icds53782.2021.9626771
3	SASTRE J， VAHID A H， McDONAGH C， et al. A text mining approach to discovering COVID-19 relevant factors［C］// Proceedings of the 2020 IEEE International Conference on Bioinformatics and Biomedicine. Piscataway： IEEE， 2020： 486-490. 10.1109/bibm49941.2020.9313149
4	BOATENG G. Towards real-time multimodal emotion recognition among couples［C］// Proceedings of the 2020 International Conference on Multimodal Interaction. New York： ACM， 2020： 748-753. 10.1145/3382507.3421154
5	BOATENG G， KOWATSCH T. Speech emotion recognition among elderly individuals using multimodal fusion and transfer learning［C］// Companion Publication of the 2020 International Conference on Multimodal Interaction. New York： ACM， 2020： 12-16. 10.1145/3395035.3425255
6	ESTEVA A， KALE A， PAULUS R， et al. COVID-19 information retrieval with deep-learning based semantic search， question answering， and abstractive summarization［J］. npj Digital Medicine， 2021， 4： No.68. 10.1038/s41746-021-00437-0
7	LIN J. A proposed conceptual framework for a representational approach to information retrieval［J］. ACM SIGIR Forum， 2021， 55（2）： No.4. 10.1145/3527546.3527552
8	LI R， ZHAO X， MOENS M F. A brief overview of universal sentence representation methods： a linguistic view［J］. ACM Computing Surveys， 2023， 55（3）： No.56. 10.1145/3482853
9	ARORA S， LIANG Y， MA T. A simple but tough-to-beat baseline for sentence embeddings［EB/OL］. （2022-07-22）［2022-07-20］..
10	KIROS R， ZHU Y， SALAKHUTDINOV R， et al. Skip-thought vectors［C］// Proceedings of the 28th International Conference on Neural Information Processing Systems — Volume 2. Cambridge： MIT Press， 2015： 3294-3302.
11	WIETING J， BANSAL M， GIMPEL K， et al. Towards universal paraphrastic sentence embeddings［EB/OL］. （2016-03-04）［2022-07-20］..
12	ZHANG M， WU Y， LI W， et al. Learning universal sentence representations with mean-max attention autoencoder［C］// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， PA： ACL， 2018： 1532-1543. 10.18653/v1/d18-1481
13	LIU Z Y， LIN Y K， SUN M S. Representation Learning for Natural Language Processing［M］. Berlin： Springer， 2020. 10.1007/978-981-15-5573-2
14	DEVLIN J， CHANG M W， LEE K， et al. BERT： pre-training of deep bidirectional transformers for language understanding［C］// Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 1 （Long and Short Papers）. Stroudsburg， PA： ACL， 2019： 4171-4186. 10.18653/v1/n18-2
15	LI B， ZHOU H， HE J， et al. On the sentence embeddings from pre-trained language models［C］// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， PA： ACL， 2020： 9119-9130. 10.18653/v1/2020.emnlp-main.733
16	SU J， CAO J， LIU W， et al. Whitening sentence representations for better semantics and faster retrieval［EB/OL］. （2021-03-29）［2022-05-23］..
17	REIMERS N， GUREVYCH I. Sentence-BERT： sentence embeddings using siamese BERT-networks［C］// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg， PA： ACL， 2019： 3982-3992. 10.18653/v1/d19-1410
18	YAN Y， LI R， WANG S， et al. ConSERT： a contrastive framework for self-supervised sentence representation transfer［C］// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing （Volume 1： Long Papers）. Stroudsburg， PA： ACL， 2021： 5065-5075. 10.18653/v1/2021.acl-long.393
19	GAO T， YAO X， CHEN D. SimCSE： simple contrastive learning of sentence embeddings［C］// Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， PA： ACL， 2021： 6894-6910. 10.18653/v1/2021.emnlp-main.552
20	HASHIMOTO T B， ALVAREZ-MELIS D， JAAKKOLA T S. Word embeddings as metric recovery in semantic spaces［J］. Transactions of the Association for Computational Linguistics， 2016， 4： 273-286. 10.1162/tacl_a_00098
21	HASAN S， CURRY E. Word re-embedding via manifold dimensionality retention［C］// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing， Stroudsburg， PA： ACL， 2017： 321-326. 10.18653/v1/d17-1033
22	ZHAO D， WANG J， CHU Y， et al. Improving biomedical word representation with locally linear embedding［J］. Neurocomputing， 2021， 447： 172-182. 10.1016/j.neucom.2021.02.071
23	ZHAO W， ZHOU D， LI L， et al. Manifold learning-based word representation refinement incorporating global and local information［C］// Proceedings of the 28th International Conference on Computational Linguistics. ［S.l.］： International Committee on Computational Linguistics， 2020： 3401-3412. 10.18653/v1/2020.coling-main.301
24	NASER MOGHADASI M， ZHUANG Y. Sent2Vec： a new sentence embedding representation with sentimental semantic［C］// Proceedings of the 2020 IEEE International Conference on Big Data. Piscataway： IEEE， 2020： 4672-4680. 10.1109/bigdata50022.2020.9378337
25	ZHAO D， WANG J， LIN H， et al. Sentence representation with manifold learning for biomedical texts［J］. Knowledge-Based Systems， 2021， 218： No.106869. 10.1016/j.knosys.2021.106869
26	BOMMASANI R， DAVIS K， CARDIE C. Interpreting pretrained contextualized representations via reductions to static embeddings［C］// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg， PA： ACL， 2020： 4758-4781. 10.18653/v1/2020.acl-main.431
27	韩程程，李磊，刘婷婷，等. 语义文本相似度计算方法［J］. 华东师范大学学报（自然科学版）， 2020（5）：95-112.
	HAN C C， LI L， LIU T T， et al. Approaches for semantic textual similarity［J］. Journal of East China Normal University （Natural Science）， 2020（5）：95-112.
28	CER D， YANG Y， KONG S Y， et al. Universal sentence encoder for English［C］// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing： System Demonstrations. Stroudsburg， PA： ACL， 2018： 169-174. 10.18653/v1/d18-2029
29	CONNEAU A， KIELA D， SCHWENK H， et al. Supervised learning of universal sentence representations from natural language inference data［C］// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， PA： ACL， 2017： 670-680. 10.18653/v1/d17-1070
30	岳增营，叶霞，刘睿珩. 基于语言模型的预训练技术研究综述［J］. 中文信息学报， 2021， 35（9）： 15-29. 10.3969/j.issn.1003-0077.2021.09.002
	YUE Z Y， YE X， LIU R H. A survey of language model based pre-training technology［J］. Journal of Chinese Information Processing， 2021， 35（9）： 15-29. 10.3969/j.issn.1003-0077.2021.09.002
31	ROWEIS S T， SAUL L K. Nonlinear dimensionality reduction by locally linear embedding［J］. Science， 2000， 290（5500）： 2323-2326. 10.1126/science.290.5500.2323

类别	模型	STS12	STS13	STS14	STS15	STS16	STS-B	SICK-R	平均值
生成模型	BERT	54.71	54.52	58.81	67.36	68.18	53.88	62.06	59.93
	Skip_Thoughts	44.32	34.56	41.65	46.52	55.45	74.28	79.21	53.71
	InferSent_FastText	58.16	55.29	58.53	67.34	68.29	72.45	78.34	65.48
	USE_TF	64.21	68.54	67.96	77.12	76.86	77.94	80.73	73.33
	ConSERT	64.64	78.49	69.42	79.72	75.95	73.97	67.31	72.78
	SBERT	66.21	74.21	74.43	77.27	73.86	74.16	78.29	74.06
	SimCSE	70.14	79.56	75.91	81.46	79.07	76.85	72.55	76.50
优化模型	Glove_WR	57.13	68.24	65.31	72.25	70.16	64.26	70.43	66.82
	BERT_flow	58.40	67.10	60.85	75.16	71.22	68.66	64.47	66.55
	BERT_whitening	57.83	66.90	60.90	75.08	71.31	68.24	63.73	66.28
	SimMSE	72.00	80.10	76.03	83.03	81.04	78.18	73.52	77.71

类别	模型	STS12	STS13	STS14	STS15	STS16	STS-B	SICK-R	平均值
生成模型	BERT	54.71	54.52	58.81	67.36	68.18	53.88	62.06	59.93
	Skip_Thoughts	44.32	34.56	41.65	46.52	55.45	74.28	79.21	53.71
	InferSent_FastText	58.16	55.29	58.53	67.34	68.29	72.45	78.34	65.48
	USE_TF	64.21	68.54	67.96	77.12	76.86	77.94	80.73	73.33
	ConSERT	64.64	78.49	69.42	79.72	75.95	73.97	67.31	72.78
	SBERT	66.21	74.21	74.43	77.27	73.86	74.16	78.29	74.06
	SimCSE	70.14	79.56	75.91	81.46	79.07	76.85	72.55	76.50
优化模型	Glove_WR	57.13	68.24	65.31	72.25	70.16	64.26	70.43	66.82
	BERT_flow	58.40	67.10	60.85	75.16	71.22	68.66	64.47	66.55
	BERT_whitening	57.83	66.90	60.90	75.08	71.31	68.24	63.73	66.28
	SimMSE	72.00	80.10	76.03	83.03	81.04	78.18	73.52	77.71

模型	STS12	STS13	STS14	STS15	STS16	STS- B	SICK- R	平均值
BERT（1）	21.02	20.12	16.77	20.14	27.43	6.43	30.11	20.28
BERT（2）	54.71	54.5	58.81	67.36	68.17	53.87	62.06	59.92
BERT（3）	50.07	52.91	54.91	63.37	64.94	47.29	58.22	55.95
BERT_MFL	60.65	64.53	63.40	74.11	72.69	62.21	64.27	65.98
Roberta（1）	45.27	36.88	47.71	53.84	59.50	39.13	61.76	49.15
Roberta（2）	57.59	48.98	59.36	66.87	64.20	58.56	61.63	59.59
Roberta（3）	53.82	46.58	56.64	64.96	63.62	55.40	62.02	57.57
Roberta_MFL	59.86	61.06	64.98	70.59	69.91	61.43	66.08	64.84
XLNET（1）	47.91	33.92	44.54	57.67	49.94	41.21	49.23	46.34
XLNET（2）	37.15	20.83	27.27	35.05	35.65	31.62	37.01	32.08
XLNET（3）	37.40	20.73	27.14	34.91	35.55	31.49	36.19	31.91
XLNET_MFL	53.45	49.14	52.42	62.07	54.65	49.62	52.11	53.35
GPT-2（1）	44.37	16.51	24.23	35.27	44.40	22.74	42.33	32.83
GPT-2（2）	36.80	24.76	31.24	33.75	37.41	27.06	43.26	33.46
GPT-2（3）	36.18	23.80	30.46	32.94	36.84	26.27	42.72	32.74
GPT-2_MFL	48.87	32.87	40.24	43.91	42.26	38.76	43.97	41.16
BART（1）	59.78	53.70	61.54	71.01	69.64	60.92	61.77	62.62
BART（2）	53.08	45.14	53.86	65.75	63.94	52.46	53.16	55.34
BART（3）	51.34	50.50	55.57	67.07	64.59	51.16	54.91	56.44
BART_MFL	60.80	63.58	64.74	73.03	72.26	63.87	63.12	65.94
T5（1）	37.52	32.88	39.91	44.02	47.96	36.42	37.13	39.40
T5（2）	66.01	71.04	73.45	73.74	64.83	62.77	60.18	67.43
T5（3）	59.01	64.19	69.87	68.87	83.17	60.35	60.84	66.61
T5_MFL	68.25	74.56	80.26	75.94	84.37	68.35	63.78	73.64

模型	STS12	STS13	STS14	STS15	STS16	STS- B	SICK- R	平均值
BERT（1）	21.02	20.12	16.77	20.14	27.43	6.43	30.11	20.28
BERT（2）	54.71	54.5	58.81	67.36	68.17	53.87	62.06	59.92
BERT（3）	50.07	52.91	54.91	63.37	64.94	47.29	58.22	55.95
BERT_MFL	60.65	64.53	63.40	74.11	72.69	62.21	64.27	65.98
Roberta（1）	45.27	36.88	47.71	53.84	59.50	39.13	61.76	49.15
Roberta（2）	57.59	48.98	59.36	66.87	64.20	58.56	61.63	59.59
Roberta（3）	53.82	46.58	56.64	64.96	63.62	55.40	62.02	57.57
Roberta_MFL	59.86	61.06	64.98	70.59	69.91	61.43	66.08	64.84
XLNET（1）	47.91	33.92	44.54	57.67	49.94	41.21	49.23	46.34
XLNET（2）	37.15	20.83	27.27	35.05	35.65	31.62	37.01	32.08
XLNET（3）	37.40	20.73	27.14	34.91	35.55	31.49	36.19	31.91
XLNET_MFL	53.45	49.14	52.42	62.07	54.65	49.62	52.11	53.35
GPT-2（1）	44.37	16.51	24.23	35.27	44.40	22.74	42.33	32.83
GPT-2（2）	36.80	24.76	31.24	33.75	37.41	27.06	43.26	33.46
GPT-2（3）	36.18	23.80	30.46	32.94	36.84	26.27	42.72	32.74
GPT-2_MFL	48.87	32.87	40.24	43.91	42.26	38.76	43.97	41.16
BART（1）	59.78	53.70	61.54	71.01	69.64	60.92	61.77	62.62
BART（2）	53.08	45.14	53.86	65.75	63.94	52.46	53.16	55.34
BART（3）	51.34	50.50	55.57	67.07	64.59	51.16	54.91	56.44
BART_MFL	60.80	63.58	64.74	73.03	72.26	63.87	63.12	65.94
T5（1）	37.52	32.88	39.91	44.02	47.96	36.42	37.13	39.40
T5（2）	66.01	71.04	73.45	73.74	64.83	62.77	60.18	67.43
T5（3）	59.01	64.19	69.87	68.87	83.17	60.35	60.84	66.61
T5_MFL	68.25	74.56	80.26	75.94	84.37	68.35	63.78	73.64

模型	None	随机采样	拒绝采样	句频采样
BERT	53.87	46.63	49.61	62.21
Roberta	58.56	58.67	59.75	61.43
XLNET	31.62	28.68	32.65	49.62
GPT-2	27.06	29.84	30.26	38.76
BART	52.46	51.56	50.68	63.87
T5	62.77	58.67	60.84	68.35

基于流形学习的句向量优化

Sentence embedding optimization based on manifold learning

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 7

参考文献 31

相关文章 15

编辑推荐

Metrics

[1]	田悦霖, 黄瑞章, 任丽娜. 融合局部语义特征的学者细粒度信息提取方法[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2707-2714.
[2]	张心月, 刘蓉, 魏驰宇, 方可. 融合提示知识的方面级情感分析方法[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2753-2759.
[3]	于碧辉, 蔡兴业, 魏靖烜. 基于提示学习的小样本文本分类方法[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2735-2740.
[4]	周晓敏, 滕飞, 张艺. 基于元网络的自动国际疾病分类编码模型[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2721-2726.
[5]	张小艳, 段正宇. 基于句级别GAN的跨语言零资源命名实体识别模型[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2406-2411.
[6]	陈克正, 郭晓然, 钟勇, 李振平. 基于负训练和迁移学习的关系抽取方法[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2426-2430.
[7]	金泽熙, 李磊, 刘继. 基于改进领域分离网络的迁移学习模型[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2382-2389.
[8]	刘耀, 童昕, 陈一风. 面向业务需求的算法路径自组配模型[J]. 《计算机应用》唯一官方网站, 2023, 43(6): 1768-1778.
[9]	雷景生, 剌凯俊, 杨胜英, 吴怡. 基于上下文语义增强的实体关系联合抽取[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1438-1444.
[10]	石利锋, 倪郑威. 基于槽位相关信息提取的对话状态追踪模型[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1430-1437.
[11]	王惠茹, 李秀红, 李哲, 马春明, 任泽裕, 杨丹. 多模态预训练模型综述[J]. 《计算机应用》唯一官方网站, 2023, 43(4): 991-1004.
[12]	高榕, 沈加伟, 邵雄凯, 吴歆韵. 基于Fastformer和自监督对比学习的实例分割算法[J]. 《计算机应用》唯一官方网站, 2023, 43(4): 1062-1070.
[13]	徐铭, 李林昊, 齐巧玲, 王利琴. 基于注意力平衡列表的溯因推理模型[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 349-355.
[14]	胡婕, 陈晓茜, 张龑. 基于池化和特征组合增强BERT的答案选择模型[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 365-373.
[15]	廖兴滨, 秦小林, 张思齐, 钱杨舸. 交互式机器翻译综述[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 329-334.