基于多粒度语义融合的信息检索方法

doi:10.11772/j.issn.1001-9081.2023050646

《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (6): 1775-1780.DOI: 10.11772/j.issn.1001-9081.2023050646

所属专题：人工智能

基于多粒度语义融合的信息检索方法

赵征宇¹, 罗景¹(), 涂新辉²

^1.武汉科技大学计算机科学与技术学院，武汉 430065
^2.华中师范大学计算机学院，武汉 430079

收稿日期:2023-05-24 修回日期:2023-09-24 接受日期:2023-10-11 发布日期:2023-10-17 出版日期:2024-06-10
通讯作者: 罗景
作者简介:赵征宇（1999—），男，湖北荆州人，硕士研究生，主要研究方向：信息检索、自然语言处理
涂新辉（1979—），男，湖北应城人，副教授，博士，CCF会员，主要研究方向：信息检索、自然语言处理。
基金资助:
国家语委重点项目(ZDI145?22);湖北省教育厅人文社会科学研究项目(18Q028)

Information retrieval method based on multi-granularity semantic fusion

Zhengyu ZHAO¹, Jing LUO¹(), Xinhui TU²

^1.School of Computer Science and Technology，Wuhan University of Science and Technology，Wuhan Hubei 430065，China
^2.School of Computer Science，Central China Normal University，Wuhan Hubei 430079，China

Received:2023-05-24 Revised:2023-09-24 Accepted:2023-10-11 Online:2023-10-17 Published:2024-06-10
Contact: Jing LUO
About author:ZHAO Zhengyu， born in 1999， M. S. candidate. His research interests include information retrieval， natural language processing.
TU Xinhui， born in 1979， Ph. D.， associate professor. His research interests include information retrieval， natural language processing.
Supported by:
Key Project of National Language Commission of China(ZDI145-22);Humanities and Social Sciences Research Project of Education Department of Hubei Province(18Q028)

摘要/Abstract

摘要：

信息检索（IR）是一种通过特定的技术和方法组织、处理信息，以满足用户的信息需求的过程。近年来，基于预训练模型的稠密检索方法取得了巨大的成功；然而，这些方法只利用了文本和词语的向量表征计算查询与文档相关度，忽略了它们短语层面间的语义信息。针对该问题，提出一种名为MSIR（Multi-Scale IR）的IR方法。所提方法通过融合查询与文档中多种不同粒度的语义信息提高IR性能。首先，构建查询和文档中词语、短语和文本这3个粒度的语义单元；其次，利用预训练模型对这3个语义单元分别进行编码获得它们的语义表征；最后，利用语义表征计算查询和文档相关度。在Corvid-19、TREC2019和Robust04这3个不同大小的经典数据集上进行了对比实验。与ColBERT（ranking model based on Contextualized late interaction over BERT （Bidirectional Encoder Representation from Transformers））相比，MSIR在Robust04数据集的P@10、P@20、NDCG@10和NDCG@20指标上均实现了约8%的提升，同时在Corvid-19和TREC2019数据集上也取得了一定的改进。实验结果表明，MSIR能够成功融合多种语义粒度，提升检索精度。

关键词: 语义融合, 信息检索, 稠密检索, 预训练模型, 文本检索

Abstract:

Information Retrieval （IR） is a process that organizes and processes information using specific techniques and methods to meet users’ information needs. In recent years， dense retrieval methods based on pre-trained models have achieved significant success. However， these methods only utilize vector representations of text and words to calculate the relevance between query and document， ignoring the semantic information at the phrase level. To address this issue， an IR method called MSIR （Multi-Scale Information Retrieval） was proposed. IR performance was enhanced by integrating semantic information of different granularities from the query and the document. First， semantic units of three different granularities — word， phrase， and text — were constructed in the query and the document. Then， the pre-trained model was used to encode these three semantic units separately to obtain their semantic representations. Finally， these semantic representations were used to calculate the relevance between the query and the document. Comparison experiments were conducted on three classic datasets of different sizes， including Corvid-19， TREC2019 and Robust04. Compared with ColBERT （ranking model based on Contextualized late interaction over BERT （Bidirectional Encoder Representation from Transformers））， MSIR shows an approximately 8% improvement in the P@10， P@20， NDCG@10 and NDCG@20 indicators on Robust04 dataset， as well as some improvements on Corvid-19 and TREC2019 datasets. Experimental results demonstrate that MSIR can effectively integrate multi-granularity semantic information， thereby enhancing retrieval accuracy.

Key words: semantic fusion, Information Retrieval (IR), dense retrieval, pre-trained model, text retrieval

中图分类号:

TP391.3

赵征宇, 罗景, 涂新辉. 基于多粒度语义融合的信息检索方法[J]. 计算机应用, 2024, 44(6): 1775-1780.

Zhengyu ZHAO, Jing LUO, Xinhui TU. Information retrieval method based on multi-granularity semantic fusion[J]. Journal of Computer Applications, 2024, 44(6): 1775-1780.

图/表 5

参考文献 20

1	DEVLIN J， CHANG M-W， LEE K， et al. BERT： pre-training of deep bidirectional transformers for language understanding［C］// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies （Volume 1： Long and Short Papers）. Stroudsburg： ACL， 2019： 4171-4186.
2	FAN Y， XIE X， CAI Y， et al. Pre-training methods in information retrieval［J］. Foundations and Trends in Information Retrieval， 2022， 16（3）： 178-317.
3	KHATTAB O， ZAHARIA M. ColBERT： efficient and effective passage search via contextualized late interaction over BERT ［C］// Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. New York： ACM， 2020： 39-48.
4	SANTHANAM K， KHATTAB O， SAAD-FALCON J， et al. ColBERTv2： effective and efficient retrieval via lightweight late interaction［C］// Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies. Stroudsburg： ACL， 2022： 3715-3734.
5	ROBERTSON S， ZARAGOZA H. The probabilistic relevance framework： BM25 and beyond［J］. Foundations and Trends in Information Retrieval， 2009， 3（4）： 333-389.
6	ROBERTSON S. Understanding inverse document frequency： on theoretical arguments for IDF［J］. Journal of Documentation， 2004，60（5）： 503-520.
7	KARPUKHIN V， OGUZ B， MIN S，et al. Dense passage retrieval for open-domain question answering［C］// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg： ACL， 2020：6769-6781.
8	CHEN H， LIU X， YIN D， et al. A survey on dialogue systems： recent advances and new frontiers ［J］. ACM SIGKDD Explorations Newsletter， 2017， 19（2）： 25-35.
9	HUANG P-S， HE X， GAO J， et al. Learning deep structured semantic models for web search using clickthrough data［C］// Proceedings of the 22nd ACM International Conference on Information & Knowledge Management. New York： ACM， 2013： 2333-2338.
10	GUO J， FAN Y， AI Q， et al. A deep relevance matching model for ad-hoc retrieval［C］// Proceedings of the 25th ACM International Conference on Information & Knowledge Management. New York： ACM， 2016： 55-64.
11	GUO J， FAN Y， PANG L， et al. A deep look into neural ranking models for information retrieval［J］. Information Processing & Management， 2020， 57（6）： 102067.
12	BENGIO Y， COURVILLE A， VINCENT P. Representation learning： a review and new perspectives［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2013， 35（8）： 1798-1828.
13	YANG Z， DAI Z， YANG Y， et al. XLNet： generalized autoregressive pretraining for language understanding［C］// Proceedings of the 33rd International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2019： 5753-5763.
14	RADFORD A， WU J， CHILD R， et al. Language models are unsupervised multitask learners［EB/OL］. ［2023-05-01］. .
15	GUO J， CAI Y， FAN Y， et al. Semantic models for the first-stage retrieval： a comprehensive review［J］. ACM Transactions on Information Systems， 2022， 40（4）： Article No.66.
16	ZHAO W X， LIU J， REN R， et al. Dense text retrieval based on pretrained language models： a survey［J］. ACM Transactions on Information Systems， 2024， 42（4）： Article No.89.
17	ZHAN J， MAO J， LIU Y， et al. RepBERT： contextualized text embeddings for first-stage retrieal ［EB/OL］. ［2023-05-11］. .
18	GAO T， YAO X， CHEN D. SimCSE： simple contrastive learning of sentence embeddings ［C］// Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsberg： ACL， 2021： 6894-6910.
19	RAFFEL C， SHAZEER N， ROBERTS A， et al. Exploring the limits of transfer learning with a unified text-to-text transformer［J］. The Journal of Machine Learning Research， 2020， 21（1）： 5485-5551.
20	REIMERS N， GUREVYCH I. Sentence-BERT： sentence embeddings using siamese BERT-networks［C］// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsberg： ACL， 2019： 3982-3992.

模型	Corvid-19				Robust04				TREC2019
模型	P@10	P@20	NDCG@10	NDCG@20	P@10	P@20	NDCG@10	NDCG@20	P@10	P@20	NDCG@10	NDCG@20
BM25	0.372 1	0.334 2	0.312 1	0.283 2	0.371 0	0.308 2	0.377 2	0.360 2	0.586 0	0.515 1	0.481 6	0.466 3
SimCSE	0.218 2	0.205 3	0.194 6	0.187 0	0.257 3	0.174 5	0.368 9	0.241 2	0.407 3	0.424 5	0.408 9	0.356 3
ColBERT	0.662 1	0.615 1	0.613 1	0.575 4	0.381 9	0.314 6	0.400 5	0.371 3	0.781 4	0.681 4	0.689 8	0.659 5
MSIR（BERT）	0.660 1	0.616 2	0.603 6	0.569 0	0.389 5	0.316 8	0.408 3	0.376 2	0.776 7	0.689 5	0.690 2	0.664 0
MSIR（T5）	0.710 0	0.647 0	0.640 1	0.595 9	0.394 3	0.320 6	0.409 1	0.377 9	0.783 7	0.696 5	0.688 7	0.667 6
MSIR（SimCSE）	0.688 0	0.651 0	0.638 5	0.604 4	0.385 1	0.317 6	0.401 9	0.379 3	0.788 4	0.687 2	0.700 9	0.665 7
MSIR（SBERT）	0.564 0	0.590 0	0.474 7	0.503 1	0.419 7	0.340 3	0.435 0	0.401 0	0.795 3	0.697 7	0.698 0	0.669 1

模型	Corvid-19				Robust04				TREC2019
模型	P@10	P@20	NDCG@10	NDCG@20	P@10	P@20	NDCG@10	NDCG@20	P@10	P@20	NDCG@10	NDCG@20
BM25	0.372 1	0.334 2	0.312 1	0.283 2	0.371 0	0.308 2	0.377 2	0.360 2	0.586 0	0.515 1	0.481 6	0.466 3
SimCSE	0.218 2	0.205 3	0.194 6	0.187 0	0.257 3	0.174 5	0.368 9	0.241 2	0.407 3	0.424 5	0.408 9	0.356 3
ColBERT	0.662 1	0.615 1	0.613 1	0.575 4	0.381 9	0.314 6	0.400 5	0.371 3	0.781 4	0.681 4	0.689 8	0.659 5
MSIR（BERT）	0.660 1	0.616 2	0.603 6	0.569 0	0.389 5	0.316 8	0.408 3	0.376 2	0.776 7	0.689 5	0.690 2	0.664 0
MSIR（T5）	0.710 0	0.647 0	0.640 1	0.595 9	0.394 3	0.320 6	0.409 1	0.377 9	0.783 7	0.696 5	0.688 7	0.667 6
MSIR（SimCSE）	0.688 0	0.651 0	0.638 5	0.604 4	0.385 1	0.317 6	0.401 9	0.379 3	0.788 4	0.687 2	0.700 9	0.665 7
MSIR（SBERT）	0.564 0	0.590 0	0.474 7	0.503 1	0.419 7	0.340 3	0.435 0	0.401 0	0.795 3	0.697 7	0.698 0	0.669 1

[1]	薛凯鹏, 徐涛, 廖春节. 融合自监督和多层交叉注意力的多模态情感分析网络[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2387-2392.
[2]	李晨阳, 张龙, 郑秋生, 钱少华. 基于扩散序列的多元可控文本生成[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2414-2420.
[3]	余杭, 周艳玲, 翟梦鑫, 刘涵. 基于预训练模型与标签融合的文本分类[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 709-714.
[4]	王楷天, 叶青, 程春雷. 基于异构图表示的中医电子病历分类方法[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 411-417.
[5]	吴祖成, 吴小俊, 徐天阳. 基于模态内细粒度特征关系提取的图像文本检索模型[J]. 《计算机应用》唯一官方网站, 2024, 44(12): 3776-3783.
[6]	林翔, 金彪, 尤玮婧, 姚志强, 熊金波. 基于脆弱指纹的深度神经网络模型完整性验证框架[J]. 《计算机应用》唯一官方网站, 2024, 44(11): 3479-3486.
[7]	陈佳, 张鸿. 基于特征增强和语义相关性匹配的图像文本检索方法[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 16-23.
[8]	田悦霖, 黄瑞章, 任丽娜. 融合局部语义特征的学者细粒度信息提取方法[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2707-2714.
[9]	张心月, 刘蓉, 魏驰宇, 方可. 融合提示知识的方面级情感分析方法[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2753-2759.
[10]	于碧辉, 蔡兴业, 魏靖烜. 基于提示学习的小样本文本分类方法[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2735-2740.
[11]	张小艳, 段正宇. 基于句级别GAN的跨语言零资源命名实体识别模型[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2406-2411.
[12]	石利锋, 倪郑威. 基于槽位相关信息提取的对话状态追踪模型[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1430-1437.
[13]	王惠茹, 李秀红, 李哲, 马春明, 任泽裕, 杨丹. 多模态预训练模型综述[J]. 《计算机应用》唯一官方网站, 2023, 43(4): 991-1004.
[14]	胡文浩, 罗景, 涂新辉. 面向稠密检索的伪相关反馈方法[J]. 《计算机应用》唯一官方网站, 2023, 43(4): 1036-1042.
[15]	胡婕, 陈晓茜, 张龑. 基于池化和特征组合增强BERT的答案选择模型[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 365-373.

基于多粒度语义融合的信息检索方法

Information retrieval method based on multi-granularity semantic fusion

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 5

参考文献 20

相关文章 15

编辑推荐

Metrics