《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (6): 1775-1780.DOI: 10.11772/j.issn.1001-9081.2023050646

所属专题: 人工智能

• 人工智能 • 上一篇    下一篇

基于多粒度语义融合的信息检索方法

赵征宇1, 罗景1(), 涂新辉2   

  1. 1.武汉科技大学 计算机科学与技术学院,武汉 430065
    2.华中师范大学 计算机学院,武汉 430079
  • 收稿日期:2023-05-24 修回日期:2023-09-24 接受日期:2023-10-11 发布日期:2023-10-17 出版日期:2024-06-10
  • 通讯作者: 罗景
  • 作者简介:赵征宇(1999—),男,湖北荆州人,硕士研究生,主要研究方向:信息检索、自然语言处理
    涂新辉(1979—),男,湖北应城人,副教授,博士,CCF会员,主要研究方向:信息检索、自然语言处理。
  • 基金资助:
    国家语委重点项目(ZDI145?22);湖北省教育厅人文社会科学研究项目(18Q028)

Information retrieval method based on multi-granularity semantic fusion

Zhengyu ZHAO1, Jing LUO1(), Xinhui TU2   

  1. 1.School of Computer Science and Technology,Wuhan University of Science and Technology,Wuhan Hubei 430065,China
    2.School of Computer Science,Central China Normal University,Wuhan Hubei 430079,China
  • Received:2023-05-24 Revised:2023-09-24 Accepted:2023-10-11 Online:2023-10-17 Published:2024-06-10
  • Contact: Jing LUO
  • About author:ZHAO Zhengyu, born in 1999, M. S. candidate. His research interests include information retrieval, natural language processing.
    TU Xinhui, born in 1979, Ph. D., associate professor. His research interests include information retrieval, natural language processing.
  • Supported by:
    Key Project of National Language Commission of China(ZDI145-22);Humanities and Social Sciences Research Project of Education Department of Hubei Province(18Q028)

摘要:

信息检索(IR)是一种通过特定的技术和方法组织、处理信息,以满足用户的信息需求的过程。近年来,基于预训练模型的稠密检索方法取得了巨大的成功;然而,这些方法只利用了文本和词语的向量表征计算查询与文档相关度,忽略了它们短语层面间的语义信息。针对该问题,提出一种名为MSIR(Multi-Scale IR)的IR方法。所提方法通过融合查询与文档中多种不同粒度的语义信息提高IR性能。首先,构建查询和文档中词语、短语和文本这3个粒度的语义单元;其次,利用预训练模型对这3个语义单元分别进行编码获得它们的语义表征;最后,利用语义表征计算查询和文档相关度。在Corvid-19、TREC2019和Robust04这3个不同大小的经典数据集上进行了对比实验。与ColBERT(ranking model based on Contextualized late interaction over BERT (Bidirectional Encoder Representation from Transformers))相比,MSIR在Robust04数据集的P@10、P@20、NDCG@10和NDCG@20指标上均实现了约8%的提升,同时在Corvid-19和TREC2019数据集上也取得了一定的改进。实验结果表明,MSIR能够成功融合多种语义粒度,提升检索精度。

关键词: 语义融合, 信息检索, 稠密检索, 预训练模型, 文本检索

Abstract:

Information Retrieval (IR) is a process that organizes and processes information using specific techniques and methods to meet users’ information needs. In recent years, dense retrieval methods based on pre-trained models have achieved significant success. However, these methods only utilize vector representations of text and words to calculate the relevance between query and document, ignoring the semantic information at the phrase level. To address this issue, an IR method called MSIR (Multi-Scale Information Retrieval) was proposed. IR performance was enhanced by integrating semantic information of different granularities from the query and the document. First, semantic units of three different granularities — word, phrase, and text — were constructed in the query and the document. Then, the pre-trained model was used to encode these three semantic units separately to obtain their semantic representations. Finally, these semantic representations were used to calculate the relevance between the query and the document. Comparison experiments were conducted on three classic datasets of different sizes, including Corvid-19, TREC2019 and Robust04. Compared with ColBERT (ranking model based on Contextualized late interaction over BERT (Bidirectional Encoder Representation from Transformers)), MSIR shows an approximately 8% improvement in the P@10, P@20, NDCG@10 and NDCG@20 indicators on Robust04 dataset, as well as some improvements on Corvid-19 and TREC2019 datasets. Experimental results demonstrate that MSIR can effectively integrate multi-granularity semantic information, thereby enhancing retrieval accuracy.

Key words: semantic fusion, Information Retrieval (IR), dense retrieval, pre-trained model, text retrieval

中图分类号: