Journal of Computer Applications ›› 2023, Vol. 43 ›› Issue (4): 1036-1042.DOI: 10.11772/j.issn.1001-9081.2022030480

• Artificial intelligence • Previous Articles    

Pseudo relevance feedback method for dense retrieval

Wenhao HU1, Jing LUO1, Xinhui TU2()   

  1. 1.School of Computer Science and Technology,Wuhan University of Science and Technology,Wuhan Hubei 430081,China
    2.School of Computer Science,Central China Normal University,Wuhan Hubei 430079,China
  • Received:2022-04-13 Revised:2022-07-02 Accepted:2022-07-11 Online:2023-01-11 Published:2023-04-10
  • Contact: Xinhui TU
  • About author:HU Wenhao, born in 1998, M. S. candidate. His research interests include information retrieval, natural language processing.
    LUO Jing, born in 1978, Ph. D., associate professor. Her research interests include information retrieval, natural language processing.
  • Supported by:
    Humanities and Social Sciences Research Project of Department of Education of Hubei Province(18Q028)

面向稠密检索的伪相关反馈方法

胡文浩1, 罗景1, 涂新辉2()   

  1. 1.武汉科技大学 计算机科学与技术学院,武汉 430081
    2.华中师范大学 计算机学院,武汉 430079
  • 通讯作者: 涂新辉
  • 作者简介:胡文浩(1998—),男,湖北黄冈人,硕士研究生,主要研究方向:信息检索、自然语言处理;
    罗景(1978—),女,湖北武汉人,副教授,博士,CCF会员,主要研究方向:信息检索、自然语言处理;
  • 基金资助:
    湖北省教育厅人文社会科学研究项目(18Q028)

Abstract:

Pseudo Relevance Feedback (PRF) mechanism is an automated Query Expansion (QE) technology that uses the original query and the information contained in the top N documents in the initial retrieval to build more accurate queries. It can further improve the performance of retrieval systems. However, the existing PRF methods for dense retrieval have two problems: lack of semantic information due to text truncation, and high time complexity in retrieval stages. Aiming at these problems, an PRF method based on paragraph-level granularity and can be used in dense retrieval for long texts, namely Dense-PRD, was proposed. Firstly, the embeddings of relevant paragraphs from top N documents of the initial retrieval were obtained by semantic distance calculation. Secondly, the QE term embeddings were obtained by average polling of the relevant paragraph embeddings. Thirdly, new query embeddings were constructed by combining the original query embeddings and QE term embeddings according to their weights. Finally, the final retrieval results were obtained according to new query embeddings. In experiments of comparing Dense-PRF with baseline models on two classic long text test datasets of Robust04 and WT2G, compared to model RepBERT+BM25, Dense-PRF has the accuracy and Normalized Discounted Cumulative Gain (NDCG) index of the top 20 documents improved by 1.66, 1.32 percentage points and 2.30, 1.91 percentage points. Experimental results demonstrate that Dense-PRF can effectively alleviate the mismatches between queries and document vocabularies and improve the retrieval accuracy.

Key words: Pseudo Relevance Feedback (PRF), Query Expansion (QE), information retrieval, dense retrieval, long text

摘要:

伪相关反馈(PRF)机制是一种自动化的查询扩展(QE)技术,它利用原始查询和初次检索中前N篇文档蕴含的信息构建更加准确的查询,从而进一步提高信息检索系统的性能。但是,现有的面向稠密检索的PRF方法由于对文本的截断处理容易造成语义信息的缺失,而且在检索阶段的空间复杂度较高。针对上述问题,提出了一种基于段落级粒度且适用于长文本稠密检索的PRF方法Dense-PRF。首先,通过计算语义距离从初次检索的前N篇文档中获得相关段落的向量;其次,对相关段落向量进行平均池化以得到QE项向量;然后,按照权重结合原始查询向量和QE项向量构建新的查询向量;最后,根据新的查询向量得到最终检索结果。在Robust04和WT2G两个经典长文本测试集上将Dense-PRF与基线模型进行了对比实验,相较于模型RepBERT+BM25,Dense-PRF在前20篇文档的准确率和归一化折现累计效益(NDCG)指标上分别提升了1.66、1.32个百分点和2.30、1.91个百分点。实验结果表明Dense-PRF能有效缓解查询与文档词汇不匹配的问题,并提升检索精度。

关键词: 伪相关反馈, 查询扩展, 信息检索, 稠密检索, 长文本

CLC Number: