Journal of Computer Applications ›› 2026, Vol. 46 ›› Issue (2): 386-394.DOI: 10.11772/j.issn.1001-9081.2025030275

• Artificial intelligence • Previous Articles    

Benchmark dataset for retrieval-augmented generation on long documents

Yixin LIU1, Xianggen LIU1, Wen LIU2, Hongbo DENG2, Ziye ZHANG1, Hua MU3()   

  1. 1.College of Computer Science,Sichuan University,Chengdu Sichuan 610065,China
    2.Southwest China Research Institute of Electronic Equipment,Chengdu Sichuan 610036,China
    3.Wuhan KM Information Technology Company Limited,Wuhan Hubei 430076,China
  • Received:2025-03-17 Revised:2025-06-02 Accepted:2025-06-04 Online:2025-08-06 Published:2026-02-10
  • Contact: Hua MU
  • About author:LIU Yixin, born in 1999, M. S. candidate. Her research interests include natural language processing, artificial intelligence.
    LIU Xianggen, born in 1993, Ph. D., associate professor. His research interests include artificial intelligence, natural language processing.
    LIU Wen, born in 1987, M. S., senior engineer. His research interests include artificial intelligence.
    DENG Hongbo, born in 1982, M. S., senior engineer. His research interests include artificial intelligence.
    ZHANG Ziye, born in 2003. His research interests include natural language processing.
    MU Hua, born in 1976, senior engineer. His research interests include artificial intelligence. Emai:muhua008@163.com
  • Supported by:
    National Natural Science Foundation of China(62206192);National Key Research and Development Program of China(2024YFB3312503);Natural Science Foundation of Sichuan Province(2024NSFTD0048)

面向长文档检索增强生成的基准数据集

刘宜欣1, 刘祥根1, 刘文2, 邓洪波2, 张子野1, 穆骅3()   

  1. 1.四川大学 计算机学院,成都 610065
    2.西南电子设备研究所,成都 610036
    3.武汉开目信息技术股份有限公司,武汉 430076
  • 通讯作者: 穆骅
  • 作者简介:刘宜欣(1999—),女,四川德阳人,硕士研究生,主要研究方向:自然语言处理、人工智能
    刘祥根(1993—),男,四川成都人,副教授,博士,主要研究方向:人工智能、自然语言处理
    刘文(1987—),男,四川成都人,高级工程师,硕士,主要研究方向:人工智能
    邓洪波(1982—),男,四川成都人,高级工程师,硕士,主要研究方向:人工智能
    张子野(2003—),男,重庆人,主要研究方向:自然语言处理
    穆骅(1976—),男,湖北武汉人,高级工程师,主要研究方向:人工智能。Emai:muhua008@163.com
  • 基金资助:
    国家自然科学基金资助项目(62206192);国家重点研发计划项目(2024YFB3312503);四川省自然科学基金资助项目(2024NSFTD0048)

Abstract:

With the development of Pretrained Language Model (PLM), Retrieval-Augmented Generation (RAG) is widely concerned as an emerging task. A comprehensive and objective evaluation of RAG is considered essential to reveal the limitations of the existing methods and to indicate future research directions. However, a lack of systematic evaluation benchmarks for RAG is observed, especially in the context of long documents. To address this issue, an automatic question-answering construction strategy based on focused fragments was proposed, aiming to build large-scale QA datasets efficiently and accurately. Based on this strategy, the first bilingual RAG evaluation benchmark dataset for long documents, named LoRAG, was constructed, covering English-Chinese bilingual documents from multiple domains such as law, finance, and literature, with an average document length of 57 000 tokens in English and 76 000 tokens in Chinese. Systematic experiments on the two key stages of RAG — retrieval and generation were conducted using the LoRAG dataset. In the retrieval stage, multiple mainstream embedding models, including text-embedding-ada-002, the bge-large series, bge-m3, and Multilingual-E5-large-instruct, were evaluated, and the reranking model bge-reranker-v2-m3 was introduced for performance optimization and comparison. In the generation stage, representative Large Language Models (LLM), including Vicuna-13B, ChatGLM2-6B, Llama2-7B, and Claude2, were tested comprehensively. Experimental results show that the constructed dataset LoRAG reveals the positioning challenges faced by current embedding methods in long-document retrieval, as well as the limitations of LLM in balancing relevance and conciseness during the generation process, providing clear research directions for future method improvements.

Key words: Retrieval-Augmented Generation (RAG), Large Language Model (LLM), long-document processing, benchmark dataset, automatic question-answer construction

摘要:

随着预训练语言模型(PLM)的发展,检索增强生成(RAG)作为一个新兴任务受到广泛关注。全面客观地评价RAG可以揭示现有方法的局限并指明研究方向,然而,现有的研究针对RAG的系统性评估基准不足,尤其是在长文档场景中。针对这一问题,提出一种基于焦点片段的自动问答构建策略,旨在高效而准确地构建大规模问答数据集。基于该策略,构建首个专门针对长文档的双语RAG评估基准数据集LoRAG,涵盖法律、金融和文学等多领域的英汉双语文档,英文文档平均长度达5.7万词元,中文文档平均长度为7.6万词元。通过LoRAG数据集,对RAG的检索与生成这两个关键阶段进行系统性实验。在检索阶段,评估text-embedding-ada-002、bge-large系列、bge-m3和Multilingual-E5-large-instruct等多种主流嵌入模型,并引入bge-reranker-v2-m3重排序模型进行性能优化与对比;在生成阶段,全面测试Vicuna-13B、ChatGLM2-6B、Llama2-7B和Claude2等代表性大语言模型(LLM)。实验结果表明,所构建数据集LoRAG有效揭示了当前嵌入方法在长文档检索中的定位难题,以及LLM在生成过程中权衡相关性与精炼性之间的局限性,这些为后续方法的改进提供了清晰的研究方向。

关键词: 检索增强生成, 大型语言模型, 长文档处理, 基准数据集, 自动问答构建

CLC Number: