With the development of Pretrained Language Model (PLM), Retrieval-Augmented Generation (RAG) is widely concerned as an emerging task. A comprehensive and objective evaluation of RAG is considered essential to reveal the limitations of the existing methods and to indicate future research directions. However, a lack of systematic evaluation benchmarks for RAG is observed, especially in the context of long documents. To address this issue, an automatic question-answering construction strategy based on focused fragments was proposed, aiming to build large-scale QA datasets efficiently and accurately. Based on this strategy, the first bilingual RAG evaluation benchmark dataset for long documents, named LoRAG, was constructed, covering English-Chinese bilingual documents from multiple domains such as law, finance, and literature, with an average document length of 57 000 tokens in English and 76 000 tokens in Chinese. Systematic experiments on the two key stages of RAG — retrieval and generation were conducted using the LoRAG dataset. In the retrieval stage, multiple mainstream embedding models, including text-embedding-ada-002, the bge-large series, bge-m3, and Multilingual-E5-large-instruct, were evaluated, and the reranking model bge-reranker-v2-m3 was introduced for performance optimization and comparison. In the generation stage, representative Large Language Models (LLM), including Vicuna-13B, ChatGLM2-6B, Llama2-7B, and Claude2, were tested comprehensively. Experimental results show that the constructed dataset LoRAG reveals the positioning challenges faced by current embedding methods in long-document retrieval, as well as the limitations of LLM in balancing relevance and conciseness during the generation process, providing clear research directions for future method improvements.