LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs
In traditional RAG framework, the basic retrieval units are normally short. The common retrievers like DPR normally work with 100-word Wikipedia paragraphs. Such a design forces the retriever to search over a large corpus to find the ‘needle’ unit. In contrast, the readers only need to extract answers from the short retrieved units. Such an imbalanced ‘heavy’ retriever and ‘light’ reader design can lead to sub-optimal performance. In order to alleviate the imbalance, we propose a new framework LongRAG, consisting of a ‘long retriever’ and a ‘long reader’. LongRAG processes the entireWikipedia into 4K-token units, which is 30x longer than before. By increasing the unit size, we significantly reduce the total units from 22M to 600K. This significantly lowers the burden of retriever, which leads to a remarkable retrieval score: answer recall@1=71% on NQ (previously 52%) and answer recall@2=72% (previously 47%) on HotpotQA (full-wiki). Then we feed the top-k retrieved units (≈ 30K tokens) to an existing long-context LLM to perform zero-shot answer extraction. Without requiring any training, LongRAG achieves an EM of 62.7% on NQ and 64.3% on HotpotQA (full-wiki), which is on par with the SoTA model.
Retrieval-Augmented Generation (RAG) methods have long been employed to enhance large language models (LLMs) (Mialon et al., 2023). Knowledge in the form of natural language can be entirely offloaded from the parametric knowledge of LLMs by leveraging a standalone retrieval component from an external corpus. The existing RAG framework tends to use short retrieval units, such as 100-word passages in popular open-domain question answering tasks (Chen et al., 2017; Lewis et al., 2020; Karpukhin et al., 2020). The retriever is tasked with finding the “needle” (i.e. the precise tiny retrieval unit) from the “haystack” (i.e. the massive corpus with tens of millions of information units). Subsequently, the retrieved units are passed to the reader to generate the final response. On the contrary, the reader only needs to extract answers from these retrievals, which is a fairly easy task. This kind of imbalanced design, with a “heavy” retriever and a “light” reader, puts too much pressure on the retriever. Therefore, the state-of-the-art RAG models (Izacard and Grave, 2020b) need to recall huge amount of units, such as the top-100 or even more, combined with additional complex re-ranker to achieve great performance. Moreover, short retrieval units can lead to semantic incompleteness due to document truncation. This can lead to information loss, ultimately restricting the end performance. This traditional design choice of the RAG framework was made in an era when NLP models were heavily restricted by their ability to handle long contexts. With the recent advances in long-context language models, the reader can potentially handle up to 128K or even millions of tokens as input (Reid et al., 2024; Achiam et al., 2023). In this paper, we propose to revisit this design choice for open-domain question answering and propose LongRAG framework as a solution to balance the workload between the retriever and the reader, as illustrated in Figure 1.
Long Retrieval Unit: By using entire Wikipedia documents or grouping multiple related documents, we can construct long retrieval units with more than 4K tokens. This design could also significantly reduce the corpus size (number of retrieval units in the corpus). Then, the retriever’s task becomes much easier. Additionally, the long retrieval unit will also improve the information completeness to avoid ambiguity or confusion.
Long Retriever: The long retriever will identify coarse relevant information for the given query by searching through all the long retrieval units in the corpus. The top 4 to 8 retrieval units are concatenated as the retrieved long context for the next step.
Long Reader: The long reader will further extract answers from the concatenation of retrievals, which is normally around 30K tokens. We simply prompt an existing long-context LM (like Gemini or GPT4) with the question to produce the answers.
Recent work has focused on improving the retriever (Karpukhin et al., 2020; Xiong et al., 2020a; Qu et al., 2020; Xiong et al., 2020b; Khalifa et al., 2023), enhancing the reader (Izacard and Grave, 2020b; Cheng et al., 2021; Yu et al., 2021; Borgeaud et al., 2022), fine-tuning the retriever and reader jointly (Yu, 2022; Izacard et al., 2022; Singh et al., 2021; Izacard and Grave, 2020a), and integrating the retriever with the black-box language model (Yu et al., 2023; Shi et al., 2023; Trivedi et al., 2022). However, the impact of document granularity on the effectiveness and efficiency of the retrieval-augmented generation pipeline remains underexplored.
different approaches have been proposed to mitigate computational issues, including sliding memory window and chunk segmentation (Hao et al., 2022; Ratner et al., 2023; Zhu et al., 2024b). FlashAttention (Dao et al., 2022) has also been a pivotal strategy
To enable length extrapolation, RoPE (Su et al., 2021) and AliBI (Press et al., 2021) position encodings have shown potential to enable length extrapolation, which have been widely used in the literature. Recent endeavors have explored diverse strategies to tackle this challenge, which is mainly Position reorganization (Jin et al., 2024; An et al., 2024), Position interpolation
Our proposed LongRAG framework is comprised of two components: the Long Retriever and the Long Reader. An illustrative example of these two components are depicted in Figure 2.
3.1 Long Retriever The traditional RAG framework employs smaller retrieval units and prioritizes retrieving the exact fine-grained short context containing the answer. In contrast, our proposed LongRAG framework places greater emphasis on recall, aiming to retrieve relevant context with much coarse granularity. This design choice shifts more burden from the retriever to the reader to extract the exact answers from the relevant context.