Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?

Paper · arXiv 2406.13121 · Published June 19, 2024

Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases. Leveraging LCLMs’ ability to natively ingest and process entire corpora of information offers numerous advantages. It enhances user-friendliness by eliminating the need for specialized knowledge of tools, provides robust end-to-end modeling that minimizes cascading errors in complex pipelines, and allows for the application of sophisticated prompting techniques across the entire system. To assess this paradigm shift, we introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs’ performance on in-context retrieval and reasoning. Our findings reveal LCLMs’ surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks. However, LCLMs still face challenges in areas like compositional reasoning that are required in SQL-like tasks. Notably, prompting strategies significantly influence performance, emphasizing the need for continued research as context lengths grow.

By consolidating complex pipelines into a unified model, LCLMs ameliorate issues like cascading errors [8] and cumbersome optimization [30, 61], offering a streamlined end-to-end approach to model development. Moreover, techniques such as adding instructions [27, 59, 14], incorporating few-shot examples [12], and leveraging demonstrations via chain-of-thought prompting [44, 60] can be seamlessly integrated to optimize LCLMs for the task at hand.

However, realizing the full potential of LCLMs requires rigorous evaluation on truly long-context tasks useful in real-world applications. Existing benchmarks fall short in this regard,

suite of six tasks consisting of 35 datasets which span text, visual, and audio modalities, to push LCLMs to their limits and gauge their real-world impact.

• Retrieval: LCLMs can directly ingest and retrieve information from a corpus, eliminating the need for separate dual-encoder models [26, 43, 31, 47]. This addresses long-standing challenges in retrieval systems such as multi-hop reasoning, instruction following, and few-shot task adaptation. We assess retrieval performance across text, visual, and audio modalities.

• Retrieval-Augmented Generation (RAG): LCLMs simplify RAG pipelines by directly reasoning over a corpus, overcoming challenges like query decomposition [46] and mitigating cascading errors due to retrieval misses [8, 38].

• SQL: We explore LCLMs’ capacity to process entire databases as text, enabling natural language database querying and bypassing conversion to a formal query language like SQL [69]. This potentially enables more expressive querying and handling of noisy or mixed-structured data. Importantly, it can also be seen as a case study representing the ability of LCLMs to subsume other types of structured data and complicated formal languages to query them, such as knowledge graphs that often require bespoke solutions.

• Many-Shot ICL: LCLMs can scale the number of examples from the tens in the traditional incontext learning setup to hundreds or thousands [65, 10], removing the need to find the optimal set of few-shot examples to use [39].

We evaluate SQL-like reasoning on Spider, a single-turn text-to-SQL dataset [67], and SparC, its multi-turn variant [68]. The corpus for each query is an associated database of one or more tables. To construct the corpus for each context length, we must select databases such that all of its tables are no larger than that context length. We do this by selecting the largest databases that will fit into that context. This necessarily means that the databases for the 1M token setting would not fit into the smaller context lengths.

placing the gold documents of few-shot queries at the end improves recall,

The traditional SQL pipeline uses a trained semantic parser to translate the natural language input into a SQL query. Then, a separate SQL interpreter is used to execute the SQL query over the database.