Can long-context models resolve retriever-reader imbalance?
Traditional RAG systems force retrievers to find precise passages because readers had small context windows. Do modern long-context LLMs change what architecture makes sense?
Standard RAG retrieves 100-word paragraphs. This forces the retriever to locate the precise passage containing the answer across a corpus of potentially 22 million units. The task is "find the needle." The reader then extracts the answer from the found passage — a relatively easy task. The retriever carries almost all the weight.
This design was rational in the era when language models had 512–2048 token context windows. Longer retrieval units were unusable because the reader could not process them. The retriever had to do the precision work because the reader could not.
LongRAG (2024) reassesses this design choice given long-context LLMs that handle 128K tokens. Instead of 100-word units, use 4K-token units constructed by grouping related documents. The corpus shrinks from 22M to 600K units — the retriever's job becomes "find the right section" rather than "find the exact needle." Recall@1 on NQ improves from 52% to 71%, and Recall@2 on HotpotQA from 47% to 72%.
The reader then receives the top-k long units concatenated (~30K tokens) and performs zero-shot answer extraction. The LLM is handling what it is good at — understanding language in rich context — while the retriever handles what it is good at — coarse relevance ranking.
The broader principle: RAG architecture design assumptions were frozen at the constraints of their era. As those constraints lift (context windows, model capability, inference cost), the optimal design changes. "Best practices" based on 2020 constraints may be anti-patterns by 2025 standards.
Source: RAG
Related concepts in this collection
-
Can inference compute replace scaling up model size?
Explores whether smaller models given more thinking time during inference can match larger models. Matters because it reshapes deployment economics and compute allocation strategies.
the reader doing more with longer context is analogous to more compute at inference enabling more capable responses
-
Does limiting reasoning per turn improve multi-turn search quality?
When language models engage in iterative search cycles, does capping reasoning at each turn—rather than just total compute—help preserve context for subsequent retrievals and improve overall search effectiveness?
related trade-off in the opposite direction: too much context per turn degrades iterative search; LongRAG should be scoped to single-turn answering, not iterative research
-
Can one model compress all conversation memory and eliminate retrieval?
Instead of storing and retrieving discrete memories, can a single LLM compress all past conversations into event recaps, user portraits, and relationship dynamics? This explores whether compression-based memory avoids the bottleneck of traditional retrieval systems.
COMEDY takes the imbalance resolution further: rather than shifting burden to the reader, it eliminates the retriever entirely by merging retrieval and generation into a single compressive operation
-
Does reasoning ability actually degrade with longer inputs?
Explores whether modern language models can maintain reasoning performance when processing long contexts, and whether technical capacity translates to practical reasoning capability over extended text.
challenges the burden-shifting thesis: FLenQA shows reasoning accuracy drops from 0.92 to 0.68 at just 3000 tokens of irrelevant content; shifting burden to the reader assumes the reader can handle longer inputs, but reasoning degrades with length even far below context window limits
-
Can long-context LLMs replace retrieval-augmented generation systems?
Explores whether loading entire corpora into LLM context windows can eliminate the need for separate retrieval systems, and what task types this approach handles well or poorly.
LOFT validates the burden-shift for semantic tasks (LCLMs rival RAG systems) while exposing its limits: compositional SQL-like tasks require structured query logic that attention-based reading cannot provide, bounding where the heavy-reader approach works
-
Can models precompute answers before users ask questions?
Most LLM applications maintain persistent state across interactions. Could models use idle time between queries to precompute useful inferences about that context, reducing latency when users actually ask?
complementary burden-shift in a different direction: LongRAG shifts work from retriever to reader at query time; sleep-time compute shifts work from query time to pre-query time; both respond to the same insight that the retriever-does-everything assumption is an artifact of historical context constraints, not an architectural necessity
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
heavy retriever / light reader imbalance is a historical artifact — long-context LLMs resolve it by shifting burden to the reader