How should compute budgets be allocated across multi-stage RAG architectures?
This explores how to divide a fixed compute budget across the stages of a multi-step RAG pipeline (filtering, retrieval, reasoning, generation) — the corpus reframes the question from 'how much total compute' to 'where in the pipeline each dollar of compute earns the most.'
This explores how to divide a fixed compute budget across the stages of a multi-step RAG pipeline — and the corpus's strongest answer is that you shouldn't spend uniformly anywhere, not across stages and not across queries. The cleanest blueprint is hierarchical splitting: route the cheap, high-volume work (query reformulation, passage pruning, citation) to a small fast model and reserve the expensive model for final synthesis. Can smaller models handle RAG filtering while larger models focus on synthesis? shows this isn't a compromise — HiFi-RAG gets both lower cost *and* better answers than putting a big model everywhere, because most pipeline stages are filtering tasks that don't reward extra reasoning.
The deeper principle underneath this is adaptive allocation by difficulty. Can we allocate inference compute based on prompt difficulty? and How should we allocate compute budget at inference time? both find that giving easy prompts less and hard prompts more beats a flat budget — and Can inference compute replace scaling up model size? shows inference compute can even substitute for a bigger model on hard queries. Translated to RAG: the budget question isn't 'how much per stage' but 'how hard is *this* query, and which stage is the bottleneck for it.' A factual lookup needs little reasoning; a compositional, multi-hop query needs the synthesis stage to think hard.
But there's a ceiling worth knowing before you over-invest in the generation stage. Can non-reasoning models catch up with more compute? finds that piling inference compute onto a model that wasn't trained to reason productively yields little — the training regime caps the payoff. So the allocation decision interacts with model choice: a reasoning-trained synthesis model converts extra tokens into accuracy, a non-reasoning one largely doesn't.
Where it gets interesting is that the most effective spending may not be 'more compute on generation' at all, but compute spent on getting the right structure to the model in the first place. Can routing queries to task-matched structures improve RAG reasoning? (StructRAG) routes each query to a task-matched knowledge structure — tables, graphs, algorithms, chunks — and beats uniform retrieval. How should retrieval and reasoning integrate in RAG systems? argues retrieval and reasoning should be tightly coupled via process-level supervision rather than treated as separate budget lines, and Why does retrieval-augmented generation fail in production? warns that the real failure is single-pass architecture and embeddings that measure association rather than relevance — no amount of generation-stage compute fixes a retrieval stage feeding the model the wrong passages.
The most provocative cross-domain reframe comes from decomposition: Can extreme task decomposition enable reliable execution at million-step scale? shows that when you break a task into minimal subtasks and vote at each step, *small non-reasoning models suffice* — inverting the instinct to spend big on hard problems. Can byte-level models match tokenized performance with better efficiency? makes the same move at the token level, spending compute only on high-entropy (surprising) regions. The unifying lesson across all of these: the best-spent RAG budget is the one that matches compute to where uncertainty actually lives — usually the retrieval and routing stages, not a uniformly expensive generator.
Sources 10 notes
HiFi-RAG demonstrates that routing query reformulation, passage pruning, and citation to cheaper models like Gemini Flash while reserving expensive models like Gemini Pro for final generation produces both lower cost and better answers than uniform deployment.
Research shows inference effectiveness varies dramatically by prompt difficulty. Reallocating the same total compute adaptively—giving easy prompts less and hard ones more—substantially outperforms larger models under uniform budgets.
Research shows that dynamically adjusting inference compute per prompt—rather than using fixed budgets—improves performance and efficiency. Uniform spending wastes resources on easy problems while underserving hard ones.
Snell et al. (2024) showed that inference-time compute trades off against model parameter scaling, especially on difficult prompts. This reveals pretraining and inference compute are not independent resources.
Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.
StructRAG demonstrates that selecting knowledge structure type based on query demands—via DPO-trained router choosing among tables, graphs, algorithms, catalogues, and chunks—improves knowledge-intensive reasoning over standard retrieval. The approach grounds this in cognitive load and cognitive fit theory from cognitive science.
Research shows that tight coupling between retrieval and reasoning—via Markov Decision Processes and step-level feedback—substantially improves accuracy and efficiency. Graph-based retrieval and metacognitive monitoring address limitations of vector embeddings and prevent retrieval failures on compositional tasks.
RAG systems fail in production due to embedding inadequacy (measuring association not relevance), missing enterprise requirements (attribution, security, compliance), and single-pass architecture limitations. Known solutions exist but aren't implemented in demo systems.
MAKER solves million-step tasks with zero errors by decomposing into minimal subtasks, applying voting at each step, and flagging correlated errors. Surprisingly, small non-reasoning models suffice when decomposition is extreme enough, inverting the standard approach to hard problems.
The Byte Latent Transformer (BLT) dynamically segments bytes into patches based on next-byte entropy, allocating more compute to high-entropy regions and less to predictable ones. At 8B parameters, BLT matches tokenized baselines while reducing inference cost and improving robustness to typos and cross-lingual transfer.