How should we allocate compute between reasoning and retrieval iterations?

This explores how to split a fixed inference-compute budget between 'thinking harder' (reasoning tokens) and 'looking things up more' (retrieval/search iterations) — and the corpus suggests the answer is to treat them as two adaptive dials on the same budget, not a fixed ratio.

This explores how to split a fixed inference-compute budget between reasoning (thinking harder) and retrieval (searching more), and the corpus reframes the question: these aren't separate resources to ration but two dials on a single test-time-compute budget that should both move with task difficulty. The foundational move is recognizing that search behaves like reasoning. Just as reasoning tokens show diminishing returns as you add more, search iterations follow the same monotonic-to-diminishing curve — which means models can literally trade reasoning budget against search budget along one shared axis to optimize an answer Does search budget scale like reasoning tokens for answer quality?. So 'how to allocate' isn't a fixed ratio; it's a routing problem.

The routing principle is adaptivity. Spending compute uniformly wastes it — easy prompts get over-served, hard ones starved — and reallocating the same total budget by difficulty beats simply using a bigger model Can we allocate inference compute based on prompt difficulty?, How should we allocate compute budget at inference time?. The same logic applies to the reasoning-vs-retrieval decision at the step level: DeepRAG frames each reasoning step as a choice — retrieve externally or trust the model's own parametric knowledge — and learns when to do which, gaining ~22% accuracy largely by *not* retrieving when internal knowledge suffices, which strips out noise from unnecessary lookups When should language models retrieve external knowledge versus use internal knowledge?. The cheapest retrieval iteration is the one you correctly skip.

There's a subtle trap on the reasoning side that directly shapes allocation: reasoning and retrieval compete for the same context window. Letting an agent reason without limit inside a single turn burns the context it needs to absorb evidence from later search rounds — so on long-horizon tasks you want per-turn reasoning caps, not just a global time budget, to keep iterative retrieval productive Does limiting reasoning per turn improve multi-turn search quality?. Reasoning isn't free to lavish on early turns; over-spending it starves the retrieval loop downstream. Stateful memory workspaces ease this tension by carrying findings across cycles instead of re-deriving them, letting iterative evidence-gathering resolve contradictions through depth rather than brute repetition Can reasoning systems maintain memory across retrieval cycles?.

Where the budget goes also depends on *what kind* of retrieval you're doing. Routing queries to the knowledge structure that fits the task — tables, graphs, catalogues, plain chunks — gets more out of each retrieval iteration than uniform RAG, so structure-matching reduces the iterations needed Can routing queries to task-matched structures improve RAG reasoning?. And separating query-planning from answer-synthesis into distinct components reduces interference on multi-hop questions, meaning the architecture itself changes how much compute each phase deserves hierarchical-research-architectures-that-separate-query-planning-from-answer-synth. The broader RAG picture echoes this: retrieval should be tightly coupled to reasoning and triggered dynamically, not on a fixed schedule How should systems retrieve and reason with external knowledge?.

One caveat the corpus is blunt about: you can't allocate your way out of a weak model. Extra inference compute only pays off when training has installed a reasoning protocol that makes additional tokens productive — non-reasoning models don't close the gap no matter how much budget you throw at them Can non-reasoning models catch up with more compute?. The thing you didn't know you wanted to know: the best allocation may be a hybrid that doesn't choose at all — balanced investment across complementary lookup-and-compute mechanisms follows a U-shaped curve where the middle beats either extreme Can lookup memory and computation work together better than either alone?, and even reward evaluation improves when you let it reason before scoring Can reward models benefit from reasoning before scoring?. Allocation, in the end, is less about dividing a pie than about teaching the system to decide — per prompt, per step, per turn — which lever to pull next.

Sources 12 notes

Does search budget scale like reasoning tokens for answer quality?

Agentic deep research shows monotonic-to-diminishing-returns curves for search iterations, matching reasoning token scaling. This creates a new inference-compute axis: models can trade off reasoning budget against search budget to optimize answer quality.

Can we allocate inference compute based on prompt difficulty?

Research shows inference effectiveness varies dramatically by prompt difficulty. Reallocating the same total compute adaptively—giving easy prompts less and hard ones more—substantially outperforms larger models under uniform budgets.

How should we allocate compute budget at inference time?

Research shows that dynamically adjusting inference compute per prompt—rather than using fixed budgets—improves performance and efficiency. Uniform spending wastes resources on easy problems while underserving hard ones.

When should language models retrieve external knowledge versus use internal knowledge?

DeepRAG models each reasoning step as a Markov Decision Process where the model learns when to retrieve versus rely on parametric knowledge. The 21.99% improvement comes from better-targeted retrieval and elimination of noise from unnecessary external knowledge.

Does limiting reasoning per turn improve multi-turn search quality?

Unrestricted reasoning within single search turns consumes context needed for subsequent retrieval rounds, degrading the agent's ability to incorporate new evidence. Setting per-turn reasoning budgets, not just overall time limits, prevents this context erosion and maintains search quality across iterations.

Can reasoning systems maintain memory across retrieval cycles?

ComoRAG demonstrates that iterative evidence acquisition with a persistent memory workspace outperforms stateless multi-step retrieval by detecting and resolving contradictions through deeper exploration, achieving up to 11% gains on complex queries.

Can routing queries to task-matched structures improve RAG reasoning?

StructRAG demonstrates that selecting knowledge structure type based on query demands—via DPO-trained router choosing among tables, graphs, algorithms, catalogues, and chunks—improves knowledge-intensive reasoning over standard retrieval. The approach grounds this in cognitive load and cognitive fit theory from cognitive science.

Do hierarchical retrieval architectures outperform flat ones on complex queries?

Separating query planning from answer synthesis into distinct components reduces interference and improves multi-hop query performance. This architectural principle mirrors documented benefits of separating planning from execution in agent design.

How should systems retrieve and reason with external knowledge?

Research shows retrieval should adapt dynamically rather than follow fixed patterns, reasoning and retrieval must integrate closely, and embedding-based retrieval has fundamental limits requiring architectural alternatives.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Can lookup memory and computation work together better than either alone?

Engram combines O(1) N-gram lookup with Mixture-of-Experts routing, revealing a U-shaped scaling law where balanced allocation to both mechanisms outperforms either alone. Gains appear largest in reasoning and code rather than pure retrieval.

Can reward models benefit from reasoning before scoring?

Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.

How should we allocate compute between reasoning and retrieval iterations?

Sources 12 notes

Next inquiring lines