Can test-time scaling work through retrieval rather than reasoning?

This explores whether you can buy better answers at inference time by searching more, not just by thinking longer — treating retrieval as its own compute axis the way reasoning tokens are.

This explores whether test-time scaling — the idea that spending more compute at inference buys better answers — can run through retrieval instead of (or alongside) chain-of-thought reasoning. The corpus says yes, and surprisingly cleanly: search budget follows the *same* scaling curve as reasoning tokens. Agentic deep-research systems improve monotonically with more search iterations, then hit diminishing returns, in a pattern that mirrors the reasoning-token relationship almost exactly Does search budget scale like reasoning tokens for answer quality? Do search steps follow the same scaling rules as reasoning tokens?. The practical upshot is a new knob: you can trade reasoning budget against search budget to optimize answer quality, reframing 'deep research' as a test-time scaling problem rather than a model-capability problem How does search scale like reasoning in agent systems?.

The cleanest way to see why this works is the field's main taxonomic split — *internal* scaling (training a model to reason autonomously) versus *external* scaling (search and verification bolted on at inference). These aren't rivals; internal builds capability while external extracts performance from capability you already have How do internal and external test-time scaling compare?. Retrieval lives squarely on the external side. That framing matters because there's evidence the *specific* external method barely matters: once you control for total compute and the quality of the value/reward function, beam search, best-of-N, and tree search converge — what actually moves the needle is how much you spend and how reliable your scoring is, not the algorithm's cleverness Does the choice of reasoning framework actually matter for test-time performance?.

The interesting twist is that *how* you retrieve can substitute for raw search volume. StructRAG routes each query to a task-appropriate knowledge structure — a table, a graph, an algorithm, a plain chunk — and beats uniform retrieval on knowledge-heavy reasoning, grounding the idea in cognitive-fit theory: match the representation to the problem and you need less search to get further Can routing queries to task-matched structures improve RAG reasoning?. That rhymes with the broader test-time-scaling finding that *adaptive* allocation beats uniform spending — easy prompts get starved less, hard ones get fed more — whether the budget is reasoning tokens or search steps How should we allocate compute budget at inference time? Can reasoning systems scale wider instead of only deeper?.

There's a deeper reason retrieval-scaling is appealing, and it's a caution about pure reasoning. Longer reasoning traces don't reliably signal harder problems — trace length tracks how close a problem sits to the training distribution, not its actual difficulty, so 'just think longer' can be recalling a memorized schema rather than computing Does longer reasoning actually mean harder problems?. The memorization worry is concrete: RLVR's apparent reasoning gains on some math benchmarks turn out to be dataset contamination, collapsing on clean post-release tests Does RLVR success on math benchmarks reflect genuine reasoning improvement?. Retrieval offers an escape hatch — it brings in information the model never internalized, rather than scaling up recall of what it already memorized.

Worth knowing: the reasoning-versus-retrieval framing isn't even the only axis. You can scale in *width* by sampling parallel latent trajectories instead of going deeper serially Can reasoning systems scale wider instead of only deeper?, make reasoning *memoryless* by contracting problems into dependency graphs so each step ignores accumulated history Can reasoning systems forget history without losing coherence?, or even fold test-time-style thinking back into pretraining for 3x data efficiency Can training data augmentation match test-time compute scaling benefits?. The honest synthesis: 'reasoning vs. retrieval' is really one slice of a wider design space where the unifying principle is that inference compute is a budget you allocate — and search is a first-class place to spend it.

Sources 12 notes

Does search budget scale like reasoning tokens for answer quality?

Agentic deep research shows monotonic-to-diminishing-returns curves for search iterations, matching reasoning token scaling. This creates a new inference-compute axis: models can trade off reasoning budget against search budget to optimize answer quality.

Do search steps follow the same scaling rules as reasoning tokens?

Deep research agents improve with more search steps in a pattern mirroring the reasoning-token relationship, with both exhibiting diminishing returns. This reveals a new inference-compute axis beyond model capability alone.

How does search scale like reasoning in agent systems?

Test-time scaling laws generalize from reasoning to retrieval: search steps follow identical scaling curves to reasoning tokens, making deep research a test-time scaling problem. This insight reframes search as a compute axis comparable to chain-of-thought reasoning.

How do internal and external test-time scaling compare?

Research shows test-time scaling methods split into internal (training models for autonomous reasoning) and external (inference-time search and verification). They complement rather than compete; internal builds capability while external extracts performance from existing capability.

Does the choice of reasoning framework actually matter for test-time performance?

Information-theoretic analysis shows BoN and MCTS converge in reasoning accuracy when controlling for total compute. Snowball errors accumulate per step regardless of framework; mitigation depends on search scope and reward function reliability, not the specific algorithm.

Can routing queries to task-matched structures improve RAG reasoning?

StructRAG demonstrates that selecting knowledge structure type based on query demands—via DPO-trained router choosing among tables, graphs, algorithms, catalogues, and chunks—improves knowledge-intensive reasoning over standard retrieval. The approach grounds this in cognitive load and cognitive fit theory from cognitive science.

How should we allocate compute budget at inference time?

Research shows that dynamically adjusting inference compute per prompt—rather than using fixed budgets—improves performance and efficiency. Uniform spending wastes resources on easy problems while underserving hard ones.

Can reasoning systems scale wider instead of only deeper?

GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.

Does longer reasoning actually mean harder problems?

Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.

Does RLVR success on math benchmarks reflect genuine reasoning improvement?

Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 from partial prompts but scores 0.0% on post-release LiveMathBench, revealing dataset contamination. On clean benchmarks, only correct rewards improve performance; random and inverse rewards fail or degrade reasoning ability.

Can reasoning systems forget history without losing coherence?

Atom of Thoughts decomposes problems into DAGs and contracts them iteratively, ensuring each state depends only on the current problem—not prior steps. This memoryless approach eliminates historical baggage that bloats reasoning while maintaining answer equivalence.

Can training data augmentation match test-time compute scaling benefits?

Augmenting pretraining data with LLM-generated reasoning traces improves data efficiency 3x and reasoning benchmark performance 10%+ for 3B models. Harder tokens automatically receive longer traces, creating a natural compute-allocation mechanism analogous to test-time scaling.

Can test-time scaling work through retrieval rather than reasoning?

Sources 12 notes

Next inquiring lines