Does test-time compute scaling work for agentic deep research tasks?
This explores whether throwing more inference-time compute at a research agent — letting it search and reason longer rather than retraining or enlarging the model — actually buys better answers, and where the gains taper off.
This explores whether throwing more inference-time compute at a research agent — letting it search and reason longer rather than swapping in a bigger model — actually buys better answers. The corpus answers yes, and with an interesting twist: the thing that scales isn't just reasoning, it's *search itself*. Several notes converge on the finding that an agent's search budget follows the same scaling curve as its reasoning tokens — more search steps improve answer quality along the same monotonic-then-diminishing path that more chain-of-thought does Does search budget scale like reasoning tokens for answer quality? Do search steps follow the same scaling rules as reasoning tokens?. The reframing is the payoff: search stops being plumbing and becomes a *compute axis* you can dial, trading reasoning budget against retrieval budget to optimize the result How does search scale like reasoning in agent systems?.
But 'just spend more tokens' comes with sharp caveats. At the multi-agent level, roughly 80% of performance variance turns out to be a function of token spend rather than any clever coordination between agents — which is freeing (you know what knob to turn) and sobering (much of the apparent intelligence is just budget) How does test-time scaling work at the agent level?. And spending only helps if the model was trained to use it: non-reasoning models don't catch up to reasoning models no matter how much inference compute you pour in, because training instills a protocol that makes the extra tokens productive in the first place Can non-reasoning models catch up with more compute?. So test-time scaling extracts capability the model already has — it doesn't manufacture it. That's the internal-vs-external split worth knowing: internal scaling (training for autonomous reasoning) builds the capability, external scaling (inference-time search and verification) cashes it out, and the two complement rather than substitute How do internal and external test-time scaling compare?.
The deeper trade is that inference compute and model size are not separate resources. On hard prompts, a smaller model given more thinking time matches a larger one — compute spent at inference can stand in for compute spent on parameters Can inference compute replace scaling up model size?. The catch is that uniform spending is wasteful: dumping a fixed budget on every query overpays for easy questions and starves the hard ones, so adaptive per-prompt allocation beats flat budgets How should we allocate compute budget at inference time?. For a research agent fielding a mix of trivial and gnarly sub-questions, that's the real lever — not 'more,' but 'more where it counts.'
What the reader might not expect: scaling doesn't have to mean *deeper*. You can scale *wider* by sampling parallel trajectories instead of one long serial chain, sidestepping the latency tax of depth Can reasoning systems scale wider instead of only deeper?. And there's a striking inversion at the extreme — decompose a task into enough tiny verified subtasks with voting at each step, and small non-reasoning models execute million-step jobs error-free, no expensive reasoning model required Can extreme task decomposition enable reliable execution at million-step scale?. Two routes to reliability that don't look like 'bigger budget' at all.
So: test-time scaling genuinely works for agentic deep research, but the honest version is layered — search scales like reasoning, most multi-agent gains are just spend, the spend only pays off on models trained to reason, and the smartest agents allocate adaptively, sometimes going wide or decomposing rather than simply spending more. If you want the structural picture of how all this fits together, the internal/external taxonomy is the doorway How do internal and external test-time scaling compare?.
Sources 10 notes
Agentic deep research shows monotonic-to-diminishing-returns curves for search iterations, matching reasoning token scaling. This creates a new inference-compute axis: models can trade off reasoning budget against search budget to optimize answer quality.
Deep research agents improve with more search steps in a pattern mirroring the reasoning-token relationship, with both exhibiting diminishing returns. This reveals a new inference-compute axis beyond model capability alone.
Test-time scaling laws generalize from reasoning to retrieval: search steps follow identical scaling curves to reasoning tokens, making deep research a test-time scaling problem. This insight reframes search as a compute axis comparable to chain-of-thought reasoning.
Research shows 80% of multi-agent performance variance comes from token budget, not coordination intelligence. LatentMAS and shared-KV-cache approaches offer ways to decouple performance gains from token costs.
Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.
Research shows test-time scaling methods split into internal (training models for autonomous reasoning) and external (inference-time search and verification). They complement rather than compete; internal builds capability while external extracts performance from existing capability.
Snell et al. (2024) showed that inference-time compute trades off against model parameter scaling, especially on difficult prompts. This reveals pretraining and inference compute are not independent resources.
Research shows that dynamically adjusting inference compute per prompt—rather than using fixed budgets—improves performance and efficiency. Uniform spending wastes resources on easy problems while underserving hard ones.
GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.
MAKER solves million-step tasks with zero errors by decomposing into minimal subtasks, applying voting at each step, and flagging correlated errors. Surprisingly, small non-reasoning models suffice when decomposition is extreme enough, inverting the standard approach to hard problems.