Why do per-turn thinking budgets matter alongside iterative retrieval depth?
This explores why, in research agents that loop through multiple rounds of searching, it matters to cap how much an agent thinks *within each turn* — not just how deep it searches overall.
This explores why, in research agents that loop through multiple rounds of searching, it matters to cap how much an agent thinks *within each turn* — not just how deep it searches overall. The short version: search depth and per-turn reasoning are two separate dials that interact, and turning one up blindly can starve the other. A deep-research agent improves as you let it search more times, but those gains follow the same diminishing-returns curve as adding more reasoning tokens — both are now recognized as parallel axes of inference-time compute that you can trade against each other Does search budget scale like reasoning tokens for answer quality? Do search steps follow the same scaling rules as reasoning tokens?. So the question isn't just "how deep do I search," it's "how do I split a finite budget between searching and thinking."
The reason per-turn limits matter is mechanical: context is shared. If an agent burns an unrestricted amount of reasoning inside a single search turn, it eats the context window that later retrieval rounds need to absorb new evidence. Capping reasoning *per turn* — rather than only setting a global time or token ceiling — preserves room for the iterative loop to keep ingesting and incorporating what it finds, which keeps search quality from eroding over many cycles Does limiting reasoning per turn improve multi-turn search quality?. A per-turn budget is essentially a way of protecting the breadth of the whole investigation from the greed of any one step.
There's also a quality argument independent of context economics: more thinking per turn isn't simply better. Chain-of-thought accuracy follows an inverted-U — it peaks at an intermediate length and then declines, with the sweet spot shrinking as models get more capable Why does chain of thought accuracy eventually decline with length?. Left unbounded, reasoning models also tend to thrash, abandoning promising paths mid-stream; penalizing those premature switches improves accuracy without any retraining Do reasoning models switch between ideas too frequently?. A per-turn cap is a blunt but effective guardrail against both over-thinking and flailing — it nudges each turn toward a decisive, bounded contribution rather than a sprawling one.
The deeper insight is about *where* exploration should live. When you have extra budget, spreading it across structured breadth — diverse abstractions or strategies — beats pouring it into deeper depth-only reasoning chains, which fall into an "underthinking" failure mode Can abstractions guide exploration better than depth alone?. In a research agent, iterative retrieval *is* the breadth mechanism: each new search turn is a fresh exploratory probe. So a tight per-turn thinking budget plus many retrieval rounds enacts breadth-first exploration, while a fat per-turn budget with few rounds collapses into shallow depth. The two dials encode a single strategic choice about explore-versus-exploit.
What you didn't know you wanted to know: there are training-free ways to claw back the per-turn budget without losing accuracy. Reasoning verbosity turns out to be a single steerable direction in activation space — one extracted vector can cut chain-of-thought length by two-thirds while preserving accuracy and running nearly 3x faster Can we steer reasoning toward brevity without retraining?. That means "spend less per turn" doesn't have to mean "think worse" — it can mean compressing the same reasoning into fewer tokens, freeing the saved context for more retrieval depth. Per-turn budgets and retrieval depth aren't competing constraints so much as the two levers of a single compute-allocation problem.
Sources 7 notes
Agentic deep research shows monotonic-to-diminishing-returns curves for search iterations, matching reasoning token scaling. This creates a new inference-compute axis: models can trade off reasoning budget against search budget to optimize answer quality.
Deep research agents improve with more search steps in a pattern mirroring the reasoning-token relationship, with both exhibiting diminishing returns. This reveals a new inference-compute axis beyond model capability alone.
Unrestricted reasoning within single search turns consumes context needed for subsequent retrieval rounds, degrading the agent's ability to incorporate new evidence. Setting per-turn reasoning budgets, not just overall time limits, prevents this context erosion and maintains search quality across iterations.
Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.
o1-like models frequently abandon reasoning paths mid-exploration, wasting tokens on incomplete approaches. A decoding-only penalty on thought-transition tokens (TIP strategy) discourages switching, improving accuracy on challenging math without model fine-tuning.
RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.
Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.