What happens to iterative search quality when reasoning depth is unconstrained?

This explores what goes wrong when you let a reasoning agent think as long and deep as it wants inside a search loop — and why more reasoning depth often makes iterative search worse, not better.

This reads the question as being about a specific failure mode: when an agent doing multi-round, iterative search is given unlimited room to reason, the depth doesn't pay off the way you'd expect — and in several ways it actively hurts. The corpus is surprisingly consistent on this. The most direct answer comes from research on long-horizon research tasks, which finds that unrestricted reasoning *within a single search turn* burns through the context window the agent needs for later retrieval rounds. Letting it think freely on turn one starves turns two through five; imposing a per-turn reasoning budget (not just an overall time cap) preserves context and keeps search quality steady across iterations Does limiting reasoning per turn improve multi-turn search quality?. So the first thing that happens is mechanical: depth eats the resource that iteration depends on.

The second thing is behavioral, and this is where the corpus gets interesting. Unconstrained reasoning doesn't just go deep — it goes *wandering*. Reasoning models tend to abandon promising paths mid-exploration, a failure called underthinking, where the model switches ideas too frequently and wastes tokens on half-finished approaches Do reasoning models switch between ideas too frequently?. A companion line of work frames this as the model exploring "like a tourist, not a scientist" — combining invalid wandering with premature path-switching, two reinforcing failures of *structure* rather than insufficient compute Why do reasoning models abandon promising solution paths?. The damning detail: these are fixable with decoding-level penalties on thought-switching, no retraining needed. That means the depth was there — the model just couldn't organize it. More room to think gave it more room to wander.

Why does this compound in search specifically? Because unsystematic exploration degrades non-linearly. One analysis shows reasoning LLMs lack validity, effectiveness, and necessity in how they explore, and as a result success probability drops *exponentially* with problem depth — medium problems stay solvable while deep ones become catastrophically hard Why do reasoning LLMs fail at deeper problem solving?. Iterative search is exactly the regime where this bites, because each round inherits the disorganization of the last. There's even a cleaner curve underneath all this: optimal chain-of-thought length follows an inverted U — accuracy peaks at intermediate length, and past the peak, more reasoning *lowers* accuracy. Tellingly, more capable models prefer shorter chains, and RL training drifts toward brevity as models improve Why does chain of thought accuracy eventually decline with length?. Unconstrained depth pushes you off the right side of that hill.

The constructive flip side: if depth-only scaling is the problem, the corpus points toward *breadth* as the fix. RLAD trains models to generate reasoning abstractions that enforce breadth-first exploration, outperforming parallel solution-sampling at large budgets precisely because it prevents the underthinking trap of long depth-only chains Can abstractions guide exploration better than depth alone?. A different approach, GRAM, scales reasoning in *width* by sampling parallel latent trajectories, sidestepping the serial latency and variance problems of going deeper Can reasoning systems scale wider instead of only deeper?. The throughline across all of these: the lever that improves iterative search isn't unbounded thinking — it's structured allocation. Constrain depth per step, spend the saved budget on breadth, and the search holds together. Leave depth unconstrained, and you don't get a deeper thinker — you get a more elaborate wanderer who runs out of context before the answer arrives.

Sources 7 notes

Does limiting reasoning per turn improve multi-turn search quality?

Unrestricted reasoning within single search turns consumes context needed for subsequent retrieval rounds, degrading the agent's ability to incorporate new evidence. Setting per-turn reasoning budgets, not just overall time limits, prevents this context erosion and maintains search quality across iterations.

Do reasoning models switch between ideas too frequently?

o1-like models frequently abandon reasoning paths mid-exploration, wasting tokens on incomplete approaches. A decoding-only penalty on thought-transition tokens (TIP strategy) discourages switching, improving accuracy on challenging math without model fine-tuning.

Why do reasoning models abandon promising solution paths?

Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.

Why do reasoning LLMs fail at deeper problem solving?

Current reasoning models lack the three properties of systematic exploration: validity, effectiveness, and necessity. This causes success probability to drop exponentially with problem depth, making medium problems solvable but deep problems catastrophically harder.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Can abstractions guide exploration better than depth alone?

RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.

Can reasoning systems scale wider instead of only deeper?

GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.

What happens to iterative search quality when reasoning depth is unconstrained?

Sources 7 notes

Next inquiring lines