Why do long-horizon reasoning tasks need per-turn step limits rather than just compute budgets?

This explores why long, multi-turn reasoning tasks (like iterative research or search agents) need a cap on reasoning *per turn* — not just a total time or compute allowance — and what makes those two kinds of limits behave so differently.

This explores why long-horizon tasks need a per-turn step limit rather than just an overall compute budget — and the corpus points to a clear answer: a compute budget treats tokens as fungible, but in a multi-turn task the real scarce resource isn't compute, it's *context*. The most direct evidence is the finding that unrestricted reasoning inside a single search turn eats the context window that later retrieval rounds need, degrading the agent's ability to take in new evidence; capping reasoning *per turn* — not just the total run time — prevents that context erosion across iterations Does limiting reasoning per turn improve multi-turn search quality?. A global budget says 'you may spend N tokens total' and says nothing about *when*; the harm comes from spending them all in one turn and starving the next.

Seen this way, the per-turn limit is really a context-allocation discipline, and several other lines in the corpus converge on the same problem from different angles. LLM Programs attack it by hiding step-irrelevant context — each call sees only what that step needs, which is exactly a per-step budget enforced architecturally rather than by token count Can algorithms control LLM reasoning better than LLMs alone?. Markov-style 'memoryless' reasoning goes further, deliberately discarding accumulated history so each state depends only on the current subproblem, eliminating the historical baggage that bloats the window Can reasoning systems forget history without losing coherence?. And recursive subtask trees with KV-cache pruning sustain accurate reasoning even past the nominal context limit by aggressively dropping stale cache Can recursive subtask trees overcome context window limits?. All three are saying the same thing the per-turn limit says: untamed reasoning competes with the work it's supposed to serve.

There's a second reason raw compute is the wrong dial: more reasoning tokens are not automatically *productive* ones. Models 'underthink' — abandoning promising paths mid-exploration and burning tokens on incomplete approaches — and simply penalizing those thought-switches improves accuracy without any retraining Do reasoning models switch between ideas too frequently?. Other work shows you can delete up to three-quarters of reasoning steps (the verification and backtracking ones that get little downstream attention) and keep accuracy intact Can reasoning steps be dynamically pruned without losing accuracy?. If most marginal reasoning tokens are low-value or actively harmful, then handing an agent a bigger compute budget mostly buys it more rope — whereas a per-turn limit forces it to commit and move on.

The deepest version of the point: compute and capability aren't the same axis. Non-reasoning models can't catch up to reasoning models no matter how much inference budget you give them, because the gap is about a trained reasoning *protocol*, not token count Can non-reasoning models catch up with more compute?. And chain-of-thought quality collapses outside the training distribution regardless of how long the chain runs Does chain-of-thought reasoning actually generalize beyond training data?. So 'spend more compute' is a blunt instrument that assumes tokens convert linearly into progress. The per-turn limit encodes a sharper model of long-horizon work: progress comes from preserving the ability to *incorporate the next piece of evidence*, and that ability is a structural budget on context, not a wallet of compute.

Sources 8 notes

Does limiting reasoning per turn improve multi-turn search quality?

Unrestricted reasoning within single search turns consumes context needed for subsequent retrieval rounds, degrading the agent's ability to incorporate new evidence. Setting per-turn reasoning budgets, not just overall time limits, prevents this context erosion and maintains search quality across iterations.

Can algorithms control LLM reasoning better than LLMs alone?

LLM Programs embed LLMs within explicit algorithms that manage control flow and state, presenting only step-specific context to each LLM call. This information hiding addresses capability and context window limits while treating complex reasoning as modular, debuggable sub-tasks.

Can reasoning systems forget history without losing coherence?

Atom of Thoughts decomposes problems into DAGs and contracts them iteratively, ensuring each state depends only on the current problem—not prior steps. This memoryless approach eliminates historical baggage that bloats reasoning while maintaining answer equivalence.

Can recursive subtask trees overcome context window limits?

The Thread Inference Model demonstrates that reasoning structured as recursive subtask trees with rule-based KV cache pruning sustains accurate reasoning beyond context limits, even when manipulating 90% of the cache. This enables single models to replace multi-agent systems by handling full recursive reasoning internally.

Do reasoning models switch between ideas too frequently?

o1-like models frequently abandon reasoning paths mid-exploration, wasting tokens on incomplete approaches. A decoding-only penalty on thought-transition tokens (TIP strategy) discourages switching, improving accuracy on challenging math without model fine-tuning.

Can reasoning steps be dynamically pruned without losing accuracy?

The PI framework categorizes reasoning into six types and uses attention maps to identify that verification and backtracking steps receive minimal downstream attention. Selecting only high-attention steps preserves accuracy while cutting reasoning length substantially.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Why do long-horizon reasoning tasks need per-turn step limits rather than just compute budgets?

Sources 8 notes

Next inquiring lines