INQUIRING LINE

Why does more inference compute amplify wandering rather than solving it?

This explores why adding more inference compute (longer chains of thought, more tokens, more search steps) often makes reasoning models wander off and abandon good paths instead of converging on answers — and what the corpus says is actually going wrong.


This explores why throwing more inference compute at a reasoning model tends to amplify its wandering rather than cure it. The corpus's sharpest answer is that wandering is a *structural* failure, not a *budget* failure — so more budget feeds the structure that's broken. The clearest statement of this is the diagnosis that reasoning models explore like tourists, not scientists: they exhibit two reinforcing failures, wandering (invalid exploration) and underthinking (switching away from promising paths before finishing them), and crucially these are fixable at decoding time with thought-switching penalties — no extra compute, no fine-tuning Why do reasoning models abandon promising solution paths?. If a cheap decoding nudge recovers accuracy, then the missing ingredient was never compute; the paths were already there and being abandoned.

Once you see wandering as path abandonment, the non-monotonic scaling results stop being surprising. Accuracy doesn't climb with tokens — it peaks at a task-specific threshold and then falls sharply (one study drops from 87% to 70% as tokens scale from ~1,100 to 16,000), because extended thinking inflates output variance and introduces self-revision errors When does thinking too much actually hurt reasoning?. More tokens is more opportunity to second-guess a correct answer into a wrong one. The same dynamic shows up when the question itself is broken: given a problem with a missing premise, reasoning models churn out long redundant chains instead of declaring it unanswerable, while plainer non-reasoning models just reject it Why do reasoning models overthink ill-posed questions?. Training rewarded *producing reasoning steps* but never taught the model *when to stop*, so compute has no brake.

There's a deeper reason the brake is missing: the reasoning traces may not be doing the meaning-work we assume. Models trained on deliberately corrupted, irrelevant traces perform comparably to those trained on correct ones — traces seem to act as computational scaffolding rather than genuine logical steps Do reasoning traces need to be semantically correct?. If the chain is scaffolding rather than argument, then lengthening it doesn't deepen the argument; it just builds more scaffolding to get lost in. And the underlying competence ceiling is real, not a compute artifact: frontier models score only ~20–23% on constraint-satisfaction problems that demand genuine backtracking Can reasoning models actually sustain long-chain reflection?, and non-reasoning models never close the gap with reasoning models no matter how large the inference budget — what matters is the training-instilled protocol that makes tokens productive in the first place Can non-reasoning models catch up with more compute?.

So the corpus's reframing is: compute isn't the lever — *allocation and structure* are. Spending the same budget adaptively (less on easy prompts, more on hard ones) beats a bigger model under a flat budget Can we allocate inference compute based on prompt difficulty?, and the gains come from trace *quality*, not quantity — step-level confidence filtering catches breakdowns mid-chain and stops early, matching majority-vote accuracy with far fewer traces Does step-level confidence outperform global averaging for trace filtering?. Even the same wandering tendency appears on the search axis: deep-research agents follow the same diminishing-returns scaling curve as reasoning tokens Do search steps follow the same scaling rules as reasoning tokens?.

The most interesting turn is that the fix isn't *less* exploration but *better-shaped* exploration. Rather than scaling depth (longer serial chains, more room to wander), scale width — sample parallel latent trajectories that probe the solution space independently without inflating variance Can reasoning systems scale wider instead of only deeper?, and make latent transitions stochastic so a model can actually *hold* uncertainty and represent multiple valid strategies instead of committing-then-abandoning Can stochastic latent reasoning help models explore multiple solutions?. Or teach the model to route — to decide when to think hard versus answer directly — so extended thinking is spent only where it pays Can models learn when to think versus respond quickly?. The throughline you might not have expected: wandering is what depth-scaling looks like when the model has no learned sense of when to stop, and the answer is to change the *shape* of the compute, not the *amount*.


Sources 12 notes

Why do reasoning models abandon promising solution paths?

Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.

When does thinking too much actually hurt reasoning?

Empirical studies demonstrate non-monotonic scaling in test-time reasoning: accuracy peaks at a critical thinking-token count, then declines sharply (87.3% to 70.3% as tokens scale from 1,100 to 16,000). Extended thinking inflates output variance and introduces self-revision errors rather than improving solution quality.

Why do reasoning models overthink ill-posed questions?

Reasoning models generate redundant, lengthy responses to questions with missing premises while non-reasoning models correctly identify them as unanswerable. Training optimizes for producing reasoning steps but never teaches models when to disengage.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Can reasoning models actually sustain long-chain reflection?

DeepSeek-R1 and o1-preview achieve only 20-23.6% exact match on 850 constraint satisfaction problems requiring genuine backtracking. This ceiling reveals that reflective reasoning fluency does not translate to actual problem-solving competence on unfamiliar instance structures.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Can we allocate inference compute based on prompt difficulty?

Research shows inference effectiveness varies dramatically by prompt difficulty. Reallocating the same total compute adaptively—giving easy prompts less and hard ones more—substantially outperforms larger models under uniform budgets.

Does step-level confidence outperform global averaging for trace filtering?

Local step-level confidence catches reasoning breakdowns that global averaging masks and enables early stopping before traces complete. This approach achieves comparable accuracy gains to naive majority voting with far fewer generated traces, proving trace quality matters more than quantity.

Do search steps follow the same scaling rules as reasoning tokens?

Deep research agents improve with more search steps in a pattern mirroring the reasoning-token relationship, with both exhibiting diminishing returns. This reveals a new inference-compute axis beyond model capability alone.

Can reasoning systems scale wider instead of only deeper?

GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.

Can stochastic latent reasoning help models explore multiple solutions?

GRAM replaces deterministic latent updates with stochastic sampling, enabling models to represent distributions over solutions rather than single predictions. This allows handling of ambiguous problems and multiple valid strategies that deterministic designs cannot represent.

Can models learn when to think versus respond quickly?

Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.

Next inquiring lines