Why does extended thinking increase output variance without improving reasoning quality?

This explores why making a model 'think' longer can spread its answers wider without making the underlying reasoning any better — and what that reveals about what extended thinking actually does.

This explores why longer thinking traces increase output variance without improving reasoning quality. The corpus offers a surprisingly clean mechanical answer: extended thinking mostly works by widening the model's output distribution, not by reasoning better. When a model thinks longer, it samples a broader range of possible answers, and a broader range happens to cover the correct answer more often — so accuracy rises for a while Does extended thinking actually improve reasoning or just increase variance?. But this is sampling coverage masquerading as reasoning. Push past a critical point and the distribution becomes too diffuse, accuracy drops, and the model starts introducing self-revision errors — second-guessing correct answers into wrong ones When does thinking too much actually hurt reasoning?. The numbers are stark: accuracy falling from 87.3% to 70.3% as thinking tokens scale from ~1,100 to ~16,000 Does more thinking time always improve reasoning accuracy? Does more thinking time actually improve LLM reasoning?.

The deeper reason the extra tokens don't buy real reasoning is that the form of reasoning and the substance of it are decoupled. Logically invalid chain-of-thought exemplars perform nearly as well as valid ones — the model learns to imitate the shape of step-by-step reasoning, not to actually infer Does logical validity actually drive chain-of-thought gains?. And when you move outside the training distribution, that imitation breaks down predictably: the model produces fluent, confident reasoning that is logically inconsistent Does chain-of-thought reasoning actually generalize beyond training data?. If the thinking is performance rather than computation, then generating more of it can only stir the distribution, not sharpen it.

There's also a control problem. Models don't know when to stop. They overthink easy problems and underthink hard ones, and they lack the critical judgment to recognize an ill-posed or unanswerable question — generating page after page of redundant reasoning where a non-reasoning model would simply say 'this can't be answered' Why do reasoning models overthink ill-posed questions?. Training optimizes for producing reasoning steps but never teaches disengagement, so the variance you see is partly the model failing to allocate its own thinking budget.

The more hopeful thread in the corpus is that none of this is intrinsic — it's about what training rewards. The same thinking mechanism that induces counterproductive self-doubt in a vanilla model gets redirected by RL training into productive gap analysis; training mediates reasoning quality, not just quantity Does extended thinking help or hurt model reasoning?. Relatedly, optimal chain-of-thought length follows an inverted-U, and more capable models naturally prefer shorter chains — simplicity emerges from the reward signal as models improve Why does chain of thought accuracy eventually decline with length?. You can even extract a single steering vector that cuts chain length by 67% while holding accuracy, suggesting verbosity is a tunable direction rather than a source of correctness Can we steer reasoning toward brevity without retraining?.

The thing worth knowing you didn't know you wanted: if extra thinking is really just wider sampling, the fix isn't more tokens — it's grounding. Approaches that interleave reasoning with real external feedback (querying a tool or environment between steps) outperform pure chain-of-thought by 10–34% precisely because they inject a correctness signal the model can't get from talking to itself Can interleaving reasoning with real-world feedback prevent hallucination?. Variance without grounding is just noise dressed as deliberation.

Sources 11 notes

Does extended thinking actually improve reasoning or just increase variance?

Longer thinking traces improve accuracy through variance expansion—broader output distributions cover correct answers more often—not through better reasoning. Beyond a critical threshold, the distribution becomes too diffuse and accuracy drops, revealing the mechanism is sampling coverage, not genuine reasoning improvement.

When does thinking too much actually hurt reasoning?

Empirical studies demonstrate non-monotonic scaling in test-time reasoning: accuracy peaks at a critical thinking-token count, then declines sharply (87.3% to 70.3% as tokens scale from 1,100 to 16,000). Extended thinking inflates output variance and introduces self-revision errors rather than improving solution quality.

Does more thinking time always improve reasoning accuracy?

Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.

Does more thinking time actually improve LLM reasoning?

Accuracy drops from 87.3% to 70.3% as thinking tokens scale from 1,100 to 16,000, and bypassing explicit reasoning entirely matches or beats standard thinking at equal token budgets. The relationship is non-monotonic, not the linear improvement commonly assumed.

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Why do reasoning models overthink ill-posed questions?

Reasoning models generate redundant, lengthy responses to questions with missing premises while non-reasoning models correctly identify them as unanswerable. Training optimizes for producing reasoning steps but never teaches models when to disengage.

Does extended thinking help or hurt model reasoning?

Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Can we steer reasoning toward brevity without retraining?

Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.

Can interleaving reasoning with real-world feedback prevent hallucination?

ReAct demonstrates that alternating verbal reasoning with external tool queries (Wikipedia API, environment interaction) prevents error propagation by injecting real-world feedback at each step. On knowledge-intensive and interactive tasks, this approach outperforms pure chain-of-thought and reinforcement learning by 10-34% absolute accuracy.

Why does extended thinking increase output variance without improving reasoning quality?

Sources 11 notes

Next inquiring lines