Can extended thinking genuinely improve reasoning or just increase variance?

This explores whether the accuracy gains from longer 'thinking' traces come from better reasoning — or from a statistical side effect where a wider net of guesses just catches the right answer more often.

This explores whether extended thinking genuinely sharpens reasoning or merely widens the spread of outputs so a correct answer turns up by coverage. The corpus leans hard toward the second reading — but with an important caveat about training. The most direct claim is that longer thinking traces improve accuracy through variance expansion: broader output distributions overlap the correct answer more often, not because the model reasons better Does extended thinking actually improve reasoning or just increase variance?. The tell is what happens past a threshold — push thinking too far and the distribution gets so diffuse that accuracy collapses Does more thinking time always improve reasoning accuracy?. If the mechanism were real reasoning, more of it shouldn't actively hurt.

That collapse is strikingly concrete. Scaling thinking tokens from ~1,100 to ~16,000 drops benchmark accuracy from 87.3% to 70.3%, and in some setups skipping explicit reasoning entirely matches or beats standard thinking at the same token budget Does more thinking time actually improve LLM reasoning?. The relationship is non-monotonic — an inverted-U where models overthink easy problems and underthink hard ones When does thinking too much actually hurt reasoning?, Why does chain of thought accuracy eventually decline with length?. Notably, the optimal length shrinks as models get more capable, and reinforcement learning naturally gravitates toward shorter chains as competence rises. Brevity, in other words, is a symptom of skill, not a constraint on it.

But variance isn't the whole story, and this is where the question gets more interesting than a flat 'no.' Whether extended thinking helps depends on what the thinking is *doing*. In vanilla models, thinking mode can be actively counterproductive — it induces self-doubt that degrades performance — yet RL training reverses the very same mechanism into productive gap analysis Does extended thinking help or hurt model reasoning?. So training mediates reasoning quality, not just quantity. The length of the trace is a poor proxy; the character of the trace is what moves the needle.

The corpus also names *why* longer traces so often misfire. Reasoning models lack a 'stop' signal — fed ill-posed questions with missing premises, they spiral into redundant length while non-reasoning models simply flag the question as unanswerable. Training optimizes for producing steps but never teaches models when to disengage Why do reasoning models overthink ill-posed questions?. And there's a deeper fragility underneath: chain-of-thought reasoning is distribution-bounded, producing fluent but logically inconsistent output once you shift task, length, or format Does chain-of-thought reasoning actually generalize beyond training data?. More tokens of a process that's imitating reasoning form without valid logic just buys more confident-sounding error.

If you want the constructive flip side, the collection suggests the lever is *placement and steering*, not raw length. Verbosity turns out to be a single linear direction you can steer in activation space — cutting chain-of-thought length 67% while holding accuracy, no retraining Can we steer reasoning toward brevity without retraining?. And reasoning quality can be planted earlier: training that reconstructs the hidden thought processes behind expert texts Can reconstructing expert thinking improve reasoning transfer?, or treats chain-of-thought as an exploratory action rewarded by information gain during pretraining Can chain-of-thought reasoning be learned during pretraining itself?, yields reasoning that genuinely transfers. The through-line: genuine improvement comes from *how* a model learns to think, while simply letting it think *longer* mostly just turns up the variance — until it turns up too much.

Sources 11 notes

Does extended thinking actually improve reasoning or just increase variance?

Longer thinking traces improve accuracy through variance expansion—broader output distributions cover correct answers more often—not through better reasoning. Beyond a critical threshold, the distribution becomes too diffuse and accuracy drops, revealing the mechanism is sampling coverage, not genuine reasoning improvement.

Does more thinking time always improve reasoning accuracy?

Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.

Does more thinking time actually improve LLM reasoning?

Accuracy drops from 87.3% to 70.3% as thinking tokens scale from 1,100 to 16,000, and bypassing explicit reasoning entirely matches or beats standard thinking at equal token budgets. The relationship is non-monotonic, not the linear improvement commonly assumed.

When does thinking too much actually hurt reasoning?

Empirical studies demonstrate non-monotonic scaling in test-time reasoning: accuracy peaks at a critical thinking-token count, then declines sharply (87.3% to 70.3% as tokens scale from 1,100 to 16,000). Extended thinking inflates output variance and introduces self-revision errors rather than improving solution quality.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Does extended thinking help or hurt model reasoning?

Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.

Why do reasoning models overthink ill-posed questions?

Reasoning models generate redundant, lengthy responses to questions with missing premises while non-reasoning models correctly identify them as unanswerable. Training optimizes for producing reasoning steps but never teaches models when to disengage.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Can we steer reasoning toward brevity without retraining?

Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.

Can reconstructing expert thinking improve reasoning transfer?

Training on expert texts augmented with reconstructed thought processes (self-talk, knowledge recall, verification) produces reasoning skills that transfer across domains and adapt depth to problem difficulty, outperforming standard continual pretraining by up to 8 points on hard problems.

Can chain-of-thought reasoning be learned during pretraining itself?

RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.

Can extended thinking genuinely improve reasoning or just increase variance?

Sources 11 notes

Next inquiring lines