How does extended thinking affect variance in reasoning model outputs?
This explores whether longer thinking traces make a model reason better, or just spread its output distribution wider so it lands on correct answers more often by chance.
This reads the question as asking about a specific mechanism: does extended thinking sharpen reasoning, or does it mostly widen the spread of possible outputs? The corpus has a pointed answer — at least one strand argues the gains from extended thinking come from variance expansion, not better thinking. Longer traces broaden the output distribution so it covers correct answers more often, which looks like improved accuracy but is really improved sampling coverage Does extended thinking actually improve reasoning or just increase variance?. The tell is what happens at the extreme: past a critical point the distribution becomes too diffuse and accuracy drops, which is exactly what you'd expect from a coverage mechanism rather than a reasoning one.
That predicts a non-monotonic curve, and several notes confirm it from different angles. Pushing thinking tokens from ~1,100 to ~16K dropped benchmark accuracy from 87.3% to 70.3% — models overthink easy problems and underthink hard ones Does more thinking time always improve reasoning accuracy?. The same falsification shows up as a direct challenge to the 'more thinking is always better' assumption, where bypassing explicit reasoning entirely can match or beat standard thinking at equal token budgets Does more thinking time actually improve LLM reasoning?. And the inverted-U holds across model capability: optimal chain-of-thought length rises with task difficulty but falls as models get more capable, so stronger models actually need less of it Why does chain of thought accuracy eventually decline with length?.
Here's the part you might not have gone looking for: the variance isn't just a length artifact, it has internal structure. Reasoning models often fail not from too little compute but from disorganized exploration — wandering down invalid paths and abandoning promising ones prematurely (the 'tourist not scientist' pattern). Decoding-level nudges like thought-switching penalties recover accuracy without any retraining, which means the good answer was reachable but got lost in the spread Why do reasoning models abandon promising solution paths?. So extended thinking inflates variance partly by generating more chances to drift.
The interesting wrinkle is that variance isn't destiny — training can redirect it. The same thinking mechanism that induces self-doubt and degrades performance in a vanilla model gets transformed by RL into productive gap analysis; what changes is the quality of the trace, not its quantity Does extended thinking help or hurt model reasoning?. So whether longer thinking helps depends on whether the model was trained to spend those tokens well.
If you want to act on this rather than just understand it, two doorways: you can compress the spread without retraining at all — verbose and concise reasoning occupy distinct regions of activation space, and a single steering vector cuts chain-of-thought length ~67% while holding accuracy Can we steer reasoning toward brevity without retraining? — or you can teach the model to decide when to think at all, routing between extended reasoning and direct answers so it stops paying the variance cost on problems that don't need it Can models learn when to think versus respond quickly?.
Sources 8 notes
Longer thinking traces improve accuracy through variance expansion—broader output distributions cover correct answers more often—not through better reasoning. Beyond a critical threshold, the distribution becomes too diffuse and accuracy drops, revealing the mechanism is sampling coverage, not genuine reasoning improvement.
Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.
Accuracy drops from 87.3% to 70.3% as thinking tokens scale from 1,100 to 16,000, and bypassing explicit reasoning entirely matches or beats standard thinking at equal token budgets. The relationship is non-monotonic, not the linear improvement commonly assumed.
Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.
Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.
Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.
Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.
Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.