Reinforcement Learning for LLMs LLM Reasoning and Architecture

Does extended thinking actually improve reasoning or just increase variance?

When models think longer, do they reason better, or do they simply sample from a wider distribution of outputs that happens to cover correct answers more often? This matters because it determines whether test-time compute is genuinely scaling reasoning capability.

Note · 2026-02-20 · sourced from Test Time Compute

The mechanistic explanation for why extended thinking initially improves then degrades: it acts as a variance dial on the output distribution, not as a reasoning quality dial. As thinking tokens increase, the model's output distribution broadens. This initially helps because broader coverage increases the chance of landing on the correct answer. But beyond a point, the distribution becomes so diffuse it overshoots the reward peak — the "dilution effect" — and accuracy drops.

Formally: there's a competition between a coverage effect (broadening variance helps overlap with the reward region) and a dilution effect (too broad places mass far from the reward). This predicts the non-monotonic curve exactly.

The critical insight is that the apparent gains aren't improvements in reasoning capability — they're improvements in sampling coverage. A model that draws from a wider distribution might hit the right answer more often even if its actual reasoning hasn't improved. This is an illusion because it conflates variance with competence.

This suggests that test-time scaling through extended thinking is not an effective use of inference budget, which is why Why does parallel reasoning outperform single chain thinking? — it explicitly controls variance through independent sampling rather than letting it inflate through trace extension.

Theoretical grounding from robustness analysis: The CoT robustness bounds paper (analyzing perturbation propagation through reasoning chains) adds a theoretical dimension. Under Lipschitz continuity assumptions, longer CoT chains do dampen input perturbations — but never fully eliminate them. Even an infinite chain leaves a non-zero robustness bound. For the Linear Self-Attention model (a simplified transformer), CoT robustness depends on the norm of input embeddings and hidden state vectors: higher norm → less sensitivity to perturbations. This means variance inflation at long chains is not just an empirical finding but has a theoretical bound: you get diminishing returns on perturbation resistance, and the residual sensitivity is determined by model-level factors (embedding norms), not just chain length. The practical upshot: there is a finite chain length beyond which extending the chain provides no additional robustness benefit — which precisely defines the threshold observed empirically.

Source: Test Time Compute

Related concepts in this collection

Does more thinking time always improve reasoning accuracy? Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.
the empirical phenomenon this explains
Why does parallel reasoning outperform single chain thinking? Does dividing a fixed token budget across multiple independent reasoning paths beat spending it all on one long chain? This explores how breadth and diversity in reasoning compare to depth.
the strategy that follows from this insight
Does policy entropy collapse limit reasoning performance in RL? As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
a related entropy mechanism in the training regime
Does chain-of-thought reasoning reflect genuine thinking or performance? When language models generate step-by-step reasoning, are they actually thinking through problems or just producing text that looks like reasoning? This matters for understanding whether extended reasoning tokens add real computational value.
refines: variance inflation describes the reasoning-time mechanism, but Reasoning Theater shows it is also difficulty-conditional — on easy problems the model's answer is largely determined before extension begins, so additional tokens are pure variance rather than coverage; on hard problems extension does carry genuine belief-revision signal
What makes reflection actually work in reasoning models? Does reflection in language models involve genuine self-correction, or just confident-sounding traces? This question probes whether models can truly backtrack and revise versus merely mimicking reflective language.
reframes the metric: variance inflation manifests as inflated chain length without the reflective capabilities (assumption, backtracking, self-refinement) that would justify the tokens; length should be replaced by reflection-capability counting

Concept map

23 direct connections · 203 in 2-hop network ·medium cluster

Does extended thinking actually improve reasonin… Does more thinking time always improve reasoning a… Why does parallel reasoning outperform single chai… Does policy entropy collapse limit reasoning perfo… Does chain-of-thought reasoning reflect genuine th… What makes reflection actually work in reasoning m…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

extended thinking inflates output variance rather than improving reasoning quality