Reinforcement Learning for LLMs LLM Reasoning and Architecture

Does extended thinking actually improve reasoning or just increase variance?

When models think longer, do they reason better, or do they simply sample from a wider distribution of outputs that happens to cover correct answers more often? This matters because it determines whether test-time compute is genuinely scaling reasoning capability.

Note · 2026-02-20 · sourced from Test Time Compute
How should we allocate compute budget at inference time?

The mechanistic explanation for why extended thinking initially improves then degrades: it acts as a variance dial on the output distribution, not as a reasoning quality dial. As thinking tokens increase, the model's output distribution broadens. This initially helps because broader coverage increases the chance of landing on the correct answer. But beyond a point, the distribution becomes so diffuse it overshoots the reward peak — the "dilution effect" — and accuracy drops.

Formally: there's a competition between a coverage effect (broadening variance helps overlap with the reward region) and a dilution effect (too broad places mass far from the reward). This predicts the non-monotonic curve exactly.

The critical insight is that the apparent gains aren't improvements in reasoning capability — they're improvements in sampling coverage. A model that draws from a wider distribution might hit the right answer more often even if its actual reasoning hasn't improved. This is an illusion because it conflates variance with competence.

This suggests that test-time scaling through extended thinking is not an effective use of inference budget, which is why Why does parallel reasoning outperform single chain thinking? — it explicitly controls variance through independent sampling rather than letting it inflate through trace extension.

Theoretical grounding from robustness analysis: The CoT robustness bounds paper (analyzing perturbation propagation through reasoning chains) adds a theoretical dimension. Under Lipschitz continuity assumptions, longer CoT chains do dampen input perturbations — but never fully eliminate them. Even an infinite chain leaves a non-zero robustness bound. For the Linear Self-Attention model (a simplified transformer), CoT robustness depends on the norm of input embeddings and hidden state vectors: higher norm → less sensitivity to perturbations. This means variance inflation at long chains is not just an empirical finding but has a theoretical bound: you get diminishing returns on perturbation resistance, and the residual sensitivity is determined by model-level factors (embedding norms), not just chain length. The practical upshot: there is a finite chain length beyond which extending the chain provides no additional robustness benefit — which precisely defines the threshold observed empirically.


Source: Test Time Compute

Related concepts in this collection

Concept map
23 direct connections · 203 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

extended thinking inflates output variance rather than improving reasoning quality