Why do reasoning models fail differently at training versus inference?

Reasoning models exhibit two distinct failure modes—entropy collapse during training and variance inflation during inference—that appear unrelated but may share underlying causes. Understanding these dual problems could reveal whether separate or unified solutions are needed.

Note · 2026-02-20 · sourced from Test Time Compute

Two failure modes in the test-time scaling literature look unrelated but share the same underlying mechanism: failed exploration-exploitation balance.

Policy entropy collapse (training time): When RL trains a reasoning model, policy entropy drops over time — the model converges to a narrow repertoire of reasoning paths, sacrificing diversity for short-term reward. The result is a model that's overfit to familiar problem types and struggles to explore novel solution strategies. The fix lives in training: entropy bonuses, diverse critique models, or curriculum design that maintains distributional breadth.

Variance inflation (inference time): When a reasoning model is given an extended thinking budget beyond its optimum, output variance inflates rather than quality improving. The model doesn't converge on the right answer; it oscillates between candidates. The exploration mechanism that training instilled becomes runaway oscillation without the stabilizing feedback of a verifier. The fix lives in inference: parallel sampling instead of sequential extension, confidence-based filtering, or hard token budgets.

Both failures are manifestations of the same underlying problem: the model is neither confidently right nor productively exploring — it's stuck in an uncertain middle state that wastes compute without generating signal. But because they occur at different timescales, the interventions are completely different:

Failure	Timescale	Mechanism	Fix
Entropy collapse	Training	Policy over-narrows	Critique diversity, entropy bonuses
Variance inflation	Inference	Thinking over-extends	Parallel sampling, token limits

The practical implication: optimizing inference alone (parallel vs sequential, budget allocation) cannot fix a training-time entropy problem. Conversely, training for exploration diversity cannot prevent inference-time variance inflation if the token budget is set too high. Both loops must be managed independently.

Historical vs batch exploration: The Outcome-based Exploration paper adds taxonomic precision to this dual problem. Historical exploration (visiting diverse states during training) improves pass@1 via expanded training signal — this is the training-time fix. Batch exploration (producing diverse outputs at test time) improves pass@k via broader solution coverage — this is the test-time fix. The mechanisms are structurally different: UCB-style bonuses over outcome space for historical exploration, within-batch repetition penalties for batch exploration. This maps the training/test-time dual directly onto concrete algorithmic prescriptions. See Does outcome-based RL diversity loss spread across unsolved problems?.

Training data format as an upstream entropy variable: Does training data format shape reasoning strategy more than domain? adds a third factor upstream of both. Multiple-choice training produces BFS-like (breadth-first, parallel-path) reasoning; free-form training produces DFS-like (depth-first, sequential) reasoning. Format shapes the default exploration profile before any RL training begins. This means entropy collapse is not solely a training-time problem — it can be seeded by format choices in the pre-RL training data. A model pre-trained on free-form data starts the RL phase with a depth-first, collapse-prone default strategy. A model pre-trained on multiple-choice data starts with a more diverse exploration strategy. The intervention sequence is thus: format decisions → exploration profile → RL collapse rate. Managing entropy requires attending to all three.

Source: Test Time Compute

Related concepts in this collection

Does policy entropy collapse limit reasoning performance in RL? As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
the training-time failure
Does extended thinking actually improve reasoning or just increase variance? When models think longer, do they reason better, or do they simply sample from a wider distribution of outputs that happens to cover correct answers more often? This matters because it determines whether test-time compute is genuinely scaling reasoning capability.
the inference-time failure
Do critique models improve diversity during training itself? Explores whether critique integrated into the training loop, beyond test-time scoring, actively maintains solution diversity and prevents the model from converging too narrowly during iterative self-training.
the training-side intervention
Why does parallel reasoning outperform single chain thinking? Does dividing a fixed token budget across multiple independent reasoning paths beat spending it all on one long chain? This explores how breadth and diversity in reasoning compare to depth.
the inference-side intervention
Why do LLMs generate novel ideas from narrow ranges? LLM research agents produce individually novel ideas but cluster them in homogeneous sets. This explores why high average novelty coexists with poor diversity coverage and what it means for automated ideation.
third instance of the same mechanism at generation time: goal-directed optimization pressure narrows output diversity even when average quality is high; suggests the exploration-exploitation failure is not TTS-specific but a general property of optimization pressure on LLM outputs
Does RL training collapse format diversity in pretrained models? Exploring whether RL fine-tuning systematically selects one output format from pretraining while suppressing others, and how this selection mechanism drives performance gains.
adds a format-level selection mechanism upstream of entropy collapse: RL does not just narrow diversity within a distribution but selects which pretraining distribution survives, making format collapse a precondition for the within-distribution entropy collapse this note describes
Does self-consistency reliably reward correct answers during training? Self-consistency initially correlates with correctness, but as models train on this signal, do they eventually learn to maximize consistency itself rather than accuracy? When does this proxy reward stop working?
self-consistency as reward is a specific mechanism that drives training-time entropy collapse: optimizing for inter-sample agreement directly incentivizes the model to narrow its output distribution, making this reward signal an active cause of the collapse problem rather than merely vulnerable to it
How quickly do errors compound during model self-training? When LLMs train on their own outputs without verification, do small mistakes amplify exponentially? This matters because it determines whether unsupervised self-improvement is even feasible.
a third training-time failure mode distinct from both entropy collapse and variance inflation: entropy collapse is diversity loss, error avalanching is accuracy degradation from compounding self-training errors — both operate at training time but through different mechanisms and require different interventions (entropy bonuses vs external verification)
Why do reasoning LLMs fail at deeper problem solving? Explores whether current reasoning models systematically search solution spaces or merely wander through them, and how this affects their ability to solve increasingly complex problems.
inference-time instantiation: wandering is the behavioral consequence of the exploration-exploitation failure at test time; training-time entropy collapse narrows the strategy repertoire, inference-time wandering is the symptom
Do reasoning models switch between ideas too frequently? Research explores whether o1-like models abandon promising reasoning paths prematurely by switching to different approaches without sufficient depth, and whether penalizing such transitions could improve accuracy.
a third face of the exploration-exploitation problem: underthinking is insufficient exploitation (abandoning promising paths), entropy collapse is insufficient exploration (narrowing strategies), variance inflation is runaway exploration (oscillating without convergence)

Concept map

20 direct connections · 197 in 2-hop network ·dense cluster

Why do reasoning models fail differently at trai… Does policy entropy collapse limit reasoning perfo… Does extended thinking actually improve reasoning … Do critique models improve diversity during traini… Why does parallel reasoning outperform single chai… Why do LLMs generate novel ideas from narrow range… Does RL training collapse format diversity in pret… Does self-consistency reliably reward correct answ… How quickly do errors compound during model self-t…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

training-time entropy collapse and test-time variance inflation are dual problems requiring different solutions