Reinforcement Learning for LLMs

Why do reasoning models fail differently at training versus inference?

Reasoning models exhibit two distinct failure modes—entropy collapse during training and variance inflation during inference—that appear unrelated but may share underlying causes. Understanding these dual problems could reveal whether separate or unified solutions are needed.

Note · 2026-02-20 · sourced from Test Time Compute
How should we allocate compute budget at inference time?

Two failure modes in the test-time scaling literature look unrelated but share the same underlying mechanism: failed exploration-exploitation balance.

Policy entropy collapse (training time): When RL trains a reasoning model, policy entropy drops over time — the model converges to a narrow repertoire of reasoning paths, sacrificing diversity for short-term reward. The result is a model that's overfit to familiar problem types and struggles to explore novel solution strategies. The fix lives in training: entropy bonuses, diverse critique models, or curriculum design that maintains distributional breadth.

Variance inflation (inference time): When a reasoning model is given an extended thinking budget beyond its optimum, output variance inflates rather than quality improving. The model doesn't converge on the right answer; it oscillates between candidates. The exploration mechanism that training instilled becomes runaway oscillation without the stabilizing feedback of a verifier. The fix lives in inference: parallel sampling instead of sequential extension, confidence-based filtering, or hard token budgets.

Both failures are manifestations of the same underlying problem: the model is neither confidently right nor productively exploring — it's stuck in an uncertain middle state that wastes compute without generating signal. But because they occur at different timescales, the interventions are completely different:

Failure Timescale Mechanism Fix
Entropy collapse Training Policy over-narrows Critique diversity, entropy bonuses
Variance inflation Inference Thinking over-extends Parallel sampling, token limits

The practical implication: optimizing inference alone (parallel vs sequential, budget allocation) cannot fix a training-time entropy problem. Conversely, training for exploration diversity cannot prevent inference-time variance inflation if the token budget is set too high. Both loops must be managed independently.

Historical vs batch exploration: The Outcome-based Exploration paper adds taxonomic precision to this dual problem. Historical exploration (visiting diverse states during training) improves pass@1 via expanded training signal — this is the training-time fix. Batch exploration (producing diverse outputs at test time) improves pass@k via broader solution coverage — this is the test-time fix. The mechanisms are structurally different: UCB-style bonuses over outcome space for historical exploration, within-batch repetition penalties for batch exploration. This maps the training/test-time dual directly onto concrete algorithmic prescriptions. See Does outcome-based RL diversity loss spread across unsolved problems?.

Training data format as an upstream entropy variable: Does training data format shape reasoning strategy more than domain? adds a third factor upstream of both. Multiple-choice training produces BFS-like (breadth-first, parallel-path) reasoning; free-form training produces DFS-like (depth-first, sequential) reasoning. Format shapes the default exploration profile before any RL training begins. This means entropy collapse is not solely a training-time problem — it can be seeded by format choices in the pre-RL training data. A model pre-trained on free-form data starts the RL phase with a depth-first, collapse-prone default strategy. A model pre-trained on multiple-choice data starts with a more diverse exploration strategy. The intervention sequence is thus: format decisions → exploration profile → RL collapse rate. Managing entropy requires attending to all three.


Source: Test Time Compute

Related concepts in this collection

Concept map
20 direct connections · 197 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

training-time entropy collapse and test-time variance inflation are dual problems requiring different solutions