Reinforcement Learning for LLMs LLM Reasoning and Architecture

Does policy entropy collapse limit reasoning performance in RL?

As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?

Note · 2026-02-20 · sourced from Test Time Compute

Across large RL runs for reasoning, a consistent pattern: policy entropy drops sharply early in training and performance saturates. The mechanism is predictable: high-probability actions with high advantage reduce entropy (the model becomes increasingly confident), while rare actions with high advantage would increase it — but rare actions are rarely selected, so the entropy-reducing forces dominate.

This produces an empirical law: R = -a·exp(H) + b, where R is downstream performance and H is policy entropy. As H → 0, R → -a + b — the performance ceiling is deterministic and visible. The model isn't getting better; it's approaching a wall defined by the entropy it has already spent.

The implication: RL scaling for reasoning cannot continue indefinitely without entropy management. Two proposed interventions — Clip-Cov (clipping updates on high-covariance tokens) and KL-Cov (penalizing high-covariance tokens with KL divergence) — both work by restricting the update of the tokens most responsible for entropy reduction. This preserves the exploratory capacity the model needs to keep improving.

GPPO gradient-preserving approach: The Klear-Reasoner paper identifies a complementary mechanism: standard clipping in PPO/GRPO can set gradients to zero for tokens where the ratio exceeds the clip range, creating dead zones where the model receives no learning signal. GPPO (Gradient-Preserving Policy Optimization) modifies the clipping function to always preserve non-zero gradients, ensuring the model can still learn from tokens that would otherwise be clipped. This is a different lever on entropy management — rather than restricting which tokens get updated (Clip-Cov) or penalizing divergence (KL-Cov), GPPO ensures all tokens continue contributing to learning even when their probability ratios are extreme.

The connection to Does extended thinking actually improve reasoning or just increase variance? is notable: at test time, the problem is too much variance; at training time, the problem is too little. Inference and training optimize against each other.

The format convergence mechanism: A controlled "Echo Chamber" study (pretraining from scratch with known data mixtures) reveals the specific form entropy collapse takes. RL does not merely reduce diversity in the abstract — it converges on producing outputs in the format of a single specific pretraining distribution while suppressing all others. Within the first epoch, the model shifts to generating answers in one distribution's format; this transition coincides with the largest accuracy gain. Which distribution "wins" is scale-dependent: smaller models favor simpler code-like formats, larger models shift to natural language. The degree of amplification depends on the KL penalty coefficient — looser KL allows more extreme format collapse. This gives practitioners a concrete control lever: KL penalty strength regulates not just how much the model diverges from its prior, but which pretraining pattern gets selected and amplified. See Does RL training collapse format diversity in pretrained models?.

Diversity-aware RL as direct countermeasure: DARLING (Diversity-Aware Reinforcement Learning) addresses entropy collapse by explicitly optimizing for semantic diversity alongside quality. A learned partition function clusters rollouts into semantically distinct groups; the diversity signal is multiplied with quality reward, amplifying advantage for responses that are both high-quality and semantically novel. The counter-intuitive result: diversity optimization also improves quality, because it forces maintained exploration across distinct solution strategies the model would never reach through pure exploitation. See Can diversity optimization improve quality during language model training?.

Hidden-state resolution of the exploration-exploitation trade-off: The exploration-exploitation tension in RLVR may be partly an artifact of token-level measurement. Hidden-state analysis via Effective Rank (measuring the dimensionality of the representation space), ERV (variability of effective rank across tokens), and ERA (the aggregate exploration measure) shows near-zero correlation between exploration and exploitation metrics at the hidden-state level. Both can be simultaneously enhanced — exploration in representation space and exploitation in output quality — because the trade-off manifests at the token level but dissolves at the hidden-state level. See Is the exploration-exploitation trade-off actually fundamental?.

Capability boundary collapse as distinct from entropy collapse: RL-PLUS identifies a related but distinct mechanism: RLVR's on-policy constraint creates inward exploitation (toward solved problems) while neglecting outward exploration (toward unsolvable or novel problems). This narrows the model's problem-solving scope — a capability boundary collapse distinct from entropy collapse per se. RL-PLUS counteracts it via Multiple Importance Sampling (incorporating external data sources) and an Exploration-Based Advantage Function (rewarding behavior on currently-unsolvable problems). See Why does RLVR training narrow a model's problem solving ability?.

Alignment-induced entropy collapse in creative tasks: The entropy collapse dynamic is not limited to reasoning RL. Debiasing and alignment training produces the same pattern in creative output: aligned models exhibit lower entropy in token predictions, form distinct clusters in the embedding space, and gravitate toward "attractor states" — indicating limited output diversity. This has direct implications for marketing applications (copywriting, ad creation, persona generation) where creative diversity is the product. The mechanism is the same: optimization pressure (whether for reasoning accuracy or safety/alignment) narrows the output distribution. See "Creativity Has Left the Chat" (Bozzone et al., 2024).

Outcome-space exploration bonuses: A complementary approach operates at the outcome level rather than the trajectory level. Since reasoning tasks admit only a limited set of distinct final answers, UCB-style exploration bonuses over outcome space are tractable. Historical exploration (UCB bonuses for rarely-observed answers) improves pass@1 by expanding training diversity; batch exploration (within-batch repetition penalties) improves pass@k by expanding test-time diversity. These require different mechanisms — a critical distinction because RL-induced diversity loss on solved problems transfers to unsolved ones via global policy sharpening. See Does outcome-based RL diversity loss spread across unsolved problems?.

Source: Test Time Compute, Reward Models, RLVR

Related concepts in this collection

Does extended thinking actually improve reasoning or just increase variance? When models think longer, do they reason better, or do they simply sample from a wider distribution of outputs that happens to cover correct answers more often? This matters because it determines whether test-time compute is genuinely scaling reasoning capability.
the inference-time parallel: too much variance there, too little here
Do critique models improve diversity during training itself? Explores whether critique integrated into the training loop, beyond test-time scoring, actively maintains solution diversity and prevents the model from converging too narrowly during iterative self-training.
another approach to preserving training-time diversity
Can diversity optimization improve quality during language model training? Standard RL training assumes quality and diversity trade off, with diversity optimization potentially hurting performance. Does explicitly rewarding semantic diversity during reinforcement learning actually improve output quality alongside diversity?
direct countermeasure: forced semantic diversity acts as exploration catalyst
Why do LLMs generate novel ideas from narrow ranges? LLM research agents produce individually novel ideas but cluster them in homogeneous sets. This explores why high average novelty coexists with poor diversity coverage and what it means for automated ideation.
extends: the same optimization-pressure narrowing appears in goal-directed generation, not only RL training; the mechanism generalizes across LLM optimization contexts
Does RL training narrow search diversity the same way it does reasoning? Exploring whether the entropy collapse pattern observed in reasoning RL also appears in search agent training. Understanding this helps identify whether diversity loss is a general RL property or domain-specific.
confirms: entropy collapse is domain-general; same mechanism documented independently in search RL, validating this as an architectural property not a reasoning-specific quirk
Does binary reward training hurt model calibration? Explores whether the standard correctness-based reward in RL training creates incentives for overconfident predictions, and what structural problem causes calibration to degrade during optimization.
extends: entropy collapse (diversity loss) and calibration degradation (confidence-accuracy divergence) are dual effects of the same RL concentration dynamic; RLCR addresses the calibration face via reward design
Is the exploration-exploitation trade-off actually fundamental? Token-level analysis suggests exploration and exploitation are opposed, but does hidden-state analysis reveal they could coexist? Understanding measurement granularity's role in perceived trade-offs matters for scaling reasoning systems.
reframes: trade-off dissolves at hidden-state level; exploration and exploitation are simultaneously enhanceable
Why does RLVR training narrow a model's problem solving ability? RLVR's on-policy constraint may force models to exploit known reasoning paths rather than explore new ones, potentially shrinking their effective problem-solving scope. Understanding this mechanism could reveal how to design better exploration incentives in language model reasoning.
distinct mechanism: scope narrowing differs from entropy collapse; requires external data integration
Does extended thinking help or hurt model reasoning? Explores whether activating thinking mode improves reasoning performance, and what role training plays in determining whether extended internal reasoning chains are productive or counterproductive.
SFT-then-RL may face entropy collapse through a different pathway: self-generated imitation data reduces output entropy before RL begins, constraining the subsequent RL exploration phase
Does RL teach reasoning or just when to use it? Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
reframes what entropy collapse is actually collapsing: not reasoning capability but the diversity of deployment-timing strategies; as the model narrows to a single activation schedule, it loses the exploratory capacity to discover better timing patterns
Does RL training follow predictable scaling curves? Can we forecast where RL training will plateau before committing full compute? ScaleRL tests whether sigmoid curves reliably predict performance ceilings across 200+ models.
provides predictive framework: the sigmoid saturation curve IS entropy collapse approaching the asymptote; ScaleRL makes the previously unpredictable entropy-collapse bottleneck forecastable from early training data
Does reinforcement learning teach social reasoning or just shortcuts? When RL optimizes for accuracy on theory of mind tasks, do models actually learn to track mental states, or do they find faster paths to correct answers? The distinction matters for genuine reasoning capability.
ToM reasoning collapse is an extreme form of entropy collapse in social reasoning: small models under RL lose interpretable reasoning entirely while maintaining accuracy, showing entropy collapse can eliminate reasoning structure completely rather than merely narrowing diversity
How quickly do errors compound during model self-training? When LLMs train on their own outputs without verification, do small mistakes amplify exponentially? This matters because it determines whether unsupervised self-improvement is even feasible.
a distinct but interacting training-time failure: entropy collapse narrows the diversity of solution strategies while error avalanching degrades the accuracy of remaining strategies — together they create a double bind where RL training simultaneously shrinks and corrupts the model's reasoning repertoire
Does negative reinforcement alone outperform full reinforcement learning? Can training with only penalty signals for wrong answers match or exceed full RL approaches? This challenges the conventional assumption that reward design requires both positive and negative signals.
identifies the causal asymmetry: positive reinforcement actively drives entropy collapse by concentrating mass on rewarded trajectories; negative reinforcement sidesteps it by redistributing mass according to the model's prior, preserving diversity
Does RL training follow a predictable two-phase learning sequence? This explores whether reinforcement learning exhibits consistent phases where basic execution skills must consolidate before strategic reasoning emerges. Understanding this sequence could reveal bottlenecks in scaling reasoning capabilities.
reframes when entropy collapse matters: collapse in execution tokens is acceptable (stable conditional entropy after mastery) but collapse in planning tokens is catastrophic (eliminates strategic exploration); entropy management should target planning tokens specifically
Can natural language feedback overcome numerical reward plateaus? Exploring whether chain-of-thought critiques can push past performance ceilings that scaling data alone cannot break in reinforcement learning for reasoning tasks.
countermeasure: NLF re-expands entropy in collapsed regions by pointing toward solution-space regions that numerical rewards cannot identify; critiques provide directional information that scalar rewards lack

Concept map

27 direct connections · 201 in 2-hop network ·medium cluster

Does policy entropy collapse limit reasoning per… Does extended thinking actually improve reasoning … Do critique models improve diversity during traini… Can diversity optimization improve quality during … Why do LLMs generate novel ideas from narrow range… Does RL training narrow search diversity the same … Does binary reward training hurt model calibration… Is the exploration-exploitation trade-off actually… Why does RLVR training narrow a model's problem so…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

policy entropy collapse is the primary bottleneck in RL scaling for reasoning