Reinforcement Learning for LLMs LLM Reasoning and Architecture

Does policy entropy collapse limit reasoning performance in RL?

As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?

Note · 2026-02-20 · sourced from Test Time Compute
How should we allocate compute budget at inference time?

Across large RL runs for reasoning, a consistent pattern: policy entropy drops sharply early in training and performance saturates. The mechanism is predictable: high-probability actions with high advantage reduce entropy (the model becomes increasingly confident), while rare actions with high advantage would increase it — but rare actions are rarely selected, so the entropy-reducing forces dominate.

This produces an empirical law: R = -a·exp(H) + b, where R is downstream performance and H is policy entropy. As H → 0, R → -a + b — the performance ceiling is deterministic and visible. The model isn't getting better; it's approaching a wall defined by the entropy it has already spent.

The implication: RL scaling for reasoning cannot continue indefinitely without entropy management. Two proposed interventions — Clip-Cov (clipping updates on high-covariance tokens) and KL-Cov (penalizing high-covariance tokens with KL divergence) — both work by restricting the update of the tokens most responsible for entropy reduction. This preserves the exploratory capacity the model needs to keep improving.

GPPO gradient-preserving approach: The Klear-Reasoner paper identifies a complementary mechanism: standard clipping in PPO/GRPO can set gradients to zero for tokens where the ratio exceeds the clip range, creating dead zones where the model receives no learning signal. GPPO (Gradient-Preserving Policy Optimization) modifies the clipping function to always preserve non-zero gradients, ensuring the model can still learn from tokens that would otherwise be clipped. This is a different lever on entropy management — rather than restricting which tokens get updated (Clip-Cov) or penalizing divergence (KL-Cov), GPPO ensures all tokens continue contributing to learning even when their probability ratios are extreme.

The connection to Does extended thinking actually improve reasoning or just increase variance? is notable: at test time, the problem is too much variance; at training time, the problem is too little. Inference and training optimize against each other.

The format convergence mechanism: A controlled "Echo Chamber" study (pretraining from scratch with known data mixtures) reveals the specific form entropy collapse takes. RL does not merely reduce diversity in the abstract — it converges on producing outputs in the format of a single specific pretraining distribution while suppressing all others. Within the first epoch, the model shifts to generating answers in one distribution's format; this transition coincides with the largest accuracy gain. Which distribution "wins" is scale-dependent: smaller models favor simpler code-like formats, larger models shift to natural language. The degree of amplification depends on the KL penalty coefficient — looser KL allows more extreme format collapse. This gives practitioners a concrete control lever: KL penalty strength regulates not just how much the model diverges from its prior, but which pretraining pattern gets selected and amplified. See Does RL training collapse format diversity in pretrained models?.

Diversity-aware RL as direct countermeasure: DARLING (Diversity-Aware Reinforcement Learning) addresses entropy collapse by explicitly optimizing for semantic diversity alongside quality. A learned partition function clusters rollouts into semantically distinct groups; the diversity signal is multiplied with quality reward, amplifying advantage for responses that are both high-quality and semantically novel. The counter-intuitive result: diversity optimization also improves quality, because it forces maintained exploration across distinct solution strategies the model would never reach through pure exploitation. See Can diversity optimization improve quality during language model training?.

Hidden-state resolution of the exploration-exploitation trade-off: The exploration-exploitation tension in RLVR may be partly an artifact of token-level measurement. Hidden-state analysis via Effective Rank (measuring the dimensionality of the representation space), ERV (variability of effective rank across tokens), and ERA (the aggregate exploration measure) shows near-zero correlation between exploration and exploitation metrics at the hidden-state level. Both can be simultaneously enhanced — exploration in representation space and exploitation in output quality — because the trade-off manifests at the token level but dissolves at the hidden-state level. See Is the exploration-exploitation trade-off actually fundamental?.

Capability boundary collapse as distinct from entropy collapse: RL-PLUS identifies a related but distinct mechanism: RLVR's on-policy constraint creates inward exploitation (toward solved problems) while neglecting outward exploration (toward unsolvable or novel problems). This narrows the model's problem-solving scope — a capability boundary collapse distinct from entropy collapse per se. RL-PLUS counteracts it via Multiple Importance Sampling (incorporating external data sources) and an Exploration-Based Advantage Function (rewarding behavior on currently-unsolvable problems). See Why does RLVR training narrow a model's problem solving ability?.

Alignment-induced entropy collapse in creative tasks: The entropy collapse dynamic is not limited to reasoning RL. Debiasing and alignment training produces the same pattern in creative output: aligned models exhibit lower entropy in token predictions, form distinct clusters in the embedding space, and gravitate toward "attractor states" — indicating limited output diversity. This has direct implications for marketing applications (copywriting, ad creation, persona generation) where creative diversity is the product. The mechanism is the same: optimization pressure (whether for reasoning accuracy or safety/alignment) narrows the output distribution. See "Creativity Has Left the Chat" (Bozzone et al., 2024).

Outcome-space exploration bonuses: A complementary approach operates at the outcome level rather than the trajectory level. Since reasoning tasks admit only a limited set of distinct final answers, UCB-style exploration bonuses over outcome space are tractable. Historical exploration (UCB bonuses for rarely-observed answers) improves pass@1 by expanding training diversity; batch exploration (within-batch repetition penalties) improves pass@k by expanding test-time diversity. These require different mechanisms — a critical distinction because RL-induced diversity loss on solved problems transfers to unsolved ones via global policy sharpening. See Does outcome-based RL diversity loss spread across unsolved problems?.


Source: Test Time Compute, Reward Models, RLVR

Related concepts in this collection

Concept map
27 direct connections · 201 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

policy entropy collapse is the primary bottleneck in RL scaling for reasoning