Why does policy entropy collapse when scaling RL for reasoning?
This explores why a model's range of exploratory behavior shrinks (entropy collapse) as you scale up reinforcement learning for reasoning — and what that costs you.
This explores why a model's range of exploratory behavior shrinks — what researchers call policy entropy collapse — as you scale reinforcement learning for reasoning, and what that shrinkage costs. The short version from the corpus: RL rewards a policy for finding strategies that maximize reward, so it keeps doubling down on whatever already works. Diversity is the price. The collapse isn't a bug in one method; it's a structural tendency of reward-maximization to converge on a narrow set of high-scoring moves and abandon the rest of the solution space.
The sharpest result here is that this isn't a vague worry — it's a measurable ceiling. One line of work fits an empirical law, R = -a·exp(H) + b, where reasoning performance saturates exactly as policy entropy approaches zero Does policy entropy collapse limit reasoning performance in RL?. In other words, once the policy stops exploring, it stops improving — the entropy you burn early is performance you can't buy back later. That's why interventions like Clip-Cov, KL-Cov, and GPPO all target the same thing: slow the entropy drain so the policy keeps some exploratory capacity alive.
What makes this interesting is how general the mechanism is. The same convergence-on-narrow-strategies shows up in search agents, where RL squeezes behavioral diversity while supervised fine-tuning on varied demonstrations keeps exploration broad Does reinforcement learning squeeze exploration diversity in search agents?. It shows up in dialogue policies, which collapse to a single dominant action regardless of who they're talking to unless meta-learning forces them to stay variable Can meta-learning prevent dialogue policies from collapsing?. And it shows up as scale-dependent collapse in social reasoning, where models below a capacity threshold reach decent accuracy through brittle shortcuts rather than real belief-tracking Does reinforcement learning on theory of mind collapse with model scale?. Different domains, same gravitational pull toward the cheapest reward-maximizing behavior.
There's a deeper clue about the cause in two findings about what RL actually changes. RL updates only 5–30% of parameters, in sparse but nearly-identical subnetworks across random seeds — it's making a small, structured, repeatable edit, not broadly reshaping the model Does reinforcement learning update only a small fraction of parameters?. And several results argue RL doesn't create reasoning so much as decide when to deploy capability the base model already has — hybrid models recover 91% of the gains just by routing tokens Does RL post-training create reasoning or just deploy it?. Read together, these suggest entropy collapse is what it looks like when a narrow optimizer sharpens a fixed underlying capability: there's little new to explore, so the policy concentrates rather than expands.
The most useful surprise is what breaks the collapse. Numerical rewards carry almost no information about *why* an answer failed, so the policy has nothing to explore toward — but chain-of-thought critiques let models climb off plateaus they were stuck on, because language feedback restores direction Can natural language feedback overcome numerical reward plateaus?. There's even a two-phase pattern where entropy on planning tokens *rises* in a later strategic-exploration phase, suggesting collapse and exploration aren't a single dial but vary by what part of reasoning you're optimizing Does RL training follow a predictable two-phase learning sequence?. So the answer to 'why does entropy collapse' isn't only 'reward-maximization is greedy' — it's also 'scalar rewards are information-poor,' which points at a different fix than just clamping the entropy term.
Sources 8 notes
Empirical law R = -a·exp(H) + b shows performance saturates when policy entropy approaches zero. Interventions like Clip-Cov, KL-Cov, and GPPO preserve exploratory capacity by managing entropy reduction during training.
RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.
Without MAML, hierarchical RL for Motivational Interviewing phases collapses to a dominant action regardless of user type. Meta-learning enables the master policy to maintain variability and adapt across diverse user profiles.
7B models develop explicit, transferable belief-tracking under RL, while smaller models achieve comparable accuracy through shortcut learning that lacks interpretable reasoning traces. The mismatch between accuracy and reasoning quality is invisible without inspecting step-by-step outputs.
Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.
Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.