What happens to model reasoning when policy entropy collapses during RL?
This explores what actually happens to a model's reasoning when its exploratory diversity (policy entropy) shrinks toward zero during reinforcement learning — and whether that collapse is the thing capping reasoning gains.
This explores what happens to reasoning when policy entropy collapses during RL — the moment when a model stops exploring varied solution paths and converges on a narrow set of reward-maximizing moves. The short version the corpus suggests: entropy collapse is less a side effect than the main ceiling on how far RL can push reasoning, and it tends to make models *sharper at what they already do* rather than *capable of more*.
The most direct account comes from work showing entropy collapse is the primary bottleneck in scaling RL for reasoning Does policy entropy collapse limit reasoning performance in RL?. It even fits a tidy empirical law — performance saturates as policy entropy approaches zero — and proposes interventions (Clip-Cov, KL-Cov, GPPO) that deliberately slow entropy's decline to keep the model exploring. So the headline answer is: when entropy collapses, reasoning gains flatline at a predictable ceiling. Strikingly, the same mechanism shows up beyond pure reasoning: RL training on search agents squeezes their behavioral diversity through the identical entropy-collapse dynamic, converging policies onto a few narrow strategies — and SFT on diverse demonstrations is what restores exploration breadth Does reinforcement learning squeeze exploration diversity in search agents?. That cross-domain echo is the tell that this is a structural property of reward optimization, not a quirk of math problems.
Here's the part a curious reader might not expect: several notes argue the collapse doesn't *destroy* reasoning so much as expose what RL was ever doing. RLVR doesn't expand the boundary of what a model can solve — pass@k analysis shows base models actually beat RLVR-tuned models at high k, meaning RL just concentrates sampling onto solutions the base model already had Does RLVR actually expand what models can reason about?. A companion view frames verifiable rewards as catalysts that surface pretrained strategies rather than teachers that build new ones, with updates that are structurally sparse and bounded by the prior How does RL training reshape reasoning and what gets lost?. And the 'RL teaches when, not how' result drives it home: base models already contain the reasoning in latent form, and RL is optimizing *deployment timing* Does RL post-training create reasoning or just deploy it?. Read together, entropy collapse is the visible signature of distribution-sharpening: the policy narrows toward known-good paths, which raises average accuracy while quietly shrinking the breadth of solutions it can still reach.
The damage isn't only about diversity, though — it can corrupt the model's self-knowledge too. Binary correctness rewards (a common RL setup) provably degrade calibration, pushing models toward confident guessing because nothing penalizes a confident wrong answer; adding a Brier-score term restores calibration without a trade-off Does binary reward training hurt model calibration?. So 'collapse' has two faces: the policy gets narrow *and* overconfident. There's also nuance on *which* entropy collapses — a two-phase view finds execution entropy stabilizes early while planning-token entropy actually keeps rising, suggesting the productive exploration migrates to strategic planning even as low-level execution locks in Does RL training follow a predictable two-phase learning sequence?. Collapse isn't uniform across the reasoning stack.
If the diagnosis is 'RL kills exploration,' the corpus also hints at sidesteps that avoid gradient-driven collapse entirely. Training-Free GRPO gets RL-like distribution shifts by distilling semantic advantages into the prompt as a token prior — no parameter updates, so no entropy to collapse Can semantic knowledge shift model behavior like reinforcement learning does?. Memory-based online RL pushes the same idea further, achieving continual adaptation purely through memory operations while leaving weights untouched Can agents learn continuously from experience without updating weights?. The throughline across all of this: entropy collapse is what makes RL good at exploitation and bad at expansion — and the frontier of the field is figuring out how to get the gains without paying the diversity tax.
Sources 9 notes
Empirical law R = -a·exp(H) + b shows performance saturates when policy entropy approaches zero. Interventions like Clip-Cov, KL-Cov, and GPPO preserve exploratory capacity by managing entropy reduction during training.
RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.
Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.
Research shows that verifiable rewards act as catalysts that surface existing capabilities from pretraining, not teachers that build new reasoning. RL updates are structurally sparse and bounded by the pretrained prior, not algorithmic sophistication.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.
Training-Free GRPO distills semantic advantages from rollout groups into prompts, shifting output distributions toward better answers through in-context learning rather than gradient updates. With few dozen training samples, it outperforms fine-tuned small LLMs and works with black-box APIs.
AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.