How does entropy loss enable exploration beyond a single training example?

This explores how entropy — the measure of how many different next-moves a model keeps alive — functions as the thing that lets a model branch out instead of collapsing onto one memorized path, and what happens when that entropy disappears during RL training.

This explores entropy not as a loss term to minimize but as the resource that keeps a model's options open — and the corpus is surprisingly consistent that when entropy drains away, exploration dies with it. The clearest statement comes from work showing that policy entropy collapse is the *primary* bottleneck in scaling RL for reasoning: performance follows a clean law where reward saturates as entropy approaches zero, so a model that has stopped hesitating has also stopped improving Does policy entropy collapse limit reasoning performance in RL?. Entropy, in other words, is the budget the model spends on trying things that aren't the single highest-reward continuation it has already locked onto.

Why does that matter for going "beyond a single training example"? Because the entropy lives in a tiny minority of decisions. Only about 20% of tokens are high-entropy — these are the *forking points* where the reasoning could genuinely go several ways, and it turns out RLVR does almost all of its useful work precisely there; training on just those forking tokens matches full-gradient updates Do high-entropy tokens drive reasoning model improvements?. Exploration isn't spread evenly across a trajectory; it's concentrated at a few branch points, and entropy is what keeps those branches from prematurely fusing into one rote answer.

The failure mode is visible from the other direction. Left unmanaged, RL doesn't expand behavior — it compresses it. In search agents, RL squeezes exploration diversity through the same entropy-collapse mechanism seen in reasoning, converging on narrow reward-maximizing strategies while SFT on diverse demonstrations preserves breadth Does reinforcement learning squeeze exploration diversity in search agents?. More strikingly, RL tends to amplify a single dominant format from pretraining within the first epoch and suppress all the alternatives Does RL training collapse format diversity in pretrained models?. So "a single training example" isn't a strawman — collapse toward one mode is the default gravity of reward optimization, and preserving entropy is the counter-force.

That reframes the interventions. Methods like Clip-Cov, KL-Cov, and GPPO exist specifically to manage *how fast* entropy falls rather than letting it crater, buying continued exploratory capacity Does policy entropy collapse limit reasoning performance in RL?. The same instinct shows up in places that never mention entropy by name: Soft Thinking refuses to commit to one discrete token, carrying a probability-weighted superposition of reasoning paths forward and using entropy itself as the early-stopping signal — exploration without collapsing the distribution Can we explore multiple reasoning paths without committing to one token?. And RLAD finds that at large compute budgets, spending it on diverse *abstractions* enforces breadth-first exploration that beats simply sampling more solutions in parallel — a structural way of preserving the branching that entropy collapse would otherwise erase Can abstractions guide exploration better than depth alone?.

The thing you didn't know you wanted to know: entropy isn't noise the model has to overcome to reach the right answer. It's the only thing standing between a model that reasons and a model that has memorized one path and calls it confidence — and the whole craft of RL post-training is learning to spend that entropy slowly instead of all at once.

Sources 6 notes

Does policy entropy collapse limit reasoning performance in RL?

Empirical law R = -a·exp(H) + b shows performance saturates when policy entropy approaches zero. Interventions like Clip-Cov, KL-Cov, and GPPO preserve exploratory capacity by managing entropy reduction during training.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Can we explore multiple reasoning paths without committing to one token?

Training-free method replaces discrete token selection with probability-weighted concept embeddings, preserving superposition of reasoning paths. Improves accuracy up to 2.48 points while reducing tokens 22.4% via entropy-based early stopping.

Can abstractions guide exploration better than depth alone?

RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.

How does entropy loss enable exploration beyond a single training example?

Sources 6 notes

Next inquiring lines