How does RL compress reasoning path diversity during training?
This explores the mechanism by which reinforcement learning narrows the range of reasoning paths a model explores during training — and what the corpus says about why it happens, where it spreads, and how to counteract it.
This explores how RL training shrinks a model's repertoire of reasoning paths, not whether it raises accuracy. The corpus converges on a single mechanism with a clinical name: **entropy collapse**. When RL rewards only final-answer correctness, it sharpens the policy by piling probability mass onto the trajectories that already work — and the same dynamic shows up whether the model is reasoning through math, searching, or generating prose Does outcome-based RL diversity loss spread across unsolved problems? Does reinforcement learning squeeze exploration diversity in search agents?. The most counterintuitive finding is that this loss isn't local: outcome-based RL transfers diversity loss from solved problems to unsolved ones, globally narrowing the policy even where it hasn't yet found an answer. Sharpening where you've succeeded quietly forecloses exploration where you haven't.
A second strand reframes what's actually being compressed. RL may not be destroying reasoning ability so much as collapsing onto a *format* that was already latent in pretraining — within the first epoch, RL amplifies one dominant pretraining distribution and suppresses the alternatives, and which format wins depends on model scale rather than on which one performs best Does RL training collapse format diversity in pretrained models?. That dovetails with the argument that RL post-training teaches a model *when* to deploy reasoning it already has, rather than teaching it new ways to reason Does RL post-training create reasoning or just deploy it?. Read together, the picture is less 'RL invents a narrow skill' and more 'RL picks one path out of many the base model contained and prunes the rest.'
The compression isn't uniform across the run, either. RL training moves through two phases — first consolidating procedural execution, then shifting the bottleneck to strategic planning, with planning-token entropy rising even as execution entropy stabilizes Does RL training follow a predictable two-phase learning sequence?. And it isn't uniform across task types: structured domains (math, code) systematically *decrease* output entropy while creative, open-ended domains increase it — which means naively mixing them lets the structured tasks' collapse bleed over and damage open-ended capability unless you schedule training order to protect it Does training order reshape how models handle different task types?.
What you didn't know you wanted to know: diversity loss is reversible, and the fix isn't always 'do less RL.' SFT on diverse demonstrations preserves exploration breadth Does reinforcement learning squeeze exploration diversity in search agents?, and — more surprisingly — explicitly *rewarding* semantic diversity during RL doesn't trade off against quality; it catalyzes exploration and produces higher-quality outputs than quality-only baselines Can diversity optimization improve quality during language model training?. There's even a subtle distinction worth holding onto: preserving diversity during *training* (exploration bonuses) and recovering it at *test time* (repetition penalties, parallel sampling) are structurally different problems requiring different machinery Does outcome-based RL diversity loss spread across unsolved problems? Can reasoning systems scale wider instead of only deeper?.
The quiet warning underneath all of this: a collapsed policy looks confident and fluent, but chain-of-thought that imitates the *form* of reasoning without the underlying logic degrades predictably once you step outside the training distribution Does chain-of-thought reasoning actually generalize beyond training data?. Compressing the reasoning paths you keep is exactly what makes a model brittle on the paths it threw away.
Sources 9 notes
RL that rewards only final answer correctness sharpens the policy globally, concentrating probability mass on correct trajectories for solved problems while simultaneously reducing diversity on unsolved ones. Historical exploration (training diversity via UCB-style bonuses) and batch exploration (test-time diversity via repetition penalties) require structurally different mechanisms.
RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.
Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.
DARLING jointly optimizes for quality and semantic diversity using a learned classifier, finding that diversity rewards catalyze exploration and produce higher-quality outputs than quality-only baselines across both creative and mathematical tasks.
GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.