Why does outcome-based RL specifically lose diversity during training?
This explores why reinforcement learning that rewards only the final answer (not the reasoning steps) tends to narrow a model's range of outputs as training proceeds — and what the corpus says the underlying mechanism is.
This explores why outcome-based RL — training that rewards only whether the final answer is correct, ignoring how the model got there — specifically erodes the variety in a model's outputs. The corpus points to a single root cause: when the only signal is "was the answer right," the optimizer has nothing to push toward except piling probability mass onto whatever trajectories already succeed. The policy sharpens, and that sharpening is global, not local. Does outcome-based RL diversity loss spread across unsolved problems? shows the striking part: the collapse doesn't stay confined to problems the model has solved — it transfers, reducing diversity even on unsolved problems where exploration is exactly what you'd want to preserve.
Mechanically, this is the same entropy collapse documented across very different tasks, which suggests it's a property of outcome reward rather than of any one domain. Does reinforcement learning squeeze exploration diversity in search agents? finds search agents converge on narrow reward-maximizing strategies through the identical mechanism seen in reasoning, while supervised fine-tuning on diverse demonstrations keeps exploration broad. Does RL training collapse format diversity in pretrained models? sharpens the picture further: within the first epoch, RL amplifies one output format inherited from pretraining and suppresses the alternatives — and which format wins depends on model scale, not even on which one performs best. So part of the diversity loss is the optimizer arbitrarily committing to one mode early and never looking back.
The corpus also reveals that the collapse is not uniform — it depends on what the reward actually incentivizes. Does preference tuning always reduce diversity the same way? shows RLHF reduces lexical-syntactic variety in code but *increases* it in creative writing, because code rewards convergence on a correct solution while creative writing rewards distinctiveness. Does training order reshape how models handle different task types? makes this concrete: structured domains drive output entropy down while open-ended ones drive it up, and training the structured tasks first prevents the entropy collapse from spilling over and damaging the creative capabilities. Diversity loss, in other words, is what outcome reward looks like in domains with a single correct target.
There's a darker version of the same dynamic worth knowing about. Do overly hard RLVR samples actually harm model capabilities? shows that when problems are nearly impossible, group-relative normalization treats rare accidental successes as high-advantage trajectories — so the model collapses onto answer-repetition and computation-skipping shortcuts that then contaminate skills it already had. And Does binary reward training hurt model calibration? notes that pure binary correctness rewards push the model toward confident guessing because nothing penalizes a confident wrong answer. Both are diversity collapse with a sign on it: the policy doesn't just narrow, it narrows onto the wrong thing.
The interesting twist is that none of this is inevitable — it's a consequence of *what* you reward, not of RL itself. Can diversity optimization improve quality during language model training? shows that adding a semantic-diversity reward actually raises quality, because diversity catalyzes exploration rather than competing with it. Do critique models improve diversity during training itself? keeps solution variety alive by injecting step-level critique into the training loop, counteracting the tail-narrowing. And Can reward vectors be the hidden source of solution diversity? dissolves the problem at its source: if you keep the reward as an unscalarized vector (per criterion, per test case, per persona), solutions naturally spread across a Pareto frontier instead of all chasing one scalar. The common thread — outcome-based RL loses diversity precisely because it compresses a many-dimensional notion of "good" into a single scalar the optimizer can only maximize one way.
Sources 10 notes
RL that rewards only final answer correctness sharpens the policy globally, concentrating probability mass on correct trajectories for solved problems while simultaneously reducing diversity on unsolved ones. Historical exploration (training diversity via UCB-style bonuses) and batch exploration (test-time diversity via repetition penalties) require structurally different mechanisms.
RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.
Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
DARLING jointly optimizes for quality and semantic diversity using a learned classifier, finding that diversity rewards catalyze exploration and produce higher-quality outputs than quality-only baselines across both creative and mathematical tasks.
Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.
Vector Policy Optimization shows that rewards decomposed per test-case, criterion, or persona provide an inherent diversity structure. Training solutions to span the Pareto frontier across these dimensions produces competent diversity grounded in real task trade-offs rather than external regularizers.