How does KL penalty strength affect the degree of format collapse during RL?
This reads the question as: does the KL-to-reference penalty — the knob that keeps an RL policy from drifting too far from its pretrained starting point — control how badly the model's output diversity collapses into a single dominant format during RL.
This explores whether the KL penalty (the regularizer that tethers an RL policy to its pretrained prior) governs how severely format diversity collapses during training. Up front, a caveat worth naming: the corpus here documents format collapse vividly but does not contain a clean, controlled sweep that turns the KL knob up and down and measures the result. So the honest synthesis is lateral — what the collection *does* establish about the mechanism the KL penalty is supposed to restrain, and why that reframes the question.
The central finding is that RL collapse onto one format is fast and structural. Controlled experiments show RL converges on a single dominant *pretraining* format within the first epoch while suppressing the alternatives, and — strikingly — the winning format depends on model scale, not on which format performs best Does RL training collapse format diversity in pretrained models?. This matters for the KL question because the KL penalty pulls the policy toward exactly that pretrained distribution. The reservoir of formats the penalty is anchoring you to is itself the thing that gets winnowed; a stronger pull toward the prior doesn't obviously preserve diversity, because the prior's dominant mode is what RL amplifies.
The deeper driver is that outcome-based reward sharpens the policy globally. Rewarding only final-answer correctness concentrates probability mass on winning trajectories — and that diversity loss *transfers* from solved problems to unsolved ones, meaning the collapse isn't local to where reward was applied Does outcome-based RL diversity loss spread across unsolved problems?. A KL penalty is the standard brake on this sharpening, but the work here suggests the brake and the gas pedal are fighting over the same quantity (entropy / mass concentration), which is why diversity-restoration often needs *separate* mechanisms — exploration bonuses during training, repetition penalties at test time — rather than just tuning regularization strength.
Two adjacent notes sharpen the picture. First, the pretrained prior, not the algorithm, sets the ceiling: vanilla PPO matches fancier methods once you add advantage normalization and token-level loss aggregation, and most RL techniques are highly setup-sensitive Can two simple techniques match complex RL algorithms?. That implies KL strength is one knob in a coupled system whose behavior won't generalize across setups — so 'how much does it matter' likely has no single answer. Second, when normalization and reward shape go wrong, you get degenerate collapse of a different flavor: overly hard samples push models into shortcut trajectories (answer repetition, skipped computation) that contaminate prior capabilities Do overly hard RLVR samples actually harm model capabilities?. Collapse, in other words, has multiple causes, and the KL penalty only touches one of them.
The thing you didn't know you wanted to know: format collapse may be less about *how hard you pull toward the pretrained prior* and more about *what the prior already prefers and how reward sharpens it*. The KL penalty regulates distance from a distribution that is itself the source of the dominant format — so treating it as the primary collapse dial may be aiming at the wrong lever. If you want to dig into the actual mechanism, the format-convergence and diversity-transfer notes are the doorways.
Sources 4 notes
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
RL that rewards only final answer correctness sharpens the policy globally, concentrating probability mass on correct trajectories for solved problems while simultaneously reducing diversity on unsolved ones. Historical exploration (training diversity via UCB-style bonuses) and batch exploration (test-time diversity via repetition penalties) require structurally different mechanisms.
Advantage normalization and token-level loss aggregation allow critic-free PPO to surpass more complex algorithms. Systematic evaluation shows most RL techniques are setup-sensitive; the pretrained prior, not algorithm choice, sets performance ceiling.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.