Why do zero-advantage rollouts destabilize training beyond just wasting compute?
This explores what happens in group-relative RL (GRPO-style) when every rollout in a group earns the same reward — so the advantage signal collapses to near-zero — and why that's actively harmful rather than merely idle compute.
This explores what happens in group-relative RL when every rollout in a group earns the same reward, so the advantage collapses toward zero — and why that's worse than just burning tokens. The corpus suggests the danger isn't the zero itself but what the normalization machinery does around it: when a group is nearly degenerate, the rare odd-one-out gets blown up into a high-magnitude signal, and the gradient that should have been silent instead steers the model somewhere bad.
The sharpest version of this is in overly-hard sampling. When a problem is almost impossible, almost every rollout fails — a near-uniform, near-zero-advantage group. But group-relative normalization divides by the within-group spread, so the one accidental success becomes a huge advantage, and the model dutifully reinforces whatever fluke produced it: answer repetition, computation-skipping, shortcut trajectories. Worse, those shortcuts don't stay quarantined to the hard problem — they contaminate capabilities the model already had Do overly hard RLVR samples actually harm model capabilities?. That's the mechanism by which a 'wasted' rollout becomes a destabilizing one: zero spread plus normalization manufactures false confidence in noise.
This is exactly why some methods treat the within-group statistic as a filter, not just a weight. One approach reuses cross-rollout variance at two levels — weighting tokens where there's real signal, and discarding queries where the comparison is degenerate before it can pollute the update — and reports 2–3× faster, more stable training as a result Can one statistical measure serve dual purposes in RL training?. The fact that filtering degenerate groups *improves* stability is the cleanest evidence that those groups were doing harm, not nothing. Relatedly, how you sample changes how often you fall into this trap: shared-prefix tree rollouts produce more distinct trajectories per token budget, which sharpens advantage estimation and reduces the all-same-outcome collapses that starve the signal Can shared-prefix trees reduce redundancy in agent rollouts?.
There's a deeper reason instability compounds: RL is already biased toward collapse, and bad gradients accelerate it. RL post-training tends to amplify a single dominant output format within the first epoch while suppressing alternatives — and the winner is picked by scale, not necessarily by performance Does RL training collapse format diversity in pretrained models?. Layer spurious high-advantage updates from degenerate groups on top of that narrowing tendency and you get premature convergence on the wrong mode. Binary/sparse reward schemes make this worse by rewarding confident guessing without penalizing confident errors, degrading calibration Does binary reward training hurt model calibration? — so the model becomes both narrower and more sure of itself.
The timing matters too. RL training appears to move through two phases — execution correctness first, strategic planning later — with the bottleneck shifting over the run Does RL training follow a predictable two-phase learning sequence?. A burst of zero-advantage groups early can starve the procedural-consolidation phase of the clean signal it needs, and spurious updates during the later planning phase concentrate optimization on exactly the tokens (planning/strategy) where a wrong push does the most damage. The takeaway you didn't know you wanted: in group-relative RL, a flat group isn't a harmless no-op — it's the precise condition under which the algorithm is most likely to amplify noise into a confident, capability-eroding mistake.
Sources 6 notes
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
DRO reuses a single self-supervised statistic at two aggregation levels: token-level weighting in dense rewards and query-level filtering to discard degenerate comparisons. This dual use achieves 2–3× faster training with better stability on unverifiable tasks.
Tree-structured rollouts that branch from shared prefixes produce more distinct trajectories within a fixed token budget than independent chain sampling. This improves advantage estimation statistics and enables longer-horizon tasks within the same compute constraint.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.