INQUIRING LINE

Why do queries with low cross-rollout variance produce degenerate gradients?

This explores a failure mode in reinforcement-learning post-training: when a model's multiple attempts (rollouts) at the same query all turn out roughly the same, the learning signal derived from comparing them stops being useful — and why that produces broken or 'degenerate' gradient updates.


This explores why low cross-rollout variance — when all of a model's sampled attempts at one query land on nearly the same outcome — breaks the gradient signal in RL training, rather than just being unhelpful. The short version: modern RL methods like group-relative optimization don't reward absolute correctness, they reward *differences between attempts on the same query.* Each rollout's advantage is its reward minus the group's average, often divided by the group's spread. When the spread is near zero, that division either flattens the signal to nothing or blows it up into noise — so the gradient carries no real information about what to reinforce. The corpus treats this directly: Can one statistical measure serve dual purposes in RL training? shows that DRO reuses one self-supervised statistic — cross-rollout variance — at two levels, weighting tokens *and* filtering out queries whose comparisons have collapsed, precisely because those degenerate comparisons contribute nothing but instability.

The sharpest illustration of *how* low variance turns toxic comes from Do overly hard RLVR samples actually harm model capabilities?. On near-impossible problems, almost every rollout fails — low variance, but pinned at the bottom. The rare accidental success then gets treated by group-relative normalization as an enormous-advantage trajectory. So the model doesn't learn reasoning; it learns to repeat whatever lucky shortcut produced that one success — answer repetition, computation-skipping — and those shortcuts then bleed into capabilities the model already had. That's the mechanism behind 'degenerate gradients': a vanishing-variance group manufactures a spuriously huge advantage signal pointed at the wrong behavior.

What makes this more than a numerical quirk is that the degeneration compounds. Does RL training collapse format diversity in pretrained models? shows RL collapsing format diversity onto one dominant pattern within the first epoch — variance shrinking across the whole output distribution, not just per query. Once attempts stop diverging, there's progressively less signal to learn from, and the model narrows further. Low cross-rollout variance is both a *symptom* of this collapse and a *driver* of it: less diversity → weaker gradients → still less diversity.

The corpus also points at the fixes laterally. Filtering is the cheapest: discard zero-variance queries before they pollute the update, exactly what Can one statistical measure serve dual purposes in RL training? does. Staying anchored to the base model is another lever — Does staying close to the base model preserve learning ability? finds that keeping the policy close to its base distribution (up to 70% closer than parameter-only RL) preserves the model's plasticity and prevents the kind of distributional collapse that drains variance in the first place. And Does step-level confidence outperform global averaging for trace filtering? makes a parallel point one level down: a single averaged signal hides where reasoning actually breaks, while finer-grained per-step signal recovers the information global averaging masks — the same lesson as not trusting a collapsed group statistic.

The thing worth taking away: in these RL setups, *disagreement among attempts is the training signal itself.* A query everyone agrees on — whether trivially easy or impossibly hard — has nothing to teach, and forcing a gradient out of it does active harm, not zero harm. The useful design move isn't squeezing more signal from quiet queries; it's recognizing them and routing around them.


Sources 5 notes

Can one statistical measure serve dual purposes in RL training?

DRO reuses a single self-supervised statistic at two aggregation levels: token-level weighting in dense rewards and query-level filtering to discard degenerate comparisons. This dual use achieves 2–3× faster training with better stability on unverifiable tasks.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does staying close to the base model preserve learning ability?

FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.

Does step-level confidence outperform global averaging for trace filtering?

Local step-level confidence catches reasoning breakdowns that global averaging masks and enables early stopping before traces complete. This approach achieves comparable accuracy gains to naive majority voting with far fewer generated traces, proving trace quality matters more than quantity.

Next inquiring lines