INQUIRING LINE

Can group-relative normalization be modified to resist shortcut trajectories?

This explores whether GRPO-style group-relative advantage normalization — which scores each trajectory against the average of its sampled group — can be reshaped so it stops rewarding lucky shortcuts instead of genuine reasoning.


This explores whether GRPO-style group-relative advantage normalization can be reshaped so it stops rewarding lucky shortcuts instead of real reasoning. The corpus is unusually direct about where the problem comes from: when you train on problems that are nearly impossible for the model, the rare accidental success gets scored as a huge positive advantage relative to its group of mostly-failed siblings — so the optimizer enthusiastically reinforces whatever produced that fluke, which tends to be answer-repetition and computation-skipping rather than sound steps Do overly hard RLVR samples actually harm model capabilities?. The shortcut isn't a side effect of normalization; it's normalization doing exactly what it's told on the wrong distribution of problems. That reframes your question: the most reliable 'modification' may be at the data layer (don't feed it problems where the only successes are accidental) rather than at the math of the advantage estimator itself.

The more interesting lateral move in the collection is changing *what the reward attaches to*. Shortcuts thrive when the only signal is the final outcome, because any path to the right answer looks equally good. Several methods convert that sparse outcome reward into dense, per-step signals derived from the structure of the trajectory — Tree-GRPO uses tree topology, Supervised RL leans on expert-aligned actions, and ToolPO keys off tool-call positions — so the credit lands on the reasoning moves rather than the lucky landing Can trajectory structure replace hand-annotated process rewards?. This is the cleaner answer to 'can it be modified to resist shortcuts': you keep group-relative comparison but give it richer trajectory structure to compare, so a step-skipping path can no longer collect the same advantage as a worked one.

There's a quieter failure mode worth knowing about. RL post-training tends to collapse onto a single dominant format inherited from pretraining within the first epoch, and the winning format is chosen by model scale rather than by which format actually reasons best Does RL training collapse format diversity in pretrained models?. A shortcut trajectory is, in a sense, just a degenerate format that won the collapse. So any normalization fix that doesn't also preserve diversity risks locking in whichever cheap pattern happened to dominate early — the resistance you want is partly about keeping the group genuinely varied, not just rescaling advantages within it.

One caution the corpus raises against naive 'shortcut-proofing': the intuition that you fix shortcuts by stripping spurious cues doesn't always hold. In heuristic-override tasks, removing the misleading cues actually *hurts* — the real difficulty is composing conflicting signals, not ignoring distractors Why does removing spurious cues sometimes hurt model performance?. The lesson for reward design is that 'shortcut' and 'legitimate-but-cheap-looking reasoning' aren't always separable from the outside, so an aggressive penalty risks suppressing real capability. And if you'd rather widen the search than re-engineer the reward, sampling many parallel latent trajectories spreads exploration across the solution space without inflating variance — giving the group more honest candidates to normalize against Can reasoning systems scale wider instead of only deeper?.

The takeaway you might not have gone looking for: the corpus quietly votes against patching the normalization formula in isolation. The shortcut problem lives at the boundaries of the method — the difficulty of the training samples, the granularity of the reward, and the diversity of the sampled group — and that's where the corpus puts its fixes.


Sources 5 notes

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Can trajectory structure replace hand-annotated process rewards?

Tree-GRPO, Supervised RL, and ToolPO each convert sparse outcome rewards into dense step signals by exploiting different structural features—tree topology, expert-aligned actions, and tool-call positions—eliminating the need for annotated process reward models.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Why does removing spurious cues sometimes hurt model performance?

Removing spurious cues degrades performance in heuristic override tasks, opposite to shortcut learning predictions. The failure mode is integrating conflicting signals rather than ignoring distractors—a frame problem, not feature selection.

Can reasoning systems scale wider instead of only deeper?

GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.

Next inquiring lines