Reasoning and Learning Architectures

Do overly hard RLVR samples actually harm model capabilities?

Explores whether training on problems beyond a model's competence band causes active regression rather than mere learning failures. Investigates whether group-relative normalization amplifies accidental successes into harmful shortcuts.

Note · 2026-05-28 · sourced from RLVR
What does reward learning actually do to model reasoning?

The damage from over-hard RLVR samples is not merely "the model fails to improve." It is active regression. When almost every rollout on a problem fails, the rare success is unlikely to be a genuinely good solution — it is more often a shortcut, an answer reached by skipping necessary computation, or a lucky guess. Group-relative normalization then treats that one trajectory as the high-advantage exemplar of the group and reinforces it. The model learns the shortcut, not the reasoning.

The behavioral signature is concrete: answer repetition, skipping computation that the problem requires, and other degenerate patterns that look like reasoning collapse. More troubling, these effects do not stay local to the hard problems — they degrade the model's pre-existing capabilities, the things it could already do before training pushed it past its competence band. The internal-feature analysis corroborates this: hard problems activate reasoning-related features but those features become useful only on the rare successful trajectory, so most of the gradient on hard samples is reinforcing the wrong activations.

Why it matters: it identifies a specific corruption channel rather than a generic "training instability." The villain is the interaction between a sparse-success reward landscape and group-relative normalization, which together turn statistical noise (an accidental success) into a learning target. This sharpens the case against naively harvesting hard examples and connects RLVR difficulty to the broader pattern where verifiable-reward training rewards trajectories that pass the check without doing the work. The counterpoint a defender might raise — that some hard problems are exactly where capability frontiers expand — only holds when successful trajectories are sampled densely enough to outvote the shortcuts, which over-hard samples by definition fail to provide.


— "Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs", https://arxiv.org/abs/2605.28388

Related concepts in this collection

Concept map
14 direct connections · 121 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

overly hard rlvr samples induce degenerate behaviors and amplify shortcut trajectories degrading prior capability