Do overly hard RLVR samples actually harm model capabilities?
Explores whether training on problems beyond a model's competence band causes active regression rather than mere learning failures. Investigates whether group-relative normalization amplifies accidental successes into harmful shortcuts.
The damage from over-hard RLVR samples is not merely "the model fails to improve." It is active regression. When almost every rollout on a problem fails, the rare success is unlikely to be a genuinely good solution — it is more often a shortcut, an answer reached by skipping necessary computation, or a lucky guess. Group-relative normalization then treats that one trajectory as the high-advantage exemplar of the group and reinforces it. The model learns the shortcut, not the reasoning.
The behavioral signature is concrete: answer repetition, skipping computation that the problem requires, and other degenerate patterns that look like reasoning collapse. More troubling, these effects do not stay local to the hard problems — they degrade the model's pre-existing capabilities, the things it could already do before training pushed it past its competence band. The internal-feature analysis corroborates this: hard problems activate reasoning-related features but those features become useful only on the rare successful trajectory, so most of the gradient on hard samples is reinforcing the wrong activations.
Why it matters: it identifies a specific corruption channel rather than a generic "training instability." The villain is the interaction between a sparse-success reward landscape and group-relative normalization, which together turn statistical noise (an accidental success) into a learning target. This sharpens the case against naively harvesting hard examples and connects RLVR difficulty to the broader pattern where verifiable-reward training rewards trajectories that pass the check without doing the work. The counterpoint a defender might raise — that some hard problems are exactly where capability frontiers expand — only holds when successful trajectories are sampled densely enough to outvote the shortcuts, which over-hard samples by definition fail to provide.
— "Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs", https://arxiv.org/abs/2605.28388
Related concepts in this collection
-
Why do medium-difficulty problems teach reasoning better than hard ones?
Does harder always mean better for learning? This explores why easy and extremely hard samples produce weak training signals in RLVR, while medium-difficulty problems drive the strongest improvements.
the parent finding; this note details the downside arm of the inverted-U
-
Does RLVR actually improve mathematical reasoning or just coherence?
RLVR post-training makes reasoning traces locally more consistent, but does this structural improvement translate to valid mathematical proofs? We investigate whether trace coherence is sufficient for correctness.
same gap between surface success and genuine reasoning; shortcut amplification is one mechanism producing coherent-but-invalid traces
-
Why does RLVR training narrow a model's problem solving ability?
RLVR's on-policy constraint may force models to exploit known reasoning paths rather than explore new ones, potentially shrinking their effective problem-solving scope. Understanding this mechanism could reveal how to design better exploration incentives in language model reasoning.
the capability-erosion outcome at scale; over-hard samples are one driver of the boundary collapse
-
Do conversational recommender benchmarks actually measure recommendation skill?
Conversational recommender systems are evaluated against ground-truth items mentioned later in conversations. But does this metric distinguish between genuinely recommending new items versus simply repeating items users already discussed?
parallel shortcut-amplification dynamic in a different domain: the reward structure rewards a degenerate copy strategy
-
Why does RLVR work with completely random rewards?
RLVR improves reasoning performance even with incorrect or random reward signals. This challenges the assumption that reward quality determines learning outcomes and raises questions about what RLVR is actually doing.
counterpoint and complication: RLVR can work despite noisy reward, but this note shows the regime (over-hard samples) where reward noise becomes actively harmful
-
What reasoning features does each difficulty level reinforce?
When models train on problems of different difficulty, do they build the same internal reasoning machinery or different kinds? This matters because accuracy gains alone hide what's actually being learned.
same-paper companion: supplies the internal-feature mechanism — hard samples activate reasoning features that only the rare success rewards, so most gradient reinforces the wrong activations
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
overly hard rlvr samples induce degenerate behaviors and amplify shortcut trajectories degrading prior capability