Why do medium-difficulty problems produce more stable learning gains?
This explores why problems of moderate difficulty — not too easy, not too hard — give the most reliable improvement when training models with reinforcement learning, and what's actually happening inside the model that makes that band productive.
This explores why moderate-difficulty problems produce the most reliable learning, rather than the easy or near-impossible ones. The corpus has a clear answer with a memorable shape: learning across difficulty follows an inverted-U curve. Medium problems win because they balance enough successes to give the model a foothold with enough failures to be informative — the learning signal is strongest where outcomes are genuinely uncertain Why do medium-difficulty problems teach reasoning better than hard ones?. Easy samples lack variance (the model already wins, so there's nothing to learn), and hard samples are where things actively break.
What makes this more than a tuning heuristic is what the corpus says happens *inside* the model at each difficulty. Easy problems reinforce answer shortcuts while suppressing deliberation; hard problems only occasionally succeed, so deliberate reasoning gets rewarded rarely; medium difficulty is the one band that strengthens both shortcut-resistance and genuine reasoning at once What reasoning features does each difficulty level reinforce?. That's the deeper reason the gains are *stable* rather than just large — identical accuracy improvements can mask opposite internal changes, and only the medium band reinforces the durable kind.
The failure mode at the hard end is worth understanding because it explains the instability you avoid. Training on near-impossible problems doesn't just waste signal — it degrades the model. Because group-relative reward normalization treats a rare accidental success as a high-value trajectory, the model learns to repeat answers and skip computation, and these degenerate shortcuts contaminate capabilities it already had Do overly hard RLVR samples actually harm model capabilities?. So 'stable gains' from medium problems is partly the absence of this active corruption.
Here's the twist that makes 'medium' harder than it sounds: difficulty isn't a fixed property of a problem. A sample's teaching value depends on the interaction between its difficulty and the model's *current* ability, so the productive band drifts as the model improves — a problem that was medium at step 100 may be trivial by step 300 How does model ability change what samples teach?. Stable learning therefore comes not from picking medium problems once, but from continuously re-centering on the moving target.
The corpus also offers escape hatches for when you can't find that band. If even your hardest useful problems produce all-failure rollouts, step-wise expert-similarity rewards give a dense signal by scoring each move against an expert, so the model learns something even when no full attempt succeeds Can step-wise expert rewards help small models learn hard reasoning?. And in a striking counterpoint to the difficulty-curve framing, a single well-chosen example can activate latent reasoning and keep improving test accuracy long after training accuracy saturates Can a single training example unlock mathematical reasoning? — a reminder that the medium-difficulty story is about *signal quality*, not problem quantity.
Sources 6 notes
RLVR learning follows an inverted-U curve across difficulty: medium problems yield strongest gains because they balance success frequency with informative failures, while easy samples lack variance and hard samples amplify shortcuts.
Easy problems reinforce answer shortcuts while suppressing deliberation; hard problems activate reasoning features only on rare success; medium difficulty strengthens both simultaneously. Identical accuracy gains can reflect opposite internal changes.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
A sample's learning value depends on the interaction between its difficulty and the model's current ability, not difficulty alone. The productive band of medium-difficulty problems drifts during training, making static difficulty estimates obsolete within steps.
Supervised Reinforcement Learning rewards models by measuring alignment with expert actions at each step, providing dense learning signals even when all rollouts fail. This approach bridges the gap between rigid token-by-token imitation (SFT) and sparse outcome-only rewards (RLVR), and works best as a curriculum foundation before outcome-based refinement.
A single example in RLVR boosts math performance from 36% to 73.6% and enables test accuracy to improve for 1,400 steps after training accuracy reaches 100%, revealing that minimal activation signals unlock latent reasoning capability.