How do difficulty metrics relate to the true value of training examples?
This explores whether a training example's difficulty score actually predicts how much a model learns from it — and the corpus says the relationship is real but unstable, conditional, and easy to misread.
This explores whether a training example's difficulty score actually predicts how much a model learns from it — and the collection's answer is that difficulty is a useful but treacherous proxy for value. At the simplest level, difficulty clearly carries signal: ranking examples by metrics like EL2N, forgetting, or memorization and throwing away the easy ones lets you prune large fractions of a dataset without losing accuracy, even beating the usual power-law scaling curves Can we prune training data without hurting model performance?. So difficulty isn't noise — easy examples really are often redundant.
But harder is not simply better. In reinforcement learning with verifiable rewards, value follows an inverted-U: medium-difficulty problems teach the most because they mix enough successes to give a signal with enough failures to be informative, while the hardest problems backfire Why do medium-difficulty problems teach reasoning better than hard ones?. Near-impossible samples are actively harmful — the model stumbles onto a rare correct answer, the training math treats that lucky hit as a high-value lesson, and it ends up reinforcing shortcuts and answer-repetition that corrode reasoning it already had Do overly hard RLVR samples actually harm model capabilities?. Strikingly, different difficulty bands don't just teach more or less, they teach different things: easy problems entrench shortcuts and suppress deliberation, hard ones light up reasoning only on the rare win, and the same headline accuracy gain can hide opposite internal changes What reasoning features does each difficulty level reinforce?.
The deeper twist is that difficulty isn't even a fixed property of an example. A sample's true value depends on the gap between its difficulty and the model's current ability, so the productive 'medium' band keeps drifting as training proceeds — a static difficulty label can go stale within a few steps How does model ability change what samples teach?. The same relativity shows up in distillation: teacher-refined data that is objectively higher quality still hurts a student when it sits beyond the student's learning frontier, so students do better filtering refinements against their own profile than swallowing everything Does teacher-refined data always improve student model performance?.
Worth knowing if you want to leave this rabbit hole smarter: some of the things we casually read as 'difficulty' aren't difficulty at all. Longer chain-of-thought traces look like the model working harder on harder problems, but controlled maze experiments show trace length tracks how close a problem is to the training distribution, not its intrinsic difficulty — it decouples completely out-of-distribution Does longer reasoning actually mean harder problems?. That's the same lesson the pruning and RLVR work keeps circling: a metric is only a stand-in for value, and the moment you treat the proxy as the thing itself you start optimizing the wrong target.
If you want to go one level out, the corpus has adjacent cautionary tales about training signals that look right but mislead — binary rewards that quietly wreck calibration by rewarding confident guessing Does binary reward training hurt model calibration?, and utility-weighted losses that sharpen decisions while starving the model of the gradient it needs to actually learn features Can utility-weighted training loss actually harm model performance?. The through-line across all of them: the number you're sorting by is rarely the value you care about.
Sources 9 notes
Research shows that ranking training examples by difficulty (EL2N, forgetting, memorization) and removing easy ones beats power-law scaling laws. On CIFAR-10, 50% of data was pruned without accuracy loss, and self-supervised metrics scaled the approach to ImageNet.
RLVR learning follows an inverted-U curve across difficulty: medium problems yield strongest gains because they balance success frequency with informative failures, while easy samples lack variance and hard samples amplify shortcuts.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
Easy problems reinforce answer shortcuts while suppressing deliberation; hard problems activate reasoning features only on rare success; medium difficulty strengthens both simultaneously. Identical accuracy gains can reflect opposite internal changes.
A sample's learning value depends on the interaction between its difficulty and the model's current ability, not difficulty alone. The productive band of medium-difficulty problems drifts during training, making static difficulty estimates obsolete within steps.
Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.
Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
Asymmetric loss functions correctly incentivize choosing but degrade representation learning by reducing gradient signals for substantive feature acquisition. Training with symmetric loss then adjusting predictions post-hoc outperforms direct utility-weighted training on the same utility objective.