How does model ability change what samples teach?
Does a sample's learning value stay fixed, or does it shift as the model improves? Understanding whether informativeness is a moving target could explain why fixed difficulty filters underperform adaptive ones during training.
If medium-difficulty problems carry the strongest RLVR signal, the obvious question is: medium relative to what? The difficulty findings are stated against the model's current capability — a problem is "hard" because this model fails it now, not because of any intrinsic property. That makes informativeness a relational, moving quantity. A problem that is over-hard at step zero (weak signal, shortcut amplification) can become medium-difficulty after the model improves, at which point it starts contributing the strongest gradient. And a problem that was medium early becomes easy and stops teaching.
This is the open problem the static difficulty bucketing leaves unresolved. The one-sample dynamics show that which features a sample reinforces depends on whether successful trajectories are sampled — and sampling success on a given problem changes as the policy moves. So the curriculum cannot be set once from a fixed difficulty estimate; the productive band drifts under the policy as training proceeds. A sample's value is co-determined by its difficulty and the model's evolving capability, and neither factor alone predicts informativeness.
Why it matters: it converts a clean prescriptive result ("train on medium-difficulty samples") into a control problem ("track which samples are currently in the productive band and re-rank continuously"). It also explains why fixed difficulty filters underperform adaptive schemes — the filter is correct only at the instant it was computed. The unresolved part is how cheaply you can estimate current informativeness online: re-estimating per-sample difficulty every few steps is expensive, and proxies (recent pass rate, reward variance) are noisy. This is a question worth tracking because adaptive-curriculum RLVR depends on solving it efficiently.
— "Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs", https://arxiv.org/abs/2605.28388
Related concepts in this collection
-
Why do medium-difficulty problems teach reasoning better than hard ones?
Does harder always mean better for learning? This explores why easy and extremely hard samples produce weak training signals in RLVR, while medium-difficulty problems drive the strongest improvements.
the static finding this note dynamizes: the inverted-U is correct only relative to a fixed capability snapshot
-
Can a single training example unlock mathematical reasoning?
Explores whether one example is enough to dramatically improve math problem-solving in language models, and whether learning continues after perfect memorization.
evidence that capability keeps moving even after apparent saturation, so the difficulty-capability relation never settles
-
Does RLVR actually expand what models can reason about?
Explores whether reinforcement learning from verifiable rewards teaches models genuinely new reasoning skills or simply makes existing capabilities more reliable. Pass@k analysis suggests the latter.
bounds the dynamic: if RLVR only reshapes sampling within fixed boundaries, the productive band drifts but cannot move past the base model's frontier
-
Can we prune training data without hurting model performance?
This explores whether difficulty metrics can identify redundant training examples that can be safely removed. It matters because most datasets contain massive waste — if we can find which examples are truly necessary, we could train better models on far less data.
static-pruning counterpart; the dynamic-informativeness view argues the pruning criterion must be recomputed as capability evolves
-
What reasoning features does each difficulty level reinforce?
When models train on problems of different difficulty, do they build the same internal reasoning machinery or different kinds? This matters because accuracy gains alone hide what's actually being learned.
grounds: gives the feature-level content of "informativeness" — what a sample teaches, and thus its value, shifts with the band it currently occupies
-
Should training maximize diversity when models feed into search?
If a model runs inside a test-time search loop that samples many rollouts and picks the best, does training for entropy and diversity unlock better solutions than training for a single sharp answer?
extends: a related control-problem reframe of RL training objectives, where what to optimize for changes with deployment regime rather than being fixed once
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
sample informativeness is dynamic depending on the interaction between task difficulty and the models evolving capability