Reasoning and Learning Architectures

How does model ability change what samples teach?

Does a sample's learning value stay fixed, or does it shift as the model improves? Understanding whether informativeness is a moving target could explain why fixed difficulty filters underperform adaptive ones during training.

Note · 2026-05-28 · sourced from RLVR
What does reward learning actually do to model reasoning?

If medium-difficulty problems carry the strongest RLVR signal, the obvious question is: medium relative to what? The difficulty findings are stated against the model's current capability — a problem is "hard" because this model fails it now, not because of any intrinsic property. That makes informativeness a relational, moving quantity. A problem that is over-hard at step zero (weak signal, shortcut amplification) can become medium-difficulty after the model improves, at which point it starts contributing the strongest gradient. And a problem that was medium early becomes easy and stops teaching.

This is the open problem the static difficulty bucketing leaves unresolved. The one-sample dynamics show that which features a sample reinforces depends on whether successful trajectories are sampled — and sampling success on a given problem changes as the policy moves. So the curriculum cannot be set once from a fixed difficulty estimate; the productive band drifts under the policy as training proceeds. A sample's value is co-determined by its difficulty and the model's evolving capability, and neither factor alone predicts informativeness.

Why it matters: it converts a clean prescriptive result ("train on medium-difficulty samples") into a control problem ("track which samples are currently in the productive band and re-rank continuously"). It also explains why fixed difficulty filters underperform adaptive schemes — the filter is correct only at the instant it was computed. The unresolved part is how cheaply you can estimate current informativeness online: re-estimating per-sample difficulty every few steps is expensive, and proxies (recent pass rate, reward variance) are noisy. This is a question worth tracking because adaptive-curriculum RLVR depends on solving it efficiently.


— "Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs", https://arxiv.org/abs/2605.28388

Related concepts in this collection

Concept map
15 direct connections · 119 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

sample informativeness is dynamic depending on the interaction between task difficulty and the models evolving capability