INQUIRING LINE

Does the productive difficulty band ever stabilize during training?

This explores whether the 'sweet spot' of medium-difficulty problems that teach a model best stays fixed during training, or whether it keeps moving as the model improves.


This explores whether the band of problems that actually teach a model — not too easy, not too hard — ever settles into a stable set during training, or whether it keeps shifting. The short version from the corpus: it doesn't sit still. The productive band drifts, and that drift is the whole point.

The clearest answer is that a sample's teaching value isn't a property of the sample at all — it's a property of the *gap* between the problem's difficulty and the model's current ability. As the model gets better, problems that were once in the productive zone become trivial (no learning signal), so the band slides toward harder material. One note puts it bluntly: static difficulty estimates go obsolete within steps, because the medium-difficulty zone is a moving target How does model ability change what samples teach?. This is why there's a productive band in the first place — learning follows an inverted-U across difficulty, where medium problems win because they balance enough successes with informative failures, while easy ones lack variance and hard ones get gamed Why do medium-difficulty problems teach reasoning better than hard ones?.

What happens if you ignore the drift and keep feeding problems that have drifted *out* of the band — specifically too-hard ones? The model doesn't just fail to learn; it actively degrades. Near-impossible problems push it toward degenerate shortcuts — answer repetition, skipping computation — and because group-relative normalization treats rare accidental successes as high-advantage, those shortcuts get amplified and contaminate skills the model already had Do overly hard RLVR samples actually harm model capabilities?. So a band that's allowed to go stale isn't neutral — it's harmful.

Here's the more interesting twist: even though the difficulty band moves, training itself has a stable *shape*. RL training across many models shows a consistent two-phase arc — first execution correctness drives gains, then strategic planning becomes the bottleneck, with planning-token entropy rising while execution entropy stabilizes Does RL training follow a predictable two-phase learning sequence?. SFT-then-RL shows its own predictable three-phase progression of disruption, readaptation, then overfitting Why does SFT-then-RL training follow a predictable three-phase pattern?. So the band of useful difficulty keeps moving, but it moves through phases you can anticipate — which is exactly what makes curriculum approaches viable rather than hopeless.

That reframes the practical question from 'where is the stable band?' to 'how do you track a moving one?' The corpus answers in two ways. Curricula deliberately ride the drift: dense step-wise expert-similarity rewards give signal even when every rollout fails, working best as a foundation *before* outcome-based refinement takes over Can step-wise expert rewards help small models learn hard reasoning?. And critique models fight the band's tendency to collapse — by maintaining solution diversity across self-training iterations, they prevent premature convergence, which is a more fundamental win than test-time accuracy Do critique models improve diversity during training itself?. The thing you didn't know you wanted to know: stability in training isn't found in a fixed difficulty level — it's found in the predictable *pattern* of how that level has to keep moving.


Sources 7 notes

How does model ability change what samples teach?

A sample's learning value depends on the interaction between its difficulty and the model's current ability, not difficulty alone. The productive band of medium-difficulty problems drifts during training, making static difficulty estimates obsolete within steps.

Why do medium-difficulty problems teach reasoning better than hard ones?

RLVR learning follows an inverted-U curve across difficulty: medium problems yield strongest gains because they balance success frequency with informative failures, while easy samples lack variance and hard samples amplify shortcuts.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Why does SFT-then-RL training follow a predictable three-phase pattern?

CHORD identifies three distinct training phases: initial capability disruption from policy shift, readaptation to expert patterns, then overfitting. Dynamically weighting SFT as an auxiliary objective within on-policy RL resolves this progression and improves stability.

Can step-wise expert rewards help small models learn hard reasoning?

Supervised Reinforcement Learning rewards models by measuring alignment with expert actions at each step, providing dense learning signals even when all rollouts fail. This approach bridges the gap between rigid token-by-token imitation (SFT) and sparse outcome-only rewards (RLVR), and works best as a curriculum foundation before outcome-based refinement.

Do critique models improve diversity during training itself?

Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.

Next inquiring lines