INQUIRING LINE

What features does a sample reinforce when it moves bands?

This reads 'bands' as the difficulty bands from curriculum-style training — where a problem that was once too hard (or too easy) drifts into the productive zone as the model's ability changes — and asks: when a sample crosses into that zone, what is it actually teaching the model?


This explores a real moving target in training: a sample's value isn't fixed by its difficulty but by how that difficulty meets the model's *current* ability, so the question is what a sample reinforces once it slides into the productive band. The cleanest anchor is the finding that sample informativeness is dynamic — the band of medium-difficulty problems that teach the most drifts during training, making any static difficulty label stale within steps How does model ability change what samples teach?. So 'moving bands' isn't the sample changing; it's the model moving underneath it. The interesting twist is that what gets reinforced may not be a new skill at all but a *format* or *distribution* the model already had latent.

That's where the corpus gets surprising. When reinforcement learning runs, it doesn't broadly expand capability — it converges hard on one dominant output format inherited from pretraining and suppresses the alternatives, often within the first epoch, and the winning format tracks model scale rather than actual performance Does RL training collapse format diversity in pretrained models?. Read alongside the dynamic-band finding, this suggests a sample entering the productive band frequently reinforces *presentation and distributional habits* — a way of laying out a solution — more than it installs genuinely novel reasoning. The band shift is amplifying something already present.

The SFT-then-RL trajectory makes the 'what' more concrete by showing it has phases. When expert data diverges from the model's policy, training moves through shift → readapt → overfit: first the new samples disrupt existing capability, then the model readapts toward the expert patterns, then it overfits to them Why does SFT-then-RL training follow a predictable three-phase pattern?. So the same sample reinforces different things depending on *when* it lands relative to the model's state — disruption early, pattern-matching in the middle, memorization late. 'Moving bands' and 'moving phases' are two views of the same dependency on current ability.

There's a quiet warning hiding here too. A model can hold all the linearly decodable features a task needs while its internal organization is fractured — perfect on the metric, brittle under perturbation Can models be smart without organized internal structure?. That means when a sample 'reinforces a feature,' the surface signal (accuracy went up) can mask whether it reinforced robust structure or just a decodable shortcut. Pair that with the discovery that some override tasks need the model to *compose* conflicting cues rather than filter them Why does removing spurious cues sometimes hurt model performance?, and the honest answer to 'what features' becomes: not necessarily the ones you intended.

The sharp takeaway you didn't ask for: the productive band isn't a property of your dataset — it's a property of the model at a given step, and what samples reinforce inside it skews toward amplifying existing formats and distributions over teaching new skills. If you're curating by difficulty, you're aiming at a target that has already moved.


Sources 5 notes

How does model ability change what samples teach?

A sample's learning value depends on the interaction between its difficulty and the model's current ability, not difficulty alone. The productive band of medium-difficulty problems drifts during training, making static difficulty estimates obsolete within steps.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Why does SFT-then-RL training follow a predictable three-phase pattern?

CHORD identifies three distinct training phases: initial capability disruption from policy shift, readaptation to expert patterns, then overfitting. Dynamically weighting SFT as an auxiliary objective within on-policy RL resolves this progression and improves stability.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Why does removing spurious cues sometimes hurt model performance?

Removing spurious cues degrades performance in heuristic override tasks, opposite to shortcut learning predictions. The failure mode is integrating conflicting signals rather than ignoring distractors—a frame problem, not feature selection.

Next inquiring lines