Why do weaker models generate better training data than stronger models?
This explores a counterintuitive finding in the corpus: that data quality is relative to the learner, not absolute — so a weaker or more similar model often produces training data that a student can actually absorb, while a stronger model's 'better' data sits beyond reach.
This explores why 'better training data' turns out to be a property of the *relationship* between teacher and student, not of the data itself. The corpus repeatedly shows that objectively higher-quality data from a stronger model can fail — not because it's wrong, but because it lands outside what the learner can currently absorb. Teacher-refined instruction data actually *degrades* a student when the refinements exceed its learning frontier; the fix is for the student to filter refinements through its own statistical profile and keep only the compatible ones Does teacher-refined data always improve student model performance?. The most striking version of this is self-generated data beating data from a stronger external model outright: SEAL lifted QA accuracy from 33.5% to 47.0% on synthetic data the model made for itself, because the restructuring matched the learner's own representational needs Does self-generated training data improve model learning?.
The mechanism becomes clearer once you stop thinking 'quality' and start thinking 'distance from where the learner already is.' RLVR gains follow an inverted-U over difficulty: medium-difficulty problems teach best because they mix wins with informative failures, while too-hard problems — exactly the kind a stronger model would happily generate — collapse into degenerate shortcuts Why do medium-difficulty problems teach reasoning better than hard ones?. Push that further and the harm is active, not just wasted: nearly-impossible samples make models learn answer-repetition and computation-skipping that then *contaminate* skills they already had Do overly hard RLVR samples actually harm model capabilities?. A weaker generator naturally produces problems and demonstrations closer to the student's frontier — the productive middle of that curve.
There's a second thread: staying close to the base distribution is itself valuable. Curriculum work reframes the whole game as managing distance from the pre-training distribution rather than pedagogical difficulty — feeding rare data first because rarity marks a distributional gap, not because it's conceptually hard Does ordering training data by rarity actually improve language models?. And low KL drift from the base model preserves the *plasticity* to keep learning at all; data that yanks a model far from its origin stalls future adaptation Does staying close to the base model preserve learning ability?. Data from a nearby model keeps the student in the region where it can still move.
The counter-case is just as instructive: weaker isn't magically better. Recursive training on model-generated content causes irreversible collapse, with rare events and tails vanishing generation after generation Does training on AI-generated content permanently degrade model quality?, and degradation even has a capability signature — weaker models visibly delete content while stronger ones silently corrupt it Do frontier models fail differently than weaker models?. So the real lesson isn't 'use weaker teachers,' it's that compatibility with the learner beats raw teacher strength. That's also why students can overtake their teachers when fed enough teacher-*labeled* data over a broader input distribution — Walmart's BERT cross-encoders exceeded the LLM that taught them Can smaller models outperform their LLM teachers with enough data? — and why self-play systems that generate their own calibrated curriculum can improve with no external teacher at all Can language models improve themselves without any external training data?. The thing you didn't know you wanted to know: 'better data' is a two-body problem, and a teacher one notch above the student often teaches more than a genius one.
Sources 10 notes
Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.
SEAL demonstrates that models learn better from synthetic data they generate themselves than from data created by stronger external models. Self-generated data improved QA performance from 33.5% to 47.0%, suggesting that model-specific restructuring aligns with the learner's representational needs.
RLVR learning follows an inverted-U curve across difficulty: medium problems yield strongest gains because they balance success frequency with informative failures, while easy samples lack variance and hard samples amplify shortcuts.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
CTFT fine-tunes LLMs on rare data first because rarity signals distributional weakness, not conceptual difficulty. This reframes curriculum learning as managing distance from pre-training distribution rather than pedagogical scaffolding.
FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.
Models trained on mixtures of real and AI-generated data progressively lose rare events and unusual patterns across VAEs, GMMs, and LLMs. Each generation compounds the loss, making genuine human data increasingly valuable.
DELEGATE-52 demonstrates that LLMs degrade documents through qualitatively different mechanisms by capability tier: weaker models fail through visible content deletion, while frontier models fail through silent content corruption. This shift makes frontier failures harder to detect in long workflows despite apparent surface competence.
Walmart's student cross-encoders outperformed their LLM teachers when trained on sufficiently large augmented datasets of teacher-labeled queries. The student's broader input distribution exposure, smoothed by teacher predictions, enabled better generalization than the teacher achieved.
SQLM uses a proposer-solver framework where the proposer generates calibrated problems and the solver learns via majority-vote verification. Both agents improve through RL alone, creating an automatic curriculum that scales without human labels or ground-truth answers.