INQUIRING LINE

Why do weaker teacher models sometimes produce better training signals than stronger ones?

This explores why a weaker or smaller teacher model can sometimes give a student model a better training signal than a stronger one — and the corpus suggests the answer is less about teacher quality and more about the match between the signal and what the student can actually absorb.


This explores why a weaker teacher can sometimes train a student better than a stronger one. The corpus reframes the puzzle: the best training signal isn't the most capable one, it's the one closest to the student's own learning frontier. Teacher-refined data that is objectively higher quality actually degrades the student when it exceeds what the student can learn from, and students do best when they filter a teacher's refinements against their own statistical profile, keeping only the compatible ones Does teacher-refined data always improve student model performance?. By that logic a stronger teacher can be worse precisely because it's further from the student — it offers improvements the student can't yet metabolize.

There's also a hidden cost in how confident strong teachers are. Teachers conditioned on correct answers and verifier output produce confident, concise reasoning traces, and students inherit that style — which optimizes in-domain accuracy but suppresses uncertainty and wrecks generalization to out-of-distribution problems that need epistemic caution Does richer teacher context hurt student generalization?. A 'better' teacher can therefore hand the student overconfidence as a side effect, while a rougher teacher that still shows hesitation may transmit a more robust habit of mind.

The same shape shows up in reinforcement learning without a teacher at all, which suggests this is a general principle about difficulty rather than about teachers specifically. RLVR gains follow an inverted-U: medium-difficulty problems teach best because they mix enough success with informative failure, while problems that are too hard provide no usable signal Why do medium-difficulty problems teach reasoning better than hard ones?. Push past that and it actively backfires — near-impossible samples make models learn degenerate shortcuts that then contaminate skills they already had, as rare lucky successes get treated as high-value trajectories and reinforce answer-repetition over real reasoning Do overly hard RLVR samples actually harm model capabilities?. A too-strong teacher posing too-hard targets is the supervised analogue of that failure.

And the upside cuts the other way too: weakness in the teacher doesn't cap the student. Walmart's BERT cross-encoders beat their own LLM teachers once trained on enough teacher-labeled data, because the smoothed teacher predictions exposed the student to a broader input distribution and it generalized past what the teacher could do Can smaller models outperform their LLM teachers with enough data?. The teacher's job there is to be a usable signal over the right distribution, not to be the ceiling.

The thread tying these together: a training signal is only as good as the receiver's ability to use it. So if you're choosing a teacher, the surprising takeaway is to optimize for the gap — pick the teacher whose outputs sit just beyond what your student already does, not the most capable model you can find. If you want to follow this further, the curriculum-design angle is striking — systems like asymmetric self-play deliberately generate problems calibrated to the solver's current ability rather than maximally hard ones Can language models improve themselves without any external training data?.


Sources 6 notes

Does teacher-refined data always improve student model performance?

Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.

Does richer teacher context hurt student generalization?

Teachers conditioned on correct answers and verifier output produce confident, concise traces that students inherit. This style suppresses uncertainty expression, optimizing in-domain performance while degrading generalization to out-of-distribution problems that require epistemic caution.

Why do medium-difficulty problems teach reasoning better than hard ones?

RLVR learning follows an inverted-U curve across difficulty: medium problems yield strongest gains because they balance success frequency with informative failures, while easy samples lack variance and hard samples amplify shortcuts.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Can smaller models outperform their LLM teachers with enough data?

Walmart's student cross-encoders outperformed their LLM teachers when trained on sufficiently large augmented datasets of teacher-labeled queries. The student's broader input distribution exposure, smoothed by teacher predictions, enabled better generalization than the teacher achieved.

Can language models improve themselves without any external training data?

SQLM uses a proposer-solver framework where the proposer generates calibrated problems and the solver learns via majority-vote verification. Both agents improve through RL alone, creating an automatic curriculum that scales without human labels or ground-truth answers.

Next inquiring lines