Does richer teacher context hurt student generalization?
When teachers are given more information during distillation, they produce confident but brittle students. Does this trade-off between in-domain wins and out-of-distribution robustness hold across different task distributions?
The self-distillation degradation finding has a clean causal story. When the teacher model is conditioned on richer information — the correct solution, access to a verifier, additional context that humans would not have at inference time — the reasoning trajectories it produces become more confident and more concise. The teacher knows the answer, so it does not bother to express uncertainty mid-trace. The student, distilling toward these traces, inherits the confident style.
The pattern unfolds along two factors: information richness and task coverage. Richer teacher context → confident traces → suppressed epistemic verbalization → faster in-domain optimization. Limited task coverage means the in-domain wins are real and visible; the model gets better at the narrow distribution it was trained on. As task coverage broadens, the missing uncertainty channel becomes a liability — out-of-distribution problems benefit from expressing uncertainty and adjusting accordingly, and the confident-style student no longer has access to that adjustment mechanism.
This produces a counter-intuitive recommendation for distillation pipeline design. Standard intuition: give the teacher as much information as possible so it produces high-quality traces. The finding inverts this: the teacher's traces become too clean, optimized for cases where confidence is warranted, missing the uncertainty markers that help the student handle cases where confidence is not warranted.
A more robust approach lets the teacher operate with less privileged context, producing traces that include the natural pauses and self-corrections of reasoning under uncertainty. The resulting traces are messier, longer, less obviously "polished" — but they preserve the corrective signal that helps OOD performance.
The deeper observation is that style transfer is part of distillation, not just correctness transfer. The student inherits the teacher's reasoning style, including how the teacher handles or hides uncertainty. Teacher conditioning shapes style, and style shapes generalization. Distillation pipelines that optimize teacher conditioning for correctness alone optimize against generalization without realizing it.
Related concepts in this collection
-
Does self-distillation harm mathematical reasoning performance?
Self-distillation usually improves models while shortening outputs, but mathematical reasoning shows a puzzling exception: performance drops up to 40%. What mechanism explains this counter-intuitive degradation?
same paper, the mechanism this trade-off produces
-
Can post-training objectives preserve reasoning style alongside correctness?
Even mathematically sound training objectives may suppress reasoning behaviors like uncertainty expression without penalizing them. Does optimizing for answer correctness inadvertently degrade the stylistic features that enable generalization?
same paper, the methodology implication
-
What do models actually learn from chain-of-thought training?
When models train on reasoning demonstrations, do they memorize content details or absorb reasoning structure? Testing with corrupted data reveals which aspects of CoT samples actually drive learning.
adjacent: structural coherence drives learning more than content; here, structural uncertainty signals matter
-
Can agents learn better from their failures than successes?
Does storing reasoning strategies extracted from both successful and failed experiences improve agent learning compared to tracking only successes or raw trajectories? This matters because failures offer preventative lessons that successes alone cannot teach.
partial tension: failures provide useful distillation signal; richer context may suppress visible failure modes
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
richer teacher context produces more confident shorter student traces — fast in-domain optimization at the cost of OOD robustness