Reasoning and Learning Architectures Reasoning and Knowledge Language Understanding and Reasoning

Does richer teacher context hurt student generalization?

When teachers are given more information during distillation, they produce confident but brittle students. Does this trade-off between in-domain wins and out-of-distribution robustness hold across different task distributions?

Note · 2026-05-18 · sourced from Training Fine Tuning

The self-distillation degradation finding has a clean causal story. When the teacher model is conditioned on richer information — the correct solution, access to a verifier, additional context that humans would not have at inference time — the reasoning trajectories it produces become more confident and more concise. The teacher knows the answer, so it does not bother to express uncertainty mid-trace. The student, distilling toward these traces, inherits the confident style.

The pattern unfolds along two factors: information richness and task coverage. Richer teacher context → confident traces → suppressed epistemic verbalization → faster in-domain optimization. Limited task coverage means the in-domain wins are real and visible; the model gets better at the narrow distribution it was trained on. As task coverage broadens, the missing uncertainty channel becomes a liability — out-of-distribution problems benefit from expressing uncertainty and adjusting accordingly, and the confident-style student no longer has access to that adjustment mechanism.

This produces a counter-intuitive recommendation for distillation pipeline design. Standard intuition: give the teacher as much information as possible so it produces high-quality traces. The finding inverts this: the teacher's traces become too clean, optimized for cases where confidence is warranted, missing the uncertainty markers that help the student handle cases where confidence is not warranted.

A more robust approach lets the teacher operate with less privileged context, producing traces that include the natural pauses and self-corrections of reasoning under uncertainty. The resulting traces are messier, longer, less obviously "polished" — but they preserve the corrective signal that helps OOD performance.

The deeper observation is that style transfer is part of distillation, not just correctness transfer. The student inherits the teacher's reasoning style, including how the teacher handles or hides uncertainty. Teacher conditioning shapes style, and style shapes generalization. Distillation pipelines that optimize teacher conditioning for correctness alone optimize against generalization without realizing it.

Related concepts in this collection

Concept map
13 direct connections · 133 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

richer teacher context produces more confident shorter student traces — fast in-domain optimization at the cost of OOD robustness