Reasoning and Knowledge Reasoning and Learning Architectures Language Understanding and Reasoning

Does self-distillation harm mathematical reasoning performance?

Self-distillation usually improves models while shortening outputs, but mathematical reasoning shows a puzzling exception: performance drops up to 40%. What mechanism explains this counter-intuitive degradation?

Note · 2026-05-18 · sourced from Training Fine Tuning

Self-distillation has emerged as an effective post-training paradigm — it usually improves performance while shortening reasoning traces, which is a clean win. The paper Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs? documents a counter-finding: in mathematical reasoning, self-distillation can reduce response length while degrading performance, with drops of up to 40% on Qwen3, DeepSeek-Distill-Qwen, and Olmo3.

The mechanism is suppression of epistemic verbalization. Strong reasoning models like DeepSeek-R1 frequently express uncertainty mid-trace using tokens like "Wait" or "Hmm." These tokens look like noise — they do not directly advance the argument, they add length without obvious content. The standard intuition is that distilling toward shorter, more confident traces should be an improvement: same answers, less verbosity, lower inference cost.

The empirical finding contradicts this. Removing the uncertainty tokens removes the signal that a reasoning path may be flawed. When the student model is distilled away from epistemic verbalization, it loses the ability to flag and self-correct its own faulty reasoning paths. The shorter, more confident traces are correlated with worse performance on out-of-distribution problems where the model would have benefited from pausing to verbalize doubt.

This reframes "Wait" and "Hmm" tokens. They are not stylistic noise to be optimized away; they are corrective mechanism markers — the surface signature of the model noticing something is off and adjusting course. Compressing the trace by removing them is removing an internal control structure.

The implication for self-distillation design is sharp. Distillation that uses richly-conditioned teachers produces confident concise students. Confident concise students do well on in-distribution problems where confidence is warranted. They fail on out-of-distribution problems where uncertainty would have been the right response. The distillation regime needs to preserve the uncertainty channel, not just optimize for shorter correct outputs.

Related concepts in this collection

Concept map
13 direct connections · 140 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

self-distillation can degrade reasoning by suppressing epistemic verbalization — Wait and Hmm tokens carry uncertainty signal not noise