Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?
Self-distillation has emerged as an effective post-training paradigm for LLMs, often improving performance while shortening reasoning traces. However, in mathematical reasoning, we find that it can reduce response length while degrading performance. We trace this degradation to the suppression of epistemic verbalization — the model's expression of uncertainty during reasoning. Through controlled experiments varying conditioning context richness and task coverage, we show that conditioning the teacher on rich information suppresses uncertainty expression, enabling rapid in-domain optimization with limited task coverage but harming OOD performance, where unseen problems benefit from expressing uncertainty and adjusting accordingly. Across Qwen3-1.7B/8B, DeepSeek-Distill-Qwen-7B, and Olmo3-7B-Instruct, we observe performance drops of up to 40%. Our findings highlight that exposing appropriate levels of uncertainty is crucial for robust reasoning and underscore the importance of optimizing reasoning behavior beyond merely reinforcing correct answer traces.
Our analysis reveals a consistent pattern: the more informative the context provided to the teacher, the more concise and confident the resulting reasoning becomes, with substantially fewer expressions of uncertainty and, particularly in math reasoning, degraded performance. We trace this effect to the suppression of epistemic verbalization, whereby models explicitly verbalize and incorporate uncertainty during reasoning. Strong reasoning models such as DeepSeek-R1 frequently express uncertainty using tokens like "Wait" or "Hmm". Although these expressions may not directly advance the reasoning, removing them discards important signals that a reasoning path may be flawed, leading to significant performance drops.
To systematically understand when and why self-distillation suppresses epistemic verbalization, we conduct a comprehensive empirical study and identify two key factors: information richness and task coverage. When the teacher is conditioned on richer information, such as the correct solution, it produces reasoning trajectories with little expressed uncertainty, encouraging the student to imitate a confident reasoning style that presupposes information unavailable at inference time. When task coverage is limited, this compression enables rapid in-domain optimization. However, as coverage increases, the trained removal of epistemic verbalization can interfere with optimization across diverse tasks, degrading performance on more challenging or previously unseen problems.
More broadly, our results show that even when the training objective faithfully guides the model toward correct reasoning traces, the resulting reasoning style can quietly shift in ways that hurt generalization. The suppression of epistemic verbalization is not penalized by standard objectives, yet negatively impacts out-of-distribution (OOD) performance. This suggests that post-training objectives need to account not only for answer correctness, but also for eliciting and preserving uncertainty-aware reasoning behaviors. We believe these findings offer a useful step toward a deeper understanding of reasoning in self-distillation and post-training more broadly.
In this work, we provide an empirical analysis of on-policy self-distillation, motivated by an information-theoretic view of conditioning context richness. Our experiments suggest that the effectiveness of self-distillation is closely tied to how information is provided to the model and how the model expresses uncertainty during reasoning. We observe that self-distillation tends to produce answers with higher confidence and shorter reasoning traces. While this effect enables more compact reasoning and can quickly improve in-domain performance when task coverage is limited, it becomes less effective when task coverage is broad and may even harm OOD performance. Importantly, the fact that performance can degrade even when a mathematically sound objective function is designed to elicit correct CoT reasoning suggests that the choice of optimization objective alone may not be sufficient for preserving robust reasoning, and that we need to pay closer attention to how training reshapes the model's reasoning behavior, beyond answer correctness.