How does self-distillation degrade reasoning by suppressing uncertainty signals?
This explores why training a model on its own polished outputs (self-distillation) can make it reason worse — specifically by erasing the hesitation markers and confidence cues that flag when reasoning is going wrong.
This explores why training a model on its own polished outputs (self-distillation) can make it reason *worse* — and the corpus traces it to a single mechanism: the loss of uncertainty signals that a model needs to catch its own mistakes. The core finding is that self-distillation strips out epistemic markers like "Wait" and "Hmm" — the tokens where a model pauses and reconsiders a flawed path Does self-distillation harm mathematical reasoning performance?. Those pauses look like noise if you optimize for confident, concise answers, but they're load-bearing: they enable self-correction on unfamiliar (out-of-distribution) problems. Remove them and you trade robustness for fluent overconfidence.
What makes this more than a one-paper curiosity is that the same trade-off shows up under a different name in teacher–student distillation. When a teacher is conditioned on the correct answer and verifier output, it produces crisp, confident traces — and students inherit that confident style, gaining in-domain accuracy while losing the epistemic caution that generalization to hard, novel problems requires Does richer teacher context hurt student generalization?. Self-distillation is essentially the model becoming its own over-confident teacher. The degradation isn't about losing knowledge; it's about losing the *expression* of doubt.
The deeper puzzle is that the uncertainty information doesn't fully disappear — it stops being verbalized. Models produce 3–4× lower entropy on their own generated text, driven by an internal surprise signal that quietly shapes the output distribution without ever surfacing as a word Why do models produce less uncertain outputs on their own text?. So self-distillation pushes uncertainty from the visible reasoning trace down into silent internal states, where it can no longer trigger the explicit "wait, let me reconsider" behavior that self-correction depends on. There's a related architectural hint that models naturally *do* mark difficulty — hidden states sparsify under out-of-distribution load as an adaptive filter Do language models sparsify their activations under difficult tasks? — which suggests the uncertainty machinery exists but gets muted rather than removed.
The constructive flip side: if suppressing confidence signals breaks reasoning, surfacing them can repair it. Using a model's own answer-span confidence as a reward signal restores calibration while strengthening step-by-step reasoning — the inverse of the distillation pathology Can model confidence work as a reward signal for reasoning?. Confidence variance can even be read live to steer between overthinking and underthinking without any retraining Can confidence patterns reveal overthinking versus underthinking?, and small models explicitly trained to hold and act on uncertainty (by abstaining when unsure) can match models ten times their size Can models learn to abstain when uncertain about predictions?. Calibration, in other words, is a trainable capability that standard pipelines leave underdeveloped — and that self-distillation actively erodes.
The surprising takeaway for a curious reader: longer or more confident reasoning is not automatically better reasoning. Accuracy peaks at an intermediate chain length and declines past it Why does chain of thought accuracy eventually decline with length?, and a small minority of high-entropy "forking" tokens — the very moments of expressed uncertainty — carry most of the learning signal in reasoning models Do high-entropy tokens drive reasoning model improvements?. Self-distillation's quiet harm is that it smooths away exactly those forks, leaving a model that sounds more certain and reasons less safely.
Sources 9 notes
Self-distillation reduces performance in mathematical reasoning by eliminating epistemic markers like "Wait" and "Hmm" tokens that flag flawed reasoning paths. These tokens enable self-correction on out-of-distribution problems, so removing them sacrifices robustness for confident brevity.
Teachers conditioned on correct answers and verifier output produce confident, concise traces that students inherit. This style suppresses uncertainty expression, optimizing in-domain performance while degrading generalization to out-of-distribution problems that require epistemic caution.
Post-trained models produce 3-4x lower output entropy on their own generations, driven by an internal representation of input surprise that causally modulates confidence. This implicit self-recognition signal appears without being verbalized, encoded directly in the output distribution.
As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.
RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.
ReBalance uses confidence variance and overconfidence as diagnostic signals to apply training-free steering vectors that reduce overthinking redundancy while promoting exploration during underthinking, improving accuracy across models from 0.5B to 32B parameters.
Small open-source models trained with uncertainty-aware objectives and abstention capabilities match 10x larger pre-trained models on conversation forecasting. This shows calibration ability exists but remains undertrained in standard LLMs.
Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.