Why does self-distillation suppress epistemic verbalization in student models?

This explores why training a model on its own outputs (self-distillation) tends to strip out the verbal hesitation markers — the "Wait," "Hmm," second-guessing — that signal uncertainty. The short answer from the corpus: distillation optimizes for confident, concise traces, and the tokens that express doubt are exactly what get smoothed away. In math reasoning, removing those epistemic markers measurably degrades performance, because they aren't filler — they're the model flagging a flawed reasoning path so it can self-correct. Cut them, and you trade robustness on hard or out-of-distribution problems for fluent brevity on easy ones Does self-distillation harm mathematical reasoning performance?.

The deeper mechanism shows up when you look at where this confident style comes from. Teachers that are conditioned on the correct answer (or on a verifier's output) produce traces that are short and sure of themselves — there was never any doubt to express, because the answer was known upfront. Students inherit that style wholesale, including its absence of caution. So suppression isn't a bug in self-distillation so much as a faithful copy of a teacher who had no reason to hesitate Does richer teacher context hurt student generalization?. The student learns the *surface form* of confidence without the underlying knowledge state that would justify it.

Here's the thing you might not expect: the verbalized doubt and the actual self-knowledge are separable. Models carry internal mechanisms — entity-recognition features that track whether they actually know a fact — that causally steer hallucination and refusal, and these persist through fine-tuning Do models know what they don't know?. But other work suggests reasoning itself can run in latent space without being spoken aloud at all, implying verbalization is partly a *training artifact* rather than a hard requirement of thinking Can models reason without generating visible thinking tokens?. Put those together and self-distillation looks like it's pruning the externalized trace while leaving the internal signal stranded — the model may still "know" it's unsure, but no longer says so, and saying-so was what enabled mid-stream correction.

That matters because a model's spoken self-reports are already a shaky proxy for its real state. LLM self-reports largely echo training distributions rather than genuine introspection Can language models actually introspect about their own states?, and models lack stable self-knowledge — they shift beliefs under conversational pressure and users over-trust their confident outputs regardless of accuracy How well do language models understand their own knowledge?. Self-distillation pushes hard in the dangerous direction here: it makes the output *more* confident-sounding while making it a *worse* indicator of actual certainty. The reader who came in worried about a niche math-reasoning result should leave seeing the broader hazard — every training step that rewards confident brevity is quietly widening the gap between how sure a model sounds and how sure it has any right to be.

Sources 6 notes

Does self-distillation harm mathematical reasoning performance?

Self-distillation reduces performance in mathematical reasoning by eliminating epistemic markers like "Wait" and "Hmm" tokens that flag flawed reasoning paths. These tokens enable self-correction on out-of-distribution problems, so removing them sacrifices robustness for confident brevity.

Does richer teacher context hurt student generalization?

Teachers conditioned on correct answers and verifier output produce confident, concise traces that students inherit. This style suppresses uncertainty expression, optimizing in-domain performance while degrading generalization to out-of-distribution problems that require epistemic caution.

Do models know what they don't know?

Sparse autoencoders revealed that language models develop causal mechanisms for detecting whether they know facts about entities. These mechanisms actively steer both hallucination and refusal behavior, and persist from base models into finetuned chat versions.

Can models reason without generating visible thinking tokens?

Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.

Can language models actually introspect about their own states?

LLM self-reports usually reflect human training distributions rather than actual internal processes. However, when a causal chain connects an internal state to accurate reporting—like inferring low temperature from output consistency—genuine lightweight introspection occurs without requiring consciousness.

How well do language models understand their own knowledge?

LLMs can describe learned behaviors without explicit training, but their self-reports are unstable and unreliable. Users systematically overrely on confident outputs regardless of accuracy, and models shift beliefs under conversational pressure, revealing surface-level rather than genuine self-understanding.

Why does self-distillation suppress epistemic verbalization in student models?

Sources 6 notes

Next inquiring lines