How does self-distillation differ from standard fine-tuning approaches?
This explores what makes self-distillation — training a model on its own (or a teacher's) generated outputs — behave differently from ordinary supervised fine-tuning on external data, and what that difference costs.
This reads the question as asking what changes when the training signal comes from a model's own generations rather than from an external dataset — and the corpus suggests the key difference isn't the mechanics (it's still gradient descent on token sequences) but *what gets quietly removed* in the process. The sharpest finding is that self-distillation can degrade reasoning by stripping out epistemic markers — the "Wait" and "Hmm" tokens that flag a shaky reasoning path. Standard fine-tuning on diverse human data tends to preserve those hesitation signals; self-distillation rewards confident brevity, and in doing so removes the very tokens that let a model catch its own out-of-distribution mistakes Does self-distillation harm mathematical reasoning performance?.
The same trade shows up from the teacher's side. When a teacher is conditioned on the correct answer or a verifier's output, it produces shorter, more confident traces, and the student inherits that confidence — gaining in-domain sharpness while losing the epistemic caution needed for problems unlike anything it trained on Does richer teacher context hurt student generalization?. So the distinction is less "self vs. external data" and more "distilled-confident vs. exploratory-uncertain." Self-distillation compresses the distribution toward a single confident mode; that's a feature for speed and a bug for robustness.
There's a deeper structural reason these self-referential approaches behave differently. A model training on its own output is working inside the generation–verification gap: it can only reliably improve where it can already verify, so without an external check it tends to amplify what it already believes rather than learn anything new What stops large language models from improving themselves?. This is the same loop that makes models over-trust answers they generated themselves Why do models trust their own generated answers?, and it's why naively fine-tuning on self-generated correction traces collapses — the model's training errors don't match its test errors, so it learns one canned correction move instead of genuine self-correction Why does self-correction training on offline data fail?.
The interesting wrinkle is that self-training isn't doomed — it just needs an external filter standing in for the missing verifier. Transformers that generate solutions, *keep only the correct ones*, and retrain on those achieve exponential length generalization, jumping from 10-digit to 100-digit addition Can transformers improve exponentially by learning from their own correct solutions?. Asymmetric self-play does the same trick without any human data by pitting a problem-proposer against a solver, using majority-vote agreement as the verification signal Can language models improve themselves without any external training data?. The pattern: self-distillation differs from standard fine-tuning precisely by lacking an independent correctness signal, and it works only when you reintroduce one.
Worth knowing alongside this: even ordinary RL fine-tuning has a hidden self-narrowing tendency — it collapses onto a single dominant pretraining format within the first epoch Does RL training collapse format diversity in pretrained models? and can sharpen memorized templates rather than install real reasoning procedures Do fine-tuned language models actually learn optimization procedures?. So the "confidence-narrows-diversity" risk that self-distillation makes vivid isn't unique to it — it's a tax on any training loop that optimizes against signals the model can already produce.
Sources 9 notes
Self-distillation reduces performance in mathematical reasoning by eliminating epistemic markers like "Wait" and "Hmm" tokens that flag flawed reasoning paths. These tokens enable self-correction on out-of-distribution problems, so removing them sacrifices robustness for confident brevity.
Teachers conditioned on correct answers and verifier output produce confident, concise traces that students inherit. This style suppresses uncertainty expression, optimizing in-domain performance while degrading generalization to out-of-distribution problems that require epistemic caution.
Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.
LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.
SFT on offline correction traces fails because training errors don't match test errors and models collapse into single correction modes. Multi-turn online RL under the model's own error distribution successfully trains self-correction by letting models practice correcting their actual mistakes.
Standard transformers generalize from 10-digit to 100-digit addition by repeatedly generating solutions, filtering for correctness, and retraining—showing exponential (not linear) out-of-distribution improvement across rounds without saturation.
SQLM uses a proposer-solver framework where the proposer generates calibrated problems and the solver learns via majority-vote verification. Both agents improve through RL alone, creating an automatic curriculum that scales without human labels or ground-truth answers.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.