How do failure examples improve distillation compared to successful trajectories alone?
This explores why teaching a model from its mistakes — not just its wins — produces a stronger student, and what specifically failures add that clean successful trajectories can't.
This reads the question as being about distillation in the broad sense — transferring reasoning from teacher (or past attempts) to student — and asking what failures contribute that a diet of correct trajectories alone leaves out. The short version the corpus keeps circling: successes teach you the move, failures teach you the boundary, and a student trained only inside the boundary doesn't know where it is.
The most direct evidence is that failures and successes carry *different kinds* of information and should be processed differently. ReasoningBank shows that storing strategy-level hints from both self-judged wins and losses beats success-only memory and beats dumping raw trajectories Can agents learn better from their failures than successes?. SkillRL makes the asymmetry explicit: keep successes as concrete demonstrations, but abstract failures into lessons — uniformly consolidating both actually degrades learning Should successful and failed episodes be processed differently?. And GRPO-RoC filters positive trajectories hard for quality while deliberately *preserving* diverse failures as negative signal — that asymmetry is what let a 14B model reach frontier math performance, because clean-only positives quietly teach the model to tolerate the errors hiding inside otherwise-correct code traces Why do correct code trajectories teach models to tolerate errors?.
Why do clean trajectories alone fall short? Because the messy parts — the wrong turns, the backtracking, the hesitation — are themselves the skill being transferred. Stream of Search pretraining on full search processes including mistakes scored 25% higher than training on optimal trajectories only; the model learns to explore and recover rather than to recite a fixed path Does training on messy search processes improve reasoning?. The flip side is a warning: self-distillation that polishes traces into confident brevity strips out the "Wait" and "Hmm" tokens that flag a flawed path, and removing those uncertainty markers wrecks robustness on out-of-distribution problems Does self-distillation harm mathematical reasoning performance?. A richer, answer-conditioned teacher produces exactly these overconfident short traces — strong in-domain, brittle outside it Does richer teacher context hurt student generalization?. Failures are where the model learns epistemic caution; sand them away and you get a fluent student that can't tell when it's wrong.
The interesting catch — the thing you might not have known to ask — is that not all failures are equal. Failures only help when the model could plausibly have succeeded. Training on near-impossible problems backfires: group-relative normalization treats rare accidental wins as high-advantage, and the model learns degenerate shortcuts that then contaminate skills it already had Do overly hard RLVR samples actually harm model capabilities?. So the value of a failure example is conditional on its being *recoverable* and *legible*. The systems that win don't just include failures — they route each one through a decision: what does this teach, and is it teachable? AutoResearchClaw's pivot-or-refine loop turns every failure into a structured next-attempt signal rather than a dead end Can experiment failures drive progress instead of stopping it?.
So failures improve distillation along three axes successes can't cover: they mark the boundaries of competence, they preserve the exploration-and-recovery behavior that is the actual reasoning skill, and they keep the uncertainty signals that let a student self-correct off-distribution. The discipline is asymmetry — distill successes as demonstrations, failures as abstracted lessons, and discard the failures that were never winnable in the first place.
Sources 8 notes
ReasoningBank shows that storing strategy-level reasoning hints from both self-judged successes and failures outperforms success-only memory and raw trajectory storage. Coupled with test-time scaling, memory and compute compound rather than substitute, creating a novel scaling law where accuracy improves through cumulative interaction history.
SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.
GRPO-RoC filters positive trajectories for quality while preserving diverse failures as negative signal, allowing a 14B model to reach frontier math performance in 510 RL steps, surpassing much larger models with cleaner reasoning.
Stream of Search pretraining, which represents exploration and backtracking as serialized strings, achieves 25% higher accuracy than optimal-trajectory-only training. Models learn internal world models for search and adaptive strategies rather than fixed external methods.
Self-distillation reduces performance in mathematical reasoning by eliminating epistemic markers like "Wait" and "Hmm" tokens that flag flawed reasoning paths. These tokens enable self-correction on out-of-distribution problems, so removing them sacrifices robustness for confident brevity.
Teachers conditioned on correct answers and verifier output produce confident, concise traces that students inherit. This style suppresses uncertainty expression, optimizing in-domain performance while degrading generalization to out-of-distribution problems that require epistemic caution.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
AutoResearchClaw's pivot-or-refine loop routes every failure through a decision process, making failure inform the next attempt rather than stop execution. Component ablation shows this mechanism drives completion and is distinct from reasoning or verification.