Why does self-correction training on offline data fail?
Can language models learn to correct their own mistakes through supervised training on correction examples? This explores whether distribution mismatch and behavior collapse prevent self-correction from emerging.
SCoRe (Self-Correction via Reinforcement Learning) starts from a stark baseline: "there is no major work showing successful intrinsic self-correction via prompting alone." Naively prompting LLMs for self-correction can degrade performance. The question is whether self-correction is an impossible capability or just one that requires the right training approach.
SFT on offline correction traces fails through two mechanisms:
Distribution mismatch: the errors made by the data-collection policy (used to generate correction examples) don't match the errors the trained model will make at test time. The model learns corrections for someone else's mistakes, not its own. At test time, it encounters novel error patterns that the correction training never addressed.
Behavior collapse: SFT implicitly gravitates toward a single dominant correction mode — whichever pattern maximizes likelihood across training examples. This mode may work for some error types but fails to generalize. The model learns one way to correct rather than learning when and how to adapt correction strategy to the specific error encountered.
SCoRe addresses both by training under the model's own distribution of self-generated correction traces using multi-turn online RL. The model generates a first attempt, then generates a correction attempt, and the RL reward is based on whether the correction improved the outcome. Appropriate regularization steers learning toward genuinely effective correction behaviors rather than fitting high-reward responses for given prompts.
This connects to a broader pattern: Does supervised fine-tuning actually improve reasoning quality? documents the same SFT-vs-RL dynamic in domain specialization. SFT copies surface patterns; RL trains under the model's actual distribution. For self-correction specifically, this means the model must practice correcting its own mistakes, not someone else's — the same principle that makes deliberate practice effective for humans.
The implication for the self-revision literature is precise: Does self-revision actually improve reasoning in language models? and Does reflection in reasoning models actually correct errors? show that current models can't self-correct. SCoRe suggests this is a training problem, not a capability limit — but fixing it requires abandoning SFT in favor of online RL.
Source: Self Refinement Self Consistency Feedback — SCoRe (arxiv 2409.12917)
Related concepts in this collection
-
Does self-revision actually improve reasoning in language models?
When o1-like models revise their own reasoning through tokens like 'Wait' or 'Alternatively', does this reflection catch and fix errors, or does it introduce new mistakes? This matters because self-revision is marketed as a key capability.
SCoRe explains why: current models weren't trained to self-correct under their own error distribution
-
Does reflection in reasoning models actually correct errors?
When reasoning models reflect on their answers, do they genuinely fix mistakes, or merely confirm what they already decided? Understanding this matters for designing better training and inference strategies.
the confirmatory nature of reflection may be an SFT artifact; online RL could produce genuinely corrective reflection
-
Does supervised fine-tuning actually improve reasoning quality?
While SFT boosts final-answer accuracy, does it degrade the quality and informativeness of the reasoning steps that justify those answers? This matters for high-stakes domains requiring auditable decision-making.
same SFT failure mode: surface pattern copying without distributional grounding
-
Does RL teach reasoning or just when to use it?
Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
SCoRe is a specific case: RL teaches when and how to correct, not just when to reason
-
Does revising your own reasoning actually help or hurt?
Self-revision in reasoning models often degrades accuracy, while external critique improves it. Understanding what makes revision helpful or harmful could reshape how we design systems that need to correct themselves.
SCoRe resolves the internal-vs-external revision dilemma: online RL under the model's own error distribution makes internal revision viable by training the model on its actual mistakes rather than someone else's, converting internal revision from a harmful default into a trained capability
-
Does a model improve by arguing with itself?
When models revise their own reasoning in response to self-generated criticism, do they converge on better answers or worse ones? And how does that compare to challenge from other models?
SCoRe is designed to prevent degeneration-of-thought: by training under the model's own error distribution with RL rewards for genuine correction, it builds the self-correction capacity that untrained self-revision lacks, addressing the confidence-amplification failure at its training-time root
-
How quickly do errors compound during model self-training?
When LLMs train on their own outputs without verification, do small mistakes amplify exponentially? This matters because it determines whether unsupervised self-improvement is even feasible.
SCoRe's distribution mismatch finding explains a root cause of error avalanching: self-training loops fail because corrections learned from one distribution don't apply to the model's own evolving errors — online RL under the model's own error distribution is the principled fix for both single-generation self-correction (SCoRe) and multi-iteration self-training (avalanche prevention)
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
SFT on model-generated correction traces fails due to distribution mismatch — multi-turn online RL under the model's own error distribution is required for self-correction