Reinforcement Learning for LLMs

Why does self-correction training on offline data fail?

Can language models learn to correct their own mistakes through supervised training on correction examples? This explores whether distribution mismatch and behavior collapse prevent self-correction from emerging.

Note · 2026-02-22 · sourced from Self Refinement Self Consistency Feedback
How should we allocate compute budget at inference time?

SCoRe (Self-Correction via Reinforcement Learning) starts from a stark baseline: "there is no major work showing successful intrinsic self-correction via prompting alone." Naively prompting LLMs for self-correction can degrade performance. The question is whether self-correction is an impossible capability or just one that requires the right training approach.

SFT on offline correction traces fails through two mechanisms:

Distribution mismatch: the errors made by the data-collection policy (used to generate correction examples) don't match the errors the trained model will make at test time. The model learns corrections for someone else's mistakes, not its own. At test time, it encounters novel error patterns that the correction training never addressed.

Behavior collapse: SFT implicitly gravitates toward a single dominant correction mode — whichever pattern maximizes likelihood across training examples. This mode may work for some error types but fails to generalize. The model learns one way to correct rather than learning when and how to adapt correction strategy to the specific error encountered.

SCoRe addresses both by training under the model's own distribution of self-generated correction traces using multi-turn online RL. The model generates a first attempt, then generates a correction attempt, and the RL reward is based on whether the correction improved the outcome. Appropriate regularization steers learning toward genuinely effective correction behaviors rather than fitting high-reward responses for given prompts.

This connects to a broader pattern: Does supervised fine-tuning actually improve reasoning quality? documents the same SFT-vs-RL dynamic in domain specialization. SFT copies surface patterns; RL trains under the model's actual distribution. For self-correction specifically, this means the model must practice correcting its own mistakes, not someone else's — the same principle that makes deliberate practice effective for humans.

The implication for the self-revision literature is precise: Does self-revision actually improve reasoning in language models? and Does reflection in reasoning models actually correct errors? show that current models can't self-correct. SCoRe suggests this is a training problem, not a capability limit — but fixing it requires abandoning SFT in favor of online RL.


Source: Self Refinement Self Consistency Feedback — SCoRe (arxiv 2409.12917)

Related concepts in this collection

Concept map
15 direct connections · 160 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

SFT on model-generated correction traces fails due to distribution mismatch — multi-turn online RL under the model's own error distribution is required for self-correction