Why does online RL succeed where supervised training fails for self-correction?
This explores why teaching a model to fix its own mistakes works when it practices live (online RL) but breaks down when you simply show it pre-recorded examples of corrections (supervised fine-tuning).
This explores why teaching a model to fix its own mistakes works when it practices live (online RL) but breaks down when you simply show it pre-recorded examples of corrections. The corpus has a sharp, direct answer to the core of this, and a set of adjacent findings that explain the deeper mechanism.
The heart of it is a distribution mismatch. When you train self-correction on offline correction traces, the errors in that training data aren't the errors the model actually makes at test time — so it learns to 'correct' mistakes it would never have produced, and tends to collapse into a single rote correction move regardless of what went wrong Why does self-correction training on offline data fail?. Online RL closes this gap by letting the model trip over its *own* live mistakes and practice recovering from them. The supervision is generated under the same conditions it faces at inference, so there's nothing to mismatch.
What makes this more than a data-hygiene story is what RL is actually doing under the hood. Several notes converge on the idea that RL doesn't install new reasoning — it surfaces and reweights capabilities already latent in the pretrained model. RL updates touch only 5–30% of parameters, in sparse but full-rank, seed-stable subnetworks, suggesting it's making targeted structural adjustments rather than rewriting the model Does reinforcement learning update only a small fraction of parameters?. And verifiable rewards act as catalysts that activate existing pretraining strategies rather than teaching genuinely new ones How does RL training reshape reasoning and what gets lost?. Self-correction is exactly the kind of skill the base model can already *do* but doesn't reliably *deploy* — so the right training signal is one that selectively reinforces deployment on real failures, which is what online practice provides and a static imitation target cannot.
There's a second reason supervised imitation underperforms here: it can only copy the modes present in the data, whereas online training is a feedback loop that can discover the recovery move itself. The corpus is full of variations on 'manufacture the missing supervision from the model's own behavior' — agents treating the consequences of their own actions as the training signal with no external reward Can agents learn from their own actions without external rewards?, tree search ranking a model's own solution paths to replace human annotation Can tree search replace human feedback in LLM training?, self-play loops co-evolving skills against an internal judge Can language models learn skills without human supervision?, and models learning to score their own outputs in unused sequence space Can models learn to evaluate their own work during training?. The common thread: feedback grounded in the model's actual rollouts beats imitation of someone else's trace, because the model is the only source that knows what *it* gets wrong.
But the corpus also warns against reading 'online RL wins' as 'online RL is magic.' The reward shape matters enormously. Binary correctness rewards quietly teach confident guessing because they never punish a confident wrong answer — adding a calibration term fixes it Does binary reward training hurt model calibration?. Feeding RL problems that are too hard breeds degenerate shortcuts that then contaminate skills the model already had Do overly hard RLVR samples actually harm model capabilities?. And RL tends to collapse onto a single dominant format within the first epoch, suppressing alternatives Does RL training collapse format diversity in pretrained models?. So the honest version of the answer is: online RL succeeds at self-correction not because reinforcement is inherently smarter than supervision, but because it trains on the model's *own live error distribution* and reweights latent skills the model already has — and only when the reward is shaped to reward the right thing.
Sources 10 notes
SFT on offline correction traces fails because training errors don't match test errors and models collapse into single correction modes. Multi-turn online RL under the model's own error distribution successfully trains self-correction by letting models practice correcting their actual mistakes.
Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.
Research shows that verifiable rewards act as catalysts that surface existing capabilities from pretraining, not teachers that build new reasoning. RL updates are structurally sparse and bounded by the pretrained prior, not algorithmic sophistication.
Research across eight environments shows that agents can use future states from their own actions as supervision without external rewards, matching expert-dependent baselines with half the data and providing superior warm-starts for subsequent RL training.
AlphaLLM uses tree search outcomes and three critic models to derive dense reward signals equivalent to human-labeled feedback. Tree structure naturally ranks solution paths by success, replacing the annotation oracle that standard RLHF requires.
Ctx2Skill's three-role self-play loop manufactures missing feedback through internal signals: the Challenger escalates difficulty as curriculum, the Judge gives binary verdicts as reward, and both sides evolve via natural-language skill edits. Success requires balancing adversarial pressure against a generalization safeguard to prevent collapse.
Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.