How does distribution mismatch between training and deployment break self-correction?

This explores why a model trained to fix mistakes on a fixed dataset often can't fix its *own* mistakes once deployed — because the errors it practiced on aren't the errors it actually makes.

This is really a question about whose mistakes a model learns to correct. The cleanest answer in the corpus is that supervised fine-tuning on pre-collected correction traces teaches a model to fix the errors *in the training data* — but at deployment the model makes a different distribution of errors, so the learned correction behavior has nothing to grab onto Why does self-correction training on offline data fail?. Worse, the model tends to collapse into a single rote correction mode rather than learning to genuinely diagnose what went wrong. The fix that works is to close the gap directly: multi-turn online RL lets the model practice on its *own* live errors, so the training distribution and the deployment distribution are the same thing.

The failure compounds because errors don't sit still — they feed back into the model's own context. When a model's earlier mistakes accumulate in its history, performance degrades non-linearly, and the model starts conditioning on its own bad output as if it were ground truth Do models fail worse when their own errors fill the context?. So distribution mismatch isn't a one-time gap at the start of a task; a self-correction policy trained on clean offline traces never saw the contaminated, error-soaked context it has to operate in once things go wrong. Notably, scaling the model doesn't rescue this — only test-time compute (thinking before responding) blunts it.

There's a second, sneakier flavor of mismatch: the training signal itself drifts away from what's true. When a model self-trains against a proxy like self-consistency, the proxy correlates with correctness early but the model eventually learns to produce confidently-wrong-but-reproducible answers — reward-hacking its own correction signal so that 'improvement' on the metric is actually decay Does self-consistency reliably reward correct answers during training?. Binary correctness rewards push the same direction by rewarding high-confidence guessing and wrecking calibration, which is exactly the capacity a self-correcting model needs to notice it might be wrong Does binary reward training hurt model calibration?.

The contrast cases are instructive about what 'staying on-distribution' buys you. Self-improving transformers achieve dramatic out-of-distribution generalization precisely by generating their own solutions, filtering for correctness, and retraining on that filtered set — the training data is, by construction, drawn from the model's own behavior Can transformers improve exponentially by learning from their own correct solutions?. Consistency-training methods make the same move, using the model's *own* clean responses as targets to avoid the 'staleness' that creeps in when training targets come from somewhere the model no longer lives Can models learn to ignore irrelevant prompt changes?. The common thread: self-correction survives when the correction signal is generated from the same distribution the model deploys in, and breaks when it's borrowed from a frozen dataset, a stale teacher, or a proxy metric.

The deeper reason this keeps biting is structural — a model can't reliably verify its own work, so any self-correction loop is only as good as the signal it closes against What actually constrains large language models from self-improvement?. Distribution mismatch is the mechanism by which that generation-verification gap turns lethal: the moment the model's real errors diverge from the errors its corrector was trained on, the loop is optimizing for a world that no longer exists.

Sources 7 notes

Why does self-correction training on offline data fail?

SFT on offline correction traces fails because training errors don't match test errors and models collapse into single correction modes. Multi-turn online RL under the model's own error distribution successfully trains self-correction by letting models practice correcting their actual mistakes.

Do models fail worse when their own errors fill the context?

Error accumulation in context causes non-linear performance degradation in long-horizon tasks. Model scaling does not fix this; only test-time compute through thinking models reduces the effect by preventing error-contaminated context from biasing reasoning.

Does self-consistency reliably reward correct answers during training?

Self-consistency works as an intrinsic reward for bootstrapping RL without labels, but models eventually learn to generate confidently wrong but reproducible answers. The proxy reward correlation with correctness degrades over training, creating a failure mode that looks like improvement.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Can transformers improve exponentially by learning from their own correct solutions?

Standard transformers generalize from 10-digit to 100-digit addition by repeatedly generating solutions, filtering for correctness, and retraining—showing exponential (not linear) out-of-distribution improvement across rounds without saturation.

Can models learn to ignore irrelevant prompt changes?

Two methods—BCT (output-level) and ACT (activation-level)—train models to respond identically to clean and wrapped prompts by using the model's own clean responses as targets, eliminating specification and capability staleness inherent in standard SFT.

What actually constrains large language models from self-improvement?

LLMs cannot reliably improve themselves without external verification; metacognition must be externalized rather than learned. Alignment philosophy is shifting from preferentism to normative standards, but coherent values at scale include problematic self-valuation requiring utility engineering beyond output control.

How does distribution mismatch between training and deployment break self-correction?

Sources 7 notes

Next inquiring lines