Why does filtering for correct examples prevent error compounding in self-training?
This explores why a simple filter — keep only the outputs the model got right, retrain on those — is enough to stop the runaway error amplification that otherwise kills self-training, and what that filter is really standing in for.
This explores why a simple filter — keep only the outputs a model got right, retrain on those, repeat — is enough to stop the runaway error amplification that otherwise wrecks self-training. The short answer the corpus points to: the filter isn't really about correctness, it's about supplying *external verification* the model can't generate from inside itself.
Start with the failure it prevents. When a model trains on its own raw output, small inaccuracies don't stay small — they avalanche, compounding exponentially within just two or three iterations and stalling improvement at an error floor set by how good your verification is, not by how capable the model actually is How quickly do errors compound during model self-training?. The reason it spirals rather than self-corrects is a structural bias: models systematically over-trust the answers they themselves generated, because high-probability outputs *feel* correct during evaluation Why do models trust their own generated answers?. So a model grading its own work tends to ratify its own mistakes, and each round of retraining bakes the mistake deeper.
Filtering for correct examples breaks the loop precisely because it inserts a judgment the model didn't make about itself. The cleanest demonstration is transformers learning addition: standard models go from 10-digit to 100-digit problems by generating solutions, *filtering for correctness*, and retraining — and the gain is exponential across rounds with no saturation Can transformers improve exponentially by learning from their own correct solutions?. The filter is the load-bearing part. Without it you're amplifying noise; with it you're amplifying signal. The deeper principle is that self-improvement is formally bounded by a generation-verification gap — every reliable fix needs something external to validate it, and no amount of metacognition lets a model escape that on its own What stops large language models from improving themselves?.
What counts as "external" is more flexible than it sounds, and this is where the corpus gets interesting. A correctness filter is one form, but the same gating logic shows up wearing other clothes: asymmetric self-play uses *majority-vote* among multiple solver attempts as its verifier, letting a model bootstrap with no human labels at all Can language models improve themselves without any external training data?; bidirectional RAG only writes a generated answer back into its own corpus if it passes entailment, attribution, and novelty checks — a gate that keeps hallucinations from polluting future retrievals Can RAG systems safely learn from their own generated answers?. All three are the same move: an admissions test that the model's own confidence cannot bribe.
Two cautions keep this from being a free lunch. First, the filter is only as honest as your verifier — and verifiers can be gamed. Train on problems that are too hard and the model learns degenerate shortcuts; group-relative normalization then treats rare accidental "correct" answers as high-value, reinforcing answer-repetition and computation-skipping instead of reasoning Do overly hard RLVR samples actually harm model capabilities?. A correctness filter that can be satisfied by luck or shortcuts re-opens the avalanche through the back door. Second, filtering picks *which* examples to learn from, but not *how* a model practices recovering — fixing self-correction specifically needs online RL on the model's own live error distribution, because offline correction traces don't match the errors that actually show up at test time Why does self-correction training on offline data fail?. So filtering is necessary, but the surprising takeaway is that it's a proxy: the thing actually preventing compounding is a verification signal the model can't fake to itself.
Sources 8 notes
Small inaccuracies in model-generated training data amplify rapidly across iterations, degrading performance unless self-consistency checks filter outputs. The effect stalls improvement within a few steps, setting an error floor based on verification quality rather than actual capability.
LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.
Standard transformers generalize from 10-digit to 100-digit addition by repeatedly generating solutions, filtering for correctness, and retraining—showing exponential (not linear) out-of-distribution improvement across rounds without saturation.
Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.
SQLM uses a proposer-solver framework where the proposer generates calibrated problems and the solver learns via majority-vote verification. Both agents improve through RL alone, creating an automatic curriculum that scales without human labels or ground-truth answers.
Systems can add generated answers to their retrieval corpus when outputs pass entailment verification, source attribution checks, and novelty detection. This prevents hallucinations from polluting future retrievals while allowing genuine knowledge accumulation.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
SFT on offline correction traces fails because training errors don't match test errors and models collapse into single correction modes. Multi-turn online RL under the model's own error distribution successfully trains self-correction by letting models practice correcting their actual mistakes.