INQUIRING LINE

Can self-consistency checks fully prevent error avalanching in self-training loops?

This explores whether the self-consistency filters that catch bad model-generated training data are enough to stop errors from compounding when a model trains on its own output — and the corpus says they help, but can't fully close the gap.


This explores whether self-consistency checks can *fully* prevent error avalanching in self-training loops. The short version the corpus suggests: they raise the floor, but they don't remove it — and they bring a failure mode of their own.

Start with the problem they're meant to solve. When a model trains on data it generated itself, small inaccuracies don't stay small — they amplify exponentially, stalling improvement within just two or three iterations and setting an error floor How quickly do errors compound during model self-training?. That note is explicit that self-consistency filtering is what holds the avalanche back, and that the floor is set by *verification quality* rather than the model's real capability. So filtering matters — but notice the framing: the ceiling is determined by how good your check is, not by how good your model could be. A weak filter just relocates the failure.

And self-consistency is a leaky filter. Used as a reward signal, it works at first, but models eventually learn to generate answers that are confidently wrong yet reproducible — the proxy's correlation with actual correctness decays over training, so the loop looks like it's improving while quietly drifting Does self-consistency reliably reward correct answers during training?. This is the deeper reason "fully prevent" is the wrong bar: self-consistency measures agreement, and a model can agree with itself for the wrong reasons. There's a related structural bias underneath — models systematically over-trust answers they generated, because high-probability outputs simply *feel* more correct during evaluation, and only comparing against outside alternatives breaks that self-agreement loop Why do models trust their own generated answers?. A consistency check that polls the same biased model is sampling from inside that loop.

This points at a more general result the corpus keeps circling: self-improvement has a hard external boundary. Improvement is bounded by the gap between generating an answer and verifying it, and every reliable fix requires something outside the model to validate and enforce it — metacognition alone can't escape this What stops large language models from improving themselves? What actually constrains large language models from self-improvement?. Self-consistency is an *internal* proxy for that external verifier, which is exactly why it can't fully close the gap. The methods that do sustain self-training tend to smuggle in a harder check: filtering for genuinely *correct* solutions (verified arithmetic) rather than merely consistent ones Can transformers improve exponentially by learning from their own correct solutions?, empirical benchmarking against real tasks Can AI systems improve themselves through trial and error?, or a proposer-solver split where one agent generates problems and the other is scored by majority-vote verification Can language models improve themselves without any external training data?. The stronger the grounding, the further the loop runs before it stalls.

There's also a second avalanche the consistency check doesn't touch at all: contamination through context. Even with good training data, once prior errors fill a model's context history, performance degrades non-linearly, and scaling the model doesn't fix it — only test-time compute that keeps the bad context from biasing reasoning helps Do models fail worse when their own errors fill the context?. And training to self-correct turns out to require online RL on the model's *own* live error distribution, because offline correction traces don't match the errors that actually show up at test time Why does self-correction training on offline data fail?. So the honest answer is no — self-consistency is a useful brake, not a cure. What actually bounds the avalanche is the quality and *externality* of your verifier, and the moment your check lives entirely inside the model it's policing, the loop can learn to fool it.


Sources 10 notes

How quickly do errors compound during model self-training?

Small inaccuracies in model-generated training data amplify rapidly across iterations, degrading performance unless self-consistency checks filter outputs. The effect stalls improvement within a few steps, setting an error floor based on verification quality rather than actual capability.

Does self-consistency reliably reward correct answers during training?

Self-consistency works as an intrinsic reward for bootstrapping RL without labels, but models eventually learn to generate confidently wrong but reproducible answers. The proxy reward correlation with correctness degrades over training, creating a failure mode that looks like improvement.

Why do models trust their own generated answers?

LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.

What stops large language models from improving themselves?

Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.

What actually constrains large language models from self-improvement?

LLMs cannot reliably improve themselves without external verification; metacognition must be externalized rather than learned. Alignment philosophy is shifting from preferentism to normative standards, but coherent values at scale include problematic self-valuation requiring utility engineering beyond output control.

Can transformers improve exponentially by learning from their own correct solutions?

Standard transformers generalize from 10-digit to 100-digit addition by repeatedly generating solutions, filtering for correctness, and retraining—showing exponential (not linear) out-of-distribution improvement across rounds without saturation.

Can AI systems improve themselves through trial and error?

DGM replaces formal proofs with empirical benchmarking and maintains an evolutionary archive of agent variants, achieving 2.5× improvement on SWE-bench and 2.2× on Polyglot by discovering capabilities like better code editing and context management.

Can language models improve themselves without any external training data?

SQLM uses a proposer-solver framework where the proposer generates calibrated problems and the solver learns via majority-vote verification. Both agents improve through RL alone, creating an automatic curriculum that scales without human labels or ground-truth answers.

Do models fail worse when their own errors fill the context?

Error accumulation in context causes non-linear performance degradation in long-horizon tasks. Model scaling does not fix this; only test-time compute through thinking models reduces the effect by preventing error-contaminated context from biasing reasoning.

Why does self-correction training on offline data fail?

SFT on offline correction traces fails because training errors don't match test errors and models collapse into single correction modes. Multi-turn online RL under the model's own error distribution successfully trains self-correction by letting models practice correcting their actual mistakes.

Next inquiring lines