Reinforcement Learning for LLMs LLM Reasoning and Architecture

How quickly do errors compound during model self-training?

When LLMs train on their own outputs without verification, do small mistakes amplify exponentially? This matters because it determines whether unsupervised self-improvement is even feasible.

Note · 2026-02-22 · sourced from Reasoning Methods CoT ToT

Self-training loops — where a model generates its own training data — contain a structural failure mode that is rarely discussed alongside the variance and overthinking problems: error avalanching. The SECToR paper documents this precisely: during self-training on arithmetic, small inaccuracies in a model's output compound rapidly because each iteration learns from the previous iteration's mistakes. The avalanche dynamic means that errors are not just preserved — they are amplified. Previous attempts at LLM self-learning (addition, other tasks) confirm the pattern: improvement stagnates after only a few steps.

This is a training-time analog to Does a model improve by arguing with itself?, which operates at inference time. The mechanism differs — training vs. test-time — but the dynamic is the same: a model using only its own outputs as the information source turns small errors into catastrophic drift.

The fix is verification. SECToR partially mitigates error avalanching through self-consistency checks: multiple candidate outputs are generated and compared, filtering out outputs where the model disagrees with itself. This is not a complete solution — self-consistency can still miss systematic errors the model makes consistently — but it significantly slows avalanching. Eventually the process still terminates due to accumulated error, but the runway is longer.

The practical implication for self-improvement approaches: unsupervised self-training (no external ground truth, no cross-model verification) is inherently time-limited. The error floor is set by the self-consistency check quality, not by the model's actual capability. This matters for RL approaches that use outcome-only verification: outcome signals without step-level checks still carry error signal, but step-level mistakes that happen to produce correct outcomes remain uncorrected and can compound in later iterations.

Source: Reasoning Methods CoT ToT

Related concepts in this collection

Does a model improve by arguing with itself? When models revise their own reasoning in response to self-generated criticism, do they converge on better answers or worse ones? And how does that compare to challenge from other models?
the inference-time analog: same error-amplification dynamic, same single-source problem
Does revising your own reasoning actually help or hurt? Self-revision in reasoning models often degrades accuracy, while external critique improves it. Understanding what makes revision helpful or harmful could reshape how we design systems that need to correct themselves.
why external verification changes the trajectory
Why do reasoning models fail differently at training versus inference? Reasoning models exhibit two distinct failure modes—entropy collapse during training and variance inflation during inference—that appear unrelated but may share underlying causes. Understanding these dual problems could reveal whether separate or unified solutions are needed.
error avalanching is a third training-time failure mode distinct from entropy collapse: collapse is diversity loss, avalanching is accuracy degradation
Can self-supervised process rewards replace human annotation? Self-supervised PRMs learn from outcome labels alone, avoiding expensive step-level annotation. The key question is whether this approach generalizes beyond math and code to domains with ambiguous correctness.
step-level PRMs are a partial solution: they add verification without human annotation
Does training on AI-generated content permanently degrade model quality? When generative models train on outputs from previous models, do the resulting models lose rare patterns permanently? The question matters because future training data will inevitably contain synthetic content.
same recursive degradation dynamic at ecosystem scale: error avalanching degrades a single model's self-training loop, model collapse degrades the shared data pool across model generations
Does teacher-refined data always improve student model performance? Explores whether higher-quality training data from teacher models uniformly benefits student models, or if compatibility with the student's current learning state matters for effective instruction.
applies: teacher refinement without compatibility filtering is structurally vulnerable to avalanching — incompatible refined data introduces errors the student cannot correct, which compound through training
Can transformers improve exponentially by learning from their own correct solutions? Can standard transformers achieve extreme length generalization by iteratively filtering and training on their own correct outputs? This explores whether self-correction loops enable unbounded out-of-distribution improvement without architectural changes.
direct tension: self-improving transformers achieve exponential improvement where error avalanching predicts collapse; the resolution is verification quality: automated correctness checking (arithmetic, string matching) prevents error accumulation, making the boundary between self-improvement and avalanching the verification gap
Does training on messy search processes improve reasoning? Can language models learn better problem-solving by observing full exploration trajectories—including mistakes and backtracking—rather than only optimal solutions? This matters because current LMs rarely see the decision-making process itself.
SoS addresses avalanching from the training side: models that learn to backtrack from dead ends are structurally less vulnerable to error compounding in self-training loops
Do prior errors in context history amplify future errors? When a language model makes mistakes early in a task, do those errors contaminate subsequent predictions? We explore whether error accumulation degrades long-horizon performance through passive context pollution rather than capability limits.
the inference-time context analog: error avalanching compounds training-loop errors across iterations, self-conditioning compounds inference-time errors across context steps — same amplification dynamic operating at different timescales
Does self-revision actually improve reasoning in language models? When o1-like models revise their own reasoning through tokens like 'Wait' or 'Alternatively', does this reflection catch and fix errors, or does it introduce new mistakes? This matters because self-revision is marketed as a key capability.
the inference-time generation analog: self-revision switches correct answers to incorrect ones within a single generation, while error avalanching amplifies errors across self-training iterations — both show that a model's own outputs are an unreliable basis for improvement without external verification
Does reflection in reasoning models actually correct errors? When reasoning models reflect on their answers, do they genuinely fix mistakes, or merely confirm what they already decided? Understanding this matters for designing better training and inference strategies.
explains why error avalanching is so hard to self-correct: if most reflection is confirmatory rather than corrective, self-training loops lack the internal error-detection capacity needed to catch mistakes before they compound
Does policy entropy collapse limit reasoning performance in RL? As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
a distinct but interacting training-time failure mode: entropy collapse narrows the diversity of solution strategies, while error avalanching degrades the accuracy of the remaining strategies — together they create a double bind where RL training both shrinks and corrupts the model's reasoning repertoire
Why does self-correction training on offline data fail? Can language models learn to correct their own mistakes through supervised training on correction examples? This explores whether distribution mismatch and behavior collapse prevent self-correction from emerging.
SCoRe's distribution mismatch finding identifies a root cause of error avalanching: self-training loops fail because corrections learned from one distribution don't match the model's own evolving errors — online RL under the model's actual error distribution is the principled fix
Why does self-rewarding training collapse when responses improve? Self-Rewarding LLMs merge generator and evaluator for efficient iteration, but both improve so fast that good and bad responses converge, erasing the learning signal. What causes this failure and how can it be fixed?
self-rewarding training is specifically vulnerable to error avalanching: the model evaluates its own outputs to create training signal, and gradient collapse when chosen-rejected responses converge is the self-rewarding-specific form of error compounding — temporal anchoring to past/future models is an anti-avalanching mechanism that introduces external reference points

Concept map

24 direct connections · 189 in 2-hop network ·medium cluster

How quickly do errors compound during model self… Does a model improve by arguing with itself? Does revising your own reasoning actually help or … Why do reasoning models fail differently at traini… Can self-supervised process rewards replace human … Does training on AI-generated content permanently … Does teacher-refined data always improve student m… Can transformers improve exponentially by learning… Does training on messy search processes improve re…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

error avalanching compounds self-training failures within a few iterations