Reinforcement Learning for LLMs LLM Reasoning and Architecture

How quickly do errors compound during model self-training?

When LLMs train on their own outputs without verification, do small mistakes amplify exponentially? This matters because it determines whether unsupervised self-improvement is even feasible.

Note · 2026-02-22 · sourced from Reasoning Methods CoT ToT
How should we allocate compute budget at inference time?

Self-training loops — where a model generates its own training data — contain a structural failure mode that is rarely discussed alongside the variance and overthinking problems: error avalanching. The SECToR paper documents this precisely: during self-training on arithmetic, small inaccuracies in a model's output compound rapidly because each iteration learns from the previous iteration's mistakes. The avalanche dynamic means that errors are not just preserved — they are amplified. Previous attempts at LLM self-learning (addition, other tasks) confirm the pattern: improvement stagnates after only a few steps.

This is a training-time analog to Does a model improve by arguing with itself?, which operates at inference time. The mechanism differs — training vs. test-time — but the dynamic is the same: a model using only its own outputs as the information source turns small errors into catastrophic drift.

The fix is verification. SECToR partially mitigates error avalanching through self-consistency checks: multiple candidate outputs are generated and compared, filtering out outputs where the model disagrees with itself. This is not a complete solution — self-consistency can still miss systematic errors the model makes consistently — but it significantly slows avalanching. Eventually the process still terminates due to accumulated error, but the runway is longer.

The practical implication for self-improvement approaches: unsupervised self-training (no external ground truth, no cross-model verification) is inherently time-limited. The error floor is set by the self-consistency check quality, not by the model's actual capability. This matters for RL approaches that use outcome-only verification: outcome signals without step-level checks still carry error signal, but step-level mistakes that happen to produce correct outcomes remain uncorrected and can compound in later iterations.


Source: Reasoning Methods CoT ToT

Related concepts in this collection

Concept map
24 direct connections · 189 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

error avalanching compounds self-training failures within a few iterations