Reinforcement Learning for LLMs

Can transformers improve exponentially by learning from their own correct solutions?

Can standard transformers achieve extreme length generalization by iteratively filtering and training on their own correct outputs? This explores whether self-correction loops enable unbounded out-of-distribution improvement without architectural changes.

Note · 2026-02-22 · sourced from LLM Architecture
How should we allocate compute budget at inference time? What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

"Self-Improving Transformers Overcome Easy-to-Hard and Length Generalization Challenges" (2502.01612) demonstrates that standard transformer architectures can achieve extreme out-of-distribution generalization through a self-improvement loop: generate solutions, filter for correctness, train on the correct ones, repeat.

The results across arithmetic, string manipulation, and maze solving show generalization far beyond the training distribution — 10-digit to 100-digit addition without apparent saturation. The critical mechanism: filtering for correct self-generated examples produces exponential improvement in OOD performance across training rounds. Not linear. Exponential.

This is achieved without any modification to the base transformer architecture. No external verifiers beyond a correctness check. No curriculum design. No reward models. The model's own ability to occasionally solve harder problems (via sampling variance) provides the training signal for the next round. The correctness filter is the critical factor that distinguishes this from How quickly do errors compound during model self-training? — without verification, small errors compound exponentially in the wrong direction; with verification, correct solutions compound exponentially in the right direction.

The finding directly challenges What limits how much models can improve themselves?. The generation-verification gap says self-improvement is bounded because the model cannot verify better than it generates. But for tasks with automated verification (arithmetic, string manipulation), the verification is perfect — the gap vanishes. This is exactly the class of tasks where self-improvement works unboundedly.

Since Can language models improve themselves without any external training data?, the self-improving transformer uses a different but related mechanism: the model serves as both proposer (generating candidate solutions at harder scales) and solver (learning from its own correct solutions). The asymmetry comes from the fact that generating one correct solution to a harder problem is easier than reliably solving all harder problems.

The exponential improvement finding may explain why Can a single training example unlock mathematical reasoning?. If a single correct example at the boundary can seed an exponential self-improvement cascade, then the minimal signal needed for activation is genuinely minimal.


Source: LLM Architecture

Related concepts in this collection

Concept map
14 direct connections · 103 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

self-improving transformers achieve extreme length generalization through iterative self-generated solutions with exponential out-of-distribution improvement