Reinforcement Learning for LLMs LLM Reasoning and Architecture

What limits how much models can improve themselves?

Explores whether self-improvement has fundamental boundaries set by how well models can verify versus generate solutions, and what this means across different task types.

Note · 2026-02-22 · sourced from Self Refinement Self Consistency Feedback
How should we allocate compute budget at inference time? What kind of thing is an LLM really?

"Mind the Gap" (Song et al., 2025) formalizes the precondition for self-improvement: the generation-verification gap, defined as the difference between a model's ability to verify solutions versus its ability to generate them. When this gap is positive, self-improvement has room to operate — the model can evaluate outputs better than it can produce them, creating a usable training signal.

The gap scales monotonically with pretraining FLOPs. Larger models have proportionally larger generation-verification gaps, which explains why self-improvement methods work better on larger models. For 4×4 Sudoku (NP-hard generation, P verification), only the largest models (72B+) show non-trivial gaps, with 50-300% accuracy improvement.

However, the gap vanishes for factual recall tasks. On Natural Questions, the gap is <1% or negative across all model sizes — verification provides no additional signal because knowing the answer and verifying the answer require the same factual knowledge. This predicts which tasks will benefit from self-improvement and which won't: tasks where generation is computationally harder than verification (math, code, structured problems) benefit; tasks where both require the same knowledge (factual QA) don't.

The diversity collapse finding is equally important: during iterative self-improvement, pass@k increases for small k (quality improves at the top) but decreases for large k (diversity decreases overall). The model converges on solutions it can verify, which are typically common patterns. Rare but correct solutions get filtered out because the model can't verify them. This is the entropy collapse dynamic operating through the verification bottleneck rather than through the policy directly.

The non-overlap property of verification mechanisms — different verifiers catch different errors despite functional similarity — suggests that compositional verification (combining multiple verification approaches) could substantially extend the ceiling. This is architecturally distinct from the temporal anchoring solution in Why does self-rewarding training collapse when responses improve? — one fixes the preference signal, the other expands the verification surface.

Promptbreeder as a practical bound-pusher for prompt optimization: Promptbreeder (Fernando et al., 2023) demonstrates a practical approach to push against these bounds for prompt optimization specifically. It overcomes APE's "diminishing returns after three rounds" through a diversity-maintaining evolutionary algorithm where mutation-prompts (instructions for modifying task-prompts) evolve alongside task-prompts — self-referential self-improvement grounded in LLMs. Promptbreeder outperforms CoT and Plan-and-Solve on arithmetic and commonsense reasoning. However, the self-improvement is still bounded by the LLM's generation capability — mutation-prompts can only express modifications the model can articulate, and fitness evaluation depends on the model's own outputs. This makes Promptbreeder a concrete instantiation of the gap framework: the generation-verification gap determines the ceiling, and the evolutionary diversity mechanism delays the diversity collapse without eliminating it. Source: Prompts Prompting.

Empirical validation via evolutionary self-improvement (DGM): The Darwin Gödel Machine replaces formal self-improvement proofs with empirical validation — evolutionary archive of past modifications, population-based search through code-level self-modifications, and fitness measured by benchmark performance. DGM improved Coder from 20.3% to 50.0% on SWE-bench Verified through iterative self-modification. This sidesteps the generation-verification gap by changing what "verification" means: instead of the model verifying its own outputs against a fixed standard, verification is empirical (does performance improve?) and historical (does the archive contain precedents?). The gap framework predicts this should work: empirical testing is a stronger verifier than self-evaluation, and evolutionary archives provide external reference points that prevent the diversity collapse that pure self-improvement suffers. See Can AI systems improve themselves through trial and error?.

The generator-discriminator-critique gap provides concrete evidence. Saunders et al. (2022) fine-tune large language models to write natural language critiques of model outputs. On topic-based summarization, model-written critiques help humans find flaws they would have missed. However, "we failed to find a clear trend showing critique performance catching up to discriminator performance, implying that larger models still have relevant knowledge they don't articulate as critiques." This is a direct instantiation of the generation-verification gap: the model can discriminate quality (verification) better than it can explain what's wrong (generation of critique). The gap persists at scale, suggesting it is structural rather than a matter of insufficient training. Source: Arxiv/Evaluations.


Source: Self Refinement Self Consistency Feedback — Mind the Gap (arxiv 2412.02674)

Related concepts in this collection

Concept map
26 direct connections · 202 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

self-improvement is bounded by the generation-verification gap — a formal quantity that scales with pretraining compute and vanishes for factual tasks