Can a model evaluate its own improvements without degrading over iterations?
This explores whether a model can act as its own judge — scoring and refining its own work across rounds — without quietly getting worse each time it does so.
This explores whether a model can act as its own judge — scoring and refining its own work across rounds — without quietly getting worse each time it does so. The corpus gives a split answer: self-evaluation works when something keeps it honest, and degrades predictably when nothing does. The cleanest statement of the limit is the generation-verification gap: a model can only improve itself in domains where it judges solutions better than it produces them, and that margin shrinks toward zero on factual tasks What limits how much models can improve themselves? What stops large language models from improving themselves?. So "evaluate its own improvements" isn't one capability — it depends entirely on whether verification is cheaper than generation for the task at hand.
The degradation isn't hypothetical, and it has a specific shape. A model asked to revise based on its own prior reasoning tends to grow *more* confident in wrong answers, not less — a failure mode distinct enough to have its own name, degeneration of thought Does a model improve by arguing with itself?. The root cause is a structural bias: models over-trust answers they generated themselves, because a high-probability output simply *feels* correct when the same model grades it Why do models trust their own generated answers?. Stack iterations on top of that and the errors compound — prior mistakes sitting in the context history bias the next step, producing sharp non-linear decay over long-horizon tasks Do models fail worse when their own errors fill the context?. Iterative refinement can even reproduce "overthinking": more rounds accumulating noise without guaranteed gains Do iterative refinement methods suffer from overthinking?.
What breaks the spiral, consistently, is *diversity* — comparing against something other than yourself. Multi-agent debate between genuinely different models reverses the confidence-in-errors pattern and improves calibration Does a model improve by arguing with itself?. Self-detection improves the moment a model compares its answer against broader alternatives instead of agreeing with itself Why do models trust their own generated answers?. And the survey of reliable self-improvement methods makes the trick explicit: the ones that work all smuggle in an external anchor — a past model version, a third-party judge, user corrections, or tool feedback — even when they're marketed as "pure" self-improvement Can models reliably improve themselves without external feedback?.
The surprise is how *much* a model can improve over iterations once you supply a verifier it can't fool. Transformers learning only from their own correct solutions — filtered for correctness, an external check — jump from 10-digit to 100-digit addition with exponential, non-saturating gains Can transformers improve exponentially by learning from their own correct solutions?. Asymmetric self-play has a proposer invent problems and a solver learn via majority-vote verification, both improving through RL with no human labels Can language models improve themselves without any external training data?. SERL alternates a model between answering and judging, deriving reward from ranking consistency Can models learn to judge themselves without external rewards?. The Darwin Gödel Machine keeps an evolutionary archive and validates variants by benchmarking rather than self-belief Can AI systems improve themselves through trial and error?.
The through-line worth taking away: the thing that makes self-evaluation degrade isn't iteration itself, it's *self-agreement* — a model grading its own output with its own biases. Every method that iterates without collapsing has quietly replaced "do I think this is better?" with a check the model can't talk itself out of: a correctness filter, a vote, a different model, a benchmark. Notably, raw scale doesn't rescue you here — bigger models still self-condition on their errors; only spending compute at test time to keep contaminated context from biasing the next step reliably helps Do models fail worse when their own errors fill the context?.
Sources 11 notes
Models can only improve themselves when they verify solutions better than they generate them. This gap scales with model size but vanishes entirely for factual tasks, predicting which domains benefit from self-improvement.
Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.
Models that reconsider answers based on their own previous reasoning become more confident in errors, not less. Multi-agent debate with genuinely different models reverses this pattern, improving both accuracy and calibration.
LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.
Error accumulation in context causes non-linear performance degradation in long-horizon tasks. Model scaling does not fix this; only test-time compute through thinking models reduces the effect by preventing error-contaminated context from biasing reasoning.
Sequential revision methods share the same failure architecture as token-level overthinking: they accumulate noise without guaranteed improvement. Progressive Draft Refinement avoids this by compressing memory between iterations, outperforming longer reasoning traces at matched compute.
Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.
Standard transformers generalize from 10-digit to 100-digit addition by repeatedly generating solutions, filtering for correctness, and retraining—showing exponential (not linear) out-of-distribution improvement across rounds without saturation.
SQLM uses a proposer-solver framework where the proposer generates calibrated problems and the solver learns via majority-vote verification. Both agents improve through RL alone, creating an automatic curriculum that scales without human labels or ground-truth answers.
SERL enables self-improving language models by having them alternate between generating responses and judging them pairwise, deriving rewards from ranking consistency and self-consistency of judgments. On AlpacaEval, this reached 59.90% win rate without external signals, up from 52.37%.
DGM replaces formal proofs with empirical benchmarking and maintains an evolutionary archive of agent variants, achieving 2.5× improvement on SWE-bench and 2.2× on Polyglot by discovering capabilities like better code editing and context management.