Why does external verification stop error amplification but internal self-assessment enable it?

This explores why a model checking its own work tends to compound mistakes, while a separate verifier breaks that loop — and where the line between them actually blurs.

This question reads as: what is it about an *external* signal that halts runaway errors, when a model grading itself seems to accelerate them? The corpus has a clean mechanistic answer, and then a set of papers that complicate it.

The core mechanism is a trust bias. Models systematically over-value answers they generated themselves, because a high-probability output simply *feels* correct when the same model evaluates it Why do models trust their own generated answers?. So self-assessment isn't a neutral check — it's a feedback loop that confirms whatever the model already leaned toward. Two papers show what that loop does over time: errors avalanche exponentially within two or three rounds of self-training How quickly do errors compound during model self-training?, and even within a single long task, once prior mistakes fill the context the model conditions on them and degrades non-linearly — bigger models don't fix it Do models fail worse when their own errors fill the context?. The reflection meant to catch errors turns out to be mostly theater: across eight models, reflections rarely change the initial answer and the traces don't faithfully explain the reasoning Can we actually trust reasoning model outputs?.

Why external verification breaks the loop is the deeper structural claim. Self-improvement is formally bounded by a *generation–verification gap* — a model can only validate what it can already reliably judge, so it cannot lift itself past its own ceiling through introspection alone What stops large language models from improving themselves?. Every method that *looks* like pure self-improvement actually smuggles in an external anchor: a past model version, a third-party judge, user corrections, or tool feedback Can models reliably improve themselves without external feedback?. The external signal works precisely because it isn't correlated with the model's own confidence — it can disagree, and disagreement is what stops amplification.

Here's the twist the corpus hands you: the boundary isn't really internal-vs-external, it's *correlated-vs-independent*. Several papers get self-assessment to work — but only by manufacturing independence inside the model. SERL has the model alternate between answering and judging, deriving reward from *consistency across rankings* rather than raw confidence, and it improves without external signals Can models learn to judge themselves without external rewards?. RLPR and INTUITOR use the model's own token probabilities as reward and extend RL reasoning to domains with no verifier at all Can model confidence alone replace external answer verification?. Self-correction can even be trained — but only with online RL on the model's *actual* mistakes; train it on offline correction traces and it collapses, because the errors it practices on don't match the ones it makes Why does self-correction training on offline data fail?.

The sharpest reframing is *where* you verify. Checking the final answer is where self-assessment's trust bias bites hardest — the model just re-endorses its conclusion. But verifying the *process* — intermediate states and policy compliance during generation — catches failures that final scoring misses entirely, raising task success from 32% to 87% Where do reasoning agents actually fail during long traces?, and asynchronous verifiers can do this alongside generation at near-zero latency cost Can verifiers monitor reasoning without slowing generation down?. So the real answer isn't that internal assessment is doomed — it's that internal assessment amplifies error whenever the judge shares the generator's biases, and stops it whenever you engineer independence: a separate verifier, a consistency constraint, online error distributions, or a shift from grading the answer to auditing the steps.

Sources 11 notes

Why do models trust their own generated answers?

LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.

How quickly do errors compound during model self-training?

Small inaccuracies in model-generated training data amplify rapidly across iterations, degrading performance unless self-consistency checks filter outputs. The effect stalls improvement within a few steps, setting an error floor based on verification quality rather than actual capability.

Do models fail worse when their own errors fill the context?

Error accumulation in context causes non-linear performance degradation in long-horizon tasks. Model scaling does not fix this; only test-time compute through thinking models reduces the effect by preventing error-contaminated context from biasing reasoning.

Can we actually trust reasoning model outputs?

Research across eight models shows reflection is mostly confirmatory theater—reflections rarely change initial answers and traces don't faithfully represent reasoning. Calibration degrades under binary reward training, and monitoring mechanisms are easily gamed.

What stops large language models from improving themselves?

Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.

Can models reliably improve themselves without external feedback?

Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.

Can models learn to judge themselves without external rewards?

SERL enables self-improving language models by having them alternate between generating responses and judging them pairwise, deriving rewards from ranking consistency and self-consistency of judgments. On AlpacaEval, this reached 59.90% win rate without external signals, up from 52.37%.

Can model confidence alone replace external answer verification?

RLPR and INTUITOR successfully extend reinforcement learning for reasoning to general domains by using the model's own token probabilities and confidence levels as reward signals, eliminating the need for external verifiers or reference answers.

Why does self-correction training on offline data fail?

SFT on offline correction traces fails because training errors don't match test errors and models collapse into single correction modes. Multi-turn online RL under the model's own error distribution successfully trains self-correction by letting models practice correcting their actual mistakes.

Where do reasoning agents actually fail during long traces?

Reliability for long-trace reasoning comes from checking intermediate states and policy compliance during generation, not from scoring final outputs. Adding intermediate verification raised task success from 32% to 87% because most failures are process violations, not wrong answers.

Can verifiers monitor reasoning without slowing generation down?

Decoupling verification from generation lets verifiers run alongside a single trace, forking to extract verifiable state and intervening only on violations. On correct runs the latency penalty is near-zero; interwhen matches or beats CoT across benchmarks at similar token budgets.

Why does external verification stop error amplification but internal self-assessment enable it?

Sources 11 notes

Next inquiring lines