Does deliberate self-revision introduce different errors than passive context contamination?

This explores whether the mistakes a model makes when it actively reconsiders its own work (self-revision) are a different kind of failure than the mistakes it makes when bad earlier outputs simply pile up in its context window (passive contamination).

This explores whether deliberate self-revision and passive context contamination fail in distinct ways — and the corpus suggests they do, though they share a common root: a model's structural tendency to trust itself. The two failure modes look different on the surface. Passive contamination is a drift: once errors enter the context history, they bias everything that follows, and performance degrades non-linearly as the bad tokens accumulate. The model isn't deciding anything; it's just being dragged down by what's already on the page Do models fail worse when their own errors fill the context?. Deliberate self-revision is an active failure: the model looks back at its own answer, decides to change it, and usually makes it worse — most revisions keep wrong answers wrong, and smaller models frequently flip correct answers to incorrect ones, with longer revision chains correlating with lower accuracy Does self-revision actually improve reasoning in language models?.

But here's the thing you might not expect: a lot of what looks like self-revision isn't even active. Analysis across reasoning models shows that 'reflection' is mostly theater — the reconsideration steps rarely change the answer and mostly serve to confirm the first one. Training on longer reflection chains improves the quality of the first answer, not the model's ability to actually correct itself Is reflection in reasoning models actually fixing mistakes?. So one of the 'errors' deliberate revision introduces is illusory work: motion that feels corrective but is really post-hoc rationalization.

Underneath both modes sits the same engine — a model is structurally biased toward validating its own outputs, because a high-probability answer it already generated simply feels more correct when it re-evaluates Why do models trust their own generated answers?. That's why passive contamination compounds (the model trusts the bad context) and why active revision amplifies confidence in wrong answers rather than fixing them. The decisive variable isn't whether revision is active or passive — it's where the corrective signal comes from. Revision guided by an external critic improves accuracy; a model revising its own uncertain output degrades it Does revising your own reasoning actually help or hurt?. This is formalized as the generation-verification gap: reliable self-improvement is bounded, and every dependable fix needs something external to validate it — metacognition alone can't escape the loop What stops large language models from improving themselves?.

There's a hopeful wrinkle, though, that sharpens the distinction. Self-correction *can* be trained, but only when the model practices on its own real mistakes through online reinforcement learning — training on offline correction traces fails because the errors in training don't match the errors at test time, and the model collapses into a single rote correction move Why does self-correction training on offline data fail?. And for passive contamination specifically, the fix is different again: scaling the model doesn't help, but test-time 'thinking' compute reduces the effect by preventing the error-laden context from biasing reasoning in the first place Do models fail worse when their own errors fill the context?. So the two failure modes don't just differ in mechanism — they respond to different remedies. Contamination is treated by insulating reasoning from poisoned context; bad self-revision is treated by importing an external verifier or training on the model's authentic error distribution. The shared lesson is that a model left alone with its own outputs, whether it's passively reading them or actively second-guessing them, tends to dig in rather than recover.

Sources 7 notes

Do models fail worse when their own errors fill the context?

Error accumulation in context causes non-linear performance degradation in long-horizon tasks. Model scaling does not fix this; only test-time compute through thinking models reduces the effect by preventing error-contaminated context from biasing reasoning.

Does self-revision actually improve reasoning in language models?

Evidence from QwQ, R1, and LIMO shows most revisions retain wrong answers rather than correcting them. Smaller models frequently switch correct answers to incorrect during revision, and longer chains with more revisions correlate with lower accuracy.

Is reflection in reasoning models actually fixing mistakes?

Analysis of 8 reasoning models shows reflections rarely change answers and primarily serve as post-hoc confirmation. Training on longer reflection chains improves first-answer quality, not self-correction capability.

Why do models trust their own generated answers?

LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.

Does revising your own reasoning actually help or hurt?

Revision guided by external models improves accuracy, but a model revising its own uncertain output typically amplifies confidence in wrong answers rather than correcting them. The revision source, not the revision act itself, determines the outcome.

What stops large language models from improving themselves?

Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.

Why does self-correction training on offline data fail?

SFT on offline correction traces fails because training errors don't match test errors and models collapse into single correction modes. Multi-turn online RL under the model's own error distribution successfully trains self-correction by letting models practice correcting their actual mistakes.

Does deliberate self-revision introduce different errors than passive context contamination?

Sources 7 notes

Next inquiring lines