How does self-revision on wrong answers increase model confidence further?

This explores why a model that re-examines its own wrong answer tends to dig in harder rather than fix the mistake — and what the corpus says is actually happening when a model 'reconsiders.'

This explores the strange dynamic where asking a model to revise a wrong answer often makes it *more* sure of being wrong, not less. The corpus traces this back to a structural bias: models systematically over-trust answers they generated themselves, because a high-probability output simply *feels* correct when the same model evaluates it Why do models trust their own generated answers?. Self-revision feeds that loop. When a model reconsiders based only on its own prior reasoning, it isn't sampling a fresh perspective — it's re-encountering the same confident error and ratifying it, a failure mode sharp enough to have its own name: degeneration of thought Does a model improve by arguing with itself?.

The clearest finding in the collection is that the *source* of the critique, not the act of revising, decides the outcome. Revision guided by an external model improves accuracy; a model revising its own uncertain output usually amplifies confidence in the wrong answer instead of correcting it Does revising your own reasoning actually help or hurt?. Several papers converge on this from different angles: across QwQ, R1, and LIMO, most revisions *retain* the wrong answer, and smaller models even flip correct answers to incorrect — with longer revision chains correlating with lower accuracy, not higher Does self-revision actually improve reasoning in language models?. An analysis of eight reasoning models goes further, suggesting much of what looks like 'reflection' is theater: reflections rarely change the answer and mostly serve as post-hoc confirmation of the first guess Is reflection in reasoning models actually fixing mistakes?.

There's a mechanical reason the spiral worsens over a long chain. Once an error sits in the context window, it biases everything that follows — models degrade non-linearly when their own prior mistakes contaminate their working history, and a wrong step in context actively conditions the next wrong step Do models fail worse when their own errors fill the context?Can model confidence work as a reward signal for reasoning?. So self-revision doesn't just fail to correct; it can manufacture new evidence (the model's own confident-but-wrong reasoning) that the model then treats as support. The thing you'd hope breaks the loop — looking again — is exactly what tightens it.

What actually reverses it is introducing genuine difference. Multi-agent debate between *different* models breaks the self-agreement loop and improves both accuracy and calibration, because the disagreement supplies the outside view a single model can't generate for itself Does a model improve by arguing with itself?. And training-time approaches that work tend to make the model practice on its *actual* mistakes under reinforcement learning, rather than rehearse idealized correction traces — offline self-correction data fails precisely because the model's training-time errors don't match the errors it makes at test time Why does self-correction training on offline data fail?.

The quietly unsettling takeaway: confidence and correctness come apart here in a way that's easy to miss. A confident model is more robust to prompt rephrasing Does model confidence predict robustness to prompt changes?, and confidence can even be harnessed as a useful reward signal that *restores* calibration when wired in deliberately Can model confidence work as a reward signal for reasoning?Can model confidence alone replace external answer verification? — but left to revise itself unsupervised, a model's growing confidence is just as likely to be a measure of how thoroughly it has talked itself into the wrong answer.

Sources 10 notes

Why do models trust their own generated answers?

LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.

Does a model improve by arguing with itself?

Models that reconsider answers based on their own previous reasoning become more confident in errors, not less. Multi-agent debate with genuinely different models reverses this pattern, improving both accuracy and calibration.

Does revising your own reasoning actually help or hurt?

Revision guided by external models improves accuracy, but a model revising its own uncertain output typically amplifies confidence in wrong answers rather than correcting them. The revision source, not the revision act itself, determines the outcome.

Does self-revision actually improve reasoning in language models?

Evidence from QwQ, R1, and LIMO shows most revisions retain wrong answers rather than correcting them. Smaller models frequently switch correct answers to incorrect during revision, and longer chains with more revisions correlate with lower accuracy.

Is reflection in reasoning models actually fixing mistakes?

Analysis of 8 reasoning models shows reflections rarely change answers and primarily serve as post-hoc confirmation. Training on longer reflection chains improves first-answer quality, not self-correction capability.

Do models fail worse when their own errors fill the context?

Error accumulation in context causes non-linear performance degradation in long-horizon tasks. Model scaling does not fix this; only test-time compute through thinking models reduces the effect by preventing error-contaminated context from biasing reasoning.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Why does self-correction training on offline data fail?

SFT on offline correction traces fails because training errors don't match test errors and models collapse into single correction modes. Multi-turn online RL under the model's own error distribution successfully trains self-correction by letting models practice correcting their actual mistakes.

Does model confidence predict robustness to prompt changes?

ProSA found that when models are highly confident, they resist prompt rephrasing; low confidence causes major output swings. Larger models, few-shot examples, and objective tasks all correlate with higher confidence and greater robustness.

Can model confidence alone replace external answer verification?

RLPR and INTUITOR successfully extend reinforcement learning for reasoning to general domains by using the model's own token probabilities and confidence levels as reward signals, eliminating the need for external verifiers or reference answers.

How does self-revision on wrong answers increase model confidence further?

Sources 10 notes

Next inquiring lines