Does self-revision actually improve reasoning in large language models?
This explores whether an LLM checking and rewriting its own reasoning actually makes its answers more correct — and the corpus suggests it usually doesn't, unless the correction signal comes from outside the model.
This explores whether an LLM revising its own reasoning actually improves accuracy. The short answer from the corpus is counterintuitive: self-revision, left to the model itself, tends to make things worse, not better. Direct measurements on o1-style reasoning models (QwQ, R1, LIMO) show that most revisions keep a wrong answer wrong, and smaller models frequently flip a correct answer to an incorrect one mid-revision — longer chains with more second-guessing actually correlate with *lower* accuracy Does self-revision actually improve reasoning in language models?. So more deliberation is not automatically more truth.
The pivotal variable turns out to be the *source* of the critique, not the act of revising. When an external model guides the revision, accuracy improves; when a model audits its own uncertain output, it typically amplifies its confidence in the wrong answer rather than catching the mistake Does revising your own reasoning actually help or hurt?. There's a clean mechanistic reason for this self-sabotage: models carry a structural bias toward trusting answers they generated themselves, because their own high-probability tokens simply *feel* more correct during evaluation. The fix is comparison against outside alternatives, which breaks the self-agreement loop Why do models trust their own generated answers?. This connects to a deeper formal limit — self-improvement is bounded by the generation-verification gap, meaning a model can't reliably validate a fix without something external to check it against What stops large language models from improving themselves?.
But 'external' doesn't have to mean a human or a bigger model standing over its shoulder. The more interesting thread in the collection is that self-correction *can* be trained in — it just can't be improvised at inference time. Supervised fine-tuning on pre-recorded correction traces fails, because the errors in training don't match the errors at test time and models collapse into a single canned 'correction' move. What works is multi-turn online reinforcement learning on the model's *own* live mistakes, so it practices fixing the errors it actually makes Why does self-correction training on offline data fail?. Related work shows models can even learn to compute their own reward in the unused sequence space after their answer, internalizing self-evaluation during training at zero inference cost Can models learn to evaluate their own work during training?, and that proposer-solver self-play can manufacture an external-feeling verification signal without human labels Can language models improve themselves without any external training data?.
It's also worth zooming out on what 'reasoning' is doing in the first place, because it reframes why revision underdelivers. Frontier reasoning models that look fluent at reflection score only ~20-23% on constraint-satisfaction problems demanding genuine backtracking — the appearance of careful reflection doesn't translate to competence on unfamiliar structures Can reasoning models actually sustain long-chain reflection?. And failures track instance *novelty*, not problem complexity: models lean on pattern-matched instances rather than general algorithms, so revising a chain doesn't help when the underlying approach was a memorized template that doesn't fit Do language models fail at reasoning due to complexity or novelty?. If the model never had the right method, re-reading its own work won't conjure one.
The thing you might not have known you wanted to know: the corpus quietly dissolves the romantic picture of an AI 'thinking harder' and getting wiser. Real gains in reasoning seem to come from a small set of high-entropy 'forking' decisions where the model commits to a direction Do high-entropy tokens drive reasoning model improvements? — and from baking verification into training — far more than from after-the-fact self-revision. Reflection that isn't anchored to an external check is, at best, theater; at worst, it's a confidence machine for wrong answers.
Sources 10 notes
Evidence from QwQ, R1, and LIMO shows most revisions retain wrong answers rather than correcting them. Smaller models frequently switch correct answers to incorrect during revision, and longer chains with more revisions correlate with lower accuracy.
Revision guided by external models improves accuracy, but a model revising its own uncertain output typically amplifies confidence in wrong answers rather than correcting them. The revision source, not the revision act itself, determines the outcome.
LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.
Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.
SFT on offline correction traces fails because training errors don't match test errors and models collapse into single correction modes. Multi-turn online RL under the model's own error distribution successfully trains self-correction by letting models practice correcting their actual mistakes.
Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.
SQLM uses a proposer-solver framework where the proposer generates calibrated problems and the solver learns via majority-vote verification. Both agents improve through RL alone, creating an automatic curriculum that scales without human labels or ground-truth answers.
DeepSeek-R1 and o1-preview achieve only 20-23.6% exact match on 850 constraint satisfaction problems requiring genuine backtracking. This ceiling reveals that reflective reasoning fluency does not translate to actual problem-solving competence on unfamiliar instance structures.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.