Does self-revision actually improve reasoning in language models?
When o1-like models revise their own reasoning through tokens like 'Wait' or 'Alternatively', does this reflection catch and fix errors, or does it introduce new mistakes? This matters because self-revision is marketed as a key capability.
Self-revision in o1-like models — prompted by tokens like "Wait" or "Alternatively" — does not reliably fix errors. The evidence from QwQ, R1, and LIMO shows:
- Most revisions retain the original (wrong) answer rather than correcting it
- Smaller models (R1-Distill-1.5B, QwQ) show a higher propensity to revise correct answers to incorrect ones than vice versa
- Longer CoTs have more self-revisions, which explains why longer traces correlate with incorrectness
The irony is that self-revision is framed as a feature — the model reflecting on its own reasoning. But empirically, the reflection is often noise that introduces additional errors rather than catching existing ones. The model's capacity to evaluate its own correctness is limited, so its "reflection" is more likely to perturb a right answer than to save a wrong one.
This has implications for inference strategy: forcing models to self-revise (by suppressing the token and appending "Wait") is more likely to degrade a good answer than improve a bad one. The better alternative is Why does parallel reasoning outperform single chain thinking?.
The Degeneration-of-Thought finding (ReConcile) adds the mechanism: when a model is challenged by its own previous reasoning reframed as external criticism, it doesn't maintain its position or improve — it capitulates with increasing confidence. The model ends more certain of the wrong answer than it started. This is the acute form: self-revision at the token level degrades accuracy; self-revision at the model-vs-model level collapses calibration. The difference between diverse multi-agent debate (which helps) and same-model challenge (which harms) confirms the key variable is not revision depth but the source of challenge. Does a model improve by arguing with itself? documents this contrastive finding.
Source: Test Time Compute
Related concepts in this collection
-
Why do correct reasoning traces contain fewer tokens?
In o1-like models, correct solutions are systematically shorter than incorrect ones for the same questions. This challenges assumptions that longer reasoning traces indicate better reasoning, and raises questions about what length actually signals.
length-correctness correlation that follows from this
-
Why does parallel reasoning outperform single chain thinking?
Does dividing a fixed token budget across multiple independent reasoning paths beat spending it all on one long chain? This explores how breadth and diversity in reasoning compare to depth.
the alternative that doesn't rely on self-revision
-
Why do LLMs generate more novel research ideas than experts?
LLM-generated research ideas are statistically more novel than those from 100+ expert researchers, but the mechanisms behind this advantage and its practical implications remain unclear. Understanding this paradox could reshape how we use AI in creative knowledge work.
parallel self-assessment failure in a different domain: LLMs cannot evaluate the quality of their own generated research ideas, just as self-revision cannot reliably detect and fix its own reasoning errors
-
Does a model improve by arguing with itself?
When models revise their own reasoning in response to self-generated criticism, do they converge on better answers or worse ones? And how does that compare to challenge from other models?
extends with the mechanism: same-model challenge causes confidence collapse in wrong answers
-
Do prior errors in context history amplify future errors?
When a language model makes mistakes early in a task, do those errors contaminate subsequent predictions? We explore whether error accumulation degrades long-horizon performance through passive context pollution rather than capability limits.
the passive counterpart: self-revision is active error injection through deliberate re-examination, while self-conditioning is passive error accumulation through context contamination — both degrade long-horizon reasoning but through different mechanisms
-
How quickly do errors compound during model self-training?
When LLMs train on their own outputs without verification, do small mistakes amplify exponentially? This matters because it determines whether unsupervised self-improvement is even feasible.
the training-time analog: self-revision compounds errors within a single generation by switching correct answers to incorrect ones, while error avalanching compounds errors across self-training iterations by learning from previous mistakes — both demonstrate that a model's own outputs are an unreliable correction signal
-
Why does self-rewarding training collapse when responses improve?
Self-Rewarding LLMs merge generator and evaluator for efficient iteration, but both improve so fast that good and bad responses converge, erasing the learning signal. What causes this failure and how can it be fixed?
self-revision failure at training scale: self-rewarding training uses the model's own judgment to create preference pairs, but gradient collapse when outputs converge is the same dynamic as self-revision degradation — the model cannot reliably distinguish better from worse among its own outputs, whether at inference-time (self-revision) or training-time (self-rewarding)
-
Does reflection in reasoning models actually correct errors?
When reasoning models reflect on their answers, do they genuinely fix mistakes, or merely confirm what they already decided? Understanding this matters for designing better training and inference strategies.
refines the picture: self-revision does not just degrade — most "revision" tokens never genuinely revise, they confirm; the original claim applies to a small fraction of actual reflection while the majority is performative confirmation
-
Is reflection in reasoning models actually fixing mistakes?
Do the thinking steps that appear after a model's first answer represent genuine self-correction, or are they mostly confirming what the model already concluded? Understanding this matters for how we train and deploy reasoning systems.
sharpens the implication: the value of training on reflection-style traces comes from improving the first answer, not from teaching genuine self-correction; self-revision's degradation is one tail of a distribution where the bulk of reflection produces no change
-
Do reasoning traces actually cause correct answers?
Explores whether the intermediate 'thinking' tokens in R1-style models genuinely drive reasoning or merely mimic its appearance. Matters because false confidence in invalid traces could mask errors.
names the underlying error: framing self-revision as "the model reflecting on its own reasoning" anthropomorphizes a token-stylistic process; revision tokens are not metacognition, they are continued autoregressive generation that happens to use reflective vocabulary
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
self-revision degrades reasoning accuracy in o1-like models