Reinforcement Learning for LLMs LLM Reasoning and Architecture

Does self-revision actually improve reasoning in language models?

When o1-like models revise their own reasoning through tokens like 'Wait' or 'Alternatively', does this reflection catch and fix errors, or does it introduce new mistakes? This matters because self-revision is marketed as a key capability.

Note · 2026-02-20 · sourced from Test Time Compute

Self-revision in o1-like models — prompted by tokens like "Wait" or "Alternatively" — does not reliably fix errors. The evidence from QwQ, R1, and LIMO shows:

Most revisions retain the original (wrong) answer rather than correcting it
Smaller models (R1-Distill-1.5B, QwQ) show a higher propensity to revise correct answers to incorrect ones than vice versa
Longer CoTs have more self-revisions, which explains why longer traces correlate with incorrectness

The irony is that self-revision is framed as a feature — the model reflecting on its own reasoning. But empirically, the reflection is often noise that introduces additional errors rather than catching existing ones. The model's capacity to evaluate its own correctness is limited, so its "reflection" is more likely to perturb a right answer than to save a wrong one.

This has implications for inference strategy: forcing models to self-revise (by suppressing the token and appending "Wait") is more likely to degrade a good answer than improve a bad one. The better alternative is Why does parallel reasoning outperform single chain thinking?.

The Degeneration-of-Thought finding (ReConcile) adds the mechanism: when a model is challenged by its own previous reasoning reframed as external criticism, it doesn't maintain its position or improve — it capitulates with increasing confidence. The model ends more certain of the wrong answer than it started. This is the acute form: self-revision at the token level degrades accuracy; self-revision at the model-vs-model level collapses calibration. The difference between diverse multi-agent debate (which helps) and same-model challenge (which harms) confirms the key variable is not revision depth but the source of challenge. Does a model improve by arguing with itself? documents this contrastive finding.

Source: Test Time Compute

Related concepts in this collection

Why do correct reasoning traces contain fewer tokens? In o1-like models, correct solutions are systematically shorter than incorrect ones for the same questions. This challenges assumptions that longer reasoning traces indicate better reasoning, and raises questions about what length actually signals.
length-correctness correlation that follows from this
Why does parallel reasoning outperform single chain thinking? Does dividing a fixed token budget across multiple independent reasoning paths beat spending it all on one long chain? This explores how breadth and diversity in reasoning compare to depth.
the alternative that doesn't rely on self-revision
Why do LLMs generate more novel research ideas than experts? LLM-generated research ideas are statistically more novel than those from 100+ expert researchers, but the mechanisms behind this advantage and its practical implications remain unclear. Understanding this paradox could reshape how we use AI in creative knowledge work.
parallel self-assessment failure in a different domain: LLMs cannot evaluate the quality of their own generated research ideas, just as self-revision cannot reliably detect and fix its own reasoning errors
Does a model improve by arguing with itself? When models revise their own reasoning in response to self-generated criticism, do they converge on better answers or worse ones? And how does that compare to challenge from other models?
extends with the mechanism: same-model challenge causes confidence collapse in wrong answers
Do prior errors in context history amplify future errors? When a language model makes mistakes early in a task, do those errors contaminate subsequent predictions? We explore whether error accumulation degrades long-horizon performance through passive context pollution rather than capability limits.
the passive counterpart: self-revision is active error injection through deliberate re-examination, while self-conditioning is passive error accumulation through context contamination — both degrade long-horizon reasoning but through different mechanisms
How quickly do errors compound during model self-training? When LLMs train on their own outputs without verification, do small mistakes amplify exponentially? This matters because it determines whether unsupervised self-improvement is even feasible.
the training-time analog: self-revision compounds errors within a single generation by switching correct answers to incorrect ones, while error avalanching compounds errors across self-training iterations by learning from previous mistakes — both demonstrate that a model's own outputs are an unreliable correction signal
Why does self-rewarding training collapse when responses improve? Self-Rewarding LLMs merge generator and evaluator for efficient iteration, but both improve so fast that good and bad responses converge, erasing the learning signal. What causes this failure and how can it be fixed?
self-revision failure at training scale: self-rewarding training uses the model's own judgment to create preference pairs, but gradient collapse when outputs converge is the same dynamic as self-revision degradation — the model cannot reliably distinguish better from worse among its own outputs, whether at inference-time (self-revision) or training-time (self-rewarding)
Does reflection in reasoning models actually correct errors? When reasoning models reflect on their answers, do they genuinely fix mistakes, or merely confirm what they already decided? Understanding this matters for designing better training and inference strategies.
refines the picture: self-revision does not just degrade — most "revision" tokens never genuinely revise, they confirm; the original claim applies to a small fraction of actual reflection while the majority is performative confirmation
Is reflection in reasoning models actually fixing mistakes? Do the thinking steps that appear after a model's first answer represent genuine self-correction, or are they mostly confirming what the model already concluded? Understanding this matters for how we train and deploy reasoning systems.
sharpens the implication: the value of training on reflection-style traces comes from improving the first answer, not from teaching genuine self-correction; self-revision's degradation is one tail of a distribution where the bulk of reflection produces no change
Do reasoning traces actually cause correct answers? Explores whether the intermediate 'thinking' tokens in R1-style models genuinely drive reasoning or merely mimic its appearance. Matters because false confidence in invalid traces could mask errors.
names the underlying error: framing self-revision as "the model reflecting on its own reasoning" anthropomorphizes a token-stylistic process; revision tokens are not metacognition, they are continued autoregressive generation that happens to use reflective vocabulary

Concept map

25 direct connections · 205 in 2-hop network ·medium cluster

Does self-revision actually improve reasoning in… Why do correct reasoning traces contain fewer toke… Why does parallel reasoning outperform single chai… Why do LLMs generate more novel research ideas tha… Does a model improve by arguing with itself? Do prior errors in context history amplify future … How quickly do errors compound during model self-t… Why does self-rewarding training collapse when res… Does reflection in reasoning models actually corre…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

self-revision degrades reasoning accuracy in o1-like models