Reinforcement Learning for LLMs LLM Reasoning and Architecture

Does self-revision actually improve reasoning in language models?

When o1-like models revise their own reasoning through tokens like 'Wait' or 'Alternatively', does this reflection catch and fix errors, or does it introduce new mistakes? This matters because self-revision is marketed as a key capability.

Note · 2026-02-20 · sourced from Test Time Compute
How should we allocate compute budget at inference time?

Self-revision in o1-like models — prompted by tokens like "Wait" or "Alternatively" — does not reliably fix errors. The evidence from QwQ, R1, and LIMO shows:

  1. Most revisions retain the original (wrong) answer rather than correcting it
  2. Smaller models (R1-Distill-1.5B, QwQ) show a higher propensity to revise correct answers to incorrect ones than vice versa
  3. Longer CoTs have more self-revisions, which explains why longer traces correlate with incorrectness

The irony is that self-revision is framed as a feature — the model reflecting on its own reasoning. But empirically, the reflection is often noise that introduces additional errors rather than catching existing ones. The model's capacity to evaluate its own correctness is limited, so its "reflection" is more likely to perturb a right answer than to save a wrong one.

This has implications for inference strategy: forcing models to self-revise (by suppressing the token and appending "Wait") is more likely to degrade a good answer than improve a bad one. The better alternative is Why does parallel reasoning outperform single chain thinking?.

The Degeneration-of-Thought finding (ReConcile) adds the mechanism: when a model is challenged by its own previous reasoning reframed as external criticism, it doesn't maintain its position or improve — it capitulates with increasing confidence. The model ends more certain of the wrong answer than it started. This is the acute form: self-revision at the token level degrades accuracy; self-revision at the model-vs-model level collapses calibration. The difference between diverse multi-agent debate (which helps) and same-model challenge (which harms) confirms the key variable is not revision depth but the source of challenge. Does a model improve by arguing with itself? documents this contrastive finding.


Source: Test Time Compute

Related concepts in this collection

Concept map
25 direct connections · 205 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

self-revision degrades reasoning accuracy in o1-like models