Is reflection in reasoning models actually fixing mistakes?
Do the thinking steps that appear after a model's first answer represent genuine self-correction, or are they mostly confirming what the model already concluded? Understanding this matters for how we train and deploy reasoning systems.
The Hook
We've been watching reasoning models think and assuming the reflection is where the work happens. It isn't. The cognitive labor occurs before the first answer. The reflection tokens that follow are mostly the model telling us it was already right.
The Finding
First Try Matters analyzes rollouts from 8 reasoning models on 5 mathematical datasets. The result: reflections — the reasoning that occurs after a candidate answer is produced — are predominantly confirmatory. They rarely change the answer.
More counterintuitively: training on longer reflection chains doesn't improve self-correction capability. It improves first-answer quality. The model gets better at being right the first time, not at catching when it's wrong.
What This Means
The visible reflection is post-hoc. The model has already reasoned to a conclusion through the invisible pre-answer chain. The reflection loop is mostly generating confirmation that the conclusion it reached is correct. When the first answer is right, this looks like careful double-checking. When the first answer is wrong, the confirmation loop typically reinforces the error rather than catching it.
This reframes the entire reflection-training literature. We've been optimizing for training data with more reflection steps under the assumption that reflection = self-correction. The finding says: reflection ≈ confirmation. More reflection training = better first answers that need less correction, not better correction capability.
The Evidence from Efficiency
Early stopping — cutting reflection after the first plausible candidate answer appears — saves 24.5% of inference tokens with only 2.9% accuracy loss. If the reflection tokens after the first answer were doing substantive work, cutting them would cost more accuracy. They aren't.
The Connection
This joins Does self-revision actually improve reasoning in language models? in a cluster that challenges the "more reflection = better reasoning" assumption. That note says revision actively hurts. This note says revision mostly doesn't happen at all — it's confirmation theater. Together: the reflection loop is at best neutral and at worst harmful.
The architectural implication: if you want genuine self-correction, you need external critique — Does revising your own reasoning actually help or hurt?. Internal reflection with the same model on its own outputs produces confirmation, not correction.
Post Angle
Platform: Medium (~1000 words). Hook: "We've been watching models think. The thinking isn't where we think it is." Evidence: 8 models, 5 datasets, predominantly confirmatory reflections. Implication: what we're calling self-correction is actually self-confirmation; training on reflection is training better first-pass reasoning. Practical: 24.5% token efficiency win from early stopping.
Source: Reasoning by Reflection
Related concepts in this collection
-
Does reflection in reasoning models actually correct errors?
When reasoning models reflect on their answers, do they genuinely fix mistakes, or merely confirm what they already decided? Understanding this matters for designing better training and inference strategies.
the core insight
-
Does self-revision actually improve reasoning in language models?
When o1-like models revise their own reasoning through tokens like 'Wait' or 'Alternatively', does this reflection catch and fix errors, or does it introduce new mistakes? This matters because self-revision is marketed as a key capability.
parallel finding: revision doesn't work; this note adds the mechanism — it doesn't happen in any substantive sense
-
Does revising your own reasoning actually help or hurt?
Self-revision in reasoning models often degrades accuracy, while external critique improves it. Understanding what makes revision helpful or harmful could reshape how we design systems that need to correct themselves.
the fix: external critique, not more internal reflection
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
the first answer was right — why reflection in reasoning models is mostly theater