Reinforcement Learning for LLMs LLM Reasoning and Architecture

Is reflection in reasoning models actually fixing mistakes?

Do the thinking steps that appear after a model's first answer represent genuine self-correction, or are they mostly confirming what the model already concluded? Understanding this matters for how we train and deploy reasoning systems.

Note · 2026-02-22 · sourced from Reasoning by Reflection
How should we allocate compute budget at inference time? How should researchers navigate LLM reasoning research?

The Hook

We've been watching reasoning models think and assuming the reflection is where the work happens. It isn't. The cognitive labor occurs before the first answer. The reflection tokens that follow are mostly the model telling us it was already right.

The Finding

First Try Matters analyzes rollouts from 8 reasoning models on 5 mathematical datasets. The result: reflections — the reasoning that occurs after a candidate answer is produced — are predominantly confirmatory. They rarely change the answer.

More counterintuitively: training on longer reflection chains doesn't improve self-correction capability. It improves first-answer quality. The model gets better at being right the first time, not at catching when it's wrong.

What This Means

The visible reflection is post-hoc. The model has already reasoned to a conclusion through the invisible pre-answer chain. The reflection loop is mostly generating confirmation that the conclusion it reached is correct. When the first answer is right, this looks like careful double-checking. When the first answer is wrong, the confirmation loop typically reinforces the error rather than catching it.

This reframes the entire reflection-training literature. We've been optimizing for training data with more reflection steps under the assumption that reflection = self-correction. The finding says: reflection ≈ confirmation. More reflection training = better first answers that need less correction, not better correction capability.

The Evidence from Efficiency

Early stopping — cutting reflection after the first plausible candidate answer appears — saves 24.5% of inference tokens with only 2.9% accuracy loss. If the reflection tokens after the first answer were doing substantive work, cutting them would cost more accuracy. They aren't.

The Connection

This joins Does self-revision actually improve reasoning in language models? in a cluster that challenges the "more reflection = better reasoning" assumption. That note says revision actively hurts. This note says revision mostly doesn't happen at all — it's confirmation theater. Together: the reflection loop is at best neutral and at worst harmful.

The architectural implication: if you want genuine self-correction, you need external critique — Does revising your own reasoning actually help or hurt?. Internal reflection with the same model on its own outputs produces confirmation, not correction.

Post Angle

Platform: Medium (~1000 words). Hook: "We've been watching models think. The thinking isn't where we think it is." Evidence: 8 models, 5 datasets, predominantly confirmatory reflections. Implication: what we're calling self-correction is actually self-confirmation; training on reflection is training better first-pass reasoning. Practical: 24.5% token efficiency win from early stopping.


Source: Reasoning by Reflection

Related concepts in this collection

Concept map
12 direct connections · 107 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

the first answer was right — why reflection in reasoning models is mostly theater