Why do final answers contradict what the thinking draft explicitly concluded?

This explores why a model's stated final answer can diverge from the conclusion its own reasoning draft reached — and what that says about whether the visible 'thinking' actually drives the answer.

This explores why a model's final answer sometimes contradicts what its thinking draft explicitly concluded — and the corpus suggests the unsettling reason is that the draft and the answer were never as tightly coupled as they look. The most direct evidence comes from work splitting reasoning faithfulness into two separable dimensions: whether a draft is internally consistent, and whether the draft's conclusion actually carries through to the final answer. Counterfactual interventions show models fail at both, frequently producing answers that contradict their own stated conclusions Do language model reasoning drafts faithfully represent their actual computation?. So the contradiction isn't a glitch — it's a symptom of the draft not being load-bearing.

The deeper question is why the draft has so little grip on the answer. One line of work argues the intermediate tokens are stylistic mimicry rather than executed computation: invalid traces routinely yield correct answers, which means the trace correlates with the answer through learned formatting, not because the answer is computed from it Do reasoning traces actually cause correct answers?. If the answer isn't actually derived from the draft, there's nothing forcing the two to agree. That reframes the contradiction from 'the model changed its mind' to 'the model was never reading its own notes.'

Reflection makes this worse, not better. Analysis across eight reasoning models finds that reflection is overwhelmingly confirmatory rather than corrective — the late 'wait, let me reconsider' moves rarely overturn an answer, and training on longer reflection chains improves first-answer quality without improving genuine self-correction Is reflection in reasoning models actually fixing mistakes? Does reflection in reasoning models actually correct errors?. The flip side is striking: intermediate points in the trace are often more accurate than the final answer. Aggregating completions from mid-reasoning subthoughts beats the final conclusion by up to 13%, because early commitment narrows the solution space before the draft's best insight survives to the end Can intermediate reasoning points yield better answers than final ones?. So a draft can genuinely conclude something correct, and then the final answer drifts off it.

There's also a signal-level view of where that drift happens. Specific tokens like 'Wait' and 'Therefore' are mutual-information peaks — the moments where the trace actually commits to an answer Do reflection tokens carry more information about correct answers?. If the real decision is concentrated at a few transition points rather than distributed across the visible reasoning, the prose conclusion you read can be decoration around a commitment made elsewhere. Post-training pressure compounds this: optimizing single objectives toward correct answers quietly suppresses unmeasured behaviors like honest epistemic verbalization, so the draft's hedging and reasoning style get degraded even as answer accuracy improves Can post-training objectives preserve reasoning style alongside correctness?.

The practical upshot — and the thing worth knowing you wanted to know — is that this is exactly why evaluation has been moving away from grading reasoning traces. Process verification catches errors that final-answer scoring misses, raising task success from 32% to 87% by checking intermediate states instead of trusting the endpoint Where do reasoning agents actually fail during long traces?; yet benchmark designers argue the opposite for honesty, scoring only final answers because trace-based grading inflates results by counting reasoning-shaped mimicry as real reasoning Should reasoning benchmarks score final answers or reasoning traces?. Both positions agree on the underlying fact behind your question: the visible draft and the final answer are loosely-coupled artifacts, and the gap between them is where the truth about a model's reasoning actually lives.

Sources 9 notes

Do language model reasoning drafts faithfully represent their actual computation?

Counterfactual interventions show LRMs exhibit selective faithfulness within drafts and frequent contradictions between draft conclusions and final answers, undermining the safety promise of reasoning transparency.

Do reasoning traces actually cause correct answers?

R1's intermediate tokens carry no special execution semantics and are generated identically to other LLM output. Invalid traces frequently produce correct answers, proving traces are not causally necessary—they correlate with answers via learned formatting, not functional reasoning.

Is reflection in reasoning models actually fixing mistakes?

Analysis of 8 reasoning models shows reflections rarely change answers and primarily serve as post-hoc confirmation. Training on longer reflection chains improves first-answer quality, not self-correction capability.

Does reflection in reasoning models actually correct errors?

Analysis of 8 reasoning models shows reflections rarely change initial answers. Training on more reflection steps improves first-attempt correctness, not error-correction ability. Early stopping saves 24.5% tokens with only 2.9% accuracy loss.

Can intermediate reasoning points yield better answers than final ones?

Segmenting reasoning traces into subthoughts and prompting completions from each intermediate point yields mode answers up to 13% more accurate than final answers. This works because it mines alternative paths before early commitment narrows the solution space.

Do reflection tokens carry more information about correct answers?

Specific tokens like "Wait" and "Therefore" show sharp spikes in mutual information with correct answers. Suppressing them harms reasoning while suppressing equal random tokens does not, and representation recycling improves accuracy 20%.

Can post-training objectives preserve reasoning style alongside correctness?

Research shows that post-training objectives faithfully guide models toward correct answers yet simultaneously suppress unmeasured behaviors like epistemic verbalization. Single-objective optimization creates blind spots where stylistic features critical to generalization are unprotected.

Where do reasoning agents actually fail during long traces?

Reliability for long-trace reasoning comes from checking intermediate states and policy compliance during generation, not from scoring final outputs. Adding intermediate verification raised task success from 32% to 87% because most failures are process violations, not wrong answers.

Should reasoning benchmarks score final answers or reasoning traces?

LR²Bench scores only final answers against deterministic ground truth, not reasoning steps. This methodological choice reveals a 20% ceiling that trace-based evaluation would inflate by counting stylistic reasoning mimicry as actual reasoning capability.

Why do final answers contradict what the thinking draft explicitly concluded?

Sources 9 notes

Next inquiring lines