Do language model reasoning drafts faithfully represent their actual computation?
If models externalize reasoning in thinking drafts before answering, does the draft accurately reflect their internal process? This matters for AI safety monitoring and error detection.
The promise of thinking models for AI safety monitoring is specific: because the model externalizes its reasoning in a thinking draft before answering, observers can read the draft to detect errors and control what happens in the answer stage. This promise depends on one empirical assumption: that the thinking draft faithfully represents the model's actual internal computation. This paper tests that assumption with counterfactual interventions and finds it frequently violated.
Intra-Draft Faithfulness: When a false or contradictory step is inserted mid-draft, do subsequent steps and the final draft conclusion appropriately integrate or correct it? If the draft is faithful, inserted errors should produce systematic downstream effects. Finding: LRMs show selective faithfulness — some steps matter, most don't. Counterfactual integration is inconsistent across models and tasks.
Draft-to-Answer Faithfulness (two components):
- Draft Reliance: Does the answer-generation stage introduce substantial new reasoning beyond the thinking draft? If so, the draft is not the full reasoning record.
- Draft-Answer Consistency: Does the final answer logically align with the thinking draft's explicit conclusion? Finding: final answers frequently contradict the explicit draft conclusions. The draft may say "therefore X" while the answer states Y.
Both failures undermine the monitoring promise from different directions. Intra-draft inconsistency means you can't trace error propagation through the draft. Draft-answer inconsistency means even a coherent, correct-looking draft doesn't guarantee a correct answer derived from it.
The safety implications are immediate: inserting corrective content into thinking drafts won't reliably fix outputs (intra-draft faithfulness fails). Reading draft conclusions to predict final answers won't reliably work (draft-answer consistency fails). The draft is an unreliable proxy for the computation it represents.
This extends Do language models actually use their reasoning steps? with a two-dimensional operationalization and empirical methodology. Both dimensions — "does the draft causally influence the answer" (causal sufficiency) and "does the answer depend on the draft" (necessity) — can now be measured via counterfactual intervention.
Source: Reasoning by Reflection
Related concepts in this collection
-
Do language models actually use their reasoning steps?
Chain-of-thought reasoning looks valid on the surface, but does each step genuinely influence the model's final answer, or are the reasoning chains decorative? This matters for trusting AI explanations.
operationalizes with two specific measurable dimensions; counterfactual intervention is the methodology that makes the abstract claim testable
-
Do reasoning traces actually cause correct answers?
Explores whether the intermediate 'thinking' tokens in R1-style models genuinely drive reasoning or merely mimic its appearance. Matters because false confidence in invalid traces could mask errors.
draft-to-answer consistency failure is the empirical confirmation of why trace anthropomorphism is dangerous
-
Does reflection in reasoning models actually correct errors?
When reasoning models reflect on their answers, do they genuinely fix mistakes, or merely confirm what they already decided? Understanding this matters for designing better training and inference strategies.
behavioral correlation: confirmatory reflection is the content-level evidence of faithfulness failure — if reflection tokens confirm rather than evaluate, they are causal decoration, not causal drivers
-
Does chain-of-thought reasoning reveal genuine inference or pattern matching?
Explores whether CoT instructions unlock real reasoning capabilities or simply constrain models to mimic familiar reasoning patterns from training data. This matters for understanding whether language models can actually reason abstractly.
provides the theoretical grounding: draft unfaithfulness is the expected outcome if CoT is imitation of reasoning form rather than genuine inference — drafts are performative by construction, so draft-answer disconnects are structural, not accidental
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
thinking draft faithfulness has two separable dimensions — intra-draft causal consistency and draft-to-answer consistency — current LRMs fail both