LLM Reasoning and Architecture Language Understanding and Pragmatics Design & LLM Interaction

Do language model reasoning drafts faithfully represent their actual computation?

If models externalize reasoning in thinking drafts before answering, does the draft accurately reflect their internal process? This matters for AI safety monitoring and error detection.

Note · 2026-02-22 · sourced from Reasoning by Reflection
How should we allocate compute budget at inference time? How should researchers navigate LLM reasoning research?

The promise of thinking models for AI safety monitoring is specific: because the model externalizes its reasoning in a thinking draft before answering, observers can read the draft to detect errors and control what happens in the answer stage. This promise depends on one empirical assumption: that the thinking draft faithfully represents the model's actual internal computation. This paper tests that assumption with counterfactual interventions and finds it frequently violated.

Intra-Draft Faithfulness: When a false or contradictory step is inserted mid-draft, do subsequent steps and the final draft conclusion appropriately integrate or correct it? If the draft is faithful, inserted errors should produce systematic downstream effects. Finding: LRMs show selective faithfulness — some steps matter, most don't. Counterfactual integration is inconsistent across models and tasks.

Draft-to-Answer Faithfulness (two components):

Both failures undermine the monitoring promise from different directions. Intra-draft inconsistency means you can't trace error propagation through the draft. Draft-answer inconsistency means even a coherent, correct-looking draft doesn't guarantee a correct answer derived from it.

The safety implications are immediate: inserting corrective content into thinking drafts won't reliably fix outputs (intra-draft faithfulness fails). Reading draft conclusions to predict final answers won't reliably work (draft-answer consistency fails). The draft is an unreliable proxy for the computation it represents.

This extends Do language models actually use their reasoning steps? with a two-dimensional operationalization and empirical methodology. Both dimensions — "does the draft causally influence the answer" (causal sufficiency) and "does the answer depend on the draft" (necessity) — can now be measured via counterfactual intervention.


Source: Reasoning by Reflection

Related concepts in this collection

Concept map
14 direct connections · 113 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

thinking draft faithfulness has two separable dimensions — intra-draft causal consistency and draft-to-answer consistency — current LRMs fail both