Do language models actually use their reasoning steps?
Chain-of-thought reasoning looks valid on the surface, but does each step genuinely influence the model's final answer, or are the reasoning chains decorative? This matters for trusting AI explanations.
CoT evaluation typically asks: does the model produce valid-looking reasoning steps? But validity of form is not faithfulness. A reasoning chain is faithful only if it is causally responsible for the output.
Two distinct criteria define faithfulness:
Causal sufficiency: Removing any step from the chain should degrade performance. If the chain is causally sufficient, every step is doing real work — the conclusion genuinely depends on each reasoning step.
Causal necessity: No step in the chain should be spurious — the chain should contain only steps that actually contribute to the conclusion. If removing a step does not change the output, that step was not necessary to the reasoning.
Trustworthy CoT requires both. A chain can be causally sufficient (every step matters) without being causally necessary (the conclusion could have been reached with fewer steps). A chain can appear necessary (every step seems to follow) without being sufficient (removing steps does not actually change outputs, suggesting the conclusion was pre-determined).
LLMs satisfy neither criterion consistently. Do language models actually use their encoded knowledge? describes the sufficiency problem: knowledge encoded in representations may not causally influence outputs. CoT steps can be generated as performative reasoning — the form of inference without the function. Why do correct reasoning traces contain fewer tokens? reveals the necessity failure from the opposite direction: correct traces are shorter, which means incorrect traces contain spurious steps that did not contribute to correct reasoning.
Do hedging markers actually signal careful thinking in AI? adds a behavioral signal: the surface markers of uncertainty cluster in chains that are already going wrong — the hedging is decorative, not functional.
The implication is uncomfortable: most CoT evaluation is measuring output quality and reasoning form, not causal faithfulness. Models can generate reasoning that looks like it produced the answer while the answer was determined by other means. We have limited tools for distinguishing the two.
The institutional dimension: "Chain-of-Thought Is Not Explainability" (2025) finds that ~25% of recent arXiv papers incorporating CoT also treat it as an interpretability technique — roughly 244 of 1,000 sampled papers. This is a documented methodological error at scale. The "illusion of transparency" framing applies here: models can reorder multiple-choice options and their CoT explanations never mention this influence, rationalizing whatever answer was selected. Models also make errors in intermediate steps but still produce correct final answers via computational pathways not captured in the trace. Fixing this requires causal validation methods — activation patching, counterfactual interventions, verifier models — that ground explanations in model internals rather than surface output form. See Do chain of thought traces actually help humans understand reasoning?.
Causal mediation analysis quantifies the gap across model types. The FRODO paper (Making Reasoning Matter) applies causal mediation analysis across 12 LLMs on math, commonsense, and causal reasoning. Key finding: "GPT-4 only changes its answer 30% of the time when conditioned on perturbed counterfactual reasoning chains" — meaning 70% of the time, the reasoning chain is causally unnecessary. Instruction-tuned models (GPT-3.5-Instruct, Mistral-Instruct) have stronger causal effect from reasoning traces than RLHF models (ChatGPT, Llama-2-Chat), suggesting RLHF training specifically weakens the link between CoT and output. See also the position paper "Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!" which argues this anthropomorphization "isn't a harmless metaphor and instead is quite dangerous — it confuses the nature of these models." Since Does chain-of-thought reasoning reflect genuine thinking or performance?, the 30% causal influence rate likely reflects the task mix: easy tasks contribute zero causal influence while hard tasks contribute genuine reasoning, averaging to the observed 30%.
The stylistic mimicry argument adds new evidence for necessity failure: The "How do reasoning models reason?" paper documents that a significant fraction of R1's pre-answer traces are judged invalid by the original search algorithm that generated them — yet these invalid traces still produce correct final answers. If traces were causally necessary, invalid traces should produce wrong answers. They don't. This is the most direct evidence available that many trace steps are causally unnecessary — the answer was determined by other means, and the trace is a performance. See Do reasoning traces actually cause correct answers?.
Agentic pipelines amplify the failure (Thoughts without Thinking): In multi-LLM agentic pipelines implementing real perceptive task guidance systems, reviewer scores for CoT thoughts are weakly correlated with reviewer scores for responses. This is empirical measurement of faithfulness failure in production: the chain doesn't predict whether the output will be correct. Incorrect outputs follow plausible-looking chains; incorrect chains don't reliably predict incorrect responses. The Einstellung effect is the specific mechanism: chains gravitate toward statistically common token sequences (e.g., real dump truck components when reasoning about a toy) even when those sequences contradict the task — and the chain appears coherent throughout. See Does chain of thought reasoning actually explain model decisions?.
The propositional interpretability framework (Chalmers, Propositional Interpretability in Artificial Intelligence) positions CoT as the most accessible method for understanding AI systems in terms of their propositional attitudes — their beliefs, desires, and intentions. Chain of thought outputs are "pre-interpreted" in natural language, which makes them appear directly readable as propositional logs. But the systematic unfaithfulness identified above means CoT fails at this propositional interpretability function: it reports attitudes the model did not actually use to reach its conclusion, omits attitudes that were causally active, and generally misrepresents the reasoning process. Sparse Auto-Encoders may offer an alternative path: feature activations could serve as concept logs that track which representations were active without relying on the model's potentially-unfaithful natural-language self-report. The Can language models describe their own learned behaviors? finding suggests that some propositional self-knowledge is accessible from the model; the faithfulness problem is that CoT may not be accessing it reliably.
The three-hypothesis framework formalizes the faithfulness problem. A position paper ("LLM Reasoning Is Latent, Not the Chain of Thought," 2604.15726) reframes the entire debate: the question is not "does CoT help?" but "what is CoT help evidence of?" Three hypotheses: H2 (reasoning is primarily the surface CoT), H0 (gains are generic serial compute), H1 (reasoning is primarily latent-state trajectories). H2 requires surface traces to have the most stable causal leverage — but the faithfulness failures documented above show they don't. H0 requires matched compute to explain gains — but it can't explain why specific internal states predict reasoning. H1 best fits: reasoning is a latent process that CoT only partially verbalizes. The practical recommendation: treat latent-state dynamics as the default object of study and design evaluations that explicitly disentangle surface traces, latent states, and serial compute. See Where does LLM reasoning actually happen during generation?.
The faithfulness-plausibility conflation is a field-wide methodological problem. Jacovi and Goldberg (2020) argue that the distinction between plausibility (how convincing an interpretation is to humans) and faithfulness (how accurately it reflects the model's true reasoning) is widely conflated in the interpretability literature. The conflation is dangerous: in high-stakes applications like recidivism prediction, a plausible but unfaithful interpretation may be the worst-case scenario. Furthermore, "inherently interpretable" methods like attention have been shown to lack faithfulness despite their apparent transparency — the claim of interpretability requires independent verification. The authors propose replacing the binary notion of faithfulness with a graded one, arguing that the current binary standard sets "a potentially unrealistic bar" and a gradient would have greater practical utility. Source: Arxiv/Evaluations.
Source: Argumentation
Related concepts in this collection
-
Do language models actually use their encoded knowledge?
Probes can detect that LMs encode facts internally, but do those encoded facts causally influence what the model generates? This explores the gap between knowing and doing.
encoding ≠ causal influence; same principle applied to CoT steps
-
Why do correct reasoning traces contain fewer tokens?
In o1-like models, correct solutions are systematically shorter than incorrect ones for the same questions. This challenges assumptions that longer reasoning traces indicate better reasoning, and raises questions about what length actually signals.
necessity failure from behavior: correct chains contain fewer spurious steps
-
Do hedging markers actually signal careful thinking in AI?
Explores whether linguistic markers like "alternatively" and "however" in model outputs correlate with accuracy or uncertainty. This matters because users often interpret such language as a sign of trustworthy reasoning.
surface markers of reasoning quality are decorative in failing chains
-
Why do LLMs accept logical fallacies more than humans?
LLMs fall for persuasive but invalid arguments at much higher rates than humans. This explores whether reasoning models genuinely evaluate logic or simply mimic argument structure.
CoT provides no resistance because the reasoning steps are not causally driving acceptance
-
Do language model reasoning drafts faithfully represent their actual computation?
If models externalize reasoning in thinking drafts before answering, does the draft accurately reflect their internal process? This matters for AI safety monitoring and error detection.
extends with empirical methodology: counterfactual interventions make the abstract faithfulness claim measurable; two-dimensional operationalization shows both intra-draft and draft-answer faithfulness fail independently
-
Does reflection in reasoning models actually correct errors?
When reasoning models reflect on their answers, do they genuinely fix mistakes, or merely confirm what they already decided? Understanding this matters for designing better training and inference strategies.
behavioral confirmation: if most reflection is confirmatory post-hoc, then faithfulness failure at the causal level is the expected mechanism; visible reasoning tokens are not driving the answer
-
Where do memorization errors arise in chain-of-thought reasoning?
Explores whether memorization in language model reasoning can be localized to specific token sources and which sources dominate error patterns during long generations.
STIM provides the token-level mechanism for faithfulness failure: memorized tokens are neither causally sufficient (pattern-matching replaces reasoning) nor causally necessary (memorized continuations happen to produce correct answers without reasoning)
-
Can a model be truthful without actually being honest?
Current benchmarks treat truthfulness and honesty as the same thing, but they measure different properties: whether outputs match reality versus whether outputs match internal beliefs. What happens if they diverge?
unfaithful CoT undermines the ability to evaluate honesty: if reasoning steps do not causally drive the answer, the model's apparent expression of its beliefs through CoT is unreliable as evidence of its actual internal state
-
Does RLVR actually improve mathematical reasoning or just coherence?
RLVR post-training makes reasoning traces locally more consistent, but does this structural improvement translate to valid mathematical proofs? We investigate whether trace coherence is sufficient for correctness.
RLVR-specific instance: FOL analysis shows RLVR improves local coherence (adjacent steps follow plausibly) without establishing causal sufficiency or necessity; structural improvement without faithfulness improvement
-
Does chain-of-thought reasoning reveal genuine inference or pattern matching?
Explores whether CoT instructions unlock real reasoning capabilities or simply constrain models to mimic familiar reasoning patterns from training data. This matters for understanding whether language models can actually reason abstractly.
provides the mechanistic theory for why faithfulness fails: if CoT is constrained imitation of reasoning form rather than genuine inference, then the chain is performative by design — faithfulness failure is the expected behavior, not a surprising defect
-
What mechanism enables models to retrieve from long context?
Do attention heads specialize in retrieving relevant information from long context windows, and if so, what makes them universal across models and necessary for factual generation?
retrieval heads provide the mechanistic substrate for CoT causal sufficiency
-
Where does LLM reasoning actually happen during generation?
Does multi-step reasoning emerge from visible chain-of-thought text, hidden layer dynamics, or simply more computation? Three competing hypotheses make different predictions and can be empirically tested.
provides the theoretical superstructure: H1/H2/H0 framework formalizes what faithfulness failure means for the field's default assumptions: reasoning steps can only causally influence subsequent steps if retrieval heads attend to them; the sparsity of retrieval heads (<5% of attention heads) explains why many CoT steps may be generated without being computationally integrated into downstream reasoning
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
cot faithfulness requires both causal sufficiency and causal necessity — llms currently satisfy neither consistently