LLM Reasoning and Architecture Language Understanding and Pragmatics

Do language models actually use their reasoning steps?

Chain-of-thought reasoning looks valid on the surface, but does each step genuinely influence the model's final answer, or are the reasoning chains decorative? This matters for trusting AI explanations.

Note · 2026-02-21 · sourced from Argumentation
How should we allocate compute budget at inference time? What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

CoT evaluation typically asks: does the model produce valid-looking reasoning steps? But validity of form is not faithfulness. A reasoning chain is faithful only if it is causally responsible for the output.

Two distinct criteria define faithfulness:

Causal sufficiency: Removing any step from the chain should degrade performance. If the chain is causally sufficient, every step is doing real work — the conclusion genuinely depends on each reasoning step.

Causal necessity: No step in the chain should be spurious — the chain should contain only steps that actually contribute to the conclusion. If removing a step does not change the output, that step was not necessary to the reasoning.

Trustworthy CoT requires both. A chain can be causally sufficient (every step matters) without being causally necessary (the conclusion could have been reached with fewer steps). A chain can appear necessary (every step seems to follow) without being sufficient (removing steps does not actually change outputs, suggesting the conclusion was pre-determined).

LLMs satisfy neither criterion consistently. Do language models actually use their encoded knowledge? describes the sufficiency problem: knowledge encoded in representations may not causally influence outputs. CoT steps can be generated as performative reasoning — the form of inference without the function. Why do correct reasoning traces contain fewer tokens? reveals the necessity failure from the opposite direction: correct traces are shorter, which means incorrect traces contain spurious steps that did not contribute to correct reasoning.

Do hedging markers actually signal careful thinking in AI? adds a behavioral signal: the surface markers of uncertainty cluster in chains that are already going wrong — the hedging is decorative, not functional.

The implication is uncomfortable: most CoT evaluation is measuring output quality and reasoning form, not causal faithfulness. Models can generate reasoning that looks like it produced the answer while the answer was determined by other means. We have limited tools for distinguishing the two.

The institutional dimension: "Chain-of-Thought Is Not Explainability" (2025) finds that ~25% of recent arXiv papers incorporating CoT also treat it as an interpretability technique — roughly 244 of 1,000 sampled papers. This is a documented methodological error at scale. The "illusion of transparency" framing applies here: models can reorder multiple-choice options and their CoT explanations never mention this influence, rationalizing whatever answer was selected. Models also make errors in intermediate steps but still produce correct final answers via computational pathways not captured in the trace. Fixing this requires causal validation methods — activation patching, counterfactual interventions, verifier models — that ground explanations in model internals rather than surface output form. See Do chain of thought traces actually help humans understand reasoning?.

Causal mediation analysis quantifies the gap across model types. The FRODO paper (Making Reasoning Matter) applies causal mediation analysis across 12 LLMs on math, commonsense, and causal reasoning. Key finding: "GPT-4 only changes its answer 30% of the time when conditioned on perturbed counterfactual reasoning chains" — meaning 70% of the time, the reasoning chain is causally unnecessary. Instruction-tuned models (GPT-3.5-Instruct, Mistral-Instruct) have stronger causal effect from reasoning traces than RLHF models (ChatGPT, Llama-2-Chat), suggesting RLHF training specifically weakens the link between CoT and output. See also the position paper "Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!" which argues this anthropomorphization "isn't a harmless metaphor and instead is quite dangerous — it confuses the nature of these models." Since Does chain-of-thought reasoning reflect genuine thinking or performance?, the 30% causal influence rate likely reflects the task mix: easy tasks contribute zero causal influence while hard tasks contribute genuine reasoning, averaging to the observed 30%.

The stylistic mimicry argument adds new evidence for necessity failure: The "How do reasoning models reason?" paper documents that a significant fraction of R1's pre-answer traces are judged invalid by the original search algorithm that generated them — yet these invalid traces still produce correct final answers. If traces were causally necessary, invalid traces should produce wrong answers. They don't. This is the most direct evidence available that many trace steps are causally unnecessary — the answer was determined by other means, and the trace is a performance. See Do reasoning traces actually cause correct answers?.

Agentic pipelines amplify the failure (Thoughts without Thinking): In multi-LLM agentic pipelines implementing real perceptive task guidance systems, reviewer scores for CoT thoughts are weakly correlated with reviewer scores for responses. This is empirical measurement of faithfulness failure in production: the chain doesn't predict whether the output will be correct. Incorrect outputs follow plausible-looking chains; incorrect chains don't reliably predict incorrect responses. The Einstellung effect is the specific mechanism: chains gravitate toward statistically common token sequences (e.g., real dump truck components when reasoning about a toy) even when those sequences contradict the task — and the chain appears coherent throughout. See Does chain of thought reasoning actually explain model decisions?.

The propositional interpretability framework (Chalmers, Propositional Interpretability in Artificial Intelligence) positions CoT as the most accessible method for understanding AI systems in terms of their propositional attitudes — their beliefs, desires, and intentions. Chain of thought outputs are "pre-interpreted" in natural language, which makes them appear directly readable as propositional logs. But the systematic unfaithfulness identified above means CoT fails at this propositional interpretability function: it reports attitudes the model did not actually use to reach its conclusion, omits attitudes that were causally active, and generally misrepresents the reasoning process. Sparse Auto-Encoders may offer an alternative path: feature activations could serve as concept logs that track which representations were active without relying on the model's potentially-unfaithful natural-language self-report. The Can language models describe their own learned behaviors? finding suggests that some propositional self-knowledge is accessible from the model; the faithfulness problem is that CoT may not be accessing it reliably.

The three-hypothesis framework formalizes the faithfulness problem. A position paper ("LLM Reasoning Is Latent, Not the Chain of Thought," 2604.15726) reframes the entire debate: the question is not "does CoT help?" but "what is CoT help evidence of?" Three hypotheses: H2 (reasoning is primarily the surface CoT), H0 (gains are generic serial compute), H1 (reasoning is primarily latent-state trajectories). H2 requires surface traces to have the most stable causal leverage — but the faithfulness failures documented above show they don't. H0 requires matched compute to explain gains — but it can't explain why specific internal states predict reasoning. H1 best fits: reasoning is a latent process that CoT only partially verbalizes. The practical recommendation: treat latent-state dynamics as the default object of study and design evaluations that explicitly disentangle surface traces, latent states, and serial compute. See Where does LLM reasoning actually happen during generation?.

The faithfulness-plausibility conflation is a field-wide methodological problem. Jacovi and Goldberg (2020) argue that the distinction between plausibility (how convincing an interpretation is to humans) and faithfulness (how accurately it reflects the model's true reasoning) is widely conflated in the interpretability literature. The conflation is dangerous: in high-stakes applications like recidivism prediction, a plausible but unfaithful interpretation may be the worst-case scenario. Furthermore, "inherently interpretable" methods like attention have been shown to lack faithfulness despite their apparent transparency — the claim of interpretability requires independent verification. The authors propose replacing the binary notion of faithfulness with a graded one, arguing that the current binary standard sets "a potentially unrealistic bar" and a gradient would have greater practical utility. Source: Arxiv/Evaluations.


Source: Argumentation

Related concepts in this collection

Concept map
28 direct connections · 176 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

cot faithfulness requires both causal sufficiency and causal necessity — llms currently satisfy neither consistently