Does chain-of-thought reasoning actually explain model decisions?
Chain-of-thought is deployed to make AI systems transparent and auditable. But does the reasoning chain actually correlate with correct outputs, or does it just create an illusion of explainability?
Post angle — Medium
The pitch for CoT in production systems: by generating reasoning steps before answers, you get transparency into the model's decision-making process. You can audit the reasoning, catch errors, build user trust.
The empirical finding from "Thoughts without Thinking": in agentic multi-LLM pipelines, reviewer scores for CoT thoughts are weakly correlated with reviewer scores for responses. The reasoning chain doesn't predict whether the output will be correct. Incorrect outputs can follow plausible-looking chains; incorrect chains don't reliably produce incorrect outputs.
This is not just academic. The CoT explainability promise is used to justify deploying agentic AI in high-stakes settings — because "you can see the reasoning." If the reasoning doesn't causally produce the output, this justification is hollow.
The deeper problem: CoT generates more material for post-hoc analysis, not better explainability. There's a difference between "I can analyze what went wrong" (what CoT provides) and "I can understand what the system will do" (what explainability requires). The former requires significant analytical effort and may actively mislead by appearing coherent.
The Einstellung Paradigm finding makes this concrete: the chain quickly gravitates toward statistically common token sequences, even when they contradict the task. The chain doesn't reveal this deviation — it looks fluent throughout.
Connections: Does chain of thought reasoning actually explain model decisions?, Do reasoning traces actually cause correct answers?, Do language models actually use their reasoning steps?
Source: Reasoning Architectures
Original note title
the explainability illusion — why cot in agentic pipelines produces chains that don't explain anything