Chain-of-Thought Is Not Explainability
we argue that CoT rationales can be misleading and are neither necessary nor sufficient for trustworthy interpretability By analysing faithfulness in terms of whether CoTs are not only human-interpretable, but also reflect underlying model reasoning in a way that supports responsible use, we synthesise evidence from previous studies. We show that verbalised chains are frequently unfaithful, diverging from the true hidden computations that drive a model’s predictions, and giving an incorrect picture of how models arrive at conclusions. Despite this, CoT is increasingly relied upon in high-stakes domains such as medicine, law, and autonomous systems; our analysis of 1,000 recent CoT-centric papers finds that ~25% explicitly treat CoT as an interpretability technique. Building on prior work in interpretability, we make three proposals: (i) avoid treating CoT as being sufficient for interpretability without additional verification, while continuing to use CoT for its communicative benefits, (ii) adopt rigorous methods that assess faithfulness for downstream decision-making, and (iii) develop causal validation methods (e.g., activation patching, counterfactual interventions, verifier models) to ground explanations in model internals.
researchers have unduly used CoT to reveal what models “think” [17, 52, 58, 34]. According to our estimates (derived in Appendix B), in the past year almost 25% (244 out of 1,000) research papers that appeared on arXiv and incorporate CoT in their model design or dataset construction also regard CoT as a technique for enhancing model interpretability.
The Unfaithfulness Problem. Despite their intuitive appeal, growing evidence shows chain-ofthought outputs often fail to meet these criteria [75, 4, 5]. CoT explanations frequently diverge from models’ real decision processes, as models may use shortcuts or latent knowledge that is not expressed in their reasoning [4, 5]. In such cases, the CoT reads as a plausible but untrustworthy explanation [37]. Here are two exemplary cases:
• Prompt bias influence. As a violation of causality and completeness, Turpin et al. [75] showed that reordering multiple-choice options can cause models to choose different answers in up to 36% of cases, yet their CoT explanations never mention this influence, instead rationalising whatever answer they selected.
• Silent error correction. As a violation of soundness, Lanham et al. [47] and Arcuschin et al. [5] both documented cases where models make errors in intermediate reasoning steps but still produce correct final answers, indicating they used computational pathways not revealed in their verbalised steps.