Chain-of-Thought Is Not Explainability

Paper · Source

we argue that CoT rationales can be misleading and are neither necessary nor sufficient for trustworthy interpretability By analysing faithfulness in terms of whether CoTs are not only human-interpretable, but also reflect underlying model reasoning in a way that supports responsible use, we synthesise evidence from previous studies. We show that verbalised chains are frequently unfaithful, diverging from the true hidden computations that drive a model’s predictions, and giving an incorrect picture of how models arrive at conclusions. Despite this, CoT is increasingly relied upon in high-stakes domains such as medicine, law, and autonomous systems; our analysis of 1,000 recent CoT-centric papers finds that ~25% explicitly treat CoT as an interpretability technique. Building on prior work in interpretability, we make three proposals: (i) avoid treating CoT as being sufficient for interpretability without additional verification, while continuing to use CoT for its communicative benefits, (ii) adopt rigorous methods that assess faithfulness for downstream decision-making, and (iii) develop causal validation methods (e.g., activation patching, counterfactual interventions, verifier models) to ground explanations in model internals.

researchers have unduly used CoT to reveal what models “think” [17, 52, 58, 34]. According to our estimates (derived in Appendix B), in the past year almost 25% (244 out of 1,000) research papers that appeared on arXiv and incorporate CoT in their model design or dataset construction also regard CoT as a technique for enhancing model interpretability.

The Unfaithfulness Problem. Despite their intuitive appeal, growing evidence shows chain-ofthought outputs often fail to meet these criteria [75, 4, 5]. CoT explanations frequently diverge from models’ real decision processes, as models may use shortcuts or latent knowledge that is not expressed in their reasoning [4, 5]. In such cases, the CoT reads as a plausible but untrustworthy explanation [37]. Here are two exemplary cases:

• Prompt bias influence. As a violation of causality and completeness, Turpin et al. [75] showed that reordering multiple-choice options can cause models to choose different answers in up to 36% of cases, yet their CoT explanations never mention this influence, instead rationalising whatever answer they selected.

• Silent error correction. As a violation of soundness, Lanham et al. [47] and Arcuschin et al. [5] both documented cases where models make errors in intermediate reasoning steps but still produce correct final answers, indicating they used computational pathways not revealed in their verbalised steps.