Thoughts without Thinking: Reconsidering the Explanatory Value of Chain-of-Thought Reasoning in LLMs through Agentic Pipelines

Paper · arXiv 2505.00875 · Published May 1, 2025

Agentic pipelines present novel challenges and opportunities for human-centered explainability. The HCXAI community is still grappling with how best to make the inner workings of LLMs transparent in actionable ways. Agentic pipelines consist of multiple LLMs working in cooperation with minimal human control. In this research paper, we present early findings from an agentic pipeline implementation of a perceptive task guidance system. Through quantitative and qualitative analysis, we analyze how Chain-of-Thought (CoT) reasoning, a common vehicle for explainability in LLMs, operates within agentic pipelines. We demonstrate that CoT reasoning alone does not lead to better outputs, nor does it offer explainability, as it tends to produce explanations without explainability, in that they do not improve the ability of end users to better understand systems or achieve their goals.

Reviewer scores for the answers in reasoning models are weakly correlated with the thoughts Interestingly, we find that the reviewer scores for the thoughts are correlated with reviewer scores for the responses. This indicates that the thoughts do not necessarily guide the models to the correct responses. If the thoughts were strongly correlated to the answers, we might infer that the "correctness" or "incorrectness" of the answers could explained by the thoughts

The data presented above indicates that CoT can lead to incorrect answers, hampering explainability. To investigate these implications, we also conducted a qualitative content analysis (QCA) [16, 22] of CoT reasoning. QCA can act as a form "reverse engineering" [31] that allows analysts to focus on the entire system that produced outputs. Take, for example, the prompt-CoT-output tuple A in Appendix Table 1. Given the input, the output is unhelpful and incorrect. It contains references components that are not part of a toy dump truck (e.g., transmission system, clutch) that are not mentioned in the assembly instructions. From an inspection of the CoT, it does not offer users an explanation for why the output is so unhelpful, but it does show how quickly the CoT reasoning went astray.

A QCA of the CoT, however, does provide some clues as to what may have led it astray by pointing to larger elements than the tuple itself. A qualitative reading of the CoT shows how quickly the text moves away from references to “disassembling and reassembling a toy" (as provided by the assembly instructions) and toward references to what is more “typically" associated with a dump box: an actual machine. Here, the CoT pulls in a range of tokens related to machine components like a “clutch", “transmission", or “gears" but while it continues to mention the “toy", nouns and verbs related to actual machines predominate. Put together, one hypothesis generated by QCA points toward the tendency of LLMs to fall victim to the “Einstellung Paradigm" [23] in which a focus on familiar or common approaches leads away from the correct approach. Here, the CoT focuses on tokens much more commonly associated with dump trucks (machine-related) in the LLM rather than the tokens germane to the task (toy-related and provided in the RAG). Another possible hypothesis generated by a QCA is that the RAG agent seemed to face an issue in filling up the context window, and therefore drew on the foundation model, which returned text on dump trucks, generically, as complex machinery. Crucially, while the QCA has pointed us in potentially helpful directions for troubleshooting this unhelpful output, such detailed analysis of the CoT does not lead to explainability. Rather, considerably more effort is required to make sense of inputs and outputs, even if the CoT generates more material with which to engage in such sensemaking. Tuple B in Appendix Table 1 also evinces a susceptiblity to the Einstellung paradigm.