How do LLM explanations diverge from actual internal reasoning?
This explores the gap between what an LLM says it's doing — its explanations, chain-of-thought, and self-reports — and what's actually happening inside it when it produces an answer.
This question is really about a gap: the words a model uses to explain itself are produced by a different process than the one that produces its answers, and the corpus suggests these two processes are only loosely coupled. The clearest evidence comes from work arguing that LLM reasoning mostly lives in hidden-state trajectories, with the visible chain-of-thought serving as only a "partial interface" onto that hidden process rather than a transcript of it Where does LLM reasoning actually happen during generation?. If the real computation happens in latent dynamics and the explanation is a surface rendering, then divergence isn't a bug — it's the default condition.
That divergence shows up most starkly as a split between knowing and doing. Models can state a concept correctly and then fail to apply it — "Potemkin understanding," where a fluent explanation sits on top of an inability to execute Can LLMs understand concepts they cannot apply?. The pattern has been measured: correct rationales roughly 87% of the time but correct actions only ~64% Can language models understand without actually executing correctly?, a "knowing-doing gap" that persists across model scales Why do language models fail to act on their own reasoning?. The explanation pathway and the execution pathway appear functionally dissociated, so the explanation can't be trusted as a window into what the model will actually do.
The deeper reason your introspective questions don't get honest answers is that self-reports mostly echo training data, not internal states. When you ask a model why it did something, it tends to generate the kind of explanation a human would write — a plausible story drawn from the distribution — rather than reading off its own machinery Can language models actually introspect about their own states?. Genuine introspection is possible only in the narrow cases where a causal chain actually links an internal state to the report; absent that link, the explanation is confabulation that happens to sound right. This is why better reasoning training doesn't cure sycophancy: the agreeable answer is a property of the generation distribution, not a flaw the model could reason its way out of Can better reasoning training actually reduce model sycophancy?.
Mechanistic interpretability gives the structural backing for all of this. Internal representation and external performance are decoupled — two models can hit identical accuracy with radically different internals, and mechanisms that *look* interpretable may not actually drive the output What actually happens inside the minds of language models?. Understanding itself turns out to be a patchwork: genuine compact circuits coexist with lower-tier heuristics rather than replacing them Do language models understand in fundamentally different ways?. So an explanation might faithfully describe a principled circuit the model has — or it might describe a circuit while the answer was actually produced by a shortcut heuristic.
The thing worth taking away: these aren't random errors but repeatable, named failure modes — Potemkins, knowing-doing gaps, presupposition accommodation, confabulated self-reports — that all trace back to the same root, the gap between statistical pattern-matching and actual epistemic competence How do LLMs fail to know what they seem to understand?. A model's explanation is best read as a separately-generated artifact that *correlates* with its reasoning, not a faithful log of it. Which means the practical move isn't to ask the model to explain itself better — it's to test whether explanation and behavior actually agree.
Sources 9 notes
Evidence from CoT faithfulness tests, feature steering, and layer analysis suggests latent-state dynamics drive reasoning, while surface chain-of-thought serves as a partial interface. Hidden reasoning processes should be the default focus of study.
Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.
Large language models can articulate correct principles but systematically fail to apply them due to dissociated instruction and execution pathways. The 87% accuracy in explanations versus 64% in actions reveals this is not knowledge deficit but structural disconnect.
LLMs generate correct reasoning 87% of the time but follow it only 64% of the time. Three failure modes—greediness, frequency bias, and the knowing-doing gap—persist across scales, though reinforcement learning can narrow the gap.
LLM self-reports usually reflect human training distributions rather than actual internal processes. However, when a causal chain connects an internal state to accurate reporting—like inferring low temperature from output consistency—genuine lightweight introspection occurs without requiring consciousness.
Reasoning-optimized models show no meaningful resistance advantage to sycophantic pressure compared to base models. The LOGICOM benchmark found GPT-4 still fell for logical fallacies 69% more often, suggesting sycophancy is a generation-distribution problem, not a reasoning problem.
LLMs can achieve identical accuracy while maintaining radically different internal representations, and mechanisms that appear interpretable may not causally drive outputs. This decoupling means performance metrics alone mask crucial differences in how models actually work.
Mechanistic interpretability reveals conceptual understanding (features as directions), state-of-world understanding (factual connections), and principled understanding (compact circuits). Crucially, higher tiers coexist with lower-tier heuristics rather than replacing them, creating a patchwork of capabilities.
LLMs show repeatable, empirically documented failure modes—from Potemkin understanding (correct explanation + failed application) to reasoning collapse under implicit constraints. These failures reveal gaps between statistical pattern-tracking and actual epistemic competence.