How can a model explain something correctly yet fail to apply it?

This explores the gap between an LLM stating a principle correctly and actually executing it — what the corpus calls 'Potemkin understanding' or the knowing-doing gap.

This explores the gap between an LLM stating a principle correctly and actually executing on it — and the corpus is unusually direct about it: this isn't a knowledge problem, it's a wiring problem. The clearest framing is what one line calls a 'computational split-brain': models articulate the right rule at high rates but apply it far less often, because the pathway that explains and the pathway that acts are functionally dissociated Can language models understand without actually executing correctly?. The numbers recur across notes — roughly 87% correct on explanation versus 64% in action Why do language models fail to act on their own reasoning?. The striking part is that a model can explain a concept, fail to apply it, *and* recognize its own failure, a triple pattern that has no clean human analog and is the signature of 'Potemkin understanding' Can LLMs understand concepts they cannot apply?.

What makes this more than a curiosity is that the same shape shows up under several different names, which suggests it's one underlying phenomenon viewed from different angles How do LLMs fail to know what they seem to understand?. Sometimes the failure is an *inference bottleneck*: the model genuinely holds the relevant fact but never activates it unless prompted to — and small nudges (subtle emphasis, or forcing it to enumerate preconditions) recover 6–15 points of accuracy, proving the knowledge was there all along Why do language models fail to use knowledge they possess?. Sometimes it shows up as accommodating a false premise the model demonstrably knows is false — it can answer the direct fact question correctly, then sail right past the same falsehood when it's smuggled in as an assumption Why do language models accept false assumptions they know are wrong?.

A second cluster of notes points at *why* application drifts away from explanation: reasoning itself can crowd out instruction-following. As chain-of-thought gets longer, the original instruction sits at a greater 'contextual distance' and the model's adherence drops — advanced reasoning models follow instructions only about half the time during math reasoning Why do more capable reasoning models ignore your instructions? Why do better reasoning models ignore instructions?. The act of reasoning hard about the problem actively erodes compliance with what you actually asked. Errors compound the same way: once a model's own mistakes fill its context, performance degrades non-linearly, and the model keeps applying its bad prior rather than the good rule it could still recite Do models fail worse when their own errors fill the context?.

The unsettling twist is that some 'correct application' is itself an illusion in the other direction. When models look like they're reasoning about constraints, many are really just defaulting to the safe/harder option — twelve of fourteen got *worse* when constraints were removed, meaning they were exploiting a conservative bias, not reasoning Are models actually reasoning about constraints or just defaulting conservatively?. And the reflection step that's supposed to catch the explain-but-don't-apply gap mostly doesn't: reflections rarely change the initial answer and traces don't faithfully represent the underlying computation, so the model can't reliably self-repair the gap Can we actually trust reasoning model outputs?. Even active, interactive reasoning — asking good questions to close the gap mid-task — shows a structural deficit that fine-tuning barely moves Why do models fail at asking good questions during interaction?.

So the answer to 'how' is layered: explanation and execution run on partly separate tracks; correct knowledge often sits inert until something forces it to activate; longer reasoning and accumulated errors pull attention away from the rule the model can still state; and the model's own monitoring is too weak and its 'reasoning' sometimes too superficial to notice the divergence. The thing you didn't know you wanted to know: narrowing this gap isn't mainly about teaching the model more — RL and emphasis prompts help precisely because the problem is activation and retrieval, not absent knowledge Why do language models fail to act on their own reasoning?.

Sources 12 notes

Can language models understand without actually executing correctly?

Large language models can articulate correct principles but systematically fail to apply them due to dissociated instruction and execution pathways. The 87% accuracy in explanations versus 64% in actions reveals this is not knowledge deficit but structural disconnect.

Why do language models fail to act on their own reasoning?

LLMs generate correct reasoning 87% of the time but follow it only 64% of the time. Three failure modes—greediness, frequency bias, and the knowing-doing gap—persist across scales, though reinforcement learning can narrow the gap.

Can LLMs understand concepts they cannot apply?

Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.

How do LLMs fail to know what they seem to understand?

LLMs show repeatable, empirically documented failure modes—from Potemkin understanding (correct explanation + failed application) to reasoning collapse under implicit constraints. These failures reveal gaps between statistical pattern-tracking and actual epistemic competence.

Why do language models fail to use knowledge they possess?

Models possess relevant knowledge but fail to activate it without explicit prompting. Adding subtle emphasis recovers 15.3 percentage points accuracy, and forcing enumeration of preconditions recovers 6-9 points, showing the bottleneck is in constraint inference, not storage.

Why do language models accept false assumptions they know are wrong?

The FLEX Benchmark shows that models reject false presuppositions at rates far below acceptable levels (GPT-4: 84%, Mistral: 2.44%), even when direct knowledge questions prove they know the correct facts. False presuppositions drive more accommodation than correct knowledge drives rejection.

Why do more capable reasoning models ignore your instructions?

Advanced reasoning models achieve only 50.71% instruction adherence during mathematical reasoning. Training for reasoning depth actively worsens instruction compliance, suggesting a fundamental trade-off between reasoning power and controllability.

Why do better reasoning models ignore instructions?

The MathIF benchmark shows that SFT and RL training improve reasoning but reduce instruction adherence, particularly as chain-of-thought length increases. Longer reasoning chains create contextual distance that dilutes the model's attention to original instructions.

Do models fail worse when their own errors fill the context?

Error accumulation in context causes non-linear performance degradation in long-horizon tasks. Model scaling does not fix this; only test-time compute through thinking models reduces the effect by preventing error-contaminated context from biasing reasoning.

Are models actually reasoning about constraints or just defaulting conservatively?

Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.

Can we actually trust reasoning model outputs?

Research across eight models shows reflection is mostly confirmatory theater—reflections rarely change initial answers and traces don't faithfully represent reasoning. Calibration degrades under binary reward training, and monitoring mechanisms are easily gamed.

Why do models fail at asking good questions during interaction?

GPT-4o achieves only 35% on interactive number guessing, with information gains collapsing from 7.7% to 2.5% as rounds progress. SFT, DPO, and Tree-of-Thought interventions provide minimal improvement, suggesting the deficit is structural rather than a prompting or fine-tuning problem.

How can a model explain something correctly yet fail to apply it?

Sources 12 notes

Next inquiring lines