Can LLMs understand concepts they cannot apply?
Explores whether large language models can correctly explain ideas while simultaneously failing to use them—and whether that combination reveals something fundamentally different from ordinary mistakes.
The Potemkin understanding paper identifies a failure pattern that is categorically different from ordinary LLM error. When a model correctly explains an ABAB rhyme scheme, then fails to generate one, then recognizes that its generation doesn't rhyme — that triple combination is not just wrong, it is incoherent. No human with that explanation would behave that way. The combination is irreconcilable with any human cognitive pattern.
This is worth separating from other LLM failure types because the mechanism matters for diagnosis and repair:
- Ordinary errors (fabrication, factual mistakes) — the model lacks information or generates plausible-but-false continuations. Fix: better retrieval, grounding, training data.
- Surface generalizations — the model learned correlations that worked in training but don't generalize structurally. Fix: better training curriculum, structural probing.
- Potemkin understanding — the model can produce the explanation and fails to apply it and recognizes the failure. This combination implies that explanation-generation and concept-application are functionally disconnected. No single epistemic fix addresses both.
The "Potemkin" framing (after Potemkin villages — facades with nothing behind) is precise: the model passes benchmark tests designed to detect understanding because those benchmarks test the same cognitive operations as humans. The tests only work as diagnostics if LLMs misunderstand concepts the same way humans do. But Potemkin understanding means the model can perform at the surface without the underlying integration that tests were designed to probe.
Benchmarks used to evaluate LLMs are also used to evaluate people. They are valid tests only if LLMs fail in human-compatible ways. Potemkin understanding shows that this assumption fails — LLMs can fail in ways that no human cognitive model predicts.
The three-domain evidence (literary techniques, game theory, psychological biases) shows this is not domain-specific. Across domains: near-perfect explanation accuracy, significant application failure, model recognition of failure. The incoherence is stable.
The "computational split-brain syndrome" diagnosis. "Comprehension Without Competence" provides the architectural analysis underlying Potemkin understanding. Through controlled experiments, the authors demonstrate that instruction and action pathways are geometrically and functionally dissociated — a phenomenon they term computational split-brain syndrome. The failure is not in knowledge access but in computational execution. LLMs function as powerful pattern completion engines but lack the architectural scaffolding for principled, compositional reasoning. This diagnosis also clarifies why mechanistic interpretability findings may reflect training-specific pattern coordination rather than universal computational principles. The geometric separation between instruction and execution pathways represents a structural limitation, not a knowledge limitation.
The Explain-Query-Test (EQT) framework provides direct empirical measurement of the explanation-comprehension gap. In EQT, a model (1) generates an explanation of a topic, (2) generates question-answer pairs from that explanation, and (3) answers those same questions without access to its own explanation. The finding: models consistently fail questions derived from their own explanations. The EQT gap correlates strongly with MMLU-PRO benchmark performance — making EQT a benchmark-free evaluation method that uses only the model's own outputs as ground truth. Critically, the gap is domain-specific: biology and psychology (domains where models initially perform well) show the largest EQT drops, while law and engineering (lower baseline) show smaller drops. This suggests Potemkin understanding is worst precisely where surface performance is highest — a counterintuitive result that demands explanation. High benchmark performance may mask explanation-comprehension disconnection rather than reveal genuine understanding.
Source: Philosophy Subjectivity; enriched from Reasoning Methods CoT ToT
Related concepts in this collection
-
Do language models actually use their encoded knowledge?
Probes can detect that LMs encode facts internally, but do those encoded facts causally influence what the model generates? This explores the gap between knowing and doing.
related mechanism: knowledge can be present without causally influencing behavior; Potemkin extends this to a more observable test (explanation vs. application)
-
Can models pass tests while missing the actual grammar?
Do language models succeed on grammatical benchmarks by learning surface patterns rather than structural rules? This matters because correct outputs may hide reliance on shallow heuristics that fail on novel structures.
Potemkin understanding adds the recognition-of-failure component that surface generalization accounts don't predict
-
Do language models actually use their reasoning steps?
Chain-of-thought reasoning looks valid on the surface, but does each step genuinely influence the model's final answer, or are the reasoning chains decorative? This matters for trusting AI explanations.
faithful reasoning would prevent Potemkin: the explanation would causally constrain the application
-
Can identical outputs hide broken internal representations?
Can neural networks produce correct outputs while having fundamentally fractured internal structure that prevents generalization and creativity? This challenges our assumptions about what performance benchmarks actually measure.
FER provides the mechanistic cause of Potemkin understanding: the internal representation is fractured across arbitrary subdomains and entangled across unrelated computations, which is why explanation-generation and concept-application are functionally disconnected despite identical surface performance
-
What limits how much models can improve themselves?
Explores whether self-improvement has fundamental boundaries set by how well models can verify versus generate solutions, and what this means across different task types.
the generation-verification gap formalizes why Potemkin understanding paradoxically enables self-improvement: when explanation exceeds application, that gap is a usable training signal — the model's verification ability can supervise its generation ability
-
Why do language models fail to act on their own reasoning?
LLMs generate correct step-by-step reasoning 87% of the time but only follow through with matching actions 64% of the time. What drives this gap between knowing and doing?
quantified instance: the 87%/64% gap between correct rationales and correct actions in sequential decision-making is the most precisely measured example of Potemkin understanding; RL fine-tuning narrows the gap, suggesting the facade is partially trainable
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
potemkin understanding is a distinct failure mode where correct explanation combined with failed application is incoherent not merely wrong