Potemkin Understanding in Large Language Models

Paper · arXiv 2506.21521 · Published June 26, 2025

This paper first introduces a formal framework to address this question. The key is to note that the benchmarks used to test LLMs—such as AP exams—are also those used to test people. However, this raises an implication: these benchmarks are only valid tests if LLMs misunderstand concepts in ways that mirror human misunderstandings. Otherwise, success on benchmarks only demonstrates potemkin understanding: the illusion of understanding driven by answers irreconcilable with how any human would interpret a concept. We present two procedures for quantifying the existence of potemkins: one using a specially designed benchmark in three domains, the other using a general procedure that provides a lower-bound on their prevalence.

Figure 1 illustrates a potemkin. When an LLM is asked to explain an ABAB rhyming scheme, its response is clear and correct (top panel). At first glance, it may appear that the LLM has understood the concept, in the same way that a human with the provided explanation would understand. However, when tasked to generate text in an ABAB rhyming scheme, the LLM fails, producing non-rhyming words (middle panel). Moreover, the LLM seems to recognize that its output does not rhyme (bottom panel). This specific combination of correct and incorrect answers is irreconcilable with any answer that a human would give.

Potemkins occur when an LLM performs well on tasks that would indicate conceptual understanding if a human completed them, but do not indicate understanding in the LLM. This paper develops two procedures for measuring the prevalence of potemkins in LLMs. The first is tailored to a specific kind of potemkin: the divide between an LLM’s ability to explain a concept and apply it. We collect a benchmark dataset across three domains — literary techniques, game theory, and psychological biases— designed to measure the prevalence of this type of potemkins.

For example, despite models being able to define concepts in each domain in our benchmark dataset near-perfectly, they struggle to apply these concepts accurately. We find that potemkins are not arising due merely to incorrect understanding of concepts, but rather due to incoherence.