Do language models understand in fundamentally different ways?
Does mechanistic evidence reveal distinct tiers of understanding in LLMs—from concept recognition to factual knowledge to principled reasoning? And do these tiers coexist rather than replace each other?
This paper synthesizes mechanistic interpretability findings into a philosophical framework that moves beyond the binary "does AI understand?" debate. The framework proposes three hierarchical tiers:
Tier 1: Conceptual understanding — arises when a model forms "features" as directions in latent space that unify diverse manifestations of a single entity or property. This is the representational foundation: the model has learned that different surface forms connect to the same underlying concept. MI evidence: SAE features, linear probing, representation geometry studies all demonstrate this.
Tier 2: State-of-the-world understanding — arises when the model learns contingent factual connections between features and dynamically tracks changes. "Michael Jordan is a basketball player" is not just a high-probability string but a reflection of an internal model linking the Michael Jordan concept to the basketball player concept. This goes beyond association to structured knowledge representation.
Tier 3: Principled understanding — arises when the model discovers compact "circuits" that connect facts via general rules rather than memorizing each fact individually. This is the shift from knowing that to knowing why. The grokking literature provides the clearest evidence: models that transition from memorization to generalization develop circuits implementing actual algorithmic rules (e.g., modular addition via Fourier transforms).
The critical insight is that higher-tier mechanisms coexist with lower-tier heuristics rather than replacing them. A model can have principled understanding of arithmetic in one circuit while relying on pattern-matching heuristics in another. This heterogeneity means understanding is not a single binary property but a patchwork: principled in some domains, merely conceptual in others, and purely heuristic in yet others.
This has direct implications for trust and deployment. The fact that a model demonstrates principled understanding in one domain gives no guarantee that it operates at the same tier in adjacent domains. The coexistence of understanding tiers also explains why models can be simultaneously impressive and brittle: the principled circuits work reliably, but the heuristic patches fail unpredictably.
Source: MechInterp
Related concepts in this collection
-
What happens inside models when they suddenly generalize?
Grokking appears as an abrupt shift from memorization to generalization. But is the underlying process truly discontinuous, or does mechanistic analysis reveal continuous phases we can measure and predict?
grokking is the mechanistic signature of the transition from tier 2 (state-of-world, memorized facts) to tier 3 (principled, circuit-based understanding)
-
Can LLMs understand concepts they cannot apply?
Explores whether large language models can correctly explain ideas while simultaneously failing to use them—and whether that combination reveals something fundamentally different from ordinary mistakes.
Potemkin understanding maps to cases where the model has tier-1 conceptual understanding (can explain) but lacks tier-3 principled understanding (cannot apply)
-
Can AI pass every test while understanding nothing?
Explores whether neural networks can produce perfect outputs while having fundamentally broken internal representations. Asks what performance benchmarks actually measure and whether they can distinguish real understanding from fraud.
FER/imposter intelligence is a case where performance metrics cannot distinguish between tiers of understanding
-
Can a model be truthful without actually being honest?
Current benchmarks treat truthfulness and honesty as the same thing, but they measure different properties: whether outputs match reality versus whether outputs match internal beliefs. What happens if they diverge?
the three-tier framework clarifies why: honesty requires tier-2 state-of-world understanding (tracking what the model itself believes), while truthfulness only requires that outputs match facts regardless of internal tier
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
mechanistic interpretability evidence supports three hierarchical varieties of LLM understanding — conceptual then state-of-world then principled — each tied to a distinct computational organization