Mechanistic Indicators of Understanding in Large Language Models

Paper · arXiv 2507.08017 · Published July 7, 2025
MechInterpDiscoursesPhilosophy SubjectivityNatural Language Inference

Abstract: Large language models (LLMs) are often portrayed as merely imitating linguistic patterns without genuine understanding. We argue that recent findings in mechanistic interpretability (MI), the emerging field probing the inner workings of LLMs, render this picture increasingly untenable— but only once those findings are integrated within a theoretical account of understanding. We propose a tiered framework for thinking about understanding in LLMs and use it to synthesize the most relevant findings to date. The framework distinguishes three hierarchical varieties of understanding, each tied to a corresponding level of computational organization: conceptual understanding emerges when a model forms “features” as directions in latent space, learning connections between diverse manifestations of a single entity or property; state-of-the-world understanding emerges when a model learns contingent factual connections between features and dynamically tracks changes in the world; principled understanding emerges when a model ceases to rely on memorized facts and discovers a compact “circuit” connecting these facts. Across these tiers, MI uncovers internal organizations that can underwrite understanding-like unification. However, these also diverge from human cognition in their parallel exploitation of heterogeneous mechanisms. Fusing philosophical theory with mechanistic evidence thus allows us to transcend binary debates over whether AI understands, paving the way for a comparative, mechanistically grounded epistemology that explores how AI understanding aligns with—and diverges from—our own.

Building on this epistemological tradition, we propose to break out three varieties of understanding that can be ascribed to LLMs, each grounded in a computational mechanism in the upper portion of the hierarchy:

  1. Conceptual Understanding: This foundational form involves the model developing internal representations (“features”) that are functionally analogous to human concepts. As Mitchell and Krakauer note, concepts are “the fundamental units of understanding in human cognition” (2023, p. 3). By forming features, the model replicates a core function of concepts: registering the connections between diverse manifestations of an entity or property by subsuming them under a single, unifying representation.

  2. State-of-the-World Understanding: Building upon conceptual understanding, this involves forming an internal representation of the state of the world by grasping contingent empirical connections between features. This allows the model to articulate specific facts (e.g. “Michael Jordan is a basketball player”) not merely as high-probability strings, but as reflections of an internal model linking the concept of Michael Jordan to that of basketball player.

  3. Principled Understanding: At the apex of this hierarchy lies the ability to grasp the underlying principles or rules that unify a diverse array of facts. This resonates with philosophical accounts of explanatory understanding, which require the subsumption of disparate data points under general principles.

By examining the emerging mechanistic evidence for these three hierarchically structured kinds of understanding (moving from concepts to facts to principles), we argue that LLMs possess the internal organization necessary to underwrite understanding-like unification. At the same time, this does not imply that lower-tier mechanisms disappear. As we will show, MI suggests that higher-tier understanding mechanisms coexist with lower-tier heuristics—a coexistence that has significant implications for the epistemic trustworthiness of these models.