Can we understand LLM mechanisms with only representational analysis?
Explores whether mapping what information a model encodes is sufficient for mechanistic understanding, or whether causal verification is equally necessary to claim genuine mechanism.
The implementation-level argument in Levels of Analysis for LLMs is that representational analysis and causal analysis are partners, not alternatives. Representational analysis maps what information a model encodes — which features, circuits, attention heads carry which signals. Causal analysis tests whether the information that is encoded actually drives behavior — through interventions, ablations, activation patches. Either method alone produces an incomplete account: a representation that is encoded but causally inert is a curiosity, and a causal effect with no representational characterization is unexplained.
The synergy matters because both methods can fool you alone. Representational analysis can identify features that correlate with behavior without showing they cause it — a classic confound. Causal analysis can demonstrate that intervening on some component changes behavior without telling you what that component encodes — the lesion shows damage but not function. The combination — representational analysis locates candidates, causal analysis tests their functional role — is what produces mechanistic claims rather than descriptive ones.
This has methodological consequences for interpretability research. Studies that report only feature visualizations or only activation patches contribute, but they do not close the loop. The convergent evidence comes from pairs: locate a candidate feature representationally, then verify it causally; identify a causal component, then map its representation. The literature on attention circuits, induction heads, and feature dictionaries has been moving toward this pairing.
For LLM understanding specifically, this template explains why some claimed "mechanisms" have not held up. They were representational without causal verification (a feature that looked like task encoding but did not drive task behavior) or causal without representational characterization (an intervention that mattered but described nothing). The discipline imported from cognitive neuroscience is to demand both.
Related concepts in this collection
-
Can cognitive science methods unlock how LLMs actually work?
Does Marr's three-level framework—developed to understand biological minds—offer interpretability researchers the structured methodology they need to decode opaque language models?
same paper, the framework
-
Can we predict where language models will fail?
Does characterizing the abstract computational problem an LLM solves—as a probability machine over sequences—let us predict which tasks it will struggle with systematically, before running experiments?
same paper, computational level companion
-
Can indirect psychology tests reveal what LLMs conceal about bias?
Alignment training teaches LLMs to refuse direct questions about bias, but do implicit psychological methods like the IAT expose the underlying associations that remain encoded in their representations?
same paper, algorithmic level companion
-
Do language model reasoning drafts faithfully represent their actual computation?
If models externalize reasoning in thinking drafts before answering, does the draft accurately reflect their internal process? This matters for AI safety monitoring and error detection.
adjacent: dual-dimension methodology in CoT
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
mechanistic understanding of LLMs requires both representational analysis and causal analysis — either alone is insufficient