Can we predict where language models will fail?
Does characterizing the abstract computational problem an LLM solves—as a probability machine over sequences—let us predict which tasks it will struggle with systematically, before running experiments?
The "Levels of Analysis for LLMs" argument carries a specific empirical payoff: characterizing the abstract computational problem an LLM solves predicts where it will fail. The "embers of autoregression" line of work (McCoy et al.) is the worked example. By framing LLMs at Marr's computational level — as systems that have learned an autoregressive distribution over text — the researchers could derive in advance that tasks whose target response has low probability under the pretraining distribution would be systematically harder, even when the task itself is logically trivial.
The prediction is non-obvious. From a behavioral standpoint, you might expect difficulty to track task complexity. From the computational-level standpoint, you expect difficulty to track target probability, because the system is fundamentally a probability machine over sequences. Tasks like "write the alphabet backwards" or "count uppercase letters" can be logically simple but require generating sequences the pretraining distribution rarely supports. The framework predicted these would be hard before the experiments were run, and they were.
This is a working example of why a level-of-analysis approach is useful. Without it, the failure modes look like random capability gaps that need to be patched one by one. With it, the gaps look like predictable consequences of a particular kind of system, and they can be enumerated systematically by examining the computational characterization. The researcher who knows what problem the system is actually solving knows where to look for failure.
For interpretability research broadly, this is a template. Find the right computational-level characterization, derive its predictions about where the system should be brittle, and the brittle spots become a research program rather than an exception list.
Related concepts in this collection
-
Can cognitive science methods unlock how LLMs actually work?
Does Marr's three-level framework—developed to understand biological minds—offer interpretability researchers the structured methodology they need to decode opaque language models?
same paper, the framework this instantiates
-
Can indirect psychology tests reveal what LLMs conceal about bias?
Alignment training teaches LLMs to refuse direct questions about bias, but do implicit psychological methods like the IAT expose the underlying associations that remain encoded in their representations?
same paper, the algorithmic-level companion
-
Does chain-of-thought reasoning actually generalize beyond training data?
Explores whether CoT's strong performance on benchmarks reflects genuine reasoning ability or merely reflects learned patterns tied to specific distributions. Tests how CoT behaves when tasks, formats, or reasoning length shift away from training data.
adjacent: another distribution-bounded failure mode
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
the computational level predicts where LLMs fail — embers of autoregression anticipated low-probability target failures