Reasoning and Knowledge Reasoning and Learning Architectures Language Understanding and Reasoning

Can we predict where language models will fail?

Does characterizing the abstract computational problem an LLM solves—as a probability machine over sequences—let us predict which tasks it will struggle with systematically, before running experiments?

Note · 2026-05-18 · sourced from Philosophy Subjectivity

The "Levels of Analysis for LLMs" argument carries a specific empirical payoff: characterizing the abstract computational problem an LLM solves predicts where it will fail. The "embers of autoregression" line of work (McCoy et al.) is the worked example. By framing LLMs at Marr's computational level — as systems that have learned an autoregressive distribution over text — the researchers could derive in advance that tasks whose target response has low probability under the pretraining distribution would be systematically harder, even when the task itself is logically trivial.

The prediction is non-obvious. From a behavioral standpoint, you might expect difficulty to track task complexity. From the computational-level standpoint, you expect difficulty to track target probability, because the system is fundamentally a probability machine over sequences. Tasks like "write the alphabet backwards" or "count uppercase letters" can be logically simple but require generating sequences the pretraining distribution rarely supports. The framework predicted these would be hard before the experiments were run, and they were.

This is a working example of why a level-of-analysis approach is useful. Without it, the failure modes look like random capability gaps that need to be patched one by one. With it, the gaps look like predictable consequences of a particular kind of system, and they can be enumerated systematically by examining the computational characterization. The researcher who knows what problem the system is actually solving knows where to look for failure.

For interpretability research broadly, this is a template. Find the right computational-level characterization, derive its predictions about where the system should be brittle, and the brittle spots become a research program rather than an exception list.

Related concepts in this collection

Concept map

14 direct connections · 124 in 2-hop network ·medium cluster Open in graph ↗

Can we predict where language models will fail? Can cognitive science methods unlock how LLMs actu… Can indirect psychology tests reveal what LLMs con… Does chain-of-thought reasoning actually generaliz…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Original note title

the computational level predicts where LLMs fail — embers of autoregression anticipated low-probability target failures

Can we predict where language models will fail?

Related concepts in this collection

Related papers in this collection