Do foundation models learn world models or task-specific shortcuts?

When transformer models predict sequences accurately, are they building genuine world models that capture underlying physics and logic? Or are they exploiting narrow patterns that fail under distribution shift?

Note · 2026-02-21 · sourced from Philosophy Subjectivity

The inductive bias probe paper distinguishes what foundation models learn to predict from what they learn to be. A transformer trained on planetary orbital mechanics can predict trajectories across solar systems it has never seen. But when fine-tuned to predict force vectors — a cornerstone of Newtonian mechanics — it produces nonsensical laws of gravitation, different laws depending on which slice of data it is applied to.

The test is precise: a world model (Newtonian mechanics) has a specific inductive bias. If the model has internalized that world model, fine-tuning on a small dataset should leverage it — the model should extrapolate using Newtonian state. The probe reveals it does not. The inductive bias is not toward Newtonian mechanics; it is toward task-specific heuristics that work locally but do not generalize as a unified world model would.

The pattern holds across domains: Othello game positions, lattice models, orbital mechanics. In each case, models learn to predict legal next states without developing inductive bias toward the underlying state structure. They appear to work on prediction tasks because they recover "coarsened state representations or non-parsimonious representations" — compact shortcuts that are not the world model.

The no-free-lunch theorem grounds this. Every learning algorithm has an inductive bias — the functions it tends to learn when extrapolating from limited data. A world model is a restriction on possible functions; a learning algorithm with that world model should extrapolate within it. Sequence prediction does not impose this restriction. The model finds other functions that fit the training distribution without committing to the world model's structure.

"Reasoning or Reciting?" provides systematic evidence from a different angle. By constructing counterfactual variants of 11 standard tasks — variants that deviate from default assumptions — the paper shows that LLMs exhibit nontrivial performance on counterfactual versions but consistently degrade compared to default conditions. The degradation is not task-specific: it appears across all 11 tasks, suggesting a general reliance on narrow, non-transferable procedures rather than abstract reasoning. This is the behavioral signature of task-specific heuristics: they work on default (training-distribution-aligned) cases but fail when the task is logically equivalent but distributionally shifted.

Circuit-level mechanistic evidence: "Arithmetic Without Algorithms" (2410.21272) provides the most granular evidence yet for the heuristics claim. Using causal analysis to identify the arithmetic circuit in LLMs, the authors discover a sparse set of important neurons that implement simple heuristics — each neuron activates when an operand falls within a certain numerical range and outputs corresponding answers. The unordered combination of these heuristic types explains most of the model's arithmetic accuracy. The model is not running an addition algorithm. It is combining pattern-matching rules — a bag of heuristics that produces correct answers for common cases without any generalizable procedure.

This creates an apparent tension with Can large language models develop genuine world models without direct environmental contact? — that note claims text training does extract world structure. The resolution may be level of analysis: coarse semantic regularities (the note) vs. precise generative-mechanistic structure (the probe). Or it may be a genuine tension requiring empirical resolution.

The familiar vs novel dimension. François Chollet and Subbarao Kambhampati's exchange clarifies the boundary: it's not complexity per se but familiarity at the instance level that determines whether heuristics suffice. LRMs can handle arbitrarily complex tasks as long as they've been covered during training — but show an unfamiliar task, even a simple one requiring just a handful of reasoning steps, and they fail. Scaling up problem variables is a "roundabout way to generate novelty" — the complexity increase forces the model into unfamiliar territory where heuristics break. Kambhampati's rejoinder sharpens this: "we showed that LRMs do indeed lose accuracy as the size of familiar instances grow — they don't learn algorithms." Both agree transformers fit instance-based patterns, not generalizable algorithms. The delineation matters for evaluation: testing on familiar problem types at increasing scale conflates two effects (novel instances vs. algorithmic generalization).

Compositional tasks provide the clearest evidence. "Faith and Fate" (Dziri et al., 2023) shows that on multi-digit multiplication, logic grid puzzles, and dynamic programming problems, transformers solve compositional tasks by reducing multi-step reasoning to linearized subgraph matching. When test problems share computation subgraphs with training data, models succeed; when the composition is novel, they fail. Training yields near-perfect performance at low complexity but "fails drastically" outside the training distribution. Error propagation in early stages compounds to prevent correct solutions at high complexity. Since Do transformers actually learn systematic compositional reasoning?, the heuristic IS subgraph matching — and it works well enough within distribution to create the illusion of systematic reasoning. Source: Arxiv/Evaluations.

Source: Philosophy Subjectivity; enriched from LLM Architecture, Flaws

Related concepts in this collection

Can large language models develop genuine world models without direct environmental contact? Do LLMs extract meaningful world structures from human-generated text despite lacking direct sensory access to reality? This matters for understanding what kind of grounding and knowledge these systems actually possess.
apparent tension: that note claims world structure extraction; this probe finds task-specific heuristics; level of analysis may resolve or genuine conflict
Do language models actually use their encoded knowledge? Probes can detect that LMs encode facts internally, but do those encoded facts causally influence what the model generates? This explores the gap between knowing and doing.
parallel structure: encoding doesn't imply use; prediction accuracy doesn't imply world model internalization
Can models pass tests while missing the actual grammar? Do language models succeed on grammatical benchmarks by learning surface patterns rather than structural rules? This matters because correct outputs may hide reliance on shallow heuristics that fail on novel structures.
same pattern in the linguistic domain: correct output without structural learning
Do large language models reason symbolically or semantically? Can LLMs follow explicit logical rules when those rules contradict their training knowledge? Testing whether reasoning operates independently of semantic associations reveals what computational mechanisms actually drive LLM multi-step inference.
semantic dependence IS the heuristic mechanism: when commonsense semantics align with the task, heuristics produce correct answers; when they conflict, the model cannot override them
Why do neural networks fail at compositional generalization? Exploring whether the binding problem from neuroscience explains neural networks' inability to systematically generalize. The binding problem has three aspects—segregation, representation, and composition—each creating distinct failure modes in how networks handle structured information.
heuristics may be the network's solution to the binding problem: rather than dynamically binding entities into compositional structures (which requires solving segregation, representation, and composition), the model bypasses binding entirely by developing task-specific shortcuts that pattern-match without composing
Can identical outputs hide broken internal representations? Can neural networks produce correct outputs while having fundamentally fractured internal structure that prevents generalization and creativity? This challenges our assumptions about what performance benchmarks actually measure.
FER is what task-specific heuristics look like at the representation level: fractured solutions that work locally within arbitrary subdomains but lack the unified principles that a genuine world model would provide
Can neural networks learn compositional skills without symbolic mechanisms? Do neural networks need explicit symbolic architecture to compose learned concepts, or can scaling alone enable compositional generalization? This asks whether compositionality is an architectural feature or an emergent property of scale.
creates a tension: scaling can produce linearly decodable compositional features, but whether these constitute genuine generalization or scaled heuristics remains open; the heuristics-vs-world-models probe suggests that even compositionally organized representations may lack the inductive bias needed for true world model behavior
Can language models solve ToM benchmarks without real reasoning? Do current theory-of-mind benchmarks actually measure mental state reasoning, or can models exploit surface patterns and distribution biases to achieve high scores? This matters because it determines whether benchmark performance indicates genuine understanding.
ToM is a specific domain where task-specific heuristics masquerade as genuine capability: SFT matches RL on ToM benchmarks because the benchmarks contain exploitable structural patterns rather than requiring true mental state reasoning
Do large language models genuinely simulate mental states? This explores whether LLMs perform authentic theory of mind reasoning or rely on surface-level pattern matching. The distinction matters because evaluation format—multiple-choice versus open-ended—reveals very different capability levels.
open-ended ToM evaluation confirms the heuristic pattern: models default to surface strategies that work on structured benchmarks but fail when task scaffolding is removed, precisely as the heuristics-vs-world-models distinction predicts

Concept map

26 direct connections · 200 in 2-hop network ·medium cluster

Do foundation models learn world models or task-… Can large language models develop genuine world mo… Do language models actually use their encoded know… Can models pass tests while missing the actual gra… Do large language models reason symbolically or se… Why do neural networks fail at compositional gener… Can identical outputs hide broken internal represe… Can neural networks learn compositional skills wit… Can language models solve ToM benchmarks without r…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

foundation models develop task-specific heuristics rather than world models even when sequence prediction accuracy is high