Do transformers actually learn systematic compositional reasoning?
Explores whether transformers solve compositional tasks through genuine systematic reasoning or by pattern-matching against training data. This matters because it determines whether scaling alone can achieve robust generalization.
"Faith and Fate: Limits of Transformers on Compositionality" (Dziri et al., 2023) provides the clearest empirical decomposition of how transformers actually handle compositional tasks — and why they fail.
The test bed is three representative tasks: multi-digit multiplication, logic grid puzzles (Einstein's puzzle), and a classic dynamic programming problem. Each is formulated as a computation graph with measurable complexity. The results are devastating for systematic reasoning claims: training on task-specific data leads to near-perfect performance on in-distribution instances at low compositional complexity, but "fails drastically on instances outside of this region."
The mechanism: transformers solve compositional tasks by reducing multi-step reasoning into linearized path matching. When a test problem's computation subgraph was seen during training (or closely resembles one), the model succeeds. When the composition is novel — requiring the model to apply computational rules to unseen combinations — it fails. This is shortcut learning: "may yield fast correct answers when similar compositional patterns are available during training but does not allow for robust generalization to uncommon or complex examples."
The error analysis is particularly revealing. While models can memorize single-step operations, they fail to compose them into correct reasoning paths. The failure is not random — it is systematic, suggesting "predictions based on shallow, rote learning rather than a deep, holistic task understanding." Error propagation makes this worse: errors in early stages compound in subsequent steps, creating an inherent ceiling on complex compositional tasks.
This provides the task-specific mechanism for what Do foundation models learn world models or task-specific shortcuts? describes at a higher level. The heuristic IS linearized subgraph matching — and it works well enough within the training distribution to create the illusion of systematic reasoning. Since Can neural networks learn compositional skills without symbolic mechanisms?, the Faith and Fate finding adds the critical qualifier: scaling helps only insofar as it increases training coverage of computation subgraphs. Novel compositions remain unsolved.
The implication for chain-of-thought: since Does logical validity actually drive chain-of-thought gains?, CoT may work not because it enables systematic reasoning but because it decomposes problems into subgraphs the model has already seen. CoT as subgraph decomposition rather than logical inference.
Source: Evaluations
Related concepts in this collection
-
Do foundation models learn world models or task-specific shortcuts?
When transformer models predict sequences accurately, are they building genuine world models that capture underlying physics and logic? Or are they exploiting narrow patterns that fail under distribution shift?
subgraph matching is the specific heuristic for compositional tasks
-
Can neural networks learn compositional skills without symbolic mechanisms?
Do neural networks need explicit symbolic architecture to compose learned concepts, or can scaling alone enable compositional generalization? This asks whether compositionality is an architectural feature or an emergent property of scale.
scaling helps by covering more subgraphs, not by creating systematic reasoning
-
Why do neural networks fail at compositional generalization?
Exploring whether the binding problem from neuroscience explains neural networks' inability to systematically generalize. The binding problem has three aspects—segregation, representation, and composition—each creating distinct failure modes in how networks handle structured information.
theoretical explanation for why linearized matching fails on novel compositions
-
Does logical validity actually drive chain-of-thought gains?
What if invalid reasoning in CoT exemplars still improves performance? Testing whether logical correctness or structural format is the real driver of CoT's effectiveness.
CoT may succeed via subgraph decomposition, not logical validity
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
compositional reasoning in transformers reduces to linearized subgraph matching — success depends on training exposure to similar computation subgraphs not systematic problem-solving