Where do neural networks still fail at compositional generalization despite scaling?
This explores the specific places where bigger models and more data still don't deliver true compositional generalization — combining known pieces in new ways — and why scaling papers over the gap rather than closing it.
This explores where neural networks still break on compositional generalization despite scaling, and the corpus has a sharp answer: scaling buys you coverage, not systematicity. The most direct finding is that transformers often succeed by memorizing the computation subgraphs they saw in training and stitching them together — what one analysis calls linearized subgraph matching — rather than learning the underlying rule. On in-distribution combinations they look fluent; on genuinely novel compositions they fail hard, with errors compounding step by step across a reasoning chain Do transformers actually learn systematic compositional reasoning?. So the failure isn't random — it's concentrated exactly where a new combination falls outside the training distribution's coverage.
That reframes what scaling actually does. One line of work shows plain MLPs *can* generalize compositionally with enough data and size — but only when the training distribution sufficiently covers the combinations of task pieces Can neural networks learn compositional skills without symbolic mechanisms?. Read alongside the subgraph-matching result, these agree more than they disagree: scaling works by densely tiling the space of combinations, so 'novel' compositions become rare. Push to combinations the data never spanned and the gap reopens. The deeper diagnosis is the binding problem — networks struggle to dynamically bind distributed features into reusable structures, to keep entities separate, and to recombine learned parts in new arrangements Why do neural networks fail at compositional generalization?. Scaling can let compositional representations *emerge*, but it doesn't install the binding mechanism that would make recombination reliable by construction.
The optimistic counterweight is real and worth holding: modern networks do exhibit genuine compositional behavior — complex syntax, multi-step logic, original code — which retires the old claim that connectionism simply can't compose Can neural networks actually achieve compositional generalization?. And networks even self-organize: pruning reveals they decompose tasks into isolated modular subnetworks, with pretraining making that modularity more consistent Do neural networks naturally learn modular compositional structure?. The honest synthesis is that the question has shifted from *whether* they compose to *how robustly* — and the robustness still tracks coverage, not principle.
Here's the part you might not expect: identical performance can hide broken internals. Networks trained by gradient descent can reproduce outputs perfectly while carrying fractured, entangled representations — internal structure so tangled that it can't transfer to new contexts or recombine creatively, unlike cleaner evolved representations Can identical outputs hide broken internal representations?. This is the mechanism beneath the behavioral failure: a model can ace the benchmark and still lack the clean, factored parts that compositional generalization requires. And there's a predictive lens for *where* it'll fail — treating LLMs as autoregressive probability machines correctly forecasts that logically trivial tasks become hard when the target is low-probability, like counting letters or reversing the alphabet Can we predict where language models will fail?.
The most pointed challenge to scaling-as-the-answer: a 7M-parameter two-layer network that *recurses on its own latent reasoning state* beats DeepSeek R1, o3-mini, and Gemini 2.5 Pro on ARC-AGI puzzles — abstraction-and-composition benchmarks — using 0.01% of their parameters Can tiny recursive networks outperform massive language models?. If recursion on latent state, not scale, drives the generalization gain, then the places scaling still fails may be precisely the places where the missing ingredient is architectural — a mechanism for reusing structure — rather than more parameters and more data.
Sources 8 notes
Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.
Standard MLPs achieve compositional generalization through data and model scaling alone, without architectural modifications, provided the training distribution sufficiently covers combinations of task modules. Linear decodability of constituents from hidden activations reliably predicts success.
Greff et al. argue that neural networks cannot dynamically bind distributed information into compositional structures due to three failures: segregating entities from inputs, maintaining representational separation, and reusing learned structure in novel combinations. Scaling can partially overcome this by enabling compositional representations to emerge.
DNNs and LLMs now demonstrate sophisticated compositional processing—complex syntax, logical reasoning chains, original code generation—challenging the classical Fodor-Pylyshyn argument that connectionism cannot support compositionality. The debate shifts from whether neural nets can compose to how they do so without explicit constituent structure.
Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.
Networks trained with SGD reproduce outputs perfectly while having radically different internal structure than evolved networks, with weight perturbations revealing fractured, entangled representations that prevent transfer to novel contexts or creative recombination.
By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.
A single 7M-parameter two-layer network recursing on its latent reasoning state achieves 45% on ARC-AGI-1 and 8% on ARC-AGI-2, beating DeepSeek R1, o3-mini, and Gemini 2.5 Pro with 0.01% of their parameters. Recursion on latent state, not scale or hierarchy, drives the generalization gain.