Can neural networks learn compositional skills without symbolic mechanisms?
Do neural networks need explicit symbolic architecture to compose learned concepts, or can scaling alone enable compositional generalization? This asks whether compositionality is an architectural feature or an emergent property of scale.
The question: do neural networks need explicit symbolic mechanisms to achieve compositionality, or does scaling suffice?
The answer: scaling data and model size leads to compositional generalization on standard MLPs, without architectural modifications — but with a critical condition: the training distribution must sufficiently cover the task space. Individual modules need not appear in isolation, but they must appear in enough combinations that the model can extract them.
Three key contributions:
Proof of representational capacity. MLPs can approximate a general class of compositional task families (hyperteachers) to arbitrary precision using only a linear number of neurons relative to the number of task modules. Memorizing all tasks requires exponential capacity; the compositional solution is fundamentally more efficient.
Linear decodability as a compositionality signature. When networks successfully compositionally generalize, the task constituents can be linearly decoded from hidden activations. This metric predicts failures in text-to-image models — when concepts cannot be linearly decoded, the model fails to compose them.
Scaling limits. Despite progress, performance deteriorates as the number of composed concepts grows. The multiplicative nature of compositionality means even scaled models hit composition limits — the exponential growth eventually exceeds any finite training distribution.
This directly addresses Why do neural networks fail at compositional generalization?: the binding problem is solvable through scaling when training covers the task space, but remains unsolved for arbitrary novel compositions. The failure mode is not inability to learn compositional structure but insufficient exposure to the combinatorial space.
The practical implication for LLMs: compositional generalization in language (novel sentence structures, new concept combinations) should improve with scale — but the tails of the combinatorial space will always remain sparsely covered, predicting continued failures on truly novel compositions.
SKiC prompting: unlocking compositional generalization with few examples: Skills-in-Context (SKiC) prompting shows that compositional generalization can be unlocked with remarkably few examples — as few as two exemplars — when the prompt structure explicitly grounds each reasoning step on foundational skills. The SKiC prompt has three blocks: (1) skills with instructions, (2) compositional examples showing how to combine skills, (3) the problem. This one-stage approach achieves near-perfect systematic generalization and is more general than decomposition-based methods (handles complex computation graphs that cannot be linearly decomposed). Intriguingly, SKiC also unlocks "latent potential" — pre-existing internal skills from pretraining that standard prompting fails to activate. This confirms the training-coverage condition from a different angle: the model has compositional capacity from pretraining, but prompting must explicitly invoke the skill-grounding structure to surface it. Source: Prompts Prompting.
Source: MechInterp
Related concepts in this collection
-
Why do neural networks fail at compositional generalization?
Exploring whether the binding problem from neuroscience explains neural networks' inability to systematically generalize. The binding problem has three aspects—segregation, representation, and composition—each creating distinct failure modes in how networks handle structured information.
binding failure is solvable through scaling but only with sufficient training coverage; explains both successes and persistent failures
-
Do foundation models learn world models or task-specific shortcuts?
When transformer models predict sequences accurately, are they building genuine world models that capture underlying physics and logic? Or are they exploiting narrow patterns that fail under distribution shift?
FER tension: are scaled compositions genuine generalizations or scaled heuristics?
-
Can identical outputs hide broken internal representations?
Can neural networks produce correct outputs while having fundamentally fractured internal structure that prevents generalization and creativity? This challenges our assumptions about what performance benchmarks actually measure.
tension: scaling may produce compositionality in outputs while FER persists in representations
-
How does multi-hop reasoning develop during transformer training?
Does implicit multi-hop reasoning emerge gradually through distinct phases? This explores whether transformers move from memorization to compositional generalization, and what internal mechanisms enable that shift.
mechanistic detail for the training-coverage condition: second-hop generalization requires query-level compositional exposure, confirming that compositional generalization depends on the training distribution covering the specific compositional structure, not just individual components
-
Can agents learn continuously without forgetting old skills?
Can lifelong learning systems retain previously acquired skills while acquiring new ones? This explores whether externalizing learned behaviors as retrievable code programs rather than parameter updates solves catastrophic forgetting.
VOYAGER's skill library is an external implementation of compositional generalization: complex skills are synthesized from primitives, achieving the efficient linear-scaling solution rather than exponential memorization; the ever-growing library progressively covers the combinatorial task space that the training-coverage condition requires
-
Can language help agents imagine goals they've never seen?
How might compositional language enable artificial agents to target outcomes beyond their training experience? This matters because it could unlock open-ended exploration without hand-coded reward functions.
IMAGINE leverages the compositionality that this note documents: familiar words recombine to describe unfamiliar outcomes, enabling agents to target goals outside their training distribution; this is compositional generalization applied to goal specification rather than task execution
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
compositional generalization emerges from scaling data and model size without explicit symbolic mechanisms