Scaling can lead to compositional generalization

Paper · arXiv 2507.07207 · Published July 9, 2025
MechInterpCognitive Models LatentWorld Models

Can neural networks systematically capture discrete, compositional task structure despite their continuous, distributed nature? The impressive capabilities of large scale neural networks suggest that the answer to this question is yes. However, even for the most capable models, there are still frequent failure cases that raise doubts about their compositionality. Here, we seek to understand what it takes for a standard neural network to generalize over tasks that share compositional structure. We find that simply scaling data and model size leads to compositional generalization. We show that this holds across different task encodings as long as the training distribution sufficiently covers the task space. In line with this finding, we prove that standard multilayer perceptrons can approximate a general class of compositional task families to arbitrary precision using only a linear number of neurons with respect to the number of task modules. Finally, we uncover that if networks successfully compositionally generalize, the constituents of a task can be linearly decoded from their hidden activations. We show that this metric correlates with failures of text-to-image generation models to compose known concepts.

The ability to understand and produce novel combinations from familiar constituents is a key faculty of intelligence. It has been debated for decades whether neural networks are ever able to truly achieve such compositional generalization [1]. Regardless of these theoretical considerations, scaling neural networks continues to result in increasingly capable models [2–4]. Naturally, as models are scaled up, their capacity to memorize grows, and it is perhaps unsurprising that as a result of training on ever larger datasets their ability to recall more information grows too [5]. However, the nature of compositionality is an exponential growth and ultimately any attempt to exhaustively capture this breadth by scaling the training data will be confronted with physical constraints.

Many works therefore advocate that neural network architectures should be explicitly endowed with compositional structure [e.g., 6–9] to allow making infinite use of their finite means [10, 11]. Capturing the underlying compositional procedure of the data is a more efficient pathway to generalize. In particular, the algorithmic complexity of this generalizing solution is much smaller than the complexity of the memorizing solution [12]. But does this mean that architectures need to explicitly factorize according to the data’s underlying compositional mechanisms [9]? For instance, monolithic networks have been shown to discover modular subnetworks which may enable compositionality without specialized symbolic mechanisms [13]. Maybe simply scaling the data and size of neural networks is then enough to achieve compositionality. Here, we attempt to answer this question:

Do neural networks compositionally generalize at scale?

Our main contributions are as follows

• We demonstrate that standard multilayer perceptrons compositionally generalize on a variety of tasks as data and model size are scaled across task encodings if the training distribution sufficiently covers the task space.

• We prove that multilayer perceptrons can approximate a general class of compositional task families to arbitrary precision using only a linear number of neurons with respect to the number of task modules.

• We show that task constituents can be (linearly) decoded from the hidden activations of models that compositionally generalize, and demonstrate that this metric correlates with failures of image generation models to compose known concepts.

Specifically, we will consider compositional task families that specify a generative procedure over tasks with shared compositional structure. In a similar vein to [14], our definition uses algorithmic complexity theory, in particular the notion of Kolmogorov complexity, see [15] for a formal treatment.

Condition (i) ensures that all task components functionally enter the composition, while condition (ii) excludes the case where compositions are purely contextsensitive, ensuring that there is shared structure between tasks. For a more detailed discussion of this definition, please refer to [16].

The notion of a task is used in a general sense here and allows to capture different types of compositional data. For instance, a task could refer to a visual scene, where modules are the set of possible objects and the composition operator renders a selection of such objects into a scene. Similarly, a task could refer to a behavior policy, where modules consist of different reward functions, a subset of which is combined by the composition operator to induce an optimal policy.

Before we can continue to define compositional generalization using Definition 2.1, we must first specify how to present the model with information about its current task, as captured by the task constituents z. In practice, such a task description might not be the task constituents themselves, but rather some encoding thereof. For example, a task could be described through a natural language instruction or by presenting example data points

2.4 Hyperteachers: A general class of compositional task families

For the purpose of this study, it will be useful to instantiate a concrete but nevertheless general class of compositional task families according to Definition 2.1. Given that neural networks are flexible function approximators and thus able to cover a wide range of behaviors, we parameterize both the composition operator as well as the task functions using neural networks. The resulting system, a composable neural network that generates another neural network, can be interpreted as a hypernetwork [17]. Indeed, hypernetworks have previously been used to study compositional generalization [18].

Memorizing all tasks of a compositional task family by definition requires exponential network capacity. Intuitively, a solution that captures the underlying compositional structure and thus generalizes should be more efficient. A priori, it is however not clear whether such a solution exists for a finite-sized, multilayer perceptron. As we have argued before, hyperteachers can be regarded as a general class of compositional task families. It is therefore instructive to consider whether a finite-sized multilayer perceptron can implement any hyperteacher without having to memorize the exponential number of possible tasks. The following theorem answers this question in the affirmative.

In principle, it is easy to come up with degenerate training distributions that make compositional generalization impossible. For instance, if a module is consistently absent from all training tasks, the model has no opportunity to learn this module and will generally fail to generalize to tasks that contain this module. In this sense, the support of the training distribution ptrain(z) needs to sufficiently cover the full constituent space for compositional generalization to succeed. In this section we investigate how various conditions over the training support affect compositional generalization.

The study of compositional generalization in neural networks has a long and rich history, with early critiques highlighting the challenges of connectionist models to exhibit systematicity and compositionality [e.g., 1, 31–33] and numerous work that in response explored mechanisms for representing and processing structured information using distributed representations [e.g., 34, 35]. In recent years, theoretical progress has been made in showing that compositional generalization can provably be achieved with neural networks in specific settings [18, 22, 36–39]. This typically requires constraining the model architecture and the data generating process. Consistent with our results, the statistics of the training data play a crucial rule in enabling compositional generalization [18, 21, 22, 39].

We aim to complement this work by showing how scaling generic neural networks can lead to compositional generalization in the absence of stronger architectural constraints. This is motivated by the finding that scaling neural networks can break the curse of dimensionality [40], and consistently results in improvements in model performance [2, 41] with new capabilities emerging as models are scaled up [3, 4, 42]. Compositional abilities in particular have therefore seemingly moved within grasp in practice [43–48]. However, even at larger scales, models often display a compositional generalization gap which does not close as scale increases [49–57], despite standard transformers showing compositionality in controlled settings [58–60].

Image generation models in particular have made impressive leaps in their ability to create novel image compositions [29, 30, 61, 62]. Compositional abilities have been shown to emerge in such diffusion-based models on synthetic tasks in an order determined by the underlying data processes and with performance showing sudden emergence due to multiplicative dependencies [24, 63]. Consistent with our finding that the task constituents are linearly decodable in models that successfully compositionally generalize, [64] find that diffusion image generation models learn factorized representations on a number of synthetic tasks. However, as the number of concepts that need to be composed grows, the performance of image generation models starts to deteriorate, showing the limits of their productivity to arbitrarily complex compositions [27, 28].