Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention

Paper · arXiv 2605.29548 · Published May 28, 2026
Reinforcement Learning

Larger models learn tasks smaller models do not. What drives this phenomenon? We develop a simple phenomenological argument that power-law scaling already suggests that a larger model will be able to learn a part of the data distribution that a smaller model fails to learn, even with infinite training data. To validate this claim and identify its causes, we study the effects of model scaling on a synthetic setup consisting of a mixture of tasks that show monotonic scaling curves. The results point to a data-induced competition over resources (neurons). Specifically, smaller models allocate their neurons to high frequency or low complexity tasks, and so they learn solutions that perform poorly on rare and complex tasks. Moreover, this happens even when solutions capable of expressing the desired task exist. We then assess how a larger model circumvents this data-centric bottleneck, finding that it traces to a reduced interference mechanism: larger models can allocate enough resources to common tasks that the gradient updates for those tasks become weak, which means that they do not overwrite rare-task features as they slowly accumulate. Finally, to further validate these claims, we pretrain OLMo models (4M to 4B parameters) on novel tasks of varying frequency and complexity.

Introduction. Modern machine learning is celebrated for its massive generalist models, which are capable of handling arbitrary inputs in diverse and complex environments [1–10]. Based on the empirical finding that larger models often excel where smaller1 models show random-chance performance, prior work has claimed that the ability to solve certain critical tasks only emerges in larger models [11–19]. Such arguments have fueled the drive towards increased scaling. However, given the large training and inference costs that large models impose, it is worth identifying precisely what marginal benefits are unlocked by larger models and whether scaling parameters is the sole way of realizing those benefits. Our argument begins from the observation that power-law scaling [20–22] already suggests that there is a regime in which a smaller model fails to learn parts of a data mixture that a larger model succeeds on, even under asymptotic training (Fig. 1, Sec. 2).

Discussion / Conclusion. We develop a data-centric account of why larger models can learn tasks that smaller models fail to learn. Specifically, we show that larger models can learn rare tasks from the data mixture, and this phenomenon is explained by learning dynamics, i.e., competition of resources and retention of memories, as well as the task frequency and complexity. Our perspective highlights that understanding scaling requires thinking beyond model expressivity. We need to understand how learning dynamics are at play with task frequency and complexity. It also points toward more intentional design of data mixtures to better elicit target capabilities. For example, simply scaling up the frequency of a target task might provide a more efficient way to learn the task than scaling up the model size.