Why do larger models learn rare tasks better?

Does model size enable learning of infrequent, complex tasks through greater representational capacity, or through some other mechanism? Understanding this matters for deciding whether scaling or data design is the more efficient lever.

Synthesis note · 2026-06-03 · sourced from Reinforcement Learning

The standard story for why larger models acquire capabilities smaller ones lack is expressivity — bigger models can represent functions smaller ones cannot. This paper argues the real cause is usually different. A phenomenological argument shows power-law scaling already implies a regime where a smaller model fails to learn part of a data mixture a larger model succeeds on, even with infinite training data — so the gap is not about whether a solution is representable.

The mechanism is reduced interference, traced through a controlled synthetic mixture and validated by pretraining OLMo models (4M–4B) on tasks of varying frequency and complexity. Smaller models face a data-induced competition over neurons: they allocate resources to high-frequency, low-complexity tasks and learn solutions that perform poorly on rare, complex tasks — even when an expressible solution exists. A larger model circumvents this because, with enough capacity allocated to common tasks, the gradient updates for those tasks become weak — so they stop overwriting the rare-task features that accumulate slowly over training.

The keeper implication overturns the "just scale parameters" reflex: understanding scaling requires thinking beyond expressivity to learning dynamics — task frequency and complexity interacting with capacity. And it suggests a cheaper lever: intentional data-mixture design. Simply up-weighting the frequency of a target rare task may teach it more efficiently than scaling model size. This connects to What limits reasoning capability beyond math and code?: both relocate capability from model size toward data composition.

Inquiring lines that use this note as a source 4

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 166 in 2-hop network ·dense cluster Open in graph ↗

Why do larger models learn rare tasks better? What limits reasoning capability beyond math and c… Why aren't bigger models better for generating div… Do base models already contain hidden reasoning ab…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

What limits reasoning capability beyond math and code? Can scaling reasoning to open-ended domains like economics and social sciences be solved by better training methods, or does the real bottleneck lie elsewhere? This explores what actually constrains broader reasoning.
both shift the lever from model size to data composition
Why aren't bigger models better for generating diverse outputs? When generating many unique outputs within a fixed budget, does model size actually matter? Exploring whether the conventional wisdom of using larger models holds for diversity-focused tasks.
another non-monotonic, capacity-vs-task account that resists "bigger is simply better"
Do base models already contain hidden reasoning ability? Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
reframes emergence as access/interference rather than absent capability

Why do larger models learn rare tasks better?

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4