Why do larger models learn rare tasks better?
Does model size enable learning of infrequent, complex tasks through greater representational capacity, or through some other mechanism? Understanding this matters for deciding whether scaling or data design is the more efficient lever.
The standard story for why larger models acquire capabilities smaller ones lack is expressivity — bigger models can represent functions smaller ones cannot. This paper argues the real cause is usually different. A phenomenological argument shows power-law scaling already implies a regime where a smaller model fails to learn part of a data mixture a larger model succeeds on, even with infinite training data — so the gap is not about whether a solution is representable.
The mechanism is reduced interference, traced through a controlled synthetic mixture and validated by pretraining OLMo models (4M–4B) on tasks of varying frequency and complexity. Smaller models face a data-induced competition over neurons: they allocate resources to high-frequency, low-complexity tasks and learn solutions that perform poorly on rare, complex tasks — even when an expressible solution exists. A larger model circumvents this because, with enough capacity allocated to common tasks, the gradient updates for those tasks become weak — so they stop overwriting the rare-task features that accumulate slowly over training.
The keeper implication overturns the "just scale parameters" reflex: understanding scaling requires thinking beyond expressivity to learning dynamics — task frequency and complexity interacting with capacity. And it suggests a cheaper lever: intentional data-mixture design. Simply up-weighting the frequency of a target rare task may teach it more efficiently than scaling model size. This connects to What limits reasoning capability beyond math and code?: both relocate capability from model size toward data composition.
Inquiring lines that use this note as a source 4
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Does pretraining data size matter less than base model scale for finetuning?
- How do task frequency and complexity interact with model capacity during training?
- Can intentional data-mixture design replace model scaling for rare task learning?
- Do rare cultural concepts fail predictably as model scale increases?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
What limits reasoning capability beyond math and code?
Can scaling reasoning to open-ended domains like economics and social sciences be solved by better training methods, or does the real bottleneck lie elsewhere? This explores what actually constrains broader reasoning.
both shift the lever from model size to data composition
-
Why aren't bigger models better for generating diverse outputs?
When generating many unique outputs within a fixed budget, does model size actually matter? Exploring whether the conventional wisdom of using larger models holds for diversity-focused tasks.
another non-monotonic, capacity-vs-task account that resists "bigger is simply better"
-
Do base models already contain hidden reasoning ability?
Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
reframes emergence as access/interference rather than absent capability
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention
- Provable Benefits of In-Tool Learning for Large Language Models
- 1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities
- Pushing the Limits of Rule Reasoning in Transformers through Natural Language Satisfiability
- Language Models are Pragmatic Speakers
- Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs
- The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs
- Adam's Law: Textual Frequency Law on Large Language Models
Original note title
larger models learn rare tasks through reduced interference not greater expressivity — capacity weakens common-task gradients so they stop overwriting rare-task features