Are LLM emergent abilities real or measurement artifacts?
Do large language models develop sudden new capabilities at certain scales, or do discontinuous metrics just make gradual improvements look sudden? This matters because it changes how we predict and interpret model behavior.
The sharp, unpredictable transitions that define "emergent abilities" — capabilities appearing suddenly at certain model scales — are artifacts of the researcher's choice of metric rather than fundamental changes in model behavior.
The argument: nonlinear or discontinuous metrics (like exact string match) produce apparent emergent abilities, while linear or continuous metrics (like token edit distance) applied to the same model outputs show smooth, continuous, predictable changes with scale. The "emergence" lives in the measurement, not the model.
Three complementary validations:
- InstructGPT/GPT-3 family — tasks with claimed emergent abilities show smooth improvement under continuous metrics
- BIG-Bench meta-analysis — claimed emergent abilities evaporate with different metrics or better statistics
- Vision tasks — the same metric manipulation produces never-before-seen "emergent abilities" across diverse deep networks, confirming the mechanism is metric-dependent not domain-specific
This doesn't mean models don't improve with scale — they do, continuously. What it challenges is the narrative of sudden capability transitions that implies qualitative changes in what models can do. The practical implication: scaling predictions become much more tractable if improvements are smooth rather than discontinuous.
This connects to Do foundation models learn world models or task-specific shortcuts? — both challenge the narrative of fundamental capability leaps. Heuristics improve gradually with more data; emergence would require qualitative shifts. The metric artifact finding supports the heuristics interpretation.
Source: Flaws
Related concepts in this collection
-
Do foundation models learn world models or task-specific shortcuts?
When transformer models predict sequences accurately, are they building genuine world models that capture underlying physics and logic? Or are they exploiting narrow patterns that fail under distribution shift?
gradual heuristic improvement vs sudden capability emergence; both support the same underlying picture
-
Can neural networks learn compositional skills without symbolic mechanisms?
Do neural networks need explicit symbolic architecture to compose learned concepts, or can scaling alone enable compositional generalization? This asks whether compositionality is an architectural feature or an emergent property of scale.
appears to conflict: compositional generalization does emerge at scale, but may do so smoothly rather than suddenly
-
How much of LLM few-shot ability comes from training data?
Do large language models genuinely learn from a few examples, or are they mostly recognizing patterns from their training data? This matters for understanding what LLMs can actually do.
triple challenge to capabilities narrative: metric artifacts inflate emergence claims + task contamination inflates baselines + prompting techniques don't replicate
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
emergent abilities of LLMs are metric artifacts not fundamental scaling behavior changes