Does depth matter more than width for tiny language models?

Explores whether deep-and-thin architectures outperform wide-and-shallow ones at sub-billion scales, and why this might contradict larger-model scaling laws.

Note · 2026-05-03 · sourced from Mobile

Kaplan et al.'s scaling laws establish a roughly balanced relationship between model depth and width as parameters scale, with width growth often dominating at typical model sizes. MobileLLM demonstrates that this guidance breaks at the sub-billion-parameter scale relevant for on-device deployment. A deep-and-thin model structure outperforms balanced or wide-and-shallow alternatives, producing 2.7 percent and 4.3 percent accuracy boosts over preceding 125M and 350M state-of-the-art models respectively. The reason offered is that depth captures abstract concepts — composing simpler features into hierarchical representations through more layers — and at small scale the model has fewer raw parameters to spend, so making each one work harder through compositional depth pays back more than spreading them across wider layers.

This matters because it shows that scaling laws are regime-dependent rather than universal. The Kaplan results were derived from larger models where width and depth are both abundant; at the small scale where mobile deployment lives, the trade-offs reverse. The implication is that the architectural recipe for on-device LLMs is genuinely different from the recipe for cloud-scale LLMs — not just smaller, but structurally different. Can architecture choices improve inference efficiency without sacrificing accuracy? makes the same point at the inference-economics layer: vanilla scaling laws say nothing about deployment regimes.

The deeper lesson is methodological: scaling laws should always be qualified by the regime in which they were derived, and recommendations for sub-billion-parameter design should not be extrapolated downward from billion-plus-parameter studies. The right architecture for a 350M parameter model is not a scaled-down version of a 70B parameter model; it is a deep-and-thin model derived from the constraints of the small-scale regime. Can parallel architectures solve fundamentally sequential problems? gives a complementary reason to favor depth — some computations require sequential composition that width cannot supply at any scale.

Source: Mobile

Related concepts in this collection

What actually limits language models on mobile phones? Is the shift toward smaller LLMs driven by quality trade-offs, or by hard physical constraints on device memory and battery life? This note examines whether sub-billion models are a practical necessity rather than a compromise.
extends: same MobileLLM source; this note answers WHY sub-billion is the regime, depth-vs-width answers HOW to design within it
Does recomputing weights cost less than moving them on mobile? Explores whether mobile hardware's memory bottleneck makes it cheaper to recompute transformer blocks than to fetch their weights twice, and whether this trades accuracy for efficiency.
extends: same MobileLLM paper; depth wins partly because depth-with-shared-weights can be deeper than depth-with-distinct-weights at fixed parameter count; the two design moves compound
Can architecture choices improve inference efficiency without sacrificing accuracy? Standard scaling laws optimize training efficiency but ignore inference cost. This explores whether architectural variables like hidden size and attention configuration can unlock inference gains without trading off model accuracy under fixed training budgets.
extends: both reject regime-blind scaling laws; this note shows depth-width trade-offs flip in the small regime; conditional scaling laws formalize how architecture variables modulate the law
Can parallel architectures solve fundamentally sequential problems? Explores whether pure parallel computation—like Transformers—can tackle problems requiring long chains of dependent reasoning, or if serial depth is theoretically necessary for certain classes of problems.
extends: gives a theoretical reason to prefer depth (serial composition) over width (parallel breadth) for capability-bounded models

Concept map

13 direct connections · 137 in 2-hop network ·dense cluster

Does depth matter more than width for tiny langu… What actually limits language models on mobile pho… Does recomputing weights cost less than moving the… Can architecture choices improve inference efficie… Can parallel architectures solve fundamentally seq…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

depth beats width for sub-billion parameter LLMs — contradicting Kaplan scaling laws because deep-and-thin captures abstract concepts better at small scale