LLM Reasoning and Architecture

Does depth matter more than width for tiny language models?

Explores whether deep-and-thin architectures outperform wide-and-shallow ones at sub-billion scales, and why this might contradict larger-model scaling laws.

Note · 2026-05-03 · sourced from Mobile

Kaplan et al.'s scaling laws establish a roughly balanced relationship between model depth and width as parameters scale, with width growth often dominating at typical model sizes. MobileLLM demonstrates that this guidance breaks at the sub-billion-parameter scale relevant for on-device deployment. A deep-and-thin model structure outperforms balanced or wide-and-shallow alternatives, producing 2.7 percent and 4.3 percent accuracy boosts over preceding 125M and 350M state-of-the-art models respectively. The reason offered is that depth captures abstract concepts — composing simpler features into hierarchical representations through more layers — and at small scale the model has fewer raw parameters to spend, so making each one work harder through compositional depth pays back more than spreading them across wider layers.

This matters because it shows that scaling laws are regime-dependent rather than universal. The Kaplan results were derived from larger models where width and depth are both abundant; at the small scale where mobile deployment lives, the trade-offs reverse. The implication is that the architectural recipe for on-device LLMs is genuinely different from the recipe for cloud-scale LLMs — not just smaller, but structurally different. Can architecture choices improve inference efficiency without sacrificing accuracy? makes the same point at the inference-economics layer: vanilla scaling laws say nothing about deployment regimes.

The deeper lesson is methodological: scaling laws should always be qualified by the regime in which they were derived, and recommendations for sub-billion-parameter design should not be extrapolated downward from billion-plus-parameter studies. The right architecture for a 350M parameter model is not a scaled-down version of a 70B parameter model; it is a deep-and-thin model derived from the constraints of the small-scale regime. Can parallel architectures solve fundamentally sequential problems? gives a complementary reason to favor depth — some computations require sequential composition that width cannot supply at any scale.


Source: Mobile

Related concepts in this collection

Concept map
13 direct connections · 137 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

depth beats width for sub-billion parameter LLMs — contradicting Kaplan scaling laws because deep-and-thin captures abstract concepts better at small scale