Why do scaling laws fail to predict optimal architectures at small parameter counts?

This explores why classic scaling laws — which predict performance from raw parameter count — break down at small model sizes, where *how* you arrange those parameters starts to matter more than how many you have.

This explores why the standard scaling-law story (loss falls predictably as you add parameters and data) stops being a good guide once models get small — and the corpus points at a single root cause: classic laws treat parameters as fungible, but at small scale their *arrangement* dominates. The clearest direct evidence is Does depth matter more than width for tiny language models?, where deep-and-thin networks beat balanced ones by 2.7–4.3% at the 125M–350M range. Kaplan-style laws predict that result shouldn't depend on shape at all — yet it does, because deep stacks let the model *compose* abstract concepts through layers rather than smear capacity across width. When you only have a few hundred million parameters, that compositional structure is the whole game; at billions, the difference washes out, which is exactly why the laws look 'true' at large scale and fail at small.

The deeper issue is that standard scaling laws bake in no architectural variables. Can architecture choices improve inference efficiency without sacrificing accuracy? makes this concrete from the other direction: once you *add* hidden size, MLP-to-attention ratio, and grouped-query configuration into the law, you can predict and optimize architecture — getting 42% throughput and 2.1% accuracy gains under the same training budget. The implication is that the original laws weren't wrong so much as blind; they marginalized away the very knobs that decide which architecture is optimal at a given size.

Small models also escape the assumptions the laws are built on. Can recurrent hierarchies achieve reasoning that transformers cannot? shows a 27M-parameter model solving Sudoku and mazes that defeat much larger chain-of-thought systems, by using recurrence to break past the fixed-depth complexity ceiling that constrains ordinary transformers. A scaling law fit to fixed-depth transformers simply has no term for 'effective computational depth from recurrence' — so it can't see why a tiny recurrent design outperforms a bigger conventional one. The same lesson shows up in Can reasoning systems scale wider instead of only deeper? and in the broader claim of Has memory architecture replaced parameter count as the scaling frontier?, where returns increasingly come from restructuring memory and computation rather than counting parameters.

There's a subtler boundary worth knowing: scaling laws sometimes *do* work — when the task space is well covered. Can neural networks learn compositional skills without symbolic mechanisms? finds plain MLPs generalize compositionally through scale alone, no architectural tricks needed, *as long as training covers the combinations*. That's the flip side of the small-parameter failure: at large scale with enough data, architecture stops mattering and the law holds; at small scale with sparse coverage, architecture is the only lever you have, and the law goes silent. Two more notes sharpen the picture — Can inference compute replace scaling up model size? shows parameters and inference compute aren't independent axes (so a one-dimensional parameter law was always incomplete), and Do larger language models solve constrained optimization better? shows some ceilings don't move with scale *or* architecture at all.

The thing you didn't know you wanted to know: scaling laws don't 'fail' at small parameter counts so much as reveal what they quietly assumed — that architecture is noise. At small scale, architecture is signal, and the laws were never measuring it.

Sources 8 notes

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Can architecture choices improve inference efficiency without sacrificing accuracy?

Augmenting scaling laws with hidden size, MLP-to-attention ratio, and GQA configuration enables architecture optimization for inference. Optimized models achieved up to 2.1% higher accuracy and 42% greater throughput than LLaMA-3.2 under identical training budgets.

Can recurrent hierarchies achieve reasoning that transformers cannot?

The Hierarchical Reasoning Model couples slow abstract planning with fast detailed computation across two timescales, achieving near-perfect performance on Sudoku and mazes where chain-of-thought methods fail completely. With only 27M parameters and 1,000 samples, HRM escapes the AC0/TC0 complexity ceiling that constrains fixed-depth transformers.

Can reasoning systems scale wider instead of only deeper?

GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.

Has memory architecture replaced parameter count as the scaling frontier?

Three converging signals in late-2025 research—taxonomy maturation, memory-aware test-time scaling loops, and hybrid sparsity laws—show that returns from restructuring memory now exceed returns from adding parameters. The design bottleneck has shifted from compute to memory structure.

Can neural networks learn compositional skills without symbolic mechanisms?

Standard MLPs achieve compositional generalization through data and model scaling alone, without architectural modifications, provided the training distribution sufficiently covers combinations of task modules. Linear decodability of constituents from hidden activations reliably predicts success.

Can inference compute replace scaling up model size?

Snell et al. (2024) showed that inference-time compute trades off against model parameter scaling, especially on difficult prompts. This reveals pretraining and inference compute are not independent resources.

Do larger language models solve constrained optimization better?

Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.

Why do scaling laws fail to predict optimal architectures at small parameter counts?

Sources 8 notes

Next inquiring lines