Which architectural choices matter most when a model must fit one billion parameters?

This explores what design decisions actually move the needle when you're building a small model — roughly a billion parameters or under — rather than scaling up freely, which is exactly the regime forced by phones and other tight hardware budgets.

This explores what design decisions actually move the needle when you're building a small model — around a billion parameters or under — rather than scaling up freely. The first thing the corpus does is reframe the question: at this size you're usually not choosing to be small, you're forced to be. Smartphone DRAM and battery budgets make sub-billion-parameter models the only sustainable option — a 7B model drains a 50kJ battery in under two hours, while a 350M model can run conversational AI for a full day on the same phone What actually limits language models on mobile phones?. So the real question becomes: given a fixed, small parameter budget, where do you spend it?

The sharpest single answer is shape over size. MobileLLM found that at the 125M–350M scale, deep-and-thin networks beat balanced width-vs-depth designs by 2.7–4.3% — because stacking more layers lets the model compose abstract concepts through depth, rather than spreading the same parameters thinly across width Does depth matter more than width for tiny language models?. Notably this contradicts the classic Kaplan scaling laws, which treated depth and width as roughly interchangeable. The lesson: scaling-law intuitions calibrated on huge models don't transfer down, and below a billion parameters how you arrange the parameters matters as much as how many you have.

The more surprising move is to stop treating the parameter count as the whole budget at all. Inference-time compute trades off against parameters: a smaller model given more thinking time at inference can match a larger one, especially on hard prompts Can inference compute replace scaling up model size?. Architecturally, that means a billion-parameter model is a better bet if you design it to lean on test-time compute rather than raw capacity. Two related patterns push the same way: freezing a pretrained backbone and bolting on a small auxiliary module preserves the big model's knowledge while adding new reasoning ability without retraining the whole thing Can continuous reasoning avoid forgetting in instruction-tuned models?, and splitting a monolith into a separate planner and solver outperforms one undifferentiated model — with the decomposition skill even transferring across domains Does separating planning from execution improve reasoning accuracy?.

There's also a counterintuitive case for staying small on purpose. For generating diverse outputs — say, synthetic training data — models around 500M parameters produce more unique samples per draw than larger ones, because bigger models concentrate probability mass on their favorite answers and collapse variety Why aren't bigger models better for generating diverse outputs?. And at the system level, routing queries across several small specialized models can beat a single frontier model: ten 7B models with smart routing surpassed GPT-4.1, suggesting selection is a stronger lever than scale Can routing beat building one better model?.

The thread tying these together: once parameters are scarce, the highest-leverage choices stop being "add capacity" and become structural — depth over width, frozen-plus-auxiliary over end-to-end retraining, planner-solver separation over monoliths, and inference compute or routing over a single bigger network. The thing you didn't know you wanted to know: for a model this size, the parameter count is one of the *least* informative numbers about how well it will perform.

Sources 7 notes

What actually limits language models on mobile phones?

Smartphones' DRAM budgets and battery capacity make sub-billion-parameter models the only sustainable option for mobile deployment. A 7B model drains a 50kJ battery in under two hours, while a 350M model can run conversational AI for a full day on the same device.

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Can inference compute replace scaling up model size?

Snell et al. (2024) showed that inference-time compute trades off against model parameter scaling, especially on difficult prompts. This reveals pretraining and inference compute are not independent resources.

Can continuous reasoning avoid forgetting in instruction-tuned models?

SoftCoT avoids catastrophic forgetting by keeping the main LLM frozen while delegating soft thought generation to a small auxiliary model. This architectural separation maintains pre-trained knowledge while enabling continuous reasoning.

Does separating planning from execution improve reasoning accuracy?

Modular architectures with separate decomposer and solver models outperform monolithic LLMs, with decomposition ability transferring across domains while solving ability does not. The separation prevents planning-execution interference and produces more generalizable skills.

Why aren't bigger models better for generating diverse outputs?

Research shows that for synthetic data generation, models around 500M parameters outperform larger ones in output diversity per sample. Larger models concentrate probability mass on preferred outputs, reducing the variety of distinct samples generated within a fixed budget.

Can routing beat building one better model?

Avengers-Pro achieves 7% higher accuracy than GPT-5-medium by routing queries to optimal models per semantic cluster, or matches its performance at 27% lower cost. Ten 7B models with routing previously surpassed GPT-4.1 and 4.5, suggesting selection is a stronger lever than scaling.

Which architectural choices matter most when a model must fit one billion parameters?

Sources 7 notes

Next inquiring lines