What constraints force mobile deployments to operate in the sub-billion parameter regime?

This explores the hardware realities — not quality tradeoffs — that cap on-device models at under a billion parameters, and the design tricks that make that cap workable.

This explores why phones force language models below a billion parameters, and the answer turns out to be physical rather than a matter of taste. The corpus is direct: smartphones run into two walls — DRAM budgets and battery capacity — long before they run into a quality ceiling What actually limits language models on mobile phones?. The energy math is the vivid part. A 7B model drains a 50kJ battery in under two hours, while a 350M model can hold a conversation for a full day on the same charge. So the sub-billion regime isn't a compromise model-makers settled for; it's the only size that survives contact with a battery you carry around.

What's interesting is that the binding constraint is memory *movement*, not memory *size*. Mobile hardware is memory-bound, meaning the costly part is fetching weights across the chip, not the arithmetic itself. That inverts an intuition: MobileLLM shows it's actually cheaper to recompute a transformer block twice than to fetch a fresh set of weights for it, so sharing weights between adjacent blocks gains accuracy with no extra parameters Does recomputing weights cost less than moving them on mobile?. Once you see that the bottleneck is data shuttling rather than capacity, the whole design space reshapes around minimizing movement.

The same memory-vs-compute reframing shows up elsewhere in the corpus, which suggests it's a general principle rather than a mobile quirk. The long-context bottleneck, for instance, also turns out to be compute — the work of consolidating context into internal state — rather than raw storage Is long-context bottleneck really about memory or compute?. Naming the real bottleneck (movement, consolidation) instead of the apparent one (size) is what unlocks the clever workarounds in both cases.

Here's what you might not expect to find: being small isn't much of a loss for the work phones actually do. Small models handle the repetitive, well-defined language tasks that make up most agent work at 10–30× lower cost Can small language models handle most agent tasks?, and when a hard prompt does come along, you can spend extra inference-time compute to let a small model match a larger one rather than shipping the larger model Can inference compute replace scaling up model size?. Routing makes this a system rather than a single device decision — predict query difficulty up front and only escalate to a big model when needed, cutting cost 40–50% Can routers select the right model before generation happens?.

Put together, the corpus tells a quietly subversive story: the sub-billion ceiling is dictated by DRAM and joules, the real enemy is moving weights rather than storing them, and the architecture that wins is heterogeneous — a small model living on the phone that knows when to phone home. The constraint isn't something to apologize for; it's the thing that forces the better design.

Sources 6 notes

What actually limits language models on mobile phones?

Smartphones' DRAM budgets and battery capacity make sub-billion-parameter models the only sustainable option for mobile deployment. A 7B model drains a 50kJ battery in under two hours, while a 350M model can run conversational AI for a full day on the same device.

Does recomputing weights cost less than moving them on mobile?

MobileLLM shows that on memory-bound mobile hardware, sharing weights between adjacent transformer blocks by recomputing one block twice uses less latency than fetching separate weights, gaining accuracy with no parameter increase.

Is long-context bottleneck really about memory or compute?

Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.

Can small language models handle most agent tasks?

SLMs handle the repetitive, well-defined language tasks that constitute most agent work at 10–30× lower cost than LLMs, making heterogeneous architectures (SLMs by default, LLMs selective) the economically rational design pattern.

Can inference compute replace scaling up model size?

Snell et al. (2024) showed that inference-time compute trades off against model parameter scaling, especially on difficult prompts. This reveals pretraining and inference compute are not independent resources.

Can routers select the right model before generation happens?

RouteLLM and Hybrid-LLM both achieve 40-50% cost reduction by routing to a single model based on query difficulty prediction, not response evaluation. Single-model routing minimizes latency compared to ensemble or cascade alternatives.

What constraints force mobile deployments to operate in the sub-billion parameter regime?

Sources 6 notes

Next inquiring lines