What actually limits language models on mobile phones?
Is the shift toward smaller LLMs driven by quality trade-offs, or by hard physical constraints on device memory and battery life? This note examines whether sub-billion models are a practical necessity rather than a compromise.
The push to build sub-billion-parameter LLMs is often framed as a quality-cost trade-off, but MobileLLM's framing is sharper: at current device specs the larger models simply cannot run sustainably on phones. Modern smartphones have 6 to 12 GB of DRAM (iPhone 15 has 6 GB, Pixel 8 Pro has 12 GB), and any single app should use no more than 10 percent of DRAM because memory is shared with the OS and other apps. An 8-bit-quantized LLaMA 7B exceeds this budget. Energy is the second binding constraint: at roughly 0.1 joules per token per billion parameters, a 7B model consumes 0.7 J/token, and a fully charged iPhone with about 50 kJ of energy can sustain that model for less than two hours of conversation at 10 tokens per second — every 64 tokens drains 0.2 percent of the battery.
These numbers reframe sub-billion LLMs as the only practical regime for mobile deployment rather than as a compromise. A 350M 8-bit model at 0.035 J/token can support conversational use for a full day on the same battery, and a 125M model can run at 50 tokens per second on-device versus 3 to 6 tokens per second for the LLaMA 7B running through MLC Chat. The decoding speed advantage compounds the energy advantage — faster generation means less time the system stays in high-power inference state.
The macro-scale argument is also striking: deploying GPT-4-class models for the daily AI usage of every individual would require around 100 million H100 GPUs at 60 TFLOPS each, equivalent to roughly 160 Meta-scale companies. Mobile inference is not just a UX preference; it is the energy-feasible path to ubiquitous LLM use. The constraint flips the design question: instead of "how big can we make this model," the right question becomes the one Does depth matter more than width for tiny language models? and Does recomputing weights cost less than moving them on mobile? both answer — how should a model under one billion parameters be architected for the regime it must run in.
Source: Mobile
Related concepts in this collection
-
Does depth matter more than width for tiny language models?
Explores whether deep-and-thin architectures outperform wide-and-shallow ones at sub-billion scales, and why this might contradict larger-model scaling laws.
extends: this note establishes WHY sub-billion is the operative regime; depth-vs-width answers HOW to architect within it
-
Does recomputing weights cost less than moving them on mobile?
Explores whether mobile hardware's memory bottleneck makes it cheaper to recompute transformer blocks than to fetch their weights twice, and whether this trades accuracy for efficiency.
extends: weight sharing is the design move that addresses the DRAM bandwidth constraint named here
-
Can small language models handle most agent tasks?
Explores whether smaller, cheaper models are actually sufficient for the repetitive, scoped work that dominates deployed agent systems, rather than relying on large models by default.
extends: complementary economic argument from the agent side — even where the device has compute headroom, SLMs are economically and architecturally preferable for most subtasks
-
Can architecture choices improve inference efficiency without sacrificing accuracy?
Standard scaling laws optimize training efficiency but ignore inference cost. This explores whether architectural variables like hidden size and attention configuration can unlock inference gains without trading off model accuracy under fixed training budgets.
extends: gives the formal frame for inference-cost-aware scaling; the DRAM and battery facts here are precisely the variables that conditional scaling laws should incorporate
-
Can inference compute replace scaling up model size?
Explores whether smaller models given more thinking time during inference can match larger models. Matters because it reshapes deployment economics and compute allocation strategies.
extends: opens an escape hatch for the constraint — small models can recover capability at inference via test-time compute, partially neutralizing the parameter ceiling
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
sub-billion parameter LLMs are forced by mobile DRAM and battery constraints not by quality preference — a 7B model drains a phone in under two hours