LLM Reasoning and Architecture

What actually limits language models on mobile phones?

Is the shift toward smaller LLMs driven by quality trade-offs, or by hard physical constraints on device memory and battery life? This note examines whether sub-billion models are a practical necessity rather than a compromise.

Note · 2026-05-03 · sourced from Mobile

The push to build sub-billion-parameter LLMs is often framed as a quality-cost trade-off, but MobileLLM's framing is sharper: at current device specs the larger models simply cannot run sustainably on phones. Modern smartphones have 6 to 12 GB of DRAM (iPhone 15 has 6 GB, Pixel 8 Pro has 12 GB), and any single app should use no more than 10 percent of DRAM because memory is shared with the OS and other apps. An 8-bit-quantized LLaMA 7B exceeds this budget. Energy is the second binding constraint: at roughly 0.1 joules per token per billion parameters, a 7B model consumes 0.7 J/token, and a fully charged iPhone with about 50 kJ of energy can sustain that model for less than two hours of conversation at 10 tokens per second — every 64 tokens drains 0.2 percent of the battery.

These numbers reframe sub-billion LLMs as the only practical regime for mobile deployment rather than as a compromise. A 350M 8-bit model at 0.035 J/token can support conversational use for a full day on the same battery, and a 125M model can run at 50 tokens per second on-device versus 3 to 6 tokens per second for the LLaMA 7B running through MLC Chat. The decoding speed advantage compounds the energy advantage — faster generation means less time the system stays in high-power inference state.

The macro-scale argument is also striking: deploying GPT-4-class models for the daily AI usage of every individual would require around 100 million H100 GPUs at 60 TFLOPS each, equivalent to roughly 160 Meta-scale companies. Mobile inference is not just a UX preference; it is the energy-feasible path to ubiquitous LLM use. The constraint flips the design question: instead of "how big can we make this model," the right question becomes the one Does depth matter more than width for tiny language models? and Does recomputing weights cost less than moving them on mobile? both answer — how should a model under one billion parameters be architected for the regime it must run in.


Source: Mobile

Related concepts in this collection

Concept map
13 direct connections · 97 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

sub-billion parameter LLMs are forced by mobile DRAM and battery constraints not by quality preference — a 7B model drains a phone in under two hours