Does recomputing weights cost less than moving them on mobile?
Explores whether mobile hardware's memory bottleneck makes it cheaper to recompute transformer blocks than to fetch their weights twice, and whether this trades accuracy for efficiency.
On mobile hardware, the latency bottleneck for transformer inference is often not arithmetic but memory movement — fetching weights from DRAM into compute is slower than the compute itself. MobileLLM exploits this asymmetry with immediate block-wise weight sharing: rather than storing two adjacent transformer blocks with separate weights, it stores one block's weights and computes the block twice in sequence. The total weight footprint stays the same, but the same weights are reused for two consecutive forward passes, avoiding the second weight fetch.
The latency overhead is minimal because the compute was happening anyway and the memory savings are concrete. Crucially this approach produces accuracy gains with no increase in model size — the shared block contributes representational capacity comparable to two distinct blocks because the second application operates on the output of the first, producing functionally different transformations even with shared parameters. This is different from across-layer sharing schemes that share weights between non-adjacent layers and lose more capacity.
The general principle is hardware-shaped architecture design. On compute-bound systems the optimization target is FLOP efficiency; on memory-bound systems it is memory-movement efficiency, and these can favor opposite architectural choices. Block-wise weight sharing makes sense on phones precisely because it trades compute for memory bandwidth — exactly the resource that is abundant relative to memory bandwidth on mobile silicon. The same model on a different hardware target might benefit from the opposite trade. Can architecture choices improve inference efficiency without sacrificing accuracy? formalizes this regime-dependence — inference-cost-aware scaling laws make architectural choices like weight sharing first-class variables.
Source: Mobile
Related concepts in this collection
-
Does depth matter more than width for tiny language models?
Explores whether deep-and-thin architectures outperform wide-and-shallow ones at sub-billion scales, and why this might contradict larger-model scaling laws.
extends: same MobileLLM paper; depth-favoring architecture and weight sharing are complementary moves — sharing lets the deep-and-thin model be even deeper at the same parameter budget
-
What actually limits language models on mobile phones?
Is the shift toward smaller LLMs driven by quality trade-offs, or by hard physical constraints on device memory and battery life? This note examines whether sub-billion models are a practical necessity rather than a compromise.
supports: weight sharing addresses precisely the DRAM bandwidth bottleneck that motivates the sub-billion regime; the constraint named there is the constraint exploited here
-
Can architecture choices improve inference efficiency without sacrificing accuracy?
Standard scaling laws optimize training efficiency but ignore inference cost. This explores whether architectural variables like hidden size and attention configuration can unlock inference gains without trading off model accuracy under fixed training budgets.
extends: the regime-dependence of architecture choice is exactly what conditional scaling laws formalize; weight sharing is the kind of architectural variable they incorporate
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
immediate block-wise weight sharing exploits memory-movement bottlenecks on device — recomputing a block twice costs less than moving its weights twice