LLM Reasoning and Architecture

Does recomputing weights cost less than moving them on mobile?

Explores whether mobile hardware's memory bottleneck makes it cheaper to recompute transformer blocks than to fetch their weights twice, and whether this trades accuracy for efficiency.

Note · 2026-05-03 · sourced from Mobile

On mobile hardware, the latency bottleneck for transformer inference is often not arithmetic but memory movement — fetching weights from DRAM into compute is slower than the compute itself. MobileLLM exploits this asymmetry with immediate block-wise weight sharing: rather than storing two adjacent transformer blocks with separate weights, it stores one block's weights and computes the block twice in sequence. The total weight footprint stays the same, but the same weights are reused for two consecutive forward passes, avoiding the second weight fetch.

The latency overhead is minimal because the compute was happening anyway and the memory savings are concrete. Crucially this approach produces accuracy gains with no increase in model size — the shared block contributes representational capacity comparable to two distinct blocks because the second application operates on the output of the first, producing functionally different transformations even with shared parameters. This is different from across-layer sharing schemes that share weights between non-adjacent layers and lose more capacity.

The general principle is hardware-shaped architecture design. On compute-bound systems the optimization target is FLOP efficiency; on memory-bound systems it is memory-movement efficiency, and these can favor opposite architectural choices. Block-wise weight sharing makes sense on phones precisely because it trades compute for memory bandwidth — exactly the resource that is abundant relative to memory bandwidth on mobile silicon. The same model on a different hardware target might benefit from the opposite trade. Can architecture choices improve inference efficiency without sacrificing accuracy? formalizes this regime-dependence — inference-cost-aware scaling laws make architectural choices like weight sharing first-class variables.


Source: Mobile

Related concepts in this collection

Concept map
12 direct connections · 105 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

immediate block-wise weight sharing exploits memory-movement bottlenecks on device — recomputing a block twice costs less than moving its weights twice