MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases

Paper · arXiv 2402.14905 · Published February 22, 2024
Mobile

This paper addresses the growing need for efficient large language models (LLMs) on mobile devices, driven by increasing cloud costs and latency concerns. We focus on designing top-quality LLMs with fewer than a billion parameters, a practical choice for mobile deployment. Contrary to prevailing belief emphasizing the pivotal role of data and parameter quantity in determining model quality, our investigation underscores the significance of model architecture for sub-billion scale LLMs. Leveraging deep and thin architectures, coupled with embedding sharing and grouped-query attention mechanisms, we establish a strong baseline network denoted as MobileLLM, which attains a remarkable 2.7%/4.3% accuracy boost over preceding 125M/350M state-of-the-art models. Additionally, we propose an immediate block-wise weight sharing approach with no increase in model size and only marginal latency overhead.

scenario characterized by widespread human reliance on LLMs in both front-end conversational interfaces and back-end operations like recommendation system, equating to ∼5% of individuals’ daily time. In this hypothetical scenario, employing GPT-4 at a processing rate of 50 tokens/s entails the deployment of around one hundred million H100 GPUs2, each capable of 60 TFLOPs/s3. This computation scale, excluding communication and data transmission overhead, is on par with 160 Meta-scale companies4. The ensuing energy consumption and carbon dioxide emissions would present staggering environmental challenges.

Furthermore, considerations of portability and computational cost propel the necessity to deploy LLMs on smartphones and mobile devices. In the current landscape of mobile technology, integrating an LLM like the LLaMAv2 7B (Touvron et al., 2023b) with 8-bit weights proves prohibitively expensive due to limitations in main-memory (DRAM) capacity source. A prevalent memory hierarchy in mobile devices is depicted in Figure 2. With DRAM capacities ranging from 6 GB for the iPhone 15 and 12 GB for the Google Pixel 8 Pro (Hristov, 2022; Google, 2023), a mobile app should not exceed 10% of the DRAM, since DRAM is shared with the operating system and other applications (Malladi et al., 2012). This motivates deploying sub-billion parameter LLMs. Additionally, factoring in LLM energy consumption (0.1 J/token per billion in model parameters (Han et al., 2016; Malladi et al., 2012)), a 7Bparameter LLM consumes 0.7 J/token. A fully charged iPhone, with approximately 50kJ of energy, can sustain this model in conversation for less than 2 hours at a rate of 10 tokens/s, with every 64 tokens draining 0.2% of the battery.    

By utilizing a sub-billion model, such as a 350M 8-bit model consuming only 0.035 J/token, an iPhone can support conversational use an entire day. Moreover, the decoding speed can be significantly enhanced, as exemplified by the benchmark results of the 125M model, capable of operating at 50 tokens/s, compared to the state-of-the-art iPhone App MLC Chat utilizing the LLaMA 7B model at 3∼6 tokens/second5

We make the following contributions to build the most accurate LLMs to date under 1 billion parameters. 6

• Contradictory to the scaling law (Kaplan et al., 2020), we demonstrate that depth is more important than width for small LLMs. A deep-and-thin model structure excels in capturing abstract concepts, resulting in superior final performance.

• We revisit embedding sharing methods (Zhang et al., 2022) and implement grouped query attention (Ainslie et al., 2023) in small LLMs to maximize weight utilization.

• We propose immediate block-wise weight sharing. In scenarios where memory movement is the latency bottleneck, weight sharing between two adjacent blocks avoids weight movement, requiring only computing the block twice and incurring minimal latency overhead.