Language Modeling is Compression

Paper · arXiv 2309.10668 · Published September 19, 2023

Information theory and machine learning are inextricably linked and have even been referred to as “two sides of the same coin” (MacKay, 2003). One particularly elegant connection is the essential equivalence between probabilistic models of data and lossless compression. The source coding theorem (Shannon, 1948) is the fundamental theorem describing this idea, i.e., the expected message length in bits of an optimal entropy encoder is equal to the negative log2-likelihood of the statistical model. In other words, maximizing the log2-likelihood (of the data) is equivalent to minimizing the number of bits required per message. Indeed, lossless compression with a probabilistic model can be achieved in a variety of different ways, including Huffman coding (Huffman, 1952), arithmetic coding (Pasco, 1977; Rissanen, 1976), and asymmetric numeral systems (Duda, 2009).

This Work We advocate for using (lossless) compression to study foundation models. To that end, we conduct an extensive empirical investigation of the offline (in-context) compression capabilities of large language models, with the rationale that they have recently become readily available (Hoffmann et al., 2022; Touvron et al., 2023) and can thus be used for compression without the training overhead. We empirically demonstrate that these models, while (meta-)trained primarily on text, also achieve state-of-the-art compression rates across different data modalities, using their context to condition a general-purpose compressor to excel at a particular task. Moreover, we shed new light on scaling laws (Kaplan et al., 2020), showing that they also hold true for compression but that measuring the compression rates instead of the log loss adds a twist: Scaling beyond a certain point will deteriorate the compression performance since the model parameters need to be accounted for in the compressed output. Finally, we advocate for framing (self-supervised) prediction through the lens of compression as it encompasses generalization: a model that compresses well generalizes well (Hutter, 2006).

Foundation Models Are General-Purpose Compressors A lossless compressor induces an injective function over bit sequences, meaning that we cannot compress all sequences equally well (by the pigeonhole principle). Consequently, in practice, compressors are often tailored to a particular setting, e.g., FLAC for audio or PNG for images, and thus fail to compress other data modalities well (see Table 1). In contrast, general-purpose compressors, such as gzip, offer good performance on a wide range of data sources. Surprisingly, Chinchilla models, while trained primarily on text, also appear to be general-purpose compressors, as they outperform all other compressors, even on image and audio data (see Table 1). Note that Chinchilla models have not been trained on this kind of data according to Appendix A. of Hoffmann et al. (2022), which states that the training dataset consists of a mix of internet text data (Wikipedia, websites, github) and books. However, it is still possible (but unlikely) that some images or audio samples were encoded into text on some websites. Thus, Chinchilla models achieve their impressive compression performance by conditioning a (meta-)trained model to a particular task at hand via in-context learning (Genewein et al., 2023). In contrast, smaller Transformers, trained manually on enwik8, only achieve good compression rates on similar Wikipedia data, i.e., enwik9. However, larger models’ stronger in-context compression (or in-context learning) comes at a price: the number of parameters, which has to be offset with increasingly large data sources when computing the adjusted compression rate (see Section 3.3). Finally, note that since Chinchilla has been trained on Wikipedia, the enwik9 results are in-distribution.