Can text-trained models compress images better than specialized tools?
Do general-purpose language models trained only on text outperform domain-specific compressors like PNG and FLAC on their native data? This tests whether compression ability is universal or requires domain specialization.
The source coding theorem (Shannon, 1948) makes this equivalence formal: maximizing log-likelihood is equivalent to minimizing bits per message. A probabilistic model IS a compressor, and a compressor IS a probabilistic model. Grünwald et al. (2309.10668) take this from theory to practice by measuring the offline compression capabilities of large language models across data modalities.
The striking finding: Chinchilla models, trained exclusively on internet text (Wikipedia, websites, GitHub, books), achieve state-of-the-art compression rates on image and audio data — beating domain-specific compressors like FLAC and PNG. This is not what you'd expect. Domain-specific compressors are engineered for their modality. A text-trained model shouldn't outperform them on non-text data.
The mechanism is in-context learning functioning as conditioning. The model doesn't learn image or audio representations during training. Instead, at compression time, it uses its context window to condition itself as a task-specific compressor. General-purpose compression via adaptation, not specialization.
However, this comes with a scaling caveat that inverts the typical scaling law narrative. When measuring adjusted compression rate (accounting for model parameters in the compressed output), scaling beyond a certain point deteriorates compression performance. The parameters themselves become overhead. Smaller models trained specifically on the target data can achieve better adjusted compression than massive general-purpose models. Since Why does reasoning training help math but hurt medical tasks?, the adjusted compression overhead may concentrate in the deeper reasoning layers — the same layers that show redundancy under pruning.
The deeper principle: a model that compresses well generalizes well (Hutter, 2006). This reframes generalization as a compression problem rather than a learning problem. Since Do foundation models learn world models or task-specific shortcuts?, the compression framing suggests these heuristics are efficient compression shortcuts — good enough to compress but not sufficient for genuine world modeling.
Source: LLM Architecture
Related concepts in this collection
-
Do foundation models learn world models or task-specific shortcuts?
When transformer models predict sequences accurately, are they building genuine world models that capture underlying physics and logic? Or are they exploiting narrow patterns that fail under distribution shift?
compression may explain why heuristics suffice: they compress well enough
-
Can large language models develop genuine world models without direct environmental contact?
Do LLMs extract meaningful world structures from human-generated text despite lacking direct sensory access to reality? This matters for understanding what kind of grounding and knowledge these systems actually possess.
compression-as-generalization offers an alternative framing for how world models emerge
-
Why does reasoning training help math but hurt medical tasks?
Explores whether reasoning and knowledge rely on different network mechanisms, and why training one might undermine the other across different domains.
the compression framing maps onto the layer separation: lower layers compress facts (memorization-heavy, document-specific), while higher layers compress procedures (generalizable across instances); the scaling caveat where adjusted compression deteriorates for larger models may reflect redundancy in deeper reasoning layers
-
Does reasoning rely on procedural knowledge or factual memorization?
Explores whether LLMs learn reasoning through general procedural patterns across documents or through memorizing specific facts. Understanding this distinction matters for training data strategy.
procedural knowledge compresses better than factual knowledge because one procedure covers many instances, directly explaining why compression = generalization holds more strongly for reasoning than for factual recall
-
Are neural network optimizers actually memory systems?
Do gradient-based optimizers like Adam function as associative memory modules that compress context, just like network layers? This reframes the relationship between training and learning.
Nested Learning operationalizes the compression principle at the component level: every NN component (including optimizers) is an associative memory compressing its context flow, making compression=generalization apply recursively at every nesting level
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
language modeling is equivalent to lossless compression — LLMs trained on text outperform domain-specific compressors on images and audio via in-context conditioning