LLM Reasoning and Architecture Language Understanding and Pragmatics

Can text-trained models compress images better than specialized tools?

Do general-purpose language models trained only on text outperform domain-specific compressors like PNG and FLAC on their native data? This tests whether compression ability is universal or requires domain specialization.

Note · 2026-02-22 · sourced from LLM Architecture
What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

The source coding theorem (Shannon, 1948) makes this equivalence formal: maximizing log-likelihood is equivalent to minimizing bits per message. A probabilistic model IS a compressor, and a compressor IS a probabilistic model. Grünwald et al. (2309.10668) take this from theory to practice by measuring the offline compression capabilities of large language models across data modalities.

The striking finding: Chinchilla models, trained exclusively on internet text (Wikipedia, websites, GitHub, books), achieve state-of-the-art compression rates on image and audio data — beating domain-specific compressors like FLAC and PNG. This is not what you'd expect. Domain-specific compressors are engineered for their modality. A text-trained model shouldn't outperform them on non-text data.

The mechanism is in-context learning functioning as conditioning. The model doesn't learn image or audio representations during training. Instead, at compression time, it uses its context window to condition itself as a task-specific compressor. General-purpose compression via adaptation, not specialization.

However, this comes with a scaling caveat that inverts the typical scaling law narrative. When measuring adjusted compression rate (accounting for model parameters in the compressed output), scaling beyond a certain point deteriorates compression performance. The parameters themselves become overhead. Smaller models trained specifically on the target data can achieve better adjusted compression than massive general-purpose models. Since Why does reasoning training help math but hurt medical tasks?, the adjusted compression overhead may concentrate in the deeper reasoning layers — the same layers that show redundancy under pruning.

The deeper principle: a model that compresses well generalizes well (Hutter, 2006). This reframes generalization as a compression problem rather than a learning problem. Since Do foundation models learn world models or task-specific shortcuts?, the compression framing suggests these heuristics are efficient compression shortcuts — good enough to compress but not sufficient for genuine world modeling.


Source: LLM Architecture

Related concepts in this collection

Concept map
15 direct connections · 146 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

language modeling is equivalent to lossless compression — LLMs trained on text outperform domain-specific compressors on images and audio via in-context conditioning