Why do naive pruning and quantization destroy LLM performance so easily?

This explores why simply chopping out weights (pruning) or lowering numerical precision (quantization) tends to wreck an LLM — and the corpus doesn't address compression methods head-on, so the honest answer is a lateral one built from what the collection *does* say about how models store and use information.

First, a caveat worth stating plainly: this collection has no paper directly testing pruning or quantization, so there's no retrieval that names your failure mode. But several notes circle the same underlying question — *how is capability actually distributed across an LLM's weights?* — and that's the thing naive compression collides with.

The sharpest angle comes from work framing learning itself as compression. One result derives optimal training from a lossless-compression objective and finds a 'Learning Law' where, in the optimal process, every training example contributes equally Does optimal language model learning maximize data compression?. Read that backwards and you get an intuition for why post-hoc compression is brutal: if a well-trained model has already squeezed its data near a compression limit with information spread evenly, there's little redundant slack left to cut. Naive pruning assumes some weights are 'spare.' A model that learned by maximizing compression has comparatively few spare weights to give.

The MobileLLM result pushes this further. At sub-billion scale, deep-and-thin architectures beat wide ones because capability comes from *composing* abstract concepts across layers, not from spreading parameters across width Does depth matter more than width for tiny language models?. If a capability is a chain through many layers rather than a localized lump, then knocking out weights or coarsening precision anywhere along the chain can break the whole composition — which is exactly the 'falls off a cliff' behavior naive compression produces. There's nothing graceful to degrade; you're snapping a link.

There's also a clue that models already manage their own sparsity dynamically. Hidden states sparsify in a localized, systematic way as tasks get harder or unfamiliar, and this acts as a *stabilizing* selective filter, not a defect Do language models sparsify their activations under difficult tasks?. That reframes the whole problem: the model already decides what to zero out, conditioned on the input. Naive pruning overrides that with a fixed, input-blind mask — you're freezing a decision the model needed to make per-example, which is most damaging precisely on the rare, hard, out-of-distribution inputs.

The quiet kicker is that these effects hide until they don't. Models can corrupt a quarter of a document over a long workflow without ever plateauing or signaling trouble Do frontier LLMs silently corrupt documents in long workflows?, and they routinely fail on low-probability targets that are logically trivial Can we predict where language models will fail?. A compressed model can look fine on common prompts and quietly collapse on the rare ones — the same long-tail where that adaptive sparsity and layer-wise composition mattered most. So the deeper lesson the corpus points to: 'destroys performance so easily' may be less about compression being violent and more about capability being more distributed, compositional, and input-conditional than a one-size mask or fixed bit-width can respect.

Sources 5 notes

Does optimal language model learning maximize data compression?

Research shows that optimal LM training can be derived from a lossless compression objective, yielding a Learning Law where all examples contribute equally in the optimal process. This approach improves scaling law coefficients, not just constants.

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Do language models sparsify their activations under difficult tasks?

As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.

Do frontier LLMs silently corrupt documents in long workflows?

Testing 19 models across 52 domains shows even advanced systems degrade documents by ~25% over extended relay tasks, with errors compounding silently without plateauing through 50 round-trips.

Can we predict where language models will fail?

By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.

Why do naive pruning and quantization destroy LLM performance so easily?

Sources 5 notes

Next inquiring lines