What compression explains why syntax fits in low-dimensional subspaces?
This explores why grammatical structure ends up living in just a few dimensions of a model's internal space — and what kind of compression makes that possible.
This explores why syntax — the structural relationships between words — turns out to occupy only a small number of dimensions inside a language model, rather than being smeared across the whole representation. The short version the corpus suggests: compression isn't an accident here, it's the training objective itself. Language modeling is mathematically equivalent to lossless compression Can text-trained models compress images better than specialized tools?, so a model that predicts well is, by definition, finding the most economical encoding of regularities in text. Grammar is the most regular thing in language — the same clause structures recur endlessly — so it's exactly what an efficient compressor would factor out into a compact, reusable code rather than memorize case by case.
What does that compact code look like geometrically? Two notes give concrete shape to it. The Polar Probe work shows that syntactic relations are encoded in *polar coordinates* — distance carries one kind of information (relation type) and angle carries another (direction), so a single low-dimensional geometric trick stores two facts at once How do language models encode syntactic relations geometrically?. That's compression in the most literal sense: reusing the same few axes to express many distinctions. Relatedly, the leading eigenvectors of embedding matrices split meaning coarse-to-fine, mirroring a taxonomy tree level by level Do embedding eigenvectors organize taxonomy from coarse to fine?. A handful of dominant directions capture the broadest, most frequent structural splits first — which is why so much of the signal collapses into a low-dimensional subspace.
There's a deeper 'why' underneath, too. Models trained to compress aggressively keep the broad category structure and throw away fine-grained nuance Do LLMs compress concepts more aggressively than humans do?. Syntax is precisely the broad, high-frequency scaffolding that survives that squeeze — so it gets a privileged, compact home in the representation while subtler distinctions get blurred. Depth helps here as well: deep-and-thin networks compose abstract structure across layers rather than spreading it across width Does depth matter more than width for tiny language models?, which is exactly the kind of architecture that would build a low-dimensional structural code by stages.
But here's the twist worth carrying away: low-dimensional doesn't mean *deep* understanding. Models that ace tasks can still hold fractured, brittle internal representations beneath a tidy decodable surface Can models be smart without organized internal structure?, and they make systematic grammatical errors that get predictably worse as sentences nest deeper Why do large language models fail at complex linguistic tasks?. So the compact syntactic subspace may be capturing the *statistics* of structure — the frequent, shallow patterns — rather than the full recursive rule system. The compression that makes syntax fit so neatly is the same compression that quietly drops the hard cases.
Sources 7 notes
Chinchilla models trained exclusively on text achieve better compression rates on images and audio than FLAC and PNG by using their context window to adapt as task-specific compressors. This demonstrates that generalization operates through compression, not specialization.
The Polar Probe shows LLMs represent syntactic type and direction through both distance and angular position between embeddings, nearly doubling accuracy over distance-only methods. This demonstrates neural networks spontaneously learn structured, symbolic-compatible geometry.
Leading eigenvectors of embedding Gram matrices separate broad taxonomic branches first, then progressively finer sub-branches—a coarse-to-fine spectral order that tracks the WordNet hypernym tree level by level, confirming predictions from co-occurrence statistics.
Using Rate-Distortion Theory on cognitive datasets, LLMs capture broad category structure but lose fine-grained distinctions humans preserve. LLMs maximize compression efficiency; humans trade compression for contextual meaning that enables situated action.
MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.
Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.
Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.