Does the linear representation hypothesis reflect networks or reflect our analysis tools?
This explores whether the well-known finding that neural networks store concepts as straight-line directions is a real property of the networks — or an illusion created by the fact that the tools we use to look inside them can only see straight lines.
This explores whether the linear representation hypothesis describes networks or describes our microscopes. The corpus comes down hard on a key worry: the methods themselves are biased witnesses. The most direct evidence is that standard interpretability tools — PCA, linear regression, RSA — systematically over-report simple linear features and under-report equally important nonlinear ones Do standard analysis methods hide nonlinear features in neural networks?. The striking demonstration there is homomorphic encryption: a network can compute perfectly while having activation patterns with no interpretable structure at all, proving that what we *see* in representations and what the network actually *computes* can be fully decoupled. So at minimum, a linear-only lens will always come back reporting linearity — that's a property of the lens.
It gets worse for naive readings. Linear decodability — the ability to read a feature off with a linear probe — turns out to be a weak signal of real organization. A model trained with SGD can contain every linearly decodable feature a task needs while its internal structure is fundamentally fractured, leaving it brittle to perturbation and distribution shift in ways standard metrics never catch Can models be smart without organized internal structure?. In other words, 'we can linearly decode it' does not license 'the network represents it linearly.' The probe succeeding tells you about the probe.
But the corpus doesn't let you collapse into pure tool-skepticism, because there are cases where genuine structure shows up that the simplest linear story *misses*. The Polar Probe finds that LLMs encode syntactic relations using both distance *and* angle — a polar-coordinate geometry that nearly doubles accuracy over distance-only (i.e. flat-linear) methods How do language models encode syntactic relations geometrically?. That's evidence of real, spontaneously-learned geometric structure that's richer than a single direction. Similarly, the leading eigenvectors of embedding Gram matrices peel taxonomy apart coarse-to-fine, tracking the WordNet hypernym tree level by level Do embedding eigenvectors organize taxonomy from coarse to fine?, and static embeddings carry measurable semantic content like valence and concreteness before attention even runs Do transformer static embeddings actually encode semantic meaning?. These aren't artifacts of choosing a linear tool — they're structure the network put there that survives scrutiny.
The resolution the corpus points to: it's both, and the interesting question is *when*. Networks really do consolidate structure — they grow dense representations for familiar data and stay sparse for unfamiliar inputs Is representational sparsity learned or intrinsic to neural networks?, and they spontaneously carve compositional tasks into isolated modular subnetworks Do neural networks naturally learn modular compositional structure?. But how cleanly that structure reads out depends heavily on how the network was *trained*, not just how it's analyzed: force weight sparsity and you get compact circuits where single neurons map to single concepts Can sparse weight training make neural networks interpretable by design?. That last result is the tell. If interpretable, near-linear structure can be *manufactured* by changing the training objective, then in ordinary networks linearity is partly real, partly a default the architecture drifts toward, and partly an echo of our looking. The honest answer is that the linear representation hypothesis is a claim about the *intersection* of a network and a probe — and the field's cleanest move is to stop asking 'is it linear?' and start asking 'what does this tool make impossible to see?'
Sources 8 notes
PCA, linear regression, and RSA over-represent simple linear features while under-representing equally important nonlinear features. Homomorphic encryption demonstrates that networks can compute perfectly well with no interpretable activation structure, proving representation patterns and computation can be entirely decoupled.
Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.
The Polar Probe shows LLMs represent syntactic type and direction through both distance and angular position between embeddings, nearly doubling accuracy over distance-only methods. This demonstrates neural networks spontaneously learn structured, symbolic-compatible geometry.
Leading eigenvectors of embedding Gram matrices separate broad taxonomic branches first, then progressively finer sub-branches—a coarse-to-fine spectral order that tracks the WordNet hypernym tree level by level, confirming predictions from co-occurrence statistics.
Clustering analysis of RoBERTa embeddings reveals sensitivity to five psycholinguistic measures including valence, concreteness, iconicity, and taboo. This demonstrates that static embeddings function as genuine lexical entries containing semantic content before self-attention operates.
During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.
Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.
Training transformers with sparse weights creates compact, human-interpretable circuits where neurons correspond to simple concepts with clear connections. Ablation studies confirm these circuits are necessary and sufficient for task performance, though scaling beyond tens of millions of parameters while maintaining interpretability remains unsolved.