Do feature extraction methods systematically miss computationally important complex features?
This explores whether the tools we use to inspect what a neural network has learned — PCA, linear probes, similarity analysis — are blind to exactly the complex, nonlinear features that do the real computational work.
This explores whether our standard inspection tools are blind to the complex features that actually drive computation. The corpus says yes, and unusually directly: the methods most analysts reach for are systematically biased toward simple features. Do standard analysis methods hide nonlinear features in neural networks? shows that PCA, linear regression, and RSA over-represent linearly decodable structure while under-representing equally important nonlinear features. The sharpest demonstration is an existence proof — a homomorphically encrypted network computes perfectly with no interpretable activation structure at all, which means a representation pattern and the computation it supposedly explains can be completely decoupled. If a network can compute well while showing analysts nothing, then 'nothing visible' tells you nothing about what's being computed.
The quieter danger is the inverse: a clean-looking representation that is actually broken underneath. Can models be smart without organized internal structure? finds models that contain all the linearly decodable features a task needs while their internal organization is fractured — fragile to perturbation and distribution shift in ways that accuracy and linear-probe scores never reveal. So the bias cuts both ways: simple-feature methods can miss real computation that's there, and can certify organization that isn't. Either way, what's legible to the probe is not what's load-bearing in the model.
Where does the missing complexity actually live? Two notes suggest it hides in interactions rather than in any single direction. Can verification separate structural near-misses from topical matches? shows that a verifier reading full token-to-token similarity maps catches structural near-misses that compressed, pooled vectors cannot — the signal exists, but only in the interaction pattern, which is precisely what dimensionality reduction throws away. And Which tokens in reasoning chains actually matter most? shows models internally rank tokens by functional role, preserving symbolic-computation tokens while discarding grammar and filler. The importance structure is real and recoverable — but only if you look at the right granularity, not a global summary.
There's a cross-domain echo worth noticing. Why do large language models fail at complex linguistic tasks? finds that LLM errors get predictably worse as syntactic structure deepens — the model captures surface patterns but not the compositional rule. That mirrors the analysis problem one level up: complexity that arises from composition and nesting is exactly what both the models and the tools that inspect them tend to flatten. Relatedly, Why does removing spurious cues sometimes hurt model performance? reframes a failure as integrating conflicting signals rather than filtering distractors — a reminder that the interesting behavior is often a composition of features, not a selection among them, and composition is what simple extraction methods are worst at seeing.
The takeaway you didn't know you wanted: 'we found a clean linear feature' and 'we understand the computation' are nearly independent claims. The corpus suggests that better interpretability may depend less on finding tidier directions and more on learning to read interaction structure — token-token maps, functional rankings, compositional depth — at the granularity where the hard features actually live.
Sources 6 notes
PCA, linear regression, and RSA over-represent simple linear features while under-representing equally important nonlinear features. Homomorphic encryption demonstrates that networks can compute perfectly well with no interpretable activation structure, proving representation patterns and computation can be entirely decoupled.
Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.
A two-stage pipeline—pooled-cosine recall followed by a small Transformer verifier operating on token-token similarity maps—reliably rejects structural near-misses that MaxSim-style late interaction cannot. The verifier succeeds because it operates on full token interaction patterns rather than compressed vectors.
Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.
Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.
Removing spurious cues degrades performance in heuristic override tasks, opposite to shortcut learning predictions. The failure mode is integrating conflicting signals rather than ignoring distractors—a frame problem, not feature selection.