Could probing methods miss computationally important features in neural networks?

This explores whether the tools we use to read what neural networks are 'thinking' (probes, PCA, linear classifiers) can systematically overlook the features that actually drive computation.

This explores whether the tools we use to read what neural networks are 'thinking' — probes, PCA, linear classifiers, representational similarity analysis — can systematically overlook the features that actually drive a network's computation. The corpus says yes, and the reason is sharper than 'our tools are imperfect': representation and computation can come apart entirely. Standard analysis methods are biased toward simple, linear features and under-represent equally important nonlinear ones Do standard analysis methods hide nonlinear features in neural networks?. The clinching demonstration is homomorphic encryption — a network can compute perfectly well while having no interpretable activation structure at all. If a probe finds nothing legible, that absence is not evidence the computation isn't there.

The deeper problem is that a network's outward behavior tells you almost nothing about whether its internals are organized the way your probe assumes. Two networks can produce identical outputs on every input while one has clean structure and the other is a tangle — the 'fractured entangled representation' result, where SGD-trained networks match evolved networks on performance but hide radically different, brittle internal organization Can identical outputs hide broken internal representations?. A model can pass every benchmark and still be internally incoherent in ways no standard test detects Can AI pass every test while understanding nothing?. Probing inherits this blindness: if you measure behavior or surface-level activations, you can be confidently wrong about what's being computed underneath.

What makes a feature easy to probe is also not fixed — it's an artifact of training. Networks develop dense, structured activations for data they've seen a lot of and fall back to sparse representations for unfamiliar inputs Is representational sparsity learned or intrinsic to neural networks?. So a probe's success can track familiarity rather than importance: the computationally critical machinery for a rare case might live in exactly the sparse, hard-to-read regime your method handles worst. Timing matters too — uncertainty signals dominate early transformer blocks while empowerment-style signals only emerge mid-network Why do large language models explore less effectively than humans?, so a probe reading the wrong layer can miss a feature that genuinely steers behavior.

There's a constructive flip side worth knowing. The reason probes miss features is that ordinary training scatters computation across entangled weights — so if you change how the network is built, you change what's visible. Training transformers with sparse weights forces modularity, producing compact circuits where individual neurons map to clear concepts, with ablations confirming those circuits are necessary and sufficient Can sparse weight training make neural networks interpretable by design?. Even normally-trained networks sometimes hide clean modular subnetworks that only pruning reveals Do neural networks naturally learn modular compositional structure?. Interpretability may be less a property you discover with a better probe than one you have to bake in during training.

The thing you didn't know you wanted to know: 'the probe found a feature' and 'this feature does the computational work' are two separate claims, and the gap between them is not noise — it's structural. A probe can light up on a representation the network barely uses, and stay dark on the machinery that actually decides the output. That's why ablation and necessity-testing keep recurring in this corpus as the corrective: not 'can I read it?' but 'does removing it break the computation?'

Sources 7 notes

Do standard analysis methods hide nonlinear features in neural networks?

PCA, linear regression, and RSA over-represent simple linear features while under-representing equally important nonlinear features. Homomorphic encryption demonstrates that networks can compute perfectly well with no interpretable activation structure, proving representation patterns and computation can be entirely decoupled.

Can identical outputs hide broken internal representations?

Networks trained with SGD reproduce outputs perfectly while having radically different internal structure than evolved networks, with weight perturbations revealing fractured, entangled representations that prevent transfer to novel contexts or creative recombination.

Can AI pass every test while understanding nothing?

The Fractured Entangled Representation hypothesis shows that SGD-trained networks can produce identical outputs across all inputs while maintaining radically different internal representations. Standard benchmarks cannot detect this structural difference.

Is representational sparsity learned or intrinsic to neural networks?

During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.

Why do large language models explore less effectively than humans?

SAE decomposition shows uncertainty values dominate early transformer blocks while empowerment representations emerge only in middle blocks. This temporal mismatch causes models to commit to decisions before long-term exploration signals can influence them. Reasoning-trained o1 overcomes this by extending computation time.

Can sparse weight training make neural networks interpretable by design?

Training transformers with sparse weights creates compact, human-interpretable circuits where neurons correspond to simple concepts with clear connections. Ablation studies confirm these circuits are necessary and sufficient for task performance, though scaling beyond tens of millions of parameters while maintaining interpretability remains unsolved.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Could probing methods miss computationally important features in neural networks?

Sources 7 notes

Next inquiring lines