Can identical model performance mask fundamentally broken internal representations?

This explores whether two models that score the same on benchmarks can be built completely differently inside — one coherent, one a tangled mess — with standard tests blind to the difference.

This explores whether identical scores can hide radically different internal wiring, and the corpus says yes — emphatically. The central finding is the Fractured Entangled Representation hypothesis: networks trained with ordinary gradient descent (SGD) can reproduce the right output on every input while their internal structure is a tangled knot, nothing like the clean, organized representations that evolved or hand-designed networks develop Can identical outputs hide broken internal representations?. Perturb the weights slightly and the fracture shows — the model breaks in ways that benchmarks never see coming, because the benchmark only ever asked for the output, not the reasoning behind it Can AI pass every test while understanding nothing?.

The sharpest version of the warning is that a model can hold all the 'linearly decodable' features a task needs — meaning a probe can read the right answer straight out of the activations — while the underlying organization is still broken. Perfect accuracy and a readable internal signal coexist with fragility to perturbation and distribution shift that no standard metric detects Can models be smart without organized internal structure?. So 'the information is in there' is not the same as 'the model is built well.' Test performance measures whether the answer can be extracted, not whether the structure that produced it would survive a new context.

What makes this more than a curiosity is the adjacent work on what good internal structure looks like — and how hard it is to get. Train transformers with sparse weights and you can force interpretable, disentangled circuits where neurons map to simple concepts, ablation-tested as necessary and sufficient — but it stops scaling past tens of millions of parameters Can sparse weight training make neural networks interpretable by design?. Left to their own devices, networks do sometimes decompose tasks into clean modular subnetworks, and pretraining makes that modularity more reliable Do neural networks naturally learn modular compositional structure?. The fractured-representation work is essentially the failure case of that same story: when the modularity doesn't form, you get entanglement that performance metrics can't see.

The twist worth carrying away is that not every messy-looking internal state is broken. Models sparsify their activations under unfamiliar, out-of-distribution tasks — and that sparsification turns out to be an adaptive filter that stabilizes performance, not a sign of damage Do language models sparsify their activations under difficult tasks?. Representational density itself is learned from how familiar the training data was, not an intrinsic flaw Is representational sparsity learned or intrinsic to neural networks?. So the real lesson isn't 'weird internals = bad model.' It's that the surface score tells you almost nothing about which it is — clean structure, adaptive filtering, or genuine fracture all produce the same number on the test. If you want to know which one you've got, you have to look inside, and most evaluation never does.

Sources 7 notes

Can identical outputs hide broken internal representations?

Networks trained with SGD reproduce outputs perfectly while having radically different internal structure than evolved networks, with weight perturbations revealing fractured, entangled representations that prevent transfer to novel contexts or creative recombination.

Can AI pass every test while understanding nothing?

The Fractured Entangled Representation hypothesis shows that SGD-trained networks can produce identical outputs across all inputs while maintaining radically different internal representations. Standard benchmarks cannot detect this structural difference.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Can sparse weight training make neural networks interpretable by design?

Training transformers with sparse weights creates compact, human-interpretable circuits where neurons correspond to simple concepts with clear connections. Ablation studies confirm these circuits are necessary and sufficient for task performance, though scaling beyond tens of millions of parameters while maintaining interpretability remains unsolved.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Do language models sparsify their activations under difficult tasks?

As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.

Is representational sparsity learned or intrinsic to neural networks?

During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.

Can identical model performance mask fundamentally broken internal representations?

Sources 7 notes

Next inquiring lines