Can fractured entangled representations hide undetected by standard analysis methods?

This explores whether a network can carry broken, tangled internal structure — the kind that hurts transfer and creativity — while the usual ways we inspect models (benchmarks, linear probes, PCA) report nothing wrong.

This explores whether a network can carry broken, tangled internal structure while standard inspection methods report nothing wrong — and the corpus says yes, repeatedly and from several angles. The core claim is the Fractured Entangled Representation hypothesis: networks trained with SGD can reproduce outputs perfectly while their internal organization is radically different from a cleanly structured network, and that disorganization only surfaces under weight perturbation or distribution shift, not under ordinary evaluation Can identical outputs hide broken internal representations?. Two models can post identical accuracy on every test you run and still be wired completely differently inside — which means the test was never measuring the thing that breaks later Can AI pass every test while understanding nothing?.

The reason this hides so well is that our detection tools are biased toward exactly the kind of structure that survives. A model can hold all the linearly decodable features a task needs while its underlying organization is fractured — so a linear probe lights up green even though the representation is brittle Can models be smart without organized internal structure?. The bias is built into the method itself: PCA, linear regression, and RSA systematically over-represent simple linear features and under-represent equally important nonlinear ones. The sharpest demonstration is homomorphic encryption — a network can compute a task perfectly with no interpretable activation structure at all, proving that representation patterns and the actual computation can be fully decoupled Do standard analysis methods hide nonlinear features in neural networks?. So 'I probed it and the features are there' is not evidence the internals are sound; it's evidence your probe only sees what it was built to see.

What's interesting is that the corpus doesn't just diagnose the problem — it points at structural fixes that work because they don't rely on after-the-fact analysis at all. Instead of inspecting a trained network and hoping the structure is clean, you can force clean structure during training: sparse-weight transformers grow compact, human-readable circuits where individual neurons map to simple concepts, and ablations confirm those circuits are genuinely necessary and sufficient — disentanglement by construction rather than by inspection Can sparse weight training make neural networks interpretable by design?. There's also evidence that networks already tend toward modular structure on their own — pruning reveals compositional subroutines living in isolated subnetworks, and pretraining makes that modularity more reliable Do neural networks naturally learn modular compositional structure?. The fractured-representation work and the modularity work are really two readings of the same phenomenon: structure varies wildly across training runs, and whether you get the clean version or the tangled version isn't something a benchmark will tell you.

The thing you might not have known you wanted to know: the same theme — a structural prior beats raw capacity, and the right constraint matters more than the right score — shows up far outside interpretability. In collaborative filtering, a shallow linear model with a single architectural constraint (items can't predict themselves) beats deep neural baselines, because the constraint forces generalization through item relationships rather than memorized self-reference Can simpler models beat deep networks for recommendation systems?, Can a linear model beat deep collaborative filtering?. The lesson rhymes with the fractured-representation findings: performance metrics are a poor guide to whether a model's internal organization is the kind that will hold up, and the durable wins come from imposing the right structure, not from chasing the leaderboard.

Sources 8 notes

Can identical outputs hide broken internal representations?

Networks trained with SGD reproduce outputs perfectly while having radically different internal structure than evolved networks, with weight perturbations revealing fractured, entangled representations that prevent transfer to novel contexts or creative recombination.

Can AI pass every test while understanding nothing?

The Fractured Entangled Representation hypothesis shows that SGD-trained networks can produce identical outputs across all inputs while maintaining radically different internal representations. Standard benchmarks cannot detect this structural difference.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Do standard analysis methods hide nonlinear features in neural networks?

PCA, linear regression, and RSA over-represent simple linear features while under-representing equally important nonlinear features. Homomorphic encryption demonstrates that networks can compute perfectly well with no interpretable activation structure, proving representation patterns and computation can be entirely decoupled.

Can sparse weight training make neural networks interpretable by design?

Training transformers with sparse weights creates compact, human-interpretable circuits where neurons correspond to simple concepts with clear connections. Ablation studies confirm these circuits are necessary and sufficient for task performance, though scaling beyond tens of millions of parameters while maintaining interpretability remains unsolved.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Can simpler models beat deep networks for recommendation systems?

EASE, a shallow linear item-item weight matrix with diagonal constrained to zero, beats deep neural baselines on most datasets. The constraint forces generalization by forbidding self-prediction, while learned negative weights capture item dissimilarity—a structural prior more valuable than model capacity.

Can a linear model beat deep collaborative filtering?

ESLER, a single-layer linear autoencoder constrained so items cannot predict themselves, outperforms most deep CF models. The constraint forces prediction through item relationships, and negative weights encoding anti-affinity prove essential—structural bias matters more than model capacity.

Can fractured entangled representations hide undetected by standard analysis methods?

Sources 8 notes

Next inquiring lines