Questioning Representational Optimism in Deep Learning: The Fractured Entangled Representation Hypothesis

Paper · arXiv 2505.11581 · Published May 16, 2025
MechInterpCognitive Models Latent

Much of the excitement in modern AI is driven by the observation that scaling up existing systems leads to better performance. But does better performance necessarily imply better internal representations? While the representational optimist assumes it must, this position paper challenges that view. We compare neural networks evolved through an open-ended search process to networks trained via conventional stochastic gradient descent (SGD) on the simple task of generating a single image. This minimal setup offers a unique advantage: each hidden neuron’s full functional behavior can be easily visualized as an image, thus revealing how the network’s output behavior is internally constructed neuron by neuron. The result is striking: while both networks produce the same output behavior, their internal representations differ dramatically. The SGD-trained networks exhibit a form of disorganization that we term fractured entangled representation (FER). Interestingly, the evolved networks largely lack FER, even approaching a unified factored representation (UFR). In large models, FER may be degrading core model capacities like generalization, creativity, and (continual) learning. Therefore, understanding and mitigating FER could be critical to the future of representation learning.

A neural network’s knowledge, generalization, creativity, continual learning ability, and overall potential are ultimately determined by its internal representations. These representations—how it models the world—are encoded through the structure of its neural circuitry (connection weights). Even as researchers increasingly probe the internal representations of neural networks (NNs) (Elhage et al., 2022; Lindsey et al., 2025; Nguyen et al., 2016; Olah et al., 2017; Olsson et al., 2022; Yosinski et al., 2015), an implicit philosophy of representational optimism has emerged—rarely stated outright but carrying profound implications. The hope of the representational optimist is that, as data and compute are scaled, good representations naturally develop on their own.

While scaling clearly improves the performance of the representations on downstream tasks (Bubeck et al., 2023; Kaplan et al., 2020), less attention has been given to understanding the nature of these representations as models grow. Of course, even representational optimists still want to understand internal representations (e.g. through mechanistic interpretability as in Sharkey et al. 2025), but the motivation is more through a desire to understand how the networks work (Olah et al., 2017; Yosinski et al., 2015), how they can be controlled, and their level of alignment (Bereska and Gavves, 2024; Lindsey et al., 2025) than through any concern that their representations may be fundamentally flawed.

At the heart of representational optimism is the belief that representation improves with scaling in deep learning (Hoffmann et al., 2022; Huh et al., 2024; Kaplan et al., 2020). For example, the performance leaps from GPT-2 (Radford et al., 2019) to GPT-3 (Brown et al., 2020) and GPT-4 (Achiam et al., 2023) suggest that the underlying representation should also be improving. As Brown et al. (2020) state in the introduction to the technical report introducing GPT-3, where they place focus on the implications for representation (boldface added for emphasis):

Recent years have featured a trend towards pre-trained language representations in NLP systems, applied in increasingly flexible and task-agnostic ways for downstream transfer. First, single-layer representations were learned using word vectors and fed to task-specific architectures, then RNNs with multiple layers of representations and contextual state were used to form stronger representations (though still applied to task-specific architectures), and more recently pre-trained recurrent or transformer language models...

Yet what if something important is actually missing in the learned representation? The main contribution of this position paper is to raise the possibility that in practice representation might not simply work itself out well by scaling after all (Figure 1).

To address the question of whether representations may be flawed even when benchmark performance is good, we shift the focus from studying individual neurons to studying the broader, holistic representational strategy across the entire network. By doing so, we are able to make a distinction between an adequate representation that solves a task and an ideal representation of that solution.

  1. FER potentially impacts generalization wherever coverage is sparse in the training data. Where there are insufficient training examples, a neural network or LLM has to interpolate or extrapolate. If general principles from outside the area of sparsity can be applied within that area, then the interpolation can still succeed. However, if those principles are fractured and therefore only selectively applied to narrow and arbitrary subdomains (and entangled and therefore yielding unintended side effects), then interpolation will not be based on those more fundamental regularities, diminishing its power. Because it is more likely to be seen in areas of sparsity, the impact of FER on generalization is particularly problematic in the borderlands of human knowledge, where little is yet written or known. A failure of LLMs to extend or apply relevant regularities to these borderlands would be expected to surface as a clumsiness in grappling with novel material. If that is the case, it is a particularly unfortunate deficit, because the very place where AI can potentially make the most exciting contributions is at the borderlands of knowledge (Amodei, 2024).

  2. Creativity is an entire discipline (Boden, 2004) over which the small commentary here cannot do justice. Nevertheless, it is still important to highlight that the ability to imagine a new artifact of a particular type requires understanding the regularities of that type. The iPhone was once a new idea, but it extended many of the regularities of the concepts of a computer and a phone (and blended them very well). A fractured and entangled representation of either phones or computers may not have enabled extending those concepts so far from their incarnations at that time. Even if the creative act is intentionally to break a regularity, other regularities should still be preserved to the extent possible in alignment with that broken regularity: in a butterfly where one wing is smaller, it still might make sense to warp and compress the same pattern as in the larger wing, preserving some of the symmetry and thereby creating a plausible version of something that has never existed. For these reasons, the toll of FER on creativity is likely steep. Studies of weight sweeps in networks with FER in this paper confirm this intuition, showing a dramatic advantage in the ability to vary underlying concepts in networks closer to UFR versus FER.

  3. Finally, FER is relevant to learning because learning involves moving through the weight space. Learn- ing is easier, and more likely to settle on deeper truths, if nearby/adjacent points in weight space respect fundamental regularities. In contrast, if nearby points overwhelmingly break regularities, learning is stifled from building and elaborating on deep discoveries. For example, if different modules perform arithmetic in the service of calculus versus physics, then learning a new, more efficient method to do arithmetic in the context of calculus would not apply to physics. The tax imposed by such fracture would likely compound even further in a continual learning scenario, which is one of the next frontiers for the field.

 This reproduction looks identical because SGD succeeds at reproducing the output. In more detail, beginning with the same neural architecture as the layerized version of the Picbreeder CPPN, we initialize a new random set of weights and train them with SGD to match the Picbreeder CPPN’s output. We provide each (x, y, d) instance and its corresponding (h, s, v) label as a regression target. As an (imperfect) analogy, the Picbreeder CPPN is like a preexisting “human” brain, whereas this new conventional SGD CPPN is analogous to an LLM trained to mimic the brain’s output behavior for a vast set of inputs (analogous to current LLM internet training). Complete details of the SGD training process are in Appendix C.

Notice that the skull is symmetric, even if not perfectly so. This observation raises the question: does the CPPN that generates this skull “know” that it is symmetric? In other words, does the CPPN explicitly represent this symmetry internally (in its neural circuits), i.e. in a unified way such that the two sides of the skull are not fractured into separate chains of computation?

One possible response is that it does not actually matter whether the network “knows” about the symmetry, as long as the output appears symmetric. After all, if the skull looks correct, why should it matter how the network processes or understands its symmetry?

One key initial insight is that, as it turns out, the CPPN behind the Picbreeder skull in Figure 4a does know that it is symmetric. That is, there is a lot of evidence to support the idea that this CPPN fundamentally understands and represents the skull’s symmetry.

The underlying pathology of FER is also easily observed simply by sweeping the conventional SGD CPPN’s weights, as shown in Figure 6b. In the case of the Picbreeder CPPN we saw that the underlying, overall symmetry is usually preserved under perturbation (and when violated, as in winking, that occurs in a regular, smooth, natural way). Yet in the SGD CPPN, though it appears identical, symmetry is almost always broken incoherently—the opposite behavior.

One potential reason to question this conclusion is the possibility that the feature space is simply being viewed in the wrong vector basis, and that a simple rotation might reveal more unified, holistic features. However, this scenario appears unlikely given the high degree of disorder in the conventional SGD skull representation—particularly the presence of arbitrary high-frequency patterns that bear little resemblance to the skull itself. Additional empirical evidence addressing this concern is showcased in Appendix F, where PCA helps to visualize an alternative feature basis and weight sweeps agnostic to the choice of basis continue to show evidence of FER (while as expected, the Picbreeder skull maintains meaningful variations even along random directions in weight space).

This problem is deeper than just inefficiency: it limits the ability of the model to build anything new (like a new skull or new face) that would require an understanding of faces. Even though it draws a perfect skull, it does not understand the underlying regularities or any modular decomposition of what it is drawing at all. Recall that more examples are given in Appendix D, which also show the same general FER phenomenon. Appendix E shows that FER persists even when doing SGD with standard ReLU networks. In effect the learned skull is an imposter—its external appearance implies its underlying representation should be authentic, but it is not the real thing underneath the hood.

5 Imposter Intelligence

We often think of individual images generated by large models as a single instance of generation. For that reason, it is important to recall that, though they appear superficially similar, the CPPN images in this paper are not analogous to a single large model generation step. Instead, these CPPN images are analogous to an entire space of inputs and outputs: each (x, y, d) pixel coordinate is an input and the corresponding pixel (h, s, v) is the output. In this way, CPPN images are actually a metaphor for the entire behavior and intelligence of a neural network (analogous to a comprehensive intelligence and personality profile/snapshot of a neural network, such as an LLM).

In that context, the imposter skull is a simple metaphor for a vastly more complex “imposter intelligence.” It shows the danger of judging a book by its cover, except here, the “cover” is the sum total of all its behaviors. No matter what input we give the imposter skull CPPN, it outputs an unassailable match to our hopes and expectations, even though underneath the hood it looks nothing like it should.

Now is therefore a good time to revisit the question of whether it matters how information is represented internally as long as a model works well. In effect, the deeper question is whether an LLM (or any foundation model) can be analogous to the skull. Can it appear healthy on the surface (where “the surface” is now an almost inconceivably vast span of abilities) but beneath the hood be suffering from pervasive FER? And does that even matter?

First, note that it is at least theoretically possible to have FER in a large model: it simply means that in cases where the same information could have been reused effectively, instead it was not reused and the underlying principle is represented twice or more. Furthermore, capabilities that should be separate in principle may be entangled in subtle ways. For example, if the computational mechanism involved in counting bricks was fundamentally disjoint from the mechanism for counting apples, and the nature of the objects being counted interfered with counting itself, the concept of counting would exhibit FER. The skull’s simplicity makes it a useful conceptual model for understanding this idea: in the skull, we can literally see the symmetry in the internal representations of the pattern and how outputs respond to weight perturbations, whereas through conventional SGD it usually ends up redundantly fractured and entangled in the sense that there are two uncorrelated yet entangled pathways to representing both sides of the skull (which is evident through the weight perturbations in Figure 6b).

To transfer these observations and intuitions to an LLM, we will draw an analogy between the symmetry of the skull and the vast amount of conceptual regularities that occur throughout the breadth of human faculties.