Language Understanding and Pragmatics LLM Reasoning and Architecture

Can models be smart without organized internal structure?

Explores whether linear feature decodability proves genuine compositional reasoning or merely indicates that the right features are present but poorly organized. Critical for understanding what performance metrics actually certify.

Note · 2026-02-23 · sourced from MechInterp

Two findings from mechanistic interpretability appear contradictory but operate at different levels of representational analysis:

Fractured Entangled Representations (FER): Since Can identical outputs hide broken internal representations?, SGD-trained models fail catastrophically under perturbation or distribution shift in ways that well-organized representations would not. The pathology is invisible to standard evaluation.

Compositional generalization at scale: Scaling data and model size produces representations where compositional features are linearly decodable — separable task constituents can be independently identified and manipulated. This has been taken as evidence for genuine compositional understanding.

The resolution: Linear decodability tests for the presence of features, not their organization. A fractured representation could contain every linearly decodable feature while being fractured in how those features relate to each other. The compositional parts are present but their composition is broken.

This connects directly to the "imposter intelligence" post angle: Can LLMs understand concepts they cannot apply?, Does supervised fine-tuning actually improve reasoning quality?, and Do foundation models learn world models or task-specific shortcuts?. All describe the same meta-pattern: surface metrics certify capability that internal structure analysis would disqualify.

The practical implication for model evaluation: passing compositional generalization tests does not guarantee robust compositional reasoning. Evaluation under distribution shift, perturbation, and novel recombination is required to distinguish genuine compositionality from fractured representations that happen to contain the right features.

Related concepts in this collection

Can we track and steer personality shifts during model finetuning? This research explores whether personality traits in language models occupy specific linear directions in activation space, and whether we can detect and control unwanted personality changes during training using these geometric directions.
persona vectors demonstrate a case where linear decodability corresponds to genuine functional organization (steering works), providing a positive counterexample to FER's warning that decodability alone is insufficient

Concept map

13 direct connections · 115 in 2-hop network ·medium cluster

Can models be smart without organized internal s… Can we track and steer personality shifts during m…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

identical performance metrics can mask fundamentally different internal representations — feature linear decodability does not guarantee representational organization