How do internal representations compare to human cognitive structures?

This explores whether the way LLMs organize information inside themselves resembles human cognition — and the corpus answers in two directions at once: surprisingly human-like on some axes, fundamentally alien on others.

This explores whether the internal machinery of language models lines up with how human minds are structured — and the collection's most interesting move is refusing a single answer. On the dimensions where meaning gets organized, models look startlingly human. When researchers unpacked the semantic structure inside LLM embeddings, the dozens of axes collapsed into three principal components that match the human 'EPA' structure (evaluation, potency, activity) psychologists have long used to describe how people judge concepts Do LLM semantic features organize along human evaluation dimensions?. In reasoning, the convergence goes further: humans and LLMs succeed and fail along the *same* content-sensitivity axis on classic tests like the Wason task, which suggests 'content-independent' logic was never the thing separating real reasoning from pattern-matching in the first place Do language models fail reasoning tests that humans pass?.

But the resemblance is partly an illusion of vantage point. One note borrows Habermas's distinction between observing a system from outside versus participating in discourse with it: from the observer's seat, humans and LLMs are categorically different kinds of thing; from inside a shared conversation, both draw on the same symbolic substrate, making the gap structural rather than absolute Do humans and LLMs differ fundamentally or just superficially?. A related framing argues models operationalize Saussure's *langue* — they learn meaning purely as relational compression of text, with no external referents or embodied grounding at all Can language models learn meaning without engaging the world?. So where human concepts are anchored to bodies and a world, the model's are anchored only to other words. Same shape, different ground.

Look closer at the wiring and the picture splits again. Networks can produce *identical* outputs while harboring radically different internal structure — SGD-trained models develop 'fractured, entangled' representations that block the kind of clean transfer and recombination evolved systems manage Can identical outputs hide broken internal representations?. This decoupling of internal structure from external behavior is a recurring warning in the corpus: performance metrics alone hide how a model actually works What actually happens inside the minds of language models? What actually happens inside a language model?. Yet against that messiness, models also show human-reminiscent organization — they spontaneously decompose compositional tasks into isolated modular subnetworks, much like the functional specialization cognitive scientists find in brains Do neural networks naturally learn modular compositional structure?.

Where the comparison gets genuinely strange is metacognition — the mind watching itself. Models build causal mechanisms for tracking whether they actually *know* something about an entity, and that self-knowledge signal steers both hallucination and refusal Do models know what they don't know?. They even sparsify their activations adaptively when a task drifts out of distribution, a localized filter that stabilizes performance under unfamiliarity rather than a breakdown Do language models sparsify their activations under difficult tasks?. But genuine introspection is mostly absent: when a model 'reports' on its own states, it's usually echoing human descriptions from training data, not reading its own internals — true introspection only flickers in when a real causal chain links an internal state to the report Can language models actually introspect about their own states?.

The thing you might not have known you wanted to know: the human-likeness and the alien-ness aren't competing theories, they're the same finding seen at different layers. At the level of *how meaning is organized* — evaluation dimensions, content effects, modular structure, self-knowledge signals — models converge on human-shaped solutions, plausibly because language itself carries that structure. At the level of *what the representations are made of and grounded in* — entangled weights, relational-only meaning, borrowed self-reports — they diverge sharply. Models can even be trained to internalize their own evaluation rather than lean on external judges Can models learn to evaluate their own work during training?, pushing the metacognitive resemblance further still. The honest comparison isn't 'like a mind' or 'not like a mind' — it's that they rediscover the human map while standing on entirely different ground.

Sources 12 notes

Do LLM semantic features organize along human evaluation dimensions?

Twenty-eight semantic axes in LLM embeddings reduce to three principal components matching human EPA structure. Intervening on one feature predictably shifts aligned features proportionally, creating unavoidable off-target effects that reflect how meaning is fundamentally organized.

Do language models fail reasoning tests that humans pass?

Research shows both humans and LLMs succeed and fail along the same content-sensitivity axis in reasoning tasks like Wason tests and natural language inference. Content-independence is not a meaningful criterion for distinguishing real reasoning from pattern matching.

Do humans and LLMs differ fundamentally or just superficially?

Applied Habermas's observer/participant distinction to AI: from outside, humans and LLMs are utterly different; from within shared discourse, both draw on the same symbolic substrate, making the difference structural rather than absolute.

Can language models learn meaning without engaging the world?

Research shows LLMs learn culturally situated discourse patterns by compressing relational structure from text, demonstrating that fluent language generation requires no external referents or embodied grounding.

Can identical outputs hide broken internal representations?

Networks trained with SGD reproduce outputs perfectly while having radically different internal structure than evolved networks, with weight perturbations revealing fractured, entangled representations that prevent transfer to novel contexts or creative recombination.

What actually happens inside the minds of language models?

LLMs can achieve identical accuracy while maintaining radically different internal representations, and mechanisms that appear interpretable may not causally drive outputs. This decoupling means performance metrics alone mask crucial differences in how models actually work.

What actually happens inside a language model?

Research shows that LLMs can achieve the same output through different internal mechanisms, and improvements in one dimension like accuracy reliably degrade others like faithfulness and calibration. Internal structure matters even when behavior appears identical.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Do models know what they don't know?

Sparse autoencoders revealed that language models develop causal mechanisms for detecting whether they know facts about entities. These mechanisms actively steer both hallucination and refusal behavior, and persist from base models into finetuned chat versions.

Do language models sparsify their activations under difficult tasks?

As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.

Can language models actually introspect about their own states?

LLM self-reports usually reflect human training distributions rather than actual internal processes. However, when a causal chain connects an internal state to accurate reporting—like inferring low temperature from output consistency—genuine lightweight introspection occurs without requiring consciousness.

Can models learn to evaluate their own work during training?

Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.

How do internal representations compare to human cognitive structures?

Sources 12 notes

Next inquiring lines