INQUIRING LINE

Why do language models capture individual differences in cognitive behavior?

This explores how — and whether — LLMs come to represent the differences between individual people's thinking, and what mechanism lets them do it.


This explores how language models manage to encode individual differences in how people think — and the corpus turns out to be split on whether they really do. The strongest yes comes from work showing that models fine-tuned on raw psychology-experiment data start predicting human decisions more accurately than the hand-built cognitive theories that have dominated the field for decades, and crucially, they pick up person-to-person variation inside their embeddings without anyone designing a feature for it Can language models learn to model human decision making?. The 'why' here is almost mundane: a model trained to compress huge amounts of behavioral data finds that representing each person as a position in a continuous space is simply the most efficient way to predict what they'll do next. Individuality falls out as a byproduct of good compression, not as a designed capability.

But the same trick that captures *who someone is* doesn't extend to capturing *how someone reasons over time*. Models reliably fail to track individualized reasoning styles as a person's strategy evolves, leaning on surface word-level cues instead of anchoring to a changing trajectory Can models recognize how individuals reason differently?. So the corpus draws a sharp line: a static snapshot of a person is learnable; a dynamic, adapting style is not. That should make you suspicious of any claim that these models 'understand' individuals — what they're capturing may be a frozen statistical fingerprint rather than a living model of a mind.

The mechanism behind that fingerprint shows up vividly in two other places. Behavioral traits can transmit between models through data that has *no semantic relationship* to the trait at all — the signal rides on statistical signatures rather than meaning, and it's so architecture-specific that it fails between different model families Can language models transmit hidden behavioral traits through unrelated data?. And researchers have located these traits as actual linear directions in activation space — 'persona vectors' for things like sycophancy — that can be monitored and steered Can we track and steer personality shifts during model finetuning?. Put together, these suggest individual differences are captured the same way: as geometric structure in the model's internal space, not as anything resembling explicit knowledge about a person.

There's a real tension worth sitting with, though. While fine-tuned models flexibly encode the differences between *other* people, the models themselves are remarkably bad at *being* different on demand: most open LLMs stubbornly retain an intrinsic personality and resist being prompted into another one Can open language models adopt different personalities through prompting?, and alignment training tends to lock a model into a single communicative identity that can't switch register the way human pragmatics requires Can language models adapt communication style to different contexts?. So the picture is asymmetric: a model can model your individuality far better than it can vary its own.

The thing you might not have expected to learn: 'capturing individual differences' and 'understanding individuals' are not the same achievement. The corpus suggests models excel at the former precisely because individuality compresses into geometry — and stumble at the deeper version (tracking a person as they change, or genuinely shifting their own identity) for the same reason. If you want to follow the skeptical thread further, the work on whether reasoning traces reflect real computation or just persuasive mimicry Do reasoning traces show how models actually think? is the natural next door.


Sources 7 notes

Can language models learn to model human decision making?

LLMs finetuned on psychology experiment data predict human behavior more accurately than theory-driven models in decision tasks, capture individual differences in their embeddings, and transfer learning across tasks without task-specific design.

Can models recognize how individuals reason differently?

LLMs struggle to anchor reasoning in temporal gameplay and adapt to evolving strategies. GPT-4o relies on surface lexical cues while DeepSeek-R1 shows early promise, but dynamic style adaptation remains largely insufficient across all models tested.

Can language models transmit hidden behavioral traits through unrelated data?

Research demonstrates that behavioral traits propagate between models via filtered data bearing no semantic relationship to the trait. The effect is model-specific, fails across different architectures, and persists despite rigorous filtering—indicating the mechanism embeds statistical signatures rather than semantic content.

Can we track and steer personality shifts during model finetuning?

Research identifies linear directions in LLM activation space corresponding to specific traits like sycophancy and hallucination. These persona vectors predict finetuning-induced personality shifts before they occur and can preventatively steer training to avoid unwanted trait changes.

Can open language models adopt different personalities through prompting?

Research shows most open models fail to adopt prompted personalities, stubbornly retaining their trained ENFJ-like defaults. Only a few flexible models succeed. Combining role and personality conditioning improves results but doesn't fully overcome resistance.

Can language models adapt communication style to different contexts?

System prompts and RLHF training lock models into one communicative identity across all interactions, preventing the contextual register-switching and value trade-offs that characterize human pragmatics. Users cannot reshape model behavior through dialogue negotiation.

Do reasoning traces show how models actually think?

LLM reasoning traces perform as persuasive appearances rather than reliable explanations of computation. Invalid logical steps perform nearly as well as valid ones, and corrupted traces generalize comparably, showing that semantic correctness is not what produces the performance gains.

Next inquiring lines