Psychology and Social Cognition LLM Reasoning and Architecture Language Understanding and Pragmatics

Do large language models genuinely simulate mental states?

This explores whether LLMs perform authentic theory of mind reasoning or rely on surface-level pattern matching. The distinction matters because evaluation format—multiple-choice versus open-ended—reveals very different capability levels.

Note · 2026-02-22 · sourced from Theory of Mind
How should researchers navigate LLM reasoning research? Why do LLMs excel at social norms yet fail at theory of mind?

The evaluation format determines what you learn about ToM capability. Multiple-choice and short-answer tasks allow models to succeed through pattern matching and elimination — selecting the most plausible option without genuinely simulating another agent's mental state. Open-ended scenarios strip away these scaffolds.

The ChangeMyView evaluation (Reddit persuasion data requiring nuanced social reasoning) reveals "clear disparities in ToM reasoning capabilities" between humans and LLMs, even the most advanced models. Incorporating human intentions and emotions through prompt tuning improves performance but "still falls short of fully achieving human-like reasoning." The gap persists because the task demands genuine perspective-taking — crafting a persuasive response requires modeling the other person's beliefs, values, and emotional state simultaneously.

The FANTOM benchmark confirms this in conversational contexts: GPT-4, Llama 2, Falcon, and Mistral all show "significant challenges" maintaining ToM reasoning performance compared to humans, even with chain-of-thought reasoning or fine-tuning. The consistency problem is key — models don't fail uniformly but "often default to surface-level reasoning strategies rather than engaging in deep, robust ToM reasoning."

The ATOMS taxonomy (Abilities in Theory of Mind Space) identifies the components: Intentions, Percepts, Beliefs, Emotions, Knowledge, Desires, and Non-literal Communication. Current benchmarks typically test only a few of these. Open-ended evaluation forces models to integrate multiple components simultaneously, which is where the breakdown occurs.

The practical implication for evaluation design: if you only test ToM with structured questions, you will overestimate capability. The format gap between structured and open-ended tasks is itself a measurement of how much ToM performance depends on task scaffolding rather than genuine mental state simulation.

Hybrid Bayesian architecture as structural fix. LAIP (LLM-Augmented Inverse Planning, Towards Machine Theory of Mind with LLM-Augmented Inverse Planning) addresses the surface-strategy default by combining LLM hypothesis generation with Bayesian inverse planning. LLMs generate prior hypotheses about agent preferences and likelihood functions for different actions; a Bayesian model computes posterior probabilities given observed actions. This hybrid outperforms LLM-alone and CoT prompting, even with smaller LLMs that typically fail ToM tasks. The architecture forces genuine mental state inference: the Bayesian backbone requires explicit probability updates over preference hierarchies rather than allowing pattern-matched shortcuts. When the Japanese restaurant is closed, the model correctly infers the agent's preference ordering from action sequences — the kind of dynamic belief tracking that pure LLM approaches default to surface strategies on.


Source: Theory of Mind

Related concepts in this collection

Concept map
16 direct connections · 157 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

llm theory of mind defaults to surface-level strategies rather than genuine mental state simulation — open-ended scenarios expose what structured questions hide