Why do reasoning models perform worse on theory of mind tasks?
This explores why the newest reasoning-optimized models (the ones built to think step-by-step) actually do *worse* at reading minds — tracking what other people believe, want, or falsely assume — than older, less specialized models.
This explores why reasoning-optimized models underperform at theory of mind — the skill of tracking what someone else believes or falsely assumes — and the corpus points to a surprising answer: the very training that makes them better at math and logic seems to actively interfere with social cognition. On benchmarks like Decrypto, models like Claude 3.7 Sonnet and o1 score worse than both humans and even simple word-embedding baselines on tasks involving false beliefs and perspective change Why do reasoning models fail at theory of mind tasks? Why do advanced reasoning models fail at understanding minds?. Cranking up reasoning effort doesn't help — it produces longer chains of thought that don't generalize and sometimes makes things worse.
The most compelling explanation is that social reasoning and formal reasoning are *different kinds of cognition that want different machinery* Why do reasoning models struggle with theory of mind tasks?. Formal reasoning is sequential: derive step B from step A, march toward an answer. But tracking minds means holding several incompatible models of the world open *at once* — what I know, what you know, what you wrongly think I know. Approaches that succeed (like ThoughtTracing's short Bayesian hypothesis tracking, or the MetaMind framework that splits the job across specialized agents for guessing, moral filtering, and validation) work precisely because they maintain multiple parallel hypotheses rather than grinding out one long deductive trace Can AI decompose social reasoning into distinct cognitive stages?. Long sequential reasoning may be the wrong tool entirely — note that even in ordinary tasks, accuracy follows an inverted-U with chain length, and more capable models naturally prefer *shorter* chains Why does chain of thought accuracy eventually decline with length?.
There's a twist worth knowing, though: part of the "failure" may be a measurement problem. Several notes argue current ToM benchmarks can be beaten by pattern-matching without any genuine mental-state reasoning — supervised fine-tuning matches reinforcement learning, and models exploit templated artifacts and distribution quirks Can language models solve ToM benchmarks without real reasoning?. So when a reasoning model scores low, it may be that it stopped relying on the cheap surface shortcut the benchmark rewards, while not having built real belief-tracking to replace it. And reinforcement learning on these tasks behaves strangely with scale: larger models develop genuine, transferable belief-tracking, while smaller ones hit the same accuracy through shortcuts that fall apart on inspection Does reinforcement learning on theory of mind collapse with model scale?.
Zoom out and a broader pattern in the corpus reframes the whole question. Even in non-social domains, reasoning-model failures often aren't failures of reasoning at all — they're failures of *novelty* (models break on unfamiliar instances, not complex ones) Do language models fail at reasoning due to complexity or novelty? or of *execution* (the model knows the algorithm but can't carry it out in text alone) Are reasoning model collapses really failures of reasoning?. Theory of mind may be the purest case of a task that resists the deductive, single-thread style these models were optimized into. And the social deficit isn't limited to ToM benchmarks: frontier models that solve problems alone get *worse* when made to collaborate, agreeing >90% of the time regardless of correctness — though self-play training recovers some of it, hinting the social gap is trainable, not permanent Why do language models fail at collaborative reasoning?.
The thing you might not have known you wanted to know: there's real evidence the gap is *architectural*, not just a training shortfall. Hybrid systems that force explicit belief-tracking outperform LLMs left to their own devices, which suggests these models default to surface social heuristics and can't be reasoned into genuine mental simulation by sheer effort Do large language models genuinely simulate mental states?. More chain-of-thought doesn't buy you a theory of mind — and might cost you one.
Sources 11 notes
Claude 3.7 Sonnet and o1 fail measurably at Decrypto benchmark tasks testing representational change, false belief, and counterfactual reasoning—tasks where they score worse than both humans and simple word-embedding baselines. The evidence suggests formal reasoning optimization actively degrades social reasoning capability.
Claude 3.7 Sonnet and o1 underperform older models on ToM benchmarks like Decrypto. Increased reasoning effort does not improve social cognition and may actively interfere with it.
Reasoning models fail to outperform vanilla LLMs on theory of mind tasks, produce longer but unhelpful traces, and show no generalization to similar scenarios. ThoughtTracing's success using shorter Bayesian hypothesis tracking suggests social reasoning demands simultaneous multiple-model maintenance, not sequential derivation.
The MetaMind framework—using three specialized agents for hypothesis generation, moral filtering, and response validation—achieved 35.7% improvement on real social scenarios and matched average human performance on theory-of-mind tasks, with ablations confirming all stages are necessary.
Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.
Supervised fine-tuning matches reinforcement learning performance on ToM tasks, suggesting models exploit structural vulnerabilities rather than develop genuine reasoning. Distribution biases and templated artifacts allow surface-level pattern recognition to achieve competitive generalization.
7B models develop explicit, transferable belief-tracking under RL, while smaller models achieve comparable accuracy through shortcut learning that lacks interpretable reasoning traces. The mismatch between accuracy and reasoning quality is invisible without inspecting step-by-step outputs.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.
Frontier LLMs that solve problems alone fail when collaborating, achieving >90% agreement regardless of correctness. Self-play preference training improves outcomes by 16.7%, suggesting social skills for effective disagreement can be trained.
ChangeMyView and FANTOM benchmarks show LLMs fail at authentic perspective-taking in open-ended scenarios, despite succeeding on structured tasks. Hybrid Bayesian architectures that force explicit belief tracking outperform LLM-alone approaches, suggesting the gap is architectural rather than merely training-based.