Why do reasoning models perform poorly at theory of mind tasks?
This explores why models specifically optimized for step-by-step reasoning (like o1 and Claude 3.7 Sonnet) actually get *worse* at inferring what other people think, want, or believe — and what that reveals about the difference between logical reasoning and social reasoning.
This explores why reasoning-optimized models underperform at theory of mind — tasks that require tracking what someone else believes, including false beliefs and counterfactuals. The striking finding across the corpus is that this isn't a small gap: on benchmarks like Decrypto, Claude 3.7 Sonnet and o1 score *worse* than older models, worse than humans, and even worse than simple word-embedding baselines Why do reasoning models fail at theory of mind tasks? Why do advanced reasoning models fail at understanding minds?. The unsettling part is the direction of the effect: more reasoning effort doesn't help and may actively interfere.
The leading explanation is that social reasoning is a categorically different kind of cognition than formal reasoning, not just a harder version of it. Formal reasoning is sequential — derive step B from step A and chain forward. But tracking minds seems to require holding *several* candidate belief-states in play at once: what I know, what you know, what you think I know. When models are pushed to produce long deductive chains, they generate more text but it doesn't help, and it doesn't transfer to similar scenarios Why do reasoning models struggle with theory of mind tasks?. Tellingly, approaches that use *shorter* Bayesian hypothesis-tracking — maintaining multiple models in parallel rather than deriving one answer linearly — outperform the long-chain reasoners. Architectures that force explicit belief-tracking beat the LLM-alone setup, which suggests the deficit is structural, not just a matter of more training Do large language models genuinely simulate mental states?.
There's a second, more deflationary thread worth sitting with: maybe the models were never really doing theory of mind in the first place, and the benchmarks let them fake it. Current ToM benchmarks turn out to be solvable through pattern matching alone — supervised fine-tuning matches reinforcement learning, and templated artifacts and distribution biases let surface-level recognition score well without any genuine mental-state modeling Can language models solve ToM benchmarks without real reasoning?. If that's true, then 'reasoning models get worse' may partly mean the reasoning process disrupts the shortcut that older models were quietly relying on, exposing that neither was truly mind-reading.
This connects to a broader pattern in how these models fail socially. They accommodate false presuppositions even when they demonstrably know the correct facts Why do language models accept false assumptions they know are wrong?, and frontier models that solve problems alone collapse when they have to collaborate — converging on agreement regardless of correctness Why do language models fail at collaborative reasoning?. The common thread: knowing a fact and modeling another agent's relationship to that fact are different competencies, and optimizing hard for the first doesn't deliver the second. Scale interacts with this strangely too — under RL training on social tasks, larger models develop transferable belief-tracking while smaller ones learn invisible shortcuts that match the accuracy without the reasoning Does reinforcement learning on theory of mind collapse with model scale?.
The thing you might not have expected to want to know: this debate doubles as a probe into what 'reasoning' even means. One camp argues these collapses aren't reasoning failures at all but *execution* failures — text-only models can't carry out long procedures even when they know the algorithm, and giving them tools fixes it Are reasoning model collapses really failures of reasoning?. Another shows failures track instance *novelty*, not task complexity — models fit patterns from similar examples rather than learning general algorithms Do language models fail at reasoning due to complexity or novelty?. Theory of mind may be the cleanest place to see this, because you can't pattern-match your way to genuinely tracking what someone else believes — and that's exactly where the reasoning models break.
Sources 10 notes
Claude 3.7 Sonnet and o1 fail measurably at Decrypto benchmark tasks testing representational change, false belief, and counterfactual reasoning—tasks where they score worse than both humans and simple word-embedding baselines. The evidence suggests formal reasoning optimization actively degrades social reasoning capability.
Claude 3.7 Sonnet and o1 underperform older models on ToM benchmarks like Decrypto. Increased reasoning effort does not improve social cognition and may actively interfere with it.
Reasoning models fail to outperform vanilla LLMs on theory of mind tasks, produce longer but unhelpful traces, and show no generalization to similar scenarios. ThoughtTracing's success using shorter Bayesian hypothesis tracking suggests social reasoning demands simultaneous multiple-model maintenance, not sequential derivation.
ChangeMyView and FANTOM benchmarks show LLMs fail at authentic perspective-taking in open-ended scenarios, despite succeeding on structured tasks. Hybrid Bayesian architectures that force explicit belief tracking outperform LLM-alone approaches, suggesting the gap is architectural rather than merely training-based.
Supervised fine-tuning matches reinforcement learning performance on ToM tasks, suggesting models exploit structural vulnerabilities rather than develop genuine reasoning. Distribution biases and templated artifacts allow surface-level pattern recognition to achieve competitive generalization.
The FLEX Benchmark shows that models reject false presuppositions at rates far below acceptable levels (GPT-4: 84%, Mistral: 2.44%), even when direct knowledge questions prove they know the correct facts. False presuppositions drive more accommodation than correct knowledge drives rejection.
Frontier LLMs that solve problems alone fail when collaborating, achieving >90% agreement regardless of correctness. Self-play preference training improves outcomes by 16.7%, suggesting social skills for effective disagreement can be trained.
7B models develop explicit, transferable belief-tracking under RL, while smaller models achieve comparable accuracy through shortcut learning that lacks interpretable reasoning traces. The mismatch between accuracy and reasoning quality is invisible without inspecting step-by-step outputs.
Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.