Why do reasoning models struggle with theory of mind tasks?
Extended reasoning training helps with math and coding but not social cognition. We explore whether reasoning models can track mental states the way they solve formal problems, and what that reveals about the structure of social reasoning.
ThoughtTracing — an SMC-inspired algorithm for mental state tracking — produces its most important finding not through its own performance but through what it reveals about existing reasoning models on ToM tasks.
Four behavioral patterns emerge:
Reasoning models don't consistently outperform vanilla LLMs using chain-of-thought. The extended reasoning training that dramatically improves math and coding does not transfer to social cognition.
They fail to generalize to similar scenarios. A reasoning model that correctly tracks mental states in one ToM scenario fails on structurally similar ones — suggesting pattern matching rather than a generalizable mental state tracking mechanism.
They produce significantly longer reasoning traces for ToM than for factual questions. The model "knows" social reasoning is hard and allocates more tokens to it, but this effort is unproductive.
Reasoning effort (output length) does not correlate with performance. More thinking does not help. This is the sharpest contrast with formal domains where longer chains generally improve accuracy up to a threshold.
These patterns suggest social reasoning is "a different category" from mathematical or programming reasoning "where reasoning models typically excel." The authors explicitly position this as a domain where inference-time reasoning research has been neglected.
The ThoughtTracing algorithm itself offers a clue about what social reasoning requires that formal reasoning doesn't: hypothesis-driven Bayesian tracking of multiple evolving mental state possibilities, weighted by observation likelihood. This is structurally different from derivational chains. Social reasoning requires maintaining multiple simultaneous models of what different agents believe, not sequentially deriving conclusions from premises. The algorithm outperforms reasoning models (including o3-mini and R1) using "significantly shorter reasoning traces" — suggesting efficiency comes from the right structure, not more tokens.
Source: Theory of Mind
Related concepts in this collection
-
Why do correct reasoning traces contain fewer tokens?
In o1-like models, correct solutions are systematically shorter than incorrect ones for the same questions. This challenges assumptions that longer reasoning traces indicate better reasoning, and raises questions about what length actually signals.
the ToM finding inverts the usual pattern: on ToM, reasoning models produce LONGER traces that DON'T help, while ThoughtTracing uses shorter traces that DO help
-
Why do reasoning models fail at theory of mind tasks?
Recent LLMs optimized for formal reasoning dramatically underperform at social reasoning tasks like false belief and recursive belief modeling. This explores whether reasoning optimization actively degrades the ability to track other agents' mental states.
independent confirmation from Decrypto: the formal-reasoning ↔ social-reasoning tension is robust across multiple benchmarks
-
Do large language models use one reasoning style or many?
Explores whether LLMs share a universal strategic reasoning approach or develop distinct styles tailored to specific game types. Understanding this matters for predicting model behavior in competitive versus cooperative scenarios.
game-based strategic reasoning is similarly fragmented; social/ToM reasoning is yet another non-transferable domain
-
When does explicit reasoning actually help model performance?
Explicit reasoning improves some tasks but hurts others. What determines whether step-by-step reasoning chains are beneficial or harmful for a given problem?
ToM extends the domain taxonomy: formal (reasoning helps) vs. nuanced judgment (reasoning hurts) vs. social (reasoning is irrelevant)
-
Do iterative refinement methods suffer from overthinking?
Iterative refinement approaches like Self-Refine structurally resemble token-level overthinking in o1-like models. Does revision across multiple inference calls reproduce the same accuracy degradation seen within single inferences?
the ToM finding is a cross-domain instance of the overthinking pattern: reasoning models allocate more tokens to social reasoning but the additional effort is unproductive, confirming that sequential token extension fails outside derivational domains
-
Can language models track how minds change during persuasion?
Do LLMs understand evolving mental states in persuasive dialogue, or do they only capture fixed attitudes? This explores whether models can update their reasoning as a person's beliefs shift across conversation turns.
the static/dynamic split provides a finer-grained taxonomy: social reasoning is not uniformly hard but splits into static (near-human) and dynamic (significantly worse), with CoT helping strategy prediction but not dynamic mental state tracking
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
social reasoning differs categorically from formal reasoning — reasoning effort does not correlate with ToM performance