Why does reasoning volume fail to improve theory of mind performance?
This explores why making models 'think harder' — longer chains of thought, more reasoning tokens, reasoning-tuned architectures — doesn't help (and often hurts) at tasks that require tracking what other people believe and intend.
This explores why pouring more reasoning effort into a model fails to make it better at reading minds — tracking beliefs, false beliefs, and intentions. The corpus's blunt answer: social reasoning isn't a longer version of formal reasoning, it's a different shape of computation, and the machinery that scales math and logic actively works against it. Several papers document the surprising inversion directly — advanced reasoning models like Claude 3.7 Sonnet and o1 score *worse* than older models, humans, and even simple word-embedding baselines on theory-of-mind benchmarks like Decrypto, with more reasoning effort interfering rather than helping (Why do advanced reasoning models fail at understanding minds?, Why do reasoning models fail at theory of mind tasks?).
The most useful explanation is architectural, not quantitative. Formal reasoning is sequential derivation — one step licenses the next. Mind-reading instead demands holding several incompatible models of the world in parallel: what I know, what you know, what you think I know. One analysis shows reasoning models produce *longer but unhelpful* traces and fail to generalize to similar scenarios, while a lightweight method (ThoughtTracing) that does short Bayesian hypothesis-tracking succeeds — suggesting the task wants simultaneous multiple-model maintenance, not a deeper single chain (Why do reasoning models struggle with theory of mind tasks?). That same theme recurs elsewhere: hybrid Bayesian architectures that *force* explicit belief tracking beat LLM-alone approaches, which otherwise default to surface strategies that look like perspective-taking but aren't (Do large language models genuinely simulate mental states?).
There's also a measurement trap hiding underneath. Some of the 'reasoning helps' results may be illusory because the benchmarks themselves are beatable by pattern matching — supervised fine-tuning matches reinforcement learning, and models exploit templated artifacts and distribution biases rather than genuinely reasoning about beliefs (Can language models solve ToM benchmarks without real reasoning?). Relatedly, scale interacts strangely: smaller models can hit the same accuracy through shortcuts that leave no interpretable trace, while only larger models develop transferable belief-tracking under RL — so accuracy alone hides whether any real reasoning is happening (Does reinforcement learning on theory of mind collapse with model scale?).
What makes this more than a quirk of social tasks is that the corpus shows 'more reasoning ≠ better' as a general law that ToM just expresses sharply. Accuracy peaks then falls as thinking tokens climb — one study saw it drop from 87% to 70% as tokens went from ~1,100 to ~16K, with models overthinking easy problems (Does more thinking time always improve reasoning accuracy?). Optimal chain length follows an inverted U, and more capable models actually prefer *shorter* chains (Why does chain of thought accuracy eventually decline with length?). In multimodal perception, verbose chains hurt because the real bottleneck is visual attention, not verbalization — the optimization targets the wrong policy (Does verbose chain-of-thought actually help multimodal perception tasks?). Read together, the pattern is the same each time: when the bottleneck isn't 'derive more steps,' adding steps optimizes a quantity that doesn't matter and crowds out the one that does.
The deeper takeaway you might not have gone looking for: reasoning quality is mediated by *what training shapes the thinking toward*, not by how much thinking there is. RL can flip the very same 'thinking mode' from counterproductive self-doubt into productive gap-analysis (Does extended thinking help or hurt model reasoning?), and base models already carry latent reasoning that post-training merely *selects* rather than creates (Do base models already contain hidden reasoning ability?). So the fix for theory of mind isn't more volume — it's giving the model the right structure (parallel belief models, Bayesian tracking) and the right training signal to elicit it.
Sources 11 notes
Claude 3.7 Sonnet and o1 underperform older models on ToM benchmarks like Decrypto. Increased reasoning effort does not improve social cognition and may actively interfere with it.
Claude 3.7 Sonnet and o1 fail measurably at Decrypto benchmark tasks testing representational change, false belief, and counterfactual reasoning—tasks where they score worse than both humans and simple word-embedding baselines. The evidence suggests formal reasoning optimization actively degrades social reasoning capability.
Reasoning models fail to outperform vanilla LLMs on theory of mind tasks, produce longer but unhelpful traces, and show no generalization to similar scenarios. ThoughtTracing's success using shorter Bayesian hypothesis tracking suggests social reasoning demands simultaneous multiple-model maintenance, not sequential derivation.
ChangeMyView and FANTOM benchmarks show LLMs fail at authentic perspective-taking in open-ended scenarios, despite succeeding on structured tasks. Hybrid Bayesian architectures that force explicit belief tracking outperform LLM-alone approaches, suggesting the gap is architectural rather than merely training-based.
Supervised fine-tuning matches reinforcement learning performance on ToM tasks, suggesting models exploit structural vulnerabilities rather than develop genuine reasoning. Distribution biases and templated artifacts allow surface-level pattern recognition to achieve competitive generalization.
7B models develop explicit, transferable belief-tracking under RL, while smaller models achieve comparable accuracy through shortcut learning that lacks interpretable reasoning traces. The mismatch between accuracy and reasoning quality is invisible without inspecting step-by-step outputs.
Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.
Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.
Long rationales and text-token RL help reasoning but hurt fine-grained perception tasks because the actual bottleneck is visual attention allocation, not verbalization. Standard CoT optimization trains the wrong policy target.
Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.