Do longer reasoning traces actually improve theory of mind accuracy?
This explores whether spending more reasoning tokens — longer chains of thought — actually buys better theory-of-mind performance, or whether social reasoning resists the more-thinking-is-better intuition that holds (sometimes) for math and logic.
This explores whether longer reasoning traces help models track what other people believe, want, or falsely assume — and the corpus answer is unusually direct: no, and sometimes the opposite. Reasoning-optimized models actually *underperform* older, plainer models on theory-of-mind benchmarks. On the Decrypto tasks for false belief and counterfactual reasoning, Claude 3.7 Sonnet and o1 score worse than humans and even worse than simple word-embedding baselines, suggesting that optimizing a model for formal step-by-step reasoning can actively corrode its social reasoning Why do reasoning models fail at theory of mind tasks?. A companion finding sharpens the why: reasoning models produce *longer but unhelpful* traces on theory-of-mind tasks and show no generalization, because social cognition seems to demand holding several candidate mental models in mind at once rather than deriving one answer in a sequence Why do reasoning models struggle with theory of mind tasks?.
That lands inside a broader pattern the collection documents repeatedly: more thinking is not monotonically better. Accuracy follows an inverted-U as traces lengthen — it peaks at some intermediate length and then declines, with the optimal length actually *shrinking* as models get more capable Why does chain of thought accuracy eventually decline with length?. One striking measurement watched accuracy fall from 87% to 70% as thinking tokens grew from ~1,100 to ~16K, as models overthought easy problems Does more thinking time always improve reasoning accuracy?. So the premise that 'longer = more careful = more accurate' is shaky even before you get to the special difficulty of social reasoning.
Here's the thing you might not expect: trace length may not even be measuring reasoning effort. One controlled maze study found that trace length tracks problem difficulty only on familiar in-distribution problems and decouples completely off-distribution — long traces reflect recall of training schemas, not adaptive computation Does longer reasoning actually mean harder problems?. And a sharper claim still: the intermediate tokens carry no special execution semantics — invalid traces frequently still produce correct answers, so the trace is learned formatting that correlates with the answer rather than a causal mechanism producing it Do reasoning traces actually cause correct answers?. If a longer trace is stylistic mimicry of reasoning What makes chain-of-thought reasoning actually work?, lengthening it wouldn't reliably help any task, and would especially flail on theory of mind, where the right move is parallel belief-tracking, not a longer derivation.
The corpus does leave an important door open: the failure looks more architectural and training-mediated than length-bound. The same mechanism (extended thinking) can flip from harmful to helpful depending on how a model was trained — RL training redirected 'thinking' from counterproductive self-doubt into productive gap analysis, which says quality of reasoning is trainable, not just quantity Does extended thinking help or hurt model reasoning?. On theory of mind specifically, RL produced genuine, transferable belief-tracking in 7B models, while smaller ones faked it through shortcuts — and crucially, accuracy alone hid that difference; you had to inspect the steps Does reinforcement learning on theory of mind collapse with model scale?. Approaches that force *explicit* belief tracking — hybrid Bayesian architectures, or short Bayesian hypothesis-tracing like ThoughtTracing — beat both LLM-alone and longer-trace approaches Do large language models genuinely simulate mental states?.
So the surprising takeaway: for theory of mind, what helps isn't *more* reasoning but a *different shape* of reasoning — maintaining multiple mental models in parallel rather than chaining one longer and longer. Length is close to the wrong knob.
Sources 10 notes
Claude 3.7 Sonnet and o1 fail measurably at Decrypto benchmark tasks testing representational change, false belief, and counterfactual reasoning—tasks where they score worse than both humans and simple word-embedding baselines. The evidence suggests formal reasoning optimization actively degrades social reasoning capability.
Reasoning models fail to outperform vanilla LLMs on theory of mind tasks, produce longer but unhelpful traces, and show no generalization to similar scenarios. ThoughtTracing's success using shorter Bayesian hypothesis tracking suggests social reasoning demands simultaneous multiple-model maintenance, not sequential derivation.
Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.
Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.
Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.
R1's intermediate tokens carry no special execution semantics and are generated identically to other LLM output. Invalid traces frequently produce correct answers, proving traces are not causally necessary—they correlate with answers via learned formatting, not functional reasoning.
CoT systems reproduce the form of reasoning through pattern matching rather than performing genuine logical inference. This explains why format effects dominate content, why structurally invalid prompts succeed, and why stronger reasoning models become less instruction-compliant.
Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.
7B models develop explicit, transferable belief-tracking under RL, while smaller models achieve comparable accuracy through shortcut learning that lacks interpretable reasoning traces. The mismatch between accuracy and reasoning quality is invisible without inspecting step-by-step outputs.
ChangeMyView and FANTOM benchmarks show LLMs fail at authentic perspective-taking in open-ended scenarios, despite succeeding on structured tasks. Hybrid Bayesian architectures that force explicit belief tracking outperform LLM-alone approaches, suggesting the gap is architectural rather than merely training-based.