INQUIRING LINE

Why does additional reasoning effort not improve theory of mind performance?

This explores why pouring more 'thinking' into a model — longer chains of thought, reasoning-optimized architectures — fails to make it better at tracking what other minds believe, and the corpus suggests the problem isn't quantity of reasoning but the kind of reasoning being applied.


This explores why additional reasoning effort doesn't help theory of mind (ToM) — the ability to track what other agents believe, want, or falsely assume. The short version the corpus keeps circling back to: social reasoning and formal reasoning are different cognitive shapes, and the machinery we've built to scale one actively interferes with the other. The most pointed evidence is that reasoning-optimized models like Claude 3.7 Sonnet and o1 score *worse* than older, plainer models on ToM benchmarks like Decrypto — sometimes worse than humans and even worse than simple word-embedding baselines Why do reasoning models fail at theory of mind tasks? Why do advanced reasoning models fail at understanding minds?. Effort doesn't just fail to help here; it appears to degrade the capability.

The deeper 'why' is architectural. Formal reasoning is sequential derivation — one step licenses the next toward an answer. But tracking minds means holding several incompatible models of the world open at once (what I know, what you know, what you falsely believe I believe). One corpus note frames this directly: reasoning models produce *longer but unhelpful* traces and show no generalization, while a method called ThoughtTracing succeeds with much shorter Bayesian hypothesis-tracking — implying social reasoning demands simultaneous maintenance of multiple belief-models, not a long linear chain Why do reasoning models struggle with theory of mind tasks?. Stretching the chain just gives the model more rope to over-commit to a single line of derivation when the task actually requires juggling several.

This connects to a more general finding that more thinking isn't monotonically better. Accuracy follows an inverted-U: pushing thinking tokens from ~1,100 to ~16K dropped benchmark accuracy from 87% to 70%, with models overthinking easy problems Does more thinking time always improve reasoning accuracy?, and the optimal chain length actually *shrinks* as models get more capable Why does chain of thought accuracy eventually decline with length?. ToM may simply sit on the steep downslope of that curve, where extra reasoning is mostly self-interference. There's even evidence that the apparent gains from chain-of-thought come from its *form* rather than genuine inference — logically invalid reasoning chains perform nearly as well as valid ones Does logical validity actually drive chain-of-thought gains? — so the 'reasoning' a model adds isn't necessarily doing real mental-state inference at all.

Here's the thing you might not expect: on structured ToM benchmarks, models can look fine — but that's partly because the benchmarks are gameable. Models default to surface strategies rather than genuine mental simulation, passing templated tasks through pattern-matching while failing open-ended perspective-taking Do large language models genuinely simulate mental states?, and supervised fine-tuning matches reinforcement learning on these tasks — a sign models exploit structural artifacts instead of building real belief-tracking Can language models solve ToM benchmarks without real reasoning?. So adding reasoning effort can't improve something the model was never really doing; it just elaborates the shortcut.

The hopeful counterpoint worth knowing: the gap may be elicitation, not absence. When reinforcement learning is applied directly to social reasoning, models above a certain scale (~7B) develop explicit, *transferable* belief-tracking, while smaller ones fall back on uninterpretable shortcuts Does reinforcement learning on theory of mind collapse with model scale? — echoing the broader result that base models already contain latent reasoning that the right training selects rather than creates Do base models already contain hidden reasoning ability?. The lesson across these notes is consistent: ToM doesn't respond to *more* reasoning, it responds to the *right kind* — parallel belief-tracking, the proper architecture, training that targets the capability — and generic effort applied to the wrong shape of problem can make things worse, not better.


Sources 10 notes

Why do reasoning models fail at theory of mind tasks?

Claude 3.7 Sonnet and o1 fail measurably at Decrypto benchmark tasks testing representational change, false belief, and counterfactual reasoning—tasks where they score worse than both humans and simple word-embedding baselines. The evidence suggests formal reasoning optimization actively degrades social reasoning capability.

Why do advanced reasoning models fail at understanding minds?

Claude 3.7 Sonnet and o1 underperform older models on ToM benchmarks like Decrypto. Increased reasoning effort does not improve social cognition and may actively interfere with it.

Why do reasoning models struggle with theory of mind tasks?

Reasoning models fail to outperform vanilla LLMs on theory of mind tasks, produce longer but unhelpful traces, and show no generalization to similar scenarios. ThoughtTracing's success using shorter Bayesian hypothesis tracking suggests social reasoning demands simultaneous multiple-model maintenance, not sequential derivation.

Does more thinking time always improve reasoning accuracy?

Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

Do large language models genuinely simulate mental states?

ChangeMyView and FANTOM benchmarks show LLMs fail at authentic perspective-taking in open-ended scenarios, despite succeeding on structured tasks. Hybrid Bayesian architectures that force explicit belief tracking outperform LLM-alone approaches, suggesting the gap is architectural rather than merely training-based.

Can language models solve ToM benchmarks without real reasoning?

Supervised fine-tuning matches reinforcement learning performance on ToM tasks, suggesting models exploit structural vulnerabilities rather than develop genuine reasoning. Distribution biases and templated artifacts allow surface-level pattern recognition to achieve competitive generalization.

Does reinforcement learning on theory of mind collapse with model scale?

7B models develop explicit, transferable belief-tracking under RL, while smaller models achieve comparable accuracy through shortcut learning that lacks interpretable reasoning traces. The mismatch between accuracy and reasoning quality is invisible without inspecting step-by-step outputs.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Next inquiring lines