LLM Reasoning and Architecture Reinforcement Learning for LLMs Language Understanding and Pragmatics

Can language models solve ToM benchmarks without real reasoning?

Do current theory-of-mind benchmarks actually measure mental state reasoning, or can models exploit surface patterns and distribution biases to achieve high scores? This matters because it determines whether benchmark performance indicates genuine understanding.

Note · 2026-02-22 · sourced from Theory of Mind
How should researchers navigate LLM reasoning research? Why do LLMs excel at social norms yet fail at theory of mind?

The dominant narrative around ToM benchmarks assumes that high performance indicates genuine mental state reasoning. This paper systematically challenges that assumption by comparing RL-trained and SFT-trained models across multiple ToM datasets.

The key finding: SFT alone — which optimizes models to reproduce desired outputs from examples without any reasoning-process optimization — achieves "competitive and generalizable performance on these benchmarks, often matching or exceeding RL models in accuracy." If SFT can match RL without any explicit reasoning training, the benchmarks may not be testing what they claim to test.

Several structural vulnerabilities emerge:

Distribution bias. In ExploreToM, 22% of questions have "yes" as the correct answer while only 4% are "no." This creates a strong prior that models can exploit without understanding the content. Answering "yes" to any question is already better than chance.

Templated generation artifacts. The datasets may contain "exploitable patterns, such as surface-level correlations between narrative elements and answers, possibly introduced by templated generation." The logical structure of the stories, even when made more naturalistic through infilling, remains predictable.

Pretraining as hidden capability. General pretraining may equip models with reasoning skills that SFT merely activates, making it impossible to distinguish "learned ToM reasoning" from "pattern matching on familiar narrative structures."

The generalization finding is particularly striking: SFT models generalize to 4th-order ToM and infilled (more naturalistic) stories nearly as well as RL models. This means even increasing complexity or naturalism of the stories doesn't differentiate genuine reasoning from structural exploitation "if the underlying logical structure remains predictable."

This presents a Kosinski dilemma: either accept that these measures are valid (implying LLMs have ToM) or reject that LLMs understand mental states (requiring us to reevaluate the measures themselves). The SFT evidence supports the latter — the measures may be testing structural pattern recognition, not mental state inference.


Source: Theory of Mind

Related concepts in this collection

Concept map
13 direct connections · 180 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

current ToM benchmarks may be solvable without explicit mental state reasoning — SFT matches RL suggesting exploitable structural patterns