Can language models solve ToM benchmarks without real reasoning?
Do current theory-of-mind benchmarks actually measure mental state reasoning, or can models exploit surface patterns and distribution biases to achieve high scores? This matters because it determines whether benchmark performance indicates genuine understanding.
The dominant narrative around ToM benchmarks assumes that high performance indicates genuine mental state reasoning. This paper systematically challenges that assumption by comparing RL-trained and SFT-trained models across multiple ToM datasets.
The key finding: SFT alone — which optimizes models to reproduce desired outputs from examples without any reasoning-process optimization — achieves "competitive and generalizable performance on these benchmarks, often matching or exceeding RL models in accuracy." If SFT can match RL without any explicit reasoning training, the benchmarks may not be testing what they claim to test.
Several structural vulnerabilities emerge:
Distribution bias. In ExploreToM, 22% of questions have "yes" as the correct answer while only 4% are "no." This creates a strong prior that models can exploit without understanding the content. Answering "yes" to any question is already better than chance.
Templated generation artifacts. The datasets may contain "exploitable patterns, such as surface-level correlations between narrative elements and answers, possibly introduced by templated generation." The logical structure of the stories, even when made more naturalistic through infilling, remains predictable.
Pretraining as hidden capability. General pretraining may equip models with reasoning skills that SFT merely activates, making it impossible to distinguish "learned ToM reasoning" from "pattern matching on familiar narrative structures."
The generalization finding is particularly striking: SFT models generalize to 4th-order ToM and infilled (more naturalistic) stories nearly as well as RL models. This means even increasing complexity or naturalism of the stories doesn't differentiate genuine reasoning from structural exploitation "if the underlying logical structure remains predictable."
This presents a Kosinski dilemma: either accept that these measures are valid (implying LLMs have ToM) or reject that LLMs understand mental states (requiring us to reevaluate the measures themselves). The SFT evidence supports the latter — the measures may be testing structural pattern recognition, not mental state inference.
Source: Theory of Mind
Related concepts in this collection
-
Do foundation models learn world models or task-specific shortcuts?
When transformer models predict sequences accurately, are they building genuine world models that capture underlying physics and logic? Or are they exploiting narrow patterns that fail under distribution shift?
the ToM benchmark finding is a specific instance: models develop task-specific heuristics for ToM-shaped problems rather than genuine mental state reasoning
-
Can models pass tests while missing the actual grammar?
Do language models succeed on grammatical benchmarks by learning surface patterns rather than structural rules? This matters because correct outputs may hide reliance on shallow heuristics that fail on novel structures.
the same pattern in a different domain: correct performance does not entail the intended mechanism
-
Does supervised fine-tuning improve reasoning or just answers?
Explores whether training models on question-answer pairs actually strengthens their reasoning quality or merely optimizes them toward correct outputs through shortcuts. This matters for deploying AI in domains like medicine where reasoning must be auditable.
SFT on ToM follows the same pattern: scores go up without reasoning quality following
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
current ToM benchmarks may be solvable without explicit mental state reasoning — SFT matches RL suggesting exploitable structural patterns