Do Theory of Mind Benchmarks Need Explicit Human-like Reasoning in Language Models?

Paper · arXiv 2504.01698 · Published April 2, 2025

Recent advancements in Large Language Models (LLMs) have shown promising performance on ToM benchmarks, raising the question: Do these benchmarks necessitate explicit human-like reasoning processes, or can models succeed through alternative strategies? We investigate this question empirically by applying Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT) to LLMs of varying scales (0.5B to 7B parameters) and evaluating them across multiple ToM datasets

Our work suggests that current ToM benchmarks may be solvable without requiring the explicit, human-like simulation of mental states they were designed to probe. LLMs, particularly when scale is limited or training signals focus solely on output correctness, may leverage alternative rules effective for benchmark data structures.

While LLMs have shown promising results on these datasets [16, 20], a critical question arises: Do these benchmarks truly promote explicit human-like reasoning processes, such as simulating agents’ mental states step-by-step, or can models achieve high performance by exploiting alternative strategies, potentially leveraging structural patterns inherent in the data? Answering this question is vital for accurately assessing AI’s progress towards genuine social intelligence. To empirically investigate this challenge, we employ two prominent post-training methodologies— Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT)—on LLMs of different scales. RL, particularly with rule-based reward signals, has proven effective in enhancing structured reasoning in formal domains like mathematics and coding by reinforcing correct output and process adherence [12, 7, 24]. While this suggests RL could potentially foster structured mental state reasoning [5], ToM involves context-sensitive social commonsense less amenable to rigid rules. SFT, on the other hand, directly optimizes models to reproduce desired outputs from provided examples. By comparing the performance and, crucially, the nature of the reasoning elicited by RL and SFT across models of varying capacities, we can gain insight into the strategies LLMs employ to solve current ToM tasks and whether these strategies align with explicit belief-tracking.

Our study yields several key findings that shed light on the nature of reasoning on current ToM benchmarks. Firstly, we find that while RL significantly boosts accuracy across models of different sizes, its impact on reasoning quality is scale-dependent. In larger models (7B), RL induces highquality, interpretable, and transferable belief-tracking behaviors. However, in smaller models (≤3B), RL leads to reasoning collapse: models achieve high accuracy and generalization but produce drastically shortened, less meaningful responses, suggesting they rely on implicit rather than explicit structured reasoning. Secondly, and perhaps most strikingly, we demonstrate that SFT alone achieves competitive and generalizable performance on these benchmarks, often matching or exceeding RL models in accuracy, despite not being explicitly optimized for the reasoning process. These results highlight a critical discrepancy between achieving high scores on current ToM benchmarks and demonstrating explicit human-like reasoning. Our work suggests that existing benchmarks may be solvable without requiring genuine, step-by-step simulation of mental states. LLMs, especially those with limited capacity or trained with output-focused signals, may internalize alternative rules or patterns that are effective for the specific structures found in benchmark datasets.

This paper makes the following key contributions:

• We achieve state-of-the-art performance and reveal a scale-dependent effect of RL on LLM in ToM: it promotes explicit reasoning in 7B models but leads to reasoning collapse (high accuracy without meaningful reasoning) in smaller models, revealing a crucial mismatch between performance and reasoning quality.

• We show that SFT achieves competitive and generalizable performance on current ToM benchmarks, providing empirical evidence that these datasets may not require explicit human-like mental state reasoning.

• We highlight the need for future ToM evaluation methods and benchmarks that assess not just answer accuracy but the depth and nature of underlying reasoning.

Consider a classic ToM scenario: Sally places her marble in a basket and leaves; Anne moves it to a box. Where will Sally search for her marble when she returns? The correct answer depends not on physical laws, but on understanding Sally’s false belief. This kind of ability is critical to human social interactions [9, 20]. The complexity of these tasks can be increased by nesting beliefs, moving from first-order (e.g., "Where does Anne think the marble is?") to higher-orders (e.g., "Where does Anne think Sally thinks the marble is?").

Rule-based reinforcement learning (RL) has proven effective in enhancing large language models (LLMs) beyond standard supervised fine-tuning. This approach uses structured reward signals to guide model behavior without requiring explicit step-by-step supervision. Notably, it enables the emergence of reasoning-like behavior through relatively simple feedback mechanisms. Recent work has shown that such rewards can encourage models to internalize structured thinking. For instance, DeepSeek-R1 [7] demonstrated that answer-level rewards led to gradual increases in both response length and accuracy, suggesting the emergence of reasoning dynamics. Logic-RL [24] applied formatconstrained rewards to synthetic logic puzzles, training a 7B model that learned to reflect and verify its answers—skills that transferred to real-world math benchmarks such as AIME. Similarly, SWE-RL [22] applied similar principles to software engineering, achieving state-of-the-art performance on coding tasks and exhibiting cross-domain generalization. These successes highlight RL’s potential to activate latent reasoning skills through reward design. However, prior successes have largely occurred in formal domains, where rules and ground truths are well-defined. It remains an open question whether such methods generalize to social reasoning tasks, which require interpreting mental states and hidden commonsense. The extent to which rule-based RL can elicit human-like mental state inference in LLMs remains underexplored.

Reasoning collapse: the pitfall of RL in smaller models The beneficial effect of RL on reasoning quality was not uniform across all model sizes. The reasoning collapse emerged in smaller models (≤3B). Despite achieving substantial accuracy gains comparable to the larger models on many benchmarks after RL training, these models failed to generate interpretable, structured reasoning traces. Instead, they appeared to rely on shorter, potentially memorized patterns or rules optimized directly for the final answer, rather than explicitly tracking agents’ mental states. This phenomenon underscores a crucial mismatch between achieving high accuracy on benchmark questions and possessing genuine, human-like reasoning capabilities. The simple rule-based rewards, while effective at optimizing for correctness, may inadvertently encourage shortcut learning in models with limited capacity relative to the complexity of the underlying task logic.

Effectiveness of SFT and implications for benchmark validity The surprisingly strong performance of models trained solely with SFT further supports the notion that existing ToM benchmarks, such as Hi-ToM and ExploreToM, while valuable, may not be sufficiently challenging to exclusively probe deep mental state inference. SFT models achieved accuracy comparable to or even slightly better than their RL-trained counterparts on several datasets, including those designed to test generalization (4th-order ToM, infilled stories). One possible explanation is that the datasets might contain exploitable patterns, such as surface-level correlations between narrative elements and answers, possibly introduced by templated generation. For example, in ExploreToM[16], 22% of questions have “yes” as the correct answer, while only 4% are “no,” introducing a strong prior. Additionally, general pretraining might equip models with reasoning skills that can be activated by SFT. The generalization seen with SFT also implies that simply increasing the complexity or naturalism of the stories (as in infilled ExploreToM) might not be enough to overcome potential implicit strategies if the underlying logical structure remains predictable.