A Systematic Review on the Evaluation of Large Language Models in Theory of Mind Tasks

Paper · arXiv 2502.08796 · Published February 12, 2025

This systematic review synthesizes current efforts to assess LLMs’ ability to perform ToM tasks—an essential aspect of human cognition involving the attribution of mental states to oneself and others. Despite notable advancements, the proficiency of LLMs in ToM remains a contentious issue. By categorizing benchmarks and tasks through a taxonomy rooted in cognitive science, this review critically examines evaluation techniques, prompting strategies, and the inherent limitations of LLMs in replicating human-like mental state reasoning. A recurring theme in the literature reveals that while LLMs demonstrate emerging competence in ToM tasks, significant gaps persist in their emulation of human cognitive abilities.

Recently, various text-based and non-linguistic Theory of Mind tests and benchmarks have been developed to assess the cognitive capabilities of large language models (Kosinski, 2024; Ullman, 2023; Sclar et al., 2022). Despite the existing results, much of the work has been met with skepticism, with subsequent studies often challenging the conclusions of earlier research (Ullman, 2023). Some attribute the so-called successes of LMs in ToM tasks to an imitation of the cognitive abilities rather than really possessing them, a long-standing debate tracing back to Turing (1950). Others suggest that models capable of solving these tasks often rely on memorization and shallow heuristics (Shapira et al., 2023a), a phenomenon sometimes referred to as the “Clever Hans” effect (Pfungst, 1911) or the “Stochastic Parrot” problem (Bender et al., 2021).

Kosinski (2024) presents a dilemma in ToM evaluation of the large language models, concisely summarized by Ullman (2023): We must either accept the validity of current ToM measures, which would imply that large language models possess Theory of Mind, or reject the assertion that LLMs understand others’ mental states, necessitating a comprehensive reevaluation and possibly a dismissal of these measures.

Ullman (2023) suggests an alternative perspective, claiming that while we can acknowledge the validity of the ToM measures, we should remain skeptical of a model that passes them. As the debate intensifies, our primary motivation for this study is to provide a systematic review of the evaluation methods used to assess the Theory of Mind in large language models. By providing a comprehensive review, we hope to clarify the current state of research, highlight areas needing further investigation, and contribute to a more nuanced understanding of how LLMs relate to human-like mental state reasoning. This systematic review will be a valuable resource for researchers and practitioners seeking to navigate the complexities of evaluating the Theory of Mind in artificial intelligence.

Task Classification. For the classification of tasks, we decided to use the following categories: Multiple Choice, True/False, Natural Language Generation, Question Answering, Inference, Text Completion, and Multi-agent Collaboration. Mul- tiple Choice refers to tasks where the possible answers are listed as options for the model. This is commonly used in benchmarks and task formulations (Ma et al., 2023b; Mireshghallah et al., 2024; Kim et al., 2023; Gandhi et al., 2023). True/False consists of questions where the model is expected to respond with true/yes/1 or false/no/0. This question format is also preferred due to its simplicity (van Duijn et al., 2023; Ma et al., 2023b). Natural Language Generation tasks expect longer, open-ended responses, where the generated output is usually evaluated by comparing it to a set of reference texts. One example of a Natural Language Generation (NLG) task in Mireshghallah et al. (2024) involves prompting the model to answer a question while considering privacy norms. The detection of leakage in the response is then assessed using two methods: (a) exact string matching for X’s name, and (b) determining whether a proxy model can recover the private information solely from the given response. Question Answering tasks require the model to respond to a question with the most appropriate answer. This category typically involves extracting information from a given text or reasoning about the content to generate the correct response. It differs from Multiple Choice tasks, as it does not provide predefined options, and from Natural Language Generation tasks, as it usually expects a concise answer, often a single word or a short phrase, rather than a longer, open-ended response. Examples of this category can be found in Ma et al. (2023b); Xu et al. (2024a); Chan et al. (2024). Text Completion involves tasks where the model is expected to complete a missing part of a sentence, often used for base language models (van Duijn et al., 2023). Multi-agent Collaboration consists of multiplayer tasks where agents work together to solve a specific problem. These tasks may have different objective functions depending on the nature of the game (Bianchi et al., 2024; Li et al., 2023; Bara et al., 2021; Guo et al., 2023; Sclar et al., 2022) Lastly, Inference refers to tasks where the model is expected to make logical inferences, which may involve Natural Language Inference (Cohen, 2021) or predictions using methods like logistic regression (Eysenbach et al., 2016).

Another problem mentioned by Li et al. (2023) is the long-horizon context, where a model tends to forget information about the room and some details in the inquiry text as they got far away from the planning question at the very end.

Wu et al. (2023) lists five error types observed in their experiments: insufficient reasoning depth, commonsense errors, hallucinations, temporal ignorance, and spurious causal inference.

Insufficient reasoning depth refers to an LLM skipping essential reasoning steps to answer a lower-order question (e.g., responding to ’Where is Anne?’ instead of ’Where would John think Anne is?’). Commonsense errors occur when the model generates a continuation that defies common sense. For instance, the authors provide an example where someone leaves a closed space yet can still see what is happening inside. Temporal ignorance occurs when LLM ignores or confuses the order of the events. Spurious causal inference refers to incorrectly attributing a cause-and-effect relationship between events that are not actually related in that way.

Our findings suggest the following narrative: As language models become more advanced—incorporating more parameters and larger training datasets—they tend to achieve higher scores on Theory of Mind tasks compared to their predecessors (van Duijn et al., 2023; Shapira et al., 2023a; Hou et al., 2024a; Kosinski, 2024). However, in most cases, they still fall short of human performance (Mireshghallah et al., 2024; Liu et al., 2024; Zhou et al., 2023a; Sap et al., 2023). Consequently, we align with the skeptical perspective, suggesting that while LLMs may exhibit some enhanced ToM abilities, these capabilities remain limited and may often rely on spurious correlations rather than robust understanding. With this review of the current state of Machine Theory of Mind literature, we aim to pave the way for future research focused on enhancing the ToM capabilities of LLMs and expanding their applications across diverse domains.