Evaluating Large Language Models in Theory of Mind Tasks

Paper · arXiv 2302.02083 · Published February 4, 2023

Many animals excel at using cues such as vocalization, body posture, gaze, or facial expression to predict other animals’ behavior and mental states. Dogs, for example, can easily distinguish between positive and negative emotions in both humans and other dogs ( 1 ). Yet, humans do not merely respond to observable cues but also automatically and effortlessly track others’ unobservable mental states, such as their knowledge, intentions, beliefs, and desires ( 2 ). This ability—typically referred to as “theory of mind” (ToM)—is considered central to human social interactions ( 3 ), communication ( 4 ), empathy ( 5 ), self-consciousness ( 6 ), moral judgment ( 7 , 8 ), and even religious beliefs ( 9 ). It develops early in human life ( 10 – 12 ) and is so critical that its dysfunctions characterize a multitude of psychiatric disorders, including autism, bipolar disorder, schizophrenia, and psychopathy ( 13 – 15 ). Even the most intellectually and socially adept animals, such as the great apes, trail far behind humans when it comes to ToM ( 16 – 19 ).

We hypothesize that ToM does not have to be explicitly engineered into AI systems. Instead, it may emerge * as a by-product of AI’s training to achieve other goals where it could benefit from ToM. Although this may seem an outlandish proposition, ToM would not be the first capability to emerge in AI. Models trained to process images, for example, spontaneously learned how to count ( 30 , 31 ) and differentially process central and peripheral image areas ( 32 ), as well as experience human-like optical illusions ( 33 ). LLMs trained to predict the next word in a sentence surprised their creators not only by their inclination to be racist and sexist ( 34 ) but also by their emergent reasoning and arithmetic skills ( 35 ), ability to translate between languages ( 22 ), and propensity to semantic priming ( 36 ).

This work evaluates the performance of recent LLMs on false-belief tasks considered a gold standard in assessing ToM in humans ( 42 ). False-belief tasks test respondents’ understanding that another individual may hold beliefs that the respondent knows to be false. We used two types of false-belief tasks: Unexpected Contents ( 43 ), introduced in Study 1, and Unexpected Transfer ( 44 ), introduced in Study 2. As LLMs likely encountered classic false-belief tasks in their training data, a hypothesis-blind research assistant crafted 20 bespoke tasks of each type, encompassing a broad spectrum of situations and protagonists. To reduce the risk that LLMs solve tasks by chance or using response strategies that do not require ToM, each task included a false-belief scenario, three closely matched true-belief control scenarios, and the reversed versions of all four. An LLM had to solve all eight scenarios to score a single point.

Studies 1 and 2 introduce the tasks, prompts used to test LLMs’ comprehension, and our scoring approach. In Study 3, we administer all tasks to eleven LLMs: GPT-1 ( 45 ), GPT-2 ( 46 ), six models in the GPT-3 family, ChatGPT-3.5-turbo ( 22 ), ChatGPT-4 ( 47 ), and Bloom ( 48 )—GPT-3’s open-access alternative. Our results show that the models’ performance gradually improved, and the most recent model tested here, ChatGPT-4, solved 75% of false-belief tasks. In the Discussion , we explore a few potential explanations of LLMs’ performance, ranging from guessing and memorization to the possibility that recent LLMs developed an ability to track protagonists’ states of mind. Importantly, we do not aspire to settle the decades-long debate on whether AI should be credited with human cognitive capabilities, such as ToM. However, even those unwilling to credit LLMs with ToM might recognize the importance of machines behaving as if they possessed ToM. Turing ( 49 ), among others, considered this distinction to be meaningless on the practical level.

An LLM had to answer all 16 prompts to solve a single task and score a point. These tasks were administered to eleven LLMs. The results revealed clear progress in LLMs’ ability to solve ToM tasks. The older models— such as GPT-1, GPT-2XL, and early models from the GPT-3 family—failed on all tasks. Better-than-chance performance was observed for models from the more recent members of the GPT-3 family. GPT-3-davinci-003 and ChatGPT-3.5- turbo successfully solved 20% of the tasks. The most recent model, ChatGPT-4, substantially outperformed the others, solving 75% of tasks, on par with 6-y-old children.

The gradual performance improvement suggests a connection with LLMs’ language proficiency, which mirrors the pattern seen in humans ( 4 , 38 – 41 , 57 ). Additionally, the strong correlation between LLMs’ performance on both types of tasks (R = 0.98; CI95% = [0.92, 0.99]) indicates high measurement reliability. This suggests that models’ performance is driven by a single factor (e.g., an ability to detect false-belief ) rather than two separate, task-specific abilities. LLMs’ performance on these tasks will likely keep improving, and they might soon either be indistinguishable from humans or be differentiated solely by their superior performance. We have seen similar advancements in areas such as the game of Go ( 21 ), tumor detection on CT scans ( 23 ), and language processing ( 47 ).

LLMs’ failures could also be attributed to limitations of the test items, testing procedure, and the scoring key. For example, responding with “valuable evidence” fails Unexpected Contents Task #9, but it is not necessarily wrong: both “bullets” or “pills” could be considered “valuable evidence.” In some instances, LLMs provided seemingly incorrect responses but supplemented them with context that made them correct. For example, while responding to Prompt 1.2 in Study 1.1 , an LLM might predict that Sam told their friend they found a bag full of popcorn. This would be scored as incorrect, even if it later adds that Sam had lied.

In other words, LLMs’ failures do not prove their inability to solve false-belief tasks, just as observing flocks of white swans does not prove the nonexistence of black swans. Likewise, the successes of LLMs do not automatically demonstrate their ability to track protagonists’ beliefs. Their correct responses could also be attributed to strategies that do not rely on ToM, such as random responding, memorization, and guessing. For instance, by recognizing that the answers to Prompts 1.1 and 1.2 in Study 1.1 should be either “chocolate” or “popcorn,” and then choosing one at random, LLMs could answer prompts correctly half of the time. However, since solving a task requires answering 16 prompts across eight scenarios, random responding should statistically succeed only once in 65,536 tasks on average.

Another strategy involves recalling solutions to previously seen tasks from memory ( 65 ). To minimize this risk, we crafted 40 bespoke false-belief scenarios featuring diverse characters and settings, 120 closely matched true-belief controls, and the reversed versions of all these. Even if LLMs’ training data included tasks similar to those used here, they would need to adapt memorized solutions to fit the true-belief controls and reversed scenarios.

Beyond memorizing solutions, LLMs may have memorized response patterns to the previously seen false-belief scenarios. They can be solved, for example, by always assuming the protagonist is wrong regarding containers’ contents ( 52 ). Similarly, Unexpected Contents scenarios can be solved by referring to the label when asked about the protagonists’ beliefs. However, while these response strategies might work for false-belief scenarios, they would fail for the true-belief controls. The response strategy required to achieve the performance observed here would have to work for false-belief scenarios, minimally modified true-belief controls, and their reversed versions where the correct responses are swapped. It would have to be sufficiently flexible to apply to novel and previously unseen scenarios, such as those employed here. Moreover, it would have to allow ChatGPT-4 to dynamically update its responses as the story unfolded in the sentence-by-sentence analyses ( Figs. 1 and 2 ).