Do LLMs Exhibit Human-Like Reasoning? Evaluating Theory of Mind in LLMs for Open-Ended Responses

Paper · arXiv 2406.05659 · Published June 9, 2024

Despite advancements, the extent to which LLMs truly understand ToM reasoning and how closely it aligns with human ToM reasoning remains inadequately explored in open-ended scenarios. Motivated by this gap, we assess the abilities of LLMs to perceive and integrate human intentions and emotions into their ToM reasoning processes within open-ended questions. Our study utilizes posts from Reddit’s ChangeMyView platform, which demands nuanced social reasoning to craft persuasive responses. Our analysis, comparing semantic similarity and lexical overlap metrics between responses generated by humans and LLMs, reveals clear disparities in ToM reasoning capabilities in open-ended questions, with even the most advanced models showing notable limitations. To enhance LLM capabilities, we implement a prompt tuning method that incorporates human intentions and emotions, resulting in improvements in ToM reasoning performance. However, despite these improvements, the enhancement still falls short of fully achieving human-like reasoning.

ToM entails the capacity to attribute mental states, such as intention, emotion, and belief, to oneself and others, and to understand that these states can be different from one’s own [21]. This cognitive ability is fundamental in social human interaction and crucial for effective communication. Large language models (LLMs), which have achieved impressive success in various natural language processing tasks [12, 44, 60], are now being pushed to the frontier of social reasoning to see if they can mimic this quintessentially human trait, especially in interacting with with human.

Many studies have shown this limitation of LLMs via utilizing multiple choice and short answer questions [14, 18, 57], but not on open-ended questions. We aim to bridge this gap by rigorously evaluating the ability of LLMs to engage in zero-shot ToM reasoning tasks within open-ended scenarios, assessing how closely their performance aligns with human capabilities in ToM reasoning task. In particular, we aim to answer the following research questions:

• RQ1: To what degree are LLMs capable of zero-shot reasoning in open-ended questions?

• RQ2: To what extent are human and LLM social reasoning capabilities aligned in addressing open-ended questions?

• RQ3: How does considering human mental state affect the performance of LLMs in ToM reasoning of open-ended questions?

Through comparative analyses assessing semantic similarity and lexical overlap scores between human and LLM responses, we observed significant disparities in reasoning capabilities within open-ended scenarios, similar to those in non-open-ended questions. These findings reveal considerable limitations in even the most advanced LLMs

These studies suggest that the consistency with which LLMs demonstrate ToM abilities remains questionable, often defaulting to surfacelevel reasoning strategies rather than engaging in deep, robust ToM reasoning [43]. Moreover, Kim et al. [14] introduced FANTOM benchmark to rigorously assess ToM abilities within conversational contexts. This benchmark has revealed significant challenges facing state-of-the-art LLMs, such as GPT-4, Llama 2, Falcon, and Mistral, particularly in maintaining performance in ToM reasoning tasks in comparison to humans, even with chain-of-thought reasoning or fine-tuning.

For instance, the Foresee and Reflect (FaR) framework offers a reasoning structure that encourages LLMs to anticipate future challenges and reason about potential actions. Analysis reveals the effectiveness of incorporating mental states into reasoning [64]. Additionally, Ma et al. [30] develops a comprehensive taxonomy for ToM in LLMs, known as Abilities in Theory of Mind Space (ATOMS), which categorizes crucial components such as Intentions, Percepts, Beliefs, Emotions, Knowledge, Desires, and Non-literal Communication. This framework aims to provide a structured approach to assess and systematically enhance ToM capabilities.

the BigToM benchmark has been developed to specifically assess LLMs’ social reasoning capabilities, focusing on aspects such as beliefs, percepts, desires, and user actions [9]. Another tool that has emerged mental state in ToM is SymbolicToM, which enhances ToM capabilities in reading comprehension tasks by effectively representing entities’ beliefs and facilitating higherorder reasoning. This approach has shown promise in providing a deeper understanding of belief states and their implications for ToM [41].

In contrast to earlier research, which typically relied on multiplechoice formats or short-answer questions to evaluate the ToM capabilities of LLMs [57], our study adopts a more nuanced approach by utilizing open-ended questions. This method allows for a broader range of responses, providing deeper insights into how LLMs interpret and respond to complex scenarios. By shifting from structured to more exploratory questioning, we aim to uncover subtler aspects of ToM reasoning in LLMs.