Towards Machine Theory of Mind with Large Language Model-Augmented Inverse Planning

Paper · arXiv 2507.03682 · Published July 4, 2025

We propose a hybrid approach to machine Theory of Mind (ToM) that uses large language models (LLMs) as a mechanism for generating hypotheses and likelihood functions with a Bayesian inverse planning model that computes posterior probabilities for an agent’s likely mental states given its actions. Bayesian inverse planning models can accurately predict human reasoning on a variety of ToM tasks, but these models are constrained in their ability to scale these predictions to scenarios with a large number of possible hypotheses and actions. Conversely, LLM-based approaches have recently demonstrated promise in solving ToM benchmarks, but can exhibit brittleness and failures on reasoning tasks even when they pass otherwise structurally identical versions. By combining these two methods, this approach leverages the strengths of each component, closely matching optimal results on a task inspired by prior inverse planning models and improving performance relative to models that utilize LLMs alone or with chainof- thought prompting, even with smaller LLMs that typically perform poorly on ToM tasks. We also exhibit the model’s potential to predict mental states on openended tasks, offering a promising direction for future development of ToM models and the creation of socially intelligent generative agents.

In this paper, we present a hybrid approach, LLM-AUGMENTED INVERSE PLANNING (LAIP), that exploits the potential complementary strengths of Bayesian inverse planning models and LLMs. By integrating the generative capabilities of LLMs, inverse planning models can be theoretically unbounded in the quantity of hypotheses about an agent’s beliefs and desires, or actions given the agent’s state, that they can entertain in any given situation. On the other hand, by explicitly formalizing the process of inverse planning, we show this hybrid model is less susceptible to zero-shot reasoning errors than LLMs without specific prompting or with generic chain-of-thought (CoT) prompting.

Inspired by probabilistic models of human cognition (e.g., Chater et al., 2006; Tenenbaum & Griffiths, 2001, work by Verma & Rao (2005), Baker et al. (2009; 2011), and Rafferty et al. (2015) formalized the understanding of others’ beliefs, desires, and intentions as an instance of Bayesian reasoning within a partially observable Markov decision process (POMDP). Within this framework, an observer engages in inverse planning—inverting the observer’s own process of generating an action policy based on its beliefs and desires—in order to reason about the unobserved internal states that give rise to an agent’s behaviours. These models have been extended to account for both children’s and adults’ commonsense reasoning that others will act according to a naive form of expected utility, maximizing expected rewards and minimizing costs (Jara-Ettinger et al., 2016; 2020; Lucas et al., 2014). Within this broader framework, ToM can be thought of as equivalent to inverse reinforcement learning (IRL; Jara-Ettinger, 2019; Ruiz-Serra & Harr´e, 2023), recovering an agent’s reward structure from actions that are assumed to be generated by an optimal policy given the agent’s beliefs.

An overview of the architecture of the LAIP model is presented in Figure 1 (see also Algorithm 1 in Appendix A.1). Broadly, the model conducts Bayesian inverse planning to reason about a target agent’s preferences given its action. After first generating a prior belief over possible hypotheses regarding the agent’s preferences, the LLM observes the agent’s situation and its observation of the environment at each time step of a task. Then, the LLM simulates the agent’s perspective on the task, generating reasoning about the agent’s likely choices given the state. From this reasoning, it generates the likelihood of different possible actions under each of these hypotheses. After the agent acts, the LLM updates the posterior distribution over hypotheses given the action chosen by the agent.

When the Japanese restaurant is closed, the agent’s actions are consistent with a strong preference for the Japanese restaurant, followed by the Chinese restaurant, followed by the Mexican restaurant. Since the agent is not able to observe whether the Japanese restaurant is open until reaching Room 3, the agent moving away from the Japanese restaurant after observing that it is closed does not have a bearing on the agent’s perceived preferences, while the fact that it moves towards the Chinese restaurant afterward indicates a preference for the Chinese restaurant over the Mexican restaurant. However, when the Japanese restaurant is open, the agent’s actions are not consistent with any strong preference hierarchy, and may reflect weak or inconsistent preferences. Thus, a model reasoning about the agent’s preferences based on its actions and a representation of the agent’s belief states should infer a strong preference for the Japanese restaurant based on the agent’s policy.