Hypothesis-Driven Theory-of-Mind Reasoning for Large Language Models

Paper · arXiv 2502.11881 · Published February 17, 2025
Theory of MindPsychology Chatbots ConversationPhilosophy Subjectivity

Existing LLM reasoning methods have shown impressive capabilities across various tasks, such as solving math and coding problems. However, applying these methods to scenarios without ground-truth answers or rule-based verification methods—such as tracking the mental states of an agent—remains challenging. Inspired by the sequential Monte Carlo algorithm, we introduce ThoughtTracing, an inference-time reasoning algorithm designed to trace the mental states of specific agents by generating hypotheses and weighting them based on observations without relying on ground-truth solutions to questions in datasets. Our algorithm is inspired by the Bayesian theory-of-mind framework, using LLMs to approximate probabilistic inference over agents’ evolving mental states based on their perceptions and actions. We evaluate ThoughtTracing on diverse theory-ofmind benchmarks, demonstrating significant performance improvements compared to baseline LLMs.1

As LLM-powered AI agents increasingly interact with humans (Collins et al., 2024), the ability to track and infer others’ mental states from open-ended textual input will become crucial for broader applications and data synthesis in the social interaction domain.

To account for the inherent uncertainty in this task, we follow the high-level structure of sequential Monte Carlo inference (SMC; Del Moral et al., 2006; Lew et al., 2023b), tracking multiple weighted hypotheses about the agent’s mental states. Importantly, however, these hypotheses are represented in open-ended natural language, and are both generated and weighted by LLMs. By using LLMs in this way, our method offers a generalizable approach to uncovering the underlying thoughts that drive agent behavior, without requiring the explicit probabilistic models assumed by BToM algorithms, and without relying on questionanswer annotations or benchmark-specific assumptions.

Although ThoughtTracing is not specifically designed for solving benchmark questions, we evaluate the algorithm using state-of-the-art LLMs on four theory-of-mind benchmarks to assess whether the traced thoughts can enhance downstream question answering task performance. Experiments show that (1) ThoughtTracing consistently improves performance on all tested models across all benchmarks, (2) additional inference-time compute (e.g., chain-of-thought) further aids models to process contexts interleaved with mental states, and (3) models with ThoughtTracing outperform reasoning models (e.g., o3-mini and R1) despite using significantly shorter reasoning traces.

Furthermore, our results reveal interesting behavioral patterns in existing reasoning models on theory-of-mind tasks: (1) they do not consistently outperform vanilla LLMs using chain-of-thought reasoning, (2) they fail to generalize to similar scenarios, (3) they produce significantly longer reasoning traces for theory-of-mind questions than for factual questions, and (4) reasoning effort (e.g., output length) does not correlate with performance. These findings highlight that social reasoning differs from mathematical or programming reasoning, areas where reasoning models typically excel.

These results suggest that ThoughtTracing represents a promising step towards more robust inference-time ToM reasoning. We aim to spark new discussions on inference-time reasoning in the social domain, contrasting with the predominant focus on math and coding.

Bayesian Theory of Mind (Baker et al., 2017) frames mental state attribution as probabilistic inference over a generative model of a rational agent. It focuses on several important roles that beliefs and goals play in a theory-of-mind: the agent’s perceptions and their prior beliefs jointly effect the current belief, and beliefs and goals are the causes of the agent’s actions. As such, beliefs and goals can be inferred in various ways: forward simulation of beliefs from the agent’s perceptions and prior beliefs, backward inference of from the agent’s observed actions, or through a joint integration of all available information.

ThoughtTracing follows this framework’s structure without using explicit models of rational belief updating and action selection. Instead, we use LLMs to simulate how an agent is likely to update their mental states in response to their perceptions, and to evaluate the likelihood of an agent’s actions given their mental states. This enables greater generality, decomposing social inference into simpler tasks: forward simulation and likelihood evaluation by LLMs.

Sequential Monte Carlo (SMC) refers to a family of algorithms designed for incremental inference over sequences of posterior distributions (Del Moral et al., 2006), such as posterior inference over latent dynamics from time series data. SMC uses a collection of weighted hypotheses (called particles) to approximate each distribution in the sequence. Given particles for step t − 1 in the sequence, SMC generates particles for step t via propagation (extending each previous particle with new latent states that exist at step t) and reweighting (weighting each particle by the likelihood of the observed data under that particle’s latent states). The particles are then resampled according to their weights (focusing samples on regions of high posterior probability) and optionally rejuvenated with Markov chain Monte Carlo (MCMC) to increase particle diversity (Chopin, 2002). To improve inference quality, SMC can make use of custom proposals at for propagation or rejuvenation (Lew et al., 2023a), using data-driven cues to generate more plausible hypotheses (Perov et al., 2015).

To infer hidden mental states that change over time, ThoughtTracing follows an SMC-like structure, using LLMs as proposals for propagating hypotheses about agents’ mental states, and weighting these hypotheses by likelihood scores generated by LLMs. However, for simplicity, ThoughtTracing does not compute full importance weights for each particle, since this would require accessing LLM log probabilities not provided by most APIs.