MOMENTS: A Comprehensive Multimodal Benchmark for Theory of Mind

Paper · arXiv 2507.04415 · Published July 6, 2025

Understanding Theory of Mind is essential for building socially intelligent multimodal agents capable of perceiving and interpreting human behavior. We introduce MOMENTS (Multimodal Mental States), a comprehensive benchmark designed to assess the ToM capabilities of multimodal large language models (LLMs) through realistic, narrative-rich scenarios presented in short films. MOMENTS includes over 2,344 multiple-choice questions spanning seven distinct ToM categories. The benchmark features long video context windows and realistic social interactions that provide deeper insight into characters’ mental states. While the visual modality generally enhances model performance, current systems still struggle to integrate it effectively, underscoring the need for further research into AI’s multimodal understanding of human behavior.

Throughout our lives, we continuously generate hypotheses about other people’s emotions, knowledge, and a range of other mental states; these hypotheses guide how we understand and interact with others. This ability, known as Theory of Mind (ToM) (Premack and Woodruff, 1978), is essential for interpreting behavior at the individual level and fundamental to coherent human social interaction (Byom and Mutlu, 2013).

Humans rely on more than just language to express their mental states. Gaze, facial expressions, body posture, gestures, and vocal cues all play an important role in communicating how we feel and what we think. This combination of verbal and non-verbal cues provides relevant multimodal information to infer mental states of others (Byom and Mutlu, 2013; Bayliss and Tipper, 2006; De Sonneville et al., 2002).

For artificial agents, this information can serve as multimodal input that enhances socially intelligent behavior, empowering users across a wide range of applications: from facilitating communication and enhancing collaboration to offering companionship. A robust ToM enables such systems to anticipate intentions, understand desires and emotions, and detect knowledge gaps, to adapt their behavior to support users more effectively (Oguntola et al., 2021). Importantly, this requires not only inferring individual mental states, but doing so in context—accurately "reading the room"by processing these signals to interpret human behavior in socially situated settings (Williams et al., 2022).

Most existing benchmarks proposed to measure ToM in artificial agents predominantly center around belief-tracking tasks within text-based narratives or simplified multimodal settings (Chen et al., 2025a). While these approaches evaluate models’ ability to reason about who knows or believes what, they frequently neglect the interplay of emotions, intentions, pragmatic communication, and social contexts that characterize genuine human interactions. Consequently, a clear gap exists between existing evaluations and the richer, socially grounded reasoning required in realistic scenarios.

To support the development of socially intelligent multimodal agents and assess current models’ ToM in realistic, socially grounded scenarios, we introduce MOMENTS (Multimodal Mental States), a comprehensive multimodal video question-answering benchmark designed to evaluate ToM across seven abilities derived from the ATOMS taxonomy (Beaudoin et al., 2020): Intentions, Desires, Beliefs, Knowledge, Percepts, Non-literal Communication, and Emotions. The dataset comprises 2,344 human-annotated questions and 9,376 candidate answers sourced from 168 long-form videos, annotated with short and long context windows, multimodal cue markers, and adversarially-generated distractors to minimize biases.

Among results within the longer Full Context Window, Knowledge, Desires, and Non-literal Communication (NLC) questions perform relatively better, suggesting that longer context may be beneficial for understanding characters’ background and effectively answering these questions. For both of the context settings, Percepts and Beliefs remain the most challenging abilities. Future work should investigate how context window length affects human performance in this task.

Precise Vision–Speech Alignment Answering Who said what, when? requires time-synchronised links between each utterance, the speaking character, and the surrounding visual context. Without such alignment, models cannot track which speakers possess which knowledge, nor can they exploit gaze, facial expressions, or body language that modulate dialogue meaning. The small gains we observe from adding vision (Table 3), and the limited improvements on questions marked as reliant on visual cues (Table 4), indicate that existing pipelines underutilize this channel.

(i) How well do these models perform overall and across different ToM abilities? (ii) To what degree does visual information and context length impact performance? and (iii) How effective is our LLM-in-the-loop distractor creation platform at mitigating answer set biases?

video input improves performance in most cases. However, the gains are modest, indicating that current models may underutilize visual cues. Performance tends to drop when using the longer Full Context Window, we attribute this to the fact that long video understanding is still challenging for open models