DPMT: Dual Process Multi-scale Theory of Mind Framework for Real-time Human-AI Collaboration

Paper · arXiv 2507.14088 · Published July 18, 2025

Real-time human-artificial intelligence (AI) collaboration is crucial yet challenging, especially when AI agents must adapt to diverse and unseen human behaviors in dynamic scenarios. Existing large language model (LLM) agents often fail to accurately model the complex human mental characteristics such as domain intentions, especially in the absence of direct communication. To address this limitation, we propose a novel dual process multi-scale theory of mind (DPMT) framework, drawing inspiration from cognitive science’s dual process theory. Our DPMT framework incorporates a multi-scale theory of mind (ToM) module to facilitate robust human partner modeling through mental characteristic reasoning. Experimental results demonstrate that DPMT significantly enhances human- AI collaboration, and ablation studies further validate the contributions of our multi-scale ToM in the slow system.

Leveraging their advanced natural language understanding, LLM agents can interpret human commands and formulate subsequent action plans (Liu et al., 2023). By adjusting their collaboration strategies through mutual communication with human partners, LLM agents significantly improve the overall collaboration performance of human-AI teams (S. Zhang et al., 2024).

Unlike tasks that an LLM agent can perform independently, collaborative tasks like Overcooked (Ghost Town Games, 2016) require the agent to work with diverse partners, including humans, to efficiently complete a series of complex sub-tasks—

However, the improved communication alone does not fully address the challenges of real-time scenarios. Current LLM agents lack the human-like cognitive ability known as “theory of mind” (ToM), which allows humans to understand and predict others’ mental beliefs based on observed behaviors in social environment (Astington & Jenkins, 1995). This ability facilitates efficient collaboration in tasks without direct communication. The absence of ToM hinders LLM agents’ performance in complex, real-time human-AI collaborative tasks. Although recent studies have explored ToM modeling to improve agent prediction (Rabinowitz et al., 2018; X. Li et al., 2023), these approaches rely on highquality trajectories and prior knowledge of partners. As a result, their interpretability and generalization are limited, restricting their applicability in real collaborative scenarios.

Inspired by dual process theory (Vaisey, 2009; Lizardo et al., 2016), we propose a cognitive dual process multi-scale theory of mind (DPMT) to improve the interpretability and efficiency of human partner modeling in real-time human-AI collaboration. Our DPMT framework distinguishes between two decision-making systems for real-time human-AI collaboration: a fast system for automatic decisions and a slow system for modeling higher-level cognitive abilities. The core contribution of this work is the development of a multi-scale theory of mind module to simulate the slow system for understanding human partners’ behavioral trajectories and reasoning about their mental characteristics, facilitating more effective collaboration. This ToM process follows a three-tiered reasoning process, which progresses from domain knowledge to cognitive style, and ultimately to domain intention, as illustrated in Figure 1. Experimental results from collaborative tasks in Overcooked demonstrate the effectiveness of DPMT in improving real-time human-AI collaboration.

Partner modeling. In multi-agent reinforcement learning (MARL) tasks, partners often exhibit dynamic and diverse strategies, introducing significant challenges in nonstationarity and generalization. To address these challenges in multi-agent collaboration, several studies (Carroll et al., 2019; Shih & Sawhney, 2021) have explored partner modeling to predict partners’ behaviors for enhancing efficient collaboration in complex MARL scenarios.

The fast system focuses on the quick intuitive decision-making for each step, making macro-action mt from a predefined macro-action set. In contrast, the slow system emphasizes cognitive multi-scale ToM reasoning, modeling the partner’s mental characteristics kt ,yt ,nt to assist the fast system in making macro-action decisions.

The action decoding module decomposes the current macro-action mt into atomic actions at and executes them at a much higher frequency until mt is completed. Once mt is finished, the fast system determines the subsequent mt+1 based on the multi-scale partner reasoning kt ,yt ,nt from the slow reasoning system. This hierarchical approach ensures a seamless integration of intuitive decision-making and cognitive reasoning for efficient human-AI collaboration.

These studies categorize various mental characteristics that influence individual behavior into three key dimensions: domain knowledge, cognitive style, and domain intention

This system organizes these mental characteristics into a hierarchical architecture, with domain knowledge forming the foundation and domain intentions at the top, progressively enhancing interpretability. Building on this, we propose a multi-scale ToM model as our slow system, simulating the slow cognitive process in the dual process theory. This ToM model comprises multiple stages of human partner mental characteristic reasoning: the human domain knowledge reasoning stage ToMknowledge, the human cognitive style reasoning stage ToMstyle, and the human domain intention reasoning stage ToMintention.

Cognitive style reasoning stage The cognitive style reflects a human partner’s mental characteristic for decisionmaking preferences based on the domain knowledge, including personality traits, behavioral strategy preferences, and risk preferences. From the perspective of personality traits, cognitive style can be categorized into field-independent, who prefer to complete an entire order independently, and field-dependent, who tend to collaborate by dividing order tasks into sub-tasks to complete an order. In field-dependent players, cognitive styles can be further categorized based on the behavioral strategy, which defines different task tendencies. For example, an ingredient-preparation-oriented style focuses primarily on retrieving and processing various ingredients. This style can also be divided into stable (consistently following a fixed strategy) and random according to the behavioral strategy. Accurate classification of cognitive styles is crucial for effective partner intention modeling

Domain intentions reasoning stage Building upon the cognitive style and the current domain state, domain intentions include both short-term and long-term intentions of the human partner, which are crucial for achieving efficient human-AI collaboration. Short-term intention reasoning involves atomic action prediction, determining the human partner’s current action, such as UP, DOWN, LEFT, RIGHT. In contrast, long-term intention reasoning focuses on predicting the human partner’s macro-action, specifying their primary goal for the current phase, such as Chop Tomato, Prepare Bob Ingredients, Cook Alice Soup, Plate David Soup.