Psychology and Social Cognition

What breaks when humans and AI models misunderstand each other?

Explores whether misalignment in mutual theory of mind between humans and AI creates only communication problems or produces material consequences in autonomous action and collaboration.

Note · 2026-02-22 · sourced from Theory of Mind
How should researchers navigate LLM reasoning research? Why do LLMs excel at social norms yet fail at theory of mind?

Design fictions probing operationalized mutual theory of mind (MToM) between humans and AI agents reveal that ToM in human-AI interaction is not a one-directional problem. Three layers of mutual modeling must be maintained simultaneously:

  1. Human's understanding of what the AI knows about them. Users need to interrogate the AI's theory of mind model — "what does it know about me?" — and this knowledge shapes how they interact with the system.

  2. AI's representation of the human's mental model of the AI. The AI must model not just the human but the human's model of the AI's capabilities. Problems arise "when a human's mental model of an AI's capabilities doesn't align with the AI's actual capabilities" — people misapply AI to domains it wasn't designed for.

  3. Bidirectional updating through interaction. Both parties must update their models as interaction progresses. The AI learns about the user through both "chat space" (conversation) and "artifact space" (work products). The human calibrates their trust through explanations of what the AI did and why.

When these layers misalign, the consequences are material, not just communicative. Design fictions show AI agents acting on users' behalf based on predictive models — writing code, responding to messages, executing workflows. A faulty MToM doesn't just cause miscommunication; it causes incorrect autonomous action.

The design implications are specific:

The wider adoption scenario (MToM within an organization) shows how these dynamics scale: MToM can "reshape work practices by streamlining communications and delivering the right information to the right people at the right time" — but every efficiency gain depends on model accuracy, and every inaccuracy has downstream consequences.

Empirical evidence from a Bayesian IRT study of human-AI synergy (n=667) provides quantitative grounding for MToM's importance: Theory of Mind predicts collaborative performance with AI but not solo performance. Users with stronger perspective-taking achieve superior collaboration — and critically, moment-to-moment fluctuations in ToM (not just stable individual differences) influence AI response quality within sessions. This confirms that MToM is not merely a design-fiction aspiration but a measurable cognitive mechanism with quantifiable effects on collaboration outcomes. See Does theory of mind predict who thrives in AI collaboration?.


Source: Theory of Mind

Related concepts in this collection

Concept map
18 direct connections · 135 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

mutual theory of mind between humans and AI requires bidirectional model updating and creates material consequences from misalignment