What breaks when humans and AI models misunderstand each other?

Explores whether misalignment in mutual theory of mind between humans and AI creates only communication problems or produces material consequences in autonomous action and collaboration.

Note · 2026-02-22 · sourced from Theory of Mind

Design fictions probing operationalized mutual theory of mind (MToM) between humans and AI agents reveal that ToM in human-AI interaction is not a one-directional problem. Three layers of mutual modeling must be maintained simultaneously:

Human's understanding of what the AI knows about them. Users need to interrogate the AI's theory of mind model — "what does it know about me?" — and this knowledge shapes how they interact with the system.
AI's representation of the human's mental model of the AI. The AI must model not just the human but the human's model of the AI's capabilities. Problems arise "when a human's mental model of an AI's capabilities doesn't align with the AI's actual capabilities" — people misapply AI to domains it wasn't designed for.
Bidirectional updating through interaction. Both parties must update their models as interaction progresses. The AI learns about the user through both "chat space" (conversation) and "artifact space" (work products). The human calibrates their trust through explanations of what the AI did and why.

When these layers misalign, the consequences are material, not just communicative. Design fictions show AI agents acting on users' behalf based on predictive models — writing code, responding to messages, executing workflows. A faulty MToM doesn't just cause miscommunication; it causes incorrect autonomous action.

The design implications are specific:

Users need signifiers of model presence — indicators that the AI is building and using a model of them
Users need the ability to query and correct the AI's user model
When MToM-infused AI acts on the user's behalf, recipients need signifiers that they're interacting with an AI, not the human
Explanations are crucial for trust calibration — both what the system did and why

The wider adoption scenario (MToM within an organization) shows how these dynamics scale: MToM can "reshape work practices by streamlining communications and delivering the right information to the right people at the right time" — but every efficiency gain depends on model accuracy, and every inaccuracy has downstream consequences.

Empirical evidence from a Bayesian IRT study of human-AI synergy (n=667) provides quantitative grounding for MToM's importance: Theory of Mind predicts collaborative performance with AI but not solo performance. Users with stronger perspective-taking achieve superior collaboration — and critically, moment-to-moment fluctuations in ToM (not just stable individual differences) influence AI response quality within sessions. This confirms that MToM is not merely a design-fiction aspiration but a measurable cognitive mechanism with quantifiable effects on collaboration outcomes. See Does theory of mind predict who thrives in AI collaboration?.

Source: Theory of Mind

Related concepts in this collection

Do language models actually build shared understanding in conversation? When LLMs respond fluently to prompts, do they perform the communicative work humans do to establish mutual understanding? Research suggests they skip the grounding acts that make dialogue reliable.
MToM is the design-level solution: if models presume rather than build common ground, the architecture must externalize the common-ground-building process
Do users worldwide trust confident AI outputs even when wrong? Explores whether the tendency to over-rely on confident language model outputs transcends language and culture. Understanding this pattern is critical for designing safer human-AI interaction across diverse linguistic contexts.
MToM misalignment is amplified by overreliance: users who don't interrogate the AI's model of them assume it's correct
Why do speakers need to actively calibrate shared reference? Explores whether using the same words guarantees speakers mean the same thing. Investigates how referential grounding differs across people and what collaborative work is needed to establish true understanding.
MToM operationalizes calibrated shared reference in the human-AI context

Concept map

18 direct connections · 135 in 2-hop network ·medium cluster

What breaks when humans and AI models misunderst… Do language models actually build shared understan… Do users worldwide trust confident AI outputs even… Why do speakers need to actively calibrate shared …

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

mutual theory of mind between humans and AI requires bidirectional model updating and creates material consequences from misalignment