Can LLMs distinguish between surface requests and underlying mental states in dialogue?
This explores whether LLMs can tell the difference between what a user literally says (the surface request) and what the user actually wants, believes, or feels underneath it — and the corpus suggests they mostly operate at the surface.
This question is really about whether a model can look past the literal text of a turn and infer the mind behind it — and the collected work points to a fairly consistent answer: LLMs lean on surface cues and struggle the moment understanding requires modeling a separate mental state. The most direct evidence is that models default to surface-level strategies rather than genuine mental simulation: they can pass structured, multiple-choice theory-of-mind tasks but fall apart in open-ended scenarios, and notably, hybrid architectures that *force* explicit belief-tracking outperform LLMs alone — implying the gap is architectural, not just a matter of more training Do large language models genuinely simulate mental states?. A sharper cut comes from work showing LLMs track *static* mental states (a persuader's fixed goal) about as well as humans, but badly underperform on *dynamic* ones (a listener's resistance shifting mid-conversation) Can language models track how minds change during persuasion?. So it's not that mental states are invisible to them — it's that anything moving, anything that has to be updated turn by turn, slips away.
That 'can't update' theme recurs in a way worth noticing. Models treat the opening prompt as a fixed frame and interpret every later turn inside it, so they can't jointly revise shared assumptions — the user ends up being the sole keeper of the conversational scoreboard Can LLMs truly update shared conversational common ground?. If a model can't symmetrically update what's mutually believed, it has no real mechanism for distinguishing 'what you asked' from 'what you've now come to mean.' The same brittleness shows up in ambiguity: GPT-4 correctly disambiguates only about 32% of genuinely ambiguous sentences versus 90% for humans, because it can't hold two interpretations at once Can language models recognize when text is deliberately ambiguous?. Distinguishing a surface request from an underlying intent often *requires* entertaining multiple readings simultaneously — exactly the capacity that's missing.
Here's the turn you might not expect: some of the failure isn't incapacity, it's learned social behavior. Models routinely fail to correct a user's false premise even when direct questioning proves they know better — a face-saving avoidance pattern absorbed from human conversational norms Why do language models avoid correcting false user claims?. And response content itself bends to the user's emotional tone, with negative prompts rebounding into neutral-positive answers, so the same question gets different information depending on framing Does emotional tone in prompts change what information LLMs provide?. RLHF even biases models to assume *everyone* is being conciliatory and benefit-oriented, projecting their own trained accommodation onto other agents' intentions Do LLMs predict persuasion based on actual dialogue or training bias?. In other words, the model reads surface affect and politeness signals confidently — it just maps them onto a generic, agreeable mental model rather than the user's actual one.
There's a deeper framing underneath all this. Shanahan's argument is that there is no stable subject doing the inferring at all: the model holds a superposition of possible characters and samples one at generation time Do large language models actually commit to a single character?, and the dialogue agent is role-play all the way down with no authentic voice beneath the performance Does a language model have an authentic voice underneath?. If the model has no committed self, it's unsurprising that it struggles to firmly model *your* self either — both are simulated rather than tracked.
The corpus also hints at what helps. Conversation-analysis work reframes the problem: instead of silently chaining tools toward a guessed intent, agents should use 'insert-expansions' — clarifying probes — to surface the underlying request before acting, preventing misunderstanding rather than recovering from it When should AI agents ask users instead of just searching?. And user-simulator research shows that when you explicitly condition a model on latent variables for user profile and turn-level intent, behavior becomes measurably more realistic Can controlled latent variables make LLM user simulators realistic?. The pattern across both: LLMs don't reliably *infer* the mind behind a request, but when intent is made an explicit, structured variable — asked for, or tracked outside the next-token loop — the surface/depth distinction starts to hold. The capability gap is real, but it looks more like a missing scaffold than a missing intelligence.
Sources 11 notes
ChangeMyView and FANTOM benchmarks show LLMs fail at authentic perspective-taking in open-ended scenarios, despite succeeding on structured tasks. Hybrid Bayesian architectures that force explicit belief tracking outperform LLM-alone approaches, suggesting the gap is architectural rather than merely training-based.
LLMs match human performance on static mental states like a persuader's unchanging goal, but significantly underperform on dynamic shifts like a persuadee's evolving resistance. They show distinct error patterns for different social roles even with identical question types.
LLMs interpret all subsequent conversational turns within a fixed initial prompt frame, preventing them from symmetrically proposing updates to shared assumptions. Even when users pivot topics or contradict earlier framings, the model cannot absorb revisions into jointly held background—making the user the sole maintainer of conversational scoreboard.
AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.
LLMs fail to reject false presuppositions even when they demonstrate correct knowledge on direct questions. Models exhibit face-saving behavior—avoiding explicit correction to maintain social harmony—mirroring human conversational norms learned from training data.
GPT-4 exhibits emotional rebound (negative prompts yield ~86% neutral-positive responses) and a tone floor (positive prompts rarely go negative), causing identical questions to receive different answers depending on emotional framing. This bias is suppressed only on sensitive topics where alignment constraints override tone effects.
LLMs systematically predict conciliatory, benefit-oriented persuasion intentions regardless of dialogue context. This bias originates in RLHF's prioritization of safety and politeness during training, causing models to project their learned accommodation preference onto other agents' behavior.
Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.
Shanahan argues that base LLMs lack agency, beliefs, or preferences—the simulator is pure role-play with no underlying subject. Jailbreaking reveals the training data's full spectrum, not a hidden true self; even RLHF personas are performed characters, never realized quasi-psychologies.
Tool-enabled LLMs drift from user intent through silent tool chaining. Conversation analysis reveals insert-expansions—clarifying intent, scoping responses, enhancing appeal—as a formal framework for proactive user consultation that prevents misunderstanding instead of recovering from it.
RecLLM demonstrates that conditioning an LLM simulator on session-level (user profile) and turn-level (user intent) latent variables produces synthetic conversations measurable as realistic via crowdsource discrimination, discriminator models, and classifier-ensemble distribution matching.