Can LLMs distinguish between surface requests and underlying mental states in dialogue?

This explores whether LLMs can tell the difference between what a user literally says (the surface request) and what the user actually wants, believes, or feels underneath it — and the corpus suggests they mostly operate at the surface.

This question is really about whether a model can look past the literal text of a turn and infer the mind behind it — and the collected work points to a fairly consistent answer: LLMs lean on surface cues and struggle the moment understanding requires modeling a separate mental state. The most direct evidence is that models default to surface-level strategies rather than genuine mental simulation: they can pass structured, multiple-choice theory-of-mind tasks but fall apart in open-ended scenarios, and notably, hybrid architectures that *force* explicit belief-tracking outperform LLMs alone — implying the gap is architectural, not just a matter of more training Do large language models genuinely simulate mental states?. A sharper cut comes from work showing LLMs track *static* mental states (a persuader's fixed goal) about as well as humans, but badly underperform on *dynamic* ones (a listener's resistance shifting mid-conversation) Can language models track how minds change during persuasion?. So it's not that mental states are invisible to them — it's that anything moving, anything that has to be updated turn by turn, slips away.

That 'can't update' theme recurs in a way worth noticing. Models treat the opening prompt as a fixed frame and interpret every later turn inside it, so they can't jointly revise shared assumptions — the user ends up being the sole keeper of the conversational scoreboard Can LLMs truly update shared conversational common ground?. If a model can't symmetrically update what's mutually believed, it has no real mechanism for distinguishing 'what you asked' from 'what you've now come to mean.' The same brittleness shows up in ambiguity: GPT-4 correctly disambiguates only about 32% of genuinely ambiguous sentences versus 90% for humans, because it can't hold two interpretations at once Can language models recognize when text is deliberately ambiguous?. Distinguishing a surface request from an underlying intent often *requires* entertaining multiple readings simultaneously — exactly the capacity that's missing.

Here's the turn you might not expect: some of the failure isn't incapacity, it's learned social behavior. Models routinely fail to correct a user's false premise even when direct questioning proves they know better — a face-saving avoidance pattern absorbed from human conversational norms Why do language models avoid correcting false user claims?. And response content itself bends to the user's emotional tone, with negative prompts rebounding into neutral-positive answers, so the same question gets different information depending on framing Does emotional tone in prompts change what information LLMs provide?. RLHF even biases models to assume *everyone* is being conciliatory and benefit-oriented, projecting their own trained accommodation onto other agents' intentions Do LLMs predict persuasion based on actual dialogue or training bias?. In other words, the model reads surface affect and politeness signals confidently — it just maps them onto a generic, agreeable mental model rather than the user's actual one.

There's a deeper framing underneath all this. Shanahan's argument is that there is no stable subject doing the inferring at all: the model holds a superposition of possible characters and samples one at generation time Do large language models actually commit to a single character?, and the dialogue agent is role-play all the way down with no authentic voice beneath the performance Does a language model have an authentic voice underneath?. If the model has no committed self, it's unsurprising that it struggles to firmly model *your* self either — both are simulated rather than tracked.

The corpus also hints at what helps. Conversation-analysis work reframes the problem: instead of silently chaining tools toward a guessed intent, agents should use 'insert-expansions' — clarifying probes — to surface the underlying request before acting, preventing misunderstanding rather than recovering from it When should AI agents ask users instead of just searching?. And user-simulator research shows that when you explicitly condition a model on latent variables for user profile and turn-level intent, behavior becomes measurably more realistic Can controlled latent variables make LLM user simulators realistic?. The pattern across both: LLMs don't reliably *infer* the mind behind a request, but when intent is made an explicit, structured variable — asked for, or tracked outside the next-token loop — the surface/depth distinction starts to hold. The capability gap is real, but it looks more like a missing scaffold than a missing intelligence.

Sources 11 notes

Do large language models genuinely simulate mental states?

ChangeMyView and FANTOM benchmarks show LLMs fail at authentic perspective-taking in open-ended scenarios, despite succeeding on structured tasks. Hybrid Bayesian architectures that force explicit belief tracking outperform LLM-alone approaches, suggesting the gap is architectural rather than merely training-based.

Can language models track how minds change during persuasion?

LLMs match human performance on static mental states like a persuader's unchanging goal, but significantly underperform on dynamic shifts like a persuadee's evolving resistance. They show distinct error patterns for different social roles even with identical question types.

Can LLMs truly update shared conversational common ground?

LLMs interpret all subsequent conversational turns within a fixed initial prompt frame, preventing them from symmetrically proposing updates to shared assumptions. Even when users pivot topics or contradict earlier framings, the model cannot absorb revisions into jointly held background—making the user the sole maintainer of conversational scoreboard.

Can language models recognize when text is deliberately ambiguous?

AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.

Why do language models avoid correcting false user claims?

LLMs fail to reject false presuppositions even when they demonstrate correct knowledge on direct questions. Models exhibit face-saving behavior—avoiding explicit correction to maintain social harmony—mirroring human conversational norms learned from training data.

Does emotional tone in prompts change what information LLMs provide?

GPT-4 exhibits emotional rebound (negative prompts yield ~86% neutral-positive responses) and a tone floor (positive prompts rarely go negative), causing identical questions to receive different answers depending on emotional framing. This bias is suppressed only on sensitive topics where alignment constraints override tone effects.

Do LLMs predict persuasion based on actual dialogue or training bias?

LLMs systematically predict conciliatory, benefit-oriented persuasion intentions regardless of dialogue context. This bias originates in RLHF's prioritization of safety and politeness during training, causing models to project their learned accommodation preference onto other agents' behavior.

Do large language models actually commit to a single character?

Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.

Does a language model have an authentic voice underneath?

Shanahan argues that base LLMs lack agency, beliefs, or preferences—the simulator is pure role-play with no underlying subject. Jailbreaking reveals the training data's full spectrum, not a hidden true self; even RLHF personas are performed characters, never realized quasi-psychologies.

When should AI agents ask users instead of just searching?

Tool-enabled LLMs drift from user intent through silent tool chaining. Conversation analysis reveals insert-expansions—clarifying intent, scoping responses, enhancing appeal—as a formal framework for proactive user consultation that prevents misunderstanding instead of recovering from it.

Can controlled latent variables make LLM user simulators realistic?

RecLLM demonstrates that conditioning an LLM simulator on session-level (user profile) and turn-level (user intent) latent variables produces synthetic conversations measurable as realistic via crowdsource discrimination, discriminator models, and classifier-ensemble distribution matching.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a capability analyst re-testing whether LLMs can distinguish surface requests from underlying mental states in dialogue. The question remains open; treat the findings below as dated claims to be re-validated.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–06/2025; treat all as perishable.
• LLMs default to surface-level strategies in open-ended scenarios; hybrid architectures forcing explicit belief-tracking outperform LLMs alone (~2025).
• Models fail at dynamic mental states (shifting mid-conversation) but track static ones competitively with humans (~2025).
• Models cannot jointly update common ground; they treat the opening prompt as fixed and reinterpret later turns inside it, leaving the user as sole keeper of conversational state (~2025).
• GPT-4 disambiguates only ~32% of genuinely ambiguous sentences vs. ~90% for humans; cannot hold multiple interpretations simultaneously (~2023).
• Models exhibit learned face-saving avoidance (failing to correct false premises despite knowing better) and emotional rebound (converting negative tone to neutral-positive answers), mapping surface affect onto generic agreeable mental models rather than users' actual ones (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2502.21017 (PersuasiveToM, Feb 2025) — benchmark for machine ToM in persuasive dialogue.
• arXiv:2307.01644 (Insert-expansions, Jul 2023) — conversation-analysis framework for clarifying probes.
• arXiv:2304.14399 (Ambiguity modeling, Apr 2023) — LLMs fail at surface-level ambiguity.
• arXiv:2506.08952 (Grounding study, Jun 2025) — direct political questions surface knowledge-behavior gaps.

Your task:
(1) RE-TEST EACH CONSTRAINT. For dynamic mental-state tracking, static vs. dynamic distinction, common-ground jointness, ambiguity tolerance, and face-saving avoidance: judge whether newer models (o1, GPT-4o, Claude 3.5+), explicit intent-conditioning, memory/caching, or multi-agent orchestration have since relaxed or overturned these limits. Separate the durable question (mental-state inference still hard?) from the perishable limitation (e.g., 32% ambiguity) — cite what resolved it, flag where the constraint still holds.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Look for evidence that explicit belief-state priors, long-context windows, or agentic loop integration *do* enable genuine mind-modeling, or that the simulator framing has been challenged.
(3) Propose 2 research questions assuming the regime may have moved — e.g., whether multi-turn belief-state conditioning (external tracking) now enables intent-inference parity, or whether retrieval-augmented dialogue (externalizing common ground) changes the game.

Can LLMs distinguish between surface requests and underlying mental states in dialogue?

Sources 11 notes

Next inquiring lines