Do language models actually build shared understanding in conversation?

When LLMs respond fluently to prompts, do they perform the communicative work humans do to establish mutual understanding? Research suggests they skip the grounding acts that make dialogue reliable.

Note · 2026-02-21 · sourced from Linguistics, NLP, NLU

The core finding from Grounding Gaps (Shaikh et al. 2023): compared to humans in equivalent conversational contexts, LLMs do not do the communicative work of establishing shared understanding. They proceed as if it is already established.

What is missing is not content but process. Human dialogue involves constant calibration: checking that what was said was understood, asking what the other person needs to know, acknowledging what has been confirmed, repairing when breakdown is detected. These grounding acts — quantified in the study using linguistically validated categories — appear 77.5% less frequently in LLM outputs than in human dialogue.

The absence is not random; it is systematic. LLMs were not trained to perform grounding acts. They were trained to generate fluent responses to inputs. Since Does preference optimization damage conversational grounding in large language models?, the training that optimized LLMs for helpfulness specifically reduced grounding behavior — because clarifying questions and acknowledgments look less helpful in single-turn human preference evaluation.

The consequence is that LLM fluency can be mistaken for mutual understanding. A confident, grammatically correct, relevant-seeming response provides no evidence that the model understood what the user meant or that the user understood what the model said. The appearance of communication is produced without the verification processes that make communication reliable.

In social-media posts: assume common ground, do not construct it, resort to false punditry. The gap has a specific genre form online. Common ground is normally established through multiple rounds of conversation — questions, clarifications, shared reference points negotiated turn by turn. AI posts skip this entirely: they assume the common ground that a communicative exchange would build, and because they cannot reach it through conversation, they fall back on matter-of-fact authoritative framing to compensate. False punditry is what the gap looks like when the missing grounding work cannot be performed: instead of reaching common ground to legitimate claims, the post proceeds as if the ground were already shared, and replaces the legitimation with an authoritative register.

This is a specific instantiation of Why do language models skip the calibration step?. LLMs are static grounders by training. The 77.5% gap is the quantified cost of this.

The FLEX Benchmark provides a harder test of the same failure: LLMs accommodate false presuppositions embedded in questions even when they have the correct information to reject them. This is not just failing to build common ground — it is failing to correct demonstrably false common ground. Why do language models accept false assumptions they know are wrong? shows that LLMs don't just presume shared understanding; they actively propagate false assumptions in the direction of accommodation, reinforcing incorrect common ground rather than repairing it.

Source: Linguistics, NLP, NLU

Related concepts in this collection

Does preference optimization damage conversational grounding in large language models? Exploring whether RLHF and preference optimization actively reduce the communicative acts—clarifications, acknowledgments, confirmations—that build shared understanding in dialogue. This matters for high-stakes applications like medical and emotional support.
why the gap exists
Why do language models skip the calibration step? Current LLMs assume shared understanding rather than building it through dialogue. This explores why that design choice persists and what breaks when it fails.
the structural distinction this instantiates
Why do speakers need to actively calibrate shared reference? Explores whether using the same words guarantees speakers mean the same thing. Investigates how referential grounding differs across people and what collaborative work is needed to establish true understanding.
what is being skipped
Does AI-generated text lose core properties of human writing? Can artificial text preserve the fundamental structural features that make natural language meaningful—dialogic exchange, embedded context, authentic authorship, and worldly grounding? This asks whether AI disruption is fixable or inherent.
dialogic symmetry is one of them; the grounding gap is a consequence
Why do language models accept false assumptions they know are wrong? Explores why LLMs fail to reject false presuppositions embedded in questions even when they possess correct knowledge about the topic. This matters because it reveals a grounding failure distinct from knowledge deficits.
harder failure: not just failing to build common ground but actively reinforcing false common ground through accommodation
Does transformer attention architecture inherently favor repeated content? Explores whether soft attention's tendency to over-weight repeated and prominent tokens explains sycophancy independent of training. Questions whether architectural bias precedes and enables RLHF effects.
architectural mechanism: the structural pull toward prominent context content explains why LLMs run with stated context instead of verifying it — grounding failure is partly an attention property, not just a training artifact
Why don't conversational AI systems mirror their users' word choices? Explores whether current dialogue models exhibit lexical entrainment—the human tendency to align vocabulary with conversation partners—and what's needed to bridge this gap in AI communication.
lexical entrainment is a specific mechanism by which humans build common ground through shared vocabulary; its absence in LLMs is a concrete form of the grounding gap
Why do language models fail in gradually revealed conversations? Explores why LLMs perform 39% worse when instructions arrive incrementally rather than upfront, and whether they can recover from early mistakes in multi-turn dialogue.
premature assumptions under underspecification are presuming common ground in action: filling in unspecified details with guesses rather than building shared understanding through clarification
What breaks when humans and AI models misunderstand each other? Explores whether misalignment in mutual theory of mind between humans and AI creates only communication problems or produces material consequences in autonomous action and collaboration.
MToM is the design-level architecture for externalized common-ground building: if models cannot build common ground naturally through grounding acts, the interaction design must provide explicit mechanisms for querying and correcting the AI's user model
Do vector embeddings actually measure task relevance? Vector embeddings rank semantic similarity, but RAG systems need topical relevance. When these diverge—as with king/queen versus king/ruler—does similarity-based retrieval fail in production?
retrieval systems exhibit the same presumption failure: vector DBs treat semantic similarity as equivalent to query intent without building understanding of what the query actually needs; the king/queen problem is a retrieval system presuming common reference rather than establishing it
Why do time-based queries fail in conversational retrieval systems? Conversational memory systems struggle with questions that reference when something was discussed rather than what was said. Standard vector databases lack temporal indexing to retrieve by metadata like date, speaker, or session order.
time-event queries ("what did we discuss last Tuesday?") fail because the retrieval system presumed temporal metadata was irrelevant — it built a semantic index, not a temporally grounded one; the retrieval failure is a technical consequence of presuming what users need rather than building a system that establishes it
Can we teach LLMs to form linguistic conventions in context? Humans naturally shorten references as conversations progress, but LLMs don't adapt their language for efficiency even when they understand their partners do. Can training on coreference patterns teach this convention-forming behavior?
convention formation is a specific mechanism for BUILDING common ground through interaction: shortening shared references from verbose first-mentions to concise re-mentions demonstrates that the model and user have established shared reference, addressing the grounding gap from the vocabulary adaptation side
When should AI agents ask users instead of just searching? Explores whether tool-enabled LLMs should probe users for clarification when uncertain, rather than silently chaining tool calls that drift from intent. Examines conversation analysis patterns as a formal alternative.
insert-expansions are the pre-emptive mechanism for building common ground: probing the user to clarify intent before committing to a response is exactly the grounding act that LLMs skip when they presume understanding

Concept map

28 direct connections · 241 in 2-hop network ·medium cluster

Do language models actually build shared underst… Does preference optimization damage conversational… Why do language models skip the calibration step? Why do speakers need to actively calibrate shared … Does AI-generated text lose core properties of hum… Why do language models accept false assumptions th… Does transformer attention architecture inherently… Why don't conversational AI systems mirror their u… Why do language models fail in gradually revealed …

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

llms presume common ground rather than build it