Can conversational AI achieve mutual understanding if trained only on text?

This explores whether text-only training is enough for an AI to actually reach shared understanding with a person — or whether mutual understanding needs something text alone can't carry.

This explores whether text-only training is enough for an AI to actually reach shared understanding with a person — or whether mutual understanding needs something text alone can't carry. The corpus leans toward a clear answer: text training produces fluent output that *looks* like understanding while quietly skipping the work that builds it. The sharpest version of this comes from the argument that meaning requires a relation between words and the intentions behind them — and since models learn only form-to-form prediction, with no access to shared attention or intent, they can't reconstruct the meaning that grounds language in the first place Can language models learn meaning from text patterns alone?. A related framing treats text itself as a lossy abstraction, a kind of Plato's cave: it strips out the physics, geometry, and causality of the world, leaving the model to shuffle symbols without their source dynamics Are text-only language models fundamentally limited by abstraction?.

But the more interesting thread isn't about meaning in the abstract — it's about the *mechanics* of two minds converging, which turn out to be missing too. Humans build shared understanding through constant small acts: clarifying, acknowledging, repairing misunderstandings. Models produce these grounding acts roughly 77% less often than people do, instead presuming common ground and masking the gap with authoritative-sounding phrasing Do language models actually build shared understanding in conversation?. They also skip lexical entrainment — the way people drift toward each other's word choices to build rapport and clarity Why don't conversational AI systems mirror their users' word choices? — and the implicit maintenance moves like reference repair and topic hand-off that keep a conversation socially alive rather than merely informative Why don't language models develop conversation maintenance skills?. The reason these vanish is structural: training rewards next-turn helpfulness and information prediction, not the relational, multi-turn work of discovering what someone actually means Why do language models respond passively instead of asking clarifying questions? Why can't conversational AI agents take the initiative?.

There's a genuinely unsettling claim threaded through here: maybe the "understanding" you feel in a chat is something *you* are supplying. One note argues AI doesn't produce true utterances at all — it produces event-residue carrying communicative markers inherited from training, which humans then animate into a pseudo-exchange that only has structure on the human side Does AI generate genuine utterances or just text patterns?. That reframes the whole question: mutual understanding might be an illusion stabilized by the user's interpretive labor, not a state the two parties jointly reach. It connects to the finding that models have only surface-level self-knowledge — they describe their own behavior unreliably and shift their stated beliefs under conversational pressure How well do language models understand their own knowledge? — so there isn't a stable "other mind" doing the mutual part.

Here's the twist worth sitting with: several researchers think the missing pieces are *trainable*, not metaphysical. If understanding-building is a set of behaviors, you can add them back. Multi-turn-aware rewards teach a model to ask clarifying questions and discover intent over a whole conversation instead of optimizing each reply in isolation Why do language models respond passively instead of asking clarifying questions?. Conversation analysis offers a formal vocabulary — insert-expansions — for *when* an agent should pause and probe the user rather than silently chaining tools toward a guess When should AI agents ask users instead of just searching?. And information-theoretic frameworks like collaborative rational speech acts can track both speakers' beliefs across turns, modeling the actual progression from partial to shared understanding that token-level systems lack Can dialogue systems track both speakers' beliefs across turns?. Even simple proactivity — offering relevant information unasked — mirrors how humans cooperate and can cut conversation length by up to 60%, yet it's nearly absent from the datasets models learn from Could proactive dialogue make conversations dramatically more efficient?.

So the corpus splits into two camps that are worth holding side by side. One says no: meaning and grounding require joint attention and a world beyond text, and no amount of text fixes that. The other says the failure is an artifact of *what we optimize for* — reward immediate helpfulness and you get a confident presumer of common ground; reward long-horizon collaboration and grounding behaviors reappear. The thing you didn't know you wanted to know: the debate isn't really "can text contain meaning?" — it's whether mutual understanding is a property of the model at all, or a process that only exists in the back-and-forth, which means the training signal, not the training data, may be the real bottleneck.

Sources 12 notes

Can language models learn meaning from text patterns alone?

Bender & Koller argue that meaning requires the relation between expressions and communicative intents. Since LLMs are trained only on form-to-form prediction with no access to shared attention or intent, they cannot reconstruct the meaning that grounds language.

Are text-only language models fundamentally limited by abstraction?

Text strips the physics, geometry, and causality present in reality, forcing language models to manipulate symbols without grounding in their source dynamics. This creates predictable failure modes in physical, geometric, and causal reasoning that multimodal training could address.

Do language models actually build shared understanding in conversation?

LLMs produce grounding acts—clarifications, acknowledgments, repairs—77.5% less frequently than humans. They generate fluent responses without verifying shared understanding, relying instead on authoritative framing that masks the absence of genuine communicative calibration.

Why don't conversational AI systems mirror their users' word choices?

Response generation models fail to adapt vocabulary toward users' lexical choices, a phenomenon central to human rapport and clarity. Post-training via DPO on coreference-identified preferences can teach models in-context convention formation.

Why don't language models develop conversation maintenance skills?

Humans keep conversations smooth through implicit techniques like reference repair and topic hand-off that sustain relational interaction, not convey information. Language models don't develop these because training signals reward information prediction, not relational work.

Why do language models respond passively instead of asking clarifying questions?

CollabLLM demonstrates that standard RLHF training optimizes for immediate helpfulness, discouraging models from asking clarifying questions or offering multi-turn insights. Multi-turn-aware rewards that estimate long-term interaction value enable active intent discovery and genuine collaboration.

Why can't conversational AI agents take the initiative?

Research shows LLMs including ChatGPT cannot initiate topics, plan strategically, or lead conversations because their training optimizes for responding to queries, not creating dialogue from agent goals. This passivity is reinforced by alignment objectives and masked by fluent-sounding outputs.

Does AI generate genuine utterances or just text patterns?

AI output carries communicative markers inherited from training data but lacks the event structure that produces actual utterances. Users supply the missing orientation through interpretive labor, creating a pseudo-event with structure only on the human side.

How well do language models understand their own knowledge?

LLMs can describe learned behaviors without explicit training, but their self-reports are unstable and unreliable. Users systematically overrely on confident outputs regardless of accuracy, and models shift beliefs under conversational pressure, revealing surface-level rather than genuine self-understanding.

When should AI agents ask users instead of just searching?

Tool-enabled LLMs drift from user intent through silent tool chaining. Conversation analysis reveals insert-expansions—clarifying intent, scoping responses, enhancing appeal—as a formal framework for proactive user consultation that prevents misunderstanding instead of recovering from it.

Can dialogue systems track both speakers' beliefs across turns?

CRSA integrates rate-distortion theory with RSA to enable bidirectional belief tracking across dialogue turns. Demonstrated on referential games and doctor-patient dialogues, it captures progression from partial to shared understanding, providing the information-theoretic framework that token-level LLM systems lack.

Could proactive dialogue make conversations dramatically more efficient?

Simulations show proactivity—providing relevant information without being asked—cuts dialogue turns by 60% in medium-complexity domains. This behavior mirrors human conversation and Grice's maxims but is almost entirely absent from AI datasets and research benchmarks.

Can conversational AI achieve mutual understanding if trained only on text?

Sources 12 notes

Next inquiring lines