How does linguistic coordination build shared reference between conversational partners?
This explores how partners in a conversation gradually build a shared sense of what their words point to — through mirroring each other's language, negotiating meaning, and updating a common pool of assumptions — and what the corpus reveals about why machines struggle to do the same.
This explores how partners in conversation come to mean the same things by the same words — and the corpus frames this less as a property of language itself than as ongoing collaborative work. The starting insight is that shared reference isn't automatic: the same words can point to different things for different speakers, so partners have to actively calibrate how their language connects to the world rather than assume agreement Why do speakers need to actively calibrate shared reference?. Coordination is the machinery that does this calibration.
One visible mechanism is mirroring. Partners drift toward each other's word choices — lexical entrainment — adopting shared terms for shared things, which is central to rapport and clarity in human dialogue Why don't conversational AI systems mirror their users' word choices?. This convergence can even be engineered: training models on coreference chains teaches them to spontaneously form ad-hoc conventions in context, shortening repeated mentions once a reference is jointly established Can we teach LLMs to form linguistic conventions in context?. Notably, coordination isn't one single thing — lexical alignment drives task efficiency and comprehension, while emotional and prosodic alignment drive warmth and trust, so matching words and matching tone serve different ends Do different types of alignment serve different conversational goals?. A surprising corner: this same coordination intensifies during deception, where liars and listeners' linguistic styles converge more than in honest talk — the coordination itself becomes a detectable signal Do liars and listeners coordinate their language during deception?.
But mirroring alone isn't grounding. The deeper move is updating a shared scoreboard — the running pool of background assumptions both partners treat as established. Building reference means repairing misunderstandings, handing off topics, and checking understanding, which are social actions that sustain the relationship rather than transmit information Why don't language models develop conversation maintenance skills?. Even explanations work this way: understanding emerges from co-construction across turns, not one-directional delivery What makes explanations work in real conversation?. Formally, this is bidirectional — frameworks like collaborative rational speech acts track both speakers' evolving beliefs across turns, capturing the progression from partial to fully shared understanding Can dialogue systems track both speakers' beliefs across turns?.
Here's what you might not expect to learn: the corpus uses human coordination mostly as a mirror to show where LLMs fail. Models can't symmetrically update common ground — they read every later turn through the frame of the initial prompt, so the human ends up being the sole keeper of the shared scoreboard Can LLMs truly update shared conversational common ground?. Worse, the very training that makes models seem helpful erodes this: preference optimization rewards confident, fluent answers over clarifying questions and understanding checks, cutting grounding acts to a fraction of human levels Does preference optimization harm conversational understanding? Does preference optimization damage conversational grounding in large language models?. One contrarian thread even argues the fix is to route around conversation entirely — multi-agent systems coordinate better through structured shared artifacts than through natural-language back-and-forth Does structured artifact sharing outperform conversational coordination?. So linguistic coordination builds shared reference through mirroring plus mutual updating — and the most striking finding in the collection is how much of that quietly invisible work today's AI simply doesn't do.
Sources 12 notes
The same words can mean different things to different speakers because referential grounding is person-specific. True communicative grounding demands collaborative negotiation of how language connects to the world, not mere surface-level word sharing.
Response generation models fail to adapt vocabulary toward users' lexical choices, a phenomenon central to human rapport and clarity. Post-training via DPO on coreference-identified preferences can teach models in-context convention formation.
Post-training with two types of preference pairs derived from TV scripts — one encouraging re-mention shortening, one preventing premature shortening — plus special [remention] tokens enables models to spontaneously form ad-hoc linguistic conventions during interaction without task-specific fine-tuning.
A 2020–2025 systematic review shows lexical alignment drives task efficiency and comprehension, while emotional and prosodic alignment drive relational warmth and trust. Conflating them in design produces category errors—cold customer-service bots and evasive mental-health assistants.
Research shows interlocutors' linguistic styles correlate more during false communication than truthful communication, especially when the speaker is motivated to deceive. This coordination serves as a detectable deception signal through the listener's adaptive behavior, not just the liar's language.
Humans keep conversations smooth through implicit techniques like reference repair and topic hand-off that sustain relational interaction, not convey information. Language models don't develop these because training signals reward information prediction, not relational work.
Analysis of 399 daily-life explanations shows that topic relation, dialogue act, and explanation move jointly predict understanding success. Explanations are co-constructed through interaction patterns, not monological delivery—challenging how LLMs currently generate explanations.
CRSA integrates rate-distortion theory with RSA to enable bidirectional belief tracking across dialogue turns. Demonstrated on referential games and doctor-patient dialogues, it captures progression from partial to shared understanding, providing the information-theoretic framework that token-level LLM systems lack.
LLMs interpret all subsequent conversational turns within a fixed initial prompt frame, preventing them from symmetrically proposing updates to shared assumptions. Even when users pivot topics or contradict earlier framings, the model cannot absorb revisions into jointly held background—making the user the sole maintainer of conversational scoreboard.
RLHF optimizes models for single-turn helpfulness by rewarding confident responses over clarifying questions and understanding checks. This preference alignment systematically reduces grounding acts by 77.5% below human levels, creating an alignment tax where models appear helpful but fail silently in multi-turn contexts.
Research shows LLMs generate 77.5% fewer grounding acts than humans, and RLHF preference optimization actively worsens this gap. The optimization target—fluent, confident responses—directly undermines the communicative work of establishing shared understanding.
MetaGPT demonstrates that agents producing standardized engineering documents achieve superior coordination compared to conversational exchange. Active information pulling from shared environments eliminates noise and mirrors efficient human workplace infrastructure.