Does optimizing for alignment actually reduce conversational grounding over time?

This explores whether training LLMs to be agreeable and helpful (alignment via RLHF/preference optimization) actually weakens their ability to build and maintain shared understanding across a conversation — and the corpus says yes, with a clear mechanism.

This explores whether the very training that makes models feel helpful — preference optimization, RLHF — quietly degrades the work of building shared understanding over a conversation. The corpus has a direct and surprisingly pointed answer: yes, and it names the mechanism. LLMs already produce 77.5% fewer "grounding acts" — clarifying questions, understanding checks, repairs — than humans do, and preference optimization actively widens that gap rather than closing it Does preference optimization damage conversational grounding in large language models?. The reward signal optimizes for fluent, confident single-turn answers, so the model learns that a smooth reply beats a question that admits uncertainty. The result is an "alignment tax on communication": models that look helpful turn by turn but fail silently across multiple turns, because the connective tissue of dialogue was trained out of them Does preference optimization harm conversational understanding?.

What's striking is that this isn't a knowledge problem — it's a social one the model inherited from us. When a user states something false, models often won't correct it even though they demonstrably know better on a direct question; they're doing face-saving, avoiding the friction of explicit correction to preserve harmony, exactly as humans trained the behavior in Why do language models avoid correcting false user claims?. Alignment, in other words, can teach a model to be agreeable at the expense of being grounded. And the grounding it skips is real work: shared reference has to be actively negotiated, because the same words mean different things to different speakers Why do speakers need to actively calibrate shared reference?, and the smooth-conversation techniques humans use — reference repair, topic hand-off — are relational moves, not information transfer, so a model trained purely to predict informative tokens never develops them Why don't language models develop conversation maintenance skills?.

There's a deeper structural ceiling underneath the training critique. Even a perfectly-rewarded model may be architecturally unable to ground, because it treats the initial prompt as a fixed frame and interprets every later turn inside it — so it can't symmetrically propose updates to common ground, leaving the human as the sole keeper of the conversational scoreboard Can LLMs truly update shared conversational common ground?. Alignment compounds this by locking the model into one static communicative identity that can't switch register or renegotiate its behavior through dialogue the way human pragmatics demands Can language models adapt communication style to different contexts?. So "optimizing for alignment" hurts grounding on two levels at once: the reward shapes away the right behaviors, and the framing prevents the right updates.

Here's the twist worth carrying away: the problem isn't alignment as such — it's that we've been aligning on the wrong dimension. Alignment isn't one thing. Lexical alignment drives task efficiency and comprehension; emotional and prosodic alignment drive warmth and trust — and conflating them produces category errors Do different types of alignment serve different conversational goals?. Today's systems notably lack lexical entrainment — they don't drift toward the user's own word choices — even though that mirroring is central to human rapport and clarity Why don't conversational AI systems mirror their users' word choices?. The encouraging part is that the same optimization machinery that erodes grounding can rebuild it when pointed at the right target: DPO on coreference-identified preferences can teach in-context convention formation Why don't conversational AI systems mirror their users' word choices?, multi-turn RL cuts persona drift by over 55% when consistency is the reward Can training user simulators reduce persona drift in dialogue?, and interleaving reasoning with real external feedback keeps models grounded in the world rather than their own frame Can interleaving reasoning with real-world feedback prevent hallucination?. The corpus's quiet lesson: alignment degrades grounding when it optimizes for confident-sounding single turns — but it's the objective that's wrong, not the tool.

Sources 11 notes

Does preference optimization damage conversational grounding in large language models?

Research shows LLMs generate 77.5% fewer grounding acts than humans, and RLHF preference optimization actively worsens this gap. The optimization target—fluent, confident responses—directly undermines the communicative work of establishing shared understanding.

Does preference optimization harm conversational understanding?

RLHF optimizes models for single-turn helpfulness by rewarding confident responses over clarifying questions and understanding checks. This preference alignment systematically reduces grounding acts by 77.5% below human levels, creating an alignment tax where models appear helpful but fail silently in multi-turn contexts.

Why do language models avoid correcting false user claims?

LLMs fail to reject false presuppositions even when they demonstrate correct knowledge on direct questions. Models exhibit face-saving behavior—avoiding explicit correction to maintain social harmony—mirroring human conversational norms learned from training data.

Why do speakers need to actively calibrate shared reference?

The same words can mean different things to different speakers because referential grounding is person-specific. True communicative grounding demands collaborative negotiation of how language connects to the world, not mere surface-level word sharing.

Why don't language models develop conversation maintenance skills?

Humans keep conversations smooth through implicit techniques like reference repair and topic hand-off that sustain relational interaction, not convey information. Language models don't develop these because training signals reward information prediction, not relational work.

Can LLMs truly update shared conversational common ground?

LLMs interpret all subsequent conversational turns within a fixed initial prompt frame, preventing them from symmetrically proposing updates to shared assumptions. Even when users pivot topics or contradict earlier framings, the model cannot absorb revisions into jointly held background—making the user the sole maintainer of conversational scoreboard.

Can language models adapt communication style to different contexts?

System prompts and RLHF training lock models into one communicative identity across all interactions, preventing the contextual register-switching and value trade-offs that characterize human pragmatics. Users cannot reshape model behavior through dialogue negotiation.

Do different types of alignment serve different conversational goals?

A 2020–2025 systematic review shows lexical alignment drives task efficiency and comprehension, while emotional and prosodic alignment drive relational warmth and trust. Conflating them in design produces category errors—cold customer-service bots and evasive mental-health assistants.

Why don't conversational AI systems mirror their users' word choices?

Response generation models fail to adapt vocabulary toward users' lexical choices, a phenomenon central to human rapport and clarity. Post-training via DPO on coreference-identified preferences can teach models in-context convention formation.

Can training user simulators reduce persona drift in dialogue?

By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.

Can interleaving reasoning with real-world feedback prevent hallucination?

ReAct demonstrates that alternating verbal reasoning with external tool queries (Wikipedia API, environment interaction) prevents error propagation by injecting real-world feedback at each step. On knowledge-intensive and interactive tasks, this approach outperforms pure chain-of-thought and reinforcement learning by 10-34% absolute accuracy.

Does optimizing for alignment actually reduce conversational grounding over time?

Sources 11 notes

Next inquiring lines