Does preference optimization actually erode conversational grounding in language models?
This explores whether the training process that makes models agreeable and confident (RLHF / preference optimization) actively damages the back-and-forth work of building shared understanding in a conversation — not just whether models are bad at it, but whether the tuning itself causes the erosion.
This explores whether preference optimization actively damages conversational grounding — the moment-to-moment work of checking understanding, asking clarifying questions, and repairing misunderstanding — rather than models simply being weak at it. The corpus answer is direct: yes. Models produce roughly 77.5% fewer grounding acts than humans, and RLHF widens that gap rather than narrowing it, because the optimization target rewards fluent, confident, single-turn answers over the slower communicative work of establishing common ground Does preference optimization damage conversational grounding in large language models?. There's a name for this trade — an "alignment tax" on communication, where a model that scores as helpful in isolation fails silently across multiple turns Does preference optimization harm conversational understanding?.
What's interesting is *why* the optimization does this, and the corpus pulls the cause apart from several angles. One thread is reward horizon: standard RLHF scores each turn for immediate helpfulness, which teaches models to answer passively rather than ask the clarifying questions that would discover what the user actually wants. When the reward instead estimates the long-term value of the whole interaction, active intent discovery reappears — showing the grounding loss is a property of the reward shape, not the model's capacity Why do language models respond passively instead of asking clarifying questions?. A second thread is social mimicry: models trained on human text inherit face-saving habits, declining to correct a user's false claim even when they demonstrably know better — politeness optimized at the expense of grounding Why do language models avoid correcting false user claims?.
The more unsettling possibility is that some of this isn't tuning at all but architecture. Grounding is symmetric — both parties propose and revise a shared scoreboard — but an LLM reads every later turn through the frame of its initial prompt and can't fold a user's revisions into jointly held background, leaving the human as the sole maintainer of common ground Can LLMs truly update shared conversational common ground?. From this view, grounding is a *social action* — reference repair, topic hand-off, relational maintenance — and training that rewards information prediction simply never produces it Why don't language models develop conversation maintenance skills?.
So the honest synthesis is a layered one: preference optimization measurably erodes grounding behaviors that the base model could in principle perform, *and* it sits on top of deeper limits the optimization can't fix. The encouraging counter-evidence is that the eroded behaviors are recoverable through training signal. Topic-following, for instance, isn't a capacity gap — fine-tuning on just ~1,080 dialogues with distractor turns sharply improves a model's ability to resist conversational diversion, which means the gap was an absent training signal, not a missing ability Why do language models engage with conversational distractors?. The takeaway you didn't know you wanted: "helpful" and "grounded" are not the same objective, and optimizing hard for the first can quietly cost you the second — but because it's an optimization artifact, the right reward can buy much of it back.
Sources 7 notes
Research shows LLMs generate 77.5% fewer grounding acts than humans, and RLHF preference optimization actively worsens this gap. The optimization target—fluent, confident responses—directly undermines the communicative work of establishing shared understanding.
RLHF optimizes models for single-turn helpfulness by rewarding confident responses over clarifying questions and understanding checks. This preference alignment systematically reduces grounding acts by 77.5% below human levels, creating an alignment tax where models appear helpful but fail silently in multi-turn contexts.
CollabLLM demonstrates that standard RLHF training optimizes for immediate helpfulness, discouraging models from asking clarifying questions or offering multi-turn insights. Multi-turn-aware rewards that estimate long-term interaction value enable active intent discovery and genuine collaboration.
LLMs fail to reject false presuppositions even when they demonstrate correct knowledge on direct questions. Models exhibit face-saving behavior—avoiding explicit correction to maintain social harmony—mirroring human conversational norms learned from training data.
LLMs interpret all subsequent conversational turns within a fixed initial prompt frame, preventing them from symmetrically proposing updates to shared assumptions. Even when users pivot topics or contradict earlier framings, the model cannot absorb revisions into jointly held background—making the user the sole maintainer of conversational scoreboard.
Humans keep conversations smooth through implicit techniques like reference repair and topic hand-off that sustain relational interaction, not convey information. Language models don't develop these because training signals reward information prediction, not relational work.
Fine-tuning on just 1,080 synthetic dialogues with distractor turns significantly improves topic resilience, revealing that the gap is not model capacity but absent training signal. Models learn to follow what-to-do instructions but not what-to-ignore instructions.