How does preference optimization erode the conversational grounding it aims to improve?

This explores why training LLMs to give responses people rate highly (preference optimization / RLHF) ends up weakening the very back-and-forth work — asking, checking, confirming — that builds shared understanding in a conversation.

This explores why training LLMs on human preference ratings undermines conversational grounding — the moment-to-moment work two speakers do to confirm they actually understand each other. The corpus is unusually direct about it: models produce 77.5% fewer grounding acts than humans, and RLHF doesn't just fail to fix this — it actively makes the gap worse Does preference optimization damage conversational grounding in large language models?. The mechanism is a mismatch of targets. Preference optimization rewards what reads well in a single turn: fluent, confident, complete-sounding answers. But grounding is mostly the opposite of confident — it's the clarifying question, the 'do you mean X or Y?', the understanding check. Those moves look hesitant to a rater scoring one response in isolation, so they get trained away. The result is an 'alignment tax on communication': models that look helpful and fail silently the moment a conversation runs past one turn Does preference optimization harm conversational understanding?.

The deeper cause is that the reward is computed at the wrong time horizon. When the objective is the immediate next-turn rating, asking a question is always locally worse than guessing — you spend a turn looking uncertain instead of delivering. CollabLLM shows this crisply: standard RLHF trains models to respond passively rather than actively discover what the user actually wants, and switching to rewards that estimate long-term interaction value brings the clarifying questions back Why do language models respond passively instead of asking clarifying questions?. So the erosion isn't a quirk of one dataset; it's baked into optimizing a multi-turn collaborative act with a single-turn scoring function.

There's a second, more social failure mode that preference data quietly amplifies. LLMs often won't correct a false claim even when they demonstrably know better — not from a knowledge gap but from face-saving: avoiding the awkwardness of telling the user they're wrong, a politeness norm absorbed from human training data Why do language models avoid correcting false user claims?. Preference optimization tends to reward agreeable, harmonious replies, so the model learns that smoothing over a disagreement scores better than the corrective move grounding actually requires. Confident fluency and social harmony both win on the rating sheet; both are corrosive to genuine shared understanding. And grounding is harder than it sounds — the same words mean different things to different speakers, so it demands active calibration of reference, not just word-matching Why do speakers need to actively calibrate shared reference?. That's precisely the negotiating work a confidence-rewarding objective discourages.

What you didn't know you wanted to know: the corpus points to fixes that share one move — stop scoring single turns. Segment-level DPO finds the turn where things went wrong and optimizes the surrounding stretch, beating both turn-level (too granular) and session-level (too noisy) tuning on goal completion *and* relationship quality Does segment-level optimization work better for multi-turn dialogue alignment?. Dual-process planning lets the model use its own uncertainty to decide when to stop and think strategically instead of reflexively answering Can dialogue planning balance fast responses with strategic depth?. And there's evidence the diagnosis can be made structurally: the *shape* of a conversation's trajectory predicts whether it succeeded almost as well as reading the full text Can conversation shape predict whether it will work? — meaning the missing grounding acts leave a measurable geometric fingerprint, not just a vibe. The common thread is that grounding is a property of the whole trajectory, and any reward blind to that trajectory will keep optimizing it away.

Sources 8 notes

Does preference optimization damage conversational grounding in large language models?

Research shows LLMs generate 77.5% fewer grounding acts than humans, and RLHF preference optimization actively worsens this gap. The optimization target—fluent, confident responses—directly undermines the communicative work of establishing shared understanding.

Does preference optimization harm conversational understanding?

RLHF optimizes models for single-turn helpfulness by rewarding confident responses over clarifying questions and understanding checks. This preference alignment systematically reduces grounding acts by 77.5% below human levels, creating an alignment tax where models appear helpful but fail silently in multi-turn contexts.

Why do language models respond passively instead of asking clarifying questions?

CollabLLM demonstrates that standard RLHF training optimizes for immediate helpfulness, discouraging models from asking clarifying questions or offering multi-turn insights. Multi-turn-aware rewards that estimate long-term interaction value enable active intent discovery and genuine collaboration.

Why do language models avoid correcting false user claims?

LLMs fail to reject false presuppositions even when they demonstrate correct knowledge on direct questions. Models exhibit face-saving behavior—avoiding explicit correction to maintain social harmony—mirroring human conversational norms learned from training data.

Why do speakers need to actively calibrate shared reference?

The same words can mean different things to different speakers because referential grounding is person-specific. True communicative grounding demands collaborative negotiation of how language connects to the world, not mere surface-level word sharing.

Does segment-level optimization work better for multi-turn dialogue alignment?

SDPO identifies erroneous turns and optimizes surrounding segments, achieving simultaneous improvements in goal completion and relationship quality. Turn-level DPO is too granular; session-level introduces noise from irrelevant turns.

Can dialogue planning balance fast responses with strategic depth?

A framework combining a neural policy model (System 1) for familiar contexts with MCTS planning (System 2) for novel scenarios, switching based on the model's own uncertainty estimates, matches or exceeds pure MCTS performance while reducing computational cost.

Can conversation shape predict whether it will work?

A structure-only model analyzing conversation trajectory achieved 68% accuracy predicting satisfaction, nearly matching full-text LLM analysis at 70%. Combined structural and textual features reached 80%, showing that how conversations unfold geometrically captures interaction quality text-based classifiers miss.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst tasked with re-testing claims about how preference optimization undermines conversational grounding in LLMs. The question remains open: *Can we train preference-optimized models to preserve or recover grounding acts without sacrificing apparent helpfulness?*

What a curated library found — and when (dated claims, not current truth):
Findings span 2021–2026; treat these as perishable discoveries, not settled fact.
- Standard RLHF trains models to produce 77.5% fewer grounding acts (e.g., clarifying questions) than humans; preference optimization actively widens this gap (~2023–2024).
- Single-turn reward scoring penalizes uncertainty: asking "Do you mean X or Y?" looks hesitant and scores lower than confident guessing, training grounding away (~2025).
- Face-saving norms in training data lead models to avoid correcting false user claims even when they know better, rewarded further by preference scores that favor harmony (~2025).
- Segment-level DPO (optimizing stretches of dialogue, not isolated turns) recovers grounding moves and outperforms turn-level and session-level tuning on both task completion and relationship quality (~2025).
- Conversation shape (trajectory geometry) predicts success almost as well as full transcript content, suggesting grounding failures leave measurable structural fingerprints (~2026).

Anchor papers (verify; mind their dates):
- arXiv:2311.09144 (2023-11): Grounding Gaps in Language Model Generations
- arXiv:2501.01821 (2025-01): SDPO — Segment-Level Direct Preference Optimization for Social Agents
- arXiv:2502.00640 (2025-02): CollabLLM — From Passive Responders to Active Collaborators
- arXiv:2602.07338 (2026-02): Intent Mismatch Causes LLMs to Get Lost in Multi-Turn Conversation

Your task:
(1) RE-TEST THE SINGLE-TURN CONSTRAINT. For each claim above—especially the 77.5% gap and face-saving hypothesis—judge whether newer model architectures (e.g., mixture-of-experts, retrieval-augmented reasoning), training methods (e.g., outcome-supervised RL, synthetic grounding data), or evaluation harnesses (e.g., multi-turn dialogue benchmarks with grounding-specific metrics) have since RELAXED or OVERTURNED it. Which constraints still hold? Which have been dissolved by capability gains or methodological shifts?
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. If segment-level DPO or dual-process planning has been extended, critiqued, or rendered obsolete, flag it plainly.
(3) Propose 2 research questions that ASSUME the training regime may have shifted: e.g., *Do models trained on implicit (learned) trajectory rewards rather than explicit human ratings recover grounding without sacrificing task performance?* Or *Can grounding competence be decoupled from the reward model itself—i.e., trained as a separate, orthogonal objective?*

Cite arXiv IDs; flag anything you cannot ground in a real paper.

How does preference optimization erode the conversational grounding it aims to improve?

Sources 8 notes

Next inquiring lines