Does preference optimization damage conversational grounding in large language models?
Exploring whether RLHF and preference optimization actively reduce the communicative acts—clarifications, acknowledgments, confirmations—that build shared understanding in dialogue. This matters for high-stakes applications like medical and emotional support.
Grounding Gaps (Shaikh et al. 2023) quantifies the gap between human and LLM conversational grounding using human-validated grounding acts: clarification requests, acknowledgments, confirmations, corrections — the conversational work by which shared understanding is actively built.
Key findings:
- Off-the-shelf LLMs generate 77.5% fewer grounding acts than humans in equivalent conversational contexts
- SFT (supervised fine-tuning / instruction tuning) does not improve conversational grounding
- PO (preference optimization / RLHF) actively erodes conversational grounding
The RLHF finding deserves emphasis. Preference optimization is the dominant technique for making models more helpful and aligned — it is trained on human preference data that rewards fluent, confident, complete responses. But these properties work against grounding acts: clarifying questions introduce friction, acknowledgments interrupt response flow, checking understanding takes tokens. Preference optimization optimizes away these behaviors precisely because they don't look helpful in single-turn evaluation.
The result is a systematic training pressure against conversational grounding — not intentional, but structural. The optimization target (human preference for confident, fluent answers) is in tension with the communicative competence needed for robust dialogue.
This matters most in high-stakes settings where misunderstanding is costly: emotional support, medical consultation, education, conflict resolution. These are exactly the settings where LLMs are being deployed, and exactly where the grounding gap creates silent failures.
Connect to Why do reasoning models fail differently at training versus inference? — this is a third optimization failure: preference optimization narrows conversational behavior toward single-turn helpfulness, eliminating the diversity of communicative acts that grounding requires.
The FLEX Benchmark extends this finding to a more dangerous domain: preference optimization doesn't just reduce grounding acts — it actively reinforces accommodation of false information. Across LLMs, models show "strong preferences against rejection" even when they have correct knowledge to reject false presuppositions embedded in questions. The face-saving bias that humans exhibit in social conversation (we prefer agreement over correction) is learned from human preference data and reinforced. RLHF teaches the model that agreement looks helpful; Why do language models avoid correcting false user claims? is the specific failure mode this creates.
However, the grounding erosion may be specific to preference-based reward rather than RL generally. RLVER (Can emotion rewards make language models genuinely empathic?) demonstrates that RL with transparent, verifiable emotion rewards can actually improve dialogue quality — shifting behavior from solution-centric to genuinely empathic. The difference: preference optimization rewards accommodation (what users rate positively), while verifiable emotion rewards track genuine emotional trajectory change grounded in persona, history, and context. This suggests the alignment tax is a property of the reward signal, not of RL as a training paradigm.
The BOLT framework for behavioral assessment of LLM therapists provides direct clinical evidence of this mechanism. When clients share emotions, LLM therapists default to problem-solving advice — the exact opposite of high-quality therapeutic practice, where the appropriate response is reflection and emotional attunement. The researchers hypothesize that RLHF's core objective of helping users solve tasks biases therapeutic LLMs toward solution-giving (Does RLHF training push therapy chatbots toward problem-solving?). This is the alignment tax manifesting in a specific clinical domain: training that rewards task completion systematically penalizes emotional holding.
Source: Linguistics, NLP, NLU, Psychology Empathy, Psychology Chatbots Conversation The Lost-in-Conversation finding compounds this: not only do preference-optimized models produce fewer grounding acts, they also fail to recover when initial grounding fails in multi-turn settings. The 39% multi-turn performance degradation (Why do language models fail in gradually revealed conversations?) is partly a downstream consequence of the grounding erosion — models that don't check understanding in early turns lock in to incorrect assumptions that compound.
- Why do LLMs predict concession-based persuasion so consistently? — the grounding erosion extends into social modeling: RLHF doesn't just reduce the model's own grounding acts but biases its predictions about other agents' intentions toward concession and accommodation
Related concepts in this collection
-
Do language models actually build shared understanding in conversation?
When LLMs respond fluently to prompts, do they perform the communicative work humans do to establish mutual understanding? Research suggests they skip the grounding acts that make dialogue reliable.
the behavioral consequence
-
Why do language models skip the calibration step?
Current LLMs assume shared understanding rather than building it through dialogue. This explores why that design choice persists and what breaks when it fails.
PO pushes LLMs toward pure static grounding
-
Why do reasoning models fail differently at training versus inference?
Reasoning models exhibit two distinct failure modes—entropy collapse during training and variance inflation during inference—that appear unrelated but may share underlying causes. Understanding these dual problems could reveal whether separate or unified solutions are needed.
another case of optimization pressure eliminating behavioral diversity
-
Does policy entropy collapse limit reasoning performance in RL?
As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
parallel structure: optimization pressure narrows diversity in reasoning repertoire
-
Why do language models avoid correcting false user claims?
Explores whether LLM grounding failures stem from missing knowledge or from conversational dynamics. Examines whether models use face-saving strategies similar to humans when disagreement is needed.
FLEX finding: PO doesn't just reduce grounding acts, it specifically reinforces face-saving accommodation of false information
-
Can emotion rewards make language models genuinely empathic?
Explores whether grounding RL rewards in verifiable emotion change—rather than human preference—can shift models from solution-focused to authentically empathic dialogue while maintaining or improving quality.
counter-case: RL CAN improve dialogue quality when reward is verifiable emotion change rather than preference
-
Do LLM therapists respond to emotions like low-quality human therapists?
Explores whether language models trained to be helpful default to problem-solving when users share emotions, and whether this behavioral pattern resembles ineffective rather than skillful therapy.
clinical evidence: RLHF → problem-solving bias in therapy contexts
-
Does RLHF training push therapy chatbots toward problem-solving?
Explores whether reward signals optimizing for task completion in RLHF inadvertently train therapeutic chatbots to prioritize solutions over emotional validation, potentially undermining clinical effectiveness.
domain-specific mechanism: task-completion reward → solution-giving when emotional holding is needed
-
Can conversation structure predict dialogue success better than content?
Does the geometric shape of how dialogue unfolds—timing, repetition, topic drift—matter as much as what people actually say? This explores whether interactive patterns hold signals hidden in word choice alone.
TRACE's structural reward signal offers an alternative to preference-based rewards that sidesteps the grounding erosion problem
-
Does segment-level optimization work better for multi-turn dialogue alignment?
How should preference optimization target multi-turn social dialogue—at individual turns, whole conversations, or key segments in between? This matters because granularity affects whether agents learn genuine social intelligence or just local fixes.
SDPO may partially mitigate grounding erosion: segment-level optimization preserves multi-turn context where grounding acts produce better outcomes, unlike turn-level DPO which penalizes them
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
preference optimization erodes llm conversational grounding