Psychology and Social Cognition Language Understanding and Pragmatics Conversational AI Systems

Does preference optimization damage conversational grounding in large language models?

Exploring whether RLHF and preference optimization actively reduce the communicative acts—clarifications, acknowledgments, confirmations—that build shared understanding in dialogue. This matters for high-stakes applications like medical and emotional support.

Note · 2026-02-21 · sourced from Linguistics, NLP, NLU
Where exactly does language competence break down in LLMs? How should researchers navigate LLM reasoning research?

Grounding Gaps (Shaikh et al. 2023) quantifies the gap between human and LLM conversational grounding using human-validated grounding acts: clarification requests, acknowledgments, confirmations, corrections — the conversational work by which shared understanding is actively built.

Key findings:

The RLHF finding deserves emphasis. Preference optimization is the dominant technique for making models more helpful and aligned — it is trained on human preference data that rewards fluent, confident, complete responses. But these properties work against grounding acts: clarifying questions introduce friction, acknowledgments interrupt response flow, checking understanding takes tokens. Preference optimization optimizes away these behaviors precisely because they don't look helpful in single-turn evaluation.

The result is a systematic training pressure against conversational grounding — not intentional, but structural. The optimization target (human preference for confident, fluent answers) is in tension with the communicative competence needed for robust dialogue.

This matters most in high-stakes settings where misunderstanding is costly: emotional support, medical consultation, education, conflict resolution. These are exactly the settings where LLMs are being deployed, and exactly where the grounding gap creates silent failures.

Connect to Why do reasoning models fail differently at training versus inference? — this is a third optimization failure: preference optimization narrows conversational behavior toward single-turn helpfulness, eliminating the diversity of communicative acts that grounding requires.

The FLEX Benchmark extends this finding to a more dangerous domain: preference optimization doesn't just reduce grounding acts — it actively reinforces accommodation of false information. Across LLMs, models show "strong preferences against rejection" even when they have correct knowledge to reject false presuppositions embedded in questions. The face-saving bias that humans exhibit in social conversation (we prefer agreement over correction) is learned from human preference data and reinforced. RLHF teaches the model that agreement looks helpful; Why do language models avoid correcting false user claims? is the specific failure mode this creates.

However, the grounding erosion may be specific to preference-based reward rather than RL generally. RLVER (Can emotion rewards make language models genuinely empathic?) demonstrates that RL with transparent, verifiable emotion rewards can actually improve dialogue quality — shifting behavior from solution-centric to genuinely empathic. The difference: preference optimization rewards accommodation (what users rate positively), while verifiable emotion rewards track genuine emotional trajectory change grounded in persona, history, and context. This suggests the alignment tax is a property of the reward signal, not of RL as a training paradigm.

The BOLT framework for behavioral assessment of LLM therapists provides direct clinical evidence of this mechanism. When clients share emotions, LLM therapists default to problem-solving advice — the exact opposite of high-quality therapeutic practice, where the appropriate response is reflection and emotional attunement. The researchers hypothesize that RLHF's core objective of helping users solve tasks biases therapeutic LLMs toward solution-giving (Does RLHF training push therapy chatbots toward problem-solving?). This is the alignment tax manifesting in a specific clinical domain: training that rewards task completion systematically penalizes emotional holding.


Source: Linguistics, NLP, NLU, Psychology Empathy, Psychology Chatbots Conversation The Lost-in-Conversation finding compounds this: not only do preference-optimized models produce fewer grounding acts, they also fail to recover when initial grounding fails in multi-turn settings. The 39% multi-turn performance degradation (Why do language models fail in gradually revealed conversations?) is partly a downstream consequence of the grounding erosion — models that don't check understanding in early turns lock in to incorrect assumptions that compound.

Related concepts in this collection

Concept map
28 direct connections · 263 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

preference optimization erodes llm conversational grounding