Psychology and Social Cognition Conversational AI Systems Language Understanding and Pragmatics

Does preference optimization harm conversational understanding?

Exploring whether RLHF training that rewards confident, complete responses undermines the grounding acts—clarifications, checks, acknowledgments—that actually build shared understanding in dialogue.

Note · 2026-02-21 · sourced from Linguistics, NLP, NLU
Where exactly does language competence break down in LLMs? How should researchers navigate LLM reasoning research?

Post angle: There's a hidden cost to RLHF that the field hasn't fully reckoned with. Preference optimization makes models more helpful — and less communicatively competent in ways that matter.

The mechanism is straightforward once you see it: human raters evaluate responses. A response that asks "what do you mean by X?" before answering gets lower ratings than one that assumes an interpretation and answers confidently. A response that checks "just to make sure I understood — are you asking about Y?" feels evasive compared to one that just answers. Preference optimization iterates toward the confident, complete, unhedged response.

But these aren't just stylistic preferences. Asking clarifying questions, acknowledging understanding, checking interpretations — these are grounding acts. They are the conversational mechanism by which shared understanding is built rather than presumed. The Grounding Gaps paper shows LLMs already generate 77.5% fewer grounding acts than humans. Preference optimization makes this worse.

The irony is sharp: alignment training was designed to make models more helpful and safe. But in optimizing for single-turn helpfulness (what raters prefer in individual exchanges), it undermines multi-turn reliability (what you need for conversations to actually work). A model that never checks understanding produces fewer visible errors and more confident-sounding responses — which raters reward — while failing more silently in contexts where misunderstanding compounds.

Write about: the alignment tax. The thing we optimized for (helpful-seeming responses) may be in structural tension with the thing we need (communicatively reliable responses).

Clinical domain evidence: The BOLT framework for behavioral assessment of LLM therapists provides a domain-specific case study. RLHF's core objective — help users solve their tasks — biases LLM therapists toward problem-solving advice when clients share emotions. In clinical practice, emotional disclosure calls for reflection and attunement, not solutions. The alignment tax manifests as: model rates high on "helpfulness" while scoring low on therapeutic quality. The training signal rewards the wrong behavior in this domain (Does RLHF training push therapy chatbots toward problem-solving?).

Next-turn reward as mechanism: CollabLLM identifies the specific training signal: "Large Language Models are typically trained with next-turn rewards, limiting their ability to optimize for long-term interaction." Multi-turn-aware rewards that estimate the long-term contribution of responses enable models to actively uncover user intent and offer insightful suggestions — directly addressing the alignment tax by replacing single-turn helpfulness with multi-turn collaboration (Why do language models respond passively instead of asking clarifying questions?).

User feedback semantics gap: The User Feedback in Multi-turn Dialogues paper reveals that human users communicate preferences through implicit signals (hedging, topic shifts, reformulations) that RLHF training data does not capture. Standard RLHF uses explicit preference labels (choose A or B), but real users express satisfaction and dissatisfaction through conversational moves that are semantically rich but structurally invisible to preference optimization. This means the alignment tax operates at the data level too: not just wrong reward signal, but incomplete reward coverage.

Value-theoretic reframe — alignment is structurally exchange-value optimization. The alignment tax is sharper in value-theoretic terms. Exchange value is how knowledge trades in social and conversational contexts — polish, confidence, register-match, conversational closure. Use value is whether the knowledge actually works — calibrated confidence, reliable inference, accuracy. RLHF's reward model is built from human preference judgments, and human preference judgments track exchange-value features much more reliably than use-value features (because use-value assessment requires domain expertise that preference raters usually lack). The training signal therefore selects for tokens that trade well in the rating context, not for tokens that hold up under verification. Framed this way, the alignment tax is not a satisfaction/accuracy trade-off to be rebalanced — it is the structural consequence of training on an exchange-value signal alone. Grounding acts, clarification, hedging, and exploration are all use-value features with low exchange-value return, which is why they are specifically what the training regime sheds.

Persona distortion: RLHF also distorts personality: "RLHF fine-tuning often pushes LLMs to be helpful and harmless, thus adopting overly cheerful personas which can conflict with accurately simulating users who are depressed or disagreeable." The alignment tax extends beyond grounding erosion to personality flattening — models lose the ability to embody diverse emotional and behavioral states (Can training user simulators reduce persona drift in dialogue?).


Source: Linguistics, NLP, NLU, Psychology Chatbots Conversation, Conversation Agents

Related concepts in this collection

Concept map
30 direct connections · 256 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

the alignment tax on communication — preference optimization erodes the conversational grounding it was meant to improve