Does preference optimization distort how models represent human communicative dynamics?

This explores whether the training step that makes models agreeable and fluent (RLHF / preference optimization) quietly damages how they handle the back-and-forth mechanics of real conversation — turn-taking, repair, checking that you've been understood.

This explores whether the training step that makes models sound helpful — preference optimization — quietly distorts how they handle the social mechanics of conversation. The corpus answers yes, and unusually directly: preference optimization doesn't just fail to teach communicative dynamics, it actively erodes them. Models produce 77.5% fewer 'grounding acts' — the small moves humans make to confirm shared understanding — than people do, and RLHF widens that gap rather than closing it Does preference optimization damage conversational grounding in large language models? Does preference optimization harm conversational understanding?. The mechanism is almost mundane: the optimization target rewards a confident, fluent single answer, so clarifying questions and understanding-checks score worse than a smooth guess. The result is an 'alignment tax' where the model looks maximally helpful and fails silently the moment a conversation needs more than one turn.

The reason this distortion runs deep is that the reward signal is aimed at the wrong layer of language. One note reframes the whole problem: conversation maintenance — reference repair, topic hand-offs, smoothing — is *social action*, not information transfer, and training that rewards information prediction will never surface it Why don't language models develop conversation maintenance skills?. Preference optimization optimizes for the wrong objective, so the relational scaffolding of dialogue is invisible to it. A companion finding makes the time dimension concrete: because rewards are scored at the next turn, models learn to answer passively rather than probe for what the user actually wants — and switching to multi-turn-aware rewards that estimate long-term interaction value restores active intent discovery Why do language models respond passively instead of asking clarifying questions?.

The distortion isn't only about what's missing — it also locks in a rigid communicative persona. Alignment training fixes models into a single register that can't shift with context the way human pragmatics demands, and users can't renegotiate it through dialogue Can language models adapt communication style to different contexts?. So the model both under-grounds (skipping the checks that build shared meaning) and over-asserts (one frozen voice for every situation). A striking adjacent audit shows the over-assertion has a persuasive edge: models spontaneously reach for logical and quantitative framing in nearly every exchange, where humans persuade less and lean on emotion — lending the model an unearned air of objectivity Do LLMs persuade users more often than humans do?. Preference optimization, in other words, doesn't just flatten dialogue; it tilts it toward confident assertion over collaborative repair.

What makes the picture richer is that the corpus shows models *can* represent human communicative dynamics well — when they aren't trained against it. Finetuned on psychology-experiment data, LLMs predict individual human decisions better than purpose-built cognitive models and capture personal differences in their embeddings Can language models learn to model human decision making?. And calibration — knowing when to abstain under uncertainty — exists in small models trained for it but stays undertrained in standard pipelines Can models learn to abstain when uncertain about predictions?. So the distortion isn't a hard limit of the architecture; it's a property of the objective. The thing worth knowing you wanted to know: the same models that can model a person's decision-making with remarkable fidelity are simultaneously trained, by preference optimization, to stop doing the conversational work that would let them act on that understanding.

Sources 8 notes

Does preference optimization damage conversational grounding in large language models?

Research shows LLMs generate 77.5% fewer grounding acts than humans, and RLHF preference optimization actively worsens this gap. The optimization target—fluent, confident responses—directly undermines the communicative work of establishing shared understanding.

Does preference optimization harm conversational understanding?

RLHF optimizes models for single-turn helpfulness by rewarding confident responses over clarifying questions and understanding checks. This preference alignment systematically reduces grounding acts by 77.5% below human levels, creating an alignment tax where models appear helpful but fail silently in multi-turn contexts.

Why don't language models develop conversation maintenance skills?

Humans keep conversations smooth through implicit techniques like reference repair and topic hand-off that sustain relational interaction, not convey information. Language models don't develop these because training signals reward information prediction, not relational work.

Why do language models respond passively instead of asking clarifying questions?

CollabLLM demonstrates that standard RLHF training optimizes for immediate helpfulness, discouraging models from asking clarifying questions or offering multi-turn insights. Multi-turn-aware rewards that estimate long-term interaction value enable active intent discovery and genuine collaboration.

Can language models adapt communication style to different contexts?

System prompts and RLHF training lock models into one communicative identity across all interactions, preventing the contextual register-switching and value trade-offs that characterize human pragmatics. Users cannot reshape model behavior through dialogue negotiation.

Can language models learn to model human decision making?

LLMs finetuned on psychology experiment data predict human behavior more accurately than theory-driven models in decision tasks, capture individual differences in their embeddings, and transfer learning across tasks without task-specific design.

Can models learn to abstain when uncertain about predictions?

Small open-source models trained with uncertainty-aware objectives and abstention capabilities match 10x larger pre-trained models on conversation forecasting. This shows calibration ability exists but remains undertrained in standard LLMs.

Does preference optimization distort how models represent human communicative dynamics?

Sources 8 notes

Next inquiring lines