How should task-oriented and socially-oriented dialogue acts receive different training signals?

This explores whether the two jobs a conversation does — getting a task done vs. maintaining the relationship — actually need different reward signals during training, and what goes wrong when we treat them as one thing.

This explores whether task-oriented dialogue (booking a flight, answering a query) and socially-oriented dialogue (keeping rapport, signaling care) should be trained on different reward signals — and the corpus suggests the answer is a firm yes, with the strongest evidence coming from work showing these are not even the same kind of behavior. The cleanest argument is that alignment dimensions aren't interchangeable: lexical alignment (matching the user's words and structure) is what drives task efficiency and comprehension, while emotional and prosodic alignment drive warmth and trust, and conflating them produces predictable category errors — cold customer-service bots that are technically correct, or evasive mental-health assistants that won't commit Do different types of alignment serve different conversational goals?. If you reward one signal expecting the other, you get the wrong machine.

The reason this matters is that today's dominant training signal — single-turn RLHF reward — quietly optimizes for the task side while starving the social side, because the social work is *implicit*. Conversation maintenance (repairing a misunderstood reference, handing off a topic, smoothing a turn) isn't information transfer; it's relational action, and models don't learn it because the training objective rewards predicting the next informative token, not doing relational labor Why don't language models develop conversation maintenance skills?. The same objective also erodes the *task* side in subtler ways: it rewards confident answers over clarifying questions, cutting grounding acts to a fraction of human levels Does preference optimization harm conversational understanding?, and it trains models to respond passively rather than actively discover what the user actually wants Why do language models respond passively instead of asking clarifying questions?. So a single reward signal manages to underserve both jobs at once.

What the corpus implies, read laterally, is that the *time horizon* of the reward is the real lever separating the two. Task acts can be scored on near-term outcome — did proactively volunteering information cut the dialogue from ten turns to four Could proactive dialogue make conversations dramatically more efficient?, did multi-turn-aware reward let the model collaborate toward the goal Why do language models respond passively instead of asking clarifying questions?. Social acts resist outcome scoring entirely, because their 'success' is consistency and presence over a whole conversation, not a completed task — which is why work on persona drift turns to reward signals measuring line-to-line and Q&A consistency Can training user simulators reduce persona drift in dialogue?, or even inference-time pragmatic self-monitoring against an imagined listener instead of a task metric at all Can imaginary listeners reduce dialogue agent contradictions?.

There's a sharper, uncomfortable finding hiding here: a uniform alignment signal doesn't just neglect social acts, it actively forbids whole classes of them. RLHF's reward for calibrated, hedged neutrality structurally blocks speech acts that require overclaiming — alarm, warning, denunciation — making this a consequence of the objective, not a fixable bug Does alignment training suppress socially necessary speech acts?. The same flattening locks the model into one communicative identity that can't switch register for context Can language models adapt communication style to different contexts?, and even biases it toward assuming everyone negotiates the way a polite assistant does Do LLMs predict persuasion based on actual dialogue or training bias?.

The thing you didn't know you wanted to know: the choice isn't only about *what* to reward but *whether reward is the right tool at all* for each act type. Task understanding may be better handled by generating commands in a domain language rather than optimizing a classifier at all Can command generation replace intent classification in dialogue systems?, while social consistency may be better enforced at inference time through pragmatic self-checking than baked in through reward Can imaginary listeners reduce dialogue agent contradictions?. Different acts may not just need different signals — they may need different mechanisms entirely.

Sources 11 notes

Do different types of alignment serve different conversational goals?

A 2020–2025 systematic review shows lexical alignment drives task efficiency and comprehension, while emotional and prosodic alignment drive relational warmth and trust. Conflating them in design produces category errors—cold customer-service bots and evasive mental-health assistants.

Why don't language models develop conversation maintenance skills?

Humans keep conversations smooth through implicit techniques like reference repair and topic hand-off that sustain relational interaction, not convey information. Language models don't develop these because training signals reward information prediction, not relational work.

Does preference optimization harm conversational understanding?

RLHF optimizes models for single-turn helpfulness by rewarding confident responses over clarifying questions and understanding checks. This preference alignment systematically reduces grounding acts by 77.5% below human levels, creating an alignment tax where models appear helpful but fail silently in multi-turn contexts.

Why do language models respond passively instead of asking clarifying questions?

CollabLLM demonstrates that standard RLHF training optimizes for immediate helpfulness, discouraging models from asking clarifying questions or offering multi-turn insights. Multi-turn-aware rewards that estimate long-term interaction value enable active intent discovery and genuine collaboration.

Could proactive dialogue make conversations dramatically more efficient?

Simulations show proactivity—providing relevant information without being asked—cuts dialogue turns by 60% in medium-complexity domains. This behavior mirrors human conversation and Grice's maxims but is almost entirely absent from AI datasets and research benchmarks.

Can training user simulators reduce persona drift in dialogue?

By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.

Can imaginary listeners reduce dialogue agent contradictions?

Endowing dialogue agents with an imaginary listener via Rational Speech Acts reduces persona contradiction at inference time without NLI labels or extra training. The agent simulates whether utterances would distinguish its persona from a distractor, suppressing generic or contradictory responses.

Does alignment training suppress socially necessary speech acts?

RLHF optimization rewards calibrated neutrality and hedged claims, which structurally prevents models from performing speech acts requiring overclaiming relative to baseline—like alarm, warning, prophecy, and denunciation. This is a direct consequence of the alignment objective, not a fixable bug.

Can language models adapt communication style to different contexts?

System prompts and RLHF training lock models into one communicative identity across all interactions, preventing the contextual register-switching and value trade-offs that characterize human pragmatics. Users cannot reshape model behavior through dialogue negotiation.

Do LLMs predict persuasion based on actual dialogue or training bias?

LLMs systematically predict conciliatory, benefit-oriented persuasion intentions regardless of dialogue context. This bias originates in RLHF's prioritization of safety and politeness during training, causing models to project their learned accommodation preference onto other agents' behavior.

Can command generation replace intent classification in dialogue systems?

Rasa's dialogue understanding architecture generates domain-specific commands instead of classifying intents, eliminating annotation requirements, handling context naturally, and scaling without degradation—treating understanding as pragmatics rather than semantics.

How should task-oriented and socially-oriented dialogue acts receive different training signals?

Sources 11 notes

Next inquiring lines