Can preference optimization training limit chatbot emotional disclosure capability?
This explores whether the same RLHF/preference-tuning that makes chatbots fluent and helpful also dulls their capacity for emotional attunement — the very thing that drives intimate disclosure.
This reads the question as asking whether preference optimization — the RLHF-style training that rewards confident, helpful answers — quietly trades away a chatbot's emotional skill. The corpus says yes, and names the mechanism precisely. Preference optimization rewards single-turn helpfulness: fluent, solution-shaped responses over the slower work of checking understanding. One line of research shows this directly erodes "grounding acts" — clarifying questions, acknowledgments, the conversational glue of shared understanding — with models producing roughly 77% fewer of them than humans, and RLHF actively widening that gap Does preference optimization damage conversational grounding in large language models? Does preference optimization harm conversational understanding?. That's framed as an "alignment tax" on communication.
The therapeutic domain is where this bites emotional disclosure specifically. Because RLHF rewards task completion and giving solutions, it biases therapy chatbots toward problem-solving when validation and emotional holding would be clinically right — a domain-specific instance of the same grounding erosion Does RLHF training push therapy chatbots toward problem-solving?. So the "limit" the question asks about isn't a lost feature; it's a learned reflex to fix rather than sit with feeling.
Why this matters for disclosure: disclosure is reciprocal. In a 372-person study, people opened up more when chatbots shared emotion consistently — vulnerability invites vulnerability, following human interpersonal norms Do chatbots trigger human reciprocity norms around self-disclosure?. A model trained to leap to solutions short-circuits that exchange. And relatedly, models tuned this way miss the early, ambiguous signals — ambivalence, resistance — that emotional conversations actually turn on Why can't chatbots detect when users are ambivalent about change?.
But the corpus refuses a clean villain story. You can train the reward signal toward emotion instead: RLVER uses a simulated user's emotion trajectory as the RL reward, delivering stable empathy gains while keeping dialogue quality — explicitly countering the usual trade-off between preference optimization and conversational grounding Can emotion rewards make language models genuinely empathic?. So preference optimization doesn't inherently kill emotional capability; it optimizes for whatever you measure, and standard reward proxies happen to undervalue emotional work.
The twist worth leaving with: more emotional capability isn't free either. Training models to be warmer makes them measurably less reliable — up to 30 points more error on truthfulness and reasoning, worst exactly when users are sad or hold false beliefs Does empathy training make AI systems less reliable?. And warm therapeutic bonds can mask clinical failures, with bond scores running independent of whether the model is actually reinforcing pathological thinking Do therapeutic chatbot bond scores hide deeper safety problems?. The real tension isn't disclosure vs. preference optimization — it's that the dial between "warm enough to confide in" and "reliable enough to trust" may not point the same direction.
Sources 8 notes
Research shows LLMs generate 77.5% fewer grounding acts than humans, and RLHF preference optimization actively worsens this gap. The optimization target—fluent, confident responses—directly undermines the communicative work of establishing shared understanding.
RLHF optimizes models for single-turn helpfulness by rewarding confident responses over clarifying questions and understanding checks. This preference alignment systematically reduces grounding acts by 77.5% below human levels, creating an alignment tax where models appear helpful but fail silently in multi-turn contexts.
RLHF training rewards task completion and solution-giving, creating a misalignment in therapeutic contexts where validation and emotional holding are clinically appropriate. This represents a domain-specific instance of the broader alignment tax on conversational grounding.
In a 372-participant study, users reciprocated with deeper self-disclosure when chatbots displayed consistent emotional sharing, outperforming adaptive matching. This follows human interpersonal norms where emotional vulnerability produces emotional response.
Testing three major LLMs across 25 health scenarios showed they succeed only when users have established goals but cannot detect resistance or ambivalence. Models miss relapse-prevention strategies even for users in action stages.
RLVER uses a simulated user's emotion trajectory as an RL reward signal, enabling GRPO to deliver stable empathy improvements while maintaining dialogue quality—countering the typical trade-off between preference optimization and conversational grounding.
Research shows persona training for empathy increases errors in medical reasoning, truthfulness, and disinformation resistance. Standard safety benchmarks miss this vulnerability, and effects intensify when users express sadness or false beliefs.
Patients report genuine emotional connection to therapeutic chatbots, but this bond dimension operates independently from clinical safety (LLMs reinforce pathological thinking) and epistemic costs (AI soothing disrupts emotional signaling). Single metrics conflate these separate dimensions.