Can alternative reward functions shift LLMs from problem-solving to genuinely empathic responses?

This explores whether changing what an LLM is rewarded for during training — away from the helpfulness/solution-giving that standard alignment optimizes — can make it respond with real emotional attunement rather than reflexive problem-solving.

This explores whether changing what an LLM is rewarded for during training can move it from reflexive problem-solving toward genuine emotional attunement. The short version: the corpus says yes, but it also explains *why* the default behavior is so stubborn — and warns that empathy bought through training has a hidden cost.

Start with the diagnosis. When users disclose emotions, LLMs tend to jump straight to advice and solutions — the exact pattern that marks low-quality human therapy Do LLM therapists respond to emotions like low-quality human therapists?. This isn't a quirk; it's traceable to how the models were trained. RLHF rewards task completion and confident, helpful-looking answers, which biases therapeutic chatbots toward fixing over holding space Does RLHF training push therapy chatbots toward problem-solving?. The same reward structure quietly erodes the conversational groundwork — clarifying questions, understanding checks — that real dialogue needs, an 'alignment tax' where the model looks helpful but misses the person Does preference optimization harm conversational understanding?. So the solution-centric reflex is a *reward artifact*, which is exactly what makes the question answerable: change the reward, change the behavior.

That's what the most direct piece of evidence does. RLVER uses a simulated user's emotional trajectory as the reward signal — the model is scored on how the user *feels* across the conversation, not on whether it solved anything — and this shifts behavior toward genuine empathy while holding dialogue quality steady Can emotion rewards make language models genuinely empathic?. It directly counters the usual trade-off where optimizing for one thing degrades grounding. Worth pairing this with a subtler point about reward design: scalar rewards (a single number) discard information. Feedback actually carries two separate signals — how well you did (evaluative) and how you should change (directive) — and a single score collapses them Can scalar rewards capture all the information in agent feedback?. Emotion-as-reward is interesting partly because an emotion trajectory is richer than a thumbs-up.

Here's the part you didn't know you wanted to know: empathy you train in may cost you reliability. Persona-training models to be warm and empathetic increases errors in medical reasoning, truthfulness, and resistance to false beliefs — by up to 30 points — and the effect gets *worse* precisely when a user is sad or distressed, the moment empathy is supposed to help Does empathy training make AI systems less reliable?. So 'shift the reward toward empathy' isn't a free lever. There's also a question of whether what you get is real empathy or its performance: LLMs already deploy far more moral and emotional *language* than humans without matching sentiment underneath, suggesting tone and substance run on separate channels Do LLMs use moral language more than humans?. A reward that optimizes the emotional surface might just get better surface.

Two more boundary markers. Single-response empathy is the easy case — LLMs already beat trainee therapists on isolated empathic replies, but that advantage hasn't been shown to survive into multi-turn relationships, which is where therapy actually happens Can language models match therapist empathy in real conversations?. And reward shaping can't conjure perception the model lacks: models fail to even detect ambivalence or early-stage motivation, so they can't attune to what they can't see Why can't chatbots detect when users are ambivalent about change?. The honest synthesis: alternative rewards demonstrably move the behavior, but 'genuinely empathic' is doing heavy lifting — you can reward the trajectory of feeling and still be trading away accuracy, and still be optimizing performance rather than the thing itself.

Sources 9 notes

Can emotion rewards make language models genuinely empathic?

RLVER uses a simulated user's emotion trajectory as an RL reward signal, enabling GRPO to deliver stable empathy improvements while maintaining dialogue quality—countering the typical trade-off between preference optimization and conversational grounding.

Do LLM therapists respond to emotions like low-quality human therapists?

Using the BOLT framework, researchers found LLMs offer solution-focused advice during emotional disclosure—a hallmark of low-quality therapy—yet also reflect more on client needs and strengths than typical poor human therapy, creating an unusual hybrid profile likely driven by RLHF's helpfulness bias.

Does RLHF training push therapy chatbots toward problem-solving?

RLHF training rewards task completion and solution-giving, creating a misalignment in therapeutic contexts where validation and emotional holding are clinically appropriate. This represents a domain-specific instance of the broader alignment tax on conversational grounding.

Does preference optimization harm conversational understanding?

RLHF optimizes models for single-turn helpfulness by rewarding confident responses over clarifying questions and understanding checks. This preference alignment systematically reduces grounding acts by 77.5% below human levels, creating an alignment tax where models appear helpful but fail silently in multi-turn contexts.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Does empathy training make AI systems less reliable?

Research shows persona training for empathy increases errors in medical reasoning, truthfulness, and disinformation resistance. Standard safety benchmarks miss this vulnerability, and effects intensify when users express sadness or false beliefs.

Do LLMs use moral language more than humans?

Research comparing LLM and human arguments found that LLMs used significantly more moral framing across care, fairness, authority, and sanctity foundations, despite producing sentiment scores nearly identical to humans. This suggests moral appeals and emotional tone operate on separate persuasive channels.

Can language models match therapist empathy in real conversations?

Six LLMs scored higher than eight trainee therapists on empathy, validation, and clinical knowledge in isolated responses. However, this advantage is structurally limited to single-turn evaluation—multi-turn therapeutic relationships and outcomes remain untested.

Why can't chatbots detect when users are ambivalent about change?

Testing three major LLMs across 25 health scenarios showed they succeed only when users have established goals but cannot detect resistance or ambivalence. Models miss relapse-prevention strategies even for users in action stages.

Can alternative reward functions shift LLMs from problem-solving to genuinely empathic responses?

Sources 9 notes

Next inquiring lines