Can emotion rewards make language models genuinely empathic?
Explores whether grounding RL rewards in verifiable emotion change—rather than human preference—can shift models from solution-focused to authentically empathic dialogue while maintaining or improving quality.
RLVER (Reinforcement Learning with Verifiable Emotion Rewards) introduces a fundamentally different RL signal for dialogue: rather than human preference ratings (which optimize for accommodation), the reward is a transparent emotion score [0,1] from a Sentient Agent simulator. Each score change is deterministically derived through multi-hop reasoning grounded in the user's persona, dialogue history, conversational context, and goals.
The SAGE framework that generates these rewards instantiates each simulated user with four factors: detailed persona, dialogue background, explicit conversation goal, and hidden intention. At each turn, the agent:
- Simulates emotional change — assessing how the response made it feel, generating interpretable "inner thoughts" justifying the shift
- Generates a coherent reply based on new emotional state, persona, and conversational goals
Key findings:
- GRPO consistently delivers stable, balanced empathy improvements across capabilities
- PPO can occasionally push upper bounds of specific capabilities but is less stable
- The framework shifts model behavior from solution-centric to genuinely empathic in social-cognition space
This is a direct counter-case to Does preference optimization damage conversational grounding in large language models? — RL CAN improve dialogue quality when the reward tracks verifiable emotion change rather than human preference. The difference: preference optimization rewards accommodation (what users rate positively); emotion rewards track genuine emotional trajectory (what actually moves the conversation forward emotionally).
The connection to reasoning RL is structural: just as Does the choice of RL algorithm actually matter for reasoning?, GRPO's stability advantage here suggests the prior matters more than the algorithm for empathy training too.
Source: Psychology Empathy
Related concepts in this collection
-
Does preference optimization damage conversational grounding in large language models?
Exploring whether RLHF and preference optimization actively reduce the communicative acts—clarifications, acknowledgments, confirmations—that build shared understanding in dialogue. This matters for high-stakes applications like medical and emotional support.
counter-case: RL with emotion rewards improves dialogue quality
-
Does the choice of RL algorithm actually matter for reasoning?
Expert Iteration, PPO, and Return-Conditioned RL show similar performance on reasoning tasks. The question is whether algorithm differences are fundamentally irrelevant, or whether something deeper explains the convergence.
GRPO stability suggests prior-bounded ceiling may apply to empathy RL
-
Does binary reward training hurt model calibration?
Explores whether the standard correctness-based reward in RL training creates incentives for overconfident predictions, and what structural problem causes calibration to degrade during optimization.
RLVER's verifiable emotion score is a continuous, grounded reward avoiding binary degradation
-
Can meta-learning prevent dialogue policies from collapsing?
Hierarchical RL for structured dialogue phases risks converging on a single action across diverse users. Does meta-learning like MAML preserve policy flexibility and adaptability to different user types?
HRL for MI dialogue uses blunt graduated bonuses (+50 to +200 per phase); RLVER's emotion-grounded rewards could replace these with verifiable signals that track whether the patient's emotional state actually shifted during evoking and planning phases, providing a more fine-grained and causally meaningful reward for the sub-policies
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
Verifiable emotion rewards shift LLM behavior from solution-centric to genuinely empathic styles in social-cognition space