RLVER: Reinforcement Learning with Verifiable Emotion Rewards for Empathetic Agents

Paper · arXiv 2507.03112 · Published July 3, 2025

However, the exploration of RLVR for enhancing dialogue capabilities faces several key obstacles:

• the lack of a stable, realistic, and scalable environment for multi-turn conversational rollouts;

• the absence of consistent and verifiable reward designs for general-purpose abilities such as emotional intelligence;

• Stable training of multi-turn reinforcement learning with LLMs remains an open challenge.

We tackle all three challenges with RLVER, the first end-to-end reinforcement learning framework with verifiable emotion rewards (RLVER) for cultivating higher-order empathetic abilities in LLMs. Built upon SAGE (Zhang et al., 2025a)—a framework that constructs self-consistent affective user simulators for realistic and automatic dialogue simulation and evaluation—we establish a stable and scalable environment that enables LLMs to continually simulate dialogue rollouts throughout training. In each conversation, the simulated user updates its emotional state after every LLM response, emitting an emotion score in [0, 1] as the reward. Changes in the emotion score are consistent and verifiable; each is deterministically derived through principled reasoning steps grounded in the user’s persona, dialogue history, conversational context, and goals. By scaling the simulation environment with a wide range of user behaviors and conversation intents, we alleviate reward hacking arising from homogeneous user preferences.

RLVER effectively and reliably improves multiple core dialogue capabilities; (ii) thinking and non-thinking models exhibit distinct developmental patterns under certain settings—thinking models tend to enhance empathy and insight, while non-thinking models focus more on action-oriented capabilities; (iii) compared to PPO, GRPO consistently delivers stable and balanced improvements, whereas PPO can occasionally push the upper bounds of specific capabilities; (iv) when examining user simulators as both environment and reward sources in RL training, we find that more challenging configurations do not necessarily yield better outcomes. On the contrary, moderately demanding but well-aligned setups may better support model growth; (v) RLVER shifts model behavior from solution-centric to genuinely empathic styles in Social-Cognition space. Our findings demonstrate that RL with verifiable emotion rewards is a practical path toward emotionally intelligent and broadly capable language agents.

To address this gap, our work builds directly upon the Sentient Agent as a Judge (SAGE) framework (Zhang et al., 2025a), a sophisticated system designed to automatically evaluate the higher-order social cognition of LLMs. The core of this framework is the Sentient Agent, an LLM-powered simulator that mimics human-like emotional responses and inner reasoning. Each agent is instantiated with four key factors: a detailed persona, a dialogue background, an explicit conversation goal, and a hidden intention, ensuring a diverse and realistic range of user simulations.

During an interaction, the Sentient Agent operates in a turn-by-turn loop. After receiving a response from the model being tested, it performs a multi-hop reasoning process to:

• Simulate Emotional Change ( femo): The agent assesses how the response made it feel, updating a numerical emotion score and generating interpretable “inner thoughts” that justify the emotional shift.

• Generate a Coherent Reply ( freply): Based on its new emotional state, persona, and conversational goals, the agent formulates its own response to continue the dialogue.

Each training step unfolds as a sequence of model-user interactions. At the start of a step i, the simulated user engine S samples an initial dialogue seed si = x0, which includes a persona, background, emotional tone, and a scenario-driven intention. The model πθ then generates a response y1, formatted according to the prescribed training template (with or without the think scaffold). The simulation engine processes this response and generates a corresponding reply x1, along with an updated emotion score e1.

This loop allows the empathetic agent to co-adapt with the simulator’s emotional dynamics, progressively learning to map diverse situations, intents, and moods to emotionally satisfying dialogues. By optimizing against a transparent and verifiable reward signal from an emotionally-aware user model, the framework establishes a reproducible and stable setup for training emotionally intelligent LLMs.

Think-Then-Say One of the key innovations in RLVER is the use of a structured “think-then-say” prompting template. This involves including an explicit . . . block before every model utterance during training, compelling the model to outline its reasoning process before delivering a response.

This template, shown below, enforces an explicit chain-of-thought reasoning step. The agent is instructed to first generate its internal monologue or strategic plan within a pair of and tags before producing the final, user-facing reply. This structure is designed to encourage the model to access and refine higher-order empathetic skills, such as considering the user’s emotional state, anticipating the impact of its words, and formulating a multi-step conversational plan. By externalizing its reasoning process, the model’s policy space is regularized, potentially leading to more stable learning and more sophisticated final behaviors.