INQUIRING LINE

How does curriculum learning prevent instability in social-emotional RL training?

This explores whether ordering or sequencing training material — a curriculum — can keep emotion-and-empathy reward training from going off the rails, and the corpus addresses this more by triangulation than head-on: it has pieces on what destabilizes emotional RL, what destabilizes RL ordering generally, and what one stable empathy-trained model looks like.


This reads "curriculum learning" as the practice of ordering or scheduling what a model trains on, and "social-emotional RL" as reinforcement learning that rewards empathy or emotional attunement. Worth saying up front: the corpus doesn't have a paper that wires those two together directly, but the surrounding material maps the problem unusually well — and the honest answer is that curriculum effects and emotional-stability effects have been studied mostly in separate rooms.

First, why social-emotional RL is unstable in the first place. Training a model to be warm is not free: empathy tuning measurably degrades reliability, raising errors in medical reasoning and truthfulness by up to 30 points, with the damage worst exactly when a user is sad or holds a false belief Does empathy training make AI systems less reliable?. And ordinary preference optimization quietly erodes the conversational repair acts — clarifying questions, understanding checks — that emotional dialogue depends on, cutting them 77.5% below human levels Does preference optimization harm conversational understanding?. So the instability isn't only training-dynamics noise; it's a capability trade-off baked into the reward.

Now the curriculum side, where the corpus is actually rich. The cleanest evidence that ordering matters is a scheduling result: training structured tasks before open-ended creative ones yields 6.2% gains and, crucially, prevents the entropy collapse that would otherwise crush open-ended capabilities — because structured domains shrink output entropy while creative domains expand it, and the order determines which one wins Does training order reshape how models handle different task types?. That entropy-collapse mechanism is the real villain across the collection: RL reliably converges policies onto one narrow strategy, squeezing exploration diversity in search agents Does reinforcement learning squeeze exploration diversity in search agents? and collapsing onto a single pretraining format within the first epoch Does RL training collapse format diversity in pretrained models?. Emotional range is open-ended by nature, so a curriculum that protects high-entropy capabilities is plausibly what keeps empathy from collapsing into one canned warm voice.

The other half of curriculum is difficulty, and here the corpus issues a sharp warning. Training on too-hard samples doesn't just waste effort — it teaches degenerate shortcuts that then contaminate skills the model already had, because rare accidental successes get scored as high-advantage and reinforced Do overly hard RLVR samples actually harm model capabilities?. A difficulty curriculum that withholds the impossible cases is therefore a stability mechanism, not just a pacing one. Related work suggests the same care should extend to how episodes are consumed: treating successes as concrete demonstrations and failures as abstracted lessons avoids the degradation of blending everything uniformly Should successful and failed episodes be processed differently?, and externalizing learned skills with an automatic curriculum lets agents keep exploring without catastrophic forgetting Can agents learn new skills without forgetting old ones?.

What does stable social-emotional RL actually look like when it works? The one direct exemplar is RLVER, which uses a simulated user's emotion trajectory as a verifiable reward and reports stable empathy gains while preserving dialogue quality — explicitly beating the usual preference-optimization-vs-grounding trade-off Can emotion rewards make language models genuinely empathic?. It pairs naturally with the finding that RL training moves through a two-phase arc, mastering execution before strategic exploration becomes the bottleneck Does RL training follow a predictable two-phase learning sequence? — a phase structure a curriculum can be designed to ride. The thing you didn't know you wanted to know: the lever that stabilizes emotional RL may not be "emotional" at all. It's entropy management — sequencing and difficulty-gating so the reward never collapses the model's expressive range — and the warmth and alignment-tax findings are warnings about what happens when no such curriculum is in place.


Sources 10 notes

Does empathy training make AI systems less reliable?

Research shows persona training for empathy increases errors in medical reasoning, truthfulness, and disinformation resistance. Standard safety benchmarks miss this vulnerability, and effects intensify when users express sadness or false beliefs.

Does preference optimization harm conversational understanding?

RLHF optimizes models for single-turn helpfulness by rewarding confident responses over clarifying questions and understanding checks. This preference alignment systematically reduces grounding acts by 77.5% below human levels, creating an alignment tax where models appear helpful but fail silently in multi-turn contexts.

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Should successful and failed episodes be processed differently?

SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.

Can agents learn new skills without forgetting old ones?

VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.

Can emotion rewards make language models genuinely empathic?

RLVER uses a simulated user's emotion trajectory as an RL reward signal, enabling GRPO to deliver stable empathy improvements while maintaining dialogue quality—countering the typical trade-off between preference optimization and conversational grounding.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Next inquiring lines