Psychology and Social Cognition Reinforcement Learning for LLMs

Why does RL succeed more on some tasks than others?

Reinforcement learning shows wildly different improvement rates across conversational tasks—from near-total capability unlock to modest gains. What determines whether RL will transform performance or produce incremental progress?

Note · 2026-03-31 · sourced from Conversation Agents
How does reinforcement learning reshape what models can reason about?

Both papers use RL to train conversational capabilities, but the improvement magnitudes diverge dramatically:

Three factors explain the gap:

1. Reward signal verifiability. Proactive critical thinking has a clear binary reward: did the model correctly identify the missing variable and ask for it? Yes or no. Persona consistency requires LLM-as-a-Judge evaluation of whether an utterance is consistent with a persona description — a softer, more ambiguous signal. Since Does the choice of RL algorithm actually matter for reasoning?, when the reward signal is clear, the algorithm barely matters. When the reward is fuzzy, everything matters.

2. Baseline differences. Proactive critical thinking starts from near-zero — the capability is completely suppressed in vanilla models. Persona consistency starts from a partially functional baseline — models already maintain some consistency. Unlocking a suppressed capability (going from 0 to 1) is architecturally different from improving an expressed capability (going from 0.5 to 0.8).

3. Task complexity. Detecting a missing variable is a bounded problem with a finite answer space. Maintaining consistent personality across an open-ended conversation is unbounded — the space of possible persona-relevant responses is vast and context-dependent.

This pattern generalizes across the vault:

The principle: RL improvement magnitude tracks reward signal verifiability. Binary verification → dramatic improvement. Judgment-based evaluation → modest improvement. The training method is the same. The reward signal determines the ceiling.


Source: Conversation Agents, promoted from ops/tensions/

Related concepts in this collection

Concept map
16 direct connections · 155 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

RL succeeds dramatically on tasks with verifiable binary rewards but only modestly on tasks requiring judgment-based evaluation