Why does RL succeed more on some tasks than others?
Reinforcement learning shows wildly different improvement rates across conversational tasks—from near-total capability unlock to modest gains. What determines whether RL will transform performance or produce incremental progress?
Both papers use RL to train conversational capabilities, but the improvement magnitudes diverge dramatically:
- Proactive critical thinking: 0.15% → 73.98% — near-total capability unlock
- Persona consistency: 55% inconsistency reduction — significant but not transformative
Three factors explain the gap:
1. Reward signal verifiability. Proactive critical thinking has a clear binary reward: did the model correctly identify the missing variable and ask for it? Yes or no. Persona consistency requires LLM-as-a-Judge evaluation of whether an utterance is consistent with a persona description — a softer, more ambiguous signal. Since Does the choice of RL algorithm actually matter for reasoning?, when the reward signal is clear, the algorithm barely matters. When the reward is fuzzy, everything matters.
2. Baseline differences. Proactive critical thinking starts from near-zero — the capability is completely suppressed in vanilla models. Persona consistency starts from a partially functional baseline — models already maintain some consistency. Unlocking a suppressed capability (going from 0 to 1) is architecturally different from improving an expressed capability (going from 0.5 to 0.8).
3. Task complexity. Detecting a missing variable is a bounded problem with a finite answer space. Maintaining consistent personality across an open-ended conversation is unbounded — the space of possible persona-relevant responses is vast and context-dependent.
This pattern generalizes across the vault:
- RLVER emotional rewards work because emotion categories are partially verifiable — empathy shifts are measurable through linguistic markers
- Checklist-based rewards (RLCF) work because sub-criteria can be independently verified
- Binary reward RL degrades calibration because forcing binary judgment onto graded reality introduces systematic distortion
The principle: RL improvement magnitude tracks reward signal verifiability. Binary verification → dramatic improvement. Judgment-based evaluation → modest improvement. The training method is the same. The reward signal determines the ceiling.
Source: Conversation Agents, promoted from ops/tensions/
Related concepts in this collection
-
Can models learn to ask clarifying questions instead of guessing?
Exploring whether large language models can be trained to detect incomplete queries and actively request missing information rather than hallucinating answers or refusing to respond. This matters because conversational agents today remain passive, responding only when prompted.
the dramatic success case (0.15% → 73.98%)
-
Can training user simulators reduce persona drift in dialogue?
Explores whether inverting typical RL setups—training the simulated user for consistency rather than the task agent—can measurably reduce persona drift and improve experimental reliability in dialogue research.
the modest success case (55% reduction)
-
Does the choice of RL algorithm actually matter for reasoning?
Expert Iteration, PPO, and Return-Conditioned RL show similar performance on reasoning tasks. The question is whether algorithm differences are fundamentally irrelevant, or whether something deeper explains the convergence.
algorithm interchangeability when reward is clear
-
Can breaking down instructions into checklists enable better reinforcement learning?
Explores whether decomposing instruction quality into verifiable yes/no criteria allows RL systems to improve on tasks that lack clear correctness signals, like creative writing or social reasoning.
decomposition into verifiable sub-criteria as a fix for the judgment-reward problem
-
Does binary reward training hurt model calibration?
Explores whether the standard correctness-based reward in RL training creates incentives for overconfident predictions, and what structural problem causes calibration to degrade during optimization.
binary forcing on graded tasks as a specific failure mode
-
Can machines learn what makes research worth doing?
Can AI systems trained on community citation patterns learn to recognize high-impact research directions the way human scientists do? The research explores whether 'scientific taste'—judgment about what to pursue—is learnable from collective community signals.
RLCF introduces a third reward category (community-level feedback) beyond the binary/judgment dichotomy
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
RL succeeds dramatically on tasks with verifiable binary rewards but only modestly on tasks requiring judgment-based evaluation