Does outcome-based RL diversity loss spread across unsolved problems?
When RL concentrates probability mass on correct answers for solved problems, does that narrowing propagate to problems the model cannot yet solve? And if so, what are the separate mechanisms for preserving diversity during training versus at test time?
Outcome-based RL (rewarding only final answer correctness) produces substantial accuracy gains but systematically reduces generation diversity. This is known. What is new: the diversity loss transfers across problems. Concentrating probability mass on correct answers for solved problems propagates to unsolved problems — the model's entire output distribution narrows, not just its distribution on problems it can solve.
The transfer mechanism: RL sharpens the policy globally, not per-problem. When the model learns to concentrate on correct trajectories for problems it has solved, the reduced diversity in its generative distribution also manifests as reduced diversity on problems it has not solved. This means RL can reduce effective diversity even on the training set relative to the base model.
The practical consequence: diversity is critical for test-time scaling. Since Why does parallel reasoning outperform single chain thinking?, diverse parallel samples are more valuable than many copies of similar reasoning. And since Why does majority voting outperform more complex inference methods?, voting requires genuine diversity to work — voting over near-identical samples provides no signal.
The key conceptual contribution is distinguishing two forms of exploration:
- Historical exploration — visiting diverse states and actions during training. Improves pass@1 (single-attempt accuracy) because the model encounters more training signal. Does not guarantee test-time diversity.
- Batch exploration — producing diverse outputs at test time. Improves pass@k (k-attempt coverage) because outputs span more of the solution space. Does not improve training diversity.
These require different mechanisms. Historical exploration uses UCB-style bonuses over outcome space (tractable because reasoning tasks have a limited set of distinct final answers). Batch exploration uses within-batch repetition penalties. The distinction directly instantiates Why do reasoning models fail differently at training versus inference? — historical/batch exploration maps onto training-time/test-time with concrete algorithmic prescriptions.
Source: Reward Models — Outcome-based Exploration for LLM Reasoning (arxiv 2509.06941)
Related concepts in this collection
-
Why do reasoning models fail differently at training versus inference?
Reasoning models exhibit two distinct failure modes—entropy collapse during training and variance inflation during inference—that appear unrelated but may share underlying causes. Understanding these dual problems could reveal whether separate or unified solutions are needed.
this paper adds taxonomic precision: historical (training) vs batch (test-time) exploration with distinct algorithms
-
Does policy entropy collapse limit reasoning performance in RL?
As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
outcome-based exploration provides UCB-style bonuses at the outcome level to address collapse
-
Why does parallel reasoning outperform single chain thinking?
Does dividing a fixed token budget across multiple independent reasoning paths beat spending it all on one long chain? This explores how breadth and diversity in reasoning compare to depth.
diversity is prerequisite for parallel scaling; RL-induced diversity loss degrades it
-
Does RL training narrow search diversity the same way it does reasoning?
Exploring whether the entropy collapse pattern observed in reasoning RL also appears in search agent training. Understanding this helps identify whether diversity loss is a general RL property or domain-specific.
diversity transfer mechanism operates across domains
-
Does self-consistency reliably reward correct answers during training?
Self-consistency initially correlates with correctness, but as models train on this signal, do they eventually learn to maximize consistency itself rather than accuracy? When does this proxy reward stop working?
self-consistency reward creates a specific diversity collapse pathway: optimizing for agreement among samples directly reduces the output diversity that makes self-consistency useful as a signal, creating a self-undermining reward dynamic
-
Why does RLVR training narrow a model's problem solving ability?
RLVR's on-policy constraint may force models to exploit known reasoning paths rather than explore new ones, potentially shrinking their effective problem-solving scope. Understanding this mechanism could reveal how to design better exploration incentives in language model reasoning.
diversity loss and capability boundary collapse are the same dynamic at different levels: diversity loss transfers from solved to unsolved problems (this note), while capability boundary collapse describes the resulting scope narrowing; both require exploration mechanisms to counteract
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
outcome-based rl induces diversity loss that transfers from solved to unsolved problems — historical and batch exploration require separate mechanisms