Reinforcement Learning for LLMs

Does outcome-based RL diversity loss spread across unsolved problems?

When RL concentrates probability mass on correct answers for solved problems, does that narrowing propagate to problems the model cannot yet solve? And if so, what are the separate mechanisms for preserving diversity during training versus at test time?

Note · 2026-02-22 · sourced from Reward Models
How should we allocate compute budget at inference time?

Outcome-based RL (rewarding only final answer correctness) produces substantial accuracy gains but systematically reduces generation diversity. This is known. What is new: the diversity loss transfers across problems. Concentrating probability mass on correct answers for solved problems propagates to unsolved problems — the model's entire output distribution narrows, not just its distribution on problems it can solve.

The transfer mechanism: RL sharpens the policy globally, not per-problem. When the model learns to concentrate on correct trajectories for problems it has solved, the reduced diversity in its generative distribution also manifests as reduced diversity on problems it has not solved. This means RL can reduce effective diversity even on the training set relative to the base model.

The practical consequence: diversity is critical for test-time scaling. Since Why does parallel reasoning outperform single chain thinking?, diverse parallel samples are more valuable than many copies of similar reasoning. And since Why does majority voting outperform more complex inference methods?, voting requires genuine diversity to work — voting over near-identical samples provides no signal.

The key conceptual contribution is distinguishing two forms of exploration:

These require different mechanisms. Historical exploration uses UCB-style bonuses over outcome space (tractable because reasoning tasks have a limited set of distinct final answers). Batch exploration uses within-batch repetition penalties. The distinction directly instantiates Why do reasoning models fail differently at training versus inference? — historical/batch exploration maps onto training-time/test-time with concrete algorithmic prescriptions.


Source: Reward Models — Outcome-based Exploration for LLM Reasoning (arxiv 2509.06941)

Related concepts in this collection

Concept map
17 direct connections · 141 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

outcome-based rl induces diversity loss that transfers from solved to unsolved problems — historical and batch exploration require separate mechanisms