Does the choice of RL algorithm actually matter for reasoning?
Expert Iteration, PPO, and Return-Conditioned RL show similar performance on reasoning tasks. The question is whether algorithm differences are fundamentally irrelevant, or whether something deeper explains the convergence.
"Teaching Large Language Models to Reason with RL" tests Expert Iteration, PPO, and Return-Conditioned RL across multiple model sizes and initialization conditions with both sparse and dense rewards. Result: performance differences across algorithms are small and convergence behavior is similar. More strikingly, RL training does not improve pass@n scores beyond what light supervised fine-tuning achieves with the same rollout budget.
The mechanism: LLMs require a pretrained prior to navigate the high-dimensional text action space — without it, exploration would be computationally impossible. But this prior simultaneously constrains what gets explored. The model generates variations on what it already knows rather than discovering genuinely novel solutions. Regardless of which RL algorithm manages the update step, the same pretrained exploration prior shapes the solution distribution at convergence.
Additional SFT training before RL makes this worse. More SFT concentrates the prior distribution further — the model converges faster on familiar patterns, which means the RL exploration from that point is more constrained, not less. The result: more SFT → tighter prior → smaller effective exploration space → RL finds less.
This reframes what RL training does in practice: it is primarily a selection mechanism, not a discovery mechanism. RL identifies which solutions already present in the pretrained distribution deserve reward. It rarely discovers solutions outside that distribution. The pretrained model contains most of what RL training will eventually "find."
Connects to Does policy entropy collapse limit reasoning performance in RL?: this paper provides algorithm-invariance evidence supporting that entropy is the fundamental constraint. Connects to Do base models already contain hidden reasoning ability?: if RL is unlocking pre-existing capability rather than building new capability, the algorithm doing the unlocking is interchangeable.
Source: Reasoning by Reflection
Related concepts in this collection
-
Does policy entropy collapse limit reasoning performance in RL?
As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
algorithm-invariance finding supports that entropy is the binding constraint, not which optimizer is used
-
Do base models already contain hidden reasoning ability?
Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
if capability is pre-existing, the mechanism for unlocking it is less important than the prior it unlocks from
-
Does RL training narrow search diversity the same way it does reasoning?
Exploring whether the entropy collapse pattern observed in reasoning RL also appears in search agent training. Understanding this helps identify whether diversity loss is a general RL property or domain-specific.
extends: entropy collapse as architectural property confirmed in search domain; RL algorithm interchangeability in reasoning and RL collapse in search are two expressions of the same prior-bounded exploration ceiling
-
Can simple rewards alone teach complex domain reasoning?
Does reinforcement learning on difficult problems with basic accuracy rewards produce sophisticated reasoning strategies without explicit chain-of-thought training? This challenges assumptions about what domain AI models need to learn effectively.
tension: emergence framing suggests RL generates genuinely novel capabilities; algorithm interchangeability suggests RL primarily selects from what the pretrained prior already contains — the two accounts apply at different scales of capability
-
Does RL training follow predictable scaling curves?
Can we forecast where RL training will plateau before committing full compute? ScaleRL tests whether sigmoid curves reliably predict performance ceilings across 200+ models.
refines the interchangeability claim: algorithm choice is interchangeable within a recipe, but recipe-level choices (data, reward structure, training configuration) set different asymptotic ceilings; ScaleRL provides the empirical scaling framework that contextualizes algorithm-level findings
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
rl for reasoning algorithm choice is interchangeable because the exploration ceiling is set by the pretrained prior not the algorithm