Reinforcement Learning for LLMs LLM Reasoning and Architecture

Does the choice of RL algorithm actually matter for reasoning?

Expert Iteration, PPO, and Return-Conditioned RL show similar performance on reasoning tasks. The question is whether algorithm differences are fundamentally irrelevant, or whether something deeper explains the convergence.

Note · 2026-02-22 · sourced from Reasoning by Reflection
How should we allocate compute budget at inference time? How should researchers navigate LLM reasoning research?

"Teaching Large Language Models to Reason with RL" tests Expert Iteration, PPO, and Return-Conditioned RL across multiple model sizes and initialization conditions with both sparse and dense rewards. Result: performance differences across algorithms are small and convergence behavior is similar. More strikingly, RL training does not improve pass@n scores beyond what light supervised fine-tuning achieves with the same rollout budget.

The mechanism: LLMs require a pretrained prior to navigate the high-dimensional text action space — without it, exploration would be computationally impossible. But this prior simultaneously constrains what gets explored. The model generates variations on what it already knows rather than discovering genuinely novel solutions. Regardless of which RL algorithm manages the update step, the same pretrained exploration prior shapes the solution distribution at convergence.

Additional SFT training before RL makes this worse. More SFT concentrates the prior distribution further — the model converges faster on familiar patterns, which means the RL exploration from that point is more constrained, not less. The result: more SFT → tighter prior → smaller effective exploration space → RL finds less.

This reframes what RL training does in practice: it is primarily a selection mechanism, not a discovery mechanism. RL identifies which solutions already present in the pretrained distribution deserve reward. It rarely discovers solutions outside that distribution. The pretrained model contains most of what RL training will eventually "find."

Connects to Does policy entropy collapse limit reasoning performance in RL?: this paper provides algorithm-invariance evidence supporting that entropy is the fundamental constraint. Connects to Do base models already contain hidden reasoning ability?: if RL is unlocking pre-existing capability rather than building new capability, the algorithm doing the unlocking is interchangeable.


Source: Reasoning by Reflection

Related concepts in this collection

Concept map
23 direct connections · 161 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

rl for reasoning algorithm choice is interchangeable because the exploration ceiling is set by the pretrained prior not the algorithm