Reinforcement Learning for LLMs LLM Reasoning and Architecture

Can extended RL training discover reasoning strategies base models cannot?

Does reinforcement learning genuinely expand what models can reason about, or does it only optimize existing latent capabilities? ProRL tests this by running RL longer on diverse tasks with better training controls.

Note · 2026-02-22 · sourced from Reinforcement Learning

A fundamental debate in RL for reasoning: does RL truly expand capabilities, or does it merely optimize sampling efficiency over solutions already embedded in the base model? Several studies argued for the latter — since Does RLVR actually expand what models can reason about?, pass@k analysis showed base model performance eventually surpassing RL-trained models as k increases. ProRL directly challenges this conclusion.

The challenge is methodological, not philosophical. ProRL identifies two limitations in prior studies: (1) overreliance on mathematics, where models are already overtrained during pre-training and post-training, restricting exploration potential; and (2) premature termination of RL training before models can fully explore novel reasoning capabilities. The solution: KL divergence control to prevent collapse, reference policy resetting to maintain exploration, and a diverse suite of tasks beyond math.

The result is striking. RL-trained models consistently outperform base models across a wide range of Pass@k evaluations, including scenarios where base models fail entirely regardless of the number of attempts. This is the critical distinction — not just better sampling efficiency, but access to solution strategies that the base model literally cannot produce at any k.

However, this finding exists in tension with the existing insight that Does RL teach reasoning or just when to use it?. The resolution may be domain-conditional: on overtrained domains (mathematics, coding), where base models have been extensively exposed during pre-training, RL primarily teaches timing and selection. On genuinely novel reasoning tasks, where base models lack established solution patterns, sufficiently prolonged RL can expand the capability frontier.

This has practical implications for how long to run RL training and on what tasks. If the goal is genuinely new reasoning capabilities rather than just better deployment of existing ones, RL must be applied to diverse, non-overtrained domains with sufficient training duration.

Source: Reinforcement Learning

Related concepts in this collection

Does RL teach reasoning or just when to use it? Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
TENSION: ProRL challenges this claim on novel (non-math) tasks
Do base models already contain hidden reasoning ability? Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
partially challenged: true for overtrained domains, not for genuinely novel tasks
Can simple rewards alone teach complex domain reasoning? Does reinforcement learning on difficult problems with basic accuracy rewards produce sophisticated reasoning strategies without explicit chain-of-thought training? This challenges assumptions about what domain AI models need to learn effectively.
supports: prolonged training is the condition under which emergence happens
Does the choice of RL algorithm actually matter for reasoning? Expert Iteration, PPO, and Return-Conditioned RL show similar performance on reasoning tasks. The question is whether algorithm differences are fundamentally irrelevant, or whether something deeper explains the convergence.
partially challenged: ProRL shows the ceiling can be raised with sufficient training duration and diversity
Does RLVR actually expand what models can reason about? Explores whether reinforcement learning with verifiable rewards teaches models genuinely new reasoning capabilities or simply makes them more reliable at solving problems they already could solve.
direct tension: pass@k analysis shows RLVR narrows boundaries, but ProRL with sufficient duration and diversity on non-overtrained domains expands them; the resolution is domain-conditional

Concept map

16 direct connections · 120 in 2-hop network ·medium cluster

Can extended RL training discover reasoning stra… Does RL teach reasoning or just when to use it? Do base models already contain hidden reasoning ab… Can simple rewards alone teach complex domain reas… Does the choice of RL algorithm actually matter fo… Does RLVR actually expand what models can reason a…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

prolonged rl discovers genuinely novel reasoning strategies inaccessible to base models even under extensive sampling