Reinforcement Learning for LLMs LLM Reasoning and Architecture

Does RL training create new reasoning skills or activate existing ones?

Understanding whether reinforcement learning actually builds novel capabilities or simply teaches models when to use reasoning they already possess. This matters for predicting RL's value across different task types.

Note · 2026-02-23 · sourced from Reasoning Architectures

Two prominent claims about what RL post-training does appear contradictory:

The timing thesis: Since Does RL teach reasoning or just when to use it? and Do base models already contain hidden reasoning ability?, RL functions as a deployment optimizer. Evidence: base models outperform RLVR-trained models at high pass@k, RL-trained models show the same solution strategies as base models, and Can a single training example unlock mathematical reasoning?.

The capability thesis: Can extended RL training discover reasoning strategies base models cannot?. Evidence: ProRL shows strategies absent from any base model sample regardless of budget, while self-evolving curriculum RL breaks the boundary constraints identified by pass@k analysis (where Does RLVR actually expand what models can reason about?).

The domain-conditional resolution: Both are correct under different conditions. For standard math/code reasoning where the problem structure is well-represented in pretraining data, RL activates latent capability (timing thesis). For complex tasks requiring multi-step planning, tool coordination, or novel strategy recombination, RL may create genuinely new capability through prolonged training (capability thesis).

Supporting evidence for the conditional view:

RLVR pass@k boundary collapse occurs on standard benchmarks (MATH, GSM8K)
ProRL novel strategy discovery occurs on problems requiring deep planning
SWE-RL doubles baseline on long-horizon engineering tasks — beyond activation
Duration matters: short RLVR narrows boundaries while prolonged RL pushes through them

The practical implication: RL training investment should be calibrated to the target domain. For standard reasoning, minimal RL (even one example) suffices. For complex agentic tasks, sustained RL investment with evolving curricula is justified.

Original note title

RL capability creation is domain-conditional — standard reasoning activates latent capability while complex planning may generate genuinely novel strategies