Does RL teach reasoning or just when to use it?
Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
The standard account of thinking models like DeepSeek R1 and GPT-o1 attributes their reasoning gains to RL post-training — a story where RL teaches models how to reason. The "Base Models Know How to Reason" paper inverts this: pre-training is when reasoning capability is acquired; RL teaches models when to deploy it.
The evidence is direct. A hybrid model that combines a base model's reasoning capabilities with a thinking model's deployment decisions — without any weight updates — recovers up to 91% of the performance gap between base and thinking models while steering only 12% of tokens. The steering uses activation-space vectors: directions in the base model that, when added at the right moments, induce reasoning behaviors like backtracking, uncertainty estimation, and subgoal-setting. The thinking model acts as a controller, deciding which steering vectors to activate and when.
This reframes what RL actually does. RL doesn't inject new reasoning skills; it biases token generation toward patterns with high reward. If base models already contain the execution-level skills (which they demonstrably do — sampled sufficiently, they produce reasoning traces already present in thinking model outputs), RL is essentially training an attention-based curriculum: produce the right reasoning at the right moment.
The implications are uncomfortable for the RL-is-essential narrative. Reasoning capability is largely a pre-training phenomenon. RL is a deployment optimizer, not a capability creator. This connects to Can prompt optimization teach models knowledge they lack? — the same principle operating at the training/inference boundary rather than purely at inference time.
Three RLVR findings reinforce this: First, pass@k analysis shows RLVR models have narrower capability boundaries than base models — at high k, base models outperform all six tested RLVR algorithms. RLVR is a sampling efficiency optimizer, not a capability expander. See Does RLVR actually expand what models can reason about?. Second, 1-shot RLVR achieves a 37-point jump (MATH500 36%→73.6%) from a single training example, with generalization continuing for 1,400 steps after the model perfectly memorizes its one example. The data is exhausted but activation continues — because the training signal triggers a phase transition in the model's output distribution. See Can a single training example unlock mathematical reasoning?. Third, spurious rewards (random, incorrect, or format-only) improve Qwen2.5-Math nearly as much as ground-truth rewards — but fail for Llama and OLMo. The differentiating variable is pretraining strategy, not reward signal quality. See Why do random rewards improve reasoning for some models but not others?.
The practical implication for reasoning system design: targeted steering of base models may be a more efficient path to reasoning performance than full RL training, particularly for domains where RLVR reward signal is hard to define.
Source: Reasoning Architectures
Related concepts in this collection
-
Can simple rewards alone teach complex domain reasoning?
Does reinforcement learning on difficult problems with basic accuracy rewards produce sophisticated reasoning strategies without explicit chain-of-thought training? This challenges assumptions about what domain AI models need to learn effectively.
challenges this: if base models already have the capability, RL is not an emergence engine but an activation scheduler
-
Does RL improve domain reasoning by adding knowledge or removing it?
When reinforcement learning improves reasoning in specialized domains like medicine, is it teaching models new facts or preventing them from using wrong ones? Understanding this distinction matters for how we design RL training.
consistent: RL shapes *which* capabilities get expressed, not their existence
-
Can non-reasoning models catch up with more compute?
Explores whether inference-time compute budget can close the performance gap between standard models and those trained for reasoning, and what training mechanisms might enable this.
partially qualified: base models can close most of the gap with targeted activation, changing what "non-reasoning model" means
-
Can prompt optimization teach models knowledge they lack?
Explores whether sophisticated prompting techniques can inject new domain knowledge into language models, or if they're limited to activating existing training knowledge.
extends to training-time dynamics
-
Does RLVR actually expand what models can reason about?
Explores whether reinforcement learning with verifiable rewards teaches models genuinely new reasoning capabilities or simply makes them more reliable at solving problems they already could solve.
pass@k evidence: RLVR narrows scope to reliable subset
-
Can a single training example unlock mathematical reasoning?
Does minimal data suffice to activate latent reasoning capabilities in language models? This explores whether one example can produce dramatic performance gains comparable to much larger datasets.
1-shot activation: minimal signal triggers phase transition
-
Why do random rewards improve reasoning for some models but not others?
Spurious rewards boost Qwen's math reasoning by 16-25% but fail for Llama and OLMo. We explore whether reward quality matters, or if pretraining strategy determines what RLVR can unlock.
pretraining determines RLVR effectiveness, not reward quality
-
Does policy entropy collapse limit reasoning performance in RL?
As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
if RL is narrowing to a deployment-timing policy rather than building capability, entropy collapse is the natural consequence: the model converges on a single activation schedule and loses the diversity of timing strategies that would sustain continued improvement
-
Can models learn to internalize search as reasoning?
Does training on linearized search traces teach models to implement search algorithms internally, expanding what they can discover during reasoning? This matters because it could unlock entirely new problem-solving modes beyond standard chain-of-thought.
challenges the when-not-how framing: Meta-CoT proposes that search algorithms ARE trainable as the "how" component, suggesting RL may operate at two levels — timing (when to reason) and search internalization (how to reason)
-
Does reinforcement learning teach social reasoning or just shortcuts?
When RL optimizes for accuracy on theory of mind tasks, do models actually learn to track mental states, or do they find faster paths to correct answers? The distinction matters for genuine reasoning capability.
adds a capacity caveat: RL teaches when-not-how only when the model has sufficient latent capability; below a scale threshold in social reasoning, RL teaches shortcuts instead of activation timing
-
Can models improve themselves on tasks without verifiable answers?
Most self-improvement methods require objective correctness signals, limiting them to math and code. Can models self-improve on open-ended instruction tasks where answers can't be automatically verified?
catalyst data reinforces the when-not-how thesis through a different mechanism: 1000 demonstrations teach the model to enrich its reasoning output format, not to reason; the small data requirement confirms the capability is latent and the catalyst is an activation signal for reasoning articulation, not reasoning itself
-
Can pretraining corpora themselves provide verifiable RL rewards?
Does framing next-token prediction as a reasoning task with ground-truth verification eliminate the need for human feedback or domain-specific rewards during language model pretraining?
RPT enriches the latent capability that post-training activates: by embedding RL reasoning patterns during pretraining itself, RPT creates a richer foundation for the "when" decision that post-training teaches
-
How does thinking emerge from policy selection in RL?
Explores whether thinking is fundamentally about selecting between existing sub-policies rather than building new reasoning from scratch. This matters for understanding how RL training unlocks latent capabilities in language models.
provides the formal mechanism: the thought MDP formalizes "when to activate" as sub-policy selection within a rich policy initialization; thinking is choosing which existing sub-policy to deploy, not building new capability
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
rl post-training teaches models when to activate reasoning mechanisms not how to execute them