Does RL teach models when to use reasoning or how to reason?
This explores whether reinforcement learning builds new reasoning ability in a model or mainly teaches it when to deploy reasoning it already has — and where the corpus splits on that question.
This explores whether RL builds new reasoning ability or mainly teaches a model when to deploy reasoning it already has. The corpus leans hard toward the second answer — but not unanimously, and the disagreement is the interesting part.
The dominant finding is that RL teaches *when*, not *how*. Base models appear to already carry reasoning strategies in latent form, and RL post-training optimizes the timing of when to fire them rather than creating them Does RL post-training create reasoning or just deploy it? Does RL teach reasoning or just when to use it?. The striking evidence: a hybrid model that borrows reasoning from the base model and only lets a thinking model decide *which* tokens to route recovered 91% of the performance gains using just 12% of the tokens — implying RL is acting as a deployment optimizer, not a capability creator. Mechanistic work backs this up: five independent techniques (RL steering, critique fine-tuning, decoding tweaks, SAE feature steering, RLVR) all elicit reasoning already sitting in base-model activations, suggesting the bottleneck is elicitation, not acquisition Do base models already contain hidden reasoning ability?.
The RLVR literature sharpens the same point from a different angle. Reward learning seems to activate pretraining strategies rather than teach new ones — a single training example can be enough to trigger it, and even spurious rewards work nearly as well as correct ones for a well-pretrained model What does reward learning actually do to model reasoning?. Pass@k analysis is the clincher here: base models actually *beat* RLVR models at high k, meaning RL narrows sampling toward solutions already in the base distribution rather than expanding what's solvable Does RLVR actually expand what models can reason about?. By that account, RL is teaching neither when nor how so much as *which answer to commit to faster*.
But the corpus does not let the "when, not how" story win cleanly. Prolonged RL on diverse, non-mathematical tasks — with KL control and policy resetting — produced models that outperform the base across *all* pass@k levels, which is exactly the signature of genuinely expanded capability, not just better sampling Can reinforcement learning discover reasoning strategies base models cannot?. The reconciliation may be domain-dependent: RL re-routes existing skill where the base model already has established patterns (math), but can discover new strategy where it doesn't. There's also a third framing the question doesn't anticipate — RL teaching *how to reason about reasoning*. Process rewards on metacognitive tags (planning, exploration, reflection) cut repetitive actions by 31% and generalize better, which is closer to shaping the reasoning process itself than to timing it Can RL agents learn to reason better, not just succeed?.
Worth pulling on if you go further: the whole debate may rest on where reasoning comes from in the first place. Analysis of five million pretraining documents found that reasoning generalization is driven by broad, transferable *procedural* knowledge — not the narrow fact-memorization behind recall Does procedural knowledge drive reasoning more than factual retrieval?. If the procedures are laid down in pretraining, then "RL teaches when, not how" is almost a corollary. And scale matters for whether RL produces real reasoning at all: on theory-of-mind tasks, larger models develop genuine transferable belief-tracking under RL while smaller ones hit the same accuracy through shortcut learning with no interpretable trace — a reminder that matching accuracy can hide whether any reasoning was learned Does reinforcement learning on theory of mind collapse with model scale?.
Sources 9 notes
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
Pre-training acquires reasoning capability; RL teaches efficient deployment. A hybrid model combining base reasoning with thinking model steering recovered 91% of performance gains using only 12% of tokens, suggesting RL acts as a deployment optimizer rather than a capability creator.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.
Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.
RL-trained models outperform base models across all pass@k levels when trained with KL control, policy resetting, and non-mathematical tasks. This shows RL can expand capability boundaries, not just optimize sampling efficiency, especially on domains where base models lack established patterns.
RLVMR uses structured meta-reasoning tags (planning, exploration, reflection, monitoring) with programmatic rewards to train agentic RL. This reduces repetitive actions by 31% compared to outcome-only methods while maintaining better generalization than supervised fine-tuning alone.
Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.
7B models develop explicit, transferable belief-tracking under RL, while smaller models achieve comparable accuracy through shortcut learning that lacks interpretable reasoning traces. The mismatch between accuracy and reasoning quality is invisible without inspecting step-by-step outputs.