Does RL teach reasoning or teach when to use it?
Post-training RL gets credit for building reasoning into language models, but emerging evidence suggests base models already possess this capability. The question is whether RL creates new reasoning skills or simply teaches deployment timing.
Post angle — Medium/LinkedIn
The dominant story: DeepSeek R1, GPT-o1, and their successors acquire reasoning capability through RL post-training. RL teaches models to think step-by-step, to backtrack, to verify — capabilities they didn't have before.
The emerging counter-evidence is striking. A hybrid model using a base model's weights with a thinking model's deployment decisions — zero weight updates — recovers 91% of the performance gap to thinking models by steering only 12% of tokens. Base models already spontaneously produce reasoning traces identical to thinking model traces when sampled sufficiently. Single-problem CFT achieves RLVR-level reasoning gains. Activation-space vectors encoding "backtracking" and "uncertainty estimation" already exist in base model hidden states before any RL.
The reframe: pre-training is when reasoning capability is acquired; RL post-training teaches when to deploy it.
This is not a trivial distinction. "When" training is cheaper, less data-hungry, and less fragile than "how" training. If capability already exists, elicitation methods (structured tool-calling, steering vectors, targeted fine-tuning on single problems) become much more attractive than full RL pipelines.
The hook for readers: "We've been crediting the locksmith for the key."
Connections: Does RL teach reasoning or just when to use it?, Do base models already contain hidden reasoning ability?, Can modular cognitive tools boost LLM reasoning without training?
Source: Reasoning Architectures
Related concepts in this collection
-
Does extended thinking help or hurt model reasoning?
Explores whether activating thinking mode improves reasoning performance, and what role training plays in determining whether extended internal reasoning chains are productive or counterproductive.
extends the "when not how" claim: RL also manages the *quality direction* of thinking, redirecting extended reasoning from unproductive self-doubt toward productive gap analysis in conversational contexts
-
Can dialogue planning balance fast responses with strategic depth?
Can a system use quick instinctive responses for familiar conversation contexts while activating deeper planning only when uncertainty demands it? This explores whether adaptive computation improves dialogue goal-reaching.
dialogue-specific instantiation of "when not how": the policy model has dialogue capabilities from pretraining; the uncertainty-switching mechanism teaches when to deploy deep planning
-
Can models learn when to think versus respond quickly?
Can a single LLM learn to adaptively choose between extended reasoning and concise responses based on task complexity? This matters because it could optimize compute efficiency without sacrificing accuracy on hard problems.
the strongest concrete implementation: Thinkless's control token design makes "when not how" architecturally explicit; RL optimizes a single routing token, not reasoning content
-
Does reinforcement learning update only a small fraction of parameters?
Investigating whether RL algorithms consistently modify only 5–30% of model parameters across different LLMs and RL methods, and what structural properties those sparse updates possess.
parametric evidence: if RL only touches 5-30% of parameters, the remaining 70-95% already encode the capability; sparse-but-full-rank updates are the physical signature of "when not how"
-
Can extended RL training discover reasoning strategies base models cannot?
Does reinforcement learning genuinely expand what models can reason about, or does it only optimize existing latent capabilities? ProRL tests this by running RL longer on diverse tasks with better training controls.
TENSION: ProRL challenges the "when not how" framing on novel non-overtrained tasks; the resolution may be domain-conditional — timing-only on overtrained domains, genuine capability creation on novel tasks with sufficient RL duration
-
Does RL training follow a predictable two-phase learning sequence?
This explores whether reinforcement learning exhibits consistent phases where basic execution skills must consolidate before strategic reasoning emerges. Understanding this sequence could reveal bottlenecks in scaling reasoning capabilities.
deepens: the two-phase dynamic decomposes "when" into a temporal structure — execution tokens are "how" (learned first), planning tokens are "when" (learned second); the "when not how" thesis applies specifically to the planning-token phase
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
thinking models learn when not how — the case that rl post-training is a deployment optimizer not a capability creator