INQUIRING LINE

Does RL primarily teach when to use reasoning or how to reason?

This explores a genuine fault line in the corpus — whether reinforcement learning mostly tunes the *timing* of reasoning a model already knows how to do, or whether it actually builds new reasoning ability.


This explores whether RL teaches a model *when* to deploy reasoning or *how* to reason in the first place — and the collection is split, which is the interesting part. The dominant cluster argues for "when." One line of work frames RL post-training as a deployment optimizer: pre-training installs the reasoning capability, and RL just learns to fire it efficiently. The striking evidence is a hybrid model that recovered 91% of the performance gains using only 12% of the tokens, simply by routing — steering *when* the thinking model engages, not teaching it anything new Does RL teach reasoning or just when to use it? Does RL post-training create reasoning or just deploy it?. A complementary finding shows reward learning mostly raises *sampling efficiency* within the base model's existing boundaries: a single training example can suffice to activate a strategy, and even spurious rewards work nearly as well as correct ones — which only makes sense if the skill was already latent What does reward learning actually do to model reasoning?. Push this further and even the optimizer choice stops mattering: PPO, Expert Iteration, and RC-RL perform comparably because the pretrained prior bounds what exploration can reach. RL is selection, not discovery Does the choice of RL algorithm actually matter for reasoning?.

But the corpus doesn't let "when" win cleanly. Prolonged RL — trained with KL control, policy resetting, and tasks outside math where base models lack established patterns — produces models that beat the base across *every* pass@k level, which is the signature of genuinely expanded capability rather than reshuffled sampling Can reinforcement learning discover reasoning strategies base models cannot?. So the answer may hinge on the domain: where the base model already has patterns, RL optimizes deployment; where it doesn't, RL can find new ones.

The more useful reframe is that "how to reason" isn't one thing. RL can improve reasoning by *removing* rather than adding — pruning trajectories that invoke wrong domain facts, which lifted medical reasoning +12.4 points by suppressing bad knowledge rather than teaching new knowledge Does RL improve domain reasoning by adding knowledge or removing it?. And it can teach genuinely new *process* skill when you reward the process directly: structured meta-reasoning tags (planning, exploration, reflection) cut repetitive actions by 31% versus outcome-only rewards Can RL agents learn to reason better, not just succeed?.

What resolves the tension is timing inside a single training run. Across eight models, RL follows a two-phase arc: first it consolidates execution correctness (the "how" of getting steps right), then the bottleneck shifts to strategic planning — *when* and *whether* to explore — with planning-token entropy rising while execution stabilizes Does RL training follow a predictable two-phase learning sequence?. That's why curricula that imitate first and explore second beat either alone: the imitation phase builds reasonable rollouts so the reward signal in the RL phase actually becomes informative Does sequencing imitation then exploration training improve reasoning?.

Here's the thing you might not have known you wanted: the "how" you'd expect RL to teach may largely come from *pre-training* exposure to procedural documents — broad, transferable reasoning patterns absorbed from diverse sources, as opposed to the narrow memorization behind factual recall Does procedural knowledge drive reasoning more than factual retrieval?. If that's right, the whole framing tilts toward "when": the procedural how is laid down early, and RL's job is to decide when to use it — except in the frontier domains where the base model never saw the pattern at all. And scale matters for whether "how" is even real: small models under RL can hit the same accuracy as larger ones through shortcut learning that lacks any interpretable reasoning trace, so a model can look like it learned *how* while having only learned *when to guess* Does reinforcement learning on theory of mind collapse with model scale?.


Sources 11 notes

Does RL teach reasoning or just when to use it?

Pre-training acquires reasoning capability; RL teaches efficient deployment. A hybrid model combining base reasoning with thinking model steering recovered 91% of performance gains using only 12% of tokens, suggesting RL acts as a deployment optimizer rather than a capability creator.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Does the choice of RL algorithm actually matter for reasoning?

Expert Iteration, PPO, and RC-RL perform comparably on reasoning because exploration is constrained by the pretrained distribution, not the optimizer. RL functions as selection, not discovery—the prior contains most solutions the algorithm will find.

Can reinforcement learning discover reasoning strategies base models cannot?

RL-trained models outperform base models across all pass@k levels when trained with KL control, policy resetting, and non-mathematical tasks. This shows RL can expand capability boundaries, not just optimize sampling efficiency, especially on domains where base models lack established patterns.

Does RL improve domain reasoning by adding knowledge or removing it?

RL enhances medical reasoning by suppressing incorrect domain knowledge during reasoning—not by expanding what models know. Evidence shows RL achieves +12.4 point knowledge improvement by removing low-reward reasoning trajectories that invoke wrong facts.

Can RL agents learn to reason better, not just succeed?

RLVMR uses structured meta-reasoning tags (planning, exploration, reflection, monitoring) with programmatic rewards to train agentic RL. This reduces repetitive actions by 31% compared to outcome-only methods while maintaining better generalization than supervised fine-tuning alone.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Does sequencing imitation then exploration training improve reasoning?

Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.

Does procedural knowledge drive reasoning more than factual retrieval?

Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.

Does reinforcement learning on theory of mind collapse with model scale?

7B models develop explicit, transferable belief-tracking under RL, while smaller models achieve comparable accuracy through shortcut learning that lacks interpretable reasoning traces. The mismatch between accuracy and reasoning quality is invisible without inspecting step-by-step outputs.

Next inquiring lines