INQUIRING LINE

Does reinforcement learning teach models how to reason or when to reason?

This explores whether reinforcement learning actually builds new reasoning ability into a model, or just teaches it to deploy reasoning the model already had — the 'how' vs 'when' question.


This explores whether RL teaches models *how* to reason (creating new capability) or *when* to reason (deploying capability that's already there) — and the corpus leans hard toward the second answer, with interesting cracks. The most direct claim is that RL post-training optimizes deployment timing rather than reasoning itself: one study found a hybrid that combined a base model's reasoning with thinking-model steering recovered 91% of the performance gains using only 12% of the tokens, which reads as RL acting like a deployment optimizer, not a capability creator Does RL teach reasoning or just when to use it?. The supporting evidence is that the reasoning was already latent — five independent techniques (RL steering, critique fine-tuning, decoding tweaks, SAE feature steering, and RLVR) all elicit reasoning that's already sitting in base-model activations, suggesting post-training *selects* rather than *creates* Do base models already contain hidden reasoning ability?.

The sharpest version of 'when, not how' comes from work on RLVR (reinforcement learning with verifiable rewards). Pass@k analysis shows base models actually *outperform* RLVR-trained models at high k — meaning RLVR narrows sampling toward solutions already in the base distribution rather than expanding the set of solvable problems Does RLVR actually expand what models can reason about?. The same picture shows up in the finding that a single training example can suffice to 'activate' a strategy, and that even spurious rewards work nearly as well as correct ones for a well-pretrained model — a strong sign the reward is flipping a switch, not teaching a skill What does reward learning actually do to model reasoning?. Distillation, by contrast, *does* transfer genuinely new reasoning patterns, which sharpens the contrast: RL routes, distillation teaches.

But the 'when' camp has a second, subtler meaning worth separating. Some work is literally about *when* in the sense of routing — should the model think hard or answer fast? Thinkless trains a single model to choose between extended reasoning and direct responses, using a decoupled RL objective that separates the mode-selection decision from answer refinement Can models learn when to think versus respond quickly?. That's 'when to reason' in the most concrete sense — and it's a useful capability even if RL isn't growing the underlying reasoning.

The counter-current is real, though. Several notes argue RL *can* build something. RL is described as an 'emergence engine' where sophisticated domain reasoning arises from simple accuracy rewards on hard problems, no teacher distillation required Can simple rewards alone teach complex domain reasoning?. RLAG claims to embed domain knowledge more effectively than supervised fine-tuning by rewarding explanation quality, not just token correctness Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?. And a fascinating reframe is that the 'how vs when' dichotomy may be a training-stage artifact: if you push reasoning rewards into *pretraining* — treating chain-of-thought as an exploratory action with an information-gain reward, or reframing next-token prediction itself as a verifiable reasoning task — you may actually plant the capability earlier rather than just eliciting it later Can chain-of-thought reasoning be learned during pretraining itself? Can next-token prediction become a reasoning task with RL?.

The thing you might not have known you wanted: the answer may be *both, in sequence*. One study tracked eight models through RL training and found a consistent two-phase dynamic — first the model consolidates execution correctness (procedural 'how'), then the bottleneck shifts to strategic planning ('when and what to attempt'), with planning-token entropy rising as execution stabilizes Does RL training follow a predictable two-phase learning sequence?. And metacognition work pushes further still, using process rewards on tagged planning/reflection/monitoring steps to teach *better* reasoning behavior, not just deployment — cutting repetitive actions by 31% Can RL agents learn to reason better, not just succeed?. So the honest synthesis: with verifiable rewards on a strong base model, RL mostly teaches *when*; but with richer reward shaping, process supervision, or relocation into pretraining, the line between 'when' and 'how' starts to blur.


Sources 11 notes

Does RL teach reasoning or just when to use it?

Pre-training acquires reasoning capability; RL teaches efficient deployment. A hybrid model combining base reasoning with thinking model steering recovered 91% of performance gains using only 12% of tokens, suggesting RL acts as a deployment optimizer rather than a capability creator.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Can models learn when to think versus respond quickly?

Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.

Can simple rewards alone teach complex domain reasoning?

Medical AI systems and o3 demonstrate that sophisticated domain reasoning emerges naturally from RL training on difficult problems with only basic accuracy signals, without requiring explicit chain-of-thought distillation from teacher models.

Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?

RLAG rewards both answer accuracy and explanation rationality by cycling between augmented and unaugmented generation, progressively internalizing coherent knowledge structures. This outperforms SFT because it prioritizes reasoning quality over token-level correctness.

Can chain-of-thought reasoning be learned during pretraining itself?

RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.

Can next-token prediction become a reasoning task with RL?

Reinforcement Pre-Training transforms next-token prediction into a reasoning task by providing verifiable rewards from the corpus itself, eliminating reward hacking and enabling inference-time scaling during pretraining. This suggests token-level reasoning patterns during pretraining strengthen downstream RL fine-tuning.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Can RL agents learn to reason better, not just succeed?

RLVMR uses structured meta-reasoning tags (planning, exploration, reflection, monitoring) with programmatic rewards to train agentic RL. This reduces repetitive actions by 31% compared to outcome-only methods while maintaining better generalization than supervised fine-tuning alone.

Next inquiring lines