INQUIRING LINE

Can RL training teach models when to activate reasoning versus when to skip it?

This explores whether RL training's real job is timing — deciding *when* a model should engage in deliberate reasoning versus answer directly — rather than building new reasoning skill from scratch.


This explores whether RL training's real job is timing — deciding *when* a model should engage in deliberate reasoning versus answer directly — rather than building new reasoning skill from scratch. The corpus makes a surprisingly strong case for "yes, mostly." A cluster of work argues that base models already carry reasoning ability in latent form, and RL doesn't manufacture it so much as decide when to fire it. The clearest demonstration: a hybrid model that recovered 91% of a thinking model's performance gains while spending only 12% of the tokens, by routing *which* tokens get the reasoning treatment rather than retraining the underlying capability Does RL post-training create reasoning or just deploy it? Does RL teach reasoning or just when to use it?. Five independent methods — RL steering, critique fine-tuning, decoding tweaks, SAE feature steering, RLVR — all elicit reasoning that was already sitting in the base model's activations, which reframes the whole problem as one of *elicitation timing*, not skill acquisition Do base models already contain hidden reasoning ability?.

If RL is a deployment switch, you'd expect to be able to flip it without retraining at all — and you can. Verbose versus concise chain-of-thought turn out to occupy distinct, linearly separable regions of activation space, so a single steering vector extracted from 50 examples cuts reasoning length by 67% while holding accuracy Can we steer reasoning toward brevity without retraining?. That's the "when to skip it" half of your question made concrete: brevity is a direction you can dial, not a behavior you must teach. The RLVR-dynamics work points the same way — reward learning improves *sampling efficiency* inside existing capability boundaries, a single example suffices to activate it, and even spurious rewards work nearly as well, which only makes sense if RL is selecting a pre-existing mode rather than instilling one What does reward learning actually do to model reasoning?.

But the corpus doesn't let the "just timing" story win cleanly, and that tension is the interesting part. Prolonged RL on diverse, non-mathematical tasks — with KL control and policy resetting — actually discovers genuinely novel strategies that base models can't reach at any sampling budget, outperforming them across all pass@k levels Can reinforcement learning discover reasoning strategies base models cannot?. So whether RL is a switch or a teacher seems to depend on the domain: where the base model already has patterns, RL optimizes deployment; where it doesn't, RL can expand the boundary. Reasoning can even be planted earlier, treating chain-of-thought as an exploratory action *during pretraining* with an information-gain reward Can chain-of-thought reasoning be learned during pretraining itself?.

The deepest answer to "when to activate versus skip" comes from work treating that decision as something the model itself can be trained to manage. Meta-reasoning rewards tag and reward the acts of planning, exploring, reflecting, and monitoring — cutting repetitive actions by 31% by teaching the agent *when* each cognitive move is worth making Can RL agents learn to reason better, not just succeed?. RL training even reverses the *quality* of thinking: vanilla models often use extended thinking counterproductively, spiraling into self-doubt, and RL redirects the same mechanism toward productive gap analysis — evidence that training governs *how thinking is used*, not just whether it happens Does extended thinking help or hurt model reasoning?. And RL learning itself unfolds in two phases — first nailing execution, then shifting the bottleneck to strategic planning — which mirrors the activate-versus-skip distinction inside the training process: models master *how* to execute before they master *when* to deploy Does RL training follow a predictable two-phase learning sequence?.

The thing you didn't know you wanted to know: the "when to skip reasoning" decision may be cheaper and more steerable than the reasoning itself — a single vector, a handful of examples, a routing rule — which suggests the next gains in efficient AI come less from smarter models and more from better-calibrated switches.


Sources 10 notes

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Does RL teach reasoning or just when to use it?

Pre-training acquires reasoning capability; RL teaches efficient deployment. A hybrid model combining base reasoning with thinking model steering recovered 91% of performance gains using only 12% of tokens, suggesting RL acts as a deployment optimizer rather than a capability creator.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Can we steer reasoning toward brevity without retraining?

Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Can reinforcement learning discover reasoning strategies base models cannot?

RL-trained models outperform base models across all pass@k levels when trained with KL control, policy resetting, and non-mathematical tasks. This shows RL can expand capability boundaries, not just optimize sampling efficiency, especially on domains where base models lack established patterns.

Can chain-of-thought reasoning be learned during pretraining itself?

RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.

Can RL agents learn to reason better, not just succeed?

RLVMR uses structured meta-reasoning tags (planning, exploration, reflection, monitoring) with programmatic rewards to train agentic RL. This reduces repetitive actions by 31% compared to outcome-only methods while maintaining better generalization than supervised fine-tuning alone.

Does extended thinking help or hurt model reasoning?

Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Next inquiring lines