INQUIRING LINE

Does reinforcement learning learn optimal per-turn reasoning discipline?

This explores whether RL actually teaches a model good judgment about *when and how much* to reason at each step — calibrated reasoning discipline — rather than just nudging it toward correct final answers.


This reads the question as asking whether RL instills genuine per-turn reasoning discipline — knowing when to think hard, when to stop, and how to plan — versus simply rewarding outcomes. The corpus suggests RL *can* shape this discipline, but only partially, and rarely 'optimally.' The most pointed caution comes from work showing that reward-based training mostly sharpens what's already there: RLVR improves sampling efficiency within a model's existing capability boundary rather than expanding it, with base models actually winning at high sampling budgets Does RLVR actually expand what models can reason about?. One training example can suffice to 'activate' a strategy, and even spurious rewards work nearly as well as correct ones What does reward learning actually do to model reasoning? — which tells you RL is selecting pre-existing reasoning behavior more than teaching new discipline Do base models already contain hidden reasoning ability?.

But 'discipline' isn't only about capability — it's about routing and pacing, and here RL shows more genuine learning. Decoupled RL can teach a single model to choose between extended thinking and a quick answer without difficulty labels, self-calibrating when reasoning is even worth it Can models learn when to think versus respond quickly?. And when you reward the *process* rather than just the answer — tagging planning, exploration, reflection, and monitoring steps — agents cut repetitive, wasteful actions by nearly a third while generalizing better Can RL agents learn to reason better, not just succeed?. That's per-turn discipline being directly shaped by the reward signal, not as a side effect.

The most interesting wrinkle is *where* the discipline lives across training. RL doesn't learn everything at once: across eight models, training moves through two phases — first execution correctness (getting the procedure right), then strategic planning becomes the bottleneck, with the biggest late gains coming from concentrating optimization on planning tokens Does RL training follow a predictable two-phase learning sequence?. So 'per-turn discipline' is really two different skills learned in sequence, and a naive outcome reward may stall once it hits the planning phase.

That stall is the limit of pure reward signals. Numerical rewards carry no information about *why* a step failed, so models plateau; feeding back natural-language critiques instead of scalars breaks through those plateaus Can natural language feedback overcome numerical reward plateaus?. This is the strongest evidence against 'optimal': a bare reward can tell a model it was wrong, not how to be more disciplined next turn. Related work pushes the discipline earlier — rewarding reasoning during pretraining via information-gain or next-token reformulations Can chain-of-thought reasoning be learned during pretraining itself? Can next-token prediction become a reasoning task with RL? — or sidesteps RL entirely with modular cognitive tools that enforce step isolation through structure rather than reward Can modular cognitive tools unlock reasoning without training?.

The surprise worth carrying away: the best per-turn discipline in this corpus doesn't come from better outcome rewards at all — it comes from changing *what you reward* (the reasoning process, the mode choice) or *what you feed back* (language, not numbers). RL learns discipline when the signal is itself disciplined; given only a thumbs-up on the final answer, it mostly polishes habits the base model already had.


Sources 10 notes

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Can models learn when to think versus respond quickly?

Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.

Can RL agents learn to reason better, not just succeed?

RLVMR uses structured meta-reasoning tags (planning, exploration, reflection, monitoring) with programmatic rewards to train agentic RL. This reduces repetitive actions by 31% compared to outcome-only methods while maintaining better generalization than supervised fine-tuning alone.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Can chain-of-thought reasoning be learned during pretraining itself?

RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.

Can next-token prediction become a reasoning task with RL?

Reinforcement Pre-Training transforms next-token prediction into a reasoning task by providing verifiable rewards from the corpus itself, eliminating reward hacking and enabling inference-time scaling during pretraining. This suggests token-level reasoning patterns during pretraining strengthen downstream RL fine-tuning.

Can modular cognitive tools unlock reasoning without training?

Four cognitive tools implemented as sandboxed LLM calls improved GPT-4.1 on AIME2024 from 26.7% to 43.3% without any RL training. Modularity enforces operation isolation that pure prompting cannot guarantee, eliciting pre-existing reasoning capability.

Next inquiring lines