What does RL post-training actually teach reasoning systems?

This explores what RL post-training (the reinforcement-learning step applied after pretraining, including reward-verified variants like RLVR) actually does to a reasoning model — whether it builds new reasoning ability or just reshapes what's already there.

This explores what RL post-training actually does to a reasoning model — and the corpus's most striking move is to reject the obvious framing ("RL teaches the model to reason") and replace it with a sharper one: RL mostly teaches a model *when* to reason, not *how*. Several notes converge on the idea that the base model already carries reasoning strategies in latent form, and RL surfaces and routes them. One hybrid setup recovered 91% of the performance gains using only 12% of the tokens by steering the base model's existing reasoning rather than building anything new Does RL teach reasoning or just when to use it? Does RL post-training create reasoning or just deploy it?. In this telling, verifiable rewards act as a catalyst that activates pretraining strategies, not a teacher — a single training example can trigger the activation, and famously, *spurious* rewards work nearly as well as correct ones How does RL training reshape reasoning and what gets lost? What does reward learning actually do to model reasoning?.

What makes this interesting is the disagreement underneath it. A second cluster argues RL does *not* expand what a model can solve: pass@k analysis shows base models actually overtake RL-trained ones at high sampling budgets, meaning RL narrows the model toward solutions already in its distribution rather than discovering new ones — distillation, by contrast, genuinely transfers new patterns Does RLVR actually expand what models can reason about?. But a third cluster pushes back hard: with KL control, policy resetting, and training on diverse non-mathematical tasks, *prolonged* RL discovers genuinely novel strategies and beats the base model at *every* pass@k level Can reinforcement learning discover reasoning strategies base models cannot?. The reconciling note suggests the answer is domain-conditional: for standard reasoning, RL just activates latent ability; for complex multi-step planning, it generates strategies the base model can't reach even with heavy sampling Does reinforcement learning create new reasoning abilities or activate existing ones?.

The mechanism work is where you'll learn something you didn't know to ask about. Inside the weights, RL turns out to be mostly *subtractive*: it sparsely updates only 5–30% of parameters, and its primary lever is suppressing wrong trajectories rather than amplifying right ones What actually changes inside a model during RL training?. A medical-reasoning study makes this concrete — RL improved domain accuracy by +12.4 points not by adding knowledge but by *pruning* paths that invoked incorrect facts Does RL improve domain reasoning by adding knowledge or removing it?. So "teaching reasoning" often looks more like teaching the model to stop saying wrong things.

There's also a temporal shape to it. Across eight models, RL follows a consistent two-phase arc: first it consolidates *procedural* correctness (getting execution right), then the bottleneck shifts to *strategic* planning, where planning-token entropy rises and concentrating optimization there yields the real gains Does RL training follow a predictable two-phase learning sequence?. This hints at why outcome-only rewards leave value on the table — and why some researchers reward the *process* directly: tagging planning, exploration, reflection, and monitoring as verifiable metacognitive steps cuts repetitive actions by 31% while generalizing better than supervised fine-tuning Can RL agents learn to reason better, not just succeed?.

If you want a doorway into the methodological frontier, look at how the field is loosening RL's dependence on verification itself — VeriFree replaces answer-checking with the model's own probability of the reference answer given its reasoning trace, matching verifier-based methods on hard benchmarks Can reasoning improvement work without answer verification?. Taken together, the corpus suggests RL post-training is less a reasoning *teacher* and more a deployment optimizer, a trajectory pruner, and — at the edges, on hard planning — occasionally a genuine inventor.

Sources 12 notes

Does RL teach reasoning or just when to use it?

Pre-training acquires reasoning capability; RL teaches efficient deployment. A hybrid model combining base reasoning with thinking model steering recovered 91% of performance gains using only 12% of tokens, suggesting RL acts as a deployment optimizer rather than a capability creator.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

How does RL training reshape reasoning and what gets lost?

Research shows that verifiable rewards act as catalysts that surface existing capabilities from pretraining, not teachers that build new reasoning. RL updates are structurally sparse and bounded by the pretrained prior, not algorithmic sophistication.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

Can reinforcement learning discover reasoning strategies base models cannot?

RL-trained models outperform base models across all pass@k levels when trained with KL control, policy resetting, and non-mathematical tasks. This shows RL can expand capability boundaries, not just optimize sampling efficiency, especially on domains where base models lack established patterns.

Does reinforcement learning create new reasoning abilities or activate existing ones?

For standard reasoning tasks, RL activates latent abilities already present in base models. For complex planning requiring multi-step coordination, RL generates genuinely novel strategies inaccessible to base models even with extensive sampling.

What actually changes inside a model during RL training?

RL's effects concentrate in structurally sparse but full-rank subnetworks across multiple algorithms and models. Suppressing wrong trajectories—rather than amplifying correct ones—appears to be the primary mechanism, with training following a predictable two-phase pattern of procedural consolidation then strategic exploration.

Does RL improve domain reasoning by adding knowledge or removing it?

RL enhances medical reasoning by suppressing incorrect domain knowledge during reasoning—not by expanding what models know. Evidence shows RL achieves +12.4 point knowledge improvement by removing low-reward reasoning trajectories that invoke wrong facts.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Can RL agents learn to reason better, not just succeed?

RLVMR uses structured meta-reasoning tags (planning, exploration, reflection, monitoring) with programmatic rewards to train agentic RL. This reduces repetitive actions by 31% compared to outcome-only methods while maintaining better generalization than supervised fine-tuning alone.

Can reasoning improvement work without answer verification?

VeriFree bypasses answer verification entirely by using the conditional probability of reference answers given generated reasoning traces as both reward signal and training weight. This approach matches or surpasses verifier-based methods on MMLU-Pro, GPQA, and SuperGPQA without rule-based or model-based verifiers.

What does RL post-training actually teach reasoning systems?

Sources 12 notes

Next inquiring lines