INQUIRING LINE

Can reinforcement learning fix the reasoning gaps that supervised fine-tuning misses?

This explores whether reinforcement learning genuinely repairs the reasoning that supervised fine-tuning (SFT) tends to fake — and the corpus complicates the premise: RL fixes some gaps SFT misses, but mostly by *eliciting* reasoning the base model already has, not by inventing new reasoning.


This explores whether reinforcement learning genuinely repairs the reasoning that supervised fine-tuning leaves broken. Start with the diagnosis of the gap itself. SFT has an "accuracy trap": it lifts benchmark scores while cutting the actual quality of reasoning steps by nearly 39%, because models learn to produce correct final answers through post-hoc rationalization rather than genuine inference Does supervised fine-tuning improve reasoning or just answers?. So the thing SFT "misses" isn't accuracy — it's the inferential chain underneath. That reframes the whole question: can RL restore reasoning that SFT hollows out?

The optimistic answer is yes. Reinforcement learning from augmented generation rewards explanation quality, not just token-level correctness, so it internalizes coherent knowledge structures where SFT only memorizes Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?. RL can also make complex domain reasoning *emerge* from simple accuracy rewards alone, no teacher-distilled chain-of-thought required Can simple rewards alone teach complex domain reasoning?. And it breaks through plateaus that pure numerical rewards can't: when a model gets stuck, natural-language critiques tell it *why* it failed and *how* to improve — information a scalar reward simply doesn't carry Can natural language feedback overcome numerical reward plateaus?.

But here's the turn you didn't know you wanted. A strong line of the corpus argues RL isn't creating reasoning at all — it's selecting reasoning that already exists. RLVR doesn't expand the boundary of solvable problems; at high sampling counts the *base* model actually outperforms the RL-tuned one, meaning RL narrows the model toward solutions already in its distribution rather than adding new ones Does RLVR actually expand what models can reason about?. The same picture appears from a different angle: a single training example, or even spurious rewards, can trigger the gains — because the work is activation, not teaching What does reward learning actually do to model reasoning?. Five independent mechanisms all converge on the conclusion that post-training *elicits* latent capability rather than acquiring it Do base models already contain hidden reasoning ability?. Put bluntly: RL post-training teaches a model *when* to reason, not *how* — hybrid models recover 91% of the gains by just routing tokens Does RL post-training create reasoning or just deploy it?.

So the honest synthesis is layered. If the "gap" is a capability the base model never had, RL won't conjure it — distillation transfers genuinely new reasoning patterns, RL doesn't Does RLVR actually expand what models can reason about?. But if the gap is reasoning that's present yet unreliably deployed — exactly the failure SFT's rationalization shortcut produces — RL is well-suited, because it works in two phases: first consolidating procedural execution, then optimizing the strategic planning that's the real bottleneck Does RL training follow a predictable two-phase learning sequence?. Decomposing rewards into verifiable sub-criteria sharpens this further on subjective tasks where holistic reward models overfit to surface artifacts Can breaking down instructions into checklists improve AI reward signals?.

The most interesting threads push the fix *earlier* than the SFT-vs-RL fight entirely. One argues reasoning should be planted during pretraining, treating chain-of-thought as an exploratory action rewarded by information gain — lifting reasoning ~19% before any post-training Can chain-of-thought reasoning be learned during pretraining itself?. Another notes the deeper ceiling: agents trained on static expert data — SFT's home turf — are capped by what curators imagined, because they never interact with an environment and learn from their own failures, which is precisely what RL's trial-and-error loop provides Can agents learn beyond what their training data shows?. The takeaway: RL fixes *deployment and elicitation* gaps SFT misses, but the boundary of what's reasonable lives upstream, in the base model and in pretraining.


Sources 12 notes

Does supervised fine-tuning improve reasoning or just answers?

Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.

Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?

RLAG rewards both answer accuracy and explanation rationality by cycling between augmented and unaugmented generation, progressively internalizing coherent knowledge structures. This outperforms SFT because it prioritizes reasoning quality over token-level correctness.

Can simple rewards alone teach complex domain reasoning?

Medical AI systems and o3 demonstrate that sophisticated domain reasoning emerges naturally from RL training on difficult problems with only basic accuracy signals, without requiring explicit chain-of-thought distillation from teacher models.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Can breaking down instructions into checklists improve AI reward signals?

RLCF and RaR methods decompose instruction quality into verifiable sub-criteria, improving performance on benchmarks like FollowBench and HealthBench. This decomposition principle reduces overfitting to superficial artifacts that plague holistic reward models.

Can chain-of-thought reasoning be learned during pretraining itself?

RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.

Can agents learn beyond what their training data shows?

Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.

Next inquiring lines