How does Supervised RL bridge the gap between SFT and RLVR?
This explores how a middle training step — Supervised RL (SRL), an imitation phase that learns from worked examples but is shaped by reward — fixes the specific failure each of the two standard methods has when used alone: SFT copies surface form without reasoning, and RLVR can't get traction when the model never stumbles onto a correct answer to reward.
This explores how Supervised RL sits between plain supervised fine-tuning (SFT) and reinforcement learning from verifiable rewards (RLVR), and the corpus frames the bridge as solving a chicken-and-egg problem each pure method runs into. Start with what each end gets wrong on its own. SFT teaches a model what good answers look like, but the lesson stops at the surface: on optimization problems, fine-tuned models produce clean JSON, valid identifiers, and the right section headings while still violating the actual constraints — they learn the costume of a solution, not the reasoning to build one Does supervised fine-tuning actually improve reasoning on optimization problems?. RLVR has the opposite shape of failure. It only rewards verifiably-correct outcomes, so it works beautifully when the model already lands on correct answers sometimes — and goes silent when it never does. And even when it works, it mostly sharpens sampling toward solutions already in the base model's repertoire rather than teaching genuinely new reasoning Does RLVR actually expand what models can reason about? What does reward learning actually do to model reasoning?.
That gap is exactly where SRL lives. The curriculum result is the clearest statement: running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, beats either method used alone — because the imitation phase 'makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen' Does sequencing imitation then exploration training improve reasoning?. In other words, SRL manufactures the precondition RLVR silently assumes. RLVR needs the model to occasionally succeed so that there's something to reward; SRL gets it producing plausible reasoning trajectories first, so the reward signal stops being all-zeros.
The corpus also explains why you can't just do more of either. Pure RLVR tends to narrow rather than broaden: its on-policy nature pushes exploitation over exploration, collapsing the model's problem-solving scope — 'capability boundary collapse' — and feeding it problems that are too hard makes this worse, since rare accidental wins get treated as high-value and the model learns shortcuts and answer-repetition instead of reasoning Why does RLVR training narrow a model's problem solving ability? Do overly hard RLVR samples actually harm model capabilities?. And naively bolting SFT in front of RL isn't free either: when the expert data diverges from the model's own distribution, training goes through a destabilizing shift–readapt–overfit progression, which is why approaches like CHORD fold the supervised signal in as a dynamically-weighted auxiliary objective inside on-policy RL rather than as a separate front-loaded stage Why does SFT-then-RL training follow a predictable three-phase pattern?.
So the bridge isn't a compromise between two settings on a dial — it's a sequencing insight. SFT alone gives form without feasibility; RLVR alone needs feasibility before it can give anything. The supervised-reward middle does the unglamorous work of getting the model into the region where verifiable rewards become a usable teaching signal. Worth noting for the curious: research suggests RL changes surprisingly little of the network — only 5–30% of parameters update, in sparse but nearly full-rank subnetworks that are consistent across seeds Does reinforcement learning update only a small fraction of parameters? — which fits the picture of the RLVR phase as a precise sharpening operation on foundations laid earlier, not a wholesale rewrite.
Sources 8 notes
Supervised fine-tuning makes model outputs look correct—proper JSON structure, valid identifiers, expected sections—without making them physically feasible. The model learns surface features of solutions, not the reasoning to construct valid ones.
Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.
Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.
Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.
RLVR narrows models' problem-solving scope by prioritizing exploitation over exploration, a phenomenon called capability boundary collapse. Multiple importance sampling with exploration-based advantage functions can counteract this by integrating external data and explicitly rewarding discovery of underexplored but valuable reasoning paths.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
CHORD identifies three distinct training phases: initial capability disruption from policy shift, readaptation to expert patterns, then overfitting. Dynamically weighting SFT as an auxiliary objective within on-policy RL resolves this progression and improves stability.
Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.