Can combining SRL with RLVR outperform either method used alone?
This explores whether a two-stage recipe — imitation training (SRL) first to build reasoning foundations, then verifiable-reward training (RLVR) to sharpen — beats running either stage on its own.
This explores whether stacking SRL then RLVR beats either alone, and the corpus has a direct answer plus a deeper reason it works. The headline result is that running Supervised RL first to establish reasoning patterns, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation Does sequencing imitation then exploration training improve reasoning?. But the *why* is the interesting part: the imitation phase isn't just a warm-up — it makes the outcome rewards informative by producing reasonable rollouts that the RL phase can then sharpen. Without that scaffolding, RLVR has nothing useful to grab onto.
That dependency makes sense once you look at what RLVR actually does. A recurring finding across the corpus is that RLVR doesn't teach new reasoning — it activates reasoning that's already latent in the model. It works nearly as well with random rewards as correct ones because it triggers a phase transition in the output distribution rather than instilling skills, and its effectiveness tracks pretraining quality, not reward quality Why does RLVR work with completely random rewards?. The same theme shows up in the finding that RL functions as *selection, not discovery* — the pretrained prior bounds what exploration can reach, which is why the choice of RL algorithm barely matters Does the choice of RL algorithm actually matter for reasoning?. If RLVR can only select and amplify what's already there, then an SRL imitation phase that plants better candidate behaviors is exactly the lever that raises the ceiling RLVR is working under.
The corpus also explains what goes wrong when you skip the foundation. RLVR fed problems that are too hard for the current model induces degenerate shortcuts — answer repetition, computation-skipping — because rare accidental successes get treated as high-advantage trajectories and reinforced Do overly hard RLVR samples actually harm model capabilities?. An SRL phase that first makes hard problems solvable converts those uninformative, all-or-nothing reward signals into a usable gradient. This is the same logic dressed differently: the imitation stage manufactures the 'reasonable rollouts' that keep the reward signal meaningful instead of degenerate.
Two cautions worth carrying into any claim of 'outperforms.' First, RLVR's gains are partly structural rather than semantic — it measurably improves coherence between adjacent reasoning steps without guaranteeing the whole proof is valid Does RLVR actually improve mathematical reasoning or just coherence?. Second, benchmark improvements can be memorization on contaminated datasets rather than genuine reasoning, and behavioral activation and benchmark scores are separable phenomena that can move independently Does RLVR success on math benchmarks reflect genuine reasoning improvement? Can genuine reasoning activation coexist with contaminated benchmarks?. So 'the curriculum wins' is most trustworthy when measured on clean, post-release benchmarks and on the activation of reasoning behavior, not just a leaderboard number.
The thing you might not have known you wanted to know: combining the two methods isn't additive, it's enabling. SRL doesn't add a separate increment of skill on top of RLVR — it changes what RLVR is *able* to do by giving a selection-and-amplification process something worth selecting. That reframes the whole 'better training recipe' question into a sequencing question about when imitation makes reward signals legible.
Sources 7 notes
Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.
RLVR works nearly as well with spurious rewards as correct ones because it catalyzes a phase transition in model output distribution. The effectiveness depends on pretraining quality, not reward signal quality or training volume.
Expert Iteration, PPO, and RC-RL perform comparably on reasoning because exploration is constrained by the pretrained distribution, not the optimizer. RL functions as selection, not discovery—the prior contains most solutions the algorithm will find.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
RLVR post-training measurably reduces logical errors between adjacent reasoning steps, but locally coherent traces can still be globally invalid proofs. The improvement is structural rather than semantic.
Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 from partial prompts but scores 0.0% on post-release LiveMathBench, revealing dataset contamination. On clean benchmarks, only correct rewards improve performance; random and inverse rewards fail or degrade reasoning ability.
RLVR activates genuine reasoning patterns through RL training while benchmark improvements may reflect data memorization on contaminated datasets. These operate at different measurement levels and can coexist without contradiction.