Reasoning and Learning Architectures

Does sequencing imitation then exploration training improve reasoning?

Can combining Supervised RL (expert imitation) followed by RLVR (outcome rewards) outperform either method alone on hard reasoning tasks? This explores whether curriculum ordering unlocks capabilities neither method achieves independently.

Note · 2026-05-18 · sourced from Training Fine Tuning

A clean curriculum-learning result from the SRL paper. Neither Supervised RL alone nor RLVR alone is the best training strategy for hard reasoning problems on small models. The strongest pipeline runs SRL first to establish a reasoning foundation, then RLVR to refine performance against verifiable rewards. The combination is more than additive — it outperforms both base methods.

The mechanism is complementary. SRL teaches the model to take reasoning actions resembling expert demonstrations. This installs the basic structure of a competent reasoning rollout, even on problems where the model would never reach the correct answer on its own. RLVR can then refine performance: given that the model now produces reasonable rollouts some of the time, outcome rewards become informative — they distinguish near-correct from off-track attempts and push the model toward the correct ones.

Without the SRL foundation, RLVR fails on hard problems because the success rate is zero. Without the RLVR refinement, SRL caps out at expert-step imitation without learning to push past the demonstrations. Each method addresses a failure mode of the other.

This is a specific instance of a broader curriculum-learning template. Different training methods have different failure-mode coverage: imitation methods fail when imitations are unreachable from the student's starting point; outcome methods fail when success is too rare. The right ordering is to use the imitation method to make outcome methods viable — build up to the regime where the harder, more capability-stretching method can produce useful signal.

For practitioners, the operational guidance is: when training small models on hard problems, do not pick between SFT/SRL and RL — sequence them. Use the imitation phase to get the model into the regime where the RL phase becomes informative, then use the RL phase to push past what imitation alone can achieve. The combined pipeline is the production setting.

The deeper observation is that "method choice" is often the wrong frame — "method sequence" frequently dominates. Curricula matter when the methods have different valid regimes.

Related concepts in this collection

Concept map
12 direct connections · 102 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

SRL-then-RLVR curriculum learning outperforms either method alone — imitation foundation then exploration refinement