Can continuous spectrum training outperform sequential SFT-then-RL stages?

This explores whether blending supervised and reinforcement signals into one continuous training process beats running them as discrete stages (SFT first, then RL) — and what the corpus knows about why staged training stumbles.

This explores whether blending supervised and reinforcement signals into one continuous training process beats the classic recipe of doing supervised fine-tuning first and bolting RL on afterward. The corpus offers a fairly direct answer: the staged version has a diagnosable failure pattern, and folding the stages together fixes it.

The clearest evidence is the discovery that SFT-then-RL training moves through three predictable phases when the expert data pulls away from what the model already does — an initial disruption as the policy shifts, a readaptation to the expert's patterns, then overfitting Why does SFT-then-RL training follow a predictable three-phase pattern?. That progression is essentially the cost of treating SFT as a finished, separate step. The same work shows that dynamically weighting SFT as an auxiliary objective *inside* on-policy RL — rather than as a prior stage — smooths out the progression and stabilizes training. That's continuous-spectrum training outperforming the staged version on its home turf.

Why does merging help? A second thread points at plasticity. Models that drift less from their base distribution stay able to keep learning, while approaches that wander far stall once the task domain changes Does staying close to the base model preserve learning ability?. A hard SFT stage followed by hard RL is exactly the kind of two-step drift that burns plasticity; a continuous blend keeps the model closer to home and learning-ready throughout. There's a structural hint here too — RL only rewrites a small, consistent slice of parameters Does reinforcement learning update only a small fraction of parameters?, so the two phases aren't fighting over the whole network, which is part of why interleaving them is even feasible.

But "continuous beats staged" isn't unconditional — order and scheduling still matter enormously, which complicates the tidy story. Training order mechanically reshapes entropy dynamics: structured tasks shrink output entropy while creative ones expand it, and front-loading structured tasks beat joint training by over 6% precisely because it stopped entropy collapse from wrecking open-ended skills Does training order reshape how models handle different task types?. So sequencing carries real information that a naive uniform blend can throw away. RL itself unfolds in phases too — procedural mastery first, then strategic exploration Does RL training follow a predictable two-phase learning sequence? — meaning even "continuous" training has internal stages whether you design for them or not.

The synthesis: the dichotomy in the question is a little bit false. The winning approach isn't "abolish stages" so much as "stop treating SFT as a frozen prior step and let supervised signal flow through RL as a tunable, weighted ingredient" — while still respecting that the *content's* difficulty and type want a schedule. What you didn't know you wanted to know: the failure of staged training isn't that RL undoes SFT, it's a specific shift-readapt-overfit arc that you can detect and dissolve by making the boundary between the two permeable.

Sources 5 notes

Why does SFT-then-RL training follow a predictable three-phase pattern?

CHORD identifies three distinct training phases: initial capability disruption from policy shift, readaptation to expert patterns, then overfitting. Dynamically weighting SFT as an auxiliary objective within on-policy RL resolves this progression and improves stability.

Does staying close to the base model preserve learning ability?

FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Can continuous spectrum training outperform sequential SFT-then-RL stages?

Sources 5 notes

Next inquiring lines