Why does SFT-then-RL training follow a predictable three-phase pattern?
When expert data diverges from a model's learned patterns, SFT-then-RL training exhibits disruption, readaptation, and overfitting phases. Understanding this progression could improve how we combine imitation and reinforcement learning.
The standard SFT-then-RL pipeline doesn't consistently outperform pure RL. CHORD's investigation reveals why: the learning curve follows a "shift-readapt-overfit" progression through three distinct phases. First, initial disruption — the sudden policy shift from expert data degrades existing capabilities. Second, readaptation — the model adapts to expert patterns and recovers performance. Third, overfitting — the model eventually overfits to the expert data, losing generalization.
This three-phase pattern appears specifically when expert data significantly diverges from the model's own established patterns. Expert data brings new capabilities but disrupts established ones, creating a fundamental tension in the SFT-then-RL approach.
CHORD's solution reframes SFT not as a separate tuning stage but as a dynamically weighted auxiliary objective within the on-policy RL process. Two control mechanisms manage the expert data influence: a global coefficient that guides the transition from off-policy imitation to on-policy exploration over training, and a per-token weighting function that down-weights highly divergent tokens from off-policy data that could disrupt on-policy training.
The insight connects to the broader SFT-RL dynamic. Since Does supervised fine-tuning actually improve reasoning quality?, the degradation phase in CHORD's three-phase pattern may correspond to the reasoning quality loss that SFT introduces. Since How quickly do errors compound during model self-training?, the overfit phase represents a slower-timescale version of the same cumulative failure dynamic.
The practical implication: rather than treating SFT and RL as sequential stages with a hard boundary, integrating them as a continuous spectrum (from imitation-heavy to exploration-heavy) over training produces more stable and higher-performing results.
Source: Reinforcement Learning
Related concepts in this collection
-
Does supervised fine-tuning actually improve reasoning quality?
While SFT boosts final-answer accuracy, does it degrade the quality and informativeness of the reasoning steps that justify those answers? This matters for high-stakes domains requiring auditable decision-making.
connects: the disruption phase may correspond to SFT's reasoning quality degradation
-
How quickly do errors compound during model self-training?
When LLMs train on their own outputs without verification, do small mistakes amplify exponentially? This matters because it determines whether unsupervised self-improvement is even feasible.
extends: overfit phase is slow-timescale error compounding
-
Does RL improve domain reasoning by adding knowledge or removing it?
When reinforcement learning improves reasoning in specialized domains like medicine, is it teaching models new facts or preventing them from using wrong ones? Understanding this distinction matters for how we design RL training.
supports: the RL phase works by pruning the overfitting artifacts of the SFT phase
-
Does supervised fine-tuning improve reasoning or just answers?
Explores whether training models on question-answer pairs actually strengthens their reasoning quality or merely optimizes them toward correct outputs through shortcuts. This matters for deploying AI in domains like medicine where reasoning must be auditable.
extends: CHORD's disruption phase is the SFT accuracy trap in temporal progression — SFT raises accuracy while degrading reasoning quality, and CHORD shows this degradation is the first phase of a three-phase dynamic that RL can recover from if properly integrated
-
Does training order reshape how models handle different task types?
Explores whether the sequence of multi-task RL training systematically affects model capabilities across structured and creative domains, and whether this ordering effect can be predicted and optimized.
Omni-Thinker's complementary entropy dynamics extend CHORD's temporal framework: CHORD shows SFT→RL follows shift-readapt-overfit within a single domain, while multi-task RL reveals that different domains pull entropy in opposite directions — making training order across domains a mechanistic variable that interacts with CHORD's within-domain phase progression
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
sft-then-rl training exhibits a shift-readapt-overfit progression when expert data diverges from model patterns