Reinforcement Learning for LLMs

Why does SFT-then-RL training follow a predictable three-phase pattern?

When expert data diverges from a model's learned patterns, SFT-then-RL training exhibits disruption, readaptation, and overfitting phases. Understanding this progression could improve how we combine imitation and reinforcement learning.

Note · 2026-02-22 · sourced from Reinforcement Learning
How should we allocate compute budget at inference time? How do you build domain expertise into general AI models?

The standard SFT-then-RL pipeline doesn't consistently outperform pure RL. CHORD's investigation reveals why: the learning curve follows a "shift-readapt-overfit" progression through three distinct phases. First, initial disruption — the sudden policy shift from expert data degrades existing capabilities. Second, readaptation — the model adapts to expert patterns and recovers performance. Third, overfitting — the model eventually overfits to the expert data, losing generalization.

This three-phase pattern appears specifically when expert data significantly diverges from the model's own established patterns. Expert data brings new capabilities but disrupts established ones, creating a fundamental tension in the SFT-then-RL approach.

CHORD's solution reframes SFT not as a separate tuning stage but as a dynamically weighted auxiliary objective within the on-policy RL process. Two control mechanisms manage the expert data influence: a global coefficient that guides the transition from off-policy imitation to on-policy exploration over training, and a per-token weighting function that down-weights highly divergent tokens from off-policy data that could disrupt on-policy training.

The insight connects to the broader SFT-RL dynamic. Since Does supervised fine-tuning actually improve reasoning quality?, the degradation phase in CHORD's three-phase pattern may correspond to the reasoning quality loss that SFT introduces. Since How quickly do errors compound during model self-training?, the overfit phase represents a slower-timescale version of the same cumulative failure dynamic.

The practical implication: rather than treating SFT and RL as sequential stages with a hard boundary, integrating them as a continuous spectrum (from imitation-heavy to exploration-heavy) over training produces more stable and higher-performing results.


Source: Reinforcement Learning

Related concepts in this collection

Concept map
16 direct connections · 155 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

sft-then-rl training exhibits a shift-readapt-overfit progression when expert data diverges from model patterns