What makes supervised fine-tuning worsen RL exploration later?
This explores the worry that supervised fine-tuning (training a model to imitate good answers) narrows the behavioral range a model can later explore when reinforcement learning takes over — but the corpus complicates that premise more than it confirms it.
This question reads SFT as something that quietly shrinks the model's later room to explore — and the corpus has a genuine tension worth seeing. The intuitive mechanism is imitation lock-in: when a model is trained to copy expert demonstrations, its competence gets capped by whatever the dataset curators happened to imagine. The model never interacts with an environment, never learns from its own failures, and so it generalizes poorly beyond the scenarios it was shown Can agents learn beyond what their training data shows?. If that imitated repertoire is narrow, RL inherits a narrow starting point — and RL is notoriously bad at inventing new behaviors it wasn't already capable of.
That last point is the crux. Several notes suggest RL mostly *selects and sharpens* what's already present rather than creating anything new. RL post-training seems to teach a model *when* to reason, not *how* — the reasoning strategies pre-exist as latent activations, and RL just optimizes their deployment timing Does RL post-training create reasoning or just deploy it?. It also collapses diversity: within the first epoch RL converges on a single dominant format and suppresses the alternatives, and the winning format is inherited from the pre-RL distribution, not chosen for performance Does RL training collapse format diversity in pretrained models?. So if SFT has already flattened the model's format and strategy palette, RL doesn't reopen it — it picks one survivor from a smaller pool. RL fine-tuning has even been shown to sharpen template-matching and memorization rather than install genuine procedures, with sharp accuracy drops on out-of-distribution variants Do fine-tuned language models actually learn optimization procedures?.
But here's the surprise the corpus delivers: SFT is not the villain the question assumes — *narrow* SFT is. In search agents, RL squeezes exploration diversity through the same entropy-collapse mechanism seen in reasoning, while SFT on **diverse** demonstrations actually *expands* and preserves exploration breadth Does reinforcement learning squeeze exploration diversity in search agents?. The same direction shows up when comparing knowledge-embedding methods: RL-style training that rewards reasoning quality internalizes more coherent structures than token-level SFT, because SFT optimizes surface correctness rather than the underlying reasoning Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?. Read together, the failure isn't 'SFT happened' — it's *what the SFT data taught*. Imitating a thin band of expert trajectories hands RL a pre-collapsed search space; imitating a broad range keeps the door open.
There's a final wrinkle that reframes 'exploration' itself. RL's effects concentrate in a sparse but full-rank subnetwork — only 5–30% of parameters move, and they move nearly identically across random seeds, suggesting RL has a fairly fixed lane it operates in regardless of where it starts Does reinforcement learning update only a small fraction of parameters?. Its primary lever appears to be *suppressing* wrong trajectories rather than amplifying new ones What actually changes inside a model during RL training?, and training unfolds in two phases — procedural consolidation first, strategic exploration second Does RL training follow a predictable two-phase learning sequence?. That second phase is exactly where a thin SFT prior bites hardest: strategic exploration can only recombine strategies the model already carries, so an over-pruned starting repertoire shows up as a ceiling late, not early. The thing you didn't know to ask: the danger isn't fine-tuning per se, it's fine-tuning on data narrow enough that, by the time RL's exploration phase arrives, there's nothing left to explore.
Sources 9 notes
Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.
RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.
RLAG rewards both answer accuracy and explanation rationality by cycling between augmented and unaugmented generation, progressively internalizing coherent knowledge structures. This outperforms SFT because it prioritizes reasoning quality over token-level correctness.
Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.
RL's effects concentrate in structurally sparse but full-rank subnetworks across multiple algorithms and models. Suppressing wrong trajectories—rather than amplifying correct ones—appears to be the primary mechanism, with training following a predictable two-phase pattern of procedural consolidation then strategic exploration.
Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.