INQUIRING LINE

How does non-reasoning SFT prevent overfitting before RL training begins?

This explores whether a light, non-reasoning supervised fine-tuning (SFT) warm-up before reinforcement learning actually guards against overfitting — and the corpus pushes back on the premise, suggesting SFT is more often the *source* of overfitting than the cure.


This reads the question as asking how an SFT stage, done before RL begins, keeps a model from overfitting. The honest synthesis: the collection mostly documents the opposite — SFT tends to *introduce* the overfitting that RL later has to fight. One careful study of SFT-then-RL pipelines finds a predictable three-phase arc — the model is first disrupted by the policy shift, readapts to the expert examples, then overfits to them Why does SFT-then-RL training follow a predictable three-phase pattern?. So a long SFT warm-up doesn't prevent overfitting; it walks the model into it.

What SFT actually teaches is part of the problem. Two notes show it buys surface competence at the cost of substance: fine-tuning raises final-answer accuracy while cutting reasoning informativeness by nearly 39%, with models reaching answers through pattern-matched shortcuts rather than genuine inference Does supervised fine-tuning actually improve reasoning quality?, and on optimization problems it makes outputs *look* right — clean JSON, valid structure — without making them physically feasible Does supervised fine-tuning actually improve reasoning on optimization problems?. That is overfitting to form. A 'non-reasoning' SFT pass that only teaches formatting and answer-matching is exactly the kind that locks in shortcuts.

So why do people run SFT first at all? The better framing in the corpus is that the reasoning is already there. Several independent lines of evidence find base models contain latent reasoning capability that minimal training merely unlocks Do base models already contain hidden reasoning ability?, and that RL post-training teaches a model *when* to reason rather than *how* — hybrid setups recover 91% of the gains just by routing tokens Does RL post-training create reasoning or just deploy it?. Under this view, a heavy reasoning-SFT stage risks overwriting capability the model already has; keeping SFT light and non-reasoning leaves the latent skills intact for RL to elicit rather than baking in imitation patterns.

The more compelling answer to overfitting isn't a clean SFT-then-RL handoff at all — it's dissolving the boundary. The same study that names the shift-readapt-overfit arc resolves it by folding SFT *into* on-policy RL as a dynamically weighted auxiliary objective, so the model never gets a chance to overfit to a frozen expert set before RL starts Why does SFT-then-RL training follow a predictable three-phase pattern?. RL has its own overfitting traps to watch — it collapses onto a single dominant pretraining format within the first epoch Does RL training collapse format diversity in pretrained models?, and overly hard samples breed degenerate shortcuts that contaminate existing skills Do overly hard RLVR samples actually harm model capabilities? — so the goal is less 'prevent overfitting with SFT' than 'don't let either stage memorize at the other's expense.'

The thing you might not have expected: the safest role for SFT before RL is to do *less*, not more. A minimal, formatting-only pass that preserves the base model's latent reasoning — rather than a thorough reasoning-imitation stage — is what keeps RL's elicitation room open.


Sources 7 notes

Why does SFT-then-RL training follow a predictable three-phase pattern?

CHORD identifies three distinct training phases: initial capability disruption from policy shift, readaptation to expert patterns, then overfitting. Dynamically weighting SFT as an auxiliary objective within on-policy RL resolves this progression and improves stability.

Does supervised fine-tuning actually improve reasoning quality?

SFT improves final-answer accuracy but reduces reasoning informativeness by 38.9% on average. Models reach correct answers through pattern-matching shortcuts rather than genuine inferential reasoning, becoming less auditable despite higher accuracy scores.

Does supervised fine-tuning actually improve reasoning on optimization problems?

Supervised fine-tuning makes model outputs look correct—proper JSON structure, valid identifiers, expected sections—without making them physically feasible. The model learns surface features of solutions, not the reasoning to construct valid ones.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Next inquiring lines