INQUIRING LINE

Why does SFT fail when expert demonstrations are too long for small models?

This explores why small models break down when supervised fine-tuning asks them to imitate long expert reasoning traces — and what the failure actually is underneath the symptom.


This reads the question as being about imitation learning's core mechanism: SFT trains a model to reproduce an expert demonstration token by token, and that rigidity is exactly what snaps when the demonstration is long and the model is small. The clearest framing in the corpus comes from work distinguishing SFT from step-wise reward learning Can step-wise expert rewards help small models learn hard reasoning?: SFT is "rigid token-by-token imitation," which means a small model is forced to match every step of a trace it has no internal capability to generate on its own. When the gap between the expert's path and the model's reach is large — and a long demonstration almost guarantees a large gap — the model can't actually learn the reasoning, so it learns the next best thing it can copy: the surface form.

That surface-copying is the second thread, and it shows up across several notes. SFT reliably improves how outputs *look* — valid structure, proper formatting, the right-shaped sections — without improving whether the underlying solution is sound Does supervised fine-tuning actually improve reasoning on optimization problems?. Measured directly, fine-tuning can raise final-answer accuracy while *degrading* the quality of the reasoning by nearly 39%, because the model reaches answers through pattern-matching shortcuts rather than genuine inference Does supervised fine-tuning actually improve reasoning quality?. A long demonstration gives the model far more surface to mimic and far more reasoning depth it can't internalize — so it leans even harder on imitation of form.

There's a capability-and-length mismatch hiding here too, which is the part you might not expect. Optimal chain-of-thought length follows an inverted U: the best length *rises* with task difficulty but *falls* with model capability, and smaller models do better with shorter chains Why does chain of thought accuracy eventually decline with length?. So a long expert trace is doubly wrong for a small model — it's longer than the model would natively benefit from, and length itself isn't even a reliable signal of difficulty. Trace length mostly reflects how close a problem sits to the training distribution, not how hard it is Does longer reasoning actually mean harder problems?; a long out-of-distribution demonstration is essentially asking the small model to recall schemas it never had.

Push a model to imitate material that's too far beyond it and the failure turns actively harmful rather than merely unhelpful. Training on near-impossible targets makes models learn degenerate shortcuts — answer repetition, computation-skipping — and those shortcuts contaminate capabilities the model already had Do overly hard RLVR samples actually harm model capabilities?. The same divergence dynamic appears in the training trajectory itself: when expert data diverges from the policy, training moves through a shift-readapt-overfit progression, with an initial capability disruption before any readaptation Why does SFT-then-RL training follow a predictable three-phase pattern?.

The interesting turn is that the fix isn't shorter demonstrations — it's changing what the model is rewarded for. Instead of copying every token, give the model dense step-wise signal for how closely each of its *own* steps aligns with the expert's, which provides a learning gradient even when every full attempt fails Can step-wise expert rewards help small models learn hard reasoning?. That reframes the long demonstration from a script to be memorized into a source of per-step guidance the small model can actually follow at its own capability — which is also why verifying intermediate reasoning steps, rather than final answers, recovers so much performance on long traces Where do reasoning agents actually fail during long traces?.


Sources 8 notes

Can step-wise expert rewards help small models learn hard reasoning?

Supervised Reinforcement Learning rewards models by measuring alignment with expert actions at each step, providing dense learning signals even when all rollouts fail. This approach bridges the gap between rigid token-by-token imitation (SFT) and sparse outcome-only rewards (RLVR), and works best as a curriculum foundation before outcome-based refinement.

Does supervised fine-tuning actually improve reasoning on optimization problems?

Supervised fine-tuning makes model outputs look correct—proper JSON structure, valid identifiers, expected sections—without making them physically feasible. The model learns surface features of solutions, not the reasoning to construct valid ones.

Does supervised fine-tuning actually improve reasoning quality?

SFT improves final-answer accuracy but reduces reasoning informativeness by 38.9% on average. Models reach correct answers through pattern-matching shortcuts rather than genuine inferential reasoning, becoming less auditable despite higher accuracy scores.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Does longer reasoning actually mean harder problems?

Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Why does SFT-then-RL training follow a predictable three-phase pattern?

CHORD identifies three distinct training phases: initial capability disruption from policy shift, readaptation to expert patterns, then overfitting. Dynamically weighting SFT as an auxiliary objective within on-policy RL resolves this progression and improves stability.

Where do reasoning agents actually fail during long traces?

Reliability for long-trace reasoning comes from checking intermediate states and policy compliance during generation, not from scoring final outputs. Adding intermediate verification raised task success from 32% to 87% because most failures are process violations, not wrong answers.

Next inquiring lines