Why does KTO skip supervised fine-tuning while DPO cannot?

This explores the premise that KTO (Kahneman-Tversky Optimization) can train directly on a base model while DPO (Direct Preference Optimization) needs a supervised fine-tuning warm-up first — and the corpus doesn't address that mechanism head-on, so the honest answer is partly about what's missing and partly about what the surrounding notes reveal.

Straight answer first: the collection doesn't contain a note that directly compares KTO and DPO or explains why one skips supervised fine-tuning (SFT) and the other depends on it. So I can't hand you the mechanism you're asking for. What I can do is show you the conceptual territory around the question — what SFT actually contributes, and why a preference method might or might not need it as a foundation — because the corpus is rich there even if it never names KTO.

The usual story is that DPO needs an SFT'd starting point because it learns from *paired* comparisons (this response beats that one), and those pairs only carry signal if the model already lives in a sensible region of output space. KTO instead learns from single thumbs-up / thumbs-down labels, which is a looser requirement. The corpus sharpens *why* that starting point matters by showing how little SFT really installs. SFT 'improves response formatting but not physical feasibility' — it teaches the surface shape of a good answer, not the reasoning to construct one Does supervised fine-tuning actually improve reasoning on optimization problems?. Even more bluntly, instruction tuning 'teaches output format distribution, not task understanding' — models trained on deliberately wrong instructions perform about as well as those trained on correct ones Does instruction tuning teach task understanding or output format?. So the thing SFT buys a downstream method like DPO is mostly *format conditioning* — getting the model to emit well-shaped candidates that preference pairs can then rank.

That reframes your question in a useful way: the dependence isn't really about SFT teaching skills, it's about SFT supplying a behaved output distribution to compare against. The LIMA finding pushes this further — 1,000 curated examples are enough because 'post-training activates existing capabilities rather than building new ones' Can careful curation replace massive alignment datasets?. If post-training is activation rather than construction, then how much pre-conditioning a preference method needs comes down to how tolerant its loss is of an unpolished base — a paired-comparison loss is more brittle to garbage candidates than a per-sample reward loss.

The corpus also gives you DPO's known failure surfaces, which is the closest it comes to your two named methods. Standard RLHF and DPO 'produce collaborators that ignore partner interventions,' evaluating suggestions by surface plausibility rather than causal impact Why do standard alignment methods ignore partner interventions?. And there's reassurance against a common worry about preference tuning generally: it doesn't actually collapse diversity — measured among quality-passing outputs, preference-tuned models are *more* diverse, because the base model only looked diverse by spanning incoherent space Does preference tuning actually reduce the diversity of model outputs?.

If you want to go deeper into the broader question lurking under yours — *when does a fine-tuning stage actually help versus quietly corrupt the model?* — the strongest adjacent threads are proxy-tuning, which closes 88–91% of the alignment gap while leaving base weights untouched because 'direct fine-tuning corrupts knowledge storage in lower layers' Can decoding-time tuning preserve knowledge better than weight fine-tuning?, and SoftCoT, which freezes the backbone entirely to avoid catastrophic forgetting Can continuous reasoning avoid forgetting in instruction-tuned models?. Those reframe the SFT-dependence question as a tradeoff: every stage you add before preference learning is a chance to either condition the model usefully or overwrite what it already knew.

Sources 7 notes

Does supervised fine-tuning actually improve reasoning on optimization problems?

Supervised fine-tuning makes model outputs look correct—proper JSON structure, valid identifiers, expected sections—without making them physically feasible. The model learns surface features of solutions, not the reasoning to construct valid ones.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Can careful curation replace massive alignment datasets?

LIMA demonstrates that 1000 carefully curated examples fine-tuned on a strong pretrained model achieve competitive alignment performance with models trained on orders of magnitude more data, showing that post-training activates existing capabilities rather than building new ones.

Why do standard alignment methods ignore partner interventions?

Regularizing agents to maintain consistency when intervention pathways are nullified forces them to evaluate suggestions by causal impact rather than surface plausibility. Common ground alignment emerges as a byproduct without explicit reward.

Does preference tuning actually reduce the diversity of model outputs?

When diversity is measured among quality-passing outputs rather than all outputs, preference-tuned models generate greater semantic diversity than base models. Base models appear more diverse only because their variance spans incoherent space.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Can continuous reasoning avoid forgetting in instruction-tuned models?

SoftCoT avoids catastrophic forgetting by keeping the main LLM frozen while delegating soft thought generation to a small auxiliary model. This architectural separation maintains pre-trained knowledge while enabling continuous reasoning.

Why does KTO skip supervised fine-tuning while DPO cannot?

Sources 7 notes

Next inquiring lines