Why does DPO outperform SFT specifically for function calling tasks?
This explores why preference-based training (DPO) beats standard supervised fine-tuning (SFT) on function calling — and the corpus points to one answer: function calling fails mostly on rigid output format, and DPO's negative examples target exactly that failure.
This explores why DPO outperforms SFT specifically for function calling, and the corpus is unusually unanimous: the gap isn't about teaching the model more, it's about teaching it what *not* to do. The most direct evidence shows small models trained with DPO on correct-and-incorrect function-calling pairs from a large teacher model match much larger models, precisely because DPO's explicit negative examples target the rigid output-format failures where SFT alone falls short Can small models match large models on function calling?. SFT shows the model good examples; DPO also shows it the bad ones and pushes away from them — which matters when the failure mode is a malformed JSON call rather than a wrong idea.
To see why that's the right lever, look at what SFT actually buys you. On structured tasks, SFT improves the *surface* of an answer without improving its substance: outputs get proper JSON structure, valid identifiers, and the expected sections, but they don't become physically feasible — the model learns the look of a solution, not the reasoning to construct a valid one Does supervised fine-tuning actually improve reasoning on optimization problems?. A parallel finding shows SFT can even raise final-answer accuracy while *degrading* reasoning quality by nearly 39%, because the model reaches answers through pattern-matching shortcuts rather than genuine inference Does supervised fine-tuning actually improve reasoning quality?. So SFT is good at exactly the thing function calling needs least and weak at the thing it needs most.
What makes function calling distinctive is *where* it breaks. One analysis identifies three independent failure points — unreliable retrieval at scale, bloated schemas that degrade reasoning, and the core problem that LLMs trained on free text can't reliably emit rigid JSON Where do traditional function calling systems actually break down?. That third failure is a formatting-discipline problem, and formatting discipline is exactly what contrastive preference training enforces well: penalize the near-miss malformed calls, reward the schema-clean ones. SFT has no signal for 'this looked almost right but was invalid' — DPO does.
The interesting twist for a curious reader: the same property that makes SFT weak here can be turned into a strength elsewhere. Other work decomposes function calling into seven granular subtasks — nested calls, chaining, parallel functions, parameter detection, and so on — and finds multi-task training generalizes better than umbrella datasets Can breaking function calling into subtasks improve model generalization?. Read together, these suggest two complementary fixes for the same root cause: DPO sharpens the boundary between valid and invalid output, while task decomposition gives the model explicit practice on each structural pattern. Both beat plain SFT because both inject a signal SFT structurally lacks — a sense of what failure looks like.
Worth noting the corpus also warns against over-crediting any single training recipe: a systematic RL study finds most technique gains are setup-sensitive and that the pretrained prior, not the algorithm, sets the performance ceiling Can two simple techniques match complex RL algorithms?. So DPO's edge on function calling is best understood narrowly — it shines when the bottleneck is rigid, verifiable output format that negative examples can directly police, not as a universal upgrade over SFT.
Sources 6 notes
Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.
Supervised fine-tuning makes model outputs look correct—proper JSON structure, valid identifiers, expected sections—without making them physically feasible. The model learns surface features of solutions, not the reasoning to construct valid ones.
SFT improves final-answer accuracy but reduces reasoning informativeness by 38.9% on average. Models reach correct answers through pattern-matching shortcuts rather than genuine inferential reasoning, becoming less auditable despite higher accuracy scores.
Floworks identifies three structural failures: vector similarity retrieval is unreliable at scale, full schemas inflate prompts and degrade reasoning, and LLMs trained on free text can't handle rigid JSON output. Fixing one axis doesn't fix the others.
Granite-20B-FunctionCalling shows that explicit training across seven granular subtasks—nested calls, chaining, parallel functions, name detection, parameter detection, next-best function, and response generation—generalizes better than umbrella datasets like ToolLLM. This multi-task approach closes the performance gap with GPT, Claude, and Gemini.
Advantage normalization and token-level loss aggregation allow critic-free PPO to surpass more complex algorithms. Systematic evaluation shows most RL techniques are setup-sensitive; the pretrained prior, not algorithm choice, sets performance ceiling.