Reinforcement Learning for LLMs LLM Reasoning and Architecture Design & LLM Interaction

Can small models match large models on function calling?

Explores whether small language models fine-tuned with the right training method can achieve comparable performance to large models on structured reasoning tasks requiring precise function calls, and what training approach makes this possible.

Note · 2026-05-03 · sourced from Tool Computer Use

The insight in this paper is methodological: function-calling for reasoning tasks is a domain where DPO outperforms SFT for small models, because the failure modes are more about preferring the right format and call sequence than about generating any plausible text. The proposed framework uses an agent that, given a problem and a callable function set, queries a large LLM by injecting function descriptions and examples and managing calls in a step-by-step reasoning chain. The byproduct is a dataset of correct AND incorrect chat completions — preference pairs ready for DPO.

Why DPO rather than SFT or PPO. SFT teaches the model to imitate good examples but provides no signal about what to avoid — and rigid output formats (precise variable names, JSON, argument values) punish near-misses harshly, so explicit negative examples matter. PPO would work but requires extensive human feedback to train a reward model, making it resource-intensive. DPO removes the reward-model step by incorporating preferences directly into the training objective, with demonstrated stability advantages over PPO.

The structural move is that a large LLM does double duty: it generates the candidate reasoning chains AND its successes/failures provide the preference labels for the small model's training. This is a teacher-distillation pattern but with both polarities — the small model learns what the large model gets right and what it gets wrong, not just to imitate the large model's right answers. The pattern fits the broader case for Can small language models handle most agent tasks?: function-calling is exactly the kind of repetitive, scoped, format-rigid work where a fine-tuned small model can replace a large general-purpose one.

The practical implication: when output format is rigid and small-model deployment is the goal, the question is not "can SFT close the gap" but "what's the cheapest source of preference signal." Self-generated preference pairs from a strong teacher are essentially free relative to human feedback.

Source: Tool Computer Use

Related concepts in this collection

Can small language models handle most agent tasks? Explores whether smaller, cheaper models are actually sufficient for the repetitive, scoped work that dominates deployed agent systems, rather than relying on large models by default.
exemplifies: function-calling is the canonical repetitive-scoped-format-rigid task where SLM-first architectures pay off; DPO-from-teacher is one viable training recipe.
Can breaking function calling into subtasks improve model generalization? Does training on seven granular function-calling subtasks instead of one umbrella objective close the gap between open-source and proprietary models? This explores whether decomposition surfaces hidden failure modes that unified training misses.
complements: Granite argues granular sub-task training closes the open-vs-proprietary gap on the *what to train on* axis; this note argues DPO-from-teacher closes the gap on the *how to train* axis. Both target the same problem (open-source function-calling lags proprietary).
Where do traditional function calling systems actually break down? Function calling seems simple but fails in ways that aren't obvious. This explores three independent failure points—retrieval, context bloat, and output rigidity—that together explain why even the best models struggle.
extends: Floworks names rigid output format as one of three failure points; this note shows DPO-with-negative-examples is a targeted intervention against the format failure mode specifically.
Does teacher-refined data always improve student model performance? Explores whether higher-quality training data from teacher models uniformly benefits student models, or if compatibility with the student's current learning state matters for effective instruction.
complicates: teacher-distillation effectiveness depends on student-teacher compatibility; preference-pair distillation may inherit this dependency.
Why do alignment methods work if they model human irrationality? DPO and PPO-Clip succeed partly by implicitly encoding human cognitive biases like loss aversion. Does modeling irrationality explain their effectiveness better than traditional preference learning theory?
extends: explains *why* DPO's negative-example signal works — it implicitly models loss aversion, which is exactly the asymmetry rigid output formats impose.

Concept map

14 direct connections · 171 in 2-hop network ·dense cluster

Can small models match large models on function … Can small language models handle most agent tasks? Can breaking function calling into subtasks improv… Where do traditional function calling systems actu… Does teacher-refined data always improve student m… Why do alignment methods work if they model human …

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

DPO-trained small models can match large models on function-calling reasoning chains — preference data from a teacher beats SFT for the rigid output format