Reinforcement Learning for LLMs LLM Reasoning and Architecture Design & LLM Interaction

Can small models match large models on function calling?

Explores whether small language models fine-tuned with the right training method can achieve comparable performance to large models on structured reasoning tasks requiring precise function calls, and what training approach makes this possible.

Note · 2026-05-03 · sourced from Tool Computer Use

The insight in this paper is methodological: function-calling for reasoning tasks is a domain where DPO outperforms SFT for small models, because the failure modes are more about preferring the right format and call sequence than about generating any plausible text. The proposed framework uses an agent that, given a problem and a callable function set, queries a large LLM by injecting function descriptions and examples and managing calls in a step-by-step reasoning chain. The byproduct is a dataset of correct AND incorrect chat completions — preference pairs ready for DPO.

Why DPO rather than SFT or PPO. SFT teaches the model to imitate good examples but provides no signal about what to avoid — and rigid output formats (precise variable names, JSON, argument values) punish near-misses harshly, so explicit negative examples matter. PPO would work but requires extensive human feedback to train a reward model, making it resource-intensive. DPO removes the reward-model step by incorporating preferences directly into the training objective, with demonstrated stability advantages over PPO.

The structural move is that a large LLM does double duty: it generates the candidate reasoning chains AND its successes/failures provide the preference labels for the small model's training. This is a teacher-distillation pattern but with both polarities — the small model learns what the large model gets right and what it gets wrong, not just to imitate the large model's right answers. The pattern fits the broader case for Can small language models handle most agent tasks?: function-calling is exactly the kind of repetitive, scoped, format-rigid work where a fine-tuned small model can replace a large general-purpose one.

The practical implication: when output format is rigid and small-model deployment is the goal, the question is not "can SFT close the gap" but "what's the cheapest source of preference signal." Self-generated preference pairs from a strong teacher are essentially free relative to human feedback.


Source: Tool Computer Use

Related concepts in this collection

Concept map
14 direct connections · 171 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

DPO-trained small models can match large models on function-calling reasoning chains — preference data from a teacher beats SFT for the rigid output format