Does training on granular tasks beat training on the full function calling problem?
This explores whether breaking function calling into explicit sub-skills and training each one beats training a model on whole tool-use examples — and what the corpus says about why decomposition might help.
This explores whether splitting function calling into named sub-skills and training on each separately beats throwing whole tool-use examples at the model. The corpus gives a fairly direct yes — with an interesting wrinkle about *why* it works. The clearest evidence comes from Granite-20B-FunctionCalling, which treats function calling not as one problem but as seven: nested calls, chaining, parallel functions, detecting which function to name, detecting parameters, picking the next-best function, and generating the response. Training explicitly across these granular tasks generalized better than umbrella datasets like ToolLLM, and closed the gap with GPT, Claude, and Gemini Can breaking function calling into subtasks improve model generalization?. The umbrella dataset gives you volume; the decomposed curriculum gives you coverage of the specific failure modes.
There's a reason decomposition might be the natural grain to train at. Pruning studies show neural networks already tend to implement compositional tasks as isolated subnetworks — ablate one and only its corresponding function breaks — and pretraining makes this modular structure more consistent Do neural networks naturally learn modular compositional structure?. If the model is internally building separable skill modules anyway, training on granular tasks is arguably training *with* that grain rather than against it. The same instinct shows up at inference time in systems that compose task-specific expert vectors on the fly rather than relying on one monolithic fine-tune Can models dynamically activate expert skills at inference time?, and in skill-library agents that build complex behaviors from stored simpler ones Can agents learn new skills without forgetting old ones?.
But here's the wrinkle worth sitting with. A separate line of work suggests that what fine-tuning on function calling actually teaches may be narrower than "understanding the task." Models trained on semantically empty or even deliberately wrong instructions match models trained on correct ones — what transfers is knowledge of the *output space*, not task meaning Does instruction tuning teach task understanding or output format?. Function calling is unusually format-bound (rigid JSON, exact parameter names), so the seven-task decomposition may be winning largely because each granular task drills a distinct slice of that output distribution. This reframes "granular beats whole" as "granular gives more thorough coverage of the format space," not necessarily deeper reasoning.
That reframing predicts what fixes the *remaining* failures. Small models often fail function calling specifically on rigid output format, and DPO — training on correct *and* incorrect examples — beats plain supervised fine-tuning precisely because the negative examples target those format failures directly Can small models match large models on function calling?. So the fuller picture isn't just granular-vs-whole; it's that the most effective recipe combines decomposed task coverage with negative-example training that pins down the exact format the model keeps getting wrong.
One caution the corpus raises: decomposition is not magic generalization. Transformers tend to solve compositional problems by memorizing computation subgraphs and stitching them together, failing badly on genuinely novel combinations Do transformers actually learn systematic compositional reasoning?, and the order you train sub-skills in mechanically reshapes the result — structured-first curricula avoid entropy collapse that joint training causes Does training order reshape how models handle different task types?. So granular training helps, but *which* granular tasks, in *what order*, with *negative examples* — those are the levers that decide whether it actually beats the monolithic approach.
Sources 8 notes
Granite-20B-FunctionCalling shows that explicit training across seven granular subtasks—nested calls, chaining, parallel functions, name detection, parameter detection, next-best function, and response generation—generalizes better than umbrella datasets like ToolLLM. This multi-task approach closes the performance gap with GPT, Claude, and Gemini.
Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.
Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.
VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.
Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.
Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.
Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.
Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.