Can granular sub-task training for function calling improve both open and proprietary models?

This explores whether breaking function calling into smaller, explicitly-trained sub-tasks actually lifts performance — and whether that lift reaches closed frontier models (GPT, Claude, Gemini) or only helps the open models doing the training.

This explores whether breaking function calling into smaller, explicitly-trained sub-tasks actually lifts performance — and whether that gain extends to proprietary frontier models or only closes the gap for open ones. The corpus is direct on the first half and quieter on the second, which is itself worth knowing. The strongest evidence comes from Granite-20B-FunctionCalling, which decomposes the skill into seven granular sub-tasks — nested calls, chaining, parallel functions, function-name detection, parameter detection, next-best-function selection, and response generation — and trains across all of them rather than dumping everagy into one umbrella dataset like ToolLLM Can breaking function calling into subtasks improve model generalization?. The payoff there is framed as closing the gap with GPT, Claude, and Gemini, not improving those proprietary models directly. So the honest answer to "both open and proprietary" is: granular training demonstrably improves the open model, and the way it helps proprietary models is by making open models competitive with them — a different relationship than the question's symmetry implies.

Why does fine-grained decomposition work at all? A surprising clue is that instruction tuning for these tasks may not be teaching "understanding" so much as the shape of the output space — models trained on semantically empty or even wrong instructions perform nearly as well as those trained on correct ones, because what transfers is knowledge of the output format distribution Does instruction tuning teach task understanding or output format?. Function calling is exactly the kind of task where rigid output format is the hard part, which is also why DPO — training on explicit correct-vs-incorrect call examples — beats plain supervised fine-tuning for small models and lets them match much larger ones on function-calling reasoning Can small models match large models on function calling?. Read together, these suggest the granular sub-tasks work because each one isolates a distinct format-and-structure failure mode rather than hoping a monolithic dataset covers them all.

There's a deeper reason decomposition is a natural fit, which you might not expect: networks seem to *want* to be modular. Pruning experiments show neural networks spontaneously implement compositional sub-routines in isolated sub-networks, where ablating one part affects only its corresponding function — and pretraining makes this modular structure more reliable Do neural networks naturally learn modular compositional structure?. Granular sub-task training may be working *with* that grain rather than against it: you're supervising sub-networks the model was already inclined to carve out.

The corpus also points to where this idea goes once you leave weight-update training behind. Instead of baking sub-tasks into parameters, agents can extract reusable sub-task routines from their own experience — Agent Workflow Memory induces routines at finer granularity than whole tasks and compounds them, yielding 24–51% gains that grow as the train-test gap widens Can agents learn reusable sub-task routines from past experience?. VOYAGER pushes the same logic into an external, composable skill library that sidesteps catastrophic forgetting entirely Can agents learn new skills without forgetting old ones?. This is the lateral payoff: "granular sub-task training" has a parameter-side version (train the seven sub-tasks in) and a memory-side version (let the agent accumulate sub-task routines at runtime) — and the latter applies equally to a proprietary model you can't retrain, since it lives outside the weights.

So the reader's takeaway worth carrying away: decomposition reliably helps open models and is the mechanism by which they catch frontier models — but if you actually want to improve a proprietary model you can't fine-tune, the corpus's answer isn't sub-task *training*, it's sub-task *memory* layered on top.

Sources 6 notes

Can breaking function calling into subtasks improve model generalization?

Granite-20B-FunctionCalling shows that explicit training across seven granular subtasks—nested calls, chaining, parallel functions, name detection, parameter detection, next-best function, and response generation—generalizes better than umbrella datasets like ToolLLM. This multi-task approach closes the performance gap with GPT, Claude, and Gemini.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Can agents learn reusable sub-task routines from past experience?

Agent Workflow Memory induces sub-task routines at finer granularity than full tasks, abstracts example-specific values, and compounds them hierarchically. This produces 24.6% relative gain on Mind2Web and 51.1% on WebArena, with larger gains as train-test gaps widen.

Can agents learn new skills without forgetting old ones?

VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.

Can granular sub-task training for function calling improve both open and proprietary models?

Sources 6 notes

Next inquiring lines