LLM Reasoning and Architecture Reinforcement Learning for LLMs Agentic and Multi-Agent Systems

Can breaking function calling into subtasks improve model generalization?

Does training on seven granular function-calling subtasks instead of one umbrella objective close the gap between open-source and proprietary models? This explores whether decomposition surfaces hidden failure modes that unified training misses.

Note · 2026-05-03 · sourced from Tool Computer Use

The diagnosis behind Granite-20B-FunctionCalling is that "function calling" as a training target is too coarse. Models fine-tuned on umbrella function-calling datasets like ToolLLM, ToolAlpaca, and Gorilla underperform along three dimensions: they fail to generalize out-of-domain, they handle the granular sub-tasks poorly when isolated, and they trail proprietary models like GPT, Claude, and Gemini. The pattern suggests that what looks like one capability is actually seven that are loosely coupled.

Granite's response is to make the seven explicit and train across all of them as separate tasks: (1) Nested Function Calling — using one function's output as another's input; (2) Function Chaining — sequencing dependent calls; (3) Parallel Functions — invoking multiple independent calls; (4) Function Name Detection — picking the right function from a set; (5) Parameter-Value Pair Detection — slot filling against a schema; (6) Next-Best Function — selecting the next call given partial state; (7) Response Generation — composing the user-facing reply from tool outputs.

The structural claim is that an instruction-tuning mixture across granular sub-tasks generalizes better than a single umbrella objective, because each sub-task surfaces different failure modes during training. A model that has explicitly practiced nested calls, parallel calls, and chaining understands their composition rather than emitting tokens that look like function calls without structural correctness.

The implication for capability evaluation: a single function-calling benchmark is misleading. Models can be strong on call-statement generation while failing on parameter slot-filling or next-best-function selection, and the average masks where the failure lies. The right unit of evaluation — and training — is the sub-task, not the umbrella. This connects directly to Where do traditional function calling systems actually break down?: Floworks names three independent failure points; Granite implicitly says there are seven, all training-addressable through multi-task decomposition.


Source: Tool Computer Use

Related concepts in this collection

Concept map
14 direct connections · 173 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

function calling decomposes into seven granular tasks — multi-task learning across them generalizes where umbrella training fails