LLM Reasoning and Architecture Reinforcement Learning for LLMs Agentic and Multi-Agent Systems

Can breaking function calling into subtasks improve model generalization?

Does training on seven granular function-calling subtasks instead of one umbrella objective close the gap between open-source and proprietary models? This explores whether decomposition surfaces hidden failure modes that unified training misses.

Note · 2026-05-03 · sourced from Tool Computer Use

The diagnosis behind Granite-20B-FunctionCalling is that "function calling" as a training target is too coarse. Models fine-tuned on umbrella function-calling datasets like ToolLLM, ToolAlpaca, and Gorilla underperform along three dimensions: they fail to generalize out-of-domain, they handle the granular sub-tasks poorly when isolated, and they trail proprietary models like GPT, Claude, and Gemini. The pattern suggests that what looks like one capability is actually seven that are loosely coupled.

Granite's response is to make the seven explicit and train across all of them as separate tasks: (1) Nested Function Calling — using one function's output as another's input; (2) Function Chaining — sequencing dependent calls; (3) Parallel Functions — invoking multiple independent calls; (4) Function Name Detection — picking the right function from a set; (5) Parameter-Value Pair Detection — slot filling against a schema; (6) Next-Best Function — selecting the next call given partial state; (7) Response Generation — composing the user-facing reply from tool outputs.

The structural claim is that an instruction-tuning mixture across granular sub-tasks generalizes better than a single umbrella objective, because each sub-task surfaces different failure modes during training. A model that has explicitly practiced nested calls, parallel calls, and chaining understands their composition rather than emitting tokens that look like function calls without structural correctness.

The implication for capability evaluation: a single function-calling benchmark is misleading. Models can be strong on call-statement generation while failing on parameter slot-filling or next-best-function selection, and the average masks where the failure lies. The right unit of evaluation — and training — is the sub-task, not the umbrella. This connects directly to Where do traditional function calling systems actually break down?: Floworks names three independent failure points; Granite implicitly says there are seven, all training-addressable through multi-task decomposition.

Source: Tool Computer Use

Related concepts in this collection

Where do traditional function calling systems actually break down? Function calling seems simple but fails in ways that aren't obvious. This explores three independent failure points—retrieval, context bloat, and output rigidity—that together explain why even the best models struggle.
extends: Floworks names three structural failure points (retrieval, schema bloat, output format); Granite identifies seven sub-task failure modes that umbrella training conflates. Both argue function-calling is not one problem.
Can small models match large models on function calling? Explores whether small language models fine-tuned with the right training method can achieve comparable performance to large models on structured reasoning tasks requiring precise function calls, and what training approach makes this possible.
complements: Granite addresses the *what to train on* axis (granular sub-tasks); DPO-from-teacher addresses the *how to train* axis (preference vs SFT). Both target the open-vs-proprietary gap on function calling.
Does training order reshape how models handle different task types? Explores whether the sequence of multi-task RL training systematically affects model capabilities across structured and creative domains, and whether this ordering effect can be predicted and optimized.
complements: multi-task training surfaces complementary entropy dynamics; Granite's seven-task mixture should benefit from this — different sub-tasks have different entropy profiles and training across them stabilizes.
Does separating planning from execution improve reasoning accuracy? Explores whether modularizing decomposition and solution into separate models prevents interference and boosts performance compared to monolithic approaches.
complements: same decomposition logic applied within function-calling — slot-filling, chaining, and response generation each warrant separate training signals because their error modes differ.
Can small language models handle most agent tasks? Explores whether smaller, cheaper models are actually sufficient for the repetitive, scoped work that dominates deployed agent systems, rather than relying on large models by default.
complements: granular sub-task decomposition is what enables SLM-first deployment of function-calling — each sub-task is small enough for a fine-tuned SLM to handle.

Concept map

14 direct connections · 173 in 2-hop network ·dense cluster

Can breaking function calling into subtasks impr… Where do traditional function calling systems actu… Can small models match large models on function ca… Does training order reshape how models handle diff… Does separating planning from execution improve re… Can small language models handle most agent tasks?

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

function calling decomposes into seven granular tasks — multi-task learning across them generalizes where umbrella training fails