Where do traditional function calling systems actually break down?
Function calling seems simple but fails in ways that aren't obvious. This explores three independent failure points—retrieval, context bloat, and output rigidity—that together explain why even the best models struggle.
The Floworks analysis frames "traditional function calling" — model accepts a task and full function schemas, outputs a complete call — as failing at three independent points, which together explain why even GPT-4o and Claude-3 Opus struggle with it.
Inefficient function retrieval. When tool catalogues are large, picking the right function is delegated to vector similarity over schema descriptions. Vector similarity is a heuristic with known accuracy, scalability, and domain-specificity problems. The retrieval layer fails before the model gets to reason.
Excessive token lengths. Function schemas are verbose — argument names, types, descriptions, examples — and including all available schemas in the prompt inflates context dramatically. This is not just a cost issue: reasoning ability of LLMs falls drastically as active context length grows, so the schemas crowd out the cognitive bandwidth available for the actual task.
High output sensitivity. LLMs are trained on free-flowing text where near-misses are tolerable. Function calling demands rigid output: precise variable names, valid JSON structure, exact argument values. The training distribution is misaligned with the deployment requirement, and small format errors cause hard failures rather than degraded responses.
The implication is that "function calling" is not one problem with one fix. Improvements at the retrieval layer (better-than-cosine matching), the context layer (schema compression or selective injection), and the output layer (constrained decoding or structure-aware training) compound rather than overlap. Anyone treating function-calling failure as a single bug to patch will under-invest on at least two of the three axes.
The three Floworks failure points connect to three different intervention papers in this cluster. Can models decide better than retrievers which tools to use? addresses the retrieval failure point by replacing passive vector-similarity retrieval with model-initiated proactive requests. Can breaking function calling into subtasks improve model generalization? addresses both retrieval (function name detection, next-best function) and output format (parameter slot-filling, structural composition) by training granular sub-tasks. Can small models match large models on function calling? addresses the output format failure point specifically by using preference signal where SFT fails.
Source: Tool Computer Use
Related concepts in this collection
-
Can models decide better than retrievers which tools to use?
Traditional retrieval picks tools upfront based on initial queries, but do models themselves make better decisions about tool needs as they reason? This explores whether authority over tool selection should move from external systems to the LLM.
extends: MCP-Zero is the targeted intervention against Floworks's retrieval failure point — replacing passive single-round retrieval with model-initiated iterative requests.
-
Can breaking function calling into subtasks improve model generalization?
Does training on seven granular function-calling subtasks instead of one umbrella objective close the gap between open-source and proprietary models? This explores whether decomposition surfaces hidden failure modes that unified training misses.
extends: Granite's seven sub-tasks specify what gets trained against the umbrella objective Floworks names as the structural problem; both reject single-shot function-calling framing.
-
Can small models match large models on function calling?
Explores whether small language models fine-tuned with the right training method can achieve comparable performance to large models on structured reasoning tasks requiring precise function calls, and what training approach makes this possible.
extends: targeted intervention against Floworks's output-format failure point — DPO with negative examples teaches the model what to avoid for rigid JSON.
-
Can reasoning and tool execution run in parallel?
Standard LLM tool use halts for each response, creating redundant prompts and sequential delays. Do alternative architectures that separate reasoning from tool observation actually eliminate these costs?
complements: ReWOO/CoA address the schema-bloat failure point at the inference architecture level rather than at the training level.
-
Why does random tool sampling produce unrealistic synthetic training data?
Tool-calling datasets generated through random sampling and single-turn framing lack the complexity and coherence of real deployment. This explores what structural choices in data synthesis determine whether models can learn realistic tool composition.
complements: ToolFlow's data-side critique pairs with Floworks's deployment-side critique — both argue function-calling failure is structural, surfacing at training and synthesis stages respectively.
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
traditional function calling is monolithic and bottlenecked at three points — retrieval accuracy schema bloat and rigid output format