Granite-Function Calling Model: Introducing Function Calling Abilities via Multi-task Learning of Granular Tasks

Paper · arXiv 2407.00121 · Published June 27, 2024

However, to realize the true potential of LLMs as autonomous agents, they must learn to identify, call, and interact with external tools and application program interfaces (APIs) to complete complex tasks. These tasks together are termed function calling. Endowing LLMs with function calling abilities leads to a myriad of advantages, such as access to current and domain-specific information in databases and knowledge sources, and the ability to outsource tasks that can be reliably performed by tools, e.g., a Python interpreter or calculator. While there has been significant progress in function calling with LLMs, there is still a dearth of open models that perform on par with proprietary LLMs like GPT, Claude, and Gemini. Therefore, in this work, we introduce the GRANITE-20B-FUNCTIONCALLING1 model under an Apache 2.0 license. The model is trained using a multi-task training approach on seven fundamental tasks encompassed in function calling, those being Nested Function Calling, Function Chaining, Parallel Functions, Function Name Detection, Parameter-Value Pair Detection, Next-Best Function, and Response Generation

For LLMs to serve as autonomous agents, they must perform accurately on two fundamental capabilities: (a) reasoning and planning, and (b) function calling, which includes identifying, calling, and interacting with tools and APIs in external environments.

The importance of function calling has spurred the development of several recent data generation efforts for fine-tuning (Basu et al., 2024; Guo et al., 2024; Qin et al., 2023; Yan et al., 2024; Tang et al., 2023) and evaluation of models (Li et al., 2023b; Muennighoff et al., 2023). Typically, however, the fine-tuned models from datasets like ToolLLM (Qin et al., 2023), ToolAlpaca (Tang et al., 2023), and Gorilla (Patil et al., 2023) underperform in one (or more) of three key dimensions: (a) Generalizability: While the datasets are generated using diverse sets of APIs (e.g., ToolLLama uses RapidAPIs 4, ToolAlpaca uses public APIs5, and Gorilla uses TensorFlow Hub, PyTorch Hub, and Hugging Face Hub), work from (Basu et al., 2024) has shown that models trained on these datasets have difficulty generalizing to out-of-domain datasets. (b) Granular tasks: Function calling, as an umbrella term, can encompass multiple granular sub-tasks such as function-name detection, slot filling6 or parameter-value pair detection, and detecting the ordered sequence of functions needed to be called. Existing models trained to perform function calling lack the ability to handle these granular tasks independently, and hence, perform poorly on such sub-tasks.

Our work is an instantiation of instruction tuning

It involves taking a large collection of NLP datasets, reformulating those datasets into a set of instruction-following tasks, and then fine-tuning an LLM on the modified data. While the earliest versions of instruction tuning straightforwardly combined large datasets together, the most recent iterations use more sophisticated mixtures of tasks to achieve the best results (Li et al., 2024; Sudalairaj et al., 2024)