Improving Small-Scale Large Language Models Function Calling for Reasoning Tasks

Paper · arXiv 2410.18890 · Published October 24, 2024
Tool Computer Use

This study introduces a novel framework for training smaller language models in function calling, focusing on specific logical and mathematical reasoning tasks. The approach aims to improve performances of small-scale models for these tasks using function calling, ensuring a high level of accuracy. Our framework employs an agent that, given a problem and a set of callable functions, queries the LLM by injecting a description and examples of the usable functions into the prompt and managing their calls in a step-by-step reasoning chain. This process is used to create a dataset of correct and incorrect reasoning chain chat completions from a large-scale LLM. This dataset is used to train a smaller LLM using Reinforcement Learning from Human Feedback (RLHF), specifically employing the Direct Preference Optimization (DPO) technique.

This framework involves the use of an agent that, given a problem and a set of possible functions useful for its solution, queries a large-scale LLM by injecting function descriptions and examples into the prompt and managing the proper function calls that the model needs to find the solution, all that in a step-by-step reasoning chain. This procedure is so used for the creation of a dataset with correct and incorrect chat completions. The generated dataset is then used to train a smaller model using a Reinforcement Learning from Human Feedback (RLHF) [23]–[26] approach, known as Direct Preference Optimization (DPO) [27]. We present the methodology tested on two different

While supervised fine-tuning is a common approach, alternative methods leveraging Reinforcement Learning from Human Feedback (RLHF) have gained prominence. One such method is Proximal Policy Optimization (PPO) [43], which integrates a reward model into the reinforcement learning framework for policy optimization. Despite its effectiveness, PPO’s requirement for extensive human feedback to train the reward model makes it resource-intensive and time-consuming. A more efficient and equally effective alternative is Direct Preference Optimization (DPO) [27]. DPO distinguishes itself by enabling the model to learn a policy directly from user preference data, eliminating the need for an explicit reward function. Furthermore, DPO has demonstrated superior stability compared to PPO. The DPO process begins with gathering human feedback. Assessors evaluate pairs of model-generated responses to identical prompts, creating a dataset of preference pairs. Unlike PPO, which trains a separate reward model, DPO incorporates these preferences directly into the training objective.