Octopus v2: On-device language model for super agent

Paper · arXiv 2404.01744 · Published April 2, 2024

Current on-device models for function calling face issues with latency and accuracy. Our research presents a new method that empowers an on-device model with 2 billion parameters to surpass the performance of GPT-4 in both accuracy and latency, and decrease the context length by 95%. When compared to Llama-7B with a RAG-based function calling mechanism, our method enhances latency by 35-fold. This method reduces the latency to levels deemed suitable for deployment across a variety of edge devices in production environments, aligning with the performance requisites for real-world applications.

The initial step involves understanding the function’s description and its arguments, using information from the user’s query to create parameters for an executable function. A direct strategy might combine a classification model with a causal language model. We can envision the N available functions as a selection pool, transforming the selection challenge into a softmax classification problem.

One straightforward method for classification could be retrieval-based document selection, identifying the function that most closely matches the user’s query by semantic similarity. Or we can use a classification model to map the query to a specific function name. Alternatively, autoregressive models, such as a GPT model, can predict the correct function name from the user’s query within the context of potential functions.

When utilizing a language model to formulate a function name, multiple tokens must be generated to form one function name, which can lead to inaccuracies. To mitigate such errors, we propose designating functions as unique functional tokens. For example, in a pool of N available functions, we assign token names ranging from <nexa_0> to <nexa_N-1> to symbolize these functions. This transforms the prediction task for function names into a single-token classification among the N functional tokens, enhancing the accuracy of function name prediction while simultaneously reducing the number of tokens required. To implement this, we introduce new special tokens from <nexa_0> to <nexa_N-1> into the tokenizer and modify the architecture of the pretrained model to expand the language head by an additional N units.