RouteLLM: Learning to Route LLMs with Preference Data

Paper · arXiv 2406.18665 · Published June 26, 2024

Large language models (LLMs) excel at a wide range of tasks, but choosing the right model often involves balancing performance and cost. Powerful models offer better results but are expensive, while smaller models are more cost-effective but less capable. To address this trade-off, we introduce a training framework for learning efficient router models that dynamically select between a stronger and weaker LLM during inference. Our framework leverages human preference data and employs data augmentation techniques to enhance performance. Evaluations on public benchmarks show that our approach can reduce costs by over 2 times without sacrificing response quality.

LLM routing (Ding et al., 2024; Hu et al., 2024) offers an effective solution by first processing each user query through a router, which then determines the most suitable LLM to handle the query. The router can direct simpler queries to smaller models and more complex ones to larger models, thereby balancing response quality with cost efficiency.

Achieving optimal LLM routing—maximizing quality within a cost constraint or minimizing cost for a target quality—is challenging. An ideal LLM router must (1) optimize response quality while invoking a single LLM per query, minimizing cost and latency as compared to multi-LLM approaches; (2) generalize to out-of-domain queries without needing separate routers for different domains; and (3) work across a broad range of LLMs without retraining, ensuring flexibility as the LLM landscape evolves.

A key distinction exists between reward modeling (Ouyang et al., 2022) and LLM routing. Reward modeling assesses response quality after an LLM generates it, whereas routing involves selecting the appropriate LLM beforehand. This requires a deep understanding of the query’s complexity and the specific capabilities of available models.

Several recent works have also examined the cost-performance trade-offs in routing between different LLMs. LLM-Blender (Jiang et al., 2023) uses an ensemble framework that queries multiple LLMs during inference and selects the best response. Frugal-GPT (Chen et al., 2023) follows a cascading approach, sequentially querying LLMs until a reliable response is obtained. AutoMix (Aggarwal et al., 2024) uses a smaller model to self-verify its response before potentially routing the query to a larger model. These methods rely on multiple LLM queries, which can increase latency. In contrast, our approach routes each query to a single LLM, addressing the latency constraints of an ideal LLM router.

Hybrid-LLM (Ding et al., 2024) shares some similarities with our framework but differs in key aspects: it uses synthetic preference labels from the MixInstruct dataset (Jiang et al., 2023) based on BARTScore (Yuan et al., 2021) and relies on a single BERT-based router. In contrast, we leverage human preference labels from Chatbot Arena (Chiang et al., 2024) and explore multiple router architectures, showing that data augmentation significantly boosts performance across all architectures. Additionally, Hybrid-LLM evaluates on the MixInstruct test split and lacks evidence of out-of-domain generalization, whereas we aim to demonstrate this by evaluating on several decontaminated public benchmarks.