Beyond GPT-5: Making LLMs Cheaper and Better via Performance-Efficiency Optimized Routing

Paper · arXiv 2508.12631 · Published August 18, 2025
RoutersDeep ResearchReasoning o1 o3 Search

Balancing performance and efficiency is a central challenge in large language model (LLM) advancement. GPT-5 addresses this with test-time routing, dynamically assigning queries to either an efficient or a high-capacity model during inference. In this work, we present Avengers-Pro, a test-time routing framework that ensembles LLMs of varying capacities and efficiencies, providing a unified solution for all performance-efficiency tradeoffs. The Avengers-Pro embeds and clusters incoming queries, then routes each to the most suitable model based on a performance-efficiency score. Across 6 challenging benchmarks and 8 leading models—including GPT-5-medium, Gemini-2.5-pro, and Claude-opus- 4.1—Avengers-Pro achieves state-of-the-art results: by varying a performance-efficiency trade-off parameter, it can surpass the strongest single model (GPT- 5-medium) by +7% in average accuracy. Moreover, it can match the average accuracy of the strongest single model at 27% lower cost, and reach ∼90% of that performance at 63% lower cost.

In this work, we advance test-time routing to optimize the performance–efficiency trade-off. We build upon our earlier work Avengers [15]—which showed that a simple routing recipe using ten models (∼7B parameters each) surpass GPT-4.1 and 4.5 across 15 datasets—and introduce the Avengers-Pro. With a focus on performance-efficiency trade-off, the Avengers-Pro operates through three lightweight operations: (i) embedding: encode queries using a text embedding model, (ii) clustering: group queries by semantic similarity, and (iii) scoring: evaluate models within each cluster based on a performance-efficiency score weighted by a trade-off parameter α. During inference, each query is embedded and mapped to its top-p nearest clusters. The model with the highest performance-efficiency score aggregated over those clusters is selected to generate the response.

The Avengers-Pro ensembles a set of heterogeneous LLMs of varying capabilities and efficiencies with a router. Appropriate routing depends on an accurate understanding of each model’s capability and efficiency across different types of tasks or queries. To build this understanding, the router requires a set D of labeled query–answer pairs. Each query d ∈ D is first encoded into a semantic vector using a text embedding model. These embeddings are then grouped into k clusters using a clustering algorithm, producing a set C = {c1, . . . , ck}, where each cluster represents a semantically coherent query type.

Following common practice in routing [3, 16, 15], we randomly split the data: 70% is used to fit the clustering model and estimate per-cluster statistics, and the remaining 30% is reserved for routing and evaluation. At inference time, we compute the embedding of the incoming query and retrieve the top-p nearest clusters (p = 4) in the embedding space.