Can routing beat building one better model?
Does directing queries to specialized models via semantic clustering outperform investing in a single frontier model? This challenges whether model improvement or model selection drives performance gains.
Avengers-Pro demonstrates that routing queries to different models based on semantic clustering can exceed the performance of any individual model in the pool — including frontier models. The mechanism: embed incoming queries, cluster by semantic similarity, evaluate per-cluster model performance-efficiency scores, and route each query to the highest-scoring model for its cluster.
Three results establish the claim:
- Performance: +7% average accuracy over GPT-5-medium (the strongest individual model in the pool) across 6 benchmarks
- Efficiency at parity: matches GPT-5-medium accuracy at 27% lower cost
- Efficiency at near-parity: reaches ~90% of GPT-5-medium performance at 63% lower cost
The earlier Avengers work made an even more striking claim: ten models of ~7B parameters each, with routing, surpassed GPT-4.1 and 4.5 across 15 datasets. This suggests the performance gain from optimal model selection can be comparable to the gap between model generations.
The architecture is lightweight: three operations at inference time (embedding, nearest-cluster lookup, score aggregation). The heavy work — fitting the clustering model and estimating per-cluster performance statistics — happens offline on a calibration set (70% for fitting, 30% for evaluation). This makes the approach deployable as a thin routing layer atop any model API ecosystem.
Since Can we allocate inference compute based on prompt difficulty?, Avengers-Pro adds a complementary optimization axis. Compute-optimal scaling asks "how much inference budget per query?" Routing asks "which model per query?" These are independent — a routing layer could be composed with per-query compute allocation for a two-dimensional Pareto optimization. Since Can inference compute replace scaling up model size?, routing extends this: you don't need a bigger model OR more compute — you need the right model for this specific query type.
The implication challenges the frontier model race: rather than building one model that dominates on everything, assembling a diverse pool of specialized-ish models with good routing may be both cheaper and more effective. This aligns with the heterogeneous architecture thesis in Can small language models handle most agent tasks? — routing makes the heterogeneous approach practical.
Source: Routers
Related concepts in this collection
-
Can we allocate inference compute based on prompt difficulty?
Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
complementary axis: compute allocation + model selection = two-dimensional optimization
-
Can inference compute replace scaling up model size?
Explores whether smaller models given more thinking time during inference can match larger models. Matters because it reshapes deployment economics and compute allocation strategies.
routing extends substitution: right model > bigger model
-
Can small language models handle most agent tasks?
Explores whether smaller, cheaper models are actually sufficient for the repetitive, scoped work that dominates deployed agent systems, rather than relying on large models by default.
routing is the mechanism enabling heterogeneous architectures
-
Can routers select the right model before generation happens?
Explores whether LLMs can be matched to queries by estimating difficulty upfront, before any generation begins. This matters because routing could cut costs significantly while preserving response quality.
single-model routing as the base case this extends to multi-model pools
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
test-time model ensembling via embedding-cluster routing surpasses any individual frontier model — model selection is a stronger lever than model improvement