Can a single Elo ranking represent multidimensional model capability?
This explores whether one number — like the Elo score behind LLM leaderboards — can capture how good a model really is, or whether 'capability' is too many-sided to collapse into a single rank.
This explores whether one number — like the Elo score behind LLM leaderboards — can capture how good a model really is, or whether 'capability' is too many-sided to collapse into a single rank. The corpus has a clear tension on this. On one side, large-scale pairwise voting genuinely works: Chatbot Arena's 240K+ crowdsourced preference votes produce rankings that correlate with expert raters, because the questions are diverse and discriminating enough to separate models reliably Can crowdsourced votes reliably rank language models?. So a single Elo-style score is a *credible* signal — when the thing you care about is averaged human preference across a broad question mix.
The problem is what that average hides. Agent capability decomposes into at least five separable axes — task success, privacy compliance, long-horizon retention, mode-shift behavior, and ecosystem readiness — and models that top one axis often rank lower on another, which makes single-score evaluation 'systematically misleading' for real deployment Does a single benchmark score actually predict agent readiness?. A scalar ranking implicitly fixes one set of weights over those axes; change the weighting and the order changes. Elo doesn't represent the multidimensionality, it picks a projection of it.
Two subtler findings sharpen why scalar scores deceive. First, identical performance numbers can mask 'fundamentally different internal representations' — a model can have every linearly decodable feature it needs yet be organized so badly that it shatters under perturbation or distribution shift, invisible to the metric Can models be smart without organized internal structure?. Second, accuracy and calibration can diverge: binary-reward training produces confidently wrong models that score well on correctness while being badly miscalibrated, and you need a *second* reward term to optimize both at once Does binary reward training hurt model calibration?. Two models with the same Elo can be confidently-wrong vs. honestly-uncertain — a difference the rank erases entirely.
What's interesting is that the corpus suggests the multidimensionality isn't just a measurement headache — it's *exploitable*. If different models genuinely lead on different axes, you can route each query to its best specialist: Avengers-Pro beats GPT-5-medium by 7% (or matches it at 27% lower cost) by sending queries to the best model per semantic cluster, and ten small models with routing once surpassed GPT-4.1 Can routing beat building one better model?. A single ranking would tell you to always pick the #1 model; the vector view tells you #1 is the wrong question. The same logic shows up in why training schedules matter — structured and creative tasks pull entropy in opposite directions, so one training order can't be optimal for all task types simultaneously Does training order reshape how models handle different task types?.
The honest answer: a single Elo faithfully represents *one* dimension — aggregate preference under a fixed question distribution — and it does that well. It cannot represent capability as a whole, because capability is a vector whose components trade off against each other, and any scalar is a weighted shadow of that vector. The leaderboard tells you who wins on average; it can't tell you who wins on *your* axis.
Sources 6 notes
Chatbot Arena's 240K+ crowdsourced preference votes produce credible model rankings because the underlying questions are diverse and discriminating, and crowd judgments correlate with expert raters—validating human preference as a scalable evaluation signal.
Capability decomposes into task success, privacy compliance, long-horizon retention, mode-shift behavior, and ecosystem readiness. Models ranked highest on one axis often rank lower on others, making single-score evaluations systematically misleading for real deployment.
Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
Avengers-Pro achieves 7% higher accuracy than GPT-5-medium by routing queries to optimal models per semantic cluster, or matches its performance at 27% lower cost. Ten 7B models with routing previously surpassed GPT-4.1 and 4.5, suggesting selection is a stronger lever than scaling.
Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.