Which aggregation method best exploits diversity in generated solutions?

This explores how to combine many candidate solutions into a better answer — and which combination strategy actually gets value out of their differences, rather than washing them out.

This explores how to combine many candidate solutions into a better answer, and which combination strategy actually gets value out of their differences. The corpus doesn't crown one winner so much as reveal a split between two families: methods that *select* (pick the best candidate or route to the best generator) and methods that *recombine* (search across candidates and merge their modes). The strongest signal points to recombination via search — but only when the upstream pool is genuinely diverse, which turns out to be the harder problem.

On the selection side, routing is the standout. Sending each query to the model best suited for it beats simply building one bigger model: a cluster-routing ensemble outperforms a frontier model by ~7% or matches it at far lower cost, and ten small models with a router previously surpassed much larger ones Can routing beat building one better model?. Notably, the winning move is a *pre-generation* decision — estimate query difficulty and pick the model before any solution is generated Can routers select the right model before generation happens?. That's selection at its leanest: it never aggregates multiple solutions at all, it just chooses the right source. It exploits diversity *across models* rather than diversity *within a candidate set*.

The recombination side aims higher. Vector Policy Optimization trains a model to emit several distinct competent solutions instead of converging on one, specifically so that downstream search — evolutionary algorithms that explore and *combine* modes — can solve problems an entropy-collapsed policy can't reach at all Should training maximize diversity when models feed into search?. This is the most direct answer to the literal question: the aggregation method that best exploits diversity is search-based mode combination, because it treats the spread of solutions as raw material to recombine, not noise to vote away.

But here's the catch the corpus keeps surfacing: most aggregation quietly fails because the diversity was never real. Ensembling many models assumes they disagree — yet 70+ models on open-ended queries collapse into an "Artificial Hivemind," producing near-identical outputs from overlapping training and alignment Do different AI models actually produce diverse outputs?. And the standard training recipe actively destroys the diversity aggregation depends on: outcome-based RL sharpens the policy globally, draining variety even on unsolved problems Does outcome-based RL diversity loss spread across unsolved problems?. Step-level critique during training counteracts this tail-narrowing and keeps solutions varied across self-training rounds Do critique models improve diversity during training itself?. So "which aggregation method" is half the question — the other half is whether anything diverse made it into the pool.

The quietest finding may be the most useful: diversity without competence doesn't aggregate into quality, it aggregates into noise. Multi-agent teams beat a solo agent only when members hold genuine domain expertise; diverse-but-shallow teams underperform a single competent one, because stimulation without grounding produces process losses instead of insight Does cognitive diversity alone improve multi-agent ideation quality?. The lesson across all of these: the best aggregation method is whichever one matches *where* your diversity actually lives — route when your models differ, search-and-recombine when your candidates are both varied and competent, and don't bother aggregating a pool that quietly converged.

Sources 7 notes

Can routing beat building one better model?

Avengers-Pro achieves 7% higher accuracy than GPT-5-medium by routing queries to optimal models per semantic cluster, or matches its performance at 27% lower cost. Ten 7B models with routing previously surpassed GPT-4.1 and 4.5, suggesting selection is a stronger lever than scaling.

Can routers select the right model before generation happens?

RouteLLM and Hybrid-LLM both achieve 40-50% cost reduction by routing to a single model based on query difficulty prediction, not response evaluation. Single-model routing minimizes latency compared to ensemble or cascade alternatives.

Should training maximize diversity when models feed into search?

Vector Policy Optimization trains models to emit varied competent solutions rather than converging to one answer. This unlocks search procedures like evolutionary algorithms to explore and combine modes, solving problems that entropy-collapsed policies cannot reach at all.

Do different AI models actually produce diverse outputs?

INFINITY-CHAT analyzed 70+ models across 26K open-ended queries and found an "Artificial Hivemind" effect: models independently generate strikingly similar or identical responses due to overlapping training data and alignment procedures, undermining the diversity benefits of model ensembles.

Does outcome-based RL diversity loss spread across unsolved problems?

RL that rewards only final answer correctness sharpens the policy globally, concentrating probability mass on correct trajectories for solved problems while simultaneously reducing diversity on unsolved ones. Historical exploration (training diversity via UCB-style bonuses) and batch exploration (test-time diversity via repetition penalties) require structurally different mechanisms.

Do critique models improve diversity during training itself?

Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.

Does cognitive diversity alone improve multi-agent ideation quality?

Multi-agent teams substantially outperform solo ideation, but only when members possess genuine senior knowledge. Diverse teams without expertise underperform even a single competent agent, because cognitive stimulation without expertise triggers process losses instead of insight.

Which aggregation method best exploits diversity in generated solutions?

Sources 7 notes

Next inquiring lines