Reasoning and Learning Architectures

Should training maximize diversity when models feed into search?

If a model runs inside a test-time search loop that samples many rollouts and picks the best, does training for entropy and diversity unlock better solutions than training for a single sharp answer?

Note · 2026-05-28 · sourced from Reinforcement Learning
How should we allocate compute budget at inference time?

The default post-training objective optimizes a single scalar reward, which pushes the policy toward a low-entropy distribution that concentrates probability on one mode. That is the right behavior if the model answers once and you take what it says. But increasingly the model is a component inside an inference-time search procedure — AlphaEvolve-style evolutionary search, best-of-k sampling, pass@k selection — that draws many rollouts and keeps the best. Here a model that always emits the same near-optimal answer is a liability: search has nothing to select among.

Vector Policy Optimization makes the consequence explicit. The thing the deployment loop actually rewards is not the single best response but the quality of the best response in a set, and the gap between diversity-trained and scalar-trained policies widens as the search budget grows. For evolutionary search the effect is categorical: VPO-trained models solve problems that GRPO-trained models cannot solve at all, because GRPO's collapsed distribution never proposes the seed variation that search needs to mutate from.

Why it matters: it inverts a tacit assumption. We tend to treat entropy reduction as evidence that training worked — the model "knows the answer." But if the model is a generator feeding a selector, sharpness is overfitting to the wrong objective. The post-training target should match the inference-time objective, and when inference is search, that objective is coverage of competent modes. The tension is real: optimizing for set-quality trades away single-shot pass@1, so the choice depends on whether deployment samples once or many times.


— "Vector Policy Optimization: Training for Diversity Improves Test-Time Search", https://arxiv.org/abs/2605.22817

Related concepts in this collection

Concept map
15 direct connections · 114 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

when models run inside test-time search training should maximize diversity of competent solutions instead of converging on one best answer