Should training maximize diversity when models feed into search?
If a model runs inside a test-time search loop that samples many rollouts and picks the best, does training for entropy and diversity unlock better solutions than training for a single sharp answer?
The default post-training objective optimizes a single scalar reward, which pushes the policy toward a low-entropy distribution that concentrates probability on one mode. That is the right behavior if the model answers once and you take what it says. But increasingly the model is a component inside an inference-time search procedure — AlphaEvolve-style evolutionary search, best-of-k sampling, pass@k selection — that draws many rollouts and keeps the best. Here a model that always emits the same near-optimal answer is a liability: search has nothing to select among.
Vector Policy Optimization makes the consequence explicit. The thing the deployment loop actually rewards is not the single best response but the quality of the best response in a set, and the gap between diversity-trained and scalar-trained policies widens as the search budget grows. For evolutionary search the effect is categorical: VPO-trained models solve problems that GRPO-trained models cannot solve at all, because GRPO's collapsed distribution never proposes the seed variation that search needs to mutate from.
Why it matters: it inverts a tacit assumption. We tend to treat entropy reduction as evidence that training worked — the model "knows the answer." But if the model is a generator feeding a selector, sharpness is overfitting to the wrong objective. The post-training target should match the inference-time objective, and when inference is search, that objective is coverage of competent modes. The tension is real: optimizing for set-quality trades away single-shot pass@1, so the choice depends on whether deployment samples once or many times.
— "Vector Policy Optimization: Training for Diversity Improves Test-Time Search", https://arxiv.org/abs/2605.22817
Related concepts in this collection
-
Can diversity optimization improve quality during language model training?
Standard RL training assumes quality and diversity trade off, with diversity optimization potentially hurting performance. Does explicitly rewarding semantic diversity during reinforcement learning actually improve output quality alongside diversity?
converging evidence that diversity-as-objective need not cost quality during training; VPO extends the payoff to inference-time search
-
Does policy entropy collapse limit reasoning performance in RL?
As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
names the failure VPO routes around: scalar RL collapses entropy, starving downstream search of varied candidates
-
Can evolutionary search beat sampling and revision at inference time?
Can LLMs evolve populations of solutions through recombination and selection to outperform simpler inference strategies? This matters because it could reveal whether biological-inspired search improves planning without formal problem definitions.
the deployment regime VPO trains for; evolutionary search is exactly where diversity-trained policies unlock otherwise-unsolvable problems
-
Why do reasoning models fail differently at training versus inference?
Reasoning models exhibit two distinct failure modes—entropy collapse during training and variance inflation during inference—that appear unrelated but may share underlying causes. Understanding these dual problems could reveal whether separate or unified solutions are needed.
frames the same train/test mismatch from the entropy side; VPO is one resolution that aligns the training objective with test-time sampling
-
Why does majority voting outperform more complex inference methods?
Simple majority voting across independent samples often matches or beats sophisticated alternatives like Best-of-N and sequential revision. What makes this basic approach so hard to beat for reasoning models?
counterpoint on which selector to pair with diverse generation; the value of trained diversity depends on the aggregation method at inference
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
when models run inside test-time search training should maximize diversity of competent solutions instead of converging on one best answer