Vector Policy Optimization: Training for Diversity Improves Test-Time Search

Paper · arXiv 2605.22817
Reinforcement LearningTest-Time ComputeInference-Time ScalingReasoning Model Architectures

Language models must now generalize out of the box to novel environments and work inside inference-scaling search procedures, such as AlphaEvolve, that select rollouts with a variety of task-specific reward functions. Unfortunately, the standard paradigm of LLM post-training optimizes a pre-specified scalar reward, often leading current LLMs to produce low-entropy response distributions and thus to struggle at displaying the diversity that inference-time search will require. We propose Vector Policy Optimization (VPO), an RL algorithm that explicitly trains policies to anticipate diverse downstream reward functions and to produce diverse solutions. VPO exploits that rewards are often vector-valued in practice, like per-test-case correctness in code generation or, say, multiple different user personas or reward models. VPO is essentially a drop-in replacement for the GRPO advantage estimator, but it trains the LLM to output a set of solutions where individual solutions specialize to different trade-offs in the vector reward space. Across four tasks, VPO matches or beats the strongest scalar RL baselines on test-time search (e.g. pass@k and best@k), with the gap widening as the search budget grows. For evolutionary search, VPO models unlock problems that GRPO models cannot solve at all. As test-time search becomes more standardized, optimizing for diversity may need to become the default post-training objective.

In many AI systems, the network is only one component of a larger pipeline. Especially for hard problems, language models are typically wrapped in some form of search, ranging from simple rejection sampling with a verifier to complex evolutionary methods like AlphaEvolve. In these settings, test-time search handles exploitation, hinting that training should focus on providing the search with a rich and diverse pool of solutions to select from. However, existing RL post-training methods are poorly suited for this kind of diversity preservation. Policy gradient methods like GRPO drive the policy toward a narrow set of high-probability responses. After training, the diversity required for effective test-time search disappears, as additional samples become near-duplicates. In this work, we propose a shift in perspective. Rather than asking a single training algorithm to handle both exploration and exploitation, we separate the two responsibilities entirely by assuming a future test-time exploitation stage. In this setting, the role of RL post-training should not be to converge on a single best response, but to maximize the diversity of a set of competent solutions.

To train a policy that produces diverse yet competent solutions, we exploit the fact that, in many realistic tasks, rewards can be naturally decomposed into a vector of components: per-test-case correctness for code generation, per-criterion ratings for RLHF, or per-sub-question success in multi-hop reasoning. This decomposition provides a natural axis for diversity. Rather than collapsing these components into a single scalar and optimizing toward one peak, we can encourage the model to produce solutions that excel along different reward dimensions, covering the Pareto frontier rather than converging to a single point on it. We term this optimization scheme Vector Policy Optimization (VPO). Concretely, VPO combines multi-answer generation with stochastic reward scalarizations, training the model to produce sets of candidates that span the Pareto frontier rather than collapsing onto a single point.

We argued that when language models are deployed inside pipelines with test-time search, the responsibilities of exploration and exploitation should be separated: training should produce a diverse pool of competent candidates, and the search procedure at test time should handle exploitation. VPO instantiates this by sampling scalarizations uniformly over the simplex and training the policy to emit sets that span the Pareto front of the underlying reward components. The change is a drop-in replacement for the GRPO advantage estimator. Across MuSiQue, EUREQA, Maze, and ToolRL, VPO improves best@k over scalar baselines, with the gap widening as the test-time budget grows. VPO benefits from a vector-valued reward; when reward is scalar only, it reduces to more standard RL. Finally, it sacrifices pass@1 for pass@k by training the policy to explore rather than to exploit. VPO is for the regime where test-time search is part of the system.