DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning

Paper · arXiv 2605.25604
Reinforcement LearningReward ModelsRL with Verifiable Rewards (RLVR)Training and Fine-Tuning

Reinforcement Learning has become a standard paradigm for aligning Large Language Models with human intent and task requirements. While Group Relative Policy Optimization offers an efficient, value-model-free alternative to Proximal Policy Optimization, adapting it to real-world multi-reward settings remains challenging. Standard scalarization practices, such as Reward Combination and Advantage Combination, suffer from significant drawbacks: Reward Combination frequently generates advantages with excessively large squared magnitudes that lead to training instability, while Advantage Combination relies on static hyperparameters and ignores cross-objective correlations. To address these limitations, we propose Dynamic Variance-adaptive Advantage Optimization (DVAO), which dynamically adjusts combination weights based on the empirical reward variance of each objective within a rollout, effectively up-weighting objectives with a stronger learning signal while suppressing noisy ones. We mathematically prove that DVAO maintains bounded advantage magnitudes for stable training and introduces a self-adaptive cross-objective regularization mechanism. Extensive experiments on mathematical reasoning and tool-use benchmarks using Qwen3 and Qwen2.5 models demonstrate that DVAO significantly outperforms baseline methods, achieving a superior multi-objective Pareto frontier and robust training stability.

However, deploying LLMs in real-world scenarios rarely involves optimizing a single, isolated metric. Practical applications dictate multi-objective requirements: a model must not only provide accurate answers but also adhere to length constraints, minimize bug rates in code generation, maintain a low hallucination rate, and keep correct tool-calling format in tool-use. Adapting GRPO to this multi-reward setting is non-trivial. The standard practice involves scalarization-either linearly combining the raw rewards (Reward Combination) or independently normalizing the rewards and then combining their respective advantages (Advantage Combination). Despite their widespread use, both methods suffer from significant theoretical and practical drawbacks. As we demonstrate in this work, the Reward Combination method frequently generates advantages with excessively large squared magnitudes than the Advantage Combination method, which translates to erratic policy gradients and training instability. Conversely, while the Advantage Combination method normalizes these magnitudes, it relies on static hyperparameters and completely isolates the objectives during normalization. This naive decoupling fails to capture the intricate correlations—whether synergistic or antagonistic—between different objectives during a single rollout, often leading to suboptimal trade-offs.

To address these fundamental limitations, we propose Dynamic Variance-adaptive Advantage Optimization (DVAO). DVAO elegantly bridges the gap between stability and objective synergy by dynamically adjusting the combination weights based on the empirical reward variance of each objective within the rollout group. This completely data-driven method up-weights objectives with higher variance—indicating a stronger learning signal—while suppressing noisy, low-variance objectives. Crucially, we mathematically prove that DVAO not only bounds the advantage magnitude for stable training but also introduces a self-adaptive cross-objective regularization mechanism. In DVAO, the gradient contribution of a single objective is modulated by the overall multi-objective performance of that specific rollout, ensuring a holistic optimization trajectory.

In this work, we identify the fundamental theoretical and practical limitations of standard scalarization techniques—namely Reward Combination and Advantage Combination—for multi-reward GRPO. To address the issues of magnitude explosion and objective isolation, we introduce Dynamic Variance-adaptive Advantage Optimization. By dynamically adjusting combination weights based on the empirical variance of each objective within a rollout group, DVAO explicitly up-weights learning signals from high-variance objectives while suppressing low-variance noise. Empirical evaluations across comprehensive mathematical reasoning and tool-use benchmarks confirm that DVAO achieves a superior Pareto optimal policy, seamlessly balancing accuracy with length and format constraints without relying on fixed hyperparameters. Future work will explore scaling the DVAO framework to environments with a larger number of conflicting reward functions and extending the variance-adaptive mechanism to broader alignment paradigms.