How should multiple reward objectives be weighted during training?
When training on multiple objectives at once, how can we automatically balance their contributions without manual tuning? This explores whether reward variance within rollouts reveals which objectives carry real learning signal.
When a single GRPO run optimizes several rewards at once — accuracy plus length plus format, say — the standard moves both fail. Reward Combination sums the rewards before computing advantage, which lets the squared advantage magnitudes explode and destabilize training. Advantage Combination computes advantages per objective and then mixes them, but it leans on fixed hyperparameter weights and treats objectives as independent, ignoring how they correlate within a rollout.
DVAO's claim is that the right weighting signal is already sitting in the data: the empirical reward variance of each objective within a rollout group. High variance means the group's responses spread out on that objective — there is a gradient to learn from. Low variance means the objective is either saturated or noise, so its contribution should shrink. Weighting by within-group variance therefore up-weights objectives carrying a real learning signal and down-weights the rest, automatically and without tuned constants.
Why it matters: this reframes multi-objective scalarization as an estimation problem rather than a preference-setting problem. You are no longer asking "how much do I value format versus accuracy?" but "which objective currently has signal worth following?" The paper proves the scheme keeps advantage magnitudes bounded (the stability win) and folds in a cross-objective regularizer so each objective's gradient is modulated by the rollout's overall multi-objective performance. The counterpoint is that variance can be a misleading proxy — a noisy reward model also produces high variance — so the method presumes reward signals are clean enough that spread tracks learnability.
— "DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning", https://arxiv.org/abs/2605.25604
Related concepts in this collection
-
Can two simple techniques match complex RL algorithms?
Does vanilla PPO with minimal modifications rival more sophisticated reasoning algorithms like GRPO and DAPO? This explores whether algorithmic complexity is necessary for effective LLM reasoning training.
both operate on the advantage-estimation machinery; DVAO adds a variance-adaptive weighting layer on top of the normalization tricks
-
Can full episode rewards per step enable better credit assignment?
Can attributing cumulative episode reward to every step in a trajectory, rather than discounting by step distance, actually solve credit assignment in sequential LLM decision-making? This challenges intuitive RL assumptions about how credit should flow backward through time.
complementary axis: that note handles credit across time, DVAO handles credit across competing objectives
-
Why does agent efficiency differ from model size reduction?
Explores why making models smaller doesn't solve agent cost problems. Agents loop recursively, compounding costs multiplicatively, so efficiency requires system-level design, not just parameter reduction.
both frame training as Pareto-frontier navigation rather than single-objective maximization
-
Can reward vectors be the hidden source of solution diversity?
Standard RL collapses multi-dimensional rewards into scalars before training, losing the natural structure that could drive diverse specialization. What if that vector structure itself is the diversity axis?
contrasts: both keep rewards multi-dimensional rather than scalarizing, but DVAO collapses objectives into a variance-weighted advantage while vector rewards preserve the per-dimension structure to drive diversity
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
multi-reward grpo should weight each objective by its empirical reward variance