Summary of "DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning"
Summary (Overview)
- Proposes DVAO: A novel method for multi-reward Reinforcement Learning (RL) in Large Language Model (LLM) alignment that dynamically adjusts combination weights based on the empirical variance of each reward objective within a rollout group.
- Addresses Key Limitations: Solves the training instability of Reward Combination (RC) caused by large advantage magnitudes and the suboptimal trade-offs of Advantage Combination (AC) due to its reliance on static weights and ignorance of cross-objective correlations.
- Provides Theoretical Guarantees: Mathematically proves that DVAO maintains bounded advantage magnitudes for stable training and introduces a self-adaptive cross-objective regularization mechanism.
- Demonstrates Superior Performance: Extensive experiments on mathematical reasoning and tool-use benchmarks show DVAO outperforms baselines (GRPO, RC, AC, GDPO), achieving a superior multi-objective Pareto frontier and robust training stability.
- Enables Synergistic Optimization: The method's dynamic weighting allows it to up-weight objectives with stronger learning signals (higher variance) and suppress noisy ones, promoting holistic policy improvement.
Introduction and Theoretical Foundation
Reinforcement Learning, particularly Group Relative Policy Optimization (GRPO) and its variants, has become a standard paradigm for aligning LLMs. However, real-world applications require optimizing multiple objectives simultaneously (e.g., accuracy, length constraints, low hallucination rates, correct tool-calling format). The standard scalarization practices for multi-reward GRPO are:
- Reward Combination (RC): Linearly combines raw rewards before advantage calculation.
- Advantage Combination (AC): Independently normalizes rewards to advantages and then linearly combines them.
Both methods have significant drawbacks. RC frequently generates advantages with excessively large squared magnitudes, leading to erratic policy gradients and training instability. AC, while normalizing magnitudes, relies on static hyperparameters and isolates objectives during normalization, failing to capture intricate cross-objective correlations and leading to suboptimal trade-offs.
This work proposes Dynamic Variance-adaptive Advantage Optimization (DVAO) to bridge the gap. DVAO dynamically adjusts combination weights based on the empirical reward variance of each objective within the rollout group. This data-driven method up-weights objectives with higher variance (indicating a stronger learning signal) and suppresses noisy, low-variance ones.
Formal Preliminaries: Given a policy model , for a query and response , there are reward functions . The standard convex combination for scalarization is:
In GRPO, the relative advantage for a group of rollouts is:
The policy optimization objective is:
where .
Methodology
DVAO replaces the fixed combination weights with dynamic variance-adaptive weights:
where is the group standard deviation for objective . The DVAO advantage is computed as:
where is the normalized advantage for objective (as in AC).
Key Theoretical Properties:
- Bounded Advantage Magnitude (Proposition 2): For a fixed query and rollout group, DVAO produces a pointwise smaller or equal advantage magnitude compared to RC: Equality holds only if all reward pairs are perfectly positively correlated. This mitigates the training instability of RC.
- Cross-Objective Regularization (Proposition 3): The sensitivity of the DVAO advantage to a raw reward is: In contrast, for AC it is: . This shows DVAO's gradient contribution for an objective is modulated by the cross-term , which depends on the model's overall multi-objective performance, introducing an implicit cross-objective regularization. AC's gradient depends only on the isolated objective's performance .
Empirical Validation / Results
Experiments were conducted on mathematical reasoning (AIME-2024, AIME-2025, MATH500, OlympiadBench, AMC23) and tool-use (Berkeley Function Call Leaderboard - BFCL-v4) benchmarks using Qwen3 and Qwen2.5 models. Objectives were accuracy vs. length compliance (math) and accuracy vs. format compliance (tool-use).
Main Results (Key Tables):
Table 1: Performance on Mathematical Reasoning Tasks
| Method | AIME-2024 Acc.(%)/Len.(%) | AIME-2025 Acc.(%)/Len.(%) | MATH500 Acc.(%)/Len.(%) | Olympaid Acc.(%)/Len.(%) | AMC23 Acc.(%)/Len.(%) | Average Acc.(%)/Len.(%) |
|---|---|---|---|---|---|---|
| Qwen3-4B-Base | ||||||
| + GRPO | 17.91 / 62.08 | 10.20 / 68.33 | 78.92 / 93.04 | 42.88 / 82.82 | 49.62 / 82.91 | 39.91 / 77.84 |
| + RC | 14.58 / 92.50 | 9.38 / 95.42 | 78.31 / 98.91 | 41.18 / 97.43 | 51.50 / 97.67 | 38.99 / 96.39 |
| + AC | 16.25 / 91.04 | 9.38 / 95.21 | 77.65 / 98.69 | 41.02 / 98.08 | 49.47 / 98.12 | 38.75 / 96.23 |
| + GDPO | 2.08 / 95.83 | 3.75 / 96.46 | 30.06 / 99.52 | 16.56 / 98.29 | 14.60 / 98.95 | 13.41 / 97.81 |
| + DVAO | 16.87 / 100.0 | 13.54 / 99.79 | 81.36 / 99.94 | 45.63 / 99.96 | 53.53 / 99.85 | 42.19 / 99.91 |
| Qwen3-8B-Base | ||||||
| + GRPO | 29.58 / 40.00 | 24.58 / 52.08 | 87.93 / 89.61 | 55.57 / 70.81 | 65.21 / 64.83 | 52.57 / 63.47 |
| + RC | 21.04 / 97.08 | 16.25 / 99.38 | 84.97 / 99.59 | 49.73 / 98.48 | 59.33 / 99.02 | 46.26 / 98.71 |
| + AC | 20.41 / 97.71 | 15.62 / 98.96 | 84.42 / 99.58 | 48.52 / 98.93 | 58.13 / 99.02 | 45.42 / 98.84 |
| + GDPO | 1.67 / 100.0 | 0.00 / 100.0 | 35.07 / 100.0 | 9.15 / 99.96 | 27.56 / 100.0 | 14.69 / 99.99 |
| + DVAO | 21.87 / 100.0 | 18.33 / 100.0 | 86.10 / 99.99 | 50.62 / 99.76 | 60.54 / 99.85 | 47.49 / 99.92 |
Table 2: Performance on Tool-Use Task (BFCL-v4)
| Method | Live Acc.(%)/Format.(%) | Non-Live Acc.(%)/Format.(%) | Multi-Turn Acc.(%)/Format.(%) | Average Acc.(%)/Format.(%) |
|---|---|---|---|---|
| Qwen2.5-3B-Instruct | ||||
| + GRPO | 59.73 / 10.35 | 47.94 / 5.18 | 2.00 / 1.29 | 36.56 / 5.61 |
| + RC | 67.51 / 61.71 | 79.94 / 83.45 | 5.62 / 36.29 | 51.02 / 60.48 |
| + AC | 69.43 / 67.04 | 82.35 / 84.82 | 8.62 / 42.20 | 53.47 / 64.69 |
| + GDPO | 67.73 / 65.08 | 82.08 / 84.53 | 8.38 / 48.04 | 52.73 / 65.88 |
| + DVAO | 72.73 / 77.44 | 84.75 / 95.11 | 12.50 / 57.40 | 56.66 / 76.65 |
| Qwen2.5-7B-Instruct | ||||
| + GRPO | 68.76 / 0.0 | 81.65 / 0.0 | 6.38 / 0.0 | 52.26 / 0.00 |
| + RC | 75.06 / 87.58 | 85.33 / 96.11 | 14.75 / 45.56 | 58.38 / 76.42 |
| + AC | 63.80 / 67.39 | 56.21 / 85.40 | 12.75 / 51.33 | 44.25 / 68.04 |
| + GDPO | 76.17 / 68.90 | 86.73 / 85.83 | 17.50 / 49.63 | 60.13 / 68.12 |
| + DVAO | 79.68 / 77.93 | 87.06 / 95.11 | 22.25 / 64.58 | 63.00 / 79.21 |
Key Findings:
- DVAO achieves the highest average accuracy and near-perfect auxiliary compliance (length/format) across both tasks and model scales.
- Baselines typically sacrifice one dimension for the other (e.g., RC/AC trade accuracy for compliance, GDPO achieves near-perfect compliance at the cost of very low accuracy).
- DVAO's advantage is consistent despite all methods sharing equal-weight initialization.
Training Dynamics & Pareto Frontiers:
- Training Stability: DVAO consistently achieves the highest accuracy reward with the lowest variance (standard deviation), and drives the length/format reward closest to the target (1.0) with the most dramatic variance collapse, confirming its bounded advantage property.
- Pareto Superiority: When sweeping the accuracy weight, DVAO's Pareto frontier dominates those of all baselines (RC, AC, GDPO), maintaining high auxiliary compliance across the entire accuracy range. Baselines exhibit saturation, instability, or incoherent fluctuations.
Theoretical and Practical Implications
- Theoretical: The paper provides a rigorous mathematical analysis of multi-reward GRPO scalarization, proving the instability of RC (Proposition 1) and the superior properties of DVAO (Propositions 2 & 3). DVAO formally guarantees bounded advantage magnitudes and introduces a novel, data-driven cross-objective regularization mechanism.
- Practical: DVAO offers a hyperparameter-free, dynamic weighting scheme that significantly improves the stability and effectiveness of multi-objective RLHF for LLMs. It enables the training of models that excel at primary tasks (e.g., reasoning accuracy) while robustly adhering to critical auxiliary constraints (e.g., response length, tool-calling format), which is essential for real-world deployment.
Conclusion
The paper identifies fundamental flaws in standard multi-reward scalarization techniques for GRPO: Reward Combination leads to training instability due to large advantage magnitudes, and Advantage Combination relies on static weights and ignores cross-objective correlations. To address these, the authors propose Dynamic Variance-adaptive Advantage Optimization (DVAO), which dynamically adjusts combination weights based on the empirical variance of each objective's rewards. DVAO is proven to maintain bounded advantage magnitudes and introduce an implicit cross-objective regularization. Extensive experiments demonstrate that DVAO achieves a superior Pareto optimal policy, effectively balancing multiple objectives without manual hyperparameter tuning. Future work will explore scaling DVAO to environments with more reward functions and extending its mechanism to broader alignment paradigms.