Co-Evolving Policy Distillation (CoPD): A Unified Summary
Summary (Overview)
- Core Problem: Existing paradigms for consolidating multiple expert capabilities into a single model—mixed-data RLVR and the static RLVR-then-OPD pipeline—suffer from significant capability loss due to gradient conflicts and large behavioral gaps between teacher and student models.
- Key Insight: Effective On-Policy Distillation (OPD) requires the teacher and student to maintain behavioral similarity (measured by top-k token overlap). The standard pipeline fails because experts, trained to convergence in isolation, drift too far from the student, making their supervision hard to absorb.
- Proposed Method: Co-Evolving Policy Distillation (CoPD) introduces parallel training branches that co-evolve through alternating phases of branch-specific RLVR (to explore new knowledge) and cross-branch mutual OPD (to transfer knowledge while keeping behavioral patterns close).
- Main Results: CoPD consistently outperforms strong baselines (mixed RLVR, OPD, MOPD) in unifying text, image, and video reasoning capabilities. It achieves an "all-in-one" model that surpasses domain-specific experts, turning cross-domain trade-offs into mutual gains.
- Broader Implication: The parallel co-evolution training pattern suggests a novel model parallel training scaling paradigm for broadening model capabilities.
Introduction and Theoretical Foundation
Reinforcement Learning with Verifiable Rewards (RLVR) has become the standard post-training paradigm for enhancing capabilities like text, image, and video reasoning in large models. However, training a single model on mixed-capability data often leads to capability divergence, where gains in one area come at the expense of another due to conflicting optimization directions.
The prevailing solution is a two-stage static OPD pipeline: 1) Train separate domain-specific experts via RLVR, and 2) Consolidate them into a unified policy via On-Policy Distillation (OPD). While this avoids gradient conflict, the authors identify a critical flaw: by the time distillation begins, the teacher (expert) has drifted too far in behavior from the student (base model), making its supervision difficult to absorb.
This is formalized through a unified utility analysis. Let be the total optimization signal from two capability datasets.
- Mixed-data RLVR suffers from a capability divergence cost :
- The static OPD pipeline avoids but operates with low absorption efficiency due to low teacher-student behavioral overlap :
- CoPD aims to achieve high absorption by maintaining moderate overlap :
The Behavioral Consistency Hypothesis posits that OPD is more effective when teacher and student exhibit similar behavioral patterns. This is measured by the top- token overlap along on-policy trajectories:
A pilot study confirms that OPD gain increases linearly with (), and that standard RLVR training monotonically decreases , pushing experts into the low-efficiency regime for distillation. This motivates CoPD, which must: 1) perform distillation during expert training, 2) keep teacher and student co-evolving, and 3) maintain an informative knowledge gap.
Methodology
CoPD maintains parallel training branches , each initialized from a shared base model and associated with a capability dataset . Training proceeds in alternating cycles of two phases:
1. Branch-Specific RLVR Phase Each branch independently performs Group Relative Policy Optimization (GRPO) on its own data to deepen expertise. The objective is:
This phase opens a behavioral/knowledge gap between branches.
2. Mutual OPD Phase Each branch generates rollouts on another branch's data () and receives token-level supervision from that branch. The teacher signal from branch to branch is:
The token-level advantage for the cross-branch update is . This phase transfers knowledge and closes the behavioral gap, keeping branches within an "absorbable" range.
Alternating Procedure & Scaling Training alternates for cycles:
- Phase I: for steps.
- Phase II: for steps.
The hyperparameters and control the rhythm between exploration and consolidation. For branches, a hub-and-spoke topology is used (e.g., text branch as hub) to avoid full pairwise distillation. Finally, the co-evolved branches are merged into a unified model.
# Algorithm 1 CoPD: Co-Evolving Policy Distillation (Simplified)
Require: Base model π_θ0, K datasets {D_k}, rewards {r_k}, cycles N, steps S_RL, S_OPD
1: Initialize K branches: θ_k ← θ0 for all k
2: for n = 1 to N do
3: # Phase I: Branch-specific RLVR
4: for each branch k in parallel do
5: Optimize θ_k on D_k with GRPO for S_RL steps # Eq. 7
6: end for
7: # Phase II: Mutual OPD
8: for each branch k in parallel do
9: for s = 1 to S_OPD do
10: Generate rollouts on D_k, update with GRPO (native)
11: for each other branch j != k do
12: Generate rollouts on D_j from π_θ_k
13: Compute teacher signal δ^(k←j) from π_θ_j # Eq. 8
14: Set advantage A^(k) = β_k * δ^(k←j)
15: end for
16: Combine batches; update θ_k
17: end for
18: end for
19: end for
20: θ* ← Merge(θ_0, θ_1, ..., θ_{K-1}) # Final unified model
21: return θ*
Empirical Validation / Results
Experiments were conducted using Qwen3-VL-4B-Instruct as the base model, evaluating on text (e.g., AIME, MATH-500), image (e.g., MMMU, MathVista), and video (e.g., MVBench, VideoMathQA) reasoning benchmarks.
Main Results: Two-Branch (Text & Image)
Table 1: Performance on Image and Text Reasoning Benchmarks
| Benchmark | Base | Image-Expert | Text-Expert | Mixed RLVR | OPD (V→T) | OPD (T→V) | CoPD |
|---|---|---|---|---|---|---|---|
| Image Reasoning Avg. | 54.00 | 55.76 | 54.88 | 55.69† | 55.99 | 56.44 | 56.97 |
| Text Reasoning Avg. | 55.78 | 55.51 | 57.89 | 55.48† | 56.23 | 56.09 | 58.76 |
| Overall Avg. | 54.74 | 55.65 | 56.13 | 55.60† | 56.09 | 56.29 | 57.71 |
| Note: V→T = Image expert teaches Text branch; T→V = Text expert teaches Image branch. † marks worst result (excluding Base). |
- Mixed RLVR shows a capability trade-off, weakening text reasoning compared to the Text-Expert.
- Static OPD (both directions) improves over Mixed RLVR but fails to fully transfer the teacher's strong capability, leaving a significant performance gap.
- CoPD achieves the best overall performance, surpassing both domain-specific experts simultaneously.
Main Results: Three-Branch (Text, Image & Video)
Table 2: Performance on Image, Text, and Video Reasoning Benchmarks
| Benchmark | Base | Image-Exp. | Text-Exp. | Video-Exp. | Mixed RLVR | MOPD | CoPD |
|---|---|---|---|---|---|---|---|
| Image Avg. | 54.00 | 55.76 | 54.88 | 54.71† | 56.17 | 56.37 | 57.12 |
| Text Avg. | 55.78 | 55.51 | 57.89 | 56.84 | 55.39† | 56.80 | 58.63 |
| Video Avg. | 56.22 | 58.27 | 55.54† | 58.75 | 59.62 | 58.32 | 59.21 |
| Overall Avg. | 55.11 | 56.31 | 55.98† | 56.39 | 56.79 | 56.99 | 58.12 |
- CoPD scales effectively, achieving the best overall performance and improving over Multi-teacher OPD (MOPD) across all three capability groups.
- MOPD underperforms the Video-Expert, confirming static multi-teacher distillation struggles with more branches.
- Mixed RLVR again shows trade-offs (high video, low text).
Analysis and Ablations
Table 3: Ablation Study on Two-Branch Setting
| Method | Image Reasoning Avg. | Text Reasoning Avg. | Overall Avg. |
|---|---|---|---|
| CoPD (Full) | 56.97 | 58.76 | 57.71 |
| w/o I-OPD (No distillation from Image) | 56.78 | 57.41 | 57.04 |
| w/o T-OPD (No distillation from Text) | 56.48 | 57.78 | 57.02 |
| Text-Branch Only (No merge) | 56.26 | 58.61 | 57.24 |
| Image-Branch Only (No merge) | 56.78 | 57.17 | 56.94 |
- Bidirectional distillation is necessary: Removing OPD in either direction degrades performance.
- Co-evolution alone is powerful: Even without merging, each single branch outperforms static OPD baselines.
- Merging consolidates strengths: The merged model achieves the best overall result.
Training Dynamics & Design Analysis:
- Behavioral Consistency: CoPD maintains top- overlap >0.9 and low symmetric KL between branches throughout training, while the static pipeline shows monotonic divergence (Figures 4a, 4b).
- Phase Ratio: An exploration-to-consolidation ratio of yields the best performance, balancing sufficient specialization with effective alignment (Figure 4c).
Theoretical and Practical Implications
- Theoretical: The paper provides a formal framework analyzing the loss mechanisms in existing consolidation paradigms (divergence cost vs. absorption inefficiency). It establishes behavioral overlap as a key measurable indicator for effective distillation.
- Practical: CoPD offers a scalable training paradigm that successfully unifies multiple advanced capabilities (text, image, video) into a single model that outperforms specialists. It turns the typical capability trade-off into a synergistic gain.
- Paradigm Shift: The method suggests moving from sequential expert training + distillation to parallel co-evolution, which could inspire new scaling laws and training strategies for developing generalist models.
Conclusion
Co-Evolving Policy Distillation (CoPD) addresses fundamental limitations in consolidating multiple expert capabilities. By interleaving branch-specific RLVR with cross-branch mutual OPD, it ensures experts co-evolve, maintaining behavioral similarity for effective knowledge transfer while accumulating complementary knowledge. Empirical results demonstrate that CoPD achieves state-of-the-art "all-in-one" consolidation, surpassing strong baselines and even domain-specific experts. This work, part of the "Self-Taught RLVR" series, explores the parallel self and suggests that model parallel co-evolution is a promising scaling paradigm for broadening the boundaries of model capabilities.