Co-Evolving Policy Distillation (CoPD): A Unified Summary

Summary (Overview)

Core Problem: Existing paradigms for consolidating multiple expert capabilities into a single model—mixed-data RLVR and the static RLVR-then-OPD pipeline—suffer from significant capability loss due to gradient conflicts and large behavioral gaps between teacher and student models.
Key Insight: Effective On-Policy Distillation (OPD) requires the teacher and student to maintain behavioral similarity (measured by top-k token overlap). The standard pipeline fails because experts, trained to convergence in isolation, drift too far from the student, making their supervision hard to absorb.
Proposed Method: Co-Evolving Policy Distillation (CoPD) introduces parallel training branches that co-evolve through alternating phases of branch-specific RLVR (to explore new knowledge) and cross-branch mutual OPD (to transfer knowledge while keeping behavioral patterns close).
Main Results: CoPD consistently outperforms strong baselines (mixed RLVR, OPD, MOPD) in unifying text, image, and video reasoning capabilities. It achieves an "all-in-one" model that surpasses domain-specific experts, turning cross-domain trade-offs into mutual gains.
Broader Implication: The parallel co-evolution training pattern suggests a novel model parallel training scaling paradigm for broadening model capabilities.

Introduction and Theoretical Foundation

Reinforcement Learning with Verifiable Rewards (RLVR) has become the standard post-training paradigm for enhancing capabilities like text, image, and video reasoning in large models. However, training a single model on mixed-capability data often leads to capability divergence, where gains in one area come at the expense of another due to conflicting optimization directions.

The prevailing solution is a two-stage static OPD pipeline: 1) Train separate domain-specific experts via RLVR, and 2) Consolidate them into a unified policy via On-Policy Distillation (OPD). While this avoids gradient conflict, the authors identify a critical flaw: by the time distillation begins, the teacher (expert) has drifted too far in behavior from the student (base model), making its supervision difficult to absorb.

This is formalized through a unified utility analysis. Let $X(D_1, D_2)$ be the total optimization signal from two capability datasets.

Mixed-data RLVR suffers from a capability divergence cost $\Phi$ : $U_{\text{mix}} \approx X(D_1, D_2) - \Phi(D_1, D_2)$
The static OPD pipeline avoids $\Phi$ but operates with low absorption efficiency $\eta(O_{\text{low}})$ due to low teacher-student behavioral overlap $O_{\text{low}}$ : $U_{\text{static}} \approx \eta(O_{\text{low}}) \cdot X(D_1, D_2), \quad \eta(O_{\text{low}}) \text{ is small}$
CoPD aims to achieve high absorption by maintaining moderate overlap $O_{\text{mod}}$ : $U_{\text{CoPD}} \approx \eta(O_{\text{mod}}) \cdot X(D_1, D_2), \quad \eta(O_{\text{mod}}) \gg \eta(O_{\text{low}})$

The Behavioral Consistency Hypothesis posits that OPD is more effective when teacher and student exhibit similar behavioral patterns. This is measured by the top- $k$ token overlap $O_k$ along on-policy trajectories:

O_k(\pi_\theta, \pi_T) = \mathbb{E}_{x, y_{<t} \sim \mu_\theta} \left[ \frac{|\text{Top}_k(\pi_\theta(\cdot|x, y_{<t})) \cap \text{Top}_k(\pi_T(\cdot|x, y_{<t}))|}{k} \right]

A pilot study confirms that OPD gain increases linearly with $O_k$ ( $r=0.89$ ), and that standard RLVR training monotonically decreases $O_k$ , pushing experts into the low-efficiency regime for distillation. This motivates CoPD, which must: 1) perform distillation during expert training, 2) keep teacher and student co-evolving, and 3) maintain an informative knowledge gap.

Methodology

CoPD maintains $K$ parallel training branches $\pi_{\theta_k}$ , each initialized from a shared base model $\pi_0$ and associated with a capability dataset $D_k$ . Training proceeds in alternating cycles of two phases:

1. Branch-Specific RLVR Phase Each branch $k$ independently performs Group Relative Policy Optimization (GRPO) on its own data $D_k$ to deepen expertise. The objective is:

L^{(k)}_{\text{RLVR}}(\theta_k) = \mathbb{E}_{x \sim D_k} \left[ \frac{1}{G} \sum_{i=1}^{G} \frac{1}{|y_i|} \sum_{t=1}^{|y_i|} \min\left( \rho_{i,t}^{(k)} \hat{A}^{\text{RL}}_i, \text{clip}(\rho_{i,t}^{(k)}, 1-\epsilon, 1+\epsilon) \hat{A}^{\text{RL}}_i \right) \right]

This phase opens a behavioral/knowledge gap between branches.

2. Mutual OPD Phase Each branch generates rollouts on another branch's data ( $x' \sim D_j$ ) and receives token-level supervision from that branch. The teacher signal from branch $j$ to branch $k$ is:

\delta_{i,t}^{(k \leftarrow j)} = \log \pi_{\theta_j}(y_{i,t}^{(k)} | x', y_{i,<t}^{(k)}) - \log \pi_{\theta_k}(y_{i,t}^{(k)} | x', y_{i,<t}^{(k)})

The token-level advantage for the cross-branch update is $\hat{A}_{i,t}^{(k)} = \beta_k \delta_{i,t}^{(k \leftarrow j)}$ . This phase transfers knowledge and closes the behavioral gap, keeping branches within an "absorbable" range.

Alternating Procedure & Scaling Training alternates for $N$ cycles:

Phase I: $\theta_k^{(n,\text{I})} = \text{RLVR}(\theta_k^{(n-1)}; D_k, r_k, S_{\text{RL}})$ for $S_{\text{RL}}$ steps.
Phase II: $\theta_k^{(n)} = \text{OPD}(\theta_k^{(n,\text{I})}; D_j, \pi_{\theta_j}, S_{\text{OPD}})$ for $S_{\text{OPD}}$ steps.

The hyperparameters $S_{\text{RL}}$ and $S_{\text{OPD}}$ control the rhythm between exploration and consolidation. For $K>2$ branches, a hub-and-spoke topology is used (e.g., text branch as hub) to avoid full pairwise distillation. Finally, the co-evolved branches are merged into a unified model.

# Algorithm 1 CoPD: Co-Evolving Policy Distillation (Simplified)
Require: Base model π_θ0, K datasets {D_k}, rewards {r_k}, cycles N, steps S_RL, S_OPD
1: Initialize K branches: θ_k ← θ0 for all k
2: for n = 1 to N do
3:   # Phase I: Branch-specific RLVR
4:   for each branch k in parallel do
5:     Optimize θ_k on D_k with GRPO for S_RL steps  # Eq. 7
6:   end for
7:   # Phase II: Mutual OPD
8:   for each branch k in parallel do
9:     for s = 1 to S_OPD do
10:      Generate rollouts on D_k, update with GRPO (native)
11:      for each other branch j != k do
12:        Generate rollouts on D_j from π_θ_k
13:        Compute teacher signal δ^(k←j) from π_θ_j  # Eq. 8
14:        Set advantage A^(k) = β_k * δ^(k←j)
15:      end for
16:      Combine batches; update θ_k
17:    end for
18:  end for
19: end for
20: θ* ← Merge(θ_0, θ_1, ..., θ_{K-1})  # Final unified model
21: return θ*

Empirical Validation / Results

Experiments were conducted using Qwen3-VL-4B-Instruct as the base model, evaluating on text (e.g., AIME, MATH-500), image (e.g., MMMU, MathVista), and video (e.g., MVBench, VideoMathQA) reasoning benchmarks.

Main Results: Two-Branch (Text & Image)

Table 1: Performance on Image and Text Reasoning Benchmarks

Benchmark	Base	Image-Expert	Text-Expert	Mixed RLVR	OPD (V→T)	OPD (T→V)	CoPD
Image Reasoning Avg.	54.00	55.76	54.88	55.69†	55.99	56.44	56.97
Text Reasoning Avg.	55.78	55.51	57.89	55.48†	56.23	56.09	58.76
Overall Avg.	54.74	55.65	56.13	55.60†	56.09	56.29	57.71
Note: V→T = Image expert teaches Text branch; T→V = Text expert teaches Image branch. † marks worst result (excluding Base).

Mixed RLVR shows a capability trade-off, weakening text reasoning compared to the Text-Expert.
Static OPD (both directions) improves over Mixed RLVR but fails to fully transfer the teacher's strong capability, leaving a significant performance gap.
CoPD achieves the best overall performance, surpassing both domain-specific experts simultaneously.

Main Results: Three-Branch (Text, Image & Video)

Table 2: Performance on Image, Text, and Video Reasoning Benchmarks

Benchmark	Base	Image-Exp.	Text-Exp.	Video-Exp.	Mixed RLVR	MOPD	CoPD
Image Avg.	54.00	55.76	54.88	54.71†	56.17	56.37	57.12
Text Avg.	55.78	55.51	57.89	56.84	55.39†	56.80	58.63
Video Avg.	56.22	58.27	55.54†	58.75	59.62	58.32	59.21
Overall Avg.	55.11	56.31	55.98†	56.39	56.79	56.99	58.12

CoPD scales effectively, achieving the best overall performance and improving over Multi-teacher OPD (MOPD) across all three capability groups.
MOPD underperforms the Video-Expert, confirming static multi-teacher distillation struggles with more branches.
Mixed RLVR again shows trade-offs (high video, low text).

Analysis and Ablations

Table 3: Ablation Study on Two-Branch Setting

Method	Image Reasoning Avg.	Text Reasoning Avg.	Overall Avg.
CoPD (Full)	56.97	58.76	57.71
w/o I-OPD (No distillation from Image)	56.78	57.41	57.04
w/o T-OPD (No distillation from Text)	56.48	57.78	57.02
Text-Branch Only (No merge)	56.26	58.61	57.24
Image-Branch Only (No merge)	56.78	57.17	56.94

Bidirectional distillation is necessary: Removing OPD in either direction degrades performance.
Co-evolution alone is powerful: Even without merging, each single branch outperforms static OPD baselines.
Merging consolidates strengths: The merged model achieves the best overall result.

Training Dynamics & Design Analysis:

Behavioral Consistency: CoPD maintains top- $k$ overlap >0.9 and low symmetric KL between branches throughout training, while the static pipeline shows monotonic divergence (Figures 4a, 4b).
Phase Ratio: An exploration-to-consolidation ratio of $S_{\text{RL}} : S_{\text{OPD}} = 1.5:1$ yields the best performance, balancing sufficient specialization with effective alignment (Figure 4c).

Theoretical and Practical Implications

Theoretical: The paper provides a formal framework analyzing the loss mechanisms in existing consolidation paradigms (divergence cost vs. absorption inefficiency). It establishes behavioral overlap as a key measurable indicator for effective distillation.
Practical: CoPD offers a scalable training paradigm that successfully unifies multiple advanced capabilities (text, image, video) into a single model that outperforms specialists. It turns the typical capability trade-off into a synergistic gain.
Paradigm Shift: The method suggests moving from sequential expert training + distillation to parallel co-evolution, which could inspire new scaling laws and training strategies for developing generalist models.

Conclusion

Co-Evolving Policy Distillation (CoPD) addresses fundamental limitations in consolidating multiple expert capabilities. By interleaving branch-specific RLVR with cross-branch mutual OPD, it ensures experts co-evolve, maintaining behavioral similarity for effective knowledge transfer while accumulating complementary knowledge. Empirical results demonstrate that CoPD achieves state-of-the-art "all-in-one" consolidation, surpassing strong baselines and even domain-specific experts. This work, part of the "Self-Taught RLVR" series, explores the parallel self and suggests that model parallel co-evolution is a promising scaling paradigm for broadening the boundaries of model capabilities.