Co-Evolving Policy Distillation (CoPD): A Unified Summary

Summary (Overview)

  • Core Problem: Existing paradigms for consolidating multiple expert capabilities into a single model—mixed-data RLVR and the static RLVR-then-OPD pipeline—suffer from significant capability loss due to gradient conflicts and large behavioral gaps between teacher and student models.
  • Key Insight: Effective On-Policy Distillation (OPD) requires the teacher and student to maintain behavioral similarity (measured by top-k token overlap). The standard pipeline fails because experts, trained to convergence in isolation, drift too far from the student, making their supervision hard to absorb.
  • Proposed Method: Co-Evolving Policy Distillation (CoPD) introduces parallel training branches that co-evolve through alternating phases of branch-specific RLVR (to explore new knowledge) and cross-branch mutual OPD (to transfer knowledge while keeping behavioral patterns close).
  • Main Results: CoPD consistently outperforms strong baselines (mixed RLVR, OPD, MOPD) in unifying text, image, and video reasoning capabilities. It achieves an "all-in-one" model that surpasses domain-specific experts, turning cross-domain trade-offs into mutual gains.
  • Broader Implication: The parallel co-evolution training pattern suggests a novel model parallel training scaling paradigm for broadening model capabilities.

Introduction and Theoretical Foundation

Reinforcement Learning with Verifiable Rewards (RLVR) has become the standard post-training paradigm for enhancing capabilities like text, image, and video reasoning in large models. However, training a single model on mixed-capability data often leads to capability divergence, where gains in one area come at the expense of another due to conflicting optimization directions.

The prevailing solution is a two-stage static OPD pipeline: 1) Train separate domain-specific experts via RLVR, and 2) Consolidate them into a unified policy via On-Policy Distillation (OPD). While this avoids gradient conflict, the authors identify a critical flaw: by the time distillation begins, the teacher (expert) has drifted too far in behavior from the student (base model), making its supervision difficult to absorb.

This is formalized through a unified utility analysis. Let X(D1,D2)X(D_1, D_2) be the total optimization signal from two capability datasets.

  • Mixed-data RLVR suffers from a capability divergence cost Φ\Phi: UmixX(D1,D2)Φ(D1,D2)U_{\text{mix}} \approx X(D_1, D_2) - \Phi(D_1, D_2)
  • The static OPD pipeline avoids Φ\Phi but operates with low absorption efficiency η(Olow)\eta(O_{\text{low}}) due to low teacher-student behavioral overlap OlowO_{\text{low}}: Ustaticη(Olow)X(D1,D2),η(Olow) is smallU_{\text{static}} \approx \eta(O_{\text{low}}) \cdot X(D_1, D_2), \quad \eta(O_{\text{low}}) \text{ is small}
  • CoPD aims to achieve high absorption by maintaining moderate overlap OmodO_{\text{mod}}: UCoPDη(Omod)X(D1,D2),η(Omod)η(Olow)U_{\text{CoPD}} \approx \eta(O_{\text{mod}}) \cdot X(D_1, D_2), \quad \eta(O_{\text{mod}}) \gg \eta(O_{\text{low}})

The Behavioral Consistency Hypothesis posits that OPD is more effective when teacher and student exhibit similar behavioral patterns. This is measured by the top-kk token overlap OkO_k along on-policy trajectories:

Ok(πθ,πT)=Ex,y<tμθ[Topk(πθ(x,y<t))Topk(πT(x,y<t))k]O_k(\pi_\theta, \pi_T) = \mathbb{E}_{x, y_{<t} \sim \mu_\theta} \left[ \frac{|\text{Top}_k(\pi_\theta(\cdot|x, y_{<t})) \cap \text{Top}_k(\pi_T(\cdot|x, y_{<t}))|}{k} \right]

A pilot study confirms that OPD gain increases linearly with OkO_k (r=0.89r=0.89), and that standard RLVR training monotonically decreases OkO_k, pushing experts into the low-efficiency regime for distillation. This motivates CoPD, which must: 1) perform distillation during expert training, 2) keep teacher and student co-evolving, and 3) maintain an informative knowledge gap.

Methodology

CoPD maintains KK parallel training branches πθk\pi_{\theta_k}, each initialized from a shared base model π0\pi_0 and associated with a capability dataset DkD_k. Training proceeds in alternating cycles of two phases:

1. Branch-Specific RLVR Phase Each branch kk independently performs Group Relative Policy Optimization (GRPO) on its own data DkD_k to deepen expertise. The objective is:

LRLVR(k)(θk)=ExDk[1Gi=1G1yit=1yimin(ρi,t(k)A^iRL,clip(ρi,t(k),1ϵ,1+ϵ)A^iRL)]L^{(k)}_{\text{RLVR}}(\theta_k) = \mathbb{E}_{x \sim D_k} \left[ \frac{1}{G} \sum_{i=1}^{G} \frac{1}{|y_i|} \sum_{t=1}^{|y_i|} \min\left( \rho_{i,t}^{(k)} \hat{A}^{\text{RL}}_i, \text{clip}(\rho_{i,t}^{(k)}, 1-\epsilon, 1+\epsilon) \hat{A}^{\text{RL}}_i \right) \right]

This phase opens a behavioral/knowledge gap between branches.

2. Mutual OPD Phase Each branch generates rollouts on another branch's data (xDjx' \sim D_j) and receives token-level supervision from that branch. The teacher signal from branch jj to branch kk is:

δi,t(kj)=logπθj(yi,t(k)x,yi,<t(k))logπθk(yi,t(k)x,yi,<t(k))\delta_{i,t}^{(k \leftarrow j)} = \log \pi_{\theta_j}(y_{i,t}^{(k)} | x', y_{i,<t}^{(k)}) - \log \pi_{\theta_k}(y_{i,t}^{(k)} | x', y_{i,<t}^{(k)})

The token-level advantage for the cross-branch update is A^i,t(k)=βkδi,t(kj)\hat{A}_{i,t}^{(k)} = \beta_k \delta_{i,t}^{(k \leftarrow j)}. This phase transfers knowledge and closes the behavioral gap, keeping branches within an "absorbable" range.

Alternating Procedure & Scaling Training alternates for NN cycles:

  1. Phase I: θk(n,I)=RLVR(θk(n1);Dk,rk,SRL)\theta_k^{(n,\text{I})} = \text{RLVR}(\theta_k^{(n-1)}; D_k, r_k, S_{\text{RL}}) for SRLS_{\text{RL}} steps.
  2. Phase II: θk(n)=OPD(θk(n,I);Dj,πθj,SOPD)\theta_k^{(n)} = \text{OPD}(\theta_k^{(n,\text{I})}; D_j, \pi_{\theta_j}, S_{\text{OPD}}) for SOPDS_{\text{OPD}} steps.

The hyperparameters SRLS_{\text{RL}} and SOPDS_{\text{OPD}} control the rhythm between exploration and consolidation. For K>2K>2 branches, a hub-and-spoke topology is used (e.g., text branch as hub) to avoid full pairwise distillation. Finally, the co-evolved branches are merged into a unified model.

# Algorithm 1 CoPD: Co-Evolving Policy Distillation (Simplified)
Require: Base model π_θ0, K datasets {D_k}, rewards {r_k}, cycles N, steps S_RL, S_OPD
1: Initialize K branches: θ_k ← θ0 for all k
2: for n = 1 to N do
3:   # Phase I: Branch-specific RLVR
4:   for each branch k in parallel do
5:     Optimize θ_k on D_k with GRPO for S_RL steps  # Eq. 7
6:   end for
7:   # Phase II: Mutual OPD
8:   for each branch k in parallel do
9:     for s = 1 to S_OPD do
10:      Generate rollouts on D_k, update with GRPO (native)
11:      for each other branch j != k do
12:        Generate rollouts on D_j from π_θ_k
13:        Compute teacher signal δ^(k←j) from π_θ_j  # Eq. 8
14:        Set advantage A^(k) = β_k * δ^(k←j)
15:      end for
16:      Combine batches; update θ_k
17:    end for
18:  end for
19: end for
20: θ* ← Merge(θ_0, θ_1, ..., θ_{K-1})  # Final unified model
21: return θ*

Empirical Validation / Results

Experiments were conducted using Qwen3-VL-4B-Instruct as the base model, evaluating on text (e.g., AIME, MATH-500), image (e.g., MMMU, MathVista), and video (e.g., MVBench, VideoMathQA) reasoning benchmarks.

Main Results: Two-Branch (Text & Image)

Table 1: Performance on Image and Text Reasoning Benchmarks

BenchmarkBaseImage-ExpertText-ExpertMixed RLVROPD (V→T)OPD (T→V)CoPD
Image Reasoning Avg.54.0055.7654.8855.69†55.9956.4456.97
Text Reasoning Avg.55.7855.5157.8955.48†56.2356.0958.76
Overall Avg.54.7455.6556.1355.60†56.0956.2957.71
Note: V→T = Image expert teaches Text branch; T→V = Text expert teaches Image branch. † marks worst result (excluding Base).
  • Mixed RLVR shows a capability trade-off, weakening text reasoning compared to the Text-Expert.
  • Static OPD (both directions) improves over Mixed RLVR but fails to fully transfer the teacher's strong capability, leaving a significant performance gap.
  • CoPD achieves the best overall performance, surpassing both domain-specific experts simultaneously.

Main Results: Three-Branch (Text, Image & Video)

Table 2: Performance on Image, Text, and Video Reasoning Benchmarks

BenchmarkBaseImage-Exp.Text-Exp.Video-Exp.Mixed RLVRMOPDCoPD
Image Avg.54.0055.7654.8854.71†56.1756.3757.12
Text Avg.55.7855.5157.8956.8455.39†56.8058.63
Video Avg.56.2258.2755.54†58.7559.6258.3259.21
Overall Avg.55.1156.3155.98†56.3956.7956.9958.12
  • CoPD scales effectively, achieving the best overall performance and improving over Multi-teacher OPD (MOPD) across all three capability groups.
  • MOPD underperforms the Video-Expert, confirming static multi-teacher distillation struggles with more branches.
  • Mixed RLVR again shows trade-offs (high video, low text).

Analysis and Ablations

Table 3: Ablation Study on Two-Branch Setting

MethodImage Reasoning Avg.Text Reasoning Avg.Overall Avg.
CoPD (Full)56.9758.7657.71
w/o I-OPD (No distillation from Image)56.7857.4157.04
w/o T-OPD (No distillation from Text)56.4857.7857.02
Text-Branch Only (No merge)56.2658.6157.24
Image-Branch Only (No merge)56.7857.1756.94
  • Bidirectional distillation is necessary: Removing OPD in either direction degrades performance.
  • Co-evolution alone is powerful: Even without merging, each single branch outperforms static OPD baselines.
  • Merging consolidates strengths: The merged model achieves the best overall result.

Training Dynamics & Design Analysis:

  • Behavioral Consistency: CoPD maintains top-kk overlap >0.9 and low symmetric KL between branches throughout training, while the static pipeline shows monotonic divergence (Figures 4a, 4b).
  • Phase Ratio: An exploration-to-consolidation ratio of SRL:SOPD=1.5:1S_{\text{RL}} : S_{\text{OPD}} = 1.5:1 yields the best performance, balancing sufficient specialization with effective alignment (Figure 4c).

Theoretical and Practical Implications

  • Theoretical: The paper provides a formal framework analyzing the loss mechanisms in existing consolidation paradigms (divergence cost vs. absorption inefficiency). It establishes behavioral overlap as a key measurable indicator for effective distillation.
  • Practical: CoPD offers a scalable training paradigm that successfully unifies multiple advanced capabilities (text, image, video) into a single model that outperforms specialists. It turns the typical capability trade-off into a synergistic gain.
  • Paradigm Shift: The method suggests moving from sequential expert training + distillation to parallel co-evolution, which could inspire new scaling laws and training strategies for developing generalist models.

Conclusion

Co-Evolving Policy Distillation (CoPD) addresses fundamental limitations in consolidating multiple expert capabilities. By interleaving branch-specific RLVR with cross-branch mutual OPD, it ensures experts co-evolve, maintaining behavioral similarity for effective knowledge transfer while accumulating complementary knowledge. Empirical results demonstrate that CoPD achieves state-of-the-art "all-in-one" consolidation, surpassing strong baselines and even domain-specific experts. This work, part of the "Self-Taught RLVR" series, explores the parallel self and suggests that model parallel co-evolution is a promising scaling paradigm for broadening the boundaries of model capabilities.