# Co-Evolving Policy Distillation

> CoPD introduces a parallel co-evolution training method that alternates RLVR and mutual distillation, achieving an all-in-one model surpassing domain experts by maintaining high behavioral overlap for effective knowledge transfer.

- **Source:** [arXiv](https://arxiv.org/abs/2604.27083)
- **Published:** 2026-05-02
- **Permalink:** https://picx.dev/p/ZgM3Uh
- **Whiteboard:** https://picx.dev/p/ZgM3Uh/image

## Summary

# Co-Evolving Policy Distillation (CoPD): A Unified Summary

## Summary (Overview)
*   **Core Problem:** Existing paradigms for consolidating multiple expert capabilities into a single model—**mixed-data RLVR** and the static **RLVR-then-OPD pipeline**—suffer from significant capability loss due to gradient conflicts and large behavioral gaps between teacher and student models.
*   **Key Insight:** Effective On-Policy Distillation (OPD) requires the teacher and student to maintain behavioral similarity (measured by top-*k* token overlap). The standard pipeline fails because experts, trained to convergence in isolation, drift too far from the student, making their supervision hard to absorb.
*   **Proposed Method:** **Co-Evolving Policy Distillation (CoPD)** introduces parallel training branches that **co-evolve** through alternating phases of branch-specific RLVR (to explore new knowledge) and cross-branch **mutual OPD** (to transfer knowledge while keeping behavioral patterns close).
*   **Main Results:** CoPD consistently outperforms strong baselines (mixed RLVR, OPD, MOPD) in unifying text, image, and video reasoning capabilities. It achieves an "all-in-one" model that **surpasses domain-specific experts**, turning cross-domain trade-offs into mutual gains.
*   **Broader Implication:** The parallel co-evolution training pattern suggests a novel **model parallel training scaling paradigm** for broadening model capabilities.

## Introduction and Theoretical Foundation
**Reinforcement Learning with Verifiable Rewards (RLVR)** has become the standard post-training paradigm for enhancing capabilities like text, image, and video reasoning in large models. However, training a single model on mixed-capability data often leads to **capability divergence**, where gains in one area come at the expense of another due to conflicting optimization directions.

The prevailing solution is a two-stage **static OPD pipeline**: 1) Train separate domain-specific experts via RLVR, and 2) Consolidate them into a unified policy via On-Policy Distillation (OPD). While this avoids gradient conflict, the authors identify a critical flaw: by the time distillation begins, the teacher (expert) has **drifted too far in behavior** from the student (base model), making its supervision difficult to absorb.

This is formalized through a **unified utility analysis**. Let $X(D_1, D_2)$ be the total optimization signal from two capability datasets.
*   **Mixed-data RLVR** suffers from a **capability divergence cost** $\Phi$:
    $$U_{\text{mix}} \approx X(D_1, D_2) - \Phi(D_1, D_2)$$
*   The **static OPD pipeline** avoids $\Phi$ but operates with low absorption efficiency $\eta(O_{\text{low}})$ due to low teacher-student behavioral overlap $O_{\text{low}}$:
    $$U_{\text{static}} \approx \eta(O_{\text{low}}) \cdot X(D_1, D_2), \quad \eta(O_{\text{low}}) \text{ is small}$$
*   **CoPD** aims to achieve high absorption by maintaining moderate overlap $O_{\text{mod}}$:
    $$U_{\text{CoPD}} \approx \eta(O_{\text{mod}}) \cdot X(D_1, D_2), \quad \eta(O_{\text{mod}}) \gg \eta(O_{\text{low}})$$

The **Behavioral Consistency Hypothesis** posits that OPD is more effective when teacher and student exhibit similar behavioral patterns. This is measured by the **top-$k$ token overlap** $O_k$ along on-policy trajectories:
$$O_k(\pi_\theta, \pi_T) = \mathbb{E}_{x, y_{<t} \sim \mu_\theta} \left[ \frac{|\text{Top}_k(\pi_\theta(\cdot|x, y_{<t})) \cap \text{Top}_k(\pi_T(\cdot|x, y_{<t}))|}{k} \right]$$

A **pilot study** confirms that OPD gain increases linearly with $O_k$ ($r=0.89$), and that standard RLVR training monotonically decreases $O_k$, pushing experts into the low-efficiency regime for distillation. This motivates **CoPD**, which must: 1) perform distillation *during* expert training, 2) keep teacher and student co-evolving, and 3) maintain an informative knowledge gap.

## Methodology
CoPD maintains $K$ parallel training branches $\pi_{\theta_k}$, each initialized from a shared base model $\pi_0$ and associated with a capability dataset $D_k$. Training proceeds in alternating cycles of two phases:

**1. Branch-Specific RLVR Phase**
Each branch $k$ independently performs Group Relative Policy Optimization (GRPO) on its own data $D_k$ to deepen expertise. The objective is:
$$
L^{(k)}_{\text{RLVR}}(\theta_k) = \mathbb{E}_{x \sim D_k} \left[ \frac{1}{G} \sum_{i=1}^{G} \frac{1}{|y_i|} \sum_{t=1}^{|y_i|} \min\left( \rho_{i,t}^{(k)} \hat{A}^{\text{RL}}_i, \text{clip}(\rho_{i,t}^{(k)}, 1-\epsilon, 1+\epsilon) \hat{A}^{\text{RL}}_i \right) \right]
$$
This phase opens a behavioral/knowledge gap between branches.

**2. Mutual OPD Phase**
Each branch generates rollouts on *another* branch's data ($x' \sim D_j$) and receives token-level supervision from that branch. The teacher signal from branch $j$ to branch $k$ is:
$$
\delta_{i,t}^{(k \leftarrow j)} = \log \pi_{\theta_j}(y_{i,t}^{(k)} | x', y_{i,<t}^{(k)}) - \log \pi_{\theta_k}(y_{i,t}^{(k)} | x', y_{i,<t}^{(k)})
$$
The token-level advantage for the cross-branch update is $\hat{A}_{i,t}^{(k)} = \beta_k \delta_{i,t}^{(k \leftarrow j)}$. This phase transfers knowledge and closes the behavioral gap, keeping branches within an "absorbable" range.

**Alternating Procedure & Scaling**
Training alternates for $N$ cycles:
1.  **Phase I:** $\theta_k^{(n,\text{I})} = \text{RLVR}(\theta_k^{(n-1)}; D_k, r_k, S_{\text{RL}})$ for $S_{\text{RL}}$ steps.
2.  **Phase II:** $\theta_k^{(n)} = \text{OPD}(\theta_k^{(n,\text{I})}; D_j, \pi_{\theta_j}, S_{\text{OPD}})$ for $S_{\text{OPD}}$ steps.

The hyperparameters $S_{\text{RL}}$ and $S_{\text{OPD}}$ control the rhythm between exploration and consolidation. For $K>2$ branches, a **hub-and-spoke** topology is used (e.g., text branch as hub) to avoid full pairwise distillation. Finally, the co-evolved branches are merged into a unified model.

```python
# Algorithm 1 CoPD: Co-Evolving Policy Distillation (Simplified)
Require: Base model π_θ0, K datasets {D_k}, rewards {r_k}, cycles N, steps S_RL, S_OPD
1: Initialize K branches: θ_k ← θ0 for all k
2: for n = 1 to N do
3:   # Phase I: Branch-specific RLVR
4:   for each branch k in parallel do
5:     Optimize θ_k on D_k with GRPO for S_RL steps  # Eq. 7
6:   end for
7:   # Phase II: Mutual OPD
8:   for each branch k in parallel do
9:     for s = 1 to S_OPD do
10:      Generate rollouts on D_k, update with GRPO (native)
11:      for each other branch j != k do
12:        Generate rollouts on D_j from π_θ_k
13:        Compute teacher signal δ^(k←j) from π_θ_j  # Eq. 8
14:        Set advantage A^(k) = β_k * δ^(k←j)
15:      end for
16:      Combine batches; update θ_k
17:    end for
18:  end for
19: end for
20: θ* ← Merge(θ_0, θ_1, ..., θ_{K-1})  # Final unified model
21: return θ*
```

## Empirical Validation / Results
Experiments were conducted using **Qwen3-VL-4B-Instruct** as the base model, evaluating on text (e.g., AIME, MATH-500), image (e.g., MMMU, MathVista), and video (e.g., MVBench, VideoMathQA) reasoning benchmarks.

### Main Results: Two-Branch (Text & Image)
**Table 1: Performance on Image and Text Reasoning Benchmarks**
| Benchmark | Base | Image-Expert | Text-Expert | Mixed RLVR | OPD (V→T) | OPD (T→V) | **CoPD** |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| **Image Reasoning Avg.** | 54.00 | 55.76 | 54.88 | 55.69† | 55.99 | 56.44 | **56.97** |
| **Text Reasoning Avg.** | 55.78 | 55.51 | 57.89 | 55.48† | 56.23 | 56.09 | **58.76** |
| **Overall Avg.** | 54.74 | 55.65 | 56.13 | 55.60† | 56.09 | 56.29 | **57.71** |
*Note: V→T = Image expert teaches Text branch; T→V = Text expert teaches Image branch. † marks worst result (excluding Base).*

*   **Mixed RLVR** shows a capability trade-off, weakening text reasoning compared to the Text-Expert.
*   **Static OPD** (both directions) improves over Mixed RLVR but fails to fully transfer the teacher's strong capability, leaving a significant performance gap.
*   **CoPD** achieves the best overall performance, **surpassing both domain-specific experts simultaneously**.

### Main Results: Three-Branch (Text, Image & Video)
**Table 2: Performance on Image, Text, and Video Reasoning Benchmarks**
| Benchmark | Base | Image-Exp. | Text-Exp. | Video-Exp. | Mixed RLVR | MOPD | **CoPD** |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| **Image Avg.** | 54.00 | 55.76 | 54.88 | 54.71† | 56.17 | 56.37 | **57.12** |
| **Text Avg.** | 55.78 | 55.51 | 57.89 | 56.84 | 55.39† | 56.80 | **58.63** |
| **Video Avg.** | 56.22 | 58.27 | 55.54† | 58.75 | 59.62 | 58.32 | **59.21** |
| **Overall Avg.** | 55.11 | 56.31 | 55.98† | 56.39 | 56.79 | 56.99 | **58.12** |

*   **CoPD** scales effectively, achieving the best overall performance and improving over **Multi-teacher OPD (MOPD)** across all three capability groups.
*   **MOPD** underperforms the Video-Expert, confirming static multi-teacher distillation struggles with more branches.
*   **Mixed RLVR** again shows trade-offs (high video, low text).

### Analysis and Ablations
**Table 3: Ablation Study on Two-Branch Setting**
| Method | Image Reasoning Avg. | Text Reasoning Avg. | Overall Avg. |
| :--- | :---: | :---: | :---: |
| CoPD (Full) | **56.97** | **58.76** | **57.71** |
| w/o I-OPD (No distillation from Image) | 56.78 | 57.41 | 57.04 |
| w/o T-OPD (No distillation from Text) | 56.48 | 57.78 | 57.02 |
| Text-Branch Only (No merge) | 56.26 | 58.61 | 57.24 |
| Image-Branch Only (No merge) | 56.78 | 57.17 | 56.94 |

*   **Bidirectional distillation is necessary:** Removing OPD in either direction degrades performance.
*   **Co-evolution alone is powerful:** Even without merging, each single branch outperforms static OPD baselines.
*   **Merging consolidates strengths:** The merged model achieves the best overall result.

**Training Dynamics & Design Analysis:**
*   **Behavioral Consistency:** CoPD maintains top-$k$ overlap >0.9 and low symmetric KL between branches throughout training, while the static pipeline shows monotonic divergence (Figures 4a, 4b).
*   **Phase Ratio:** An exploration-to-consolidation ratio of $S_{\text{RL}} : S_{\text{OPD}} = 1.5:1$ yields the best performance, balancing sufficient specialization with effective alignment (Figure 4c).

## Theoretical and Practical Implications
*   **Theoretical:** The paper provides a formal framework analyzing the loss mechanisms in existing consolidation paradigms (divergence cost vs. absorption inefficiency). It establishes **behavioral overlap** as a key measurable indicator for effective distillation.
*   **Practical:** CoPD offers a **scalable training paradigm** that successfully unifies multiple advanced capabilities (text, image, video) into a single model that outperforms specialists. It turns the typical capability trade-off into a synergistic gain.
*   **Paradigm Shift:** The method suggests moving from sequential expert training + distillation to **parallel co-evolution**, which could inspire new scaling laws and training strategies for developing generalist models.

## Conclusion
Co-Evolving Policy Distillation (CoPD) addresses fundamental limitations in consolidating multiple expert capabilities. By **interleaving branch-specific RLVR with cross-branch mutual OPD**, it ensures experts co-evolve, maintaining behavioral similarity for effective knowledge transfer while accumulating complementary knowledge. Empirical results demonstrate that CoPD achieves state-of-the-art "all-in-one" consolidation, surpassing strong baselines and even domain-specific experts. This work, part of the "Self-Taught RLVR" series, explores the **parallel self** and suggests that model parallel co-evolution is a promising scaling paradigm for broadening the boundaries of model capabilities.

---

_Markdown view of https://picx.dev/p/ZgM3Uh, served by PicX — AI-generated visual whiteboard summaries of research papers._