# DOPD: Dual On-policy Distillation

> DOPD introduces an advantage-aware dual distillation that dynamically routes token supervision to prevent privilege illusion, achieving a 7.5-point gain over vanilla OPD.

- **Source:** [arXiv](https://arxiv.org/abs/2606.30626)
- **Published:** 2026-07-02
- **Permalink:** https://picx.dev/p/1c5HeT
- **Whiteboard:** https://picx.dev/p/1c5HeT/image

## Summary

## Summary (Overview)

- **Identifies “privilege illusion”**: Injecting privileged information (e.g., reasoning hints, bounding boxes) into on‑policy distillation (OPD) can create an apparent teacher–student performance gap that conflates **transferable capability** with **unlearnable information asymmetry**, leading to entropy collapse and degraded distillation.
- **Proposes privilege advantage gap**: A token‑level metric $A = \lvert \log \Pi_T(y_n \mid x,p,y_{<n}) - \log \Pi_S(y_n \mid x,p,y_{<n}) \rvert$ that distinguishes tokens dominated by genuine capability gaps from those reflecting privileged shortcuts.
- **Introduces DOPD**: An advantage‑aware dual distillation framework that dynamically routes token‑level supervision **between a privileged teacher and a privileged student** according to the advantage gap and relative probabilities. Four distinct regimes are defined, each receiving tailored distillation (strong full‑vocabulary JS, light Top‑K reverse KL, weak self‑regularization).
- **Achieves state‑of‑the‑art results**: On LLM (Qwen3‑8B → Qwen3‑1.7B) and VLM (Qwen3‑VL‑8B → Qwen3‑VL‑2B) setups, DOPD outperforms Vanilla OPD by **7.5 points** (LLM) and **6.0 points** (VLM) on average across eight benchmarks. It also surpasses all standard, self‑distillation, and adaptive counterparts.
- **Demonstrates strong robustness and scalability**: DOPD maintains consistent gains across five teacher‑student model pairs (size ratio up to 13.3×), improves continual learning and out‑of‑distribution generalization, and exhibits stable entropy and performance trajectories.

---

## Introduction and Theoretical Foundation

**Background.** On‑policy distillation (OPD) transfers knowledge from a stronger teacher to a weaker student by supervising student‑sampled trajectories with dense token‑level signals. This mitigates distribution shift and yields higher efficiency than off‑policy distillation.

**Privilege illusion.** To push the performance frontier, many works equip the teacher (or student) with **privileged information** – e.g., verified reasoning hints for LLMs or bounding‑box annotations for VLMs. However, the paper identifies a failure mode: the apparent performance advantage from privileged inputs may stem from **information asymmetry** rather than genuine capability. Indiscriminately distilling such signals causes the student to mimic privileged outcomes instead of acquiring transferable ability, leading to entropy collapse and poor distillation.

**Non‑uniform token supervision.** Only a small subset of tokens carries pivotal capability‑bearing signals (e.g., decisive reasoning steps). Uniformly distilling all tokens from a monolithic source amplifies privilege illusion by forcing the student to fit easy‑to‑mimic privileged shortcuts.

**Key takeaway.** The paper motivates the need to disentangle the teacher‑student capability gap from the information‑asymmetry gap. The **privilege advantage gap** (Equation 1) serves as a proxy to identify tokens where the teacher’s advantage reflects true competence rather than privileged shortcuts.

---

## Methodology

### 1. Privilege Advantage Gap
For a given input $x$, privileged input $p$, and partial sequence $y_{<n}$, define:
\[
A = \lvert \log \Pi_T(y_n \mid x,p,y_{<n}) - \log \Pi_S(y_n \mid x,p,y_{<n}) \rvert = \left\lvert \log \frac{\Pi_T(y_n \mid x,p,y_{<n})}{\Pi_S(y_n \mid x,p,y_{<n})} \right\rvert
\tag{1}
\]
A **large** $A$ indicates a genuine capability discrepancy under identical privileged conditions; a **small** $A$ suggests the teacher’s advantage is mainly due to privileged information.

### 2. Divergence Objectives
Three common divergences are considered:

- **Forward KL**: $\text{KL}_{\text{forward}}(\Pi_T \parallel \Pi_S) = \mathbb{E}_{y \sim \Pi_T}[\log \frac{\Pi_T}{\Pi_S}]$ – support‑covering.
- **Reverse KL**: $\text{KL}_{\text{reverse}}(\Pi_S \parallel \Pi_T) = \mathbb{E}_{y \sim \Pi_S}[\log \frac{\Pi_S}{\Pi_T}]$ – mode‑seeking.
- **JS Divergence**: $\text{JS} = \frac12 \text{KL}(\Pi_T \parallel \Pi_M) + \frac12 \text{KL}(\Pi_S \parallel \Pi_M)$, where $\Pi_M = \frac12\Pi_T + \frac12\Pi_S$ – balanced.

### 3. Vanilla OPD Objective
\[
\mathbb{E}_{x\sim D}\!\left[ \mathbb{E}_{y\sim\Pi_S}\!\left[ \frac{1}{|y|}\sum_{n=1}^{|y|} \mathcal{L}_n(y_n; t_{<n}) \right] \right]
\tag{5}
\]
Here $t$ includes inputs, previous tokens, and any auxiliary context; $\mathcal{L}$ is a token‑level divergence (e.g., reverse KL). Most OPD variants minimize this objective for all tokens uniformly.

### 4. DOPD: Advantage‑Aware Dual Distillation

**Overview (Figure 5).** For each on‑policy trajectory $y$, the privileged teacher and privileged student policies each perform a forward pass. Their token‑level probabilities $q_S$ and $q_T$ (and log‑probabilities $\ell_S$, $\ell_T$) are computed. The advantage gap $A$ and the sum of probabilities $q_S+q_T$ are averaged (after outlier removal) to obtain thresholds $\bar{A}$ and $\overline{q_S+q_T}$.

**Four token regimes** (each with a different distillation loss):

| Condition | Interpretation | Distillation Loss |
|-----------|----------------|-------------------|
| **Low $A$, High $q_S$ & $q_T$** ($A_n < \bar{A} \land q_S+q_T \geq \overline{q_S+q_T}$) | Teacher and student agree; bottleneck is absence of privileged info, not capability. | **Light teacher distillation**: $\mathcal{L}_{\text{LH}} = \beta_l \,\text{KL}_{\text{reverse}}(\Pi_S \parallel \Pi_T)$ with Top‑K tokens. |
| **Low $A$, Low $q_S$ & $q_T$** ($A_n < \bar{A} \land q_S+q_T < \overline{q_S+q_T}$) | Both policies uncertain; token beyond reliable competence region. | **Weak self‑regularization**: $\mathcal{L}_{\text{LL}} = \beta_w \,\text{KL}_{\text{reverse}}(\Pi_S \parallel \operatorname{sg}[\Pi_S^{\text{priv}}])$ with Top‑K. |
| **High $A$, High $q_T$** ($A_n \geq \bar{A} \land q_T \geq q_S$) | Teacher is confident and shows clear advantage; token carries transferable capability. | **Strong full‑vocabulary teacher distillation**: $\mathcal{L}_{\text{HT}} = \text{JS}(\Pi_S \parallel \Pi_T)$. |
| **High $A$, High $q_S$** ($A_n \geq \bar{A} \land q_T < q_S$) | Student is confident but teacher is not; forcing teacher imitation may suppress exploration. | **Light self‑regularization**: $\mathcal{L}_{\text{HS}} = \beta_l \,\text{KL}_{\text{reverse}}(\Pi_S \parallel \operatorname{sg}[\Pi_S^{\text{priv}}])$ with Top‑K. |

**Total objective:**
\[
\mathcal{L}_{\text{DOPD}} = \mathbb{I}_{\text{LH}}\mathcal{L}_{\text{LH}} + \mathbb{I}_{\text{LL}}\mathcal{L}_{\text{LL}} + \mathbb{I}_{\text{HT}}\mathcal{L}_{\text{HT}} + \mathbb{I}_{\text{HS}}\mathcal{L}_{\text{HS}}
\tag{10}
\]
which is then applied within the outer expectation of Eq. (5).

---

## Empirical Validation / Results

### Main Results

**LLM‑based OPD (Qwen3‑8B → Qwen3‑1.7B)** – average across 8 benchmarks:

| Method | Average | C‑Eval | LiveBench | MATH500 | AIME25 | ZebraLogic | AutoLogi | BFCLv3 | LCBv5 |
|--------|---------|--------|-----------|---------|--------|------------|----------|--------|-------|
| Teacher | 52.8 | 77.1 | 53.5 | 86.9 | 20.2 | 25.0 | 76.3 | 60.0 | 23.6 |
| Student | 39.1 | 60.4 | 35.4 | 72.7 | 9.5 | 12.1 | 59.8 | 51.9 | 11.3 |
| Vanilla OPD | 43.9 | 65.2 | 40.9 | 75.6 | 16.7 | 15.8 | 64.3 | 55.4 | 17.6 |
| ExOPD | 47.0 | 68.3 | 44.7 | 76.7 | 18.5 | 19.9 | 68.0 | 57.2 | 22.6 |
| Uni‑OPD | 46.6 | 66.5 | 42.3 | 77.5 | 20.0 | 22.3 | 67.2 | 56.1 | 20.8 |
| EOPD | 46.1 | 67.5 | 45.7 | 75.9 | 17.6 | 19.3 | 67.1 | 56.8 | 19.0 |
| **DOPD (Ours)** | **51.4** | **71.3** | **49.8** | **81.5** | **23.3** | **26.9** | **71.0** | **60.2** | **27.1** |

> DOPD recovers **89.8%** of the teacher‑student gap and surpasses the teacher on four benchmarks (AIME25, ZebraLogic, BFCLv3, LCBv5).

**VLM‑based OPD (Qwen3‑VL‑8B → Qwen3‑VL‑2B)** – average across 8 benchmarks:

| Method | Average | RealWorldQA | MMStar | MathVision | DynaMath | LogicVista | MMMU | MMMU‑Pro | VSI‑Bench |
|--------|---------|-------------|--------|------------|----------|------------|------|----------|-----------|
| Teacher | 62.9 | 71.3 | 70.7 | 53.8 | 67.6 | 55.0 | 69.6 | 55.8 | 59.7 |
| Student | 48.3 | 63.6 | 58.4 | 32.0 | 53.8 | 35.5 | 53.2 | 36.4 | 53.6 |
| Vanilla OPD | 52.4 | 64.7 | 61.8 | 37.1 | 56.2 | 40.2 | 58.0 | 46.7 | 54.1 |
| Uni‑OPD | 54.2 | 65.0 | 65.3 | 43.0 | 58.2 | 42.5 | 59.1 | 47.0 | 53.7 |
| Vision‑OPD | 55.6 | 66.2 | 66.4 | 38.0 | 57.6 | 43.1 | 64.9 | 52.3 | 56.1 |
| VA‑OPD * | 56.3 | 67.0 | 66.2 | 38.7 | 57.7 | 43.1 | 66.1 | 54.2 | 57.5 |
| **DOPD (Ours)** | **58.4** | **67.4** | **67.2** | **45.6** | **60.5** | **47.7** | **67.0** | **53.9** | **57.8** |

> DOPD recovers **69.2%** of the teacher‑student gap.

### Robustness & Scalability (Table 3)

Experiments across five teacher‑student pairs (Qwen3‑8B/4B/1.7B → Qwen3‑0.6B/1.7B) show DOPD **consistently outperforms Vanilla OPD** by **11.1–14.1** points average, and achieves **53.0–92.2% gap reduction** compared to Vanilla OPD’s 13.2–42.6%. Gains are largest when the teacher‑student size ratio is large (e.g., 8B→0.6B: +14.1 vs +3.5 for Vanilla OPD).

### Continual Learning & Out‑of‑Distribution (Figure 7)

- **Continual learning**: A three‑stage curriculum (general → reasoning → coding) shows DOPD accumulates capabilities with minimal forgetting, outperforming Vanilla OPD in both stability and final performance.
- **OOD evaluation**: Training on coding/reasoning and testing on the other domain yields **3.1–4.3 points** improvement over the best baselines (ExOPD, Uni‑OPD, EOPD).

### Training Stability (Figure 8)

DOPD achieves **superior performance by step‑80** (compared to step‑200 for baselines) and maintains healthy entropy – rising initially, then gradually decreasing to a steady state. In contrast, self‑distillation baselines suffer entropy collapse around step‑95.

### Token & Divergence Analyses (Table 6, 7; Figure 9)

- **Token ablation**: Using only high‑advantage teacher‑confident tokens yields 4.6 points over uniform distillation; the full adaptive mechanism adds another 8+ points.
- **Divergence study**: JS divergence with full vocabulary performs best overall, but Top‑K strategies offer better computational efficiency.
- **Privileged information analysis**: Step‑wise hints without execution (LLM) and bounding boxes with object labels (VLM) are most effective; directly providing final answers hurts performance.

### Ablation & Sensitivity (Table 8, Figure 10)

- Removing privileged input or distillation from either policy degrades performance.
- Advantage‑aware routing, adaptive divergence objectives, and adaptive strategies each contribute significantly.
- Optimal intensity coefficients: $\beta_w = 0.3$, $\beta_l = 0.6$.

---

## Theoretical and Practical Implications

- **Theoretical**: DOPD introduces the first principled framework to **disentangle capability transfer from privileged‑information mimicry** in OPD. The privilege advantage gap provides a measurable quantity to identify genuinely transferable knowledge, and the dual‑source routing paradigm extends existing teacher‑only or self‑only distillation theories.
- **Practical**: DOPD offers a **drop‑in replacement** for standard OPD that yields substantial gains without architectural changes. It is especially beneficial when:
  - Teacher‑student size ratios are large (scalable distillation).
  - Reliable privileged information (e.g., automated reasoning hints, weak visual annotations) is available.
  - Continual learning or out‑of‑distribution generalization is desired.
- The method is **model‑agnostic** and has been validated on both LLMs and VLMs; it is expected to transfer to other domains (e.g., multimodal, code generation).

---

## Conclusion

DOPD addresses the fundamental limitation of on‑policy distillation under privileged contexts by identifying and mitigating **privilege illusion**. By leveraging the privilege advantage gap to route token‑level supervision dynamically between a privileged teacher and a privileged student, DOPD achieves:

- Superior performance across LLM and VLM benchmarks (7.5–6.0 points over Vanilla OPD).
- Excellent robustness, scalability, continual learning, and out‑of‑distribution generalization.
- Stable and efficient training with healthy entropy dynamics.

**Future directions** include:
- More cost‑effective methods for obtaining privileged information.
- Learnable or principled routing mechanisms (e.g., neural advantage estimators).
- Extension to other sequence‑generation tasks beyond LLMs/VLMs for interpretable and efficient selective capacity transfer.

---

_Markdown view of https://picx.dev/p/1c5HeT, served by PicX — AI-generated visual whiteboard summaries of research papers._
