Visual Summary | DOPD: Dual On-policy Distillation

Summary (Overview)

Identifies “privilege illusion”: Injecting privileged information (e.g., reasoning hints, bounding boxes) into on‑policy distillation (OPD) can create an apparent teacher–student performance gap that conflates transferable capability with unlearnable information asymmetry, leading to entropy collapse and degraded distillation.
Proposes privilege advantage gap: A token‑level metric $A = \lvert \log \Pi_T(y_n \mid x,p,y_{<n}) - \log \Pi_S(y_n \mid x,p,y_{<n}) \rvert$ that distinguishes tokens dominated by genuine capability gaps from those reflecting privileged shortcuts.
Introduces DOPD: An advantage‑aware dual distillation framework that dynamically routes token‑level supervision between a privileged teacher and a privileged student according to the advantage gap and relative probabilities. Four distinct regimes are defined, each receiving tailored distillation (strong full‑vocabulary JS, light Top‑K reverse KL, weak self‑regularization).
Achieves state‑of‑the‑art results: On LLM (Qwen3‑8B → Qwen3‑1.7B) and VLM (Qwen3‑VL‑8B → Qwen3‑VL‑2B) setups, DOPD outperforms Vanilla OPD by 7.5 points (LLM) and 6.0 points (VLM) on average across eight benchmarks. It also surpasses all standard, self‑distillation, and adaptive counterparts.
Demonstrates strong robustness and scalability: DOPD maintains consistent gains across five teacher‑student model pairs (size ratio up to 13.3×), improves continual learning and out‑of‑distribution generalization, and exhibits stable entropy and performance trajectories.

Introduction and Theoretical Foundation

Background. On‑policy distillation (OPD) transfers knowledge from a stronger teacher to a weaker student by supervising student‑sampled trajectories with dense token‑level signals. This mitigates distribution shift and yields higher efficiency than off‑policy distillation.

Privilege illusion. To push the performance frontier, many works equip the teacher (or student) with privileged information – e.g., verified reasoning hints for LLMs or bounding‑box annotations for VLMs. However, the paper identifies a failure mode: the apparent performance advantage from privileged inputs may stem from information asymmetry rather than genuine capability. Indiscriminately distilling such signals causes the student to mimic privileged outcomes instead of acquiring transferable ability, leading to entropy collapse and poor distillation.

Non‑uniform token supervision. Only a small subset of tokens carries pivotal capability‑bearing signals (e.g., decisive reasoning steps). Uniformly distilling all tokens from a monolithic source amplifies privilege illusion by forcing the student to fit easy‑to‑mimic privileged shortcuts.

Key takeaway. The paper motivates the need to disentangle the teacher‑student capability gap from the information‑asymmetry gap. The privilege advantage gap (Equation 1) serves as a proxy to identify tokens where the teacher’s advantage reflects true competence rather than privileged shortcuts.

Methodology

1. Privilege Advantage Gap

For a given input $x$ , privileged input $p$ , and partial sequence $y_{<n}$ , define: [ A = \lvert \log \Pi_T(y_n \mid x,p,y_{<n}) - \log \Pi_S(y_n \mid x,p,y_{<n}) \rvert = \left\lvert \log \frac{\Pi_T(y_n \mid x,p,y_{<n})}{\Pi_S(y_n \mid x,p,y_{<n})} \right\rvert \tag{1} ] A large $A$ indicates a genuine capability discrepancy under identical privileged conditions; a small $A$ suggests the teacher’s advantage is mainly due to privileged information.

2. Divergence Objectives

Three common divergences are considered:

Forward KL: $\text{KL}_{\text{forward}}(\Pi_T \parallel \Pi_S) = \mathbb{E}_{y \sim \Pi_T}[\log \frac{\Pi_T}{\Pi_S}]$ – support‑covering.
Reverse KL: $\text{KL}_{\text{reverse}}(\Pi_S \parallel \Pi_T) = \mathbb{E}_{y \sim \Pi_S}[\log \frac{\Pi_S}{\Pi_T}]$ – mode‑seeking.
JS Divergence: $\text{JS} = \frac12 \text{KL}(\Pi_T \parallel \Pi_M) + \frac12 \text{KL}(\Pi_S \parallel \Pi_M)$ , where $\Pi_M = \frac12\Pi_T + \frac12\Pi_S$ – balanced.

3. Vanilla OPD Objective

[ \mathbb{E}{x\sim D}!\left[ \mathbb{E}{y\sim\Pi_S}!\left[ \frac{1}{|y|}\sum_{n=1}^{|y|} \mathcal{L}n(y_n; t{<n}) \right] \right] \tag{5} ] Here $t$ includes inputs, previous tokens, and any auxiliary context; $\mathcal{L}$ is a token‑level divergence (e.g., reverse KL). Most OPD variants minimize this objective for all tokens uniformly.

4. DOPD: Advantage‑Aware Dual Distillation

Overview (Figure 5). For each on‑policy trajectory $y$ , the privileged teacher and privileged student policies each perform a forward pass. Their token‑level probabilities $q_S$ and $q_T$ (and log‑probabilities $\ell_S$ , $\ell_T$ ) are computed. The advantage gap $A$ and the sum of probabilities $q_S+q_T$ are averaged (after outlier removal) to obtain thresholds $\bar{A}$ and $\overline{q_S+q_T}$ .

Four token regimes (each with a different distillation loss):

Condition	Interpretation	Distillation Loss
Low $A$ , High $q_S$ & $q_T$ ( $A_n < \bar{A} \land q_S+q_T \geq \overline{q_S+q_T}$ )	Teacher and student agree; bottleneck is absence of privileged info, not capability.	Light teacher distillation: $\mathcal{L}_{\text{LH}} = \beta_l \,\text{KL}_{\text{reverse}}(\Pi_S \parallel \Pi_T)$ with Top‑K tokens.
Low $A$ , Low $q_S$ & $q_T$ ( $A_n < \bar{A} \land q_S+q_T < \overline{q_S+q_T}$ )	Both policies uncertain; token beyond reliable competence region.	Weak self‑regularization: $\mathcal{L}_{\text{LL}} = \beta_w \,\text{KL}_{\text{reverse}}(\Pi_S \parallel \operatorname{sg}[\Pi_S^{\text{priv}}])$ with Top‑K.
High $A$ , High $q_T$ ( $A_n \geq \bar{A} \land q_T \geq q_S$ )	Teacher is confident and shows clear advantage; token carries transferable capability.	Strong full‑vocabulary teacher distillation: $\mathcal{L}_{\text{HT}} = \text{JS}(\Pi_S \parallel \Pi_T)$ .
High $A$ , High $q_S$ ( $A_n \geq \bar{A} \land q_T < q_S$ )	Student is confident but teacher is not; forcing teacher imitation may suppress exploration.	Light self‑regularization: $\mathcal{L}_{\text{HS}} = \beta_l \,\text{KL}_{\text{reverse}}(\Pi_S \parallel \operatorname{sg}[\Pi_S^{\text{priv}}])$ with Top‑K.

Total objective: [ \mathcal{L}{\text{DOPD}} = \mathbb{I}{\text{LH}}\mathcal{L}{\text{LH}} + \mathbb{I}{\text{LL}}\mathcal{L}{\text{LL}} + \mathbb{I}{\text{HT}}\mathcal{L}{\text{HT}} + \mathbb{I}{\text{HS}}\mathcal{L}_{\text{HS}} \tag{10} ] which is then applied within the outer expectation of Eq. (5).

Empirical Validation / Results

Main Results

LLM‑based OPD (Qwen3‑8B → Qwen3‑1.7B) – average across 8 benchmarks:

Method	Average	C‑Eval	LiveBench	MATH500	AIME25	ZebraLogic	AutoLogi	BFCLv3	LCBv5
Teacher	52.8	77.1	53.5	86.9	20.2	25.0	76.3	60.0	23.6
Student	39.1	60.4	35.4	72.7	9.5	12.1	59.8	51.9	11.3
Vanilla OPD	43.9	65.2	40.9	75.6	16.7	15.8	64.3	55.4	17.6
ExOPD	47.0	68.3	44.7	76.7	18.5	19.9	68.0	57.2	22.6
Uni‑OPD	46.6	66.5	42.3	77.5	20.0	22.3	67.2	56.1	20.8
EOPD	46.1	67.5	45.7	75.9	17.6	19.3	67.1	56.8	19.0
DOPD (Ours)	51.4	71.3	49.8	81.5	23.3	26.9	71.0	60.2	27.1

DOPD recovers 89.8% of the teacher‑student gap and surpasses the teacher on four benchmarks (AIME25, ZebraLogic, BFCLv3, LCBv5).

VLM‑based OPD (Qwen3‑VL‑8B → Qwen3‑VL‑2B) – average across 8 benchmarks:

Method	Average	RealWorldQA	MMStar	MathVision	DynaMath	LogicVista	MMMU	MMMU‑Pro	VSI‑Bench
Teacher	62.9	71.3	70.7	53.8	67.6	55.0	69.6	55.8	59.7
Student	48.3	63.6	58.4	32.0	53.8	35.5	53.2	36.4	53.6
Vanilla OPD	52.4	64.7	61.8	37.1	56.2	40.2	58.0	46.7	54.1
Uni‑OPD	54.2	65.0	65.3	43.0	58.2	42.5	59.1	47.0	53.7
Vision‑OPD	55.6	66.2	66.4	38.0	57.6	43.1	64.9	52.3	56.1
VA‑OPD *	56.3	67.0	66.2	38.7	57.7	43.1	66.1	54.2	57.5
DOPD (Ours)	58.4	67.4	67.2	45.6	60.5	47.7	67.0	53.9	57.8

DOPD recovers 69.2% of the teacher‑student gap.

Robustness & Scalability (Table 3)

Experiments across five teacher‑student pairs (Qwen3‑8B/4B/1.7B → Qwen3‑0.6B/1.7B) show DOPD consistently outperforms Vanilla OPD by 11.1–14.1 points average, and achieves 53.0–92.2% gap reduction compared to Vanilla OPD’s 13.2–42.6%. Gains are largest when the teacher‑student size ratio is large (e.g., 8B→0.6B: +14.1 vs +3.5 for Vanilla OPD).

Continual Learning & Out‑of‑Distribution (Figure 7)

Continual learning: A three‑stage curriculum (general → reasoning → coding) shows DOPD accumulates capabilities with minimal forgetting, outperforming Vanilla OPD in both stability and final performance.
OOD evaluation: Training on coding/reasoning and testing on the other domain yields 3.1–4.3 points improvement over the best baselines (ExOPD, Uni‑OPD, EOPD).

Training Stability (Figure 8)

DOPD achieves superior performance by step‑80 (compared to step‑200 for baselines) and maintains healthy entropy – rising initially, then gradually decreasing to a steady state. In contrast, self‑distillation baselines suffer entropy collapse around step‑95.

Token & Divergence Analyses (Table 6, 7; Figure 9)

Token ablation: Using only high‑advantage teacher‑confident tokens yields 4.6 points over uniform distillation; the full adaptive mechanism adds another 8+ points.
Divergence study: JS divergence with full vocabulary performs best overall, but Top‑K strategies offer better computational efficiency.
Privileged information analysis: Step‑wise hints without execution (LLM) and bounding boxes with object labels (VLM) are most effective; directly providing final answers hurts performance.

Ablation & Sensitivity (Table 8, Figure 10)

Removing privileged input or distillation from either policy degrades performance.
Advantage‑aware routing, adaptive divergence objectives, and adaptive strategies each contribute significantly.
Optimal intensity coefficients: $\beta_w = 0.3$ , $\beta_l = 0.6$ .

Theoretical and Practical Implications

Theoretical: DOPD introduces the first principled framework to disentangle capability transfer from privileged‑information mimicry in OPD. The privilege advantage gap provides a measurable quantity to identify genuinely transferable knowledge, and the dual‑source routing paradigm extends existing teacher‑only or self‑only distillation theories.
Practical: DOPD offers a drop‑in replacement for standard OPD that yields substantial gains without architectural changes. It is especially beneficial when:
- Teacher‑student size ratios are large (scalable distillation).
- Reliable privileged information (e.g., automated reasoning hints, weak visual annotations) is available.
- Continual learning or out‑of‑distribution generalization is desired.
The method is model‑agnostic and has been validated on both LLMs and VLMs; it is expected to transfer to other domains (e.g., multimodal, code generation).

Conclusion

DOPD addresses the fundamental limitation of on‑policy distillation under privileged contexts by identifying and mitigating privilege illusion. By leveraging the privilege advantage gap to route token‑level supervision dynamically between a privileged teacher and a privileged student, DOPD achieves:

Superior performance across LLM and VLM benchmarks (7.5–6.0 points over Vanilla OPD).
Excellent robustness, scalability, continual learning, and out‑of‑distribution generalization.
Stable and efficient training with healthy entropy dynamics.

Future directions include:

More cost‑effective methods for obtaining privileged information.
Learnable or principled routing mechanisms (e.g., neural advantage estimators).
Extension to other sequence‑generation tasks beyond LLMs/VLMs for interpretable and efficient selective capacity transfer.