Summary (Overview)
- Identifies “privilege illusion”: Injecting privileged information (e.g., reasoning hints, bounding boxes) into on‑policy distillation (OPD) can create an apparent teacher–student performance gap that conflates transferable capability with unlearnable information asymmetry, leading to entropy collapse and degraded distillation.
- Proposes privilege advantage gap: A token‑level metric that distinguishes tokens dominated by genuine capability gaps from those reflecting privileged shortcuts.
- Introduces DOPD: An advantage‑aware dual distillation framework that dynamically routes token‑level supervision between a privileged teacher and a privileged student according to the advantage gap and relative probabilities. Four distinct regimes are defined, each receiving tailored distillation (strong full‑vocabulary JS, light Top‑K reverse KL, weak self‑regularization).
- Achieves state‑of‑the‑art results: On LLM (Qwen3‑8B → Qwen3‑1.7B) and VLM (Qwen3‑VL‑8B → Qwen3‑VL‑2B) setups, DOPD outperforms Vanilla OPD by 7.5 points (LLM) and 6.0 points (VLM) on average across eight benchmarks. It also surpasses all standard, self‑distillation, and adaptive counterparts.
- Demonstrates strong robustness and scalability: DOPD maintains consistent gains across five teacher‑student model pairs (size ratio up to 13.3×), improves continual learning and out‑of‑distribution generalization, and exhibits stable entropy and performance trajectories.
Introduction and Theoretical Foundation
Background. On‑policy distillation (OPD) transfers knowledge from a stronger teacher to a weaker student by supervising student‑sampled trajectories with dense token‑level signals. This mitigates distribution shift and yields higher efficiency than off‑policy distillation.
Privilege illusion. To push the performance frontier, many works equip the teacher (or student) with privileged information – e.g., verified reasoning hints for LLMs or bounding‑box annotations for VLMs. However, the paper identifies a failure mode: the apparent performance advantage from privileged inputs may stem from information asymmetry rather than genuine capability. Indiscriminately distilling such signals causes the student to mimic privileged outcomes instead of acquiring transferable ability, leading to entropy collapse and poor distillation.
Non‑uniform token supervision. Only a small subset of tokens carries pivotal capability‑bearing signals (e.g., decisive reasoning steps). Uniformly distilling all tokens from a monolithic source amplifies privilege illusion by forcing the student to fit easy‑to‑mimic privileged shortcuts.
Key takeaway. The paper motivates the need to disentangle the teacher‑student capability gap from the information‑asymmetry gap. The privilege advantage gap (Equation 1) serves as a proxy to identify tokens where the teacher’s advantage reflects true competence rather than privileged shortcuts.
Methodology
1. Privilege Advantage Gap
For a given input , privileged input , and partial sequence , define: [ A = \lvert \log \Pi_T(y_n \mid x,p,y_{<n}) - \log \Pi_S(y_n \mid x,p,y_{<n}) \rvert = \left\lvert \log \frac{\Pi_T(y_n \mid x,p,y_{<n})}{\Pi_S(y_n \mid x,p,y_{<n})} \right\rvert \tag{1} ] A large indicates a genuine capability discrepancy under identical privileged conditions; a small suggests the teacher’s advantage is mainly due to privileged information.
2. Divergence Objectives
Three common divergences are considered:
- Forward KL: – support‑covering.
- Reverse KL: – mode‑seeking.
- JS Divergence: , where – balanced.
3. Vanilla OPD Objective
[ \mathbb{E}{x\sim D}!\left[ \mathbb{E}{y\sim\Pi_S}!\left[ \frac{1}{|y|}\sum_{n=1}^{|y|} \mathcal{L}n(y_n; t{<n}) \right] \right] \tag{5} ] Here includes inputs, previous tokens, and any auxiliary context; is a token‑level divergence (e.g., reverse KL). Most OPD variants minimize this objective for all tokens uniformly.
4. DOPD: Advantage‑Aware Dual Distillation
Overview (Figure 5). For each on‑policy trajectory , the privileged teacher and privileged student policies each perform a forward pass. Their token‑level probabilities and (and log‑probabilities , ) are computed. The advantage gap and the sum of probabilities are averaged (after outlier removal) to obtain thresholds and .
Four token regimes (each with a different distillation loss):
| Condition | Interpretation | Distillation Loss |
|---|---|---|
| Low , High & () | Teacher and student agree; bottleneck is absence of privileged info, not capability. | Light teacher distillation: with Top‑K tokens. |
| Low , Low & () | Both policies uncertain; token beyond reliable competence region. | Weak self‑regularization: with Top‑K. |
| High , High () | Teacher is confident and shows clear advantage; token carries transferable capability. | Strong full‑vocabulary teacher distillation: . |
| High , High () | Student is confident but teacher is not; forcing teacher imitation may suppress exploration. | Light self‑regularization: with Top‑K. |
Total objective: [ \mathcal{L}{\text{DOPD}} = \mathbb{I}{\text{LH}}\mathcal{L}{\text{LH}} + \mathbb{I}{\text{LL}}\mathcal{L}{\text{LL}} + \mathbb{I}{\text{HT}}\mathcal{L}{\text{HT}} + \mathbb{I}{\text{HS}}\mathcal{L}_{\text{HS}} \tag{10} ] which is then applied within the outer expectation of Eq. (5).
Empirical Validation / Results
Main Results
LLM‑based OPD (Qwen3‑8B → Qwen3‑1.7B) – average across 8 benchmarks:
| Method | Average | C‑Eval | LiveBench | MATH500 | AIME25 | ZebraLogic | AutoLogi | BFCLv3 | LCBv5 |
|---|---|---|---|---|---|---|---|---|---|
| Teacher | 52.8 | 77.1 | 53.5 | 86.9 | 20.2 | 25.0 | 76.3 | 60.0 | 23.6 |
| Student | 39.1 | 60.4 | 35.4 | 72.7 | 9.5 | 12.1 | 59.8 | 51.9 | 11.3 |
| Vanilla OPD | 43.9 | 65.2 | 40.9 | 75.6 | 16.7 | 15.8 | 64.3 | 55.4 | 17.6 |
| ExOPD | 47.0 | 68.3 | 44.7 | 76.7 | 18.5 | 19.9 | 68.0 | 57.2 | 22.6 |
| Uni‑OPD | 46.6 | 66.5 | 42.3 | 77.5 | 20.0 | 22.3 | 67.2 | 56.1 | 20.8 |
| EOPD | 46.1 | 67.5 | 45.7 | 75.9 | 17.6 | 19.3 | 67.1 | 56.8 | 19.0 |
| DOPD (Ours) | 51.4 | 71.3 | 49.8 | 81.5 | 23.3 | 26.9 | 71.0 | 60.2 | 27.1 |
DOPD recovers 89.8% of the teacher‑student gap and surpasses the teacher on four benchmarks (AIME25, ZebraLogic, BFCLv3, LCBv5).
VLM‑based OPD (Qwen3‑VL‑8B → Qwen3‑VL‑2B) – average across 8 benchmarks:
| Method | Average | RealWorldQA | MMStar | MathVision | DynaMath | LogicVista | MMMU | MMMU‑Pro | VSI‑Bench |
|---|---|---|---|---|---|---|---|---|---|
| Teacher | 62.9 | 71.3 | 70.7 | 53.8 | 67.6 | 55.0 | 69.6 | 55.8 | 59.7 |
| Student | 48.3 | 63.6 | 58.4 | 32.0 | 53.8 | 35.5 | 53.2 | 36.4 | 53.6 |
| Vanilla OPD | 52.4 | 64.7 | 61.8 | 37.1 | 56.2 | 40.2 | 58.0 | 46.7 | 54.1 |
| Uni‑OPD | 54.2 | 65.0 | 65.3 | 43.0 | 58.2 | 42.5 | 59.1 | 47.0 | 53.7 |
| Vision‑OPD | 55.6 | 66.2 | 66.4 | 38.0 | 57.6 | 43.1 | 64.9 | 52.3 | 56.1 |
| VA‑OPD * | 56.3 | 67.0 | 66.2 | 38.7 | 57.7 | 43.1 | 66.1 | 54.2 | 57.5 |
| DOPD (Ours) | 58.4 | 67.4 | 67.2 | 45.6 | 60.5 | 47.7 | 67.0 | 53.9 | 57.8 |
DOPD recovers 69.2% of the teacher‑student gap.
Robustness & Scalability (Table 3)
Experiments across five teacher‑student pairs (Qwen3‑8B/4B/1.7B → Qwen3‑0.6B/1.7B) show DOPD consistently outperforms Vanilla OPD by 11.1–14.1 points average, and achieves 53.0–92.2% gap reduction compared to Vanilla OPD’s 13.2–42.6%. Gains are largest when the teacher‑student size ratio is large (e.g., 8B→0.6B: +14.1 vs +3.5 for Vanilla OPD).
Continual Learning & Out‑of‑Distribution (Figure 7)
- Continual learning: A three‑stage curriculum (general → reasoning → coding) shows DOPD accumulates capabilities with minimal forgetting, outperforming Vanilla OPD in both stability and final performance.
- OOD evaluation: Training on coding/reasoning and testing on the other domain yields 3.1–4.3 points improvement over the best baselines (ExOPD, Uni‑OPD, EOPD).
Training Stability (Figure 8)
DOPD achieves superior performance by step‑80 (compared to step‑200 for baselines) and maintains healthy entropy – rising initially, then gradually decreasing to a steady state. In contrast, self‑distillation baselines suffer entropy collapse around step‑95.
Token & Divergence Analyses (Table 6, 7; Figure 9)
- Token ablation: Using only high‑advantage teacher‑confident tokens yields 4.6 points over uniform distillation; the full adaptive mechanism adds another 8+ points.
- Divergence study: JS divergence with full vocabulary performs best overall, but Top‑K strategies offer better computational efficiency.
- Privileged information analysis: Step‑wise hints without execution (LLM) and bounding boxes with object labels (VLM) are most effective; directly providing final answers hurts performance.
Ablation & Sensitivity (Table 8, Figure 10)
- Removing privileged input or distillation from either policy degrades performance.
- Advantage‑aware routing, adaptive divergence objectives, and adaptive strategies each contribute significantly.
- Optimal intensity coefficients: , .
Theoretical and Practical Implications
- Theoretical: DOPD introduces the first principled framework to disentangle capability transfer from privileged‑information mimicry in OPD. The privilege advantage gap provides a measurable quantity to identify genuinely transferable knowledge, and the dual‑source routing paradigm extends existing teacher‑only or self‑only distillation theories.
- Practical: DOPD offers a drop‑in replacement for standard OPD that yields substantial gains without architectural changes. It is especially beneficial when:
- Teacher‑student size ratios are large (scalable distillation).
- Reliable privileged information (e.g., automated reasoning hints, weak visual annotations) is available.
- Continual learning or out‑of‑distribution generalization is desired.
- The method is model‑agnostic and has been validated on both LLMs and VLMs; it is expected to transfer to other domains (e.g., multimodal, code generation).
Conclusion
DOPD addresses the fundamental limitation of on‑policy distillation under privileged contexts by identifying and mitigating privilege illusion. By leveraging the privilege advantage gap to route token‑level supervision dynamically between a privileged teacher and a privileged student, DOPD achieves:
- Superior performance across LLM and VLM benchmarks (7.5–6.0 points over Vanilla OPD).
- Excellent robustness, scalability, continual learning, and out‑of‑distribution generalization.
- Stable and efficient training with healthy entropy dynamics.
Future directions include:
- More cost‑effective methods for obtaining privileged information.
- Learnable or principled routing mechanisms (e.g., neural advantage estimators).
- Extension to other sequence‑generation tasks beyond LLMs/VLMs for interpretable and efficient selective capacity transfer.
Related papers
- Scaling the Horizon, Not the Parameters: Reaching Trillion-Parameter Performance with a 35B Agent
A 35B MoE model matches trillion-parameter performance on long-horizon agent tasks by scaling agent horizon instead of parameters.
- MobileForge: Annotation-Free Adaptation for Mobile GUI Agents with Hierarchical Feedback-Guided Policy Optimization
MobileForge achieves state-of-the-art open-data mobile GUI agent performance at 77.6% Pass@3 on AndroidWorld with zero human annotation.
- LiveEdit: Towards Real-Time Diffusion-Based Streaming Video Editing
LiveEdit achieves real-time streaming video editing by distilling a bidirectional DiT into a causal 4-step model and caching self-attention features for static regions.