Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

Summary (Overview)

Identifies two governing conditions for successful On-Policy Distillation (OPD):
1. Thinking-pattern consistency: The student and teacher must share compatible reasoning patterns (e.g., high overlap in top- $k$ token distributions).
2. New knowledge: The teacher must provide genuinely new capabilities beyond what the student has already seen during training; higher benchmark scores alone are insufficient.
Reveals the token-level mechanism: Successful OPD is driven by progressive alignment on a small set of high-probability overlap tokens at student-visited states, which concentrate 97%–99% of the probability mass and provide the main gradient signal.
Proposes practical recovery strategies: Two methods to rescue failing OPD:
1. Off-policy cold start: An initial SFT phase on teacher-generated rollouts to bridge the thinking-pattern gap.
2. Teacher-aligned prompt selection: Using prompts from the teacher's post-training data to sharpen alignment.
Examines the cost of dense supervision: OPD's token-level reward quality degrades with trajectory depth, revealing a fundamental tension between supervision density and reliability for long-horizon tasks.

Introduction and Theoretical Foundation

On-Policy Distillation (OPD) has become a core post-training technique for Large Language Models (LLMs), complementing SFT and RL. Unlike off-policy distillation, which suffers from exposure bias, OPD has the student generate its own rollouts and uses the teacher's per-token log-probabilities as a dense reward signal on states the student actually visits.

Despite its success in industry pipelines (e.g., Qwen3, MiMo), OPD remains poorly understood and can fail unpredictably—a stronger teacher may fail to improve a student while a weaker one succeeds. This paper systematically investigates OPD's training dynamics, progressing from empirical conditions (Phenomenology) to token-level mechanisms and finally to practical recipes for recovery.

Theoretical Foundation: The standard OPD objective minimizes the reverse Kullback-Leibler (KL) divergence between the student ( $\pi_\theta$ ) and teacher ( $\pi_T$ ) distributions over student-generated trajectories.

The sequence-level reverse KL objective is:

\mathcal{L}_{\text{OPD}}(\theta) = \mathbb{E}_{x \sim \mathcal{D}_x}\left[ D_{\text{KL}}\left( \pi_\theta(\cdot|x) \| \pi_T(\cdot|x) \right) \right]. \tag{1}

This decomposes into an exact token-level sum:

\mathcal{L}_{\text{OPD}}(\theta) = \mathbb{E}_{x \sim \mathcal{D}_x, \hat{y} \sim \pi_\theta(\cdot|x)}\left[ \sum_{t=1}^{T} D_{\text{KL}}(p_t \| q_t) \right], \tag{2}

where $p_t(v) \triangleq \pi_\theta(v|x, \hat{y}_{<t})$ and $q_t(v) \triangleq \pi_T(v|x, \hat{y}_{<t})$ .

Common implementations vary in supervision granularity:

Sampled-Token OPD: Unbiased single-sample estimator using only the student-sampled token $\hat{y}_t$ . $\mathcal{L}^{\text{sample}}_{\text{OPD}}(\theta) = \mathbb{E}_{x \sim \mathcal{D}_x, \hat{y} \sim \pi_\theta(\cdot|x)}\left[ \sum_{t=1}^{T} \ell^{\text{sample}}_t \right], \quad \ell^{\text{sample}}_t = \log p_t(\hat{y}_t) - \log q_t(\hat{y}_t). \tag{3}$
Full-Vocabulary OPD: Computes divergence over the entire vocabulary at each step (dense but expensive). $\mathcal{L}^{\text{full}}_{\text{OPD}}(\theta) = \mathbb{E}_{x \sim \mathcal{D}_x, \hat{y} \sim \pi_\theta(\cdot|x)}\left[ \sum_{t=1}^{T} D_{\text{KL}}(p_t \| q_t) \right]. \tag{4}$
Top- $k$ OPD: Restricts divergence computation to the student's top- $k$ tokens $S_t = \text{TopK}(p_t, k)$ , renormalizing distributions over this subset. $\mathcal{L}^{\text{top-k}}_{\text{OPD}}(\theta) = \mathbb{E}_{x \sim \mathcal{D}_x, \hat{y} \sim \pi_\theta(\cdot|x)}\left[ \sum_{t=1}^{T} D_{\text{KL}}\left( \bar{p}^{(S_t)}_t \| \bar{q}^{(S_t)}_t \right) \right]. \tag{5}$

Dynamic Metrics for analysis are defined as:

Overlap Ratio: $\mathcal{M}_{\text{overlap}} \triangleq \mathbb{E}_t\left[ \frac{|S^{(p)}_t \cap S^{(q)}_t|}{k} \right]$ , where $S^{(p)}_t = \text{TopK}(p_t, k)$ and $S^{(q)}_t = \text{TopK}(q_t, k)$ . \tag{6}
Overlap-Token Advantage: $\mathcal{M}_{\text{adv}} \triangleq \mathbb{E}_t\left[ \frac{1}{|S^{(p)}_t \cap S^{(q)}_t|} \sum_{v \in S^{(p)}_t \cap S^{(q)}_t} A_t(v) \right]$ , with $A_t(v) \triangleq \bar{p}_t(v)(\log \bar{q}_t(v) - \log \bar{p}_t(v))$ . \tag{7}
Entropy Gap: $\Delta H_t = |H(q_t) - H(p_t)|$ . \tag{8}

Methodology

The study employs controlled OPD experiments across different model families (Qwen, DeepSeek) and sizes (1.5B to 14B parameters) on mathematical reasoning tasks, primarily using the DAPO-Math-17k dataset.

Key Experimental Setups:

Thinking-Pattern Consistency: Compares OPD from two Qwen3-4B teachers (one base, one GRPO-trained) into a Qwen3-1.7B-Base student.
New Knowledge vs. Scale: Compares OPD from same-pipeline teachers vs. teachers with additional RL post-training (e.g., DeepSeek-R1-Distill-7B vs. Skywork-OR1-Math-7B).
Reverse Distillation: Uses a stronger student (JustRL-1.5B, obtained via RL) and distills from weaker teachers (its pre-RL checkpoint R1-Distill-1.5B and a larger same-family model R1-Distill-7B).
Token-Level Mechanism: Compares dynamics (overlap, entropy gap, advantage) of successful (JustRL-1.5B → R1-Distill-1.5B) vs. failing (R1-Distill-7B → R1-Distill-1.5B) OPD runs.
Ablation on Optimization Support: Tests whether optimizing only on overlap tokens $S^{(p)}_t \cap S^{(q)}_t$ or non-overlap tokens $S^{(p)}_t \triangle S^{(q)}_t$ suffices for successful distillation.
Recovery Strategies:
- Off-policy cold start: SFT on 200K teacher-generated rollouts before OPD.
- Teacher-aligned prompts: Varying prompt template/content to match the teacher's training data.
Reward Analysis: Investigates reward quality vs. trajectory depth and compares global reward informativeness between successful and failing teachers.

Evaluation: Models are evaluated on AIME 2024, AIME 2025, and AMC 2023 benchmarks, reporting average accuracy over 16 samples (avg@16) as the primary metric. Default OPD hyperparameters are used unless specified.

Empirical Validation / Results

1. Phenomenology: Conditions for Success

Thinking-Pattern Consistency: A teacher with a more compatible thinking pattern (Qwen3-4B-Base-GRPO) yields better OPD outcomes than a higher-scoring but pattern-mismatched teacher (Qwen3-4B Non-thinking), despite comparable benchmark performance.

Teacher	AIME 2024	AIME 2025	AMC 2023
Qwen3-4B (Non-thinking)	0.212	0.204	0.700
Qwen3-4B-Base-GRPO	0.210	0.242	0.599

New Knowledge, Not Just Scale: Teachers that have acquired additional capabilities through further RL post-training yield substantially stronger OPD gains and higher gap recovery rates than same-pipeline teachers of larger scale.

Gap Recovery Rate: $(Acc_{\text{after OPD}} - Acc_{\text{before OPD}}) / (Acc_{\text{teacher}} - Acc_{\text{before OPD}})$ .

Reverse Distillation Validation: Distilling the stronger JustRL-1.5B student back to its weaker pre-RL checkpoint (R1-Distill-1.5B) causes it to regress to the teacher's performance level. Strikingly, using a larger, higher-scoring same-family teacher (R1-Distill-7B) drives the student to the same regressed level, indicating identical local target distributions and that OPD learns thinking patterns independent of benchmark scores.

2. Mechanism: Progressive Alignment on Overlap Tokens

Successful OPD runs are characterized by progressive alignment on high-probability tokens:

Overlap Ratio rises steadily (e.g., from ~72% to >91%).
Entropy Gap narrows.
Overlap-Token Advantage improves toward zero.
The overlapping tokens carry 97%–99% of the total probability mass for both student and teacher.

Ablation Result: Optimizing only on the overlap tokens suffices to match the full performance of standard Top- $k$ OPD, while optimizing only on non-overlap tokens yields substantially weaker results. This confirms the overlap set is the principal locus of OPD's gradient signal.

3. Recipe: Recovering Failing OPD

Off-Policy Cold Start: An initial SFT phase on teacher-generated rollouts bridges the thinking-pattern gap, resulting in higher initial overlap, more stable training dynamics, and stronger final performance compared to pure OPD.
Teacher-Aligned Prompts: Using prompts (template or content) that match the teacher's post-training data improves downstream performance and sharpens alignment. However, using only such prompts can overly suppress student entropy, suggesting a mix with out-of-distribution prompts is beneficial.

4. The Cost of Dense Supervision

Reward Degrades with Depth: Teacher reward quality and continuation accuracy advantage systematically decrease as the student-generated prefix length increases. Instability in long-horizon OPD (e.g., 15K tokens) originates at later tokens and propagates backward.
Globally Informative but Locally Unexploitable: In failing OPD configurations, the teacher's reward signal can be globally correlated with rollout correctness (comparable AUROC to successful teachers) yet fail to provide locally exploitable gradients, possibly due to anisotropic reward landscapes.
Sampled-Token Reward is Sufficient: Sampled-token OPD performs comparably to Top- $k$ OPD for $k \ge 4$ . The failure of Top-1 OPD is due to its biased, mode-concentrated selection rule, not the number of tokens.

Table: Effect of Support Size $k$ in Top- $k$ OPD (avg@16 Accuracy)

Method	AIME 2024	AIME 2025	AMC 2023
Sampled-token OPD	0.454	0.327	0.782
Top-1 OPD	0.446	0.310	0.772
Top-4 OPD	0.473	0.331	0.793
Top-16 OPD	0.458	0.338	0.791
Top-64 OPD	0.463	0.338	0.785

Theoretical and Practical Implications

Theoretical Implications:

Challenges the assumption that a stronger teacher always yields better distillation.
Decouples OPD dynamics from benchmark performance, emphasizing thinking patterns and knowledge novelty.
Provides a mechanistic explanation: OPD succeeds via a self-reinforcing cycle of progressive alignment on a small, high-probability overlap token set.
Reveals a fundamental tension: Dense token-level supervision comes at the cost of degraded reliability over long trajectories, questioning OPD's scalability to long-horizon reasoning.

Practical Implications:

Teacher Selection: Prioritize thinking-pattern compatibility and novel capabilities over raw benchmark scores.
Training Design: Implement off-policy cold start or use teacher-aligned prompts to recover failing OPD runs.
Task Length: Be mindful of a "sweet spot" in response length (3K-7K tokens in this study); very long horizons may induce instability.
Algorithm Choice: Sampled-token OPD is often sufficient and efficient; avoid the degenerate Top-1 setting.

Conclusion

This work provides a systematic analysis of On-Policy Distillation, establishing that its success is governed by two conditions: thinking-pattern consistency and new knowledge in the teacher. The core mechanism is progressive alignment on high-probability overlap tokens. When these conditions are unmet, practical strategies like off-policy cold start can recover performance. However, OPD's dense supervision has a cost: reward quality degrades with trajectory depth, revealing limitations for long-horizon tasks.

Future Directions:

Extending findings beyond mathematical reasoning to other domains (code, open-ended generation).
Investigating the impact of pre-training data divergence on OPD.
Analyzing dynamics in self-distillation settings.
Developing hybrid or curriculum strategies to overcome the long-horizon ceiling.