Summary (Overview)
- Trust Region On-Policy Distillation (TrOPD) is proposed to stabilize on-policy distillation (OPD) for large language models (LLMs) when teacher and student distributions diverge substantially.
- TrOPD partitions student-generated tokens into a trust region (where teacher supervision is reliable) and outliers (where it is unreliable), using an adaptive threshold based on the decoding agreement ratio .
- For outlier tokens, TrOPD employs a top‑ forward KL (FKL) estimator to preserve informative signals while avoiding unreliable policy gradients.
- Off‑policy guidance is introduced: the student continues generation from teacher‑generated prefixes, using forward KL to imitate teacher trajectories, encouraging exploration toward reliable regions.
- Extensive experiments on mathematical reasoning (AIME 2024/2025, AMC 2023), code generation (LiveCodeBench), instruction following (IFBench), and STEM (GPQA Diamond) show that TrOPD consistently outperforms state‑of‑the‑art OPD baselines (OPD, EOPD, REOPOLD) by +3.34 to +6.18 points on average.
Introduction and Theoretical Foundation
On‑Policy Distillation (OPD) trains a student model on its own generated trajectories to mitigate the exposure bias inherent in off‑policy distillation. The typical objective uses reverse KL divergence (RKL):
whose gradient takes a policy‑gradient form: the student is rewarded for sequences assigned high probability by the teacher. However, when the teacher and student distributions diverge substantially, student‑generated tokens may fall into low‑probability regions of the teacher, leading to extreme policy‑gradient outliers (e.g., when ). This destabilizes training and limits final performance.
For reasoning models, full‑vocabulary KL divergence is prohibitively expensive due to memory ( = sequence length, = vocabulary size). Recent work uses the estimator for RKL:
which reduces memory to but suffers from two key issues:
- Significant policy‑gradient outliers when .
- Low‑quality student generations that limit the effective optimization space.
Existing mitigation strategies (e.g., entropy‑based token selection, reward clipping) provide only limited correction. This motivates the need for a principled approach to ensure reliable supervision.
Methodology
TrOPD distinguishes three regions in the distillation process (see Table 2):
| Region | Policy | Objective | Estimator | Memory |
|---|---|---|---|---|
| On‑Policy Trust Region | ||||
| On‑Policy Outlier | ||||
| Off‑Policy Guidance |
1. Adaptive Trust Region
For each token sampled from the student, the probability of being in the trust region is:
inspired by speculative decoding’s acceptance probability. Tokens with are considered outliers.
2. Outlier Estimation
For outlier tokens, instead of masking or clipping, TrOPD uses a top‑ forward KL (FKL) objective:
where . This preserves informative teacher‑supported tokens while avoiding unreliable reverse‑KL gradients.
3. Off‑Policy Guidance
To encourage the student to generate within teacher‑verifiable regions, the student continues generation from a teacher‑generated prefix :
where is a small coefficient (default 0.001). The off‑policy trajectory length is gradually annealed to zero via a cosine schedule, so generation becomes fully on‑policy by the end of training.
Unified Optimization Objective
The overall TrOPD objective is:
Empirical Validation / Results
Main Results
Table 3: Performance with DeepSeek‑R1‑Distill‑Qwen‑1.5B student (single‑domain teacher: Skywork‑OR1‑Math‑7B; multi‑domain teacher: Skywork‑OR1‑7B)
| Method | AIME 24 | AIME 25 | AMC 23 | LiveCodeBench v6 | GPQA Diamond | Avg. |
|---|---|---|---|---|---|---|
| DeepSeek‑Qwen2.5‑1.5B | 28.64 | 24.16 | 71.01 | 15.43 | 34.22 | 34.69 |
| Single‑Domain Distillation | ||||||
| Teacher | 66.14 | 51.87 | 92.34 | 34.86 | 47.22 | 58.48 |
| OPD | 35.83 | 29.16 | 75.39 | 17.14 | 28.03 | 37.11 |
| EOPD | 36.97 | 29.79 | 75.23 | 15.43 | 32.58 | 38.00 |
| Entropy OPD 20% | 35.52 | 29.06 | 73.82 | 14.29 | 31.82 | 36.90 |
| REOPOLD 2Stage | 34.47 | 29.89 | 73.35 | 16.57 | 30.18 | 36.89 |
| REOPOLD | 36.97 | 30.83 | 75.78 | 18.29 | 32.07 | 38.79 |
| TrOPD | 38.54 | 32.50 | 77.03 | 18.86 | 36.24 | 40.63 |
| Multi‑Domain Distillation | ||||||
| Teacher | 65.62 | 52.81 | 91.79 | 36.57 | 47.22 | 58.80 |
| OPD | 30.10 | 21.66 | 61.56 | 20.57 | 31.06 | 32.99 |
| REOPOLD | 34.27 | 25.83 | 63.90 | 19.43 | 34.47 | 35.58 |
| TrOPD | 36.04 | 27.60 | 70.93 | 22.29 | 31.19 | 37.61 |
Table 4: Performance with Qwen3‑SFT‑1.7B student (teacher: Qwen3‑Nemotron‑4B, multi‑domain)
| Method | AIME 24 | AIME 25 | AMC 23 | GPQA dia. | MMLU red. | IFBench | LCB v6 | Avg. |
|---|---|---|---|---|---|---|---|---|
| Qwen3‑SFT‑1.7B | 35.41 | 26.45 | 68.90 | 25.25 | 66.60 | 26.19 | 30.29 | 39.87 |
| OPD | 48.02 | 40.72 | 81.79 | 29.80 | 68.60 | 37.07 | 32.00 | 48.29 |
| EOPD | 47.08 | 40.83 | 81.32 | 33.84 | 68.26 | 36.39 | 34.29 | 48.86 |
| Entropy OPD | 43.54 | 42.70 | 79.53 | 29.92 | 68.51 | 38.78 | 33.71 | 48.10 |
| REOPOLD | 45.62 | 42.29 | 81.64 | 30.56 | 68.30 | 36.05 | 35.43 | 48.56 |
| TrOPD | 52.08 | 44.06 | 83.04 | 35.98 | 68.74 | 42.18 | 36.00 | 51.73 |
TrOPD consistently outperforms all baselines across both teacher–student configurations and all domains, with average gains of +3.06 to +4.62 points over OPD and +1.84 to +3.17 points over REOPOLD.
Ablation Studies
Table 5: Ablation on outlier estimation and off‑policy guidance (math domain)
| Method | Outlier Objective | AIME 24 | AIME 25 | AMC 23 | Avg. |
|---|---|---|---|---|---|
| DeepSeek‑1.5B | – | 28.64 | 24.16 | 71.01 | 41.27 |
| OPD | 35.83 | 29.16 | 75.39 | 46.79 | |
| + Mask Outlier | 0 | 37.08 | 30.62 | 75.46 | 47.72 |
| + Clip Outlier | 36.97 | 30.83 | 75.78 | 47.86 | |
| + Full FKL | 0.00 | 0.00 | 4.21 | 1.40 | |
| + FKL Outlier | 39.16 | 29.89 | 77.96 | 49.00 | |
| + Off‑Policy Guidance | |||||
| TrOPD Mask | 0 | 40.10 | 30.41 | 75.85 | 48.79 |
| TrOPD Clip | 37.39 | 31.77 | 77.03 | 48.73 | |
| TrOPD FKL | 38.54 | 32.50 | 78.51 | 49.85 |
Key findings:
- Applying FKL only to outlier regions (FKL Outlier) outperforms masking or clipping.
- Off‑policy guidance further improves all variants.
- The full TrOPD (FKL outlier + off‑policy guidance) achieves the highest average score.
Comparison with Concurrent Work AOPD
Table 6: TrOPD vs. AOPD (math domain)
| Method | AIME 24 | AIME 25 | AMC 23 | LiveCodeBench v6 | GPQA Diamond | Avg. |
|---|---|---|---|---|---|---|
| DeepSeek‑Qwen2.5‑1.5B | 28.64 | 24.16 | 71.01 | 15.43 | 34.22 | 34.69 |
| OPD | 35.83 | 29.16 | 75.39 | 17.14 | 28.03 | 37.11 |
| AOPD | 39.89 | 30.00 | 77.18 | 20.57 | 31.31 | 39.79 |
| TrOPD | 38.54 | 32.50 | 77.03 | 18.86 | 36.24 | 40.63 |
| TrOPD + AOPD | 42.08 | 31.87 | 78.20 | 21.71 | 34.47 | 41.67 |
TrOPD outperforms AOPD, and their combination yields further gains, indicating orthogonality.
Theoretical and Practical Implications
- Theoretical significance: TrOPD provides a principled solution to the supervision reliability problem in OPD by partitioning the token space into trust and outlier regions based on the teacher‑student agreement ratio. This mitigates the instability of the estimator under distribution mismatch without discarding informative signals.
- Practical impact: TrOPD enables stable and efficient post‑training of small reasoning models (SRMs), achieving state‑of‑the‑art performance across multiple domains. The method is memory‑efficient ( for trust region, only for outlier tokens) and works with long‑chain‑of‑thought reasoning.
- Complementarity: TrOPD is orthogonal to other OPD enhancements (e.g., AOPD), suggesting that combining trust‑region learning with other strategies is a promising direction.
Conclusion
TrOPD (Trust Region On‑Policy Distillation) is a reliable and stable framework for reasoning‑oriented OPD. It uses an adaptive trust region to suppress unreliable policy gradients, a top‑ forward KL estimator to preserve informative outlier supervision, and off‑policy guidance to encourage exploration toward teacher‑supported trajectories. Extensive experiments on mathematics, code, instruction following, and STEM benchmarks show consistent and substantial improvements over existing OPD methods.
Limitations: The work focuses on post‑training with specific student models (DeepSeek‑Qwen2.5‑1.5B, Qwen3‑SFT‑1.7B) and does not explore pre‑training or mid‑training stages that could further boost reasoning performance. Future work should investigate multi‑stage training and practical deployment of small reasoning models.
Related papers
- Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses
- LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards
LongTraceRL improves long-context reasoning in LLMs by training with realistic distractors from search agent trajectories and a fine-grained entity-level rubric reward.
- Mellum2 Technical Report
Mellum 2 is an efficient 12B MoE model specialized for software engineering, matching the inference cost of a 7B dense model while achieving competitive performance on coding and reasoning tasks.