Summary (Overview)

  • Trust Region On-Policy Distillation (TrOPD) is proposed to stabilize on-policy distillation (OPD) for large language models (LLMs) when teacher and student distributions diverge substantially.
  • TrOPD partitions student-generated tokens into a trust region (where teacher supervision is reliable) and outliers (where it is unreliable), using an adaptive threshold based on the decoding agreement ratio Ptrust(x)=min(πT(x)/πS(x),1)P_{\text{trust}}(x) = \min(\pi_T(x)/\pi_S(x), 1).
  • For outlier tokens, TrOPD employs a top‑kk forward KL (FKL) estimator to preserve informative signals while avoiding unreliable policy gradients.
  • Off‑policy guidance is introduced: the student continues generation from teacher‑generated prefixes, using forward KL to imitate teacher trajectories, encouraging exploration toward reliable regions.
  • Extensive experiments on mathematical reasoning (AIME 2024/2025, AMC 2023), code generation (LiveCodeBench), instruction following (IFBench), and STEM (GPQA Diamond) show that TrOPD consistently outperforms state‑of‑the‑art OPD baselines (OPD, EOPD, REOPOLD) by +3.34 to +6.18 points on average.

Introduction and Theoretical Foundation

On‑Policy Distillation (OPD) trains a student model on its own generated trajectories to mitigate the exposure bias inherent in off‑policy distillation. The typical objective uses reverse KL divergence (RKL):

DKL(πSπT)=ExπS[logπS(x)πT(x)],D_{\text{KL}}(\pi_S \parallel \pi_T) = \mathbb{E}_{x \sim \pi_S} \left[ \log \frac{\pi_S(x)}{\pi_T(x)} \right],

whose gradient takes a policy‑gradient form: the student is rewarded for sequences assigned high probability by the teacher. However, when the teacher and student distributions diverge substantially, student‑generated tokens may fall into low‑probability regions of the teacher, leading to extreme policy‑gradient outliers (e.g., J\nabla J \approx -\infty when πT(x)0\pi_T(x) \to 0). This destabilizes training and limits final performance.

For reasoning models, full‑vocabulary KL divergence is prohibitively expensive due to O(nk)O(n \cdot k) memory (nn = sequence length, kk = vocabulary size). Recent work uses the K1K_1 estimator for RKL:

JKD=ExπS[logπSπT],J_{\text{KD}} = -\mathbb{E}_{x \sim \pi_S} \left[ \log \frac{\pi_S}{\pi_T} \right],

which reduces memory to O(n)O(n) but suffers from two key issues:

  1. Significant policy‑gradient outliers when πT(x)0\pi_T(x) \approx 0.
  2. Low‑quality student generations that limit the effective optimization space.

Existing mitigation strategies (e.g., entropy‑based token selection, reward clipping) provide only limited correction. This motivates the need for a principled approach to ensure reliable supervision.

Methodology

TrOPD distinguishes three regions in the distillation process (see Table 2):

RegionPolicyObjectiveEstimatorMemory
On‑Policy Trust RegionxπSx \sim \pi_SKL(πSπT)-\text{KL}(\pi_S \parallel \pi_T)logπT(x)πS(x)\log \frac{\pi_T(x)}{\pi_S(x)}O(n)O(n)
On‑Policy OutlierxπSx \sim \pi_SKL(πTπS)-\text{KL}(\pi_T \parallel \pi_S)vVT(k)πT(v)logπS(v)πT(v)\sum_{v \in V_T^{(k)}} \pi_T(v) \log \frac{\pi_S(v)}{\pi_T(v)}O(nk)O(nk)
Off‑Policy GuidancexπTx \sim \pi_TβKL(πTπS)-\beta \text{KL}(\pi_T \parallel \pi_S)βlogπS(x)πT(x)\beta \log \frac{\pi_S(x)}{\pi_T(x)}O(n)O(n)

1. Adaptive Trust Region

For each token sampled from the student, the probability of being in the trust region is:

Ptrust(x)=min(πT(x)πS(x),1),P_{\text{trust}}(x) = \min\left( \frac{\pi_T(x)}{\pi_S(x)}, 1 \right),

inspired by speculative decoding’s acceptance probability. Tokens with Ptrust(x)<1P_{\text{trust}}(x) < 1 are considered outliers.

2. Outlier Estimation

For outlier tokens, instead of masking or clipping, TrOPD uses a top‑kk forward KL (FKL) objective:

JFKLx=MxvVTkπT(v)logπT(v)πS(v),J_{\text{FKL}}^x = -\mathbb{M}_x \sum_{v \in V_T^k} \pi_T(v) \log \frac{\pi_T(v)}{\pi_S(v)},

where VTk=TopK(πT)V_T^k = \text{TopK}(\pi_T). This preserves informative teacher‑supported tokens while avoiding unreliable reverse‑KL gradients.

3. Off‑Policy Guidance

To encourage the student to generate within teacher‑verifiable regions, the student continues generation from a teacher‑generated prefix x[:l]x_{[:l]}:

Jx=βI[xπT]logπSπT+JOnx[l:],J_x = -\beta \mathbb{I}[x \sim \pi_T] \log \frac{\pi_S}{\pi_T} + J_{\text{On}}^{x_{[l:]}},

where β\beta is a small coefficient (default 0.001). The off‑policy trajectory length is gradually annealed to zero via a cosine schedule, so generation becomes fully on‑policy by the end of training.

Unified Optimization Objective

The overall TrOPD objective is:

JTrOPDx=I[xπS]MxvVTkπT,vlogπT,vπS,vI[xπT]MxlogπSπTβI[xπT]logπTπS.J_{\text{TrOPD}}^x = -\mathbb{I}[x \sim \pi_S] \mathbb{M}_x \sum_{v \in V_T^k} \pi_{T,v} \log \frac{\pi_{T,v}}{\pi_{S,v}} - \mathbb{I}[x \sim \pi_T] \mathbb{M}_x \log \frac{\pi_S}{\pi_T} - \beta \mathbb{I}[x \sim \pi_T] \log \frac{\pi_T}{\pi_S}.

Empirical Validation / Results

Main Results

Table 3: Performance with DeepSeek‑R1‑Distill‑Qwen‑1.5B student (single‑domain teacher: Skywork‑OR1‑Math‑7B; multi‑domain teacher: Skywork‑OR1‑7B)

MethodAIME 24AIME 25AMC 23LiveCodeBench v6GPQA DiamondAvg.
DeepSeek‑Qwen2.5‑1.5B28.6424.1671.0115.4334.2234.69
Single‑Domain Distillation
Teacher66.1451.8792.3434.8647.2258.48
OPD35.8329.1675.3917.1428.0337.11
EOPD36.9729.7975.2315.4332.5838.00
Entropy OPD 20%35.5229.0673.8214.2931.8236.90
REOPOLD 2Stage34.4729.8973.3516.5730.1836.89
REOPOLD36.9730.8375.7818.2932.0738.79
TrOPD38.5432.5077.0318.8636.2440.63
Multi‑Domain Distillation
Teacher65.6252.8191.7936.5747.2258.80
OPD30.1021.6661.5620.5731.0632.99
REOPOLD34.2725.8363.9019.4334.4735.58
TrOPD36.0427.6070.9322.2931.1937.61

Table 4: Performance with Qwen3‑SFT‑1.7B student (teacher: Qwen3‑Nemotron‑4B, multi‑domain)

MethodAIME 24AIME 25AMC 23GPQA dia.MMLU red.IFBenchLCB v6Avg.
Qwen3‑SFT‑1.7B35.4126.4568.9025.2566.6026.1930.2939.87
OPD48.0240.7281.7929.8068.6037.0732.0048.29
EOPD47.0840.8381.3233.8468.2636.3934.2948.86
Entropy OPD43.5442.7079.5329.9268.5138.7833.7148.10
REOPOLD45.6242.2981.6430.5668.3036.0535.4348.56
TrOPD52.0844.0683.0435.9868.7442.1836.0051.73

TrOPD consistently outperforms all baselines across both teacher–student configurations and all domains, with average gains of +3.06 to +4.62 points over OPD and +1.84 to +3.17 points over REOPOLD.

Ablation Studies

Table 5: Ablation on outlier estimation and off‑policy guidance (math domain)

MethodOutlier ObjectiveAIME 24AIME 25AMC 23Avg.
DeepSeek‑1.5B28.6424.1671.0141.27
OPDlogπT/πS\log \pi_T / \pi_S35.8329.1675.3946.79
+ Mask Outlier037.0830.6275.4647.72
+ Clip Outlierτ\tau36.9730.8375.7847.86
+ Full FKLvVTkπT,vlog(πT,v/πS,v)\sum_{v \in V_T^k} \pi_{T,v} \log(\pi_{T,v}/\pi_{S,v})0.000.004.211.40
+ FKL OutliervVTkπT,vlog(πT,v/πS,v)\sum_{v \in V_T^k} \pi_{T,v} \log(\pi_{T,v}/\pi_{S,v})39.1629.8977.9649.00
+ Off‑Policy Guidance
TrOPD Mask040.1030.4175.8548.79
TrOPD Clipτ\tau37.3931.7777.0348.73
TrOPD FKLvVTkπT,vlog(πT,v/πS,v)\sum_{v \in V_T^k} \pi_{T,v} \log(\pi_{T,v}/\pi_{S,v})38.5432.5078.5149.85

Key findings:

  • Applying FKL only to outlier regions (FKL Outlier) outperforms masking or clipping.
  • Off‑policy guidance further improves all variants.
  • The full TrOPD (FKL outlier + off‑policy guidance) achieves the highest average score.

Comparison with Concurrent Work AOPD

Table 6: TrOPD vs. AOPD (math domain)

MethodAIME 24AIME 25AMC 23LiveCodeBench v6GPQA DiamondAvg.
DeepSeek‑Qwen2.5‑1.5B28.6424.1671.0115.4334.2234.69
OPD35.8329.1675.3917.1428.0337.11
AOPD39.8930.0077.1820.5731.3139.79
TrOPD38.5432.5077.0318.8636.2440.63
TrOPD + AOPD42.0831.8778.2021.7134.4741.67

TrOPD outperforms AOPD, and their combination yields further gains, indicating orthogonality.

Theoretical and Practical Implications

  • Theoretical significance: TrOPD provides a principled solution to the supervision reliability problem in OPD by partitioning the token space into trust and outlier regions based on the teacher‑student agreement ratio. This mitigates the instability of the K1K_1 estimator under distribution mismatch without discarding informative signals.
  • Practical impact: TrOPD enables stable and efficient post‑training of small reasoning models (SRMs), achieving state‑of‑the‑art performance across multiple domains. The method is memory‑efficient (O(n)O(n) for trust region, O(nk)O(nk) only for outlier tokens) and works with long‑chain‑of‑thought reasoning.
  • Complementarity: TrOPD is orthogonal to other OPD enhancements (e.g., AOPD), suggesting that combining trust‑region learning with other strategies is a promising direction.

Conclusion

TrOPD (Trust Region On‑Policy Distillation) is a reliable and stable framework for reasoning‑oriented OPD. It uses an adaptive trust region to suppress unreliable policy gradients, a top‑kk forward KL estimator to preserve informative outlier supervision, and off‑policy guidance to encourage exploration toward teacher‑supported trajectories. Extensive experiments on mathematics, code, instruction following, and STEM benchmarks show consistent and substantial improvements over existing OPD methods.

Limitations: The work focuses on post‑training with specific student models (DeepSeek‑Qwen2.5‑1.5B, Qwen3‑SFT‑1.7B) and does not explore pre‑training or mid‑training stages that could further boost reasoning performance. Future work should investigate multi‑stage training and practical deployment of small reasoning models.

Related papers