Visual Summary | Trust Region On-Policy Distillation

Summary (Overview)

Trust Region On-Policy Distillation (TrOPD) is proposed to stabilize on-policy distillation (OPD) for large language models (LLMs) when teacher and student distributions diverge substantially.
TrOPD partitions student-generated tokens into a trust region (where teacher supervision is reliable) and outliers (where it is unreliable), using an adaptive threshold based on the decoding agreement ratio $P_{\text{trust}}(x) = \min(\pi_T(x)/\pi_S(x), 1)$ .
For outlier tokens, TrOPD employs a top‑ $k$ forward KL (FKL) estimator to preserve informative signals while avoiding unreliable policy gradients.
Off‑policy guidance is introduced: the student continues generation from teacher‑generated prefixes, using forward KL to imitate teacher trajectories, encouraging exploration toward reliable regions.
Extensive experiments on mathematical reasoning (AIME 2024/2025, AMC 2023), code generation (LiveCodeBench), instruction following (IFBench), and STEM (GPQA Diamond) show that TrOPD consistently outperforms state‑of‑the‑art OPD baselines (OPD, EOPD, REOPOLD) by +3.34 to +6.18 points on average.

Introduction and Theoretical Foundation

On‑Policy Distillation (OPD) trains a student model on its own generated trajectories to mitigate the exposure bias inherent in off‑policy distillation. The typical objective uses reverse KL divergence (RKL):

D_{\text{KL}}(\pi_S \parallel \pi_T) = \mathbb{E}_{x \sim \pi_S} \left[ \log \frac{\pi_S(x)}{\pi_T(x)} \right],

whose gradient takes a policy‑gradient form: the student is rewarded for sequences assigned high probability by the teacher. However, when the teacher and student distributions diverge substantially, student‑generated tokens may fall into low‑probability regions of the teacher, leading to extreme policy‑gradient outliers (e.g., $\nabla J \approx -\infty$ when $\pi_T(x) \to 0$ ). This destabilizes training and limits final performance.

For reasoning models, full‑vocabulary KL divergence is prohibitively expensive due to $O(n \cdot k)$ memory ( $n$ = sequence length, $k$ = vocabulary size). Recent work uses the $K_1$ estimator for RKL:

J_{\text{KD}} = -\mathbb{E}_{x \sim \pi_S} \left[ \log \frac{\pi_S}{\pi_T} \right],

which reduces memory to $O(n)$ but suffers from two key issues:

Significant policy‑gradient outliers when $\pi_T(x) \approx 0$ .
Low‑quality student generations that limit the effective optimization space.

Existing mitigation strategies (e.g., entropy‑based token selection, reward clipping) provide only limited correction. This motivates the need for a principled approach to ensure reliable supervision.

Methodology

TrOPD distinguishes three regions in the distillation process (see Table 2):

Region	Policy	Objective	Estimator	Memory
On‑Policy Trust Region	$x \sim \pi_S$	$-\text{KL}(\pi_S \parallel \pi_T)$	$\log \frac{\pi_T(x)}{\pi_S(x)}$	$O(n)$
On‑Policy Outlier	$x \sim \pi_S$	$-\text{KL}(\pi_T \parallel \pi_S)$	$\sum_{v \in V_T^{(k)}} \pi_T(v) \log \frac{\pi_S(v)}{\pi_T(v)}$	$O(nk)$
Off‑Policy Guidance	$x \sim \pi_T$	$-\beta \text{KL}(\pi_T \parallel \pi_S)$	$\beta \log \frac{\pi_S(x)}{\pi_T(x)}$	$O(n)$

1. Adaptive Trust Region

For each token sampled from the student, the probability of being in the trust region is:

P_{\text{trust}}(x) = \min\left( \frac{\pi_T(x)}{\pi_S(x)}, 1 \right),

inspired by speculative decoding’s acceptance probability. Tokens with $P_{\text{trust}}(x) < 1$ are considered outliers.

2. Outlier Estimation

For outlier tokens, instead of masking or clipping, TrOPD uses a top‑ $k$ forward KL (FKL) objective:

J_{\text{FKL}}^x = -\mathbb{M}_x \sum_{v \in V_T^k} \pi_T(v) \log \frac{\pi_T(v)}{\pi_S(v)},

where $V_T^k = \text{TopK}(\pi_T)$ . This preserves informative teacher‑supported tokens while avoiding unreliable reverse‑KL gradients.

3. Off‑Policy Guidance

To encourage the student to generate within teacher‑verifiable regions, the student continues generation from a teacher‑generated prefix $x_{[:l]}$ :

J_x = -\beta \mathbb{I}[x \sim \pi_T] \log \frac{\pi_S}{\pi_T} + J_{\text{On}}^{x_{[l:]}},

where $\beta$ is a small coefficient (default 0.001). The off‑policy trajectory length is gradually annealed to zero via a cosine schedule, so generation becomes fully on‑policy by the end of training.

Unified Optimization Objective

The overall TrOPD objective is:

J_{\text{TrOPD}}^x = -\mathbb{I}[x \sim \pi_S] \mathbb{M}_x \sum_{v \in V_T^k} \pi_{T,v} \log \frac{\pi_{T,v}}{\pi_{S,v}} - \mathbb{I}[x \sim \pi_T] \mathbb{M}_x \log \frac{\pi_S}{\pi_T} - \beta \mathbb{I}[x \sim \pi_T] \log \frac{\pi_T}{\pi_S}.

Empirical Validation / Results

Main Results

Table 3: Performance with DeepSeek‑R1‑Distill‑Qwen‑1.5B student (single‑domain teacher: Skywork‑OR1‑Math‑7B; multi‑domain teacher: Skywork‑OR1‑7B)

Method	AIME 24	AIME 25	AMC 23	LiveCodeBench v6	GPQA Diamond	Avg.
DeepSeek‑Qwen2.5‑1.5B	28.64	24.16	71.01	15.43	34.22	34.69
Single‑Domain Distillation
Teacher	66.14	51.87	92.34	34.86	47.22	58.48
OPD	35.83	29.16	75.39	17.14	28.03	37.11
EOPD	36.97	29.79	75.23	15.43	32.58	38.00
Entropy OPD 20%	35.52	29.06	73.82	14.29	31.82	36.90
REOPOLD 2Stage	34.47	29.89	73.35	16.57	30.18	36.89
REOPOLD	36.97	30.83	75.78	18.29	32.07	38.79
TrOPD	38.54	32.50	77.03	18.86	36.24	40.63
Multi‑Domain Distillation
Teacher	65.62	52.81	91.79	36.57	47.22	58.80
OPD	30.10	21.66	61.56	20.57	31.06	32.99
REOPOLD	34.27	25.83	63.90	19.43	34.47	35.58
TrOPD	36.04	27.60	70.93	22.29	31.19	37.61

Table 4: Performance with Qwen3‑SFT‑1.7B student (teacher: Qwen3‑Nemotron‑4B, multi‑domain)

Method	AIME 24	AIME 25	AMC 23	GPQA dia.	MMLU red.	IFBench	LCB v6	Avg.
Qwen3‑SFT‑1.7B	35.41	26.45	68.90	25.25	66.60	26.19	30.29	39.87
OPD	48.02	40.72	81.79	29.80	68.60	37.07	32.00	48.29
EOPD	47.08	40.83	81.32	33.84	68.26	36.39	34.29	48.86
Entropy OPD	43.54	42.70	79.53	29.92	68.51	38.78	33.71	48.10
REOPOLD	45.62	42.29	81.64	30.56	68.30	36.05	35.43	48.56
TrOPD	52.08	44.06	83.04	35.98	68.74	42.18	36.00	51.73

TrOPD consistently outperforms all baselines across both teacher–student configurations and all domains, with average gains of +3.06 to +4.62 points over OPD and +1.84 to +3.17 points over REOPOLD.

Ablation Studies

Table 5: Ablation on outlier estimation and off‑policy guidance (math domain)

Method	Outlier Objective	AIME 24	AIME 25	AMC 23	Avg.
DeepSeek‑1.5B	–	28.64	24.16	71.01	41.27
OPD	$\log \pi_T / \pi_S$	35.83	29.16	75.39	46.79
+ Mask Outlier	0	37.08	30.62	75.46	47.72
+ Clip Outlier	$\tau$	36.97	30.83	75.78	47.86
+ Full FKL	$\sum_{v \in V_T^k} \pi_{T,v} \log(\pi_{T,v}/\pi_{S,v})$	0.00	0.00	4.21	1.40
+ FKL Outlier	$\sum_{v \in V_T^k} \pi_{T,v} \log(\pi_{T,v}/\pi_{S,v})$	39.16	29.89	77.96	49.00
+ Off‑Policy Guidance
TrOPD Mask	0	40.10	30.41	75.85	48.79
TrOPD Clip	$\tau$	37.39	31.77	77.03	48.73
TrOPD FKL	$\sum_{v \in V_T^k} \pi_{T,v} \log(\pi_{T,v}/\pi_{S,v})$	38.54	32.50	78.51	49.85

Key findings:

Applying FKL only to outlier regions (FKL Outlier) outperforms masking or clipping.
Off‑policy guidance further improves all variants.
The full TrOPD (FKL outlier + off‑policy guidance) achieves the highest average score.

Comparison with Concurrent Work AOPD

Table 6: TrOPD vs. AOPD (math domain)

Method	AIME 24	AIME 25	AMC 23	LiveCodeBench v6	GPQA Diamond	Avg.
DeepSeek‑Qwen2.5‑1.5B	28.64	24.16	71.01	15.43	34.22	34.69
OPD	35.83	29.16	75.39	17.14	28.03	37.11
AOPD	39.89	30.00	77.18	20.57	31.31	39.79
TrOPD	38.54	32.50	77.03	18.86	36.24	40.63
TrOPD + AOPD	42.08	31.87	78.20	21.71	34.47	41.67

TrOPD outperforms AOPD, and their combination yields further gains, indicating orthogonality.

Theoretical and Practical Implications

Theoretical significance: TrOPD provides a principled solution to the supervision reliability problem in OPD by partitioning the token space into trust and outlier regions based on the teacher‑student agreement ratio. This mitigates the instability of the $K_1$ estimator under distribution mismatch without discarding informative signals.
Practical impact: TrOPD enables stable and efficient post‑training of small reasoning models (SRMs), achieving state‑of‑the‑art performance across multiple domains. The method is memory‑efficient ( $O(n)$ for trust region, $O(nk)$ only for outlier tokens) and works with long‑chain‑of‑thought reasoning.
Complementarity: TrOPD is orthogonal to other OPD enhancements (e.g., AOPD), suggesting that combining trust‑region learning with other strategies is a promising direction.

Conclusion

TrOPD (Trust Region On‑Policy Distillation) is a reliable and stable framework for reasoning‑oriented OPD. It uses an adaptive trust region to suppress unreliable policy gradients, a top‑ $k$ forward KL estimator to preserve informative outlier supervision, and off‑policy guidance to encourage exploration toward teacher‑supported trajectories. Extensive experiments on mathematics, code, instruction following, and STEM benchmarks show consistent and substantial improvements over existing OPD methods.

Limitations: The work focuses on post‑training with specific student models (DeepSeek‑Qwen2.5‑1.5B, Qwen3‑SFT‑1.7B) and does not explore pre‑training or mid‑training stages that could further boost reasoning performance. Future work should investigate multi‑stage training and practical deployment of small reasoning models.