Trust-Region Behavior Blending for On-Policy Distillation

Summary (Overview)

Core Problem: On-policy distillation (OPD) trains a student on prefixes from its own policy, but early student rollouts can be of low quality, providing weak supervision from the teacher.
Proposed Solution: Trust-Region Behavior Blending (TRB), a warmup method that replaces early student rollouts with a teacher-guided behavior policy, constrained to stay within a student-centered KL trust region.
Key Mechanism: For each prefix, TRB selects the behavior policy $\mu^*$ that minimizes $D_{KL}(\mu \parallel \pi_T)$ subject to $D_{KL}(\mu \parallel \pi_S) \le \epsilon$ , with a closed-form solution: $\mu_\beta(a|h) \propto \pi_S(a|h)^{1-\beta} \pi_T(a|h)^\beta$ .
Implementation: The KL budget $\epsilon$ is linearly annealed to zero over a fixed warmup horizon $K$ , after which training reverts to pure student rollouts.
Empirical Result: TRB achieves the strongest average performance across two math-reasoning distillation settings, outperforming vanilla OPD and other baselines like SKD, Veto, and SFT warmup.

Introduction and Theoretical Foundation

Knowledge distillation transfers capability from a large teacher model to a smaller student. For Large Language Models (LLMs), standard offline distillation suffers from exposure bias: the student is trained on teacher-generated prefixes but must generate its own at inference time. On-Policy Distillation (OPD) addresses this by rolling out the current student and applying teacher supervision on the prefixes it actually visits.

However, early in training, the student's policy is weak, generating low-quality prefixes that may not carry usable teacher signal. This creates a tension: pure student rollouts preserve the target distribution but start poorly, while stronger teacher intervention (e.g., injecting teacher tokens) moves collection off-policy.

Trust-Region Behavior Blending (TRB) is proposed to address this early regime. It modifies the behavior policy used for rollout collection without changing the per-prefix distillation objective. The goal is to provide teacher-guided prefixes early on, while explicitly constraining deviations to remain close to the current student, and then anneal this guidance away.

The theoretical foundation is the standard reverse-KL OPD objective:

\mathcal{L}_{OPD}(\theta) = \mathbb{E}_{h \sim P_{\pi_S}}[D_{KL}(\pi_\theta(\cdot|h) \parallel \pi_T(\cdot|h))]

TRB keeps this loss fixed and only changes the behavior policy $\mu$ used to sample prefixes $h$ .

Methodology

4.1 Per-Prefix Behavior Policy

At a generation prefix $h$ , TRB defines the behavior policy $\mu^*(\cdot|h)$ as the solution to a constrained optimization problem that pulls sampling towards the teacher while staying near the student:

\mu^*(\cdot|h) = \arg\min_{\mu} D_{KL}(\mu \parallel \pi_T) \quad \text{s.t.} \quad D_{KL}(\mu \parallel \pi_S) \le \epsilon, \quad \sum_a \mu(a) = 1, \quad \mu(a) \ge 0. \tag{1}

Here, $\epsilon \ge 0$ is the allowed local deviation (KL budget).

4.2 Closed-Form Solution

Equation (1) has a closed-form solution belonging to a family of blended distributions:

\mu_\beta(a|h) = \frac{\pi_S(a|h)^{1-\beta} \pi_T(a|h)^\beta}{Z_\beta(h)}, \quad \beta \in [0, 1]. \tag{2}

The optimal coefficient $\beta^*(h)$ is the largest value in $[0, 1]$ such that $D_{KL}(\mu_\beta \parallel \pi_S) \le \epsilon$ . It is found via binary search, justified by the monotonicity of $D_{KL}(\mu_\beta \parallel \pi_S)$ in $\beta$ .

If $\epsilon = 0$ , then $\mu^* = \pi_S$ .
If $D_{KL}(\pi_T \parallel \pi_S) \le \epsilon$ , then $\mu^* = \pi_T$ .

4.3 Annealed Warmup

TRB is applied only during a warmup phase. The KL budget is annealed linearly to zero:

\epsilon_k = \epsilon_0 \left(1 - \frac{k}{K}\right), \quad k \le K, \tag{3}

where $K$ is the warmup horizon and $\epsilon_0$ is the initial budget. After $K$ steps, $\epsilon_k=0$ and training proceeds with pure student rollouts ( $\mu = \pi_S$ ).

Training Configuration

All methods use a common OPD training stack. Key parameters are listed below:

Table 2: Common training configuration for the blend-based OPD sweeps.

Parameter	Value
Optimizer	AdamW
Learning rate	$1 \times 10^{-5}$
Global batch size	64
Rollouts per prompt	4
KL loss type	reverse KL
KL top- $k$ support	16 tokens
Rollout temperature	1.0

Table 3: Evaluation configuration.

Parameter	Value
Temperature	1.0
top-p	—
top-k	-1
Max response length	8192

Empirical Validation / Results

Experiments evaluate TRB on two math-reasoning OPD settings:

Qwen3-1.7B-Base distilled from Qwen3-8B
Qwen3-0.6B-Base distilled from Qwen3 -4B

5.2 Benchmark Comparison

Methods are compared using a checkpoint-selection protocol (selecting the checkpoint with the highest mean score). Performance is measured by pass@1 on multiple benchmarks (MATH500, Olympiad, AMC, AIME, GSM8K).

Table 1: Benchmark pass@1 results. Bold marks the best result in each column; underline marks the second-best.

Method	Qwen3-1.7B-Base ← Qwen3-8B	Qwen3-0.6B-Base ← Qwen3-4B
	Avg	MATH500
Trust-Region behavior Blending	33.2	69.7
Vanilla OPD	32.3	69.1
Veto	32.6	69.4
Interleaved teacher injection (SKD)	32.7	69.4
Temperature warmup	32.8	69.2
SFT warmup	32.2	67.6
Fixed- $\epsilon$ blending	32.6	69.2

Key Result: TRB attains the strongest average score in both settings. It also outperforms the persistent Fixed- $\epsilon$ blending variant, indicating the importance of annealing the guidance.

5.3 Early-Training Comparisons

Figure 2 shows training trajectories. TRB and other teacher-guided methods (SKD, fixed- $\epsilon$ ) start faster than vanilla OPD. However, persistent off-policy methods (fixed. $\epsilon$ , SFT warmup) do not achieve the highest final performance, while TRB does.
Figure亞 shows that during the TRB warmup phase, the teacher token-mean entropy on visited prefixes is lower (indicating more teacher-aligned, less random prefixes). After warmup, it aligns with vanilla OPD, yet the performance gap remains.
Figure 4 presents a controlled probe at training step 0: TRB-generated prefixes lead to higher success rates than vanilla OPD prefixes when continued by either the teacher or the student, across all tested prefix truncation lengths. This indicates TRB improves the quality of the initial learning states.

Extended Results (Appendix B)

Sweep-level comparisons (Figures 5 & 6) show that the best TRB configurations consistently outperform the best SKD configurations and vanilla OPD across both model-pair settings.

Theoretical and Practical Implications

Theoretical Insight: TRB provides an efficient local trade-off. For a small KL budget $\epsilon$ , moving slightly from the student yields a first-order reduction in teacher KL (high value) at a second-order behavior-KL cost. The analysis shows $D_{KL}(p \parallel q) - D_{KL}(\mu_{\beta^*} \parallel q) = \sqrt{2\epsilon\sigma_p^2} + O(\epsilon)$ .
Practical Implication: Teacher-guided off-policy behavior is most beneficial early in training when student and teacher are misaligned. Persistent guidance (fixed- $\epsilon$ ) or very direct intervention (SFT) may not yield the best final outcome. Annealing back to pure student rollouts is crucial.
Mechanism: TRB improves OPD by shifting the early prefix distribution toward regions where both teacher and student continuations are more successful, providing a better foundation for learning.

Conclusion

Trust-Region Behavior Blending (TRB) is an effective warmup method for stabilizing early on-policy distillation. By optimizing a teacher-guided behavior policy within a student-centered KL trust region and annealing it away, TRB improves the quality of initial rollouts without permanently altering the training distribution. Empirical results on math-reasoning tasks show TRB achieves state-of-the-art average performance compared to several alternative stabilization techniques.

Limitations and Future Work:

The study is scoped to math-reasoning with Qwen3-Base models. Transfer to other domains and model pairs requires verification.
TRB increases training-time cost during warmup due to online teacher decoding and co-residency with the student (extra memory for teacher weights and KV cache).
The method's hyperparameters ( $\epsilon_0$ , $K$ ) may need tuning for different settings.