Trust-Region Behavior Blending for On-Policy Distillation
Summary (Overview)
- Core Problem: On-policy distillation (OPD) trains a student on prefixes from its own policy, but early student rollouts can be of low quality, providing weak supervision from the teacher.
- Proposed Solution: Trust-Region Behavior Blending (TRB), a warmup method that replaces early student rollouts with a teacher-guided behavior policy, constrained to stay within a student-centered KL trust region.
- Key Mechanism: For each prefix, TRB selects the behavior policy that minimizes subject to , with a closed-form solution: .
- Implementation: The KL budget is linearly annealed to zero over a fixed warmup horizon , after which training reverts to pure student rollouts.
- Empirical Result: TRB achieves the strongest average performance across two math-reasoning distillation settings, outperforming vanilla OPD and other baselines like SKD, Veto, and SFT warmup.
Introduction and Theoretical Foundation
Knowledge distillation transfers capability from a large teacher model to a smaller student. For Large Language Models (LLMs), standard offline distillation suffers from exposure bias: the student is trained on teacher-generated prefixes but must generate its own at inference time. On-Policy Distillation (OPD) addresses this by rolling out the current student and applying teacher supervision on the prefixes it actually visits.
However, early in training, the student's policy is weak, generating low-quality prefixes that may not carry usable teacher signal. This creates a tension: pure student rollouts preserve the target distribution but start poorly, while stronger teacher intervention (e.g., injecting teacher tokens) moves collection off-policy.
Trust-Region Behavior Blending (TRB) is proposed to address this early regime. It modifies the behavior policy used for rollout collection without changing the per-prefix distillation objective. The goal is to provide teacher-guided prefixes early on, while explicitly constraining deviations to remain close to the current student, and then anneal this guidance away.
The theoretical foundation is the standard reverse-KL OPD objective:
TRB keeps this loss fixed and only changes the behavior policy used to sample prefixes .
Methodology
4.1 Per-Prefix Behavior Policy
At a generation prefix , TRB defines the behavior policy as the solution to a constrained optimization problem that pulls sampling towards the teacher while staying near the student:
Here, is the allowed local deviation (KL budget).
4.2 Closed-Form Solution
Equation (1) has a closed-form solution belonging to a family of blended distributions:
The optimal coefficient is the largest value in such that . It is found via binary search, justified by the monotonicity of in .
- If , then .
- If , then .
4.3 Annealed Warmup
TRB is applied only during a warmup phase. The KL budget is annealed linearly to zero:
where is the warmup horizon and is the initial budget. After steps, and training proceeds with pure student rollouts ().
Training Configuration
All methods use a common OPD training stack. Key parameters are listed below:
Table 2: Common training configuration for the blend-based OPD sweeps.
| Parameter | Value |
|---|---|
| Optimizer | AdamW |
| Learning rate | |
| Global batch size | 64 |
| Rollouts per prompt | 4 |
| KL loss type | reverse KL |
| KL top- support | 16 tokens |
| Rollout temperature | 1.0 |
Table 3: Evaluation configuration.
| Parameter | Value |
|---|---|
| Temperature | 1.0 |
| top-p | — |
| top-k | -1 |
| Max response length | 8192 |
Empirical Validation / Results
Experiments evaluate TRB on two math-reasoning OPD settings:
- Qwen3-1.7B-Base distilled from Qwen3-8B
- Qwen3-0.6B-Base distilled from Qwen3 -4B
5.2 Benchmark Comparison
Methods are compared using a checkpoint-selection protocol (selecting the checkpoint with the highest mean score). Performance is measured by pass@1 on multiple benchmarks (MATH500, Olympiad, AMC, AIME, GSM8K).
Table 1: Benchmark pass@1 results. Bold marks the best result in each column; underline marks the second-best.
| Method | Qwen3-1.7B-Base ← Qwen3-8B | Qwen3-0.6B-Base ← Qwen3-4B |
|---|---|---|
| Avg | MATH500 | |
| Trust-Region behavior Blending | 33.2 | 69.7 |
| Vanilla OPD | 32.3 | 69.1 |
| Veto | 32.6 | 69.4 |
| Interleaved teacher injection (SKD) | 32.7 | 69.4 |
| Temperature warmup | 32.8 | 69.2 |
| SFT warmup | 32.2 | 67.6 |
| Fixed- blending | 32.6 | 69.2 |
Key Result: TRB attains the strongest average score in both settings. It also outperforms the persistent Fixed- blending variant, indicating the importance of annealing the guidance.
5.3 Early-Training Comparisons
- Figure 2 shows training trajectories. TRB and other teacher-guided methods (SKD, fixed-) start faster than vanilla OPD. However, persistent off-policy methods (fixed., SFT warmup) do not achieve the highest final performance, while TRB does.
- Figure亞 shows that during the TRB warmup phase, the teacher token-mean entropy on visited prefixes is lower (indicating more teacher-aligned, less random prefixes). After warmup, it aligns with vanilla OPD, yet the performance gap remains.
- Figure 4 presents a controlled probe at training step 0: TRB-generated prefixes lead to higher success rates than vanilla OPD prefixes when continued by either the teacher or the student, across all tested prefix truncation lengths. This indicates TRB improves the quality of the initial learning states.
Extended Results (Appendix B)
Sweep-level comparisons (Figures 5 & 6) show that the best TRB configurations consistently outperform the best SKD configurations and vanilla OPD across both model-pair settings.
Theoretical and Practical Implications
- Theoretical Insight: TRB provides an efficient local trade-off. For a small KL budget , moving slightly from the student yields a first-order reduction in teacher KL (high value) at a second-order behavior-KL cost. The analysis shows .
- Practical Implication: Teacher-guided off-policy behavior is most beneficial early in training when student and teacher are misaligned. Persistent guidance (fixed-) or very direct intervention (SFT) may not yield the best final outcome. Annealing back to pure student rollouts is crucial.
- Mechanism: TRB improves OPD by shifting the early prefix distribution toward regions where both teacher and student continuations are more successful, providing a better foundation for learning.
Conclusion
Trust-Region Behavior Blending (TRB) is an effective warmup method for stabilizing early on-policy distillation. By optimizing a teacher-guided behavior policy within a student-centered KL trust region and annealing it away, TRB improves the quality of initial rollouts without permanently altering the training distribution. Empirical results on math-reasoning tasks show TRB achieves state-of-the-art average performance compared to several alternative stabilization techniques.
Limitations and Future Work:
- The study is scoped to math-reasoning with Qwen3-Base models. Transfer to other domains and model pairs requires verification.
- TRB increases training-time cost during warmup due to online teacher decoding and co-residency with the student (extra memory for teacher weights and KV cache).
- The method's hyperparameters (, ) may need tuning for different settings.
Related papers
- GGT-100K: Generative Ground Truth for Generalizable Real-World Image Restoration
Training image restoration models on 100,000 real-world image pairs generated by a multimodal foundation model consistently improves their generalization to diverse real-world degradations.
- Function2Scene: 3D Indoor Scene Layout from Functional Specifications
Function2Scene introduces a novel framework that generates 3D indoor layouts from functional specifications using an iterative check-and-repair pipeline with LLMs, significantly outperforming prior methods in functional design.
- LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards
LongTraceRL improves long-context reasoning in LLMs by training with realistic distractors from search agent trajectories and a fine-grained entity-level rubric reward.