Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

Summary (Overview)

Core Thesis: The efficiency of On-Policy Distillation (OPD) stems from a form of "foresight"—it establishes a stable and aligned update trajectory toward the final model early in training.
Key Finding 1 (Module Level): OPD exhibits Functional Redundancy Avoidance. It identifies modules with low marginal utility (e.g., embedding, bottom/top layers) and suppresses their updates, concentrating effective changes on reasoning-critical intermediate-layer MLPs.
Key Finding model (Update Direction Level): OPD exhibits Early Low-Rank Lock-in. Its parameter updates are more low-rank concentrated and their dominant subspaces align closely with the final update subspace early in training, requiring minimal exploration and correction.
Practical Contribution: Based on these insights, the authors propose EffOPD, a plug-and-play acceleration method that performs adaptive linear extrapolation along the early predicted direction. It achieves an average 3× training speedup while maintaining comparable final performance.
Empirical Validation: Findings are validated across model scales (1.5B to 32B parameters), multiple RL algorithms (PPO, GRPO, DAPO), and tasks (mathematical reasoning, code generation).

Introduction and Theoretical Foundation

On-Policy Distillation (OPD) has emerged as an efficient post-training paradigm for Large Language Models (LLMs), achieving performance comparable to Reinforcement Learning (RL) with substantially reduced training time. Existing studies largely attribute this advantage to denser and more stable supervision from the teacher model. However, such macroscopic, optimization-centric explanations fail to capture the underlying parameter update dynamics.

This work argues that OPD's efficiency stems from a form of "foresight": it establishes stable and highly aligned update directions early in training, enabling rapid convergence with limited exploration. This foresight manifests at two levels:

Module-Allocation Level: OPD concentrates updates on modules critical to reasoning.
Update-Direction Level: OPD's dominant update subspaces align with the final solution early.

The theoretical foundation is analyzed through a local geometric view (Appendix F.5). Linearizing the student model around the base model, the OPD objective can be approximated as a convex quadratic minimization problem:

\min_{\Delta \theta} \frac{1}{2} \Delta \theta^\top A \Delta \theta - b^\top \Delta \theta

where $A = E_c[J_c^\top F_c J_c]$ and $b = E_c[J_c^\top F_c r_c]$ . Here, $r_c = z^\star(c) - z_0(c)$ is the teacher-base logit residual. This formulation reveals that if the driving term $b$ is concentrated in a low-dimensional subspace (the top- $k$ eigenspace of $A$ ), the update $\Delta \theta$ remains confined to this subspace from the early stages, explaining the Early Low-Rank Lock-in property.

Methodology

The paper employs a multi-faceted analytical approach to dissect the parameter dynamics of OPD compared to RL.

1. Experimental Setup:

Models: Experiments span scales from 1.5B to 32B parameters, using models like Qwen2.5, Qwen3, and their RL-tuned variants (see Table 2).
Training Paradigms:
- Reinforcement Learning (RL): Objective is $J_{RL}(\theta) = \max_\theta \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta(\cdot|x)}[r(x, y) - \beta D_{KL}(\pi_\theta \| \pi_{ref})]$ .
- On-Policy Distillation (OPD): Objective is $J_{OPD}(\theta) = \min_\theta \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta(\cdot|x)}[D_{KL}(\pi_\theta(y|x) \| \pi^\star(y|x))]$ .
Analysis Focus: Parameter update matrix $\Delta W_{RL/OPD} = W_{RL/OPD} - W_{Base}$ .

2. Analytical Techniques:

Scaling Analysis: Fixing the update direction $\Delta W$ and scaling its magnitude by a factor $\alpha$ to evaluate update efficiency: $W_{Base} + \alpha \Delta W_{RL/OPD}$ .
Sliding-Window Intervention: Partitioning the model into consecutive layer blocks and injecting local OPD/RL updates into each block to evaluate their impact on reasoning performance, isolating functional contributions.
Spectral Analysis: Performing Singular Value Decomposition (SVD) on $\Delta W$ to characterize its geometric structure using metrics like Spectral Norm, Effective Rank, and Top-1% Subspace Norm Ratio.
Subspace Evolution Analysis: Tracking the cosine similarity between the dominant subspaces (e.g., Top- $k$ ) of intermediate checkpoints and the final checkpoint to measure directional alignment over time.
Norm Scaling Intervention: For early OPD checkpoints, preserving the update direction within each module but rescaling its Frobenius norm to match the final checkpoint's norm, to disentangle the effects of direction formation vs. magnitude growth.

Empirical Validation / Results

1. Functional Redundancy Avoidance (Module Level)

Update Efficiency: When updates are scaled to the same norm ( $\alpha \|\Delta W\|$ ), OPD achieves substantially higher reasoning gains than RL (Figure 2a), indicating RL updates contain components weakly correlated with task performance.
Training Dynamics: OPD consistently requires smaller parameter updates than RL to achieve the same reasoning accuracy throughout training (Figure 2b).
Locating Redundancy: Embedding layer updates contribute negligibly to reasoning performance (Figure 3a). Sliding-window intervention reveals an inverted U-shaped pattern: interventions in middle layers (especially MLPs) yield the largest gains, while bottom/top layers yield smaller improvements (Figure 3b).
Key Difference: While OPD and RL show similar sensitivity patterns across layers, RL accumulates substantially larger update norms in low-sensitivity regions (bottom/top layers). OPD suppresses these redundant updates and concentrates changes in high-contribution modules.

2. Early Low-Rank Lock-in (Update Direction Level)

Spectral Concentration: OPD updates exhibit stronger low-rank structure than RL across all model scales, as shown by higher spectral-to-Frobenius norm ratios and lower effective ranks.

Table 1: Characterization of Parameter Update Geometry: OPD vs. RL Across Model Scales.

Metric	1.5B	4B	8B	14B
	RL	OPD	RL	OPD
Spectral Norm (↑)	0.094	0.113	0.007	0.009
Spectral / Frobenius Norm Ratio (↑)	33.2%	39.6%	19.7%	25.7%
Effective Rank (↓)	964	778	1908	1587
Top-1% Subspace Norm Ratio (↑)	78.1%	92.3%	79.2%	93.4%

Subspace Quality: Under equal norm budgets, OPD's Top- $k\%$ principal subspace consistently outperforms RL's in recovering reasoning performance (Figure 4a), indicating higher directional quality.
Tail Subspace Utility: RL allocates more update energy to tail directions (Bottom- $k\%$ subspace) but with low marginal performance return (Figure 4b).
Early Alignment: OPD's dominant subspaces show stronger and earlier alignment with the final subspaces than RL's (Figure 5b). t-SNE visualizations show OPD trajectories are more compact and smoother (Figure 5a).
Norm Scaling Recovery: Scaling the norm of an early OPD checkpoint (at 10% training progress) to match the final checkpoint's norm recovers ~80% of the final model's performance (Figure 5c), proving that effective directions are formed early.

3. Acceleration via EffOPD

Method: EffOPD triggers extrapolation at exponentially spaced checkpoints ( $t = 2^n$ ). The local update direction $\Delta_n$ is estimated as the displacement between the current and previous exponential checkpoint: $\Delta_n = W_{2^n} - W_{2^{n-1}}$ . It then generates candidate parameters by extrapolating along this direction:

\widetilde{W}_{n,k} = W_{2^n} + 2^k \Delta_n

A lightweight validation set $\mathcal{D}_v$ (50 examples) is used to accept extrapolations that improve performance, ensuring safety.

Results: EffOPD achieves significant acceleration across model scales and tasks (code generation, mathematical reasoning). It typically converges within ~10 training steps, compared to 30–40 steps for vanilla OPD, yielding a >3× speedup while maintaining or even slightly improving final performance (Figure 6). Ablation studies show it is robust to learning rate choices and validation set difficulty (Figure 7).

Theoretical and Practical Implications

Theoretical Implications: Provides a parameter-dynamics perspective for understanding OPD efficiency, moving beyond macroscopic claims of "denser supervision." The identified properties (Functional Redundancy Avoidance and Early Low-Rank Lock-in) offer a geometric explanation for why distillation is easier to optimize.
Practical Implications:
- EffOPD is a simple, plug-and-play acceleration method requiring no extra trainable modules or complex hyperparameter tuning.
- It demonstrates that leveraging early directional stability is a viable and effective strategy for accelerating post-training.
- The analysis suggests metrics like directional alignment and spectral concentration could serve as diagnostic signals for monitoring training progress and stability.
- The findings are orthogonal to existing acceleration techniques, providing new insights for designing more interpretable and efficient post-training paradigms.

Conclusion

This work identifies and validates two key properties that constitute the "foresight" of On-Policy Distillation:

Functional Redundancy Avoidance: OPD suppresses updates in low-utility modules and concentrates them in reasoning-critical regions.
Early Low-Rank Lock-in: OPD's updates are low-rank concentrated and their dominant directions align with the final solution early, minimizing exploratory steps.

These properties explain OPD's efficiency at a parameter-dynamics level. Building on this insight, the proposed EffOPD method leverages early directional stability via adaptive extrapolation, achieving up to 3× training speedup while preserving final performance. The findings offer a new perspective for understanding and accelerating post-training in large language models, emphasizing the importance of early directional stabilization and compact parameter allocation.