# On the Geometry of On-Policy Distillation

> On-policy distillation exhibits subspace locking, with cumulative updates confined to a persistent low-dimensional channel controlled by objective composition.

- **Source:** [arXiv](https://arxiv.org/abs/2606.07082)
- **Published:** 2026-06-10
- **Permalink:** https://picx.dev/p/q6d00h
- **Whiteboard:** https://picx.dev/p/q6d00h/image

## Summary

## Summary (Overview)

- On-policy distillation (OPD) occupies a **relaxed off-principal regime** in parameter space, lying between dense principal-aligned updates (SFT) and sparse geometry-preserving updates (RLVR), with a bias toward the RLVR side.
- OPD exhibits **subspace locking**: cumulative updates rapidly enter a narrow, persistent low-dimensional channel that is functionally sufficient for training—projecting gradients onto the early subspace preserves performance.
- The subspace locking is **robust to token sparsification and off-policy rollouts** (runtime perturbations) but **sensitive to objective composition** (mixing with RLVR changes the rank dynamics).
- Key diagnostics (update sparsity, subspace rotation, spectral drift, update localization) consistently place OPD between SFT and RLVR; for example, bf16 update sparsity is 8.1% (SFT), 51.6% (OPD), 77.2% (RLVR).
- The findings suggest that OPD should be designed as **geometry control** rather than merely denser token supervision, with objective composition as the primary lever for regulating the update channel.

## Introduction and Theoretical Foundation

Large reasoning models (LRMs) have advanced complex reasoning in LLMs, driven by post-training paradigms: supervised fine-tuning (SFT) on offline demonstrations, reinforcement learning with verifiable rewards (RLVR) from sparse outcome signals, and on-policy distillation (OPD)—training a student on its own sampled trajectories with dense token-level guidance from a stronger teacher. While the geometric footprints of SFT and RLVR are known (dense/principal-aligned vs. sparse/off-principal), OPD’s parameter-space dynamics remain unclear. OPD combines SFT-like dense token-level distillation with RLVR-like on-policy sampling and policy-gradient optimization. The paper addresses three research questions:

- **RQ1**: Where does OPD lie in the parameter-space spectrum between SFT and RLVR?
- **RQ2**: What intrinsic update trajectory does OPD follow during training?
- **RQ3**: Which component of OPD controls this trajectory?

The theoretical foundation is built on a **relaxed Three-Gate view** (extending Zhu et al., 2025). The three gates are:
1. **Distributional anchor**: limited update norm under a local quadratic budget.
2. **Pretrained model geometry**: routes bounded updates away from dominant spectral directions.
3. **Precision realization** (bf16): determines which coordinates become visibly changed.

OPD preserves RLVR’s gated structure but relaxes it via dense token-level teacher supervision, broadening the active directions while keeping geometry-steered updates.

## Methodology

**Experimental setup**: Analysis focuses on Qwen3-family checkpoints (Qwen3-8B base, SFT, OPD, RLVR) with math-domain prompts. Controlled comparison ensures same SFT anchor and prompt distribution. Additional OPD variants vary teacher size, student size, data domain, and MoE teacher.

**Parameter-space diagnostics** (four metrics):

1. **Update sparsity** (bf16-aware): treat weight $w_i$ as unchanged if $|\hat{w}_i - w_i| \leq \eta \max(|w_i|, |\hat{w}_i|)$ with $\eta=10^{-3}$. Sparsity $= 1 - \frac{1}{n}\sum_{i,j} \mathbb{1}[W_{+,ij} \not\approx_\eta W_{0,ij}]$.

2. **Principal-angle rotation**: cosine of principal angles between top-$k$ singular subspaces:
   $$\cos\theta_i(U)=\sigma_i(U_{0,k}^\top U_{+,k}),\quad \cos\theta_i(V)=\sigma_i(V_{0,k}^\top V_{+,k})$$

3. **Spectral drift** (normalized spectral shift):
   $$\text{NSS}(W_0,W_+)=\frac{\|\sigma(W_+)-\sigma(W_0)\|_2}{\|\sigma(W_0)\|_2}$$

4. **Update–mask overlap**: compare bf16-visible update mask $M$ with principal mask $M_{\text{princ}}$ (top-$\alpha$ entries in rank-$k$ SVD reconstruction) and low-magnitude mask $M_{\text{low}}$ (bottom-$\alpha$ by $|W_0|$). Overlap $= |M_* \cap M|/|M|$.

**Trajectory analysis**: cumulative update $\Delta W_t = W_t - W_0$. Stable rank: $\text{srank}(\Delta W_t) = \|\Delta W_t\|_F^2 / \|\Delta W_t\|_{\text{op}}^2$. Subspace similarity: $\text{Sim}_K(t,t_{\text{end}}) = \frac{1}{K}\|V_K(t)^\top V_K(t_{\text{end}})\|_F^2$. Functional sufficiency: project gradients $g \leftarrow g V_{16} V_{16}^\top$.

**Control experiments** (RQ3): perturb token supervision density (top-KL or random 25%/50% retention), rollout policy (off-policy vs. on-policy), and objective composition (linear interpolation $A_i^{(\alpha)} = \alpha A_{i,\text{OPD}} + (1-\alpha)A_{i,\text{RLVR}}$).

## Empirical Validation / Results

**1. OPD occupies a relaxed off-principal regime (RQ1)**

*Update sparsity*: In controlled Qwen3-8B comparison, SFT leaves 8.1% of weights unchanged (bf16), RLVR 77.2%, OPD 51.6%. Stable across variants (48.6%–57.1%). Published reference points confirm SFT–RLVR separation.

**Table 1: bf16-aware update sparsity**
| Base Model → Finetuned Model | Algorithm | Data | Sparsity |
|----------------------------|-----------|------|----------|
| Qwen3-8B-Base → Qwen3-8B-SFT | SFT | Math | 8.1% |
| Qwen3-8B-SFT → OPD-8B-T32B | OPD | Math | 51.6% |
| Qwen3-8B-SFT → RLVR-8B | GRPO | Math | 77.2% |
| ... OPD variants (4B/14B/8B, different teachers) | OPD | Math/Code | 48.6%–57.1% |
| Published reference (SFT/GRPO) | SFT/GRPO | Math+Code | 0.6%–79.9% |

*Subspace rotation and spectral drift*: SFT has principal angles >10° (single layer), RLVR <0.5°, OPD ~1°. NSS: SFT at $10^{-3}$, OPD at $10^{-4}$, RLVR at $10^{-5}$.

*Update localization*: Global update density: SFT 92.73%, OPD 46.28%, RLVR 27.76%. Principal-mask overlap: SFT 29.66%, OPD 27.31%, RLVR 26.65% (all near/below 30% random baseline). Low-magnitude overlap: SFT 31.88%, OPD 53.59%, RLVR 73.48%.

**2. OPD exhibits subspace locking (RQ2)**

*Stable rank trajectory* (Fig. 4a): OPD stays in a narrow low-rank band (~20–30) throughout training. SFT expands (~10 to >60), RLVR contracts (~30 to ~15). OPD is not a temporal interpolation.

*Update scale* (Fig. 4b): OPD has substantially larger $\|\Delta W_t\|_F$ than RLVR, ruling out trivial small-update explanation. Hill tail estimates (Fig. 4c) confirm OPD evolves mildly.

*Early subspace emergence* (Fig. 5): Top-16 subspace similarity to final update: OPD aligns from first checkpoint (~0.8), SFT and RLVR converge gradually.

*Functional sufficiency* (Fig. 6): Projecting gradients to rank-16 subspace (extracted around 20% training) preserves OPD performance on reasoning benchmarks (AIME 2024, MATH-500). Same constraint degrades SFT.

**3. Objective composition controls subspace locking (RQ3)**

*Token sparsification* (Fig. 7a): Top-KL 50%/25% and Random 50%/25% retain OPD stable-rank trajectory. Even random 25% changes update scale more than spectral shape.

*Off-policy rollouts* (Fig. 7b): Off-policy OPD has modestly larger update norm but matched stable rank.

*Objective mixing* (Fig. 7c): Linear interpolation of OPD and RLVR advantage signals ($\alpha = 0.25, 0.05, 0.01$) shows clear split: OPD-dominant mixtures retain OPD-like trajectory; weak OPD component departs.

**Mechanistic interpretation**: Token sparsification rescales second moment $\mathbb{E}[\tilde{g}\tilde{g}^\top] \approx c\mathbb{E}[g_{\text{OPD}}g_{\text{OPD}}^\top] + \text{noise}$, preserving leading spectral directions. Objective mixing changes gradient source: $\Sigma_\alpha \approx \alpha^2\Sigma_{\text{OPD}} + (1-\alpha)^2\Sigma_{\text{RLVR}} + \alpha(1-\alpha)\Sigma_{\text{cross}}$, altering dominant covariance geometry.

## Theoretical and Practical Implications

- OPD is **not merely an interpolation** between SFT and RLVR but induces its own update geometry: relaxed off-principal localization plus subspace locking.
- The **Three-Gate view** explains why OPD is geometry-steered like RLVR yet less selective: dense token-level supervision broadens active directions while update remains bounded and geometry-constrained.
- **Practical guideline**: Design OPD as **geometry control**—monitor the locked update channel, use objective composition as primary lever, and tune token selection/rollout policy through their effect on update geometry.
- The results suggest that effective OPD recipes should regulate **objective-induced update geometry** rather than only token coverage or rollout generation.
- Subspace locking provides a **functional bottleneck**: early low-dimensional subspace suffices for OPD training, enabling potential efficiency gains (e.g., low-rank training).

## Conclusion

The paper provides a parameter-space account of on-policy distillation. Key takeaways:

- OPD occupies a **relaxed off-principal regime**: more selective than SFT, less constrained than RLVR.
- OPD exhibits **subspace locking**: cumulative updates rapidly enter a persistent low-dimensional channel that is functionally sufficient.
- The locked trajectory is **robust to runtime perturbations** (token sparsification, off-policy rollouts) but **sensitive to objective composition**.
- OPD should be viewed as **geometry control**, not merely denser token supervision.

**Limitations**: Analysis on Qwen3-family reasoning settings; generalizability to other model families and modalities remains to be tested. Diagnostics from stored checkpoints; the Three-Gate view is a mechanistic explanation consistent with evidence, not a complete causal theory.

**Future directions**: Geometry-aware OPD algorithms that exploit the locked update channel, adaptation for broader model classes and domains, and formal understanding of the Three-Gate mechanism.

---

_Markdown view of https://picx.dev/p/q6d00h, served by PicX — AI-generated visual whiteboard summaries of research papers._