Summary (Overview)
- On-policy distillation (OPD) occupies a relaxed off-principal regime in parameter space, lying between dense principal-aligned updates (SFT) and sparse geometry-preserving updates (RLVR), with a bias toward the RLVR side.
- OPD exhibits subspace locking: cumulative updates rapidly enter a narrow, persistent low-dimensional channel that is functionally sufficient for training—projecting gradients onto the early subspace preserves performance.
- The subspace locking is robust to token sparsification and off-policy rollouts (runtime perturbations) but sensitive to objective composition (mixing with RLVR changes the rank dynamics).
- Key diagnostics (update sparsity, subspace rotation, spectral drift, update localization) consistently place OPD between SFT and RLVR; for example, bf16 update sparsity is 8.1% (SFT), 51.6% (OPD), 77.2% (RLVR).
- The findings suggest that OPD should be designed as geometry control rather than merely denser token supervision, with objective composition as the primary lever for regulating the update channel.
Introduction and Theoretical Foundation
Large reasoning models (LRMs) have advanced complex reasoning in LLMs, driven by post-training paradigms: supervised fine-tuning (SFT) on offline demonstrations, reinforcement learning with verifiable rewards (RLVR) from sparse outcome signals, and on-policy distillation (OPD)—training a student on its own sampled trajectories with dense token-level guidance from a stronger teacher. While the geometric footprints of SFT and RLVR are known (dense/principal-aligned vs. sparse/off-principal), OPD’s parameter-space dynamics remain unclear. OPD combines SFT-like dense token-level distillation with RLVR-like on-policy sampling and policy-gradient optimization. The paper addresses three research questions:
- RQ1: Where does OPD lie in the parameter-space spectrum between SFT and RLVR?
- RQ2: What intrinsic update trajectory does OPD follow during training?
- RQ3: Which component of OPD controls this trajectory?
The theoretical foundation is built on a relaxed Three-Gate view (extending Zhu et al., 2025). The three gates are:
- Distributional anchor: limited update norm under a local quadratic budget.
- Pretrained model geometry: routes bounded updates away from dominant spectral directions.
- Precision realization (bf16): determines which coordinates become visibly changed.
OPD preserves RLVR’s gated structure but relaxes it via dense token-level teacher supervision, broadening the active directions while keeping geometry-steered updates.
Methodology
Experimental setup: Analysis focuses on Qwen3-family checkpoints (Qwen3-8B base, SFT, OPD, RLVR) with math-domain prompts. Controlled comparison ensures same SFT anchor and prompt distribution. Additional OPD variants vary teacher size, student size, data domain, and MoE teacher.
Parameter-space diagnostics (four metrics):
-
Update sparsity (bf16-aware): treat weight as unchanged if with . Sparsity .
-
Principal-angle rotation: cosine of principal angles between top- singular subspaces:
-
Spectral drift (normalized spectral shift):
-
Update–mask overlap: compare bf16-visible update mask with principal mask (top- entries in rank- SVD reconstruction) and low-magnitude mask (bottom- by ). Overlap .
Trajectory analysis: cumulative update . Stable rank: . Subspace similarity: . Functional sufficiency: project gradients .
Control experiments (RQ3): perturb token supervision density (top-KL or random 25%/50% retention), rollout policy (off-policy vs. on-policy), and objective composition (linear interpolation ).
Empirical Validation / Results
1. OPD occupies a relaxed off-principal regime (RQ1)
Update sparsity: In controlled Qwen3-8B comparison, SFT leaves 8.1% of weights unchanged (bf16), RLVR 77.2%, OPD 51.6%. Stable across variants (48.6%–57.1%). Published reference points confirm SFT–RLVR separation.
Table 1: bf16-aware update sparsity
| Base Model → Finetuned Model | Algorithm | Data | Sparsity |
|---|---|---|---|
| Qwen3-8B-Base → Qwen3-8B-SFT | SFT | Math | 8.1% |
| Qwen3-8B-SFT → OPD-8B-T32B | OPD | Math | 51.6% |
| Qwen3-8B-SFT → RLVR-8B | GRPO | Math | 77.2% |
| ... OPD variants (4B/14B/8B, different teachers) | OPD | Math/Code | 48.6%–57.1% |
| Published reference (SFT/GRPO) | SFT/GRPO | Math+Code | 0.6%–79.9% |
Subspace rotation and spectral drift: SFT has principal angles >10° (single layer), RLVR <0.5°, OPD ~1°. NSS: SFT at , OPD at , RLVR at .
Update localization: Global update density: SFT 92.73%, OPD 46.28%, RLVR 27.76%. Principal-mask overlap: SFT 29.66%, OPD 27.31%, RLVR 26.65% (all near/below 30% random baseline). Low-magnitude overlap: SFT 31.88%, OPD 53.59%, RLVR 73.48%.
2. OPD exhibits subspace locking (RQ2)
Stable rank trajectory (Fig. 4a): OPD stays in a narrow low-rank band (~20–30) throughout training. SFT expands (~10 to >60), RLVR contracts (~30 to ~15). OPD is not a temporal interpolation.
Update scale (Fig. 4b): OPD has substantially larger than RLVR, ruling out trivial small-update explanation. Hill tail estimates (Fig. 4c) confirm OPD evolves mildly.
Early subspace emergence (Fig. 5): Top-16 subspace similarity to final update: OPD aligns from first checkpoint (~0.8), SFT and RLVR converge gradually.
Functional sufficiency (Fig. 6): Projecting gradients to rank-16 subspace (extracted around 20% training) preserves OPD performance on reasoning benchmarks (AIME 2024, MATH-500). Same constraint degrades SFT.
3. Objective composition controls subspace locking (RQ3)
Token sparsification (Fig. 7a): Top-KL 50%/25% and Random 50%/25% retain OPD stable-rank trajectory. Even random 25% changes update scale more than spectral shape.
Off-policy rollouts (Fig. 7b): Off-policy OPD has modestly larger update norm but matched stable rank.
Objective mixing (Fig. 7c): Linear interpolation of OPD and RLVR advantage signals () shows clear split: OPD-dominant mixtures retain OPD-like trajectory; weak OPD component departs.
Mechanistic interpretation: Token sparsification rescales second moment , preserving leading spectral directions. Objective mixing changes gradient source: , altering dominant covariance geometry.
Theoretical and Practical Implications
- OPD is not merely an interpolation between SFT and RLVR but induces its own update geometry: relaxed off-principal localization plus subspace locking.
- The Three-Gate view explains why OPD is geometry-steered like RLVR yet less selective: dense token-level supervision broadens active directions while update remains bounded and geometry-constrained.
- Practical guideline: Design OPD as geometry control—monitor the locked update channel, use objective composition as primary lever, and tune token selection/rollout policy through their effect on update geometry.
- The results suggest that effective OPD recipes should regulate objective-induced update geometry rather than only token coverage or rollout generation.
- Subspace locking provides a functional bottleneck: early low-dimensional subspace suffices for OPD training, enabling potential efficiency gains (e.g., low-rank training).
Conclusion
The paper provides a parameter-space account of on-policy distillation. Key takeaways:
- OPD occupies a relaxed off-principal regime: more selective than SFT, less constrained than RLVR.
- OPD exhibits subspace locking: cumulative updates rapidly enter a persistent low-dimensional channel that is functionally sufficient.
- The locked trajectory is robust to runtime perturbations (token sparsification, off-policy rollouts) but sensitive to objective composition.
- OPD should be viewed as geometry control, not merely denser token supervision.
Limitations: Analysis on Qwen3-family reasoning settings; generalizability to other model families and modalities remains to be tested. Diagnostics from stored checkpoints; the Three-Gate view is a mechanistic explanation consistent with evidence, not a complete causal theory.
Future directions: Geometry-aware OPD algorithms that exploit the locked update channel, adaptation for broader model classes and domains, and formal understanding of the Three-Gate mechanism.
Related papers
- Your UnEmbedding Matrix is Secretly a Feature Lens for Text Embeddings
EmbedFilter filters out the edge spectrum of the unembedding matrix, improving LLM zero-shot embeddings by up to 14.1% on MTEB.
- MiniMax Sparse Attention
MiniMax Sparse Attention matches full attention quality while reducing FLOPs 28.4x and achieving 14.2x prefill speedup at 1M context on a 109B MoE model.
- SearchSwarm: Towards Delegation Intelligence in Agentic LLMs for Long-Horizon Deep Research
Training a 30B-A3B model on harness-elicited delegation trajectories yields state-of-the-art on long-horizon benchmarks, rivaling 10x larger models.