Self-Distilled RLVR: Summary

Summary (Overview)

  • Identifies a fundamental flaw in On-Policy Self-Distillation (OPSD): The information asymmetry between a teacher (with privileged information) and a student (without it) creates an irreducible mutual information gap I(Yt;RX,Y<t)>0I(Y_t; R | X, Y_{<t}) > 0 in the distribution-matching objective. This leads to privileged information leakage and eventual performance degradation.
  • Proposes RLSD (Reinforcement Learning with Self-Distillation): A new paradigm that repurposes self-distillation from a distribution-matching target to a token-level credit assignment mechanism. The environment reward determines the direction of updates (reinforce/penalize), while the privileged teacher's evidence ratio PT(yt)/PS(yt)P_T(y_t)/P_S(y_t) modulates the update magnitude per token.
  • Achieves superior performance and stability: RLSD unifies the reliable direction of RLVR (e.g., GRPO) with the dense, fine-grained signals of self-distillation. It achieves higher convergence ceilings, faster training, and avoids the leakage and collapse seen in OPSD, as validated on multimodal reasoning benchmarks.

Introduction and Theoretical Foundation

Reinforcement Learning with Verifiable Rewards (RLVR) methods like GRPO train models using sparse, sequence-level rewards (e.g., answer correctness). On-Policy Distillation (OPD) complements this by using a stronger external teacher model to provide dense, token-level supervision, but incurs high computational cost. On-Policy Self-Distillation (OPSD) emerged as an efficient alternative, where a single model acts as both teacher (conditioned on privileged information rr, like a reference answer) and student (conditioned only on the query xx).

However, this paper demonstrates that OPSD suffers from systematic privileged information leakage—the model begins to reference invisible "reference solutions" during inference—and unstable long-term training, where performance peaks early then degrades.

The theoretical foundation explains this failure. In OPD, teacher and student are information-symmetric (same input). In OPSD, they are information-asymmetric: the teacher conditions on rr, which the student cannot observe. This makes the distribution-matching objective ill-posed.

Theorem 1 (KL Decomposition) formalizes the problem. The OPSD objective LOPSDL_{OPSD} and the ideal objective LL^* (matching the teacher's marginal distribution) are related by:

LOPSD=L+I(Yt;RX,Y<t)L_{OPSD} = L^* + I(Y_t; R | X, Y_{<t})

where I(Yt;RX,Y<t)I(Y_t; R | X, Y_{<t}) is the conditional mutual information between the current token YtY_t and the privileged information RR. This term is strictly positive and independent of the student's parameters θ\theta, creating an irreducible optimization gap that drives leakage.

Methodology

The core insight is to decouple the roles of the environment reward and the teacher signal:

  • Update Direction: Must be reliable and determined by the verifiable environment reward.
  • Update Magnitude: Benefits from being dense and fine-grained, provided by the teacher.

RLSD (Algorithm 1) implements this as follows:

  1. On-Policy Rollout & Sequence-Level Advantage: For a query xx, sample GG responses {y(1),...,y(G)}\{y^{(1)}, ..., y^{(G)}\} from the student policy πθ(x)\pi_\theta(\cdot|x). A verifier provides a binary reward R(x,y(i)){0,1}R(x, y^{(i)}) \in \{0, 1\}. Compute a group-relative advantage for each response:

    A(i)=R(x,y(i))μGσGA^{(i)} = \frac{R(x, y^{(i)}) - \mu_G}{\sigma_G}

    where μG,σG\mu_G, \sigma_G are the mean and standard deviation of rewards in the group.

  2. Token-Level Credit Assignment via Self-Distillation: For each token yt(i)y_t^{(i)} in a trajectory:

    • Compute the privileged information gain: Δt=sg(logPT(yt(i))logPS(yt(i)))\Delta_t = \text{sg}(\log P_T(y_t^{(i)}) - \log P_S(y_t^{(i)})) where sg\text{sg} is the stop-gradient operator, PS()=πθ(x,y<t)P_S(\cdot) = \pi_\theta(\cdot|x, y_{<t}), and PT()=πθ(x,r,y<t)P_T(\cdot) = \pi_\theta(\cdot|x, r, y_{<t}).
    • Construct a direction-aware evidence weight: wt=exp(sign(A(i))Δt)=(PT(yt(i))PS(yt(i)))sign(A(i))w_t = \exp(\text{sign}(A^{(i)}) \cdot \Delta_t) = \left( \frac{P_T(y_t^{(i)})}{P_S(y_t^{(i)})} \right)^{\text{sign}(A^{(i)})} This weight has a Bayesian interpretation as the belief update ratio P(rx,yt)/P(rx,y<t)P(r | x, y_{\le t}) / P(r | x, y_{<t}).
    • Apply clipping for stability and interpolate with a uniform baseline: A^t(i)=A(i)((1λ)+λclip(wt,1ϵw,1+ϵw))\hat{A}_t^{(i)} = A^{(i)} \cdot \left( (1 - \lambda) + \lambda \cdot \text{clip}(w_t, 1 - \epsilon_w, 1 + \epsilon_w) \right) The mixing coefficient λ\lambda decays from 0.5 to 0 over early training.
  3. Policy Update: Update the parameters θ\theta by maximizing the RLSD objective:

    LRLSD(θ)=E[1Gi=1G1y(i)t=1y(i)min(wtA(i),clip(wt,1ϵw,1+ϵw)A(i))]L_{RLSD}(\theta) = \mathbb{E} \left[ \frac{1}{G} \sum_{i=1}^{G} \frac{1}{|y^{(i)}|} \sum_{t=1}^{|y^{(i)}|} \min\left( w_t A^{(i)}, \text{clip}(w_t, 1-\epsilon_w, 1+\epsilon_w) A^{(i)} \right) \right]

RLSD requires only one extra forward pass (for the teacher logits) and serves as a drop-in replacement for the uniform advantage in GRPO.

Empirical Validation / Results

Experiments were conducted on the Qwen3-VL-8B-Instruct model, trained on the challenging MMFineReason-123K dataset and evaluated on five multimodal reasoning benchmarks.

Table 2: Multimodal reasoning results on the Qwen3-VL-8B-Instruct model.

MethodMMMUMathVistaMathVisionZeroBenchWemathAvg.
Base LLM62.4473.8047.3719.7654.1051.49
GRPO65.1176.2048.8222.6056.5753.86
OPSD63.8275.1047.5321.0654.9552.49
SDPO65.1174.0047.2725.1552.1952.74
GRPO+OPSD63.2275.9048.5222.1654.7652.91
RLSD (Ours)67.2278.1052.7324.8558.0056.18

Key Findings:

  1. RLSD achieves the highest average accuracy (56.18%), outperforming the base LLM by +4.69% and GRPO by +2.32%.
  2. Training Dynamics (Figure 5): RLSD shows a steeper initial ascent and higher final reward than GRPO, while avoiding OPSD's late-stage collapse. It also maintains higher policy entropy than GRPO, indicating less uniform token suppression.
  3. Case Study (Figure 6): Token-level credit heatmaps show RLSD successfully assigns larger credit/blame to decisive reasoning steps (e.g., key calculations) and down-weights generic narration or neutral setup tokens.

Theoretical and Practical Implications

  • Theoretical: Provides a formal analysis of why information-asymmetric distribution matching (OPSD) fails, framing it as an ill -posed objective with an irreducible mutual information gap. Introduces the "impossibility trilemma" for shared-parameter self-distillation: objective stability, sustained improvement, and leakage-free training cannot all be achieved simultaneously under distribution matching. RLSD resolves this trilemma.
  • Practical: RLSD offers a computationally efficient, drop-in improvement for existing RLVR pipelines like GRPO. It requires only the final ground-truth answer as privileged information (no reasoning traces), provides fine-grained credit assignment without training auxiliary models, and ensures training stability anchored to environmental feedback.

Conclusion

This work diagnoses the fundamental limitations of on-policy self-distillation, proving that information asymmetry leads to an ill-posed objective and inevitable leakage. The proposed RLSD paradigm circumvents these issues by repurposing self-distillation from a generative target to a credit assignment modulator. By anchoring update directions to the environment reward and using the teacher's evidence ratio to control token-level magnitudes, RLSD unifies the strengths of RLVR and self-distillation, achieving higher performance, faster convergence, and robust training stability. Future work will explore RLSD's applicability to broader domains beyond multimodal reasoning.