Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information

Summary (Overview)

Identifies a critical flaw in standard on-policy self-distillation for mathematical reasoning. The per-token signal, interpreted as conditional Pointwise Mutual Information (PMI), rewards "shortcut" tokens implied by the privileged context (verified solution) and penalizes "deliberation" tokens (e.g., "Wait", "Let", "Maybe") essential for multi-step search.
Proposes Anti-Self-Distillation (AntiSD), a simple yet effective fix that reverses the gradient direction. Instead of descending, it ascends a Jensen-Shannon Divergence (JSD) between the student and teacher, flipping the per-token reward sign. An entropy-triggered gate stabilizes the process.
Demonstrates significant empirical gains across five models (4B to 30B parameters). AntiSD reaches the baseline GRPO's accuracy in 2 to 10x fewer training steps and improves final accuracy by up to 11.5 points, while default self-distillation consistently underperforms.

Introduction and Theoretical Foundation

Reinforcement Learning from Verifiable Rewards (RLVR) is a dominant paradigm for improving reasoning in Language Models (LLMs). However, its reward is a sparse, trajectory-level scalar, making credit assignment to individual reasoning steps difficult. Two main approaches address this: training a separate Process Reward Model (PRM) or applying On-Policy Distillation (OPD) from a stronger external teacher.

On-policy self-distillation offers a promising alternative by using the model itself as the teacher, conditioned on "privileged context" (e.g., a verified solution). This provides a token-level signal without external resources. However, its gains on challenging mathematical reasoning benchmarks are inconsistent and often negative.

This paper traces the failure to the privileged context itself. Conditioning the teacher on the known solution turns it into an oracle, inflating its confidence on tokens that logically follow from the answer ("shortcut tokens") and deflating it on tokens that explore alternatives ("deliberation tokens"). Standard self-distillation pulls the student towards this biased teacher, reinforcing the wrong behaviors.

The core theoretical insight is that the per-token log-ratio between teacher and student is the conditional Pointwise Mutual Information (PMI):

u_t = \log \pi_T(y_t | x, y_{<t}) - \log \pi_S(y_t | x, y_{<t}) = \text{PMI}(y_t; c | x, y_{<t})

where $c$ is the privileged context. The default reward $\delta_t = +u_t$ thus rewards tokens whose probability is raised by knowing the answer ( $u_t > 0$ ) and penalizes those it lowers ( $u_t < 0$ ).

Methodology

The standard on-policy self-distillation loss added to the GRPO objective is the per-token reverse KL divergence:

\mathcal{L}_{SD}(\theta) = \mathbb{E}_{x, y \sim \pi_S(\cdot|x)} \left[ \sum_{t=1}^{T} D_{KL}\left( \pi_S(\cdot | x, y_{<t}) \ \| \ \text{sg}[\pi_T(\cdot | x, y_{<t})] \right) \right]

where $\pi_S$ is the student (generating the rollout) and $\pi_T$ is the teacher (same network conditioned on $c$ ). The gradient of this loss yields a per-token advantage $\delta_t = +u_t$ .

Anti-Self-Distillation (AntiSD) modifies this approach with three key components:

Sign Reversal (Descent → Ascent): Based on observation (O1) that the default signal has the wrong polarity for reasoning, AntiSD ascends a divergence instead of descending it.
Jensen-Shannon Divergence (JSD) Ascent: Based on observation (O2) that deliberation tokens ( $u_t \ll 0$ ) are over-sampled and can have extreme values, JSD provides an asymmetric bound. The per-token advantage for ascending JSD is: $A^{\text{AntiSD}}_t = -\phi(u_t), \quad \phi(u) := \frac{1}{2}(\text{softplus}(u) - \log 2)$ The function $\phi(u)$ is monotonic and sign-preserving but bounded below by $-\frac{1}{2}\log 2$ , capping the advantage on the deliberation side while leaving it linear (unbounded) on the shortcut side.
Entropy-Triggered Gate: Since ascent is not self-terminating, a gate disables the AntiSD term when the teacher's per-token entropy collapses, indicating the log-ratio $u_t$ is no longer informative. The gate state $g$ and mixing weight $\lambda$ are updated as: $g \leftarrow \begin{cases} 1 & \text{if } g = 0 \text{ and } H \geq H_{\text{warm}} \\ 0 & \text{if } g = 1 \text{ and } H < \tau_{\text{down}} \\ g & \text{otherwise} \end{cases}, \quad \lambda = g \cdot \lambda_{\text{max}}$ where $H$ is the median teacher entropy, $H_{\text{warm}}$ is a baseline from warmup, and $\tau_{\text{down}} = 0.93 H_{\text{warm}}$ is an auto-calibrated threshold.

The final combined advantage is: $A_{i,t} = A^{\text{seq}}_i - \lambda \cdot \phi(u_t)$ .

Empirical Validation / Results

Experiments were conducted on five models (Qwen3 and Olmo-3 families, 4B-30B parameters) trained on the DAPO-Math-17k dataset and evaluated on math reasoning benchmarks (AIME 2024/25/26, HMMT 2025, MinervaMath).

Table 1: Main Results (Accuracy %)

Method	AIME24	AIME25	AIME26	HMMT25	Minerva	Average	Speedup
Qwen3-8B
+GRPO	73.5	65.2	64.2	39.2	45.1	57.4 @200	1.0x
+SD	40.1	30.5	26.9	14.9	40.7	30.6 @200	×
+AntiSD	78.4	73.4	73.7	54.4	48.5	65.7 @180	5.0x
Qwen3-4B-IT-2507
+GRPO	67.8	57.7	63.5	34.1	33.2	51.3 @200	1.0x
+SD	59.8	45.8	52.0	28.8	43.0	45.9 @10	×
+AntiSD	76.6	70.2	74.4	46.7	46.4	62.8 @100	10.0x
Olmo3-7B-IT
+GRPO	57.0	45.3	52.1	31.2	29.1	43.0 @190	1.0x
+SD	54.5	41.8	46.6	24.4	38.5	41.1 @10	×
+AntiSD	62.4	49.1	55.2	32.3	42.4	48.3 @200	9.5x

Key Findings:

Speed & Final Performance: AntiSD reaches GRPO's peak accuracy in 2 to 10x fewer steps and improves final average accuracy by +2.1 to +11.5 points. Default Self-Distillation (SD) consistently underperforms GRPO, often dramatically.
Pass@k Analysis: AntiSD's lead over GRPO is sustained across sampling budgets (k=1 to k=32), indicating gains come from solving more problems, not just concentrating probability on a few.
Generalization to Code: On code reasoning benchmarks (HumanEval+, MBPP+), AntiSD also shows consistent, though smaller, improvements over GRPO.
Ablations: Sign reversal is the dominant factor. The JSD bound and entropy gate are crucial stabilizers, especially for models prone to entropy collapse. Removing the teacher (using only student log-prob) leads to rapid self-reinforcement collapse.
Continual Training: Applying AntiSD on top of a saturated GRPO checkpoint quickly recovers most of the gains, showing it provides complementary signal.

Theoretical and Practical Implications

Theoretical Implications:

Provides a clear information-theoretic interpretation of the on-policy self-distillation signal as conditional PMI.
Explains the previously observed "shortening" of responses as a structural shortcut bias, not benign compression.
Frames the per-token advantage as a potential-based reward shaping term, which leaves the set of optimal policies invariant.

Practical Implications:

AntiSD is a simple, drop-in replacement for default self-distillation that yields significantly faster convergence and higher final performance on reasoning tasks.
It enables scalable self-improvement without external teachers or annotated process rewards.
The method is compute-efficient, adding negligible overhead to the base GRPO training loop.

Conclusion

The paper identifies a fundamental flaw in on-policy self-distillation for reasoning: the privileged context biases the per-token PMI signal against the deliberation steps necessary for search. Anti-Self-Distillation corrects this by reversing the gradient direction and using a bounded divergence, stabilized by an entropy gate. This simple change leads to substantial gains in training efficiency and final accuracy across a range of models and benchmarks, opening a path for LLMs to more effectively bootstrap their own reasoning capabilities. Future work may explore extensions to multi-turn agentic settings and richer forms of privileged context.