Self-Distilled Agentic Reinforcement Learning (SDAR) - Summary

Summary (Overview)

Primary Contribution: Introduces SDAR, a method that integrates On-Policy Self-Distillation (OPSD) as a gated auxiliary objective to Reinforcement Learning (RL) for training multi-turn LLM agents. It addresses the instability of naive OPSD in long-horizon interactions.
Core Mechanism: Employs a token-level sigmoid gate ( $g_t = \sigma(\beta \Delta_t)$ ) that adaptively modulates the distillation intensity based on the detached Teacher-Student log-probability gap ( $\Delta_t$ ). This strengthens distillation on teacher-endorsed tokens (positive $\Delta_t$ ) and softly attenuates it on negative-gap tokens, which may arise from imperfect skill retrieval or utilization.
Key Findings: SDAR achieves substantial performance improvements over the GRPO baseline across three benchmarks (ALFWorld, Search-QA, WebShop) and multiple model scales (Qwen2.5-3B/7B, Qwen3-1.7B). It successfully internalizes skills (no need for skills at inference) and shows robust gains even with low-quality skill retrieval.
Stability: SDAR avoids the catastrophic instability observed in naive GRPO+OPSD combinations and other hybrid methods (RLSD, Skill-SD), maintaining stable optimization by preserving RL as the unbiased primary backbone.
Ablation Insights: The Teacher-Student Gap gating strategy with reverse KL divergence is optimal. Performance is sensitive to the distillation coefficient ( $\lambda$ ) and sigmoid sharpness ( $\beta$ ), with $\lambda=0.01$ and $\beta=5.0$ found to be optimal.

Introduction and Theoretical Foundation

Training LLMs to act as multi-turn agents is a central challenge. Two complementary paradigms exist: Reinforcement Learning (RL), which provides coarse, trajectory-level reward signals, and On-Policy Self-Distillation (OPSD), which provides dense, token-level guidance from a teacher branch augmented with privileged context (e.g., retrieved skills).

However, applying OPSD directly to multi-turn agents is problematic due to two key observations:

Multi-turn OPSD Instability: As the student policy drifts from the teacher-supported trajectory over multiple turns, the token-level supervision becomes increasingly unreliable, leading to compounding error, surging KL divergence, and catastrophic performance degradation.
Asymmetric Trust in Privileged Guidance: The teacher is the same policy augmented with privileged context (skills). A negative teacher-student gap ( $\Delta_t < 0$ ) should be interpreted cautiously, as it may indicate a token to suppress or may arise from: (a) low-quality skill retrieval, (b) failure to utilize relevant skills, or (c) multi-turn drift. Analysis shows negative-gap tokens exceed 50% of all tokens, making this a pervasive issue.

This motivates the core philosophy of SDAR: RL should remain the primary, unbiased optimization backbone, while OPSD serves as a carefully controlled auxiliary objective. The control is implemented via an adaptive, token-level gating mechanism, creating a dynamic, self-paced curriculum at the finest granularity.

Methodology

2.1 Problem Setup

A multi-turn agent generates a sequence of response tokens $y = (y_1, ..., y_T) \sim \pi_\theta(\cdot|x)$ . At token $t$ , the student context is $s_t = (x, y_{<t})$ and the teacher context is $s^+_t = (x, c^+, y_{<t})$ , where $c^+$ is privileged training-only context (e.g., retrieved skills).

Skill Retrieval: Four strategies of varying quality are implemented:

UCB Retrieval: Treats retrieval as a multi-armed bandit, selecting the skill with the highest Upper Confidence Bound (UCB) score: $\text{score}(e) = \bar{r}(e) + c \sqrt{\frac{\ln N_{ucb}}{n(e)}}$ where $\bar{r}(e)$ is the mean reward, $N_{ucb}$ is total queries, $n(e)$ is selection count, and $c$ controls exploration.
Keyword Matching (KM)
Full Retrieval
Random Retrieval

2.2 Optimization Goals

The overall training objective combines the RL loss ( $L_{GRPO}$ ) and the SDAR distillation loss ( $L_{SDAR}$ ):

L(\theta) = L_{GRPO}(\theta) + \lambda_{SDAR} \cdot L_{SDAR}(\theta)

RL Optimization (GRPO): For a group of $G$ sampled responses $\{y^{(i)}\}_{i=1}^G$ , the GRPO objective is:

L_{GRPO}(\theta) = -\frac{1}{G} \sum_{i=1}^G \text{Agg}\left[ \min\left( r^{(i)}_t A^{(i)}, \text{clip}(r^{(i)}_t, 1-\epsilon, 1+\epsilon) A^{(i)} \right) \right] + \beta \cdot \frac{1}{G} \sum_{i=1}^G \text{Agg}\left[ D_{KL}\left( \pi_\theta(\cdot|s^{(i)}_t) \| \pi_{ref}(\cdot|s^{(i)}_t) \right) \right]

where $r^{(i)}_t = \pi_\theta(y^{(i)}_t|s^{(i)}_t) / \pi_{\theta_{old}}(y^{(i)}_t|s^{(i)}_t)$ and $A^{(i)}$ is the group-relative advantage.

OPSD Optimization: The per-token reverse KL divergence is estimated via a single-sample on the student-sampled token $y_t$ , yielding the Teacher-Student log-probability gap:

\Delta_t = \log \pi_T(y_t | s^+_t) - \log \pi_\theta(y_t | s_t)

2.3 Token-Level Gating

The key innovation is a token-level gate $g_t \in [0,1]$ that modulates the OPSD signal. Let $\Delta_t = \text{sg}(\log \pi^+_\theta(y_t|s^+_t) - \log \pi_\theta(y_t|s_t))$ be the detached gap, and $h_t = -\sum_{v \in \mathcal{V}} \pi_\theta(v|s_t) \log \pi_\theta(v|s_t)$ be the student entropy. Three gating strategies are instantiated using the logistic sigmoid $\sigma$ with sharpness $\beta>0$ :

Entropy Gating: $g_t = \sigma(\beta h_t)$
Gap Gating: $g_t = \sigma(\beta \Delta_t)$
Soft-OR Gating: $g_t = \sigma(\beta [1 - (1-h_t)(1-\Delta_t)])$

The token-level SDAR loss and final objective are:

\ell^{SDAR}_t = g_t \cdot \left( \log \pi^+_\theta(y_t | s^+_t) - \log \pi_\theta(y_t | s_t) \right)

L_{SDAR} = \text{Agg}(\ell^{SDAR}_t)

Theoretical Properties (Appendix A): The design ensures stable optimization. The gate is detached so gradients flow only through the student log-probability. Minimizing $L_{SDAR}$ is equivalent to maximizing a token-weighted log-likelihood. The gradient is strictly modulated by the bounded gate: $\nabla_\theta L_{SDAR} = -\text{Agg}(g_t \nabla_\theta \log \pi_\theta(y_t|s_t))$ .

Empirical Validation / Results

3.1 Main Results

Table 1 presents comprehensive results across ALFWorld, Search-QA, and WebShop for Qwen2.5-3B, 7B, and Qwen3-1.7B models.

Key Comparisons:

SDAR vs. GRPO: Substantial improvements: +9.4% on ALFWorld (3B, 84.4 vs 75.0), +7.0% on Search-QA (3B), +10.2% on WebShop-Acc (7B, 82.8 vs 72.6).
SDAR vs. Naive Hybrids: Avoids instability of GRPO+OPSD (which collapses on Qwen3-1.7B to 32.0 vs GRPO's 46.1) and consistently outperforms Skill-SD and RLSD.
Skills Internalization: SDAR does not require skills at inference yet surpasses skill-augmented Skill-GRPO* in most settings (e.g., ALFWorld-3B: 84.4 vs 80.5). Skill-GRPO shows massive drops when tested without skills.
Generalization: SDAR shows strong generalization, especially on the challenging Qwen3-1.7B model, where it achieves the highest score (53.9%) while Skill-GRPO drops to 21.1%.

Table 1: Performance on ALFWorld, Search-QA and WebShop tasks. (Excerpt for Qwen2.5-7B-Instruct on ALFWorld and WebShop-Acc)

Method	ALFWorld Avg	WebShop Acc
Vanilla	12.5	1.6
GRPO	81.2	72.6
Skill-GRPO*	88.3	81.2
GRPO+OPSD	80.4	76.5
Skill-SD	85.1	76.5
RLSD	82.0	77.3
SDAR (Ours)	85.9	82.8

3.2 Training Dynamics

Monitoring the Qwen2.5-7B model on ALFWorld reveals:

The mean Teacher-Student gap ( $\bar{\Delta} = \mathbb{E}_t[\Delta_t]$ ) is consistently negative, confirming the asymmetric trust regime.
$\bar{\Delta}$ converges toward zero, showing the gate successfully identifies beneficial tokens.
The gate activation ratio (fraction of tokens with $g_t > 0.5$ ) starts low (<0.5) and gradually increases as the student policy improves, reflecting adaptive filtering.

3.3 Robust Analysis

Table 2 shows SDAR's performance with different skill retrieval methods, all outperforming the pure GRPO baseline (w/o OPSD).

Even Random Retrieval yields gains (+1.9% ALFWorld, +1.0% WebShop-Acc).
Higher-quality retrieval (KM, UCB) amplifies benefits, with KM achieving the highest WebShop-Acc gain (+10.2%).

Table 2: Robust Testing of different skill retrieval methods. (Qwen2.5-7B, ALFWorld & WebShop)

Method	ALFWorld (Gain)	WebShop-Acc (Gain)
UCB	86.8 (+5.6)	81.2 (+8.6)
KM	85.9 (+4.7)	82.8 (+10.2)
Full	83.2 (+2.0)	78.1 (+5.5)
Random	83.1 (+1.9)	73.6 (+1.0)
w/o OPSD	81.2	72.6

3.4 Ablation Studies

Gating Strategy: Teacher-Student Gap gating consistently outperforms Entropy and Soft-OR gating.
Sharpness $\beta$ : $\beta=5.0$ is optimal. $\beta=0$ (no gate) leads to instability; overly large $\beta$ binarizes the gate, removing useful smooth modulation.
Distillation Coefficient $\lambda$ : $\lambda=0.01$ is optimal. $\lambda=0.1$ causes the distillation gradient to overwhelm the RL update, leading to performance decline.
Distillation Objective: Reverse KL outperforms Forward KL and Jensen-Shannon Divergence (JSD). Reverse KL's mode-seeking property naturally down-weights tokens with low teacher probability, complementing the gating mechanism.

Theoretical and Practical Implications

Theoretical: SDAR provides a principled framework for combining RL and distillation. The token-level gating with detached signals ensures optimization stability (bounded gradients) and implements an online, self-paced curriculum. The theoretical analysis (Appendix A) formalizes these properties.
Practical: SDAR offers a robust and effective recipe for post-training LLM agents. It:
- Consistently improves agent performance across diverse benchmarks and model scales.
- Successfully internalizes privileged knowledge (skills), reducing inference-time dependencies.
- Is robust to the quality of auxiliary information (skill retrieval), degrading gracefully.
- Avoids the tuning complexity and instability of hand-crafted curricula or rigid hybrid methods.

Conclusion

SDAR reconciles RL and OPSD for multi-turn agent training by treating OPSD as a gated auxiliary objective. The core mechanism is a token-level sigmoid gate that lets each token autonomously regulate its distillation intensity based on the detached Teacher-Student gap. This preserves RL as the unbiased primary backbone while selectively extracting beneficial teacher signals and attenuating unreliable negative guidance.

Empirical results across three benchmarks (ALFWorld, WebShop, Search-QA) and three model scales (Qwen2.5-3B/7B, Qwen3-1.7B) confirm that SDAR delivers consistent gains over pure RL and hybrid baselines, avoids catastrophic instability, and demonstrates strong skill internalization and robustness. The method provides a stable and effective approach for advancing agentic reinforcement learning.