Self-Distilled Agentic Reinforcement Learning (SDAR) - Summary

Summary (Overview)

  • Primary Contribution: Introduces SDAR, a method that integrates On-Policy Self-Distillation (OPSD) as a gated auxiliary objective to Reinforcement Learning (RL) for training multi-turn LLM agents. It addresses the instability of naive OPSD in long-horizon interactions.
  • Core Mechanism: Employs a token-level sigmoid gate (gt=σ(βΔt)g_t = \sigma(\beta \Delta_t)) that adaptively modulates the distillation intensity based on the detached Teacher-Student log-probability gap (Δt\Delta_t). This strengthens distillation on teacher-endorsed tokens (positive Δt\Delta_t) and softly attenuates it on negative-gap tokens, which may arise from imperfect skill retrieval or utilization.
  • Key Findings: SDAR achieves substantial performance improvements over the GRPO baseline across three benchmarks (ALFWorld, Search-QA, WebShop) and multiple model scales (Qwen2.5-3B/7B, Qwen3-1.7B). It successfully internalizes skills (no need for skills at inference) and shows robust gains even with low-quality skill retrieval.
  • Stability: SDAR avoids the catastrophic instability observed in naive GRPO+OPSD combinations and other hybrid methods (RLSD, Skill-SD), maintaining stable optimization by preserving RL as the unbiased primary backbone.
  • Ablation Insights: The Teacher-Student Gap gating strategy with reverse KL divergence is optimal. Performance is sensitive to the distillation coefficient (λ\lambda) and sigmoid sharpness (β\beta), with λ=0.01\lambda=0.01 and β=5.0\beta=5.0 found to be optimal.

Introduction and Theoretical Foundation

Training LLMs to act as multi-turn agents is a central challenge. Two complementary paradigms exist: Reinforcement Learning (RL), which provides coarse, trajectory-level reward signals, and On-Policy Self-Distillation (OPSD), which provides dense, token-level guidance from a teacher branch augmented with privileged context (e.g., retrieved skills).

However, applying OPSD directly to multi-turn agents is problematic due to two key observations:

  1. Multi-turn OPSD Instability: As the student policy drifts from the teacher-supported trajectory over multiple turns, the token-level supervision becomes increasingly unreliable, leading to compounding error, surging KL divergence, and catastrophic performance degradation.
  2. Asymmetric Trust in Privileged Guidance: The teacher is the same policy augmented with privileged context (skills). A negative teacher-student gap (Δt<0\Delta_t < 0) should be interpreted cautiously, as it may indicate a token to suppress or may arise from: (a) low-quality skill retrieval, (b) failure to utilize relevant skills, or (c) multi-turn drift. Analysis shows negative-gap tokens exceed 50% of all tokens, making this a pervasive issue.

This motivates the core philosophy of SDAR: RL should remain the primary, unbiased optimization backbone, while OPSD serves as a carefully controlled auxiliary objective. The control is implemented via an adaptive, token-level gating mechanism, creating a dynamic, self-paced curriculum at the finest granularity.

Methodology

2.1 Problem Setup

A multi-turn agent generates a sequence of response tokens y=(y1,...,yT)πθ(x)y = (y_1, ..., y_T) \sim \pi_\theta(\cdot|x). At token tt, the student context is st=(x,y<t)s_t = (x, y_{<t}) and the teacher context is st+=(x,c+,y<t)s^+_t = (x, c^+, y_{<t}), where c+c^+ is privileged training-only context (e.g., retrieved skills).

Skill Retrieval: Four strategies of varying quality are implemented:

  1. UCB Retrieval: Treats retrieval as a multi-armed bandit, selecting the skill with the highest Upper Confidence Bound (UCB) score: score(e)=rˉ(e)+clnNucbn(e)\text{score}(e) = \bar{r}(e) + c \sqrt{\frac{\ln N_{ucb}}{n(e)}} where rˉ(e)\bar{r}(e) is the mean reward, NucbN_{ucb} is total queries, n(e)n(e) is selection count, and cc controls exploration.
  2. Keyword Matching (KM)
  3. Full Retrieval
  4. Random Retrieval

2.2 Optimization Goals

The overall training objective combines the RL loss (LGRPOL_{GRPO}) and the SDAR distillation loss (LSDARL_{SDAR}):

L(θ)=LGRPO(θ)+λSDARLSDAR(θ)L(\theta) = L_{GRPO}(\theta) + \lambda_{SDAR} \cdot L_{SDAR}(\theta)

RL Optimization (GRPO): For a group of GG sampled responses {y(i)}i=1G\{y^{(i)}\}_{i=1}^G, the GRPO objective is:

LGRPO(θ)=1Gi=1GAgg[min(rt(i)A(i),clip(rt(i),1ϵ,1+ϵ)A(i))]+β1Gi=1GAgg[DKL(πθ(st(i))πref(st(i)))]L_{GRPO}(\theta) = -\frac{1}{G} \sum_{i=1}^G \text{Agg}\left[ \min\left( r^{(i)}_t A^{(i)}, \text{clip}(r^{(i)}_t, 1-\epsilon, 1+\epsilon) A^{(i)} \right) \right] + \beta \cdot \frac{1}{G} \sum_{i=1}^G \text{Agg}\left[ D_{KL}\left( \pi_\theta(\cdot|s^{(i)}_t) \| \pi_{ref}(\cdot|s^{(i)}_t) \right) \right]

where rt(i)=πθ(yt(i)st(i))/πθold(yt(i)st(i))r^{(i)}_t = \pi_\theta(y^{(i)}_t|s^{(i)}_t) / \pi_{\theta_{old}}(y^{(i)}_t|s^{(i)}_t) and A(i)A^{(i)} is the group-relative advantage.

OPSD Optimization: The per-token reverse KL divergence is estimated via a single-sample on the student-sampled token yty_t, yielding the Teacher-Student log-probability gap:

Δt=logπT(ytst+)logπθ(ytst)\Delta_t = \log \pi_T(y_t | s^+_t) - \log \pi_\theta(y_t | s_t)

2.3 Token-Level Gating

The key innovation is a token-level gate gt[0,1]g_t \in [0,1] that modulates the OPSD signal. Let Δt=sg(logπθ+(ytst+)logπθ(ytst))\Delta_t = \text{sg}(\log \pi^+_\theta(y_t|s^+_t) - \log \pi_\theta(y_t|s_t)) be the detached gap, and ht=vVπθ(vst)logπθ(vst)h_t = -\sum_{v \in \mathcal{V}} \pi_\theta(v|s_t) \log \pi_\theta(v|s_t) be the student entropy. Three gating strategies are instantiated using the logistic sigmoid σ\sigma with sharpness β>0\beta>0:

  1. Entropy Gating: gt=σ(βht)g_t = \sigma(\beta h_t)
  2. Gap Gating: gt=σ(βΔt)g_t = \sigma(\beta \Delta_t)
  3. Soft-OR Gating: gt=σ(β[1(1ht)(1Δt)])g_t = \sigma(\beta [1 - (1-h_t)(1-\Delta_t)])

The token-level SDAR loss and final objective are:

tSDAR=gt(logπθ+(ytst+)logπθ(ytst))\ell^{SDAR}_t = g_t \cdot \left( \log \pi^+_\theta(y_t | s^+_t) - \log \pi_\theta(y_t | s_t) \right) LSDAR=Agg(tSDAR)L_{SDAR} = \text{Agg}(\ell^{SDAR}_t)

Theoretical Properties (Appendix A): The design ensures stable optimization. The gate is detached so gradients flow only through the student log-probability. Minimizing LSDARL_{SDAR} is equivalent to maximizing a token-weighted log-likelihood. The gradient is strictly modulated by the bounded gate: θLSDAR=Agg(gtθlogπθ(ytst))\nabla_\theta L_{SDAR} = -\text{Agg}(g_t \nabla_\theta \log \pi_\theta(y_t|s_t)).

Empirical Validation / Results

3.1 Main Results

Table 1 presents comprehensive results across ALFWorld, Search-QA, and WebShop for Qwen2.5-3B, 7B, and Qwen3-1.7B models.

Key Comparisons:

  • SDAR vs. GRPO: Substantial improvements: +9.4% on ALFWorld (3B, 84.4 vs 75.0), +7.0% on Search-QA (3B), +10.2% on WebShop-Acc (7B, 82.8 vs 72.6).
  • SDAR vs. Naive Hybrids: Avoids instability of GRPO+OPSD (which collapses on Qwen3-1.7B to 32.0 vs GRPO's 46.1) and consistently outperforms Skill-SD and RLSD.
  • Skills Internalization: SDAR does not require skills at inference yet surpasses skill-augmented Skill-GRPO* in most settings (e.g., ALFWorld-3B: 84.4 vs 80.5). Skill-GRPO shows massive drops when tested without skills.
  • Generalization: SDAR shows strong generalization, especially on the challenging Qwen3-1.7B model, where it achieves the highest score (53.9%) while Skill-GRPO drops to 21.1%.

Table 1: Performance on ALFWorld, Search-QA and WebShop tasks. (Excerpt for Qwen2.5-7B-Instruct on ALFWorld and WebShop-Acc)

MethodALFWorld AvgWebShop Acc
Vanilla12.51.6
GRPO81.272.6
Skill-GRPO*88.381.2
GRPO+OPSD80.476.5
Skill-SD85.176.5
RLSD82.077.3
SDAR (Ours)85.982.8

3.2 Training Dynamics

Monitoring the Qwen2.5-7B model on ALFWorld reveals:

  • The mean Teacher-Student gap (Δˉ=Et[Δt]\bar{\Delta} = \mathbb{E}_t[\Delta_t]) is consistently negative, confirming the asymmetric trust regime.
  • Δˉ\bar{\Delta} converges toward zero, showing the gate successfully identifies beneficial tokens.
  • The gate activation ratio (fraction of tokens with gt>0.5g_t > 0.5) starts low (<0.5) and gradually increases as the student policy improves, reflecting adaptive filtering.

3.3 Robust Analysis

Table 2 shows SDAR's performance with different skill retrieval methods, all outperforming the pure GRPO baseline (w/o OPSD).

  • Even Random Retrieval yields gains (+1.9% ALFWorld, +1.0% WebShop-Acc).
  • Higher-quality retrieval (KM, UCB) amplifies benefits, with KM achieving the highest WebShop-Acc gain (+10.2%).

Table 2: Robust Testing of different skill retrieval methods. (Qwen2.5-7B, ALFWorld & WebShop)

MethodALFWorld (Gain)WebShop-Acc (Gain)
UCB86.8 (+5.6)81.2 (+8.6)
KM85.9 (+4.7)82.8 (+10.2)
Full83.2 (+2.0)78.1 (+5.5)
Random83.1 (+1.9)73.6 (+1.0)
w/o OPSD81.272.6

3.4 Ablation Studies

  • Gating Strategy: Teacher-Student Gap gating consistently outperforms Entropy and Soft-OR gating.
  • Sharpness β\beta: β=5.0\beta=5.0 is optimal. β=0\beta=0 (no gate) leads to instability; overly large β\beta binarizes the gate, removing useful smooth modulation.
  • Distillation Coefficient λ\lambda: λ=0.01\lambda=0.01 is optimal. λ=0.1\lambda=0.1 causes the distillation gradient to overwhelm the RL update, leading to performance decline.
  • Distillation Objective: Reverse KL outperforms Forward KL and Jensen-Shannon Divergence (JSD). Reverse KL's mode-seeking property naturally down-weights tokens with low teacher probability, complementing the gating mechanism.

Theoretical and Practical Implications

  • Theoretical: SDAR provides a principled framework for combining RL and distillation. The token-level gating with detached signals ensures optimization stability (bounded gradients) and implements an online, self-paced curriculum. The theoretical analysis (Appendix A) formalizes these properties.
  • Practical: SDAR offers a robust and effective recipe for post-training LLM agents. It:
    • Consistently improves agent performance across diverse benchmarks and model scales.
    • Successfully internalizes privileged knowledge (skills), reducing inference-time dependencies.
    • Is robust to the quality of auxiliary information (skill retrieval), degrading gracefully.
    • Avoids the tuning complexity and instability of hand-crafted curricula or rigid hybrid methods.

Conclusion

SDAR reconciles RL and OPSD for multi-turn agent training by treating OPSD as a gated auxiliary objective. The core mechanism is a token-level sigmoid gate that lets each token autonomously regulate its distillation intensity based on the detached Teacher-Student gap. This preserves RL as the unbiased primary backbone while selectively extracting beneficial teacher signals and attenuating unreliable negative guidance.

Empirical results across three benchmarks (ALFWorld, WebShop, Search-QA) and three model scales (Qwen2.5-3B/7B, Qwen3-1.7B) confirm that SDAR delivers consistent gains over pure RL and hybrid baselines, avoids catastrophic instability, and demonstrates strong skill internalization and robustness. The method provides a stable and effective approach for advancing agentic reinforcement learning.