Self-Distilled Agentic Reinforcement Learning (SDAR) - Summary
Summary (Overview)
- Primary Contribution: Introduces SDAR, a method that integrates On-Policy Self-Distillation (OPSD) as a gated auxiliary objective to Reinforcement Learning (RL) for training multi-turn LLM agents. It addresses the instability of naive OPSD in long-horizon interactions.
- Core Mechanism: Employs a token-level sigmoid gate () that adaptively modulates the distillation intensity based on the detached Teacher-Student log-probability gap (). This strengthens distillation on teacher-endorsed tokens (positive ) and softly attenuates it on negative-gap tokens, which may arise from imperfect skill retrieval or utilization.
- Key Findings: SDAR achieves substantial performance improvements over the GRPO baseline across three benchmarks (ALFWorld, Search-QA, WebShop) and multiple model scales (Qwen2.5-3B/7B, Qwen3-1.7B). It successfully internalizes skills (no need for skills at inference) and shows robust gains even with low-quality skill retrieval.
- Stability: SDAR avoids the catastrophic instability observed in naive GRPO+OPSD combinations and other hybrid methods (RLSD, Skill-SD), maintaining stable optimization by preserving RL as the unbiased primary backbone.
- Ablation Insights: The Teacher-Student Gap gating strategy with reverse KL divergence is optimal. Performance is sensitive to the distillation coefficient () and sigmoid sharpness (), with and found to be optimal.
Introduction and Theoretical Foundation
Training LLMs to act as multi-turn agents is a central challenge. Two complementary paradigms exist: Reinforcement Learning (RL), which provides coarse, trajectory-level reward signals, and On-Policy Self-Distillation (OPSD), which provides dense, token-level guidance from a teacher branch augmented with privileged context (e.g., retrieved skills).
However, applying OPSD directly to multi-turn agents is problematic due to two key observations:
- Multi-turn OPSD Instability: As the student policy drifts from the teacher-supported trajectory over multiple turns, the token-level supervision becomes increasingly unreliable, leading to compounding error, surging KL divergence, and catastrophic performance degradation.
- Asymmetric Trust in Privileged Guidance: The teacher is the same policy augmented with privileged context (skills). A negative teacher-student gap () should be interpreted cautiously, as it may indicate a token to suppress or may arise from: (a) low-quality skill retrieval, (b) failure to utilize relevant skills, or (c) multi-turn drift. Analysis shows negative-gap tokens exceed 50% of all tokens, making this a pervasive issue.
This motivates the core philosophy of SDAR: RL should remain the primary, unbiased optimization backbone, while OPSD serves as a carefully controlled auxiliary objective. The control is implemented via an adaptive, token-level gating mechanism, creating a dynamic, self-paced curriculum at the finest granularity.
Methodology
2.1 Problem Setup
A multi-turn agent generates a sequence of response tokens . At token , the student context is and the teacher context is , where is privileged training-only context (e.g., retrieved skills).
Skill Retrieval: Four strategies of varying quality are implemented:
- UCB Retrieval: Treats retrieval as a multi-armed bandit, selecting the skill with the highest Upper Confidence Bound (UCB) score: where is the mean reward, is total queries, is selection count, and controls exploration.
- Keyword Matching (KM)
- Full Retrieval
- Random Retrieval
2.2 Optimization Goals
The overall training objective combines the RL loss () and the SDAR distillation loss ():
RL Optimization (GRPO): For a group of sampled responses , the GRPO objective is:
where and is the group-relative advantage.
OPSD Optimization: The per-token reverse KL divergence is estimated via a single-sample on the student-sampled token , yielding the Teacher-Student log-probability gap:
2.3 Token-Level Gating
The key innovation is a token-level gate that modulates the OPSD signal. Let be the detached gap, and be the student entropy. Three gating strategies are instantiated using the logistic sigmoid with sharpness :
- Entropy Gating:
- Gap Gating:
- Soft-OR Gating:
The token-level SDAR loss and final objective are:
Theoretical Properties (Appendix A): The design ensures stable optimization. The gate is detached so gradients flow only through the student log-probability. Minimizing is equivalent to maximizing a token-weighted log-likelihood. The gradient is strictly modulated by the bounded gate: .
Empirical Validation / Results
3.1 Main Results
Table 1 presents comprehensive results across ALFWorld, Search-QA, and WebShop for Qwen2.5-3B, 7B, and Qwen3-1.7B models.
Key Comparisons:
- SDAR vs. GRPO: Substantial improvements: +9.4% on ALFWorld (3B, 84.4 vs 75.0), +7.0% on Search-QA (3B), +10.2% on WebShop-Acc (7B, 82.8 vs 72.6).
- SDAR vs. Naive Hybrids: Avoids instability of GRPO+OPSD (which collapses on Qwen3-1.7B to 32.0 vs GRPO's 46.1) and consistently outperforms Skill-SD and RLSD.
- Skills Internalization: SDAR does not require skills at inference yet surpasses skill-augmented Skill-GRPO* in most settings (e.g., ALFWorld-3B: 84.4 vs 80.5). Skill-GRPO shows massive drops when tested without skills.
- Generalization: SDAR shows strong generalization, especially on the challenging Qwen3-1.7B model, where it achieves the highest score (53.9%) while Skill-GRPO drops to 21.1%.
Table 1: Performance on ALFWorld, Search-QA and WebShop tasks. (Excerpt for Qwen2.5-7B-Instruct on ALFWorld and WebShop-Acc)
| Method | ALFWorld Avg | WebShop Acc |
|---|---|---|
| Vanilla | 12.5 | 1.6 |
| GRPO | 81.2 | 72.6 |
| Skill-GRPO* | 88.3 | 81.2 |
| GRPO+OPSD | 80.4 | 76.5 |
| Skill-SD | 85.1 | 76.5 |
| RLSD | 82.0 | 77.3 |
| SDAR (Ours) | 85.9 | 82.8 |
3.2 Training Dynamics
Monitoring the Qwen2.5-7B model on ALFWorld reveals:
- The mean Teacher-Student gap () is consistently negative, confirming the asymmetric trust regime.
- converges toward zero, showing the gate successfully identifies beneficial tokens.
- The gate activation ratio (fraction of tokens with ) starts low (<0.5) and gradually increases as the student policy improves, reflecting adaptive filtering.
3.3 Robust Analysis
Table 2 shows SDAR's performance with different skill retrieval methods, all outperforming the pure GRPO baseline (w/o OPSD).
- Even Random Retrieval yields gains (+1.9% ALFWorld, +1.0% WebShop-Acc).
- Higher-quality retrieval (KM, UCB) amplifies benefits, with KM achieving the highest WebShop-Acc gain (+10.2%).
Table 2: Robust Testing of different skill retrieval methods. (Qwen2.5-7B, ALFWorld & WebShop)
| Method | ALFWorld (Gain) | WebShop-Acc (Gain) |
|---|---|---|
| UCB | 86.8 (+5.6) | 81.2 (+8.6) |
| KM | 85.9 (+4.7) | 82.8 (+10.2) |
| Full | 83.2 (+2.0) | 78.1 (+5.5) |
| Random | 83.1 (+1.9) | 73.6 (+1.0) |
| w/o OPSD | 81.2 | 72.6 |
3.4 Ablation Studies
- Gating Strategy: Teacher-Student Gap gating consistently outperforms Entropy and Soft-OR gating.
- Sharpness : is optimal. (no gate) leads to instability; overly large binarizes the gate, removing useful smooth modulation.
- Distillation Coefficient : is optimal. causes the distillation gradient to overwhelm the RL update, leading to performance decline.
- Distillation Objective: Reverse KL outperforms Forward KL and Jensen-Shannon Divergence (JSD). Reverse KL's mode-seeking property naturally down-weights tokens with low teacher probability, complementing the gating mechanism.
Theoretical and Practical Implications
- Theoretical: SDAR provides a principled framework for combining RL and distillation. The token-level gating with detached signals ensures optimization stability (bounded gradients) and implements an online, self-paced curriculum. The theoretical analysis (Appendix A) formalizes these properties.
- Practical: SDAR offers a robust and effective recipe for post-training LLM agents. It:
- Consistently improves agent performance across diverse benchmarks and model scales.
- Successfully internalizes privileged knowledge (skills), reducing inference-time dependencies.
- Is robust to the quality of auxiliary information (skill retrieval), degrading gracefully.
- Avoids the tuning complexity and instability of hand-crafted curricula or rigid hybrid methods.
Conclusion
SDAR reconciles RL and OPSD for multi-turn agent training by treating OPSD as a gated auxiliary objective. The core mechanism is a token-level sigmoid gate that lets each token autonomously regulate its distillation intensity based on the detached Teacher-Student gap. This preserves RL as the unbiased primary backbone while selectively extracting beneficial teacher signals and attenuating unreliable negative guidance.
Empirical results across three benchmarks (ALFWorld, WebShop, Search-QA) and three model scales (Qwen2.5-3B/7B, Qwen3-1.7B) confirm that SDAR delivers consistent gains over pure RL and hybrid baselines, avoids catastrophic instability, and demonstrates strong skill internalization and robustness. The method provides a stable and effective approach for advancing agentic reinforcement learning.