# OPID: On-Policy Skill Distillation for Agentic Reinforcement Learning

> OPID extracts hierarchical hindsight skills from on-policy trajectories for dense token-level self-distillation, consistently outperforming outcome-only RL across diverse agentic tasks.

- **Source:** [arXiv](https://arxiv.org/abs/2606.26790)
- **Published:** 2026-06-27
- **Permalink:** https://picx.dev/p/REHpgh
- **Whiteboard:** https://picx.dev/p/REHpgh/image

## Summary

## Summary (Overview)

- **Proposes OPID (On-Policy Skill Distillation):** A framework that extracts hierarchical hindsight skills (episode-level and step-level) from completed on-policy trajectories and uses them for dense token-level self-distillation in agentic reinforcement learning (RL).
- **Critical-first skill routing:** Selects step-level skills at critical timesteps and falls back to episode-level skills otherwise, providing appropriate granularity of guidance.
- **Combines skill-based self-distillation with outcome-based RL:** The token-level skill advantage $A^{\text{skill}}_{\tau,t,\ell}$ is added to the group-relative episode advantage $A^{\text{ep}}_{\tau,t,\ell}$ to form the final advantage $A^{\text{OPID}}_{\tau,t,\ell}$, preserving outcome optimization as the primary objective.
- **Strong empirical results:** OPID consistently outperforms outcome-only RL (GRPO) and existing skill-distillation baselines on ALFWorld, WebShop, and Search-based QA across model scales (Qwen2.5-3B, 7B, Qwen3-1.7B), with gains in success rate, sample efficiency, and cross-domain generalization.
- **No inference-time overhead:** Skills are used only during training; at inference, the policy acts from the ordinary interaction history without analyzer calls, skill retrieval, or privileged context.

---

## Introduction and Theoretical Foundation

### Background and Motivation
- Large language models are increasingly deployed as interactive agents for long-horizon tasks (embodied, web navigation, search-augmented QA). Reinforcement learning (RL) is a natural post-training paradigm, with outcome-based methods like GRPO providing stable critic-free optimization on-policy rollouts.
- **Sparse reward problem:** Outcome-based RL offers only trajectory-level rewards, providing no guidance on which intermediate decisions should be reinforced or suppressed. This is especially severe in long-horizon interaction, where a single early mistake may derail the episode.
- On-policy self-distillation provides dense token-level supervision by comparing the same policy under different contexts. Skill-conditioned variants use natural-language skills as privileged context, but often rely on external skill memories, which are costly to maintain and may be mismatched with the current policy's state distribution in multi-turn interaction.

### Core Idea of OPID
OPID extracts hindsight skills directly from completed on-policy trajectories, avoiding external skill libraries. Skills are hierarchical:
- **Episode-level skills ($s^{\text{ep}}_\tau$):** Capture global workflows or failure-avoidance rules for the entire trajectory.
- **Step-level skills ($s^{\text{step}}_{\tau,t}$):** Capture local decision knowledge at critical timesteps (e.g., avoiding repeated invalid actions, selecting the next object to inspect).

A **critical-first routing** mechanism selects step-level skills when critical decisions are identified and falls back to episode-level skills otherwise. The routed skill is injected into the interaction history, and the old policy re-scores the same sampled response under both original and skill-augmented contexts, producing a token-level self-distillation advantage.

---

## Methodology

### Problem Formulation
The agentic task is modeled as a partially observable Markov decision process $(\mathcal{S}, \mathcal{A}, \mathcal{O}, T, R, \gamma)$. At timestep $t$, the agent maintains an interaction history $h_t$ and generates response $y_t \sim \pi_\theta(\cdot | h_t)$. A completed trajectory $\tau = \{ (o_t, y_t, r_t) \}_{t=0}^{T-1}$ with outcome score $R(\tau)$.

Following GRPO, for each task prompt $q$, a group of $N$ trajectories is sampled from the current policy: $\mathcal{G}_q = \{ \tau^{(1)}, \ldots, \tau^{(N)} \}$.

### On-Policy Skill Extraction
An LLM-based analyzer $\mathcal{A}$ maps a trajectory $\tau$ to:
$$\mathcal{A}(\tau) = \left( s^{\text{ep}}_\tau, \{ s^{\text{step}}_{\tau,t} \}_{t \in \mathcal{C}_\tau} \right)$$
where $\mathcal{C}_\tau$ is the set of critical timesteps identified by the analyzer.

### Critical-First Skill Routing
For trajectory $\tau$ and timestep $t$, the routed skill is:
$$s_{\tau,t} = \begin{cases} s^{\text{step}}_{\tau,t}, & \text{if } t \in \mathcal{C}_\tau, \\ s^{\text{ep}}_\tau, & \text{otherwise}. \end{cases}$$

### Skill-Conditioned Self-Distillation
Let $\tilde{h}_{\tau,t} = H(h_{\tau,t}, s_{\tau,t})$ be the skill-augmented history. The old policy $\pi_{\theta_{\text{old}}}$ scores the same response $y_{\tau,t}$ under both contexts:
- $\ell^{\text{old}}_{\tau,t,\ell} = \log \pi_{\theta_{\text{old}}}(y_{\tau,t,\ell} | h_{\tau,t}, y_{\tau,t,<\ell})$
- $\ell^{\text{skill}}_{\tau,t,\ell} = \log \pi_{\theta_{\text{old}}}(y_{\tau,t,\ell} | \tilde{h}_{\tau,t}, y_{\tau,t,<\ell})$

The skill-based self-teacher advantage (masked by valid response token mask $m_{\tau,t,\ell}$) is:
$$A^{\text{skill}}_{\tau,t,\ell} = \left( \ell^{\text{skill}}_{\tau,t,\ell} - \ell^{\text{old}}_{\tau,t,\ell} \right) m_{\tau,t,\ell}$$

### Policy Optimization with Skill Advantage
For each rollout group $\mathcal{G}_q$, the group mean and standard deviation of outcome rewards are computed:
$$\mu_q = \text{mean}(\{ R(\tau') | \tau' \in \mathcal{G}_q \}), \quad \sigma_q = \text{std}(\{ R(\tau') | \tau' \in \mathcal{G}_q \})$$

The episode-relative advantage is:
$$A^{\text{ep}}_\tau = \frac{R(\tau) - \mu_q}{\sigma_q}, \quad \tau \in \mathcal{G}_q$$
Broadcast to tokens: $A^{\text{ep}}_{\tau,t,\ell} = A^{\text{ep}}_\tau m_{\tau,t,\ell}$.

The final OPID advantage:
$$A^{\text{OPID}}_{\tau,t,\ell} = A^{\text{ep}}_{\tau,t,\ell} + \lambda_{\text{skill}} A^{\text{skill}}_{\tau,t,\ell}$$

The policy is optimized with the clipped PPO objective:
$$\mathcal{L}_{\text{policy}}(\theta) = -\mathbb{E}_{\tau,t,\ell} \left[ \min\left( \rho_{\tau,t,\ell}(\theta) A^{\text{OPID}}_{\tau,t,\ell}, \; \text{clip}(\rho_{\tau,t,\ell}(\theta), 1-\epsilon, 1+\epsilon) A^{\text{OPID}}_{\tau,t,\ell} \right) \right] + \beta \mathcal{L}_{\text{KL}}(\theta)$$
where $\rho_{\tau,t,\ell}(\theta) = \exp(\log \pi_\theta(y_{\tau,t,\ell} | h_{\tau,t}, y_{\tau,t,<\ell}) - \log \pi_{\theta_{\text{old}}}(y_{\tau,t,\ell} | h_{\tau,t}, y_{\tau,t,<\ell}))$.

---

## Empirical Validation / Results

### Experimental Setup
- **Benchmarks:** ALFWorld (embodied household), WebShop (e-commerce), Search-based QA (search-augmented question answering on NQ, TriviaQA, PopQA, HotpotQA, 2WikiMultiHopQA, MuSiQue, Bamboogle).
- **Baselines:** Vanilla, Skill-Prompt*, GRPO, Skill-GRPO, Skill-GRPO*, GRPO+OPSD, Skill-SD, RLSD, SDAR. * indicates validation with skills.
- **Backbones:** Qwen2.5-3B/7B-Instruct, Qwen3-1.7B-Instruct.
- **Implementation:** Training for 150 steps, batch size 16 for ALFWorld/WebShop, 128 for Search-based QA, group size $N=8$, $\lambda_{\text{skill}}=0.001$, $\epsilon=0.2$.

### Main Results (Table 1)
Performance comparison on ALFWorld (success rate %), Search-based QA (accuracy %), WebShop (Score / Succ. %). Best and second-best highlighted.

**Key findings:**
- OPID improves over GRPO in most model–domain combinations. E.g., on Qwen2.5-3B: ALFWorld +9.3 points (84.3 vs 75.0), Search-based QA +8.6 (45.0 vs 36.4), WebShop +10.9 (74.2 vs 63.3).
- OPID matches or surpasses strong hybrid baselines (SDAR, RLSD, Skill-GRPO*) in aggregate settings.
- OPID outperforms Skill-GRPO (without inference skills) by large margins, showing it internalizes skills rather than depending on them at inference.

| Method | ALFWorld Avg | Search Avg | WebShop Score | WebShop Succ. |
|--------|-------------|------------|---------------|----------------|
| **Qwen2.5-3B** | | | | |
| GRPO | 75.0 | 36.4 | 79.8 | 63.3 |
| Skill-GRPO* | 80.5 | 36.1 | 76.3 | 66.4 |
| SDAR | 84.4 | 43.4 | 85.0 | 68.0 |
| **OPID** | **84.3** | **45.0** | **85.0** | **74.2** |
| **Qwen2.5-7B** | | | | |
| GRPO | 81.2 | 42.0 | 80.9 | 72.6 |
| Skill-GRPO* | 88.3 | 47.5 | 87.0 | 81.2 |
| SDAR | 85.9 | 49.0 | 89.4 | 82.8 |
| **OPID** | **90.0** | **49.2** | 85.3 | 79.7 |
| **Qwen3-1.7B** | | | | |
| GRPO | 46.1 | 40.8 | 67.3 | 38.3 |
| SDAR | 53.9 | 41.9 | 76.8 | 58.6 |
| **OPID** | **58.9** | 40.4 | **79.6** | **64.8** |

### Training Dynamics
- Figure 3 shows OPID diverging from GRPO mid-training and maintaining higher success rate while reducing average episode length (15-16 steps vs 17-18 steps), indicating more direct action sequences.

### Sample Efficiency (Figure 4)
- OPID consistently outperforms GRPO under reduced training data fractions. With 60% data, OPID reaches 71.9 success (close to GRPO full data 75.0); with 80% data, OPID exceeds full-data GRPO (78.9 vs 75.0). Absolute gains range from +9.3 to +20.3 points.

### Cross-Domain Generalization (Figure 5)
- On ALFWorld unseen split, OPID achieves 78.6% success vs GRPO 70.9%, with large gains on Look (+26.7) and Heat (+18.5).

### Ablation Studies
- **Hierarchical skills (Table 2):** Removing episode-level skills drops ALFWorld avg from 84.3 to 74.1; removing step-level skills drops to 79.1. Both levels are complementary.
- **Critical-first routing (Table 3):** Without routing (superimposing both skills), ALFWorld avg drops from 84.3 to 77.5, confirming the importance of selective routing.

### Theoretical Analysis (Appendix A)
- Proposition 1: The unclipped OPID skill loss decomposes as $\mathcal{L}^{\text{unclip}}_{\text{skill}}(\theta) = \lambda_{\text{skill}} [\mathcal{L}_{\text{RKL}}(\theta) - D_b(\theta)]$, showing it is a relative-KL loss locally equivalent to reverse-KL distillation at the behavior policy.
- Proposition 2: On-policy occupancy matching eliminates outer context-distribution mismatch for distillation.
- Proposition 3: Critical-first routing recovers the oracle candidate-teacher selection under perfect criticality detection, with degradation controlled by detector error.

---

## Theoretical and Practical Implications

- **Theoretical insight:** OPID provides a principled way to convert trajectory-derived skills into dense token-level supervision that complements outcome rewards. The loss is locally equivalent to reverse-KL distillation, ensuring the policy is shaped toward the skill-conditioned teacher without drifting from the behavior policy.
- **Practical significance:** OPID eliminates the need for external skill libraries, retrieval mechanisms, or privileged context at inference time. The distribution-matched hindsight supervision improves sample efficiency, reduces repetitive/invalid actions, and enables better cross-domain generalization.
- **Behavioral improvement:** OPID agents learn more direct action sequences (shorter episode lengths) and avoid hallucinated targets or object substitution errors, as shown in qualitative examples (Figure 6).

---

## Conclusion

OPID is an on-policy skill distillation framework that turns completed agent trajectories into hierarchical hindsight supervision (episode-level and step-level skills). By using critical-first routing and combining skill-based self-distillation with outcome-based RL, OPID provides dense, distribution-matched token-level guidance without external skill libraries or inference-time overhead. Experiments across embodied, web, and search-based benchmarks demonstrate consistent improvements in performance, sample efficiency, and robustness.

**Future directions:** Evaluate in broader interactive environments (OdysseyArena, WebArena, VisualWebArena), enrich skill structure with higher-level reasoning abstractions, and improve training efficiency with speculative decoding methods.

---

_Markdown view of https://picx.dev/p/REHpgh, served by PicX — AI-generated visual whiteboard summaries of research papers._
