Visual Summary | OPID: On-Policy Skill Distillation for Agentic Reinforcement Learning

Summary (Overview)

Proposes OPID (On-Policy Skill Distillation): A framework that extracts hierarchical hindsight skills (episode-level and step-level) from completed on-policy trajectories and uses them for dense token-level self-distillation in agentic reinforcement learning (RL).
Critical-first skill routing: Selects step-level skills at critical timesteps and falls back to episode-level skills otherwise, providing appropriate granularity of guidance.
Combines skill-based self-distillation with outcome-based RL: The token-level skill advantage $A^{\text{skill}}_{\tau,t,\ell}$ is added to the group-relative episode advantage $A^{\text{ep}}_{\tau,t,\ell}$ to form the final advantage $A^{\text{OPID}}_{\tau,t,\ell}$ , preserving outcome optimization as the primary objective.
Strong empirical results: OPID consistently outperforms outcome-only RL (GRPO) and existing skill-distillation baselines on ALFWorld, WebShop, and Search-based QA across model scales (Qwen2.5-3B, 7B, Qwen3-1.7B), with gains in success rate, sample efficiency, and cross-domain generalization.
No inference-time overhead: Skills are used only during training; at inference, the policy acts from the ordinary interaction history without analyzer calls, skill retrieval, or privileged context.

Introduction and Theoretical Foundation

Background and Motivation

Large language models are increasingly deployed as interactive agents for long-horizon tasks (embodied, web navigation, search-augmented QA). Reinforcement learning (RL) is a natural post-training paradigm, with outcome-based methods like GRPO providing stable critic-free optimization on-policy rollouts.
Sparse reward problem: Outcome-based RL offers only trajectory-level rewards, providing no guidance on which intermediate decisions should be reinforced or suppressed. This is especially severe in long-horizon interaction, where a single early mistake may derail the episode.
On-policy self-distillation provides dense token-level supervision by comparing the same policy under different contexts. Skill-conditioned variants use natural-language skills as privileged context, but often rely on external skill memories, which are costly to maintain and may be mismatched with the current policy's state distribution in multi-turn interaction.

Core Idea of OPID

OPID extracts hindsight skills directly from completed on-policy trajectories, avoiding external skill libraries. Skills are hierarchical:

Episode-level skills ( $s^{\text{ep}}_\tau$ ): Capture global workflows or failure-avoidance rules for the entire trajectory.
Step-level skills ( $s^{\text{step}}_{\tau,t}$ ): Capture local decision knowledge at critical timesteps (e.g., avoiding repeated invalid actions, selecting the next object to inspect).

A critical-first routing mechanism selects step-level skills when critical decisions are identified and falls back to episode-level skills otherwise. The routed skill is injected into the interaction history, and the old policy re-scores the same sampled response under both original and skill-augmented contexts, producing a token-level self-distillation advantage.

Methodology

Problem Formulation

The agentic task is modeled as a partially observable Markov decision process $(\mathcal{S}, \mathcal{A}, \mathcal{O}, T, R, \gamma)$ . At timestep $t$ , the agent maintains an interaction history $h_t$ and generates response $y_t \sim \pi_\theta(\cdot | h_t)$ . A completed trajectory $\tau = \{ (o_t, y_t, r_t) \}_{t=0}^{T-1}$ with outcome score $R(\tau)$ .

Following GRPO, for each task prompt $q$ , a group of $N$ trajectories is sampled from the current policy: $\mathcal{G}_q = \{ \tau^{(1)}, \ldots, \tau^{(N)} \}$ .

On-Policy Skill Extraction

An LLM-based analyzer $\mathcal{A}$ maps a trajectory $\tau$ to:

\mathcal{A}(\tau) = \left( s^{\text{ep}}_\tau, \{ s^{\text{step}}_{\tau,t} \}_{t \in \mathcal{C}_\tau} \right)

where $\mathcal{C}_\tau$ is the set of critical timesteps identified by the analyzer.

Critical-First Skill Routing

For trajectory $\tau$ and timestep $t$ , the routed skill is:

s_{\tau,t} = \begin{cases} s^{\text{step}}_{\tau,t}, & \text{if } t \in \mathcal{C}_\tau, \\ s^{\text{ep}}_\tau, & \text{otherwise}. \end{cases}

Skill-Conditioned Self-Distillation

Let $\tilde{h}_{\tau,t} = H(h_{\tau,t}, s_{\tau,t})$ be the skill-augmented history. The old policy $\pi_{\theta_{\text{old}}}$ scores the same response $y_{\tau,t}$ under both contexts:

$\ell^{\text{old}}_{\tau,t,\ell} = \log \pi_{\theta_{\text{old}}}(y_{\tau,t,\ell} | h_{\tau,t}, y_{\tau,t,<\ell})$
$\ell^{\text{skill}}_{\tau,t,\ell} = \log \pi_{\theta_{\text{old}}}(y_{\tau,t,\ell} | \tilde{h}_{\tau,t}, y_{\tau,t,<\ell})$

The skill-based self-teacher advantage (masked by valid response token mask $m_{\tau,t,\ell}$ ) is:

A^{\text{skill}}_{\tau,t,\ell} = \left( \ell^{\text{skill}}_{\tau,t,\ell} - \ell^{\text{old}}_{\tau,t,\ell} \right) m_{\tau,t,\ell}

Policy Optimization with Skill Advantage

For each rollout group $\mathcal{G}_q$ , the group mean and standard deviation of outcome rewards are computed:

\mu_q = \text{mean}(\{ R(\tau') | \tau' \in \mathcal{G}_q \}), \quad \sigma_q = \text{std}(\{ R(\tau') | \tau' \in \mathcal{G}_q \})

The episode-relative advantage is:

A^{\text{ep}}_\tau = \frac{R(\tau) - \mu_q}{\sigma_q}, \quad \tau \in \mathcal{G}_q

Broadcast to tokens: $A^{\text{ep}}_{\tau,t,\ell} = A^{\text{ep}}_\tau m_{\tau,t,\ell}$ .

The final OPID advantage:

A^{\text{OPID}}_{\tau,t,\ell} = A^{\text{ep}}_{\tau,t,\ell} + \lambda_{\text{skill}} A^{\text{skill}}_{\tau,t,\ell}

The policy is optimized with the clipped PPO objective:

\mathcal{L}_{\text{policy}}(\theta) = -\mathbb{E}_{\tau,t,\ell} \left[ \min\left( \rho_{\tau,t,\ell}(\theta) A^{\text{OPID}}_{\tau,t,\ell}, \; \text{clip}(\rho_{\tau,t,\ell}(\theta), 1-\epsilon, 1+\epsilon) A^{\text{OPID}}_{\tau,t,\ell} \right) \right] + \beta \mathcal{L}_{\text{KL}}(\theta)

where $\rho_{\tau,t,\ell}(\theta) = \exp(\log \pi_\theta(y_{\tau,t,\ell} | h_{\tau,t}, y_{\tau,t,<\ell}) - \log \pi_{\theta_{\text{old}}}(y_{\tau,t,\ell} | h_{\tau,t}, y_{\tau,t,<\ell}))$ .

Empirical Validation / Results

Experimental Setup

Benchmarks: ALFWorld (embodied household), WebShop (e-commerce), Search-based QA (search-augmented question answering on NQ, TriviaQA, PopQA, HotpotQA, 2WikiMultiHopQA, MuSiQue, Bamboogle).
Baselines: Vanilla, Skill-Prompt*, GRPO, Skill-GRPO, Skill-GRPO*, GRPO+OPSD, Skill-SD, RLSD, SDAR. * indicates validation with skills.
Backbones: Qwen2.5-3B/7B-Instruct, Qwen3-1.7B-Instruct.
Implementation: Training for 150 steps, batch size 16 for ALFWorld/WebShop, 128 for Search-based QA, group size $N=8$ , $\lambda_{\text{skill}}=0.001$ , $\epsilon=0.2$ .

Main Results (Table 1)

Performance comparison on ALFWorld (success rate %), Search-based QA (accuracy %), WebShop (Score / Succ. %). Best and second-best highlighted.

Key findings:

OPID improves over GRPO in most model–domain combinations. E.g., on Qwen2.5-3B: ALFWorld +9.3 points (84.3 vs 75.0), Search-based QA +8.6 (45.0 vs 36.4), WebShop +10.9 (74.2 vs 63.3).
OPID matches or surpasses strong hybrid baselines (SDAR, RLSD, Skill-GRPO*) in aggregate settings.
OPID outperforms Skill-GRPO (without inference skills) by large margins, showing it internalizes skills rather than depending on them at inference.

Method	ALFWorld Avg	Search Avg	WebShop Score	WebShop Succ.
Qwen2.5-3B
GRPO	75.0	36.4	79.8	63.3
Skill-GRPO*	80.5	36.1	76.3	66.4
SDAR	84.4	43.4	85.0	68.0
OPID	84.3	45.0	85.0	74.2
Qwen2.5-7B
GRPO	81.2	42.0	80.9	72.6
Skill-GRPO*	88.3	47.5	87.0	81.2
SDAR	85.9	49.0	89.4	82.8
OPID	90.0	49.2	85.3	79.7
Qwen3-1.7B
GRPO	46.1	40.8	67.3	38.3
SDAR	53.9	41.9	76.8	58.6
OPID	58.9	40.4	79.6	64.8

Training Dynamics

Figure 3 shows OPID diverging from GRPO mid-training and maintaining higher success rate while reducing average episode length (15-16 steps vs 17-18 steps), indicating more direct action sequences.

Sample Efficiency (Figure 4)

OPID consistently outperforms GRPO under reduced training data fractions. With 60% data, OPID reaches 71.9 success (close to GRPO full data 75.0); with 80% data, OPID exceeds full-data GRPO (78.9 vs 75.0). Absolute gains range from +9.3 to +20.3 points.

Cross-Domain Generalization (Figure 5)

On ALFWorld unseen split, OPID achieves 78.6% success vs GRPO 70.9%, with large gains on Look (+26.7) and Heat (+18.5).

Ablation Studies

Hierarchical skills (Table 2): Removing episode-level skills drops ALFWorld avg from 84.3 to 74.1; removing step-level skills drops to 79.1. Both levels are complementary.
Critical-first routing (Table 3): Without routing (superimposing both skills), ALFWorld avg drops from 84.3 to 77.5, confirming the importance of selective routing.

Theoretical Analysis (Appendix A)

Proposition 1: The unclipped OPID skill loss decomposes as $\mathcal{L}^{\text{unclip}}_{\text{skill}}(\theta) = \lambda_{\text{skill}} [\mathcal{L}_{\text{RKL}}(\theta) - D_b(\theta)]$ , showing it is a relative-KL loss locally equivalent to reverse-KL distillation at the behavior policy.
Proposition 2: On-policy occupancy matching eliminates outer context-distribution mismatch for distillation.
Proposition 3: Critical-first routing recovers the oracle candidate-teacher selection under perfect criticality detection, with degradation controlled by detector error.

Theoretical and Practical Implications

Theoretical insight: OPID provides a principled way to convert trajectory-derived skills into dense token-level supervision that complements outcome rewards. The loss is locally equivalent to reverse-KL distillation, ensuring the policy is shaped toward the skill-conditioned teacher without drifting from the behavior policy.
Practical significance: OPID eliminates the need for external skill libraries, retrieval mechanisms, or privileged context at inference time. The distribution-matched hindsight supervision improves sample efficiency, reduces repetitive/invalid actions, and enables better cross-domain generalization.
Behavioral improvement: OPID agents learn more direct action sequences (shorter episode lengths) and avoid hallucinated targets or object substitution errors, as shown in qualitative examples (Figure 6).

Conclusion

OPID is an on-policy skill distillation framework that turns completed agent trajectories into hierarchical hindsight supervision (episode-level and step-level skills). By using critical-first routing and combining skill-based self-distillation with outcome-based RL, OPID provides dense, distribution-matched token-level guidance without external skill libraries or inference-time overhead. Experiments across embodied, web, and search-based benchmarks demonstrate consistent improvements in performance, sample efficiency, and robustness.

Future directions: Evaluate in broader interactive environments (OdysseyArena, WebArena, VisualWebArena), enrich skill structure with higher-level reasoning abstractions, and improve training efficiency with speculative decoding methods.