Summary (Overview)

  • Proposes OPID (On-Policy Skill Distillation): A framework that extracts hierarchical hindsight skills (episode-level and step-level) from completed on-policy trajectories and uses them for dense token-level self-distillation in agentic reinforcement learning (RL).
  • Critical-first skill routing: Selects step-level skills at critical timesteps and falls back to episode-level skills otherwise, providing appropriate granularity of guidance.
  • Combines skill-based self-distillation with outcome-based RL: The token-level skill advantage Aτ,t,skillA^{\text{skill}}_{\tau,t,\ell} is added to the group-relative episode advantage Aτ,t,epA^{\text{ep}}_{\tau,t,\ell} to form the final advantage Aτ,t,OPIDA^{\text{OPID}}_{\tau,t,\ell}, preserving outcome optimization as the primary objective.
  • Strong empirical results: OPID consistently outperforms outcome-only RL (GRPO) and existing skill-distillation baselines on ALFWorld, WebShop, and Search-based QA across model scales (Qwen2.5-3B, 7B, Qwen3-1.7B), with gains in success rate, sample efficiency, and cross-domain generalization.
  • No inference-time overhead: Skills are used only during training; at inference, the policy acts from the ordinary interaction history without analyzer calls, skill retrieval, or privileged context.

Introduction and Theoretical Foundation

Background and Motivation

  • Large language models are increasingly deployed as interactive agents for long-horizon tasks (embodied, web navigation, search-augmented QA). Reinforcement learning (RL) is a natural post-training paradigm, with outcome-based methods like GRPO providing stable critic-free optimization on-policy rollouts.
  • Sparse reward problem: Outcome-based RL offers only trajectory-level rewards, providing no guidance on which intermediate decisions should be reinforced or suppressed. This is especially severe in long-horizon interaction, where a single early mistake may derail the episode.
  • On-policy self-distillation provides dense token-level supervision by comparing the same policy under different contexts. Skill-conditioned variants use natural-language skills as privileged context, but often rely on external skill memories, which are costly to maintain and may be mismatched with the current policy's state distribution in multi-turn interaction.

Core Idea of OPID

OPID extracts hindsight skills directly from completed on-policy trajectories, avoiding external skill libraries. Skills are hierarchical:

  • Episode-level skills (sτeps^{\text{ep}}_\tau): Capture global workflows or failure-avoidance rules for the entire trajectory.
  • Step-level skills (sτ,tsteps^{\text{step}}_{\tau,t}): Capture local decision knowledge at critical timesteps (e.g., avoiding repeated invalid actions, selecting the next object to inspect).

A critical-first routing mechanism selects step-level skills when critical decisions are identified and falls back to episode-level skills otherwise. The routed skill is injected into the interaction history, and the old policy re-scores the same sampled response under both original and skill-augmented contexts, producing a token-level self-distillation advantage.


Methodology

Problem Formulation

The agentic task is modeled as a partially observable Markov decision process (S,A,O,T,R,γ)(\mathcal{S}, \mathcal{A}, \mathcal{O}, T, R, \gamma). At timestep tt, the agent maintains an interaction history hth_t and generates response ytπθ(ht)y_t \sim \pi_\theta(\cdot | h_t). A completed trajectory τ={(ot,yt,rt)}t=0T1\tau = \{ (o_t, y_t, r_t) \}_{t=0}^{T-1} with outcome score R(τ)R(\tau).

Following GRPO, for each task prompt qq, a group of NN trajectories is sampled from the current policy: Gq={τ(1),,τ(N)}\mathcal{G}_q = \{ \tau^{(1)}, \ldots, \tau^{(N)} \}.

On-Policy Skill Extraction

An LLM-based analyzer A\mathcal{A} maps a trajectory τ\tau to:

A(τ)=(sτep,{sτ,tstep}tCτ)\mathcal{A}(\tau) = \left( s^{\text{ep}}_\tau, \{ s^{\text{step}}_{\tau,t} \}_{t \in \mathcal{C}_\tau} \right)

where Cτ\mathcal{C}_\tau is the set of critical timesteps identified by the analyzer.

Critical-First Skill Routing

For trajectory τ\tau and timestep tt, the routed skill is:

sτ,t={sτ,tstep,if tCτ,sτep,otherwise.s_{\tau,t} = \begin{cases} s^{\text{step}}_{\tau,t}, & \text{if } t \in \mathcal{C}_\tau, \\ s^{\text{ep}}_\tau, & \text{otherwise}. \end{cases}

Skill-Conditioned Self-Distillation

Let h~τ,t=H(hτ,t,sτ,t)\tilde{h}_{\tau,t} = H(h_{\tau,t}, s_{\tau,t}) be the skill-augmented history. The old policy πθold\pi_{\theta_{\text{old}}} scores the same response yτ,ty_{\tau,t} under both contexts:

  • τ,t,old=logπθold(yτ,t,hτ,t,yτ,t,<)\ell^{\text{old}}_{\tau,t,\ell} = \log \pi_{\theta_{\text{old}}}(y_{\tau,t,\ell} | h_{\tau,t}, y_{\tau,t,<\ell})
  • τ,t,skill=logπθold(yτ,t,h~τ,t,yτ,t,<)\ell^{\text{skill}}_{\tau,t,\ell} = \log \pi_{\theta_{\text{old}}}(y_{\tau,t,\ell} | \tilde{h}_{\tau,t}, y_{\tau,t,<\ell})

The skill-based self-teacher advantage (masked by valid response token mask mτ,t,m_{\tau,t,\ell}) is:

Aτ,t,skill=(τ,t,skillτ,t,old)mτ,t,A^{\text{skill}}_{\tau,t,\ell} = \left( \ell^{\text{skill}}_{\tau,t,\ell} - \ell^{\text{old}}_{\tau,t,\ell} \right) m_{\tau,t,\ell}

Policy Optimization with Skill Advantage

For each rollout group Gq\mathcal{G}_q, the group mean and standard deviation of outcome rewards are computed:

μq=mean({R(τ)τGq}),σq=std({R(τ)τGq})\mu_q = \text{mean}(\{ R(\tau') | \tau' \in \mathcal{G}_q \}), \quad \sigma_q = \text{std}(\{ R(\tau') | \tau' \in \mathcal{G}_q \})

The episode-relative advantage is:

Aτep=R(τ)μqσq,τGqA^{\text{ep}}_\tau = \frac{R(\tau) - \mu_q}{\sigma_q}, \quad \tau \in \mathcal{G}_q

Broadcast to tokens: Aτ,t,ep=Aτepmτ,t,A^{\text{ep}}_{\tau,t,\ell} = A^{\text{ep}}_\tau m_{\tau,t,\ell}.

The final OPID advantage:

Aτ,t,OPID=Aτ,t,ep+λskillAτ,t,skillA^{\text{OPID}}_{\tau,t,\ell} = A^{\text{ep}}_{\tau,t,\ell} + \lambda_{\text{skill}} A^{\text{skill}}_{\tau,t,\ell}

The policy is optimized with the clipped PPO objective:

Lpolicy(θ)=Eτ,t,[min(ρτ,t,(θ)Aτ,t,OPID,  clip(ρτ,t,(θ),1ϵ,1+ϵ)Aτ,t,OPID)]+βLKL(θ)\mathcal{L}_{\text{policy}}(\theta) = -\mathbb{E}_{\tau,t,\ell} \left[ \min\left( \rho_{\tau,t,\ell}(\theta) A^{\text{OPID}}_{\tau,t,\ell}, \; \text{clip}(\rho_{\tau,t,\ell}(\theta), 1-\epsilon, 1+\epsilon) A^{\text{OPID}}_{\tau,t,\ell} \right) \right] + \beta \mathcal{L}_{\text{KL}}(\theta)

where ρτ,t,(θ)=exp(logπθ(yτ,t,hτ,t,yτ,t,<)logπθold(yτ,t,hτ,t,yτ,t,<))\rho_{\tau,t,\ell}(\theta) = \exp(\log \pi_\theta(y_{\tau,t,\ell} | h_{\tau,t}, y_{\tau,t,<\ell}) - \log \pi_{\theta_{\text{old}}}(y_{\tau,t,\ell} | h_{\tau,t}, y_{\tau,t,<\ell})).


Empirical Validation / Results

Experimental Setup

  • Benchmarks: ALFWorld (embodied household), WebShop (e-commerce), Search-based QA (search-augmented question answering on NQ, TriviaQA, PopQA, HotpotQA, 2WikiMultiHopQA, MuSiQue, Bamboogle).
  • Baselines: Vanilla, Skill-Prompt*, GRPO, Skill-GRPO, Skill-GRPO*, GRPO+OPSD, Skill-SD, RLSD, SDAR. * indicates validation with skills.
  • Backbones: Qwen2.5-3B/7B-Instruct, Qwen3-1.7B-Instruct.
  • Implementation: Training for 150 steps, batch size 16 for ALFWorld/WebShop, 128 for Search-based QA, group size N=8N=8, λskill=0.001\lambda_{\text{skill}}=0.001, ϵ=0.2\epsilon=0.2.

Main Results (Table 1)

Performance comparison on ALFWorld (success rate %), Search-based QA (accuracy %), WebShop (Score / Succ. %). Best and second-best highlighted.

Key findings:

  • OPID improves over GRPO in most model–domain combinations. E.g., on Qwen2.5-3B: ALFWorld +9.3 points (84.3 vs 75.0), Search-based QA +8.6 (45.0 vs 36.4), WebShop +10.9 (74.2 vs 63.3).
  • OPID matches or surpasses strong hybrid baselines (SDAR, RLSD, Skill-GRPO*) in aggregate settings.
  • OPID outperforms Skill-GRPO (without inference skills) by large margins, showing it internalizes skills rather than depending on them at inference.
MethodALFWorld AvgSearch AvgWebShop ScoreWebShop Succ.
Qwen2.5-3B
GRPO75.036.479.863.3
Skill-GRPO*80.536.176.366.4
SDAR84.443.485.068.0
OPID84.345.085.074.2
Qwen2.5-7B
GRPO81.242.080.972.6
Skill-GRPO*88.347.587.081.2
SDAR85.949.089.482.8
OPID90.049.285.379.7
Qwen3-1.7B
GRPO46.140.867.338.3
SDAR53.941.976.858.6
OPID58.940.479.664.8

Training Dynamics

  • Figure 3 shows OPID diverging from GRPO mid-training and maintaining higher success rate while reducing average episode length (15-16 steps vs 17-18 steps), indicating more direct action sequences.

Sample Efficiency (Figure 4)

  • OPID consistently outperforms GRPO under reduced training data fractions. With 60% data, OPID reaches 71.9 success (close to GRPO full data 75.0); with 80% data, OPID exceeds full-data GRPO (78.9 vs 75.0). Absolute gains range from +9.3 to +20.3 points.

Cross-Domain Generalization (Figure 5)

  • On ALFWorld unseen split, OPID achieves 78.6% success vs GRPO 70.9%, with large gains on Look (+26.7) and Heat (+18.5).

Ablation Studies

  • Hierarchical skills (Table 2): Removing episode-level skills drops ALFWorld avg from 84.3 to 74.1; removing step-level skills drops to 79.1. Both levels are complementary.
  • Critical-first routing (Table 3): Without routing (superimposing both skills), ALFWorld avg drops from 84.3 to 77.5, confirming the importance of selective routing.

Theoretical Analysis (Appendix A)

  • Proposition 1: The unclipped OPID skill loss decomposes as Lskillunclip(θ)=λskill[LRKL(θ)Db(θ)]\mathcal{L}^{\text{unclip}}_{\text{skill}}(\theta) = \lambda_{\text{skill}} [\mathcal{L}_{\text{RKL}}(\theta) - D_b(\theta)], showing it is a relative-KL loss locally equivalent to reverse-KL distillation at the behavior policy.
  • Proposition 2: On-policy occupancy matching eliminates outer context-distribution mismatch for distillation.
  • Proposition 3: Critical-first routing recovers the oracle candidate-teacher selection under perfect criticality detection, with degradation controlled by detector error.

Theoretical and Practical Implications

  • Theoretical insight: OPID provides a principled way to convert trajectory-derived skills into dense token-level supervision that complements outcome rewards. The loss is locally equivalent to reverse-KL distillation, ensuring the policy is shaped toward the skill-conditioned teacher without drifting from the behavior policy.
  • Practical significance: OPID eliminates the need for external skill libraries, retrieval mechanisms, or privileged context at inference time. The distribution-matched hindsight supervision improves sample efficiency, reduces repetitive/invalid actions, and enables better cross-domain generalization.
  • Behavioral improvement: OPID agents learn more direct action sequences (shorter episode lengths) and avoid hallucinated targets or object substitution errors, as shown in qualitative examples (Figure 6).

Conclusion

OPID is an on-policy skill distillation framework that turns completed agent trajectories into hierarchical hindsight supervision (episode-level and step-level skills). By using critical-first routing and combining skill-based self-distillation with outcome-based RL, OPID provides dense, distribution-matched token-level guidance without external skill libraries or inference-time overhead. Experiments across embodied, web, and search-based benchmarks demonstrate consistent improvements in performance, sample efficiency, and robustness.

Future directions: Evaluate in broader interactive environments (OdysseyArena, WebArena, VisualWebArena), enrich skill structure with higher-level reasoning abstractions, and improve training efficiency with speculative decoding methods.

Related papers