Summary (Overview)
- Proposes OPID (On-Policy Skill Distillation): A framework that extracts hierarchical hindsight skills (episode-level and step-level) from completed on-policy trajectories and uses them for dense token-level self-distillation in agentic reinforcement learning (RL).
- Critical-first skill routing: Selects step-level skills at critical timesteps and falls back to episode-level skills otherwise, providing appropriate granularity of guidance.
- Combines skill-based self-distillation with outcome-based RL: The token-level skill advantage is added to the group-relative episode advantage to form the final advantage , preserving outcome optimization as the primary objective.
- Strong empirical results: OPID consistently outperforms outcome-only RL (GRPO) and existing skill-distillation baselines on ALFWorld, WebShop, and Search-based QA across model scales (Qwen2.5-3B, 7B, Qwen3-1.7B), with gains in success rate, sample efficiency, and cross-domain generalization.
- No inference-time overhead: Skills are used only during training; at inference, the policy acts from the ordinary interaction history without analyzer calls, skill retrieval, or privileged context.
Introduction and Theoretical Foundation
Background and Motivation
- Large language models are increasingly deployed as interactive agents for long-horizon tasks (embodied, web navigation, search-augmented QA). Reinforcement learning (RL) is a natural post-training paradigm, with outcome-based methods like GRPO providing stable critic-free optimization on-policy rollouts.
- Sparse reward problem: Outcome-based RL offers only trajectory-level rewards, providing no guidance on which intermediate decisions should be reinforced or suppressed. This is especially severe in long-horizon interaction, where a single early mistake may derail the episode.
- On-policy self-distillation provides dense token-level supervision by comparing the same policy under different contexts. Skill-conditioned variants use natural-language skills as privileged context, but often rely on external skill memories, which are costly to maintain and may be mismatched with the current policy's state distribution in multi-turn interaction.
Core Idea of OPID
OPID extracts hindsight skills directly from completed on-policy trajectories, avoiding external skill libraries. Skills are hierarchical:
- Episode-level skills (): Capture global workflows or failure-avoidance rules for the entire trajectory.
- Step-level skills (): Capture local decision knowledge at critical timesteps (e.g., avoiding repeated invalid actions, selecting the next object to inspect).
A critical-first routing mechanism selects step-level skills when critical decisions are identified and falls back to episode-level skills otherwise. The routed skill is injected into the interaction history, and the old policy re-scores the same sampled response under both original and skill-augmented contexts, producing a token-level self-distillation advantage.
Methodology
Problem Formulation
The agentic task is modeled as a partially observable Markov decision process . At timestep , the agent maintains an interaction history and generates response . A completed trajectory with outcome score .
Following GRPO, for each task prompt , a group of trajectories is sampled from the current policy: .
On-Policy Skill Extraction
An LLM-based analyzer maps a trajectory to:
where is the set of critical timesteps identified by the analyzer.
Critical-First Skill Routing
For trajectory and timestep , the routed skill is:
Skill-Conditioned Self-Distillation
Let be the skill-augmented history. The old policy scores the same response under both contexts:
The skill-based self-teacher advantage (masked by valid response token mask ) is:
Policy Optimization with Skill Advantage
For each rollout group , the group mean and standard deviation of outcome rewards are computed:
The episode-relative advantage is:
Broadcast to tokens: .
The final OPID advantage:
The policy is optimized with the clipped PPO objective:
where .
Empirical Validation / Results
Experimental Setup
- Benchmarks: ALFWorld (embodied household), WebShop (e-commerce), Search-based QA (search-augmented question answering on NQ, TriviaQA, PopQA, HotpotQA, 2WikiMultiHopQA, MuSiQue, Bamboogle).
- Baselines: Vanilla, Skill-Prompt*, GRPO, Skill-GRPO, Skill-GRPO*, GRPO+OPSD, Skill-SD, RLSD, SDAR. * indicates validation with skills.
- Backbones: Qwen2.5-3B/7B-Instruct, Qwen3-1.7B-Instruct.
- Implementation: Training for 150 steps, batch size 16 for ALFWorld/WebShop, 128 for Search-based QA, group size , , .
Main Results (Table 1)
Performance comparison on ALFWorld (success rate %), Search-based QA (accuracy %), WebShop (Score / Succ. %). Best and second-best highlighted.
Key findings:
- OPID improves over GRPO in most model–domain combinations. E.g., on Qwen2.5-3B: ALFWorld +9.3 points (84.3 vs 75.0), Search-based QA +8.6 (45.0 vs 36.4), WebShop +10.9 (74.2 vs 63.3).
- OPID matches or surpasses strong hybrid baselines (SDAR, RLSD, Skill-GRPO*) in aggregate settings.
- OPID outperforms Skill-GRPO (without inference skills) by large margins, showing it internalizes skills rather than depending on them at inference.
| Method | ALFWorld Avg | Search Avg | WebShop Score | WebShop Succ. |
|---|---|---|---|---|
| Qwen2.5-3B | ||||
| GRPO | 75.0 | 36.4 | 79.8 | 63.3 |
| Skill-GRPO* | 80.5 | 36.1 | 76.3 | 66.4 |
| SDAR | 84.4 | 43.4 | 85.0 | 68.0 |
| OPID | 84.3 | 45.0 | 85.0 | 74.2 |
| Qwen2.5-7B | ||||
| GRPO | 81.2 | 42.0 | 80.9 | 72.6 |
| Skill-GRPO* | 88.3 | 47.5 | 87.0 | 81.2 |
| SDAR | 85.9 | 49.0 | 89.4 | 82.8 |
| OPID | 90.0 | 49.2 | 85.3 | 79.7 |
| Qwen3-1.7B | ||||
| GRPO | 46.1 | 40.8 | 67.3 | 38.3 |
| SDAR | 53.9 | 41.9 | 76.8 | 58.6 |
| OPID | 58.9 | 40.4 | 79.6 | 64.8 |
Training Dynamics
- Figure 3 shows OPID diverging from GRPO mid-training and maintaining higher success rate while reducing average episode length (15-16 steps vs 17-18 steps), indicating more direct action sequences.
Sample Efficiency (Figure 4)
- OPID consistently outperforms GRPO under reduced training data fractions. With 60% data, OPID reaches 71.9 success (close to GRPO full data 75.0); with 80% data, OPID exceeds full-data GRPO (78.9 vs 75.0). Absolute gains range from +9.3 to +20.3 points.
Cross-Domain Generalization (Figure 5)
- On ALFWorld unseen split, OPID achieves 78.6% success vs GRPO 70.9%, with large gains on Look (+26.7) and Heat (+18.5).
Ablation Studies
- Hierarchical skills (Table 2): Removing episode-level skills drops ALFWorld avg from 84.3 to 74.1; removing step-level skills drops to 79.1. Both levels are complementary.
- Critical-first routing (Table 3): Without routing (superimposing both skills), ALFWorld avg drops from 84.3 to 77.5, confirming the importance of selective routing.
Theoretical Analysis (Appendix A)
- Proposition 1: The unclipped OPID skill loss decomposes as , showing it is a relative-KL loss locally equivalent to reverse-KL distillation at the behavior policy.
- Proposition 2: On-policy occupancy matching eliminates outer context-distribution mismatch for distillation.
- Proposition 3: Critical-first routing recovers the oracle candidate-teacher selection under perfect criticality detection, with degradation controlled by detector error.
Theoretical and Practical Implications
- Theoretical insight: OPID provides a principled way to convert trajectory-derived skills into dense token-level supervision that complements outcome rewards. The loss is locally equivalent to reverse-KL distillation, ensuring the policy is shaped toward the skill-conditioned teacher without drifting from the behavior policy.
- Practical significance: OPID eliminates the need for external skill libraries, retrieval mechanisms, or privileged context at inference time. The distribution-matched hindsight supervision improves sample efficiency, reduces repetitive/invalid actions, and enables better cross-domain generalization.
- Behavioral improvement: OPID agents learn more direct action sequences (shorter episode lengths) and avoid hallucinated targets or object substitution errors, as shown in qualitative examples (Figure 6).
Conclusion
OPID is an on-policy skill distillation framework that turns completed agent trajectories into hierarchical hindsight supervision (episode-level and step-level skills). By using critical-first routing and combining skill-based self-distillation with outcome-based RL, OPID provides dense, distribution-matched token-level guidance without external skill libraries or inference-time overhead. Experiments across embodied, web, and search-based benchmarks demonstrate consistent improvements in performance, sample efficiency, and robustness.
Future directions: Evaluate in broader interactive environments (OdysseyArena, WebArena, VisualWebArena), enrich skill structure with higher-level reasoning abstractions, and improve training efficiency with speculative decoding methods.
Related papers
- DragMesh-2: Physically Plausible Dexterous Hand-Object Interaction with Articulated Objects
PICA improves robustness to damping shifts for articulated object manipulation by injecting physically informed contact signals without force feedback.
- NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?
The strongest AI coding agent surpasses published SOTA on only 17.8% of 90 Nature-sourced benchmark tasks, succeeding via translation not invention.
- LectūraAgents: A Multi-Agent Framework for Adaptive Personalized AI-Assisted Learning and Embodied Teaching
Hierarchical multi-agent framework with Teaching Action–Speech Alignment generates personalized, embodied lectures, improving learning outcomes over existing systems.