Summary (Overview)
- New Formulation: Introduces Playful Agentic Robot Learning, where an embodied coding agent autonomously acquires reusable skills via self-directed play before downstream tasks are provided, rather than only learning from explicit instructions.
- RAT(_S) System: Proposes Robotics Agent Teams (RAT(_S)) that use a structured three-team architecture (Task Proposer, Execution Team, Memory-Management Team) to propose novel-but-learnable tasks, execute Code-as-Policy programs with per-step verification and retry, and distill successful behaviors into a persistent skill library.
- Significant Performance Gains: On LIBERO-PRO, RAT(_S) achieves a 20.6 percentage point improvement over the CaP-Agent0 baseline (43.8% vs. 23.2%); on MolmoSpaces, a 17.0 point improvement (38.0% vs. 21.0%).
- Transferable Skill Library: Skills learned during play in LIBERO-PRO transfer to RoboSuite (+8.9 pp) and to real-world tasks (+8.8 pp) when plugged into CaP-Agent0, without any fine-tuning.
- Curiosity-Driven Play is Key: Ablations show that random play under the same budget yields minimal gains, while curiosity-driven play (Goldilocks objective) substantially improves performance, and play and improved test-time execution are complementary.
Introduction and Theoretical Foundation
Background and Motivation: Current agentic robot systems generate Code-as-Policy programs, observe feedback, and revise behavior, but they remain task-driven – reusable skills are only acquired after explicit instructions. In contrast, natural intelligence (e.g., children) acquires skills through self-directed play before goals are specified, exploring novel yet learnable interactions near the boundary of competence. This has inspired developmental robotics and intrinsic motivation models.
Key Insight in the Code-as-Policy Era: Unlike earlier systems that explored fixed sensorimotor or goal spaces, coding agents can express exploratory goals in language, execute them as programs, inspect outcomes, and save successful behaviors as callable code. This makes play a practical mechanism for continual skill acquisition.
Problem Formulation: In a standard Code-as-Policy (CaP) framework, an agent receives environment context (c), primitive functions (f), and language instruction (l) to synthesize executable program (\pi). In the play-time setting, external task instruction (l) is removed; the agent autonomously proposes and practices self-generated tasks (\tau_t) in a play environment (E_\text{play}), with the goal of acquiring a skill library (\mathcal{L}) that improves downstream task solving.
The objective is to obtain a library (\mathcal{L} = \mathcal{L}0 \cup \mathcal{L}\text{learned}) (where (\mathcal{L}_0 \equiv f)) that improves performance on unseen test tasks over using only the initial primitives.
Methodology
RAT(_S) is organized into three collaborating teams operating during play-time (see Figure 2 in the paper), following Algorithm 1:
1. Task Proposer Team: Driven by intrinsic motivation, the proposer generates a candidate pool of task descriptions conditioned on current scene context, skill library, and failure memory. Task selection uses the Goldilocks principle to favor tasks that are novel yet learnable:
[\tau_t = \arg\max_{\tau \in \mathcal{T}_t} \big[ N(\tau) \cdot F(\tau) \big]]
Where:
- Object-Skill Novelty: ( N(\tau) = \frac{1}{|O(\tau) \times S(\tau)|} \sum_{(o,s)} \frac{1}{\sqrt{N(o,s) + 1}} ) – encourages exploration of rarely-tried object-skill combinations.
- Competence Frontier: ( F(\tau) = 4,\bar{r}(\tau),(1 - \bar{r}(\tau)) ), where (\bar{r}(\tau) = \frac{1}{|S(\tau)|} \sum_s \hat{r}(s)) is the average Wilson-bounded success rate of required skills. This peaks at (\bar{r} \approx 0.5), the sweet spot of learnability.
2. Execution Team: A Write-Execute-Verify-Diagnose loop with three roles:
- Planning Agent: Produces an ordered plan with retrieved skills.
- Execution Agents (Policy Writer): Converts plan to robot-control code and runs it. A SubAgent can practice isolated sub-actions persistently.
- Verification Agents: Planner Verifier (physical grounding), Quality Checker (code safety), Goal Verifier (final success), Per-Step Verifier (step-level verdicts), Failure Diagnoser (summarizes failure mode and suggests corrections).
The team retries with feedback up to a budget; on success, the loop stops and behavior is distilled.
3. Memory-Management Team: Maintains two persistent stores:
- Skill Library (\mathcal{L}): Stores code skills extracted from successful executions. Skills start as experimental, become verified (prioritized retrieval) after repeated success, or deprecated after repeated failure.
- Failure Memory (\mathcal{M}): Stores compact lessons from failed attempts (e.g., missing preconditions).
- Periodic Curation (every (K=5) iterations): Merges near-duplicate skills, removes redundant lessons, and proposes candidate helper functions from existing primitives.
Evaluation: At test time, intrinsic exploration and memory update are disabled. The learned library can be used (1) as a plug-and-play addition to CaP-Agent0, or (2) within the full RAT(_S) execution system where the planner draws on learned skills.
Empirical Validation / Results
Benchmarks and Setup:
- LIBERO-PRO: 60 held-out tasks (Object, Goal, Spatial generalization splits, each with “Pos.” and “Task” perturbations), 10 trials each (600 total).
- MolmoSpaces: 10 tasks per category (Open, Close, Pick, Pick-and-Place), 10 trials each (400 total).
- RoboSuite: Held-out from play, 7 tasks × 50 trials each.
- Real-world: 2 tasks × 40 trials each.
- Baselines: OpenVLA, (\pi_0), (\pi_{0.5}), CaP-Agent0 (no play).
- Play: 50 iterations per environment with gemini-3.1pro-preview.
In-Domain Evaluation:
Table 1: LIBERO-PRO in-domain evaluation (success rates %)
| Method | Object Pos. | Object Task | Goal Pos. | Goal Task | Spatial Pos. | Spatial Task | Avg. |
|---|---|---|---|---|---|---|---|
| OpenVLA | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| (\pi_0) | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| (\pi_{0.5}) | 17.0 | 1.0 | 38.0 | 0.0 | 20.0 | 1.0 | 12.8 |
| CaP-Agent0 | 27.0 | 31.0 | 29.0 | 16.0 | 13.0 | 23.0 | 23.2 |
| RAT(_S) | 61.0 | 63.0 | 43.0 | 36.0 | 29.0 | 31.0 | 43.8 |
Table 2: MolmoSpaces in-domain evaluation (success rates %)
| Method | Open | Close | Pick | Pick-and-Place | Avg. |
|---|---|---|---|---|---|
| CaP-Agent0 | 14.0 | 36.0 | 23.0 | 11.0 | 21.0 |
| RAT(_S) | 20.0 | 73.0 | 37.0 | 22.0 | 38.0 |
RAT(_S) outperforms all baselines, with largest gains on object splits in LIBERO-PRO and on the Close category in MolmoSpaces.
Cross-Environment Transfer:
Table 3: Cross-environment transfer to RoboSuite (CaP-Agent0 + RAT(_S) skills from LIBERO-PRO play)
| Task | CaP-Agent0 | + RAT(_S) Skills | (\Delta) |
|---|---|---|---|
| Cube lifting | 34/50 (68.0%) | 42/50 (84.0%) | +16.0 pp |
| Cube restacking | 17/50 (34.0%) | 23/50 (46.0%) | +12.0 pp |
| Cube stacking | 23/50 (46.0%) | 30/50 (60.0%) | +14.0 pp |
| Nut assembly | 0/50 (0.0%) | 0/50 (0.0%) | 0.0 pp |
| Spill wiping | 50/50 (100.0%) | 50/50 (100.0%) | 0.0 pp |
| Two-arm handover | 12/50 (24.0%) | 10/50 (20.0%) | -4.0 pp |
| Two-arm lifting | 5/50 (10.0%) | 17/50 (34.0%) | +24.0 pp |
| Average | 141/350 (40.3%) | 172/350 (49.1%) | +8.9 pp |
Notably, two-arm lifting (cross-embodiment) shows largest gain (+24 pp). Real-world transfer: average +8.8 pp (Table 3 bottom).
Ablations on Play Strategy and Test-Time System:
Table 4: LIBERO-PRO ablation (success rates %)
| Test-Time System | Play-Time Skills | Object Pos. | Object Task | Goal Pos. | Goal Task | Spatial Pos. | Spatial Task | Avg. |
|---|---|---|---|---|---|---|---|---|
| CaP-Agent0 | No Play | 27.0 | 31.0 | 29.0 | 16.0 | 13.0 | 23.0 | 23.2 |
| Random Play | 20.0 | 28.0 | 32.0 | 16.0 | 20.0 | 32.0 | 24.7 | |
| Curious Play | 51.0 | 47.0 | 34.0 | 20.0 | 19.0 | 23.0 | 32.3 | |
| RAT(_S) Exec. | No Play | 54.0 | 58.0 | 32.0 | 24.0 | 20.0 | 30.0 | 36.3 |
| Random Play | 54.0 | 46.0 | 34.0 | 44.0 | 24.0 | 28.0 | 38.3 | |
| Curious Play | 60.0 | 60.0 | 48.0 | 38.0 | 30.0 | 30.0 | 44.3 |
Key findings: (a) Random play under CaP-Agent0 gives negligible gain (+1.5 pp), while Curious Play gives +9.1 pp. (b) Play and test-time execution are complementary: improving only execution gains +13.1 pp, improving only play gains +9.1 pp, combining both yields +21.1 pp.
Theoretical and Practical Implications
Theoretical: RAT(_S) bridges classical developmental robotics (intrinsic motivation, Goldilocks principle) with modern Code-as-Policy agents, showing that play can be formalized as a task-proposal and skill-acquisition process in language and code space. The curiosity-driven task selection balances novelty and learnability, yielding a self-directed curriculum that expands the skill library along the competence frontier.
Practical:
- Plug-and-Play Skill Library: The learned library transfers to other CaP agents (CaP-Agent0) without any fine-tuning, providing a practical module for improving agentic robot systems.
- Cross-Environment and Sim-to-Real Transfer: Skills acquired in simulation transfer to different simulators (LIBERO-PRO → RoboSuite) and to real hardware, albeit on simple tasks.
- Complementary Improvements: Play and better test-time execution are complementary, suggesting that both autonomous exploration and improved reasoning/verification are valuable.
Conclusion
RAT(_S) introduces Playful Agentic Robot Learning, where an embodied coding agent acquires reusable Code-as-Policy skills through self-directed play before downstream tasks. By proposing novel yet learnable tasks, executing programs with dense verification and retry, and distilling successes into a skill library, RAT(_S) substantially outperforms CaP-Agent0 and VLA baselines on LIBERO-PRO and MolmoSpaces. The learned skills transfer across simulation environments and to real-world tasks, demonstrating plug-and-play capability.
Limitations identified by the authors:
- Evaluation is primarily simulation-based; larger-scale physical deployment needed.
- Play constrained by diversity of available simulation environments.
- Improper skill reuse can hurt performance if retrieved skills do not fit the downstream task.
- High inference cost and heavy reliance on VLM verification.
- Bounded by primitive-level control APIs, limiting dexterous manipulation.
Future directions include better retrieval and context-aware selection, lighter feedback mechanisms, and richer low-level controllers for more complex manipulation.
Related papers
- Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?
Robust-U1 enables MLLMs to explicitly self-recover corrupted images, achieving state-of-the-art robust understanding across real-world and adversarial corruptions.
- Flow-OPD: On-Policy Distillation for Flow Matching Models
Flow-OPD introduces a two-stage on-policy distillation framework for flow matching models that consolidates multiple specialized teacher models into a single unified student, achieving a 10-point GenEval improvement and surpassing teacher performance.
- Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
This paper proposes a "levels × laws" taxonomy for world models, organizing them along three capability levels and four governing-law regimes to unify fragmented AI research.