Summary (Overview)
- Bridging action representation: Proposes using relative wrist translation in the head-camera frame as a shared action space between humans and robots, avoiding noisy rotation estimates and mismatched contact patterns.
- Interleaved action tokens: A π₀-like VLA model with attention masking handles heterogeneous action components (3D-wrist, 6DoF end-effector, gripper) across data sources.
- Three-stage training: Pre-training on 600+ hours of human actions (Stage I), human-robot co-training with pick-and-place data (Stage II), and few-shot robot post-training (Stage III).
- Empirical results: The bridging action transfers manipulation skills to 15 bi-manual tasks more effectively than 6DoF human actions, improves data efficiency for few-shot robot data, and scales with large-scale human-only pre-training.
- Loss alignment: Human-only pre-training on wrist translations accelerates convergence of 6DoF end-effector and gripper losses during co-training, showing that the bridging objective aligns with the executable robot action space.
Introduction and Theoretical Foundation
The paper addresses the challenge of transferring manipulation skills from human action data to a bi-manual robot with parallel grippers. Human action data is cheap, abundant, and diverse, making it promising for scaling robot learning. However, prior work treats humans as just another 6DoF embodiment, suffering from:
- Noisy wrist rotations from hand-pose estimators.
- Fundamentally different contact patterns between human fingers and parallel grippers, making wrist rotations less semantically meaningful.
The key insight is that both humans and robots act upon what they perceive, so wrist translation in the head-camera frame serves as a shared, embodiment-agnostic action signal. This "bridging action" is:
- Physically meaningful under a shared observation perspective.
- Robust to noisy rotation estimates.
- Embodiment-agnostic by construction.
Methodology
4.1 Motion Bridging Action Representation
Why not 6DoF? Wrist rotations from human data are noisy and misaligned with gripper semantics.
Bridging action (relative wrist translation in head-camera frame): Let denote wrist pose in world frame at time , and the head-camera pose. Mapping wrist pose into camera frame yields . The bridging action over a -step future window is:
where extracts the translation component. Concatenating both arms gives .
Robot 6DoF end-effector action (relative pose in wrist frame):
Converted to Cartesian coordinates and Euler angles, yielding .
Gripper action: binary signal , .
Unified action space: . Different data sources supervise only reliably available components:
Table 1: Per-embodiment action supervision
| Data source | |||
|---|---|---|---|
| In-the-wild human action data (EgoDex + out-sourced) | ✓ | – | – |
| In-lab human action data | ✓ | – | ✓ |
| Robot tele-operation | ✓ | ✓ | ✓ |
4.2 VLA Model with Interleaved Action Sequence
Architecture: A -like vision-language-action model processing language instruction and observations (head + two wrist cameras). Uses pre-trained VLM for vision-language tokens, then an Action Transformer with flow matching for action generation.
Interleaved action tokens: Action tokens are ordered as with attention masking to handle missing components. The ordering enables explicit knowledge transfer from bridging to 6DoF tokens via attention.
Flow-matching objective: For and , the model denoises by predicting velocity :
where . Inference uses Euler integration from to with .
Vision-language co-training: Co-trained with standard next-token prediction loss on vision-language data.
4.3 Training Strategies
- Stage I: Human-only pre-training – 600+ hours of human actions (EgoDex, out-sourced, in-lab). Only supervise .
- Stage II: Human-robot co-training – Co-train with 72 hours of robot pick-and-place data and task-specific human actions (~3 hrs/task, 15 tasks). Randomly add or substitute for on robot data to bind bridging to executable actions.
- Stage III: Few-shot robot post-training – 10 additional robot trajectories per task.
Empirical Validation / Results
5.1 Evaluation Setups
15 bi-manual manipulation tasks grouped by object (microwave, drawer, mug/cup, others). Two distinct test scenes per task, 8 trials total. Metrics: success rate and average progress (fine-grained per-task scoring, see Fig. 4 in paper).
5.2 Main Results (Fig. 5 & Fig. 6)
- Training only on pick-and-place robot data (green) yields very low performance (overall progress 21%, success 10%).
- Co-training with human actions via bridging action (orange) substantially improves performance (progress 45%, success 22%).
- Adding large-scale human-only pre-training (blue) further improves (progress 60%, success 38%).
- Few-shot robot post-training (purple) gives highest performance (progress 72%, success 55%).
5.3 Comparison to 6DoF Human Actions (Table 2)
| Human Actions | Microwave Prog/Succ | Drawer Prog/Succ | Mug/Cup Prog/Succ | Other Prog/Succ | Overall Prog/Succ |
|---|---|---|---|---|---|
| 25.00 / 4.17 | 55.00 / 31.25 | 28.13 / 0.00 | 49.17 / 33.33 | 34.67 / 12.50 | |
| (Ours) | 38.02 / 25.00 | 49.06 / 31.25 | 48.13 / 3.13 | 50.00 / 37.50 | 44.58 / 22.50 |
Qualitatively, 6DoF human actions lead to distorted wrist poses, while bridging actions produce natural, task-aligned poses (Fig. 7 & Fig. 8).
5.4 Post-Training Data Efficiency (Table 3)
| Model | Overall Prog (%) | Overall Succ (%) |
|---|---|---|
| Stage III only (no human pretrain) | 53.79 | 35.83 |
| Stage I + III (with human pretrain) | 71.21 | 55.00 |
Human-only pre-training substantially improves few-shot post-training efficiency.
5.5 Ablation of Bridging Objective on Robot Data (Table 4)
| Robot Actions | Microwave Prog/Succ | Drawer Prog/Succ | Mug/Cup Prog/Succ | Other Prog/Succ | Overall Prog/Succ |
|---|---|---|---|---|---|
| w/o | 35.73 / 10.42 | 39.38 / 12.50 | 39.38 / 0.00 | 48.13 / 33.33 | 39.67 / 12.50 |
| w/ | 64.58 / 45.83 | 56.88 / 43.75 | 52.50 / 15.63 | 61.67 / 50.00 | 59.75 / 38.33 |
Supervising the bridging action on robot data is essential.
5.6 Loss Alignment (Fig. 9)
Human-only pre-training (400k iterations) yields lower training losses for and during co-training, showing that the bridging objective landscape aligns with the executable robot action space.
5.7 Action Consistency (Fig. 10)
Visualization confirms that predicted bridging actions and 6DoF end-effector actions align closely across diverse tasks.
5.8 Upper Bound Analysis (Table 5)
Treating task-specific robot demonstrations as "human" data (no observation gap, less noise):
| Model | Microwave Prog/Succ | Drawer Prog/Succ | Mug/Cup Prog/Succ | Other Prog/Succ | Overall Prog/Succ |
|---|---|---|---|---|---|
| Default (Ours) | 64.58 / 45.83 | 56.88 / 43.75 | 52.50 / 15.63 | 61.67 / 50.00 | 59.75 / 38.33 |
| Upper Bound | 68.75 / 54.17 | 75.94 / 62.50 | 81.25 / 53.13 | 71.25 / 58.33 | 73.54 / 55.83 |
5.9 Failure Cases
Failures occur in tasks requiring precise rotational adjustments (e.g., "insert straw into cup," "open drawer"). The robot shows task intent but fails at critical steps due to the translation-only design.
Theoretical and Practical Implications
- Theoretical: Provides evidence that a translation-only bridging action is both necessary and sufficient for transferring manipulation skills across embodiments with fundamentally different end-effectors (fingers vs. parallel grippers). The shared observation frame (head camera) eliminates embodiment-specific noise.
- Practical: Enables scalable robot learning by leveraging abundant, cheap human action data without requiring task-specific robot demonstrations. The three-stage pipeline (pre-train, co-train, post-train) offers a practical recipe for building generalist manipulation policies.
- Scalability: The bridging representation scales with the amount of human data, validated by 600+ hours of pre-training and consistent improvements.
- Limitations: Translation-only actions limit performance on rotation-critical tasks; future work should incorporate limited, reliable rotation cues.
Conclusion
The paper demonstrates that relative wrist translation in the head-camera frame serves as an effective bridging action for transferring manipulation skills from humans to bi-manual robots. The proposed interleaved action token architecture with attention masking handles heterogeneous action components across data sources. Real-world experiments on 15 tasks show that the translation-only bridging action outperforms 6DoF human actions, scales with large-scale human pre-training, and improves data efficiency for few-shot robot fine-tuning. Key limitations include struggles with rotation-critical tasks and thin object grasping. Future directions include incorporating limited rotation cues and scaling to more diverse robot actions to further narrow the embodiment gap.
Related papers
- MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research
MOBILEGYM introduces a lightweight, browser-based mobile GUI simulator that uses structured JSON state for deterministic verification and enables hundreds of parallel instances for scalable online RL training.
- Geometry-Aware Representation Denoising for Robust Multi-view 3D Reconstruction
GARD denoises the geometry-aware feature representations of a frozen 3D reconstructor, enabling simultaneous recovery of accurate 3D geometry and high-quality multi-view images under severe degradations.
- DragMesh-2: Physically Plausible Dexterous Hand-Object Interaction with Articulated Objects
PICA improves robustness to damping shifts for articulated object manipulation by injecting physically informed contact signals without force feedback.