Summary (Overview)

  • Bridging action representation: Proposes using relative wrist translation in the head-camera frame as a shared action space between humans and robots, avoiding noisy rotation estimates and mismatched contact patterns.
  • Interleaved action tokens: A π₀-like VLA model with attention masking handles heterogeneous action components (3D-wrist, 6DoF end-effector, gripper) across data sources.
  • Three-stage training: Pre-training on 600+ hours of human actions (Stage I), human-robot co-training with pick-and-place data (Stage II), and few-shot robot post-training (Stage III).
  • Empirical results: The bridging action transfers manipulation skills to 15 bi-manual tasks more effectively than 6DoF human actions, improves data efficiency for few-shot robot data, and scales with large-scale human-only pre-training.
  • Loss alignment: Human-only pre-training on wrist translations accelerates convergence of 6DoF end-effector and gripper losses during co-training, showing that the bridging objective aligns with the executable robot action space.

Introduction and Theoretical Foundation

The paper addresses the challenge of transferring manipulation skills from human action data to a bi-manual robot with parallel grippers. Human action data is cheap, abundant, and diverse, making it promising for scaling robot learning. However, prior work treats humans as just another 6DoF embodiment, suffering from:

  1. Noisy wrist rotations from hand-pose estimators.
  2. Fundamentally different contact patterns between human fingers and parallel grippers, making wrist rotations less semantically meaningful.

The key insight is that both humans and robots act upon what they perceive, so wrist translation in the head-camera frame serves as a shared, embodiment-agnostic action signal. This "bridging action" is:

  • Physically meaningful under a shared observation perspective.
  • Robust to noisy rotation estimates.
  • Embodiment-agnostic by construction.

Methodology

4.1 Motion Bridging Action Representation

Why not 6DoF? Wrist rotations from human data are noisy and misaligned with gripper semantics.

Bridging action (relative wrist translation in head-camera frame): Let WwtSE(3)W_w^t \in SE(3) denote wrist pose in world frame at time tt, and TwctSE(3)T_{w \leftarrow c}^t \in SE(3) the head-camera pose. Mapping wrist pose into camera frame ctc_t yields Wctt+i=(Twct)1Wwt+iW_{c_t}^{t+i} = (T_{w \leftarrow c}^t)^{-1} W_w^{t+i}. The bridging action over a kk-step future window is:

at+i3D-wrist=ΔW3D=t(Wctt+i)t(Wctt),i=1,,ka_{t+i}^{\text{3D-wrist}} = \Delta W_{3D} = \mathfrak{t}\big(W_{c_t}^{t+i}\big) - \mathfrak{t}\big(W_{c_t}^{t}\big), \quad i=1,\dots,k

where t()\mathfrak{t}(\cdot) extracts the 3×13 \times 1 translation component. Concatenating both arms gives at3D-wristRk×6a_t^{\text{3D-wrist}} \in \mathbb{R}^{k \times 6}.

Robot 6DoF end-effector action (relative pose in wrist frame):

at+i6D-eef=ΔW6D=(Wwt)1Wwt+i,i=1,,ka_{t+i}^{\text{6D-eef}} = \Delta W_{6D} = \big(W_w^t\big)^{-1} W_w^{t+i}, \quad i=1,\dots,k

Converted to Cartesian coordinates and Euler angles, yielding at6D-eefRk×12a_t^{\text{6D-eef}} \in \mathbb{R}^{k \times 12}.

Gripper action: binary signal aigripper{0,1}a_i^{\text{gripper}} \in \{0,1\}, atgripperRk×2a_t^{\text{gripper}} \in \mathbb{R}^{k \times 2}.

Unified action space: at=(at3D-wrist,at6D-eef,atgripper)a_t = (a_t^{\text{3D-wrist}}, a_t^{\text{6D-eef}}, a_t^{\text{gripper}}). Different data sources supervise only reliably available components:

Table 1: Per-embodiment action supervision

Data sourcea3D-wrista^{\text{3D-wrist}}a6D-eefa^{\text{6D-eef}}agrippera^{\text{gripper}}
In-the-wild human action data (EgoDex + out-sourced)
In-lab human action data
Robot tele-operation

4.2 VLA Model with Interleaved Action Sequence

Architecture: A π0\pi_0-like vision-language-action model πθ(l,ot)\pi_\theta(l, o_t) processing language instruction ll and observations oto_t (head + two wrist cameras). Uses pre-trained VLM for vision-language tokens, then an Action Transformer with flow matching for action generation.

Interleaved action tokens: Action tokens are ordered as a3D-wrista6D-eefagrippera^{\text{3D-wrist}} \rightarrow a^{\text{6D-eef}} \rightarrow a^{\text{gripper}} with attention masking to handle missing components. The ordering enables explicit knowledge transfer from bridging to 6DoF tokens via attention.

Flow-matching objective: For τ(0,1)\tau \in (0,1) and ϵN(0,I)\epsilon \sim \mathcal{N}(0,I), the model denoises atτ=τϵ+(1τ)ata_t^\tau = \tau \epsilon + (1-\tau)a_t by predicting velocity v^(atτ,ot,l,τ)\hat{v}(a_t^\tau, o_t, l, \tau):

LFM=v^(atτ,ot,l,τ)v22\mathcal{L}_{\text{FM}} = \big\| \hat{v}(a_t^\tau, o_t, l, \tau) - v^* \big\|_2^2

where v=ϵatv^* = \epsilon - a_t. Inference uses Euler integration from τ=0\tau=0 to 11 with Δτ=0.2\Delta\tau=0.2.

Vision-language co-training: Co-trained with standard next-token prediction loss LNTP\mathcal{L}_{\text{NTP}} on vision-language data.

4.3 Training Strategies

  • Stage I: Human-only pre-training – 600+ hours of human actions (EgoDex, out-sourced, in-lab). Only supervise LFM3D-wrist\mathcal{L}_{\text{FM}}^{\text{3D-wrist}}.
  • Stage II: Human-robot co-training – Co-train with 72 hours of robot pick-and-place data and task-specific human actions (~3 hrs/task, 15 tasks). Randomly add or substitute a3D-wrista^{\text{3D-wrist}} for a6D-eefa^{\text{6D-eef}} on robot data to bind bridging to executable actions.
  • Stage III: Few-shot robot post-training – 10 additional robot trajectories per task.

Empirical Validation / Results

5.1 Evaluation Setups

15 bi-manual manipulation tasks grouped by object (microwave, drawer, mug/cup, others). Two distinct test scenes per task, 8 trials total. Metrics: success rate and average progress (fine-grained per-task scoring, see Fig. 4 in paper).

5.2 Main Results (Fig. 5 & Fig. 6)

  • Training only on pick-and-place robot data (green) yields very low performance (overall progress 21%, success 10%).
  • Co-training with human actions via bridging action (orange) substantially improves performance (progress 45%, success 22%).
  • Adding large-scale human-only pre-training (blue) further improves (progress 60%, success 38%).
  • Few-shot robot post-training (purple) gives highest performance (progress 72%, success 55%).

5.3 Comparison to 6DoF Human Actions (Table 2)

Human ActionsMicrowave Prog/SuccDrawer Prog/SuccMug/Cup Prog/SuccOther Prog/SuccOverall Prog/Succ
a6D-eefa^{\text{6D-eef}}25.00 / 4.1755.00 / 31.2528.13 / 0.0049.17 / 33.3334.67 / 12.50
a3D-wrista^{\text{3D-wrist}} (Ours)38.02 / 25.0049.06 / 31.2548.13 / 3.1350.00 / 37.5044.58 / 22.50

Qualitatively, 6DoF human actions lead to distorted wrist poses, while bridging actions produce natural, task-aligned poses (Fig. 7 & Fig. 8).

5.4 Post-Training Data Efficiency (Table 3)

ModelOverall Prog (%)Overall Succ (%)
Stage III only (no human pretrain)53.7935.83
Stage I + III (with human pretrain)71.2155.00

Human-only pre-training substantially improves few-shot post-training efficiency.

5.5 Ablation of Bridging Objective on Robot Data (Table 4)

Robot ActionsMicrowave Prog/SuccDrawer Prog/SuccMug/Cup Prog/SuccOther Prog/SuccOverall Prog/Succ
w/o a3D-wrista^{\text{3D-wrist}}35.73 / 10.4239.38 / 12.5039.38 / 0.0048.13 / 33.3339.67 / 12.50
w/ a3D-wrista^{\text{3D-wrist}}64.58 / 45.8356.88 / 43.7552.50 / 15.6361.67 / 50.0059.75 / 38.33

Supervising the bridging action on robot data is essential.

5.6 Loss Alignment (Fig. 9)

Human-only pre-training (400k iterations) yields lower training losses for a6D-eefa^{\text{6D-eef}} and agrippera^{\text{gripper}} during co-training, showing that the bridging objective landscape aligns with the executable robot action space.

5.7 Action Consistency (Fig. 10)

Visualization confirms that predicted bridging actions a3D-wrista^{\text{3D-wrist}} and 6DoF end-effector actions a6D-eefa^{\text{6D-eef}} align closely across diverse tasks.

5.8 Upper Bound Analysis (Table 5)

Treating task-specific robot demonstrations as "human" data (no observation gap, less noise):

ModelMicrowave Prog/SuccDrawer Prog/SuccMug/Cup Prog/SuccOther Prog/SuccOverall Prog/Succ
Default (Ours)64.58 / 45.8356.88 / 43.7552.50 / 15.6361.67 / 50.0059.75 / 38.33
Upper Bound68.75 / 54.1775.94 / 62.5081.25 / 53.1371.25 / 58.3373.54 / 55.83

5.9 Failure Cases

Failures occur in tasks requiring precise rotational adjustments (e.g., "insert straw into cup," "open drawer"). The robot shows task intent but fails at critical steps due to the translation-only design.

Theoretical and Practical Implications

  • Theoretical: Provides evidence that a translation-only bridging action is both necessary and sufficient for transferring manipulation skills across embodiments with fundamentally different end-effectors (fingers vs. parallel grippers). The shared observation frame (head camera) eliminates embodiment-specific noise.
  • Practical: Enables scalable robot learning by leveraging abundant, cheap human action data without requiring task-specific robot demonstrations. The three-stage pipeline (pre-train, co-train, post-train) offers a practical recipe for building generalist manipulation policies.
  • Scalability: The bridging representation scales with the amount of human data, validated by 600+ hours of pre-training and consistent improvements.
  • Limitations: Translation-only actions limit performance on rotation-critical tasks; future work should incorporate limited, reliable rotation cues.

Conclusion

The paper demonstrates that relative wrist translation in the head-camera frame serves as an effective bridging action for transferring manipulation skills from humans to bi-manual robots. The proposed interleaved action token architecture with attention masking handles heterogeneous action components across data sources. Real-world experiments on 15 tasks show that the translation-only bridging action outperforms 6DoF human actions, scales with large-scale human pre-training, and improves data efficiency for few-shot robot fine-tuning. Key limitations include struggles with rotation-critical tasks and thin object grasping. Future directions include incorporating limited rotation cues and scaling to more diverse robot actions to further narrow the embodiment gap.

Related papers