Visual Summary | Translation as a Bridging Action: Transferring Manipulation Skills from Humans to Robots

Summary (Overview)

Bridging action representation: Proposes using relative wrist translation in the head-camera frame as a shared action space between humans and robots, avoiding noisy rotation estimates and mismatched contact patterns.
Interleaved action tokens: A π₀-like VLA model with attention masking handles heterogeneous action components (3D-wrist, 6DoF end-effector, gripper) across data sources.
Three-stage training: Pre-training on 600+ hours of human actions (Stage I), human-robot co-training with pick-and-place data (Stage II), and few-shot robot post-training (Stage III).
Empirical results: The bridging action transfers manipulation skills to 15 bi-manual tasks more effectively than 6DoF human actions, improves data efficiency for few-shot robot data, and scales with large-scale human-only pre-training.
Loss alignment: Human-only pre-training on wrist translations accelerates convergence of 6DoF end-effector and gripper losses during co-training, showing that the bridging objective aligns with the executable robot action space.

Introduction and Theoretical Foundation

The paper addresses the challenge of transferring manipulation skills from human action data to a bi-manual robot with parallel grippers. Human action data is cheap, abundant, and diverse, making it promising for scaling robot learning. However, prior work treats humans as just another 6DoF embodiment, suffering from:

Noisy wrist rotations from hand-pose estimators.
Fundamentally different contact patterns between human fingers and parallel grippers, making wrist rotations less semantically meaningful.

The key insight is that both humans and robots act upon what they perceive, so wrist translation in the head-camera frame serves as a shared, embodiment-agnostic action signal. This "bridging action" is:

Physically meaningful under a shared observation perspective.
Robust to noisy rotation estimates.
Embodiment-agnostic by construction.

Methodology

4.1 Motion Bridging Action Representation

Why not 6DoF? Wrist rotations from human data are noisy and misaligned with gripper semantics.

Bridging action (relative wrist translation in head-camera frame): Let $W_w^t \in SE(3)$ denote wrist pose in world frame at time $t$ , and $T_{w \leftarrow c}^t \in SE(3)$ the head-camera pose. Mapping wrist pose into camera frame $c_t$ yields $W_{c_t}^{t+i} = (T_{w \leftarrow c}^t)^{-1} W_w^{t+i}$ . The bridging action over a $k$ -step future window is:

a_{t+i}^{\text{3D-wrist}} = \Delta W_{3D} = \mathfrak{t}\big(W_{c_t}^{t+i}\big) - \mathfrak{t}\big(W_{c_t}^{t}\big), \quad i=1,\dots,k

where $\mathfrak{t}(\cdot)$ extracts the $3 \times 1$ translation component. Concatenating both arms gives $a_t^{\text{3D-wrist}} \in \mathbb{R}^{k \times 6}$ .

Robot 6DoF end-effector action (relative pose in wrist frame):

a_{t+i}^{\text{6D-eef}} = \Delta W_{6D} = \big(W_w^t\big)^{-1} W_w^{t+i}, \quad i=1,\dots,k

Converted to Cartesian coordinates and Euler angles, yielding $a_t^{\text{6D-eef}} \in \mathbb{R}^{k \times 12}$ .

Gripper action: binary signal $a_i^{\text{gripper}} \in \{0,1\}$ , $a_t^{\text{gripper}} \in \mathbb{R}^{k \times 2}$ .

Unified action space: $a_t = (a_t^{\text{3D-wrist}}, a_t^{\text{6D-eef}}, a_t^{\text{gripper}})$ . Different data sources supervise only reliably available components:

Table 1: Per-embodiment action supervision

Data source	$a^{\text{3D-wrist}}$	$a^{\text{6D-eef}}$	$a^{\text{gripper}}$
In-the-wild human action data (EgoDex + out-sourced)	✓	–	–
In-lab human action data	✓	–	✓
Robot tele-operation	✓	✓	✓

4.2 VLA Model with Interleaved Action Sequence

Architecture: A $\pi_0$ -like vision-language-action model $\pi_\theta(l, o_t)$ processing language instruction $l$ and observations $o_t$ (head + two wrist cameras). Uses pre-trained VLM for vision-language tokens, then an Action Transformer with flow matching for action generation.

Interleaved action tokens: Action tokens are ordered as $a^{\text{3D-wrist}} \rightarrow a^{\text{6D-eef}} \rightarrow a^{\text{gripper}}$ with attention masking to handle missing components. The ordering enables explicit knowledge transfer from bridging to 6DoF tokens via attention.

Flow-matching objective: For $\tau \in (0,1)$ and $\epsilon \sim \mathcal{N}(0,I)$ , the model denoises $a_t^\tau = \tau \epsilon + (1-\tau)a_t$ by predicting velocity $\hat{v}(a_t^\tau, o_t, l, \tau)$ :

\mathcal{L}_{\text{FM}} = \big\| \hat{v}(a_t^\tau, o_t, l, \tau) - v^* \big\|_2^2

where $v^* = \epsilon - a_t$ . Inference uses Euler integration from $\tau=0$ to $1$ with $\Delta\tau=0.2$ .

Vision-language co-training: Co-trained with standard next-token prediction loss $\mathcal{L}_{\text{NTP}}$ on vision-language data.

4.3 Training Strategies

Stage I: Human-only pre-training – 600+ hours of human actions (EgoDex, out-sourced, in-lab). Only supervise $\mathcal{L}_{\text{FM}}^{\text{3D-wrist}}$ .
Stage II: Human-robot co-training – Co-train with 72 hours of robot pick-and-place data and task-specific human actions (~3 hrs/task, 15 tasks). Randomly add or substitute $a^{\text{3D-wrist}}$ for $a^{\text{6D-eef}}$ on robot data to bind bridging to executable actions.
Stage III: Few-shot robot post-training – 10 additional robot trajectories per task.

Empirical Validation / Results

5.1 Evaluation Setups

15 bi-manual manipulation tasks grouped by object (microwave, drawer, mug/cup, others). Two distinct test scenes per task, 8 trials total. Metrics: success rate and average progress (fine-grained per-task scoring, see Fig. 4 in paper).

5.2 Main Results (Fig. 5 & Fig. 6)

Training only on pick-and-place robot data (green) yields very low performance (overall progress 21%, success 10%).
Co-training with human actions via bridging action (orange) substantially improves performance (progress 45%, success 22%).
Adding large-scale human-only pre-training (blue) further improves (progress 60%, success 38%).
Few-shot robot post-training (purple) gives highest performance (progress 72%, success 55%).

5.3 Comparison to 6DoF Human Actions (Table 2)

Human Actions	Microwave Prog/Succ	Drawer Prog/Succ	Mug/Cup Prog/Succ	Other Prog/Succ	Overall Prog/Succ
$a^{\text{6D-eef}}$	25.00 / 4.17	55.00 / 31.25	28.13 / 0.00	49.17 / 33.33	34.67 / 12.50
$a^{\text{3D-wrist}}$ (Ours)	38.02 / 25.00	49.06 / 31.25	48.13 / 3.13	50.00 / 37.50	44.58 / 22.50

Qualitatively, 6DoF human actions lead to distorted wrist poses, while bridging actions produce natural, task-aligned poses (Fig. 7 & Fig. 8).

5.4 Post-Training Data Efficiency (Table 3)

Model	Overall Prog (%)	Overall Succ (%)
Stage III only (no human pretrain)	53.79	35.83
Stage I + III (with human pretrain)	71.21	55.00

Human-only pre-training substantially improves few-shot post-training efficiency.

5.5 Ablation of Bridging Objective on Robot Data (Table 4)

Robot Actions	Microwave Prog/Succ	Drawer Prog/Succ	Mug/Cup Prog/Succ	Other Prog/Succ	Overall Prog/Succ
w/o $a^{\text{3D-wrist}}$	35.73 / 10.42	39.38 / 12.50	39.38 / 0.00	48.13 / 33.33	39.67 / 12.50
w/ $a^{\text{3D-wrist}}$	64.58 / 45.83	56.88 / 43.75	52.50 / 15.63	61.67 / 50.00	59.75 / 38.33

Supervising the bridging action on robot data is essential.

5.6 Loss Alignment (Fig. 9)

Human-only pre-training (400k iterations) yields lower training losses for $a^{\text{6D-eef}}$ and $a^{\text{gripper}}$ during co-training, showing that the bridging objective landscape aligns with the executable robot action space.

5.7 Action Consistency (Fig. 10)

Visualization confirms that predicted bridging actions $a^{\text{3D-wrist}}$ and 6DoF end-effector actions $a^{\text{6D-eef}}$ align closely across diverse tasks.

5.8 Upper Bound Analysis (Table 5)

Treating task-specific robot demonstrations as "human" data (no observation gap, less noise):

Model	Microwave Prog/Succ	Drawer Prog/Succ	Mug/Cup Prog/Succ	Other Prog/Succ	Overall Prog/Succ
Default (Ours)	64.58 / 45.83	56.88 / 43.75	52.50 / 15.63	61.67 / 50.00	59.75 / 38.33
Upper Bound	68.75 / 54.17	75.94 / 62.50	81.25 / 53.13	71.25 / 58.33	73.54 / 55.83

5.9 Failure Cases

Failures occur in tasks requiring precise rotational adjustments (e.g., "insert straw into cup," "open drawer"). The robot shows task intent but fails at critical steps due to the translation-only design.

Theoretical and Practical Implications

Theoretical: Provides evidence that a translation-only bridging action is both necessary and sufficient for transferring manipulation skills across embodiments with fundamentally different end-effectors (fingers vs. parallel grippers). The shared observation frame (head camera) eliminates embodiment-specific noise.
Practical: Enables scalable robot learning by leveraging abundant, cheap human action data without requiring task-specific robot demonstrations. The three-stage pipeline (pre-train, co-train, post-train) offers a practical recipe for building generalist manipulation policies.
Scalability: The bridging representation scales with the amount of human data, validated by 600+ hours of pre-training and consistent improvements.
Limitations: Translation-only actions limit performance on rotation-critical tasks; future work should incorporate limited, reliable rotation cues.

Conclusion

The paper demonstrates that relative wrist translation in the head-camera frame serves as an effective bridging action for transferring manipulation skills from humans to bi-manual robots. The proposed interleaved action token architecture with attention masking handles heterogeneous action components across data sources. Real-world experiments on 15 tasks show that the translation-only bridging action outperforms 6DoF human actions, scales with large-scale human pre-training, and improves data efficiency for few-shot robot fine-tuning. Key limitations include struggles with rotation-critical tasks and thin object grasping. Future directions include incorporating limited rotation cues and scaling to more diverse robot actions to further narrow the embodiment gap.