DexJoCo: A Benchmark and Toolkit for Task-Oriented Dexterous Manipulation on MuJoCo

Summary (Overview)

Comprehensive Benchmark: Introduces DexJoCo, a benchmark with 11 functionally grounded tasks designed to evaluate the unique capabilities of dexterous hands: tool-use, bimanual coordination, long-horizon execution, and reasoning.
Low-Cost Toolkit: Develops a low-cost (~$2,300) teleoperation system using Rokoko gloves and HTC Vive trackers, paired with a self-supervised retargeting algorithm (GeoRT) for efficient collection of 1.1K high-quality human demonstration trajectories.
Extensive Evaluation: Benchmarks modern policies (ACT, Diffusion Policy, $\pi_{0.5}$ , GR00T N1.5) under diverse settings, revealing key limitations: poor robustness to visual randomization, failure in fine-grained actions/insertion, and lack of true language generalization in Vision-Language-Action (VLA) models.
Critical Insights: Identifies major research gaps: the need for dexterous-hand-centric foundation models, the limitations of vision-only policies for contact-rich manipulation, and the challenge of sim-to-real transfer.

Introduction and Theoretical Foundation

Achieving human-level robotic manipulation necessitates dexterous hands capable of fine-grained, contact-rich interactions. While progress has been made with manipulator-gripper systems, advancing dexterous hand learning requires standardized benchmarks for systematic evaluation. Existing dexterous benchmarks suffer from several limitations:

They often use hand-only setups, creating trajectories unrealistic for real-world manipulator-hand systems.
Their tasks (e.g., in-hand manipulation, pick-and-place) lack functional diversity and fail to highlight the distinct advantages of dexterous hands over simple grippers.
They lack reliable, user-friendly systems for collecting high-quality human demonstrations, often resorting to reinforcement learning or automated generation which yields unnatural behaviors.
They lack standardized language instructions and unified data formats compatible with modern VLA models.

DexJoCo is introduced to address these gaps. It provides a benchmark with functionally grounded tasks that require dexterous capabilities, a toolkit for low-cost data collection, and a dataset of human demonstrations to facilitate systematic training and evaluation of dexterous manipulation policies.

Comparison with Existing Benchmarks:

Benchmark	Hand	Tool-Use	Bimanual	Reasoning	Hand MoCap System	Trajectory Collection Methods
CALVIN						Motion Planning
LIBERO						Human Demonstration
RoboTwin 2.0	✓					Motion Planning
DexMimicGen	✓	✓				Few Human + MimicGen
Bi-DexHands	✓		✓			RL Policy
DexJoCo (ours)	✓	✓	✓	✓	✓	Human Demonstration

Methodology

1. Robot Setup and Observation State

Simulator: Built on the MuJoCo physics simulator.
Robot System: Comprises a Rethink Robotics mount, a Franka Panda manipulator, and an Allegro Hand.
Observations: Include third-person/wrist-mounted RGB and RGB-D images, object poses, robot motion states, end-effector pose, and hand joint angles.
Action Space: Manipulator actions are target absolute end-effector poses; hand actions are target absolute joint angles.

2. Human Demonstration Data Collection System

Hardware: Uses Rokoko Smartgloves for hand pose capture and HTC Vive Trackers with Base Stations for wrist/end-effector tracking. Total cost ~$2,300.
Teleoperation Algorithm:
- Hand Retargeting: Employs GeoRT, a lightweight self-supervised method. The retargeting model $f$ maps human fingertip keypoints $x_H$ to robot joint positions $q_R = f(x_H)$ by minimizing a composite loss: $L = L_{dir} + \lambda_1 L_{cover} + \lambda_2 L_{flat} + \lambda_3 L_{pinch} + \lambda_4 L_{col}$ where $L_{dir}$ preserves motion direction, $L_{cover}$ enlarges workspace, $L_{flat}$ ensures uniform sensitivity, $L_{pinch}$ preserves pinch behaviors, and $L_{col}$ avoids self-collisions.
- Wrist Tracking: The tracker is fixed to align human wrist motion with the Franka end-effector. Actions are recorded as relative pose changes from an initial reference.

3. Task Design

Formalization: A task $\mathcal{T} = (\mathcal{O}, \mathcal{G})$ is defined by interactive objects $\mathcal{O} = \{o_1, o_2, ..., o_m\}$ and goal constraints $\mathcal{G} = \{g_{seq}, g_{pose}, g_{joint}, g_{contact}\}$ (temporal, pose, joint-state, and contact conditions).
Design Principles:
1. Functional Interaction: Tasks mimic everyday activities with explicit visual feedback.
2. Dexterity Dependency: Success requires fine-grained finger coordination, impossible for parallel grippers.
3. Long-Horizon Compositionality: Multi-stage execution with temporal dependencies.
4. Bimanual Coordination: Requires coordinated, asymmetric two-hand manipulation.
Task Categories & Examples:
- Tool-Use: Water Plant, Hammer Nail
- Bimanual: Unlock iPad, Hanoi, Assembly, Microwave Cook, Photograph
- Long-Horizon: Microwave Cook
- Reasoning: Hanoi (Tower of Hanoi)

iii. Domain Randomizations

To evaluate policy robustness, domain randomization is applied via trajectory replay:

Visual: Randomizes third-person camera pose, lighting (direction/color), and table texture.
Physical: Randomizes object placement and table height.
Dynamics: Randomizes object mass, joint friction, and stiffness (for evaluation).

4. Imitation Learning Policy Evaluation

Baseline Models: ACT, Diffusion Policy (DP-T: Transformer, DP-C: CNN), $\pi_{0.5}$ , GR00T N1.5.
Action Chunking: All baselines model the conditional probability of a future $k$ -step action chunk given $h$ frames of history: $P(a_{t:t+k-1}) = \pi_{\theta}(a_{t:t+k-1} | s_{t-h+1:t}, l)$ where $l$ is an optional language instruction.
Model Deployment: Uses an asynchronous inference mechanism to generate the next action chunk while executing the current one, improving reactivity.

Empirical Validation / Results

1. Benchmark Performance Under Randomization

The benchmark proves highly challenging. Performance drops sharply under full visual randomization ("rand-full").

Table 2: Performance comparison on benchmark tasks. Mean success rate (%) ± std.

Task	DP-T	DP-C	ACT	$\pi_{0.5}$	GR00T N1.5
	rand-obj	rand-full	rand-obj	rand-full	rand-obj
Hammer Nail	81.3 ± 3.1	18.7 ± 1.2	58.7 ± 4.2	19.3 ± 3.1	50.0 ± 7.2
Click Mouse	62.0 ± 2.0	25.3 ± 8.1	74.0 ± 5.3	34.7 ± 4.2	61.3 ± 3.1
Pick Bucket	83.3 ± 3.1	58.7 ± 15.0	70.0 ± 2.0	68.0 ± 3.5	64.0 ± 4.0
Pinch Tongs	22.7 ± 5.8	18.7 ± 3.1	57.3 ± 6.4	28.7 ± 11.7	31.3 ± 3.1
Fold Glasses	53.3 ± 3.1	11.3 ± 1.2	54.0 ± 15.9	15.3 ± 7.6	47.3 ± 11.0
Water Plant	84.0 ± 3.5	56.0 ± 8.7	63.3 ± 3.1	54.0 ± 5.3	47.3 ± 4.6
Unlock iPad /B	8.0 ± 2.0	2.0 ± 2.0	52.0 ± 2.0	12.0 ± 3.5	9.3 ± 3.1
Hanoi /B	24.7 ± 4.6	0.7 ± 1.2	12.7 ± 3.1	9.3 ± 6.1	6.0 ± 2.0
Assembly /B	4.7 ± 3.1	0.0 ± 0.0	3.3 ± 1.2	0.0 ± 0.0	0.0 ± 0.0
Microwave /B	73.3 ± 11.6	21.3 ± 4.6	54.0 ± 12.5	62.7 ± 6.4	66.0 ± 2.0
Photograph /B	56.7 ± 4.6	7.3 ± 1.2	24.0 ± 8.7	8.7 ± 4.2	7.3 ± 1.2
Avg.	50.4 ± 1.4	20.0 ± 1.4	47.6 ± 2.0	28.4 ± 1.5	35.5 ± 2.0

Key Findings:

$\pi_{0.5}$ achieves the highest average success, benefiting from large-scale pre-training.
The smaller DP-T (~100M params) trained from scratch is competitive, especially on bimanual tasks.
DP-C excels at precise operations (button pressing, hinge interaction) likely due to its use of FiLM for observation injection, providing stronger fine-grained visual perception.
Bimanual tasks (Unlock iPad, Hanoi, Assembly) are particularly difficult, with some policies never succeeding.

2. Failure Mode Analysis

Fine-grained Actions: Policies often locate objects but fail to perform precise interactions (e.g., clicking specific buttons).
Insertion: High failure rate in tasks like Assembly and Hanoi.
Temporal Memory: In Pinch Tongs, policies grasp but fail to execute the squeeze-release sequence.
Sequential Errors: In Microwave, policies often place the object but then withdraw it with the hand.

3. Multi-task, Dynamics, and Action-Head Evaluations

Table 3: Multi-task, dynamics, and action-head evaluations. Success rate (%).

Task	Multi-task	Rand-dynamics	Rand-AH ( $\pi_{0.5}$ )
	DP-T	$\pi_{0.5}$	DP-T
Hammer Nail	58.7 ± 5.0	86.7 ± 3.1	77.3 ± 6.4
Click Mouse	38.7 ± 3.1	80.7 ± 3.1	0.0 ± 0.0
Pick Bucket	55.3 ± 7.6	83.3 ± 8.1	80.7 ± 3.1
Pinch Tongs	6.0 ± 5.3	45.3 ± 6.1	15.3 ± 4.2
Fold Glasses	11.3 ± 5.0	42.0 ± 6.0	40.7 ± 4.6
Water Plant	60.0 ± 6.9	84.0 ± 4.0	76.0 ± 6.0
Unlock iPad /B	0.0 ± 0.0	0.7 ± 1.2	0.7 ± 1.2
Hanoi /B	8.0 ± 2.0	6.0 ± 0.0	29.3 ± 2.3
Assembly /B	1.3 ± 2.3	3.3 ± 2.3	8.0 ± 5.3
Microwave /B	42.7 ± 6.4	39.3 ± 13.0	70.0 ± 9.2
Photograph /B	28.0 ± 6.0	29.3 ± 1.2	59.3 ±組成
Avg.	33.2 ± 2.4	45.5 ±