DexJoCo: A Benchmark and Toolkit for Task-Oriented Dexterous Manipulation on MuJoCo
Summary (Overview)
- Comprehensive Benchmark: Introduces DexJoCo, a benchmark with 11 functionally grounded tasks designed to evaluate the unique capabilities of dexterous hands: tool-use, bimanual coordination, long-horizon execution, and reasoning.
- Low-Cost Toolkit: Develops a low-cost (~$2,300) teleoperation system using Rokoko gloves and HTC Vive trackers, paired with a self-supervised retargeting algorithm (GeoRT) for efficient collection of 1.1K high-quality human demonstration trajectories.
- Extensive Evaluation: Benchmarks modern policies (ACT, Diffusion Policy, , GR00T N1.5) under diverse settings, revealing key limitations: poor robustness to visual randomization, failure in fine-grained actions/insertion, and lack of true language generalization in Vision-Language-Action (VLA) models.
- Critical Insights: Identifies major research gaps: the need for dexterous-hand-centric foundation models, the limitations of vision-only policies for contact-rich manipulation, and the challenge of sim-to-real transfer.
Introduction and Theoretical Foundation
Achieving human-level robotic manipulation necessitates dexterous hands capable of fine-grained, contact-rich interactions. While progress has been made with manipulator-gripper systems, advancing dexterous hand learning requires standardized benchmarks for systematic evaluation. Existing dexterous benchmarks suffer from several limitations:
- They often use hand-only setups, creating trajectories unrealistic for real-world manipulator-hand systems.
- Their tasks (e.g., in-hand manipulation, pick-and-place) lack functional diversity and fail to highlight the distinct advantages of dexterous hands over simple grippers.
- They lack reliable, user-friendly systems for collecting high-quality human demonstrations, often resorting to reinforcement learning or automated generation which yields unnatural behaviors.
- They lack standardized language instructions and unified data formats compatible with modern VLA models.
DexJoCo is introduced to address these gaps. It provides a benchmark with functionally grounded tasks that require dexterous capabilities, a toolkit for low-cost data collection, and a dataset of human demonstrations to facilitate systematic training and evaluation of dexterous manipulation policies.
Comparison with Existing Benchmarks:
| Benchmark | Hand | Tool-Use | Bimanual | Reasoning | Hand MoCap System | Trajectory Collection Methods |
|---|---|---|---|---|---|---|
| CALVIN | Motion Planning | |||||
| LIBERO | Human Demonstration | |||||
| RoboTwin 2.0 | ✓ | Motion Planning | ||||
| DexMimicGen | ✓ | ✓ | Few Human + MimicGen | |||
| Bi-DexHands | ✓ | ✓ | RL Policy | |||
| DexJoCo (ours) | ✓ | ✓ | ✓ | ✓ | ✓ | Human Demonstration |
Methodology
1. Robot Setup and Observation State
- Simulator: Built on the MuJoCo physics simulator.
- Robot System: Comprises a Rethink Robotics mount, a Franka Panda manipulator, and an Allegro Hand.
- Observations: Include third-person/wrist-mounted RGB and RGB-D images, object poses, robot motion states, end-effector pose, and hand joint angles.
- Action Space: Manipulator actions are target absolute end-effector poses; hand actions are target absolute joint angles.
2. Human Demonstration Data Collection System
- Hardware: Uses Rokoko Smartgloves for hand pose capture and HTC Vive Trackers with Base Stations for wrist/end-effector tracking. Total cost ~$2,300.
- Teleoperation Algorithm:
- Hand Retargeting: Employs GeoRT, a lightweight self-supervised method. The retargeting model maps human fingertip keypoints to robot joint positions by minimizing a composite loss: where preserves motion direction, enlarges workspace, ensures uniform sensitivity, preserves pinch behaviors, and avoids self-collisions.
- Wrist Tracking: The tracker is fixed to align human wrist motion with the Franka end-effector. Actions are recorded as relative pose changes from an initial reference.
3. Task Design
- Formalization: A task is defined by interactive objects and goal constraints (temporal, pose, joint-state, and contact conditions).
- Design Principles:
- Functional Interaction: Tasks mimic everyday activities with explicit visual feedback.
- Dexterity Dependency: Success requires fine-grained finger coordination, impossible for parallel grippers.
- Long-Horizon Compositionality: Multi-stage execution with temporal dependencies.
- Bimanual Coordination: Requires coordinated, asymmetric two-hand manipulation.
- Task Categories & Examples:
- Tool-Use:
Water Plant,Hammer Nail - Bimanual:
Unlock iPad,Hanoi,Assembly,Microwave Cook,Photograph - Long-Horizon:
Microwave Cook - Reasoning:
Hanoi(Tower of Hanoi)
- Tool-Use:
iii. Domain Randomizations
To evaluate policy robustness, domain randomization is applied via trajectory replay:
- Visual: Randomizes third-person camera pose, lighting (direction/color), and table texture.
- Physical: Randomizes object placement and table height.
- Dynamics: Randomizes object mass, joint friction, and stiffness (for evaluation).
4. Imitation Learning Policy Evaluation
- Baseline Models: ACT, Diffusion Policy (DP-T: Transformer, DP-C: CNN), , GR00T N1.5.
- Action Chunking: All baselines model the conditional probability of a future -step action chunk given frames of history: where is an optional language instruction.
- Model Deployment: Uses an asynchronous inference mechanism to generate the next action chunk while executing the current one, improving reactivity.
Empirical Validation / Results
1. Benchmark Performance Under Randomization
The benchmark proves highly challenging. Performance drops sharply under full visual randomization ("rand-full").
Table 2: Performance comparison on benchmark tasks. Mean success rate (%) ± std.
| Task | DP-T | DP-C | ACT | GR00T N1.5 | |
|---|---|---|---|---|---|
| rand-obj | rand-full | rand-obj | rand-full | rand-obj | |
| Hammer Nail | 81.3 ± 3.1 | 18.7 ± 1.2 | 58.7 ± 4.2 | 19.3 ± 3.1 | 50.0 ± 7.2 |
| Click Mouse | 62.0 ± 2.0 | 25.3 ± 8.1 | 74.0 ± 5.3 | 34.7 ± 4.2 | 61.3 ± 3.1 |
| Pick Bucket | 83.3 ± 3.1 | 58.7 ± 15.0 | 70.0 ± 2.0 | 68.0 ± 3.5 | 64.0 ± 4.0 |
| Pinch Tongs | 22.7 ± 5.8 | 18.7 ± 3.1 | 57.3 ± 6.4 | 28.7 ± 11.7 | 31.3 ± 3.1 |
| Fold Glasses | 53.3 ± 3.1 | 11.3 ± 1.2 | 54.0 ± 15.9 | 15.3 ± 7.6 | 47.3 ± 11.0 |
| Water Plant | 84.0 ± 3.5 | 56.0 ± 8.7 | 63.3 ± 3.1 | 54.0 ± 5.3 | 47.3 ± 4.6 |
| Unlock iPad /B | 8.0 ± 2.0 | 2.0 ± 2.0 | 52.0 ± 2.0 | 12.0 ± 3.5 | 9.3 ± 3.1 |
| Hanoi /B | 24.7 ± 4.6 | 0.7 ± 1.2 | 12.7 ± 3.1 | 9.3 ± 6.1 | 6.0 ± 2.0 |
| Assembly /B | 4.7 ± 3.1 | 0.0 ± 0.0 | 3.3 ± 1.2 | 0.0 ± 0.0 | 0.0 ± 0.0 |
| Microwave /B | 73.3 ± 11.6 | 21.3 ± 4.6 | 54.0 ± 12.5 | 62.7 ± 6.4 | 66.0 ± 2.0 |
| Photograph /B | 56.7 ± 4.6 | 7.3 ± 1.2 | 24.0 ± 8.7 | 8.7 ± 4.2 | 7.3 ± 1.2 |
| Avg. | 50.4 ± 1.4 | 20.0 ± 1.4 | 47.6 ± 2.0 | 28.4 ± 1.5 | 35.5 ± 2.0 |
Key Findings:
- achieves the highest average success, benefiting from large-scale pre-training.
- The smaller DP-T (~100M params) trained from scratch is competitive, especially on bimanual tasks.
- DP-C excels at precise operations (button pressing, hinge interaction) likely due to its use of FiLM for observation injection, providing stronger fine-grained visual perception.
- Bimanual tasks (
Unlock iPad,Hanoi,Assembly) are particularly difficult, with some policies never succeeding.
2. Failure Mode Analysis
- Fine-grained Actions: Policies often locate objects but fail to perform precise interactions (e.g., clicking specific buttons).
- Insertion: High failure rate in tasks like
AssemblyandHanoi. - Temporal Memory: In
Pinch Tongs, policies grasp but fail to execute the squeeze-release sequence. - Sequential Errors: In
Microwave, policies often place the object but then withdraw it with the hand.
3. Multi-task, Dynamics, and Action-Head Evaluations
Table 3: Multi-task, dynamics, and action-head evaluations. Success rate (%).
| Task | Multi-task | Rand-dynamics | Rand-AH () |
|---|---|---|---|
| DP-T | DP-T | ||
| Hammer Nail | 58.7 ± 5.0 | 86.7 ± 3.1 | 77.3 ± 6.4 |
| Click Mouse | 38.7 ± 3.1 | 80.7 ± 3.1 | 0.0 ± 0.0 |
| Pick Bucket | 55.3 ± 7.6 | 83.3 ± 8.1 | 80.7 ± 3.1 |
| Pinch Tongs | 6.0 ± 5.3 | 45.3 ± 6.1 | 15.3 ± 4.2 |
| Fold Glasses | 11.3 ± 5.0 | 42.0 ± 6.0 | 40.7 ± 4.6 |
| Water Plant | 60.0 ± 6.9 | 84.0 ± 4.0 | 76.0 ± 6.0 |
| Unlock iPad /B | 0.0 ± 0.0 | 0.7 ± 1.2 | 0.7 ± 1.2 |
| Hanoi /B | 8.0 ± 2.0 | 6.0 ± 0.0 | 29.3 ± 2.3 |
| Assembly /B | 1.3 ± 2.3 | 3.3 ± 2.3 | 8.0 ± 5.3 |
| Microwave /B | 42.7 ± 6.4 | 39.3 ± 13.0 | 70.0 ± 9.2 |
| Photograph /B | 28.0 ± 6.0 | 29.3 ± 1.2 | 59.3 ±組成 |
| Avg. | 33.2 ± 2.4 | 45.5 ± |