DexJoCo: A Benchmark and Toolkit for Task-Oriented Dexterous Manipulation on MuJoCo

Summary (Overview)

  • Comprehensive Benchmark: Introduces DexJoCo, a benchmark with 11 functionally grounded tasks designed to evaluate the unique capabilities of dexterous hands: tool-use, bimanual coordination, long-horizon execution, and reasoning.
  • Low-Cost Toolkit: Develops a low-cost (~$2,300) teleoperation system using Rokoko gloves and HTC Vive trackers, paired with a self-supervised retargeting algorithm (GeoRT) for efficient collection of 1.1K high-quality human demonstration trajectories.
  • Extensive Evaluation: Benchmarks modern policies (ACT, Diffusion Policy, π0.5\pi_{0.5}, GR00T N1.5) under diverse settings, revealing key limitations: poor robustness to visual randomization, failure in fine-grained actions/insertion, and lack of true language generalization in Vision-Language-Action (VLA) models.
  • Critical Insights: Identifies major research gaps: the need for dexterous-hand-centric foundation models, the limitations of vision-only policies for contact-rich manipulation, and the challenge of sim-to-real transfer.

Introduction and Theoretical Foundation

Achieving human-level robotic manipulation necessitates dexterous hands capable of fine-grained, contact-rich interactions. While progress has been made with manipulator-gripper systems, advancing dexterous hand learning requires standardized benchmarks for systematic evaluation. Existing dexterous benchmarks suffer from several limitations:

  1. They often use hand-only setups, creating trajectories unrealistic for real-world manipulator-hand systems.
  2. Their tasks (e.g., in-hand manipulation, pick-and-place) lack functional diversity and fail to highlight the distinct advantages of dexterous hands over simple grippers.
  3. They lack reliable, user-friendly systems for collecting high-quality human demonstrations, often resorting to reinforcement learning or automated generation which yields unnatural behaviors.
  4. They lack standardized language instructions and unified data formats compatible with modern VLA models.

DexJoCo is introduced to address these gaps. It provides a benchmark with functionally grounded tasks that require dexterous capabilities, a toolkit for low-cost data collection, and a dataset of human demonstrations to facilitate systematic training and evaluation of dexterous manipulation policies.

Comparison with Existing Benchmarks:

BenchmarkHandTool-UseBimanualReasoningHand MoCap SystemTrajectory Collection Methods
CALVINMotion Planning
LIBEROHuman Demonstration
RoboTwin 2.0Motion Planning
DexMimicGenFew Human + MimicGen
Bi-DexHandsRL Policy
DexJoCo (ours)Human Demonstration

Methodology

1. Robot Setup and Observation State

  • Simulator: Built on the MuJoCo physics simulator.
  • Robot System: Comprises a Rethink Robotics mount, a Franka Panda manipulator, and an Allegro Hand.
  • Observations: Include third-person/wrist-mounted RGB and RGB-D images, object poses, robot motion states, end-effector pose, and hand joint angles.
  • Action Space: Manipulator actions are target absolute end-effector poses; hand actions are target absolute joint angles.

2. Human Demonstration Data Collection System

  • Hardware: Uses Rokoko Smartgloves for hand pose capture and HTC Vive Trackers with Base Stations for wrist/end-effector tracking. Total cost ~$2,300.
  • Teleoperation Algorithm:
    • Hand Retargeting: Employs GeoRT, a lightweight self-supervised method. The retargeting model ff maps human fingertip keypoints xHx_H to robot joint positions qR=f(xH)q_R = f(x_H) by minimizing a composite loss: L=Ldir+λ1Lcover+λ2Lflat+λ3Lpinch+λ4LcolL = L_{dir} + \lambda_1 L_{cover} + \lambda_2 L_{flat} + \lambda_3 L_{pinch} + \lambda_4 L_{col} where LdirL_{dir} preserves motion direction, LcoverL_{cover} enlarges workspace, LflatL_{flat} ensures uniform sensitivity, LpinchL_{pinch} preserves pinch behaviors, and LcolL_{col} avoids self-collisions.
    • Wrist Tracking: The tracker is fixed to align human wrist motion with the Franka end-effector. Actions are recorded as relative pose changes from an initial reference.

3. Task Design

  • Formalization: A task T=(O,G)\mathcal{T} = (\mathcal{O}, \mathcal{G}) is defined by interactive objects O={o1,o2,...,om}\mathcal{O} = \{o_1, o_2, ..., o_m\} and goal constraints G={gseq,gpose,gjoint,gcontact}\mathcal{G} = \{g_{seq}, g_{pose}, g_{joint}, g_{contact}\} (temporal, pose, joint-state, and contact conditions).
  • Design Principles:
    1. Functional Interaction: Tasks mimic everyday activities with explicit visual feedback.
    2. Dexterity Dependency: Success requires fine-grained finger coordination, impossible for parallel grippers.
    3. Long-Horizon Compositionality: Multi-stage execution with temporal dependencies.
    4. Bimanual Coordination: Requires coordinated, asymmetric two-hand manipulation.
  • Task Categories & Examples:
    • Tool-Use: Water Plant, Hammer Nail
    • Bimanual: Unlock iPad, Hanoi, Assembly, Microwave Cook, Photograph
    • Long-Horizon: Microwave Cook
    • Reasoning: Hanoi (Tower of Hanoi)

iii. Domain Randomizations

To evaluate policy robustness, domain randomization is applied via trajectory replay:

  • Visual: Randomizes third-person camera pose, lighting (direction/color), and table texture.
  • Physical: Randomizes object placement and table height.
  • Dynamics: Randomizes object mass, joint friction, and stiffness (for evaluation).

4. Imitation Learning Policy Evaluation

  • Baseline Models: ACT, Diffusion Policy (DP-T: Transformer, DP-C: CNN), π0.5\pi_{0.5}, GR00T N1.5.
  • Action Chunking: All baselines model the conditional probability of a future kk-step action chunk given hh frames of history: P(at:t+k1)=πθ(at:t+k1sth+1:t,l)P(a_{t:t+k-1}) = \pi_{\theta}(a_{t:t+k-1} | s_{t-h+1:t}, l) where ll is an optional language instruction.
  • Model Deployment: Uses an asynchronous inference mechanism to generate the next action chunk while executing the current one, improving reactivity.

Empirical Validation / Results

1. Benchmark Performance Under Randomization

The benchmark proves highly challenging. Performance drops sharply under full visual randomization ("rand-full").

Table 2: Performance comparison on benchmark tasks. Mean success rate (%) ± std.

TaskDP-TDP-CACTπ0.5\pi_{0.5}GR00T N1.5
rand-objrand-fullrand-objrand-fullrand-obj
Hammer Nail81.3 ± 3.118.7 ± 1.258.7 ± 4.219.3 ± 3.150.0 ± 7.2
Click Mouse62.0 ± 2.025.3 ± 8.174.0 ± 5.334.7 ± 4.261.3 ± 3.1
Pick Bucket83.3 ± 3.158.7 ± 15.070.0 ± 2.068.0 ± 3.564.0 ± 4.0
Pinch Tongs22.7 ± 5.818.7 ± 3.157.3 ± 6.428.7 ± 11.731.3 ± 3.1
Fold Glasses53.3 ± 3.111.3 ± 1.254.0 ± 15.915.3 ± 7.647.3 ± 11.0
Water Plant84.0 ± 3.556.0 ± 8.763.3 ± 3.154.0 ± 5.347.3 ± 4.6
Unlock iPad /B8.0 ± 2.02.0 ± 2.052.0 ± 2.012.0 ± 3.59.3 ± 3.1
Hanoi /B24.7 ± 4.60.7 ± 1.212.7 ± 3.19.3 ± 6.16.0 ± 2.0
Assembly /B4.7 ± 3.10.0 ± 0.03.3 ± 1.20.0 ± 0.00.0 ± 0.0
Microwave /B73.3 ± 11.621.3 ± 4.654.0 ± 12.562.7 ± 6.466.0 ± 2.0
Photograph /B56.7 ± 4.67.3 ± 1.224.0 ± 8.78.7 ± 4.27.3 ± 1.2
Avg.50.4 ± 1.420.0 ± 1.447.6 ± 2.028.4 ± 1.535.5 ± 2.0

Key Findings:

  • π0.5\pi_{0.5} achieves the highest average success, benefiting from large-scale pre-training.
  • The smaller DP-T (~100M params) trained from scratch is competitive, especially on bimanual tasks.
  • DP-C excels at precise operations (button pressing, hinge interaction) likely due to its use of FiLM for observation injection, providing stronger fine-grained visual perception.
  • Bimanual tasks (Unlock iPad, Hanoi, Assembly) are particularly difficult, with some policies never succeeding.

2. Failure Mode Analysis

  • Fine-grained Actions: Policies often locate objects but fail to perform precise interactions (e.g., clicking specific buttons).
  • Insertion: High failure rate in tasks like Assembly and Hanoi.
  • Temporal Memory: In Pinch Tongs, policies grasp but fail to execute the squeeze-release sequence.
  • Sequential Errors: In Microwave, policies often place the object but then withdraw it with the hand.

3. Multi-task, Dynamics, and Action-Head Evaluations

Table 3: Multi-task, dynamics, and action-head evaluations. Success rate (%).

TaskMulti-taskRand-dynamicsRand-AH (π0.5\pi_{0.5})
DP-Tπ0.5\pi_{0.5}DP-T
Hammer Nail58.7 ± 5.086.7 ± 3.177.3 ± 6.4
Click Mouse38.7 ± 3.180.7 ± 3.10.0 ± 0.0
Pick Bucket55.3 ± 7.683.3 ± 8.180.7 ± 3.1
Pinch Tongs6.0 ± 5.345.3 ± 6.115.3 ± 4.2
Fold Glasses11.3 ± 5.042.0 ± 6.040.7 ± 4.6
Water Plant60.0 ± 6.984.0 ± 4.076.0 ± 6.0
Unlock iPad /B0.0 ± 0.00.7 ± 1.20.7 ± 1.2
Hanoi /B8.0 ± 2.06.0 ± 0.029.3 ± 2.3
Assembly /B1.3 ± 2.33.3 ± 2.38.0 ± 5.3
Microwave /B42.7 ± 6.439.3 ± 13.070.0 ± 9.2
Photograph /B28.0 ± 6.029.3 ± 1.259.3 ±組成
Avg.33.2 ± 2.445.5 ±