Summary (Overview)
- DragMesh‑2 is a contact‑driven framework for dexterous hand–articulated‑object interaction. The policy controls only the hand; the target joint has no action channel and must move through physical hand–handle contact.
- PICA (Physically Informed Contact‑Aware) training injects observable physical signals into policy learning without tactile or force feedback, improving robustness under changing contact loads.
- The approach is evaluated on seven GAPartNet objects across three damping multipliers (×1 nominal, ×2 mild, ×4 OOD). PICA attains the highest mean success in all six mode‑× damping settings (deterministic and stochastic).
- A pure‑geometry dexterous interaction dataset (277 trajectories, 7 categories) is released to support future loco‑manipulation and humanoid hand–object interaction research.
- Key finding: nominal success alone is misleading; longer training improves nominal success but collapses OOD robustness and action‑saturation metrics, motivating a protocol that reports both OOD damping and contact‑aware diagnostics.
Introduction and Theoretical Foundation
Dexterous hand interaction with articulated objects is important for household, assistive, and humanoid manipulation. Unlike static objects, articulated objects cannot be directly controlled – their motion must emerge through sustained hand–object contact. Prior work (including DragMesh 1) focused on object‑centric articulated generation, but the transition to hand‑driven physical interaction is non‑trivial: geometric trajectory replay or open‑loop execution does not model the contact dynamics required to move the articulated part.
Existing RL methods are typically trained under fixed dynamics and optimize only task completion. Without tactile or force feedback, policies overfit nominal dynamics and rely on “dynamics shortcuts” rather than stable contact behaviors. Success under nominal damping does not imply stable contact behavior under contact‑load shifts.
DragMesh‑2 formulates articulated‑object manipulation as a problem that must be completed through real hand–handle interaction. PICA addresses the robustness gap by explicitly injecting physically informed signals into policy learning through contact‑aware constraints and dynamics randomization.
Methodology
3.1 Contact‑Driven Task Formulation
- Hand: 51‑DoF SMPL‑X hand (6 virtual wrist DoFs + 45 finger joints). Policy outputs a 51‑dimensional increment to the PD target, clipped to joint limits.
- Object: No action channel; the target part moves only through hand–handle contact.
- Success threshold:
- Task progress (normalized per object):
- Observation: hand joint positions and velocities, handle pose, relative palm–handle geometry, target‑joint state, task‑scale features (progress, remaining distance). No RGB, depth, point clouds, force, or tactile signals.
- Reference trajectories are used only for grasp initialization, target motion scale, and tracking baseline – not as object‑control commands or expert labels.
3.2 Physically Informed Contact‑Aware Learning (PICA)
PICA augments PPO with physical signals at both environment and policy levels. Key components:
-
History token:
-
Causal‑window auxiliary head predicts four observable contact responses from the GLA‑encoded history:
These four targets describe recent object response, maximum palm–handle distance, detachment risk, and tracking stress.
-
Reward augments task progress with contact maintenance, action regularization, detachment handling, and successful termination:
-
Training loss:
-
Robustness evaluation over damping set and execution mode :
-
Diagnostics:
clip099(fraction of steps with action magnitude > 0.99) anddetach_proxy(detachment‑failure rate) are reported alongside task success.
3.3 Dataset
Heuristically generated from GAPartNet geometry – no learning involved. Each trajectory is a phased interaction (approach, grasp, drag, release) stored as per‑frame wrist poses and finger configurations.
| Category | # Traj. |
|---|---|
| StorageFurniture | 256 |
| TrashCan | 7 |
| Dishwasher | 5 |
| Refrigerator | 4 |
| Oven | 3 |
| Microwave | 1 |
| TableObject | 1 |
| Total | 277 |
The dataset provides expert grasp initialization, motion‑scale normalization, and a non‑learned tracking baseline.
Empirical Validation / Results
Experimental Setup
- 7 GAPartNet objects across 3 categories (Dishwasher, StorageFurniture, Microwave) and 2 joint types (5 revolute doors, 2 prismatic drawers).
- Damping multipliers: ×1 (nominal), ×2 (mild shift), ×4 (strong OOD shift). Each (method, object, damping, mode) cell uses 20 episodes.
- Baselines: trajectory‑tracking replay, GT‑part‑pose parallel‑jaw primitive, state‑only PPO, flat‑history PPO, GRU‑PPO, Transformer‑PPO, and ablations (PICA w/o physical signals, PICA w/o GLA encoder).
Main Comparison
Figure 2 summarizes average success over all objects. Key findings from Table 2:
- Trajectory tracking achieves 1.00 at ×1 but drops to 0.71 at ×2 and ×4 (two objects lose contact).
- Parallel‑jaw primitive succeeds on only one object (0.14 mean) and is damping‑invariant.
- PICA (Ours) attains the best mean in every damping/mode column: deterministic success 0.89 at ×1, 0.80 at ×2, 0.56 at ×4. This compares to state‑only PPO (0.27 at ×4), flat‑history PPO (0.32), GRU‑PPO (0.30), Transformer‑PPO (0.09).
- Adding richer temporal encoders alone does not close the gap; the combination of physical signals + GLA encoder is essential.
Table 2: Per‑object success (deterministic/stochastic) across damping multipliers.
| Method | Damp | Avg (det/stoch) |
|---|---|---|
| Traj. tracking | ×1 | 1.00 / – |
| ×2 | 0.71 / – | |
| ×4 | 0.71 / – | |
| Parallel‑Jaw | ×1 | 0.14 / – |
| State‑only PPO | ×1 | 0.58 / 0.44 |
| ×4 | 0.27 / 0.26 | |
| Flat‑history PPO | ×4 | 0.32 / 0.21 |
| GRU‑PPO | ×4 | 0.30 / 0.28 |
| Transformer‑PPO | ×4 | 0.09 / 0.04 |
| PICA (Ours) | ×1 | 0.89 / 0.82 |
| ×2 | 0.80 / 0.72 | |
| ×4 | 0.56 / 0.43 |
Ablation Study
| Method | ×1 (det) | ×4 (det) |
|---|---|---|
| w/o PICA (GLA only) | 0.65 | 0.36 |
| w/o GLA (PICA only, flat history) | 0.75 | 0.43 |
| PICA (full) | 0.89 | 0.56 |
The components are complementary: physical signals help more under nominal damping, the temporal encoder helps more under stochastic mid‑damping, and the full model exceeds either component by ≥0.13 at ×4.
Nominal Success Can Mask Saturation Collapse
Table 4: Training‑length study on a single object (base policy, no contact fine‑tuning).
| Training epochs | Succ. ×1 | Succ. ×4 | clip099 ×4 |
|---|---|---|---|
| 150 | 0.90 | 0.55 | 0.90 |
| 200 | 0.90 | 0.50 | 0.97 |
| 300 | 1.00 | 0.10 | 0.99 |
| 500 | 1.00 | 0.10 | 0.99 |
As training extends, nominal success improves but OOD robustness collapses and action saturation (clip099) approaches 1.0. This motivates reporting diagnostics alongside success and checkpoint selection by OOD robustness, not training reward.
Theoretical and Practical Implications
- Theoretical: The work demonstrates that contact‑conditioned policy learning, even without force/tactile feedback, can improve robustness by injecting observable physical proxies (object response, detachment risk, tracking stress) into the training loss. This shifts learning from task‑progress‑only optimization toward contact‑conditioned interaction.
- Practical: DragMesh‑2 provides a reproducible evaluation protocol for dexterous articulated‑object manipulation, including contact‑aware diagnostics. The released dataset (277 trajectories) offers geometry‑guided grasp initialization, task‑scale normalization, and tracking references for future work.
- Limitation: The policy still relies on a position‑increment action interface and tends toward action saturation under strong contact load (success drops from 0.89 at ×1 to 0.56 at ×4). Per‑object heterogeneity remains. Without force or tactile feedback, contact state is inferred only from kinematic error, which is insufficient for stable light pulling at high damping.
Conclusion
DragMesh‑2 extends articulated interaction from object‑centric generation to hand‑driven physical interaction, showing that nominal task success does not guarantee stable contact behavior. PICA improves robustness by adding physically informed training signals, dynamics randomization, and temporal contact‑response modeling without force or tactile feedback. Across seven GAPartNet objects under nominal, moderate, and out‑of‑distribution damping, DragMesh‑2 achieves stronger robustness than competing methods. The paper releases a pure‑geometry interaction dataset for future whole‑body loco‑manipulation and humanoid hand–object interaction research. Future work should enrich the contact interface with force/tactile feedback and extend the approach to whole‑body control by coupling the upper‑body contact interaction with humanoid locomotion.
Related papers
- Trust Region On-Policy Distillation
TrOPD stabilizes on-policy distillation by adaptively partitioning student-generated tokens into a trust region for reliable supervision and outliers handled by a forward KL estimator, outperforming baselines by 3.34-6.18 points across reasoning and code tasks.
- Humanoid-GPT: Scaling Data and Structure for Zero-Shot Motion Tracking
Humanoid-GPT achieves real-time zero-shot whole-body tracking on a humanoid robot by scaling to a 2B-frame motion corpus and using a causal Transformer with diversity-aware training.
- APPO: Agentic Procedural Policy Optimization
APPO shifts credit assignment to fine-grained decision points using a Branching Score, outperforming baselines on 13 agentic reasoning benchmarks.