Visual Summary | DragMesh-2: Physically Plausible Dexterous Hand-Object Interaction with Articulated Objects

Summary (Overview)

DragMesh‑2 is a contact‑driven framework for dexterous hand–articulated‑object interaction. The policy controls only the hand; the target joint has no action channel and must move through physical hand–handle contact.
PICA (Physically Informed Contact‑Aware) training injects observable physical signals into policy learning without tactile or force feedback, improving robustness under changing contact loads.
The approach is evaluated on seven GAPartNet objects across three damping multipliers (×1 nominal, ×2 mild, ×4 OOD). PICA attains the highest mean success in all six mode‑× damping settings (deterministic and stochastic).
A pure‑geometry dexterous interaction dataset (277 trajectories, 7 categories) is released to support future loco‑manipulation and humanoid hand–object interaction research.
Key finding: nominal success alone is misleading; longer training improves nominal success but collapses OOD robustness and action‑saturation metrics, motivating a protocol that reports both OOD damping and contact‑aware diagnostics.

Introduction and Theoretical Foundation

Dexterous hand interaction with articulated objects is important for household, assistive, and humanoid manipulation. Unlike static objects, articulated objects cannot be directly controlled – their motion must emerge through sustained hand–object contact. Prior work (including DragMesh 1) focused on object‑centric articulated generation, but the transition to hand‑driven physical interaction is non‑trivial: geometric trajectory replay or open‑loop execution does not model the contact dynamics required to move the articulated part.

Existing RL methods are typically trained under fixed dynamics and optimize only task completion. Without tactile or force feedback, policies overfit nominal dynamics and rely on “dynamics shortcuts” rather than stable contact behaviors. Success under nominal damping does not imply stable contact behavior under contact‑load shifts.

DragMesh‑2 formulates articulated‑object manipulation as a problem that must be completed through real hand–handle interaction. PICA addresses the robustness gap by explicitly injecting physically informed signals into policy learning through contact‑aware constraints and dynamics randomization.

Methodology

3.1 Contact‑Driven Task Formulation

Hand: 51‑DoF SMPL‑X hand (6 virtual wrist DoFs + 45 finger joints). Policy outputs a 51‑dimensional increment to the PD target, clipped to joint limits.
Object: No action channel; the target part moves only through hand–handle contact.
Success threshold: $q_{\text{done}} = q_{\text{traj}}^{\min} + \rho\,(q_{\text{traj}}^{\max} - q_{\text{traj}}^{\min}) \qquad (1)$
Task progress (normalized per object): $p_t = \max\left(0,\ \frac{q_t^o - q^{\text{start}}}{q^{\text{goal}} - q^{\text{start}}}\right) \qquad (2)$
Observation: hand joint positions and velocities, handle pose, relative palm–handle geometry, target‑joint state, task‑scale features (progress, remaining distance). No RGB, depth, point clouds, force, or tactile signals.
Reference trajectories are used only for grasp initialization, target motion scale, and tracking baseline – not as object‑control commands or expert labels.

3.2 Physically Informed Contact‑Aware Learning (PICA)

PICA augments PPO with physical signals at both environment and policy levels. Key components:

History token:
$h_t = [e_t,\ a_{t-1}], \quad e_t = q_t^{\text{PD}} - q_t^h \qquad (3)$
Causal‑window auxiliary head predicts four observable contact responses from the GLA‑encoded history:
$y_t = \left[ q_t^o - q_{t-K}^o,\ \max_{\tau\in[t-K,t]} d_\tau,\ \mathbb{1}\!\left(\max_{\tau\in[t-K,t]} d_\tau > d_{\text{detach}}\right),\ \max_{\tau\in[t-K,t]} \|e_\tau\|_2\right] \qquad (4)$
These four targets describe recent object response, maximum palm–handle distance, detachment risk, and tracking stress.
Reward augments task progress with contact maintenance, action regularization, detachment handling, and successful termination:
$r_t = r_{\text{task}} + r_{\text{dist}} + r_{\text{act}} + r_{\text{time}} + r_{\text{detach}} + r_{\text{success}} + r_{\text{bound}} + r_{\text{contact}} \qquad (5)$
Training loss:
$\mathcal{L} = \mathcal{L}_{\text{PPO}} + c_v \mathcal{L}_V + c_b \mathcal{L}_{\text{bounds}} + w_{\text{aux}} \mathcal{L}_{\text{aux}} \qquad (6)$
Robustness evaluation over damping set $\mathcal{B}=\{\times1,\times2,\times4\}$ and execution mode $m\in\{\text{det},\text{stoch}\}$ :
$\bar{S}_m = \frac{1}{|\mathcal{B}|}\sum_{b\in\mathcal{B}} S_{b,m}, \quad S_m^{\text{worst}} = \min_{b\in\mathcal{B}} S_{b,m} \qquad (7)$
Diagnostics: clip099 (fraction of steps with action magnitude > 0.99) and detach_proxy (detachment‑failure rate) are reported alongside task success.

3.3 Dataset

Heuristically generated from GAPartNet geometry – no learning involved. Each trajectory is a phased interaction (approach, grasp, drag, release) stored as per‑frame wrist poses and finger configurations.

Category	# Traj.
StorageFurniture	256
TrashCan	7
Dishwasher	5
Refrigerator	4
Oven	3
Microwave	1
TableObject	1
Total	277

The dataset provides expert grasp initialization, motion‑scale normalization, and a non‑learned tracking baseline.

Empirical Validation / Results

Experimental Setup

7 GAPartNet objects across 3 categories (Dishwasher, StorageFurniture, Microwave) and 2 joint types (5 revolute doors, 2 prismatic drawers).
Damping multipliers: ×1 (nominal), ×2 (mild shift), ×4 (strong OOD shift). Each (method, object, damping, mode) cell uses 20 episodes.
Baselines: trajectory‑tracking replay, GT‑part‑pose parallel‑jaw primitive, state‑only PPO, flat‑history PPO, GRU‑PPO, Transformer‑PPO, and ablations (PICA w/o physical signals, PICA w/o GLA encoder).

Main Comparison

Figure 2 summarizes average success over all objects. Key findings from Table 2:

Trajectory tracking achieves 1.00 at ×1 but drops to 0.71 at ×2 and ×4 (two objects lose contact).
Parallel‑jaw primitive succeeds on only one object (0.14 mean) and is damping‑invariant.
PICA (Ours) attains the best mean in every damping/mode column: deterministic success 0.89 at ×1, 0.80 at ×2, 0.56 at ×4. This compares to state‑only PPO (0.27 at ×4), flat‑history PPO (0.32), GRU‑PPO (0.30), Transformer‑PPO (0.09).
Adding richer temporal encoders alone does not close the gap; the combination of physical signals + GLA encoder is essential.

Table 2: Per‑object success (deterministic/stochastic) across damping multipliers.

Method	Damp	Avg (det/stoch)
Traj. tracking	×1	1.00 / –
	×2	0.71 / –
	×4	0.71 / –
Parallel‑Jaw	×1	0.14 / –
State‑only PPO	×1	0.58 / 0.44
	×4	0.27 / 0.26
Flat‑history PPO	×4	0.32 / 0.21
GRU‑PPO	×4	0.30 / 0.28
Transformer‑PPO	×4	0.09 / 0.04
PICA (Ours)	×1	0.89 / 0.82
	×2	0.80 / 0.72
	×4	0.56 / 0.43

Ablation Study

Method	×1 (det)	×4 (det)
w/o PICA (GLA only)	0.65	0.36
w/o GLA (PICA only, flat history)	0.75	0.43
PICA (full)	0.89	0.56

The components are complementary: physical signals help more under nominal damping, the temporal encoder helps more under stochastic mid‑damping, and the full model exceeds either component by ≥0.13 at ×4.

Nominal Success Can Mask Saturation Collapse

Table 4: Training‑length study on a single object (base policy, no contact fine‑tuning).

Training epochs	Succ. ×1	Succ. ×4	clip099 ×4
150	0.90	0.55	0.90
200	0.90	0.50	0.97
300	1.00	0.10	0.99
500	1.00	0.10	0.99

As training extends, nominal success improves but OOD robustness collapses and action saturation (clip099) approaches 1.0. This motivates reporting diagnostics alongside success and checkpoint selection by OOD robustness, not training reward.

Theoretical and Practical Implications

Theoretical: The work demonstrates that contact‑conditioned policy learning, even without force/tactile feedback, can improve robustness by injecting observable physical proxies (object response, detachment risk, tracking stress) into the training loss. This shifts learning from task‑progress‑only optimization toward contact‑conditioned interaction.
Practical: DragMesh‑2 provides a reproducible evaluation protocol for dexterous articulated‑object manipulation, including contact‑aware diagnostics. The released dataset (277 trajectories) offers geometry‑guided grasp initialization, task‑scale normalization, and tracking references for future work.
Limitation: The policy still relies on a position‑increment action interface and tends toward action saturation under strong contact load (success drops from 0.89 at ×1 to 0.56 at ×4). Per‑object heterogeneity remains. Without force or tactile feedback, contact state is inferred only from kinematic error, which is insufficient for stable light pulling at high damping.

Conclusion

DragMesh‑2 extends articulated interaction from object‑centric generation to hand‑driven physical interaction, showing that nominal task success does not guarantee stable contact behavior. PICA improves robustness by adding physically informed training signals, dynamics randomization, and temporal contact‑response modeling without force or tactile feedback. Across seven GAPartNet objects under nominal, moderate, and out‑of‑distribution damping, DragMesh‑2 achieves stronger robustness than competing methods. The paper releases a pure‑geometry interaction dataset for future whole‑body loco‑manipulation and humanoid hand–object interaction research. Future work should enrich the contact interface with force/tactile feedback and extend the approach to whole‑body control by coupling the upper‑body contact interaction with humanoid locomotion.