HSImul3R: Physics-in-the-Loop Reconstruction of Simulation-Ready Human–Scene Interactions - Summary

Summary (Overview)

  • Key Contribution: HSImul3R is the first framework to achieve simulation-ready 3D reconstruction of human-scene interactions (HSI) from casual captures (sparse-view images or monocular videos), bridging the perception-simulation gap.
  • Core Method: A physically-grounded bi-directional optimization pipeline that uses a physics simulator as an active supervisor:
    • Forward-pass: Scene-targeted Reinforcement Learning (RL) refines human motion under supervision of motion fidelity and contact stability.
    • Reverse-pass: Direct Simulation Reward Optimization (DSRO) refines scene geometry using simulation feedback on gravitational stability and interaction success.
  • Novel Dataset: Introduces HSIBench, a new benchmark with 300 unique HSI instances, 19 objects, and multi-view captures for training and evaluation.
  • Key Results: HSImul3R significantly outperforms existing methods in simulation stability and reconstruction quality. The optimized motions can be seamlessly transferred to real-world humanoid robots for deployment.
  • Main Problem Addressed: Existing HSI reconstructions are visually plausible but often violate physical constraints, causing instability in physics engines and failure in embodied AI applications.

Introduction and Theoretical Foundation

Embodied AI requires agents that can perceive, reason, and act in real-world environments. A core challenge is modeling humanoid-scene interactions, which necessitates understanding human motion, spatial layouts, and interaction stability. Reconstructing HSI from images/videos can provide high-fidelity supervision for creating scalable, simulation-ready datasets.

However, a significant perception-simulation gap exists. Current methods fall into three fragmented categories:

  1. 3D Scene Reconstruction (e.g., NeRF, Gaussian Splatting): Focuses on environment geometry, ignoring human dynamics.
  2. Human Motion Estimation: Reconstructs motion in isolation, without modeling physical contact or environmental constraints.
  3. Interaction Modeling: Often relies on limited-scale SMPL-driven datasets lacking physical validation.

Unified frameworks like HOS-NeRF or HSfM optimize primarily in 2D image space, prioritizing visual alignment over geometric and physical validity. This results in reconstructions lacking metric/contact fidelity, making them unsuitable for simulation.

HSImul3R's Theoretical Basis: It formulates HSI reconstruction as a bi-directional physics-aware optimization problem. The physics simulator acts as an active supervisor, enabling closed-loop refinement between human motion and scene geometry to ensure physical plausibility and stability.

Methodology

The pipeline (Fig. 3) first reconstructs and aligns human and scene, then applies bi-directional physics-aware optimization.

1. Human-Scene Interaction Reconstruction and Alignment

  • Scene Reconstruction: Uses DUSt3R to recover 3D environment structure from uncalibrated images.
  • Human Motion Estimation: Uses SAM2 for detection/tracking, 4DHumans for 3D SMPL motion, and ViTPose for 2D keypoints.
  • Initial Alignment: Joint optimization via human-centric bundle adjustment and global human-scene alignment minimizes reprojection error.
  • Alignment via Explicit 3D Structural Prior: To fix structural artifacts and improve 3D geometric awareness, a pre-trained image-to-3D generative model (e.g., MIDI) is used to synthesize high-fidelity 3D object representations from the most prominent view: Rscene:={MIDI(In[Mi]),i[0,O]}R_{\text{scene}} := \{ \text{MIDI}( I_n [ M_i ]), i \in [0, O] \} where RsceneR_{\text{scene}} is the refined 3D scene, OO is the number of objects, InI_n is the input image, and MiM_i is the object's segmentation mask from SAM.
  • 3D Explicit Constraints: With improved geometry, human-scene alignment is refined using 3D constraints to minimize penetration.
    • For non-contact scenarios, optimize positions via: non-contact=1Hp1jNoμihμjo2+1Noj=1NominiHpμjoμih2\ell_{\text{non-contact}} = \frac{1}{|H_p|} \cdot \sum_{1 \le j \le N_o} || \mu^h_i - \mu^o_j ||_2 + \frac{1}{N_o} \cdot \sum_{j=1}^{N_o} \min_{i \in H_p} || \mu^o_j - \mu^h_i ||_2
    • For contact scenarios, apply: contact=1HpiHpmax(0,δ(μih))\ell_{\text{contact}} = \frac{1}{|H_p|} \cdot \sum_{i \in H_p} \max(0, -\delta(\mu^h_i))
    Here, HpH_p is the closest human body part, NoN_o is object vertices, μjo\mu^o_j and μih\mu^h_i are 3D positions, and δ()\delta(\cdot) is the signed distance function.

2. Forward-Pass: Scene-Targeted Motion Optimization To ensure stable dynamics in the simulator, a scene-targeted supervision signal is added to RL-based motion tracking (PHC). This encourages physically plausible contact by minimizing the distance between human contact keypoints and nearby object surfaces:

scene=1NcontactNsurfj=1Ncontacti=1Nsurfμiokjh22\ell_{\text{scene}} = \frac{1}{N_{\text{contact}} \cdot N_{\text{surf}}} \cdot \sum_{j=1}^{N_{\text{contact}}} \sum_{i=1}^{N_{\text{surf}}} \| \mu^o_i - k^h_j \|_2^2

where NcontactN_{\text{contact}} is the number of human-scene contacts and NsurfN_{\text{surf}} is sampled object surface points in the contact region.

3. Reverse-Pass: Simulator-Guided Object Refinement (DSRO) To address instability caused by defective generated geometry, Direct Simulation Reward Optimization (DSRO) is introduced. It uses simulation outcomes as a supervision signal to refine the 3D object generation model.

  • The DSRO objective is: DSRO=TEII,x0XI,tμ(0,T),xtq(xtx0)[w(t)(12l(x0))ϵϵθ(xt,t)22]\ell_{\text{DSRO}} = - T \cdot \mathbb{E}_{I \sim \mathcal{I}, x_0 \sim \mathcal{X}_I, t \sim \mu(0,T), x_t \sim q(x_t | x_0)} [ w(t) \cdot (1 - 2 \cdot l(x_0)) \| \epsilon - \epsilon_\theta(x_t, t) \|_2^2 ]
  • The stability label l(x0)l(x_0) is determined from simulation feedback: l(x0)={1,if stable0,otherwisel(x_0) = \begin{cases} 1, & \text{if stable} \\ 0, & \text{otherwise} \end{cases} Stability requires: (1) object stable under gravity, (2) HSI scene reaches a stable state, (3) final state preserves meaningful contact.

4. Extension to Monocular Videos For video input, MegaSAM and TRAM are used for scene and human motion reconstruction. SAM2 provides 2D bounding boxes to identify interactions and achieve dynamic 3D alignment, assuming a static scene.

Empirical Validation / Results

Extensive experiments on HSIBench evaluate reconstruction fidelity, simulation stability, and DSRO impact.

Quantitative Evaluation Metrics:

  • SP-3D: Scene Penetration ratio in 3D reconstruction.
  • Stability-HSI: % of simulations where HSI remains stable (considers gravity, scene stability, meaningful contact).
  • Human Motion Quality: W-MPJPE (world coordinate accuracy) and PA-MPJPE (local pose precision).
  • Object Geometry Quality: Chamfer Distance (CD) and F-Score.
  • Stability-Gravity (SG): % of objects stable under gravity alone.

Key Results:

  • HSI Reconstruction & Simulation (Table 1): HSImul3R significantly outperforms baseline HSfM and all ablated variants (V1-V4) across all metrics.

    MethodStability-HSI (%) ↑ (Easy/Medium/Hard)SP-3D (%) ↓W-MPJPE ↓PA-MPJPE ↓
    HSfM [52]10.52 / 4.50 / 2.6669.515.022.79
    V1 (HSfM+MIDI)13.96 / 8.81 / 4.1777.126.183.20
    V2 (w/o scene\ell_{\text{scene}})39.56 / 22.71 / 7.05-4.912.71
    V3 (center-point dist.)42.57 / 23.84 / 10.18-4.602.42
    V4 (w/o DSRO)29.56 / 16.62 / 5.17-4.572.39
    Ours53.68 / 30.56 / 13.9222.94.092.17
  • Image-to-3D Generation (Table 2): DSRO-fine-tuned model achieves best physical plausibility and geometric accuracy.

    MethodStability-HSI (%) ↑ (Easy/Medium/Hard)SG (%) ↑CD ↓F-Score ↑
    MIDI [23]29.56 / 16.62 / 5.1779.190.19881.95
    DSO* [31]38.75 / 25.91 / 7.8887.230.19186.26
    Ours53.68 / 30.56 / 13.9291.500.17388.25
  • Input View Analysis (Table 3): Increasing input views slightly improves motion quality but has little impact on simulation stability or penetration handling.

  • Ablation Studies: Figure 7 shows that removing the scene-targeted loss scene\ell_{\text{scene}} (Eq. 5) leads to unstable simulations with object displacement.

  • Real-World Deployment: The refined human motions were successfully retargeted and deployed on a Unitree G1 humanoid robot using a diffusion-guided RL policy, enabling robust robot-scene interactions (Fig. 6).

Theoretical and Practical Implications

Theoretical Implications:

  • Proposes a novel physics-in-the-loop paradigm for 3D reconstruction, where the simulator is an active supervisor rather than a passive validator.
  • Introduces DSRO, a method for using simulation outcomes as direct supervision for generative model fine-tuning, moving beyond reliance on human annotations.
  • Demonstrates the necessity of joint, physically-grounded optimization of human motion and scene geometry to bridge the perception-simulation gap.

Practical Implications:

  • Enables Simulation-Ready Assets: Produces HSI reconstructions that are directly usable in physics engines for robotics, gaming, and VR/AR.
  • Facilitates Embodied AI: Provides a scalable pipeline to create large, physically-validated HSI datasets from casual captures (e.g., YouTube videos) for training embodied AI models.
  • Robotics Deployment: The framework's output can be directly transferred to control real-world humanoid robots, as demonstrated with the Unitree G1.
  • Benchmarking: The release of HSIBench provides a valuable resource for training and evaluating future HSI and embodied AI methods.

Conclusion

HSImul3R presents the first framework for simulation-ready reconstruction of human-scene interactions from uncalibrated sparse views or monocular videos. Its core contributions are:

  1. A contact-aware interaction model to reduce penetration in 3D reconstruction.
  2. A scene-targeted RL strategy to promote stable simulator interactions.
  3. A DSRO scheme that uses simulation feedback to improve image-to-3D generation.
  4. The HSIBench dataset.

The method significantly outperforms existing techniques in achieving stable simulations and high-quality reconstructions. Future work could address limitations such as handling more complex multi-object interactions, improving success rates, and reducing biases in the fine-tuned generative model. This work establishes a foundational step towards physically-grounded, scalable HSI reconstruction for embodied AI.