# HSImul3R: Physics-in-the-Loop Reconstruction of Simulation-Ready Human-Scene Interactions

> HSImul3R introduces a physics-supervised bidirectional optimization framework that reconstructs physically stable, simulation-ready human-scene interactions from casual captures.

- **Source:** [arXiv](https://arxiv.org/abs/2603.15612)
- **Published:** 2026-03-18
- **Permalink:** https://picx.dev/p/59ASct
- **Whiteboard:** https://picx.dev/p/59ASct/image

## Summary

# HSImul3R: Physics-in-the-Loop Reconstruction of Simulation-Ready Human–Scene Interactions - Summary

## Summary (Overview)
*   **Key Contribution:** HSImul3R is the first framework to achieve **simulation-ready 3D reconstruction** of human-scene interactions (HSI) from casual captures (sparse-view images or monocular videos), bridging the **perception-simulation gap**.
*   **Core Method:** A **physically-grounded bi-directional optimization pipeline** that uses a physics simulator as an active supervisor:
    *   **Forward-pass:** **Scene-targeted Reinforcement Learning (RL)** refines human motion under supervision of motion fidelity and contact stability.
    *   **Reverse-pass:** **Direct Simulation Reward Optimization (DSRO)** refines scene geometry using simulation feedback on gravitational stability and interaction success.
*   **Novel Dataset:** Introduces **HSIBench**, a new benchmark with 300 unique HSI instances, 19 objects, and multi-view captures for training and evaluation.
*   **Key Results:** HSImul3R significantly outperforms existing methods in simulation stability and reconstruction quality. The optimized motions can be **seamlessly transferred to real-world humanoid robots** for deployment.
*   **Main Problem Addressed:** Existing HSI reconstructions are visually plausible but often violate physical constraints, causing instability in physics engines and failure in embodied AI applications.

## Introduction and Theoretical Foundation
Embodied AI requires agents that can perceive, reason, and act in real-world environments. A core challenge is modeling **humanoid-scene interactions**, which necessitates understanding human motion, spatial layouts, and **interaction stability**. Reconstructing HSI from images/videos can provide high-fidelity supervision for creating scalable, simulation-ready datasets.

However, a significant **perception-simulation gap** exists. Current methods fall into three fragmented categories:
1.  **3D Scene Reconstruction** (e.g., NeRF, Gaussian Splatting): Focuses on environment geometry, ignoring human dynamics.
2.  **Human Motion Estimation**: Reconstructs motion in isolation, without modeling physical contact or environmental constraints.
3.  **Interaction Modeling**: Often relies on limited-scale SMPL-driven datasets lacking physical validation.

Unified frameworks like HOS-NeRF or HSfM optimize primarily in 2D image space, prioritizing visual alignment over **geometric and physical validity**. This results in reconstructions lacking metric/contact fidelity, making them unsuitable for simulation.

**HSImul3R's Theoretical Basis:** It formulates HSI reconstruction as a **bi-directional physics-aware optimization problem**. The physics simulator acts as an **active supervisor**, enabling closed-loop refinement between human motion and scene geometry to ensure physical plausibility and stability.

## Methodology
The pipeline (Fig. 3) first reconstructs and aligns human and scene, then applies bi-directional physics-aware optimization.

**1. Human-Scene Interaction Reconstruction and Alignment**
*   **Scene Reconstruction:** Uses **DUSt3R** to recover 3D environment structure from uncalibrated images.
*   **Human Motion Estimation:** Uses SAM2 for detection/tracking, 4DHumans for 3D SMPL motion, and ViTPose for 2D keypoints.
*   **Initial Alignment:** Joint optimization via human-centric bundle adjustment and global human-scene alignment minimizes reprojection error.
*   **Alignment via Explicit 3D Structural Prior:** To fix structural artifacts and improve 3D geometric awareness, a pre-trained image-to-3D generative model (e.g., MIDI) is used to synthesize high-fidelity 3D object representations from the most prominent view:
    $$R_{\text{scene}} := \{ \text{MIDI}( I_n [ M_i ]), i \in [0, O] \}$$
    where $R_{\text{scene}}$ is the refined 3D scene, $O$ is the number of objects, $I_n$ is the input image, and $M_i$ is the object's segmentation mask from SAM.
*   **3D Explicit Constraints:** With improved geometry, human-scene alignment is refined using 3D constraints to minimize penetration.
    *   For **non-contact** scenarios, optimize positions via:
        $$\ell_{\text{non-contact}} = \frac{1}{|H_p|} \cdot \sum_{1 \le j \le N_o} || \mu^h_i - \mu^o_j ||_2 + \frac{1}{N_o} \cdot \sum_{j=1}^{N_o} \min_{i \in H_p} || \mu^o_j - \mu^h_i ||_2$$
    *   For **contact** scenarios, apply:
        $$\ell_{\text{contact}} = \frac{1}{|H_p|} \cdot \sum_{i \in H_p} \max(0, -\delta(\mu^h_i))$$
    Here, $H_p$ is the closest human body part, $N_o$ is object vertices, $\mu^o_j$ and $\mu^h_i$ are 3D positions, and $\delta(\cdot)$ is the signed distance function.

**2. Forward-Pass: Scene-Targeted Motion Optimization**
To ensure stable dynamics in the simulator, a **scene-targeted supervision signal** is added to RL-based motion tracking (PHC). This encourages physically plausible contact by minimizing the distance between human contact keypoints and nearby object surfaces:
$$\ell_{\text{scene}} = \frac{1}{N_{\text{contact}} \cdot N_{\text{surf}}} \cdot \sum_{j=1}^{N_{\text{contact}}} \sum_{i=1}^{N_{\text{surf}}} \| \mu^o_i - k^h_j \|_2^2$$
where $N_{\text{contact}}$ is the number of human-scene contacts and $N_{\text{surf}}$ is sampled object surface points in the contact region.

**3. Reverse-Pass: Simulator-Guided Object Refinement (DSRO)**
To address instability caused by defective generated geometry, **Direct Simulation Reward Optimization (DSRO)** is introduced. It uses simulation outcomes as a supervision signal to refine the 3D object generation model.
*   The DSRO objective is:
    $$\ell_{\text{DSRO}} = - T \cdot \mathbb{E}_{I \sim \mathcal{I}, x_0 \sim \mathcal{X}_I, t \sim \mu(0,T), x_t \sim q(x_t | x_0)} [ w(t) \cdot (1 - 2 \cdot l(x_0)) \| \epsilon - \epsilon_\theta(x_t, t) \|_2^2 ]$$
*   The **stability label** $l(x_0)$ is determined from simulation feedback:
    $$l(x_0) = \begin{cases} 1, & \text{if stable} \\ 0, & \text{otherwise} \end{cases}$$
    Stability requires: (1) object stable under gravity, (2) HSI scene reaches a stable state, (3) final state preserves meaningful contact.

**4. Extension to Monocular Videos**
For video input, MegaSAM and TRAM are used for scene and human motion reconstruction. SAM2 provides 2D bounding boxes to identify interactions and achieve dynamic 3D alignment, assuming a static scene.

## Empirical Validation / Results
Extensive experiments on **HSIBench** evaluate reconstruction fidelity, simulation stability, and DSRO impact.

**Quantitative Evaluation Metrics:**
*   **SP-3D:** Scene Penetration ratio in 3D reconstruction.
*   **Stability-HSI:** % of simulations where HSI remains stable (considers gravity, scene stability, meaningful contact).
*   **Human Motion Quality:** W-MPJPE (world coordinate accuracy) and PA-MPJPE (local pose precision).
*   **Object Geometry Quality:** Chamfer Distance (CD) and F-Score.
*   **Stability-Gravity (SG):** % of objects stable under gravity alone.

**Key Results:**
*   **HSI Reconstruction & Simulation (Table 1):** HSImul3R significantly outperforms baseline HSfM and all ablated variants (V1-V4) across all metrics.

    | Method | Stability-HSI (%) ↑ (Easy/Medium/Hard) | SP-3D (%) ↓ | W-MPJPE ↓ | PA-MPJPE ↓ |
    | :--- | :--- | :--- | :--- | :--- |
    | HSfM [52] | 10.52 / 4.50 / 2.66 | 69.51 | 5.02 | 2.79 |
    | V1 (HSfM+MIDI) | 13.96 / 8.81 / 4.17 | 77.12 | 6.18 | 3.20 |
    | V2 (w/o $\ell_{\text{scene}}$) | 39.56 / 22.71 / 7.05 | - | 4.91 | 2.71 |
    | V3 (center-point dist.) | 42.57 / 23.84 / 10.18 | - | 4.60 | 2.42 |
    | V4 (w/o DSRO) | 29.56 / 16.62 / 5.17 | - | 4.57 | 2.39 |
    | **Ours** | **53.68 / 30.56 / 13.92** | **22.9** | **4.09** | **2.17** |

*   **Image-to-3D Generation (Table 2):** DSRO-fine-tuned model achieves best physical plausibility and geometric accuracy.

    | Method | Stability-HSI (%) ↑ (Easy/Medium/Hard) | SG (%) ↑ | CD ↓ | F-Score ↑ |
    | :--- | :--- | :--- | :--- | :--- |
    | MIDI [23] | 29.56 / 16.62 / 5.17 | 79.19 | 0.198 | 81.95 |
    | DSO* [31] | 38.75 / 25.91 / 7.88 | 87.23 | 0.191 | 86.26 |
    | **Ours** | **53.68 / 30.56 / 13.92** | **91.50** | **0.173** | **88.25** |

*   **Input View Analysis (Table 3):** Increasing input views slightly improves motion quality but has little impact on simulation stability or penetration handling.

*   **Ablation Studies:** Figure 7 shows that removing the scene-targeted loss $\ell_{\text{scene}}$ (Eq. 5) leads to unstable simulations with object displacement.
*   **Real-World Deployment:** The refined human motions were successfully retargeted and deployed on a Unitree G1 humanoid robot using a diffusion-guided RL policy, enabling robust robot-scene interactions (Fig. 6).

## Theoretical and Practical Implications
**Theoretical Implications:**
*   Proposes a novel **physics-in-the-loop** paradigm for 3D reconstruction, where the simulator is an active supervisor rather than a passive validator.
*   Introduces **DSRO**, a method for using simulation outcomes as direct supervision for generative model fine-tuning, moving beyond reliance on human annotations.
*   Demonstrates the necessity of **joint, physically-grounded optimization** of human motion and scene geometry to bridge the perception-simulation gap.

**Practical Implications:**
*   **Enables Simulation-Ready Assets:** Produces HSI reconstructions that are directly usable in physics engines for robotics, gaming, and VR/AR.
*   **Facilitates Embodied AI:** Provides a scalable pipeline to create large, physically-validated HSI datasets from casual captures (e.g., YouTube videos) for training embodied AI models.
*   **Robotics Deployment:** The framework's output can be directly transferred to control real-world humanoid robots, as demonstrated with the Unitree G1.
*   **Benchmarking:** The release of HSIBench provides a valuable resource for training and evaluating future HSI and embodied AI methods.

## Conclusion
HSImul3R presents the first framework for simulation-ready reconstruction of human-scene interactions from uncalibrated sparse views or monocular videos. Its core contributions are:
1.  A **contact-aware interaction model** to reduce penetration in 3D reconstruction.
2.  A **scene-targeted RL strategy** to promote stable simulator interactions.
3.  A **DSRO scheme** that uses simulation feedback to improve image-to-3D generation.
4.  The **HSIBench** dataset.

The method significantly outperforms existing techniques in achieving stable simulations and high-quality reconstructions. **Future work** could address limitations such as handling more complex multi-object interactions, improving success rates, and reducing biases in the fine-tuned generative model. This work establishes a foundational step towards physically-grounded, scalable HSI reconstruction for embodied AI.

---

_Markdown view of https://picx.dev/p/59ASct, served by PicX — AI-generated visual whiteboard summaries of research papers._
