# PhysForge: Generating Physics-Grounded 3D Assets for Interactive Virtual World

> PhysForge generates physics-grounded 3D assets from a single image using a two-stage pipeline that first plans a hierarchical physical blueprint with a VLM and then jointly synthesizes geometry and kinematic parameters via a novel KineVoxel Injection mechanism.

- **Source:** [arXiv](https://arxiv.org/abs/2605.05163)
- **Published:** 2026-05-08
- **Permalink:** https://picx.dev/p/yeqYmk
- **Whiteboard:** https://picx.dev/p/yeqYmk/image

## Summary

# PhysForge: Generating Physics-Grounded 3D Assets for Interactive Virtual Worlds

## Summary (Overview)
*   **Problem & Goal:** Existing 3D generation methods produce static "hollow shell" assets, lacking the physics and functional properties needed for interaction in virtual worlds and embodied AI. PhysForge aims to generate physics-grounded, simulation-ready 3D assets from a single image.
*   **Core Framework:** A decoupled two-stage pipeline: (1) A VLM acts as a "physical architect" to plan a **Hierarchical Physical Blueprint**, and (2) A physics-grounded diffusion model realizes this blueprint via a novel **KineVoxel Injection (KVI)** mechanism to jointly generate geometry, texture, and precise kinematic parameters.
*   **Dataset:** Introduces **PhysDB**, a large-scale dataset of 150,000 assets annotated with a novel **four-tier physical system** (holistic, static, functional, interactive properties).
*   **Key Innovation:** The **KineVoxel Injection** mechanism encodes articulation parameters (origin, axis, limits) into a special voxel representation, enabling synergistic generation with geometry within the diffusion process.
*   **Validation:** Extensive experiments show state-of-the-art performance in part structure planning and physics property generation. Generated assets are directly applicable in robotic simulators and game engines.

## Introduction and Theoretical Foundation
The rapid progress in 3D generative models has created a potential data engine for the soaring demand for 3D content in embodied AI and virtual worlds. However, a significant gap remains: existing methods focus solely on static geometry and textures, overlooking the **physics information** crucial for interaction. These generated assets cannot be manipulated, making them unsuitable for simulators or games requiring realistic physics.

The core insight of PhysForge is that for an object to be physically interactive, its generation must be driven by its **functional logic and hierarchical physics**. An object's structure should be a manifestation of its intended physical functions (e.g., a cabinet door for opening, a button for pressing). Therefore, the research shifts the focus from holistic shape generation to **physics-centric synthesis**.

To achieve this, the authors propose a "planning-then-generation" paradigm, inspired by successes in 2D multimodal research. They leverage the complementary strengths of specialized models: **Vision-Language Models (VLMs)** possess the world knowledge for complex physical planning, while **diffusion models** excel at precise synthesis of geometry and kinematic parameters. By decoupling these processes, PhysForge ensures assets are both visually realistic and physically consistent.

## Methodology
PhysForge is a two-stage framework supported by the PhysDB dataset.

### 1. PhysDB: A Physics-Grounded Dataset
A novel, large-scale dataset of 150k 3D objects sourced from Objaverse, annotated with a **four-tier physical system**:
1.  **Holistic Tier:** Object-level properties (real-world scale, category, usage scene).
2.  **Static Tier:** Part-level static attributes (semantic label, physical material, mass).
3.  **Functional Tier:** Part-level functional attributes (intrinsic function, state machine).
4.  **Interactive Tier:** Part-level interactive attributes (atomic affordances, kinematic definitions: parent part, joint type, joint parameters).

The annotation pipeline uses a human-in-the-loop process with a multimodal LLM for initial annotation followed by manual correction. To train the kinematic generation stage, the dataset is supplemented with PartNet-Mobility and Infinite-Mobility for ground-truth articulation parameters.

### 2. Stage 1: VLM as a Physical Blueprint Planner
A finetuned VLM (**Qwen2.5-VL**) acts as the planner. Its input includes:
*   A single image $I$.
*   An optional 2D part mask $M$ (for granularity control).
*   Generated 3D voxels $V$ (from TRELLIS first stage).

The 3D voxel features are extracted using a **PartField encoder** and a position-aware 3D ConvNet. The VLM is finetuned to autoregressively generate the **Hierarchical Physical Blueprint**. This includes:
*   **Bounding Box Layout:** Represented efficiently using 6 special tokens (`<boxs>`, `<boxe>`, and quantized coordinate tokens `<box0>...<box63>`).
*   **Detailed Physical Properties:** For each part (parent node, articulation type, material, function, etc.).

A key discovery is **physics-guided planning resolves part ambiguity**. Co-predicting physical properties alongside bounding boxes provides stronger semantic constraints, enabling reasonable part decompositions even without 2D mask guidance.

### 3. Stage 2: Diffusion-based Generation with KineVoxel Injection (KVI)
This stage realizes the VLM's blueprint by generating high-fidelity geometry and **precise kinematic parameters**. The challenge is generating continuous 3D values (joint origin, axis) within a geometry diffusion pipeline.

The solution is the novel **KineVoxel Injection (KVI)** mechanism:
*   **Articulation Parameter Representation:** For a part $i$, parameters are an 8D vector $P_i = (O_i, A_i, L_i)$, where $O_i \in \mathbb{R}^3$ is the joint origin, $A_i \in \mathbb{R}^3$ is the joint axis, and $L_i \in \mathbb{R}^2$ is the motion limits.
*   **KineVoxel Encoding:** This vector is encoded into a "KineVoxel" latent $z_{k,i}$ using a lightweight Kinematic Encoder $E_{kine}$:

    $$
    z_{k,i} = E_{kine}(\text{concat}(S_O \cdot O_i, S_A \cdot A_i, S_L \cdot L_i))
    $$

    where $S_O, S_A, S_L$ are scaling factors.
*   **Joint Injection:** The KineVoxel latent $z_{k,i}$ is concatenated with the sequence of geometry voxel latents $Z_g = \{z_{g,i}\}$ and fed into the main denoising transformer. A **joint type embedding** $E_{type}$ (from the VLM's planned type, e.g., "revolute") is added to $z_{k,i}$ to help the transformer distinguish latent types.
*   **Training Objective:** The model is trained using Conditional Flow Matching (CFM) with a composite loss:

    $$
    L = \mathbb{E}_{t,Z_0,c} [ L_{geo} + \lambda_{kine} \cdot L_{kine} ]
    $$

    where $c$ is the condition from the VLM blueprint, and the loss terms are:
    $$
    L_{geo} = \| v_{g,t} - \hat{v}_{g,t} \|^2; \quad L_{kine} = \| v_{k,t} - \hat{v}_{k,t} \|^2.
    $$
    The weighting factor $\lambda_{kine} = 10$ emphasizes accurate articulation prediction.

## Empirical Validation / Results
Evaluation uses PartObjaverse-Tiny, PhysXNet test set, and new test sets from PhysDB and articulated datasets.

### 1. Part Structure Planning
**Baselines:** OmniPart (first stage), PartField.
**Metrics:** BBox IoU, Voxel Recall, Voxel IoU.

**Table 3. Quantitative results for bounding box generation (%) on PartObjaverse-Tiny.**

| Method | Voxel recall ↑ | Voxel IoU ↑ | Bbox IoU ↑ |
| :--- | :--- | :--- | :--- |
| PartField | 69.65 | 46.04 | 37.33 |
| OmniPart (SAM mask) | 68.33 | 43.34 | 34.33 |
| PhysForge-bbox (w/o mask) | 67.89 | 35.53 | 32.30 |
| **PhysForge (w/o mask)** | **73.63** | **47.66** | **36.32** |
| OmniPart | 73.79 | 52.92 | 41.66 |
| **PhysForge (Ours)** | **77.16** | **53.74** | **42.95** |

**Key Findings:**
*   PhysForge achieves state-of-the-art results.
*   **Physics-guided planning is crucial:** "PhysForge (w/o mask)" significantly outperforms "PhysForge-bbox (w/o mask)", showing that predicting physical properties enhances semantic understanding of part structures.
*   PhysForge without a mask outperforms OmniPart using SAM-generated masks, demonstrating robustness.

### 2. Physics-Grounded Generation
**Baselines for Properties:** PhysXGen, TRELLIS.
**Metrics:** Chamfer Distance (CD), F1-Score, MAE for scale/material/affordance, CLIP-Similarity for text properties.

**Table 1. Quantitative comparison of Physics Property generation on the PhysXNet.**

| Method | CD ↓ | F1-0.1 ↑ | F1-0.05 ↑ | Abs. scale (cm) ↓ | Material ↓ | Affordance ↓ | Description ↑ |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| TRELLIS | 10.10 | 86.53 | 72.47 | - | - | - | - |
| PhysXGen | 9.81 | 87.91 | 73.60 | 25.83 | 1.59 | 3.69 | 0.38 |
| **PhysForge (Ours)** | **9.21** | **89.24** | **75.43** | **11.04** | **0.81** | **1.22** | **0.87** |

**Table 2. Quantitative comparison on the PhysDB.**

| Method | CD ↓ | F1-0.1 ↑ | F1-0.05 ↑ | Abs. scale (m) ↓ | Material ↓ | Function ↑ | Interaction ↑ |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| TRELLIS | 24.32 | 68.19 | 53.28 | - | - | - | - |
| PhysXGen | 25.30 | 65.79 | 50.57 | 1.08 | 1.44 | 0.36 | 0.34 |
| **PhysForge (Ours)** | **22.89** | **70.51** | **55.38** | **0.37** | **0.43** | **0.83** | **0.96** |

**Key Findings:** PhysForge outperforms baselines in both geometry quality and physics property accuracy, benefiting from the VLM's world-knowledge prior.

### 3. Kinematic Parameter Generation
**Baselines:** Articulate Anything, Singapo, URDFormer.
**Metrics:** CD, Clip-Similarity, Joint Axis Error, Joint Pivot Error.

**Table &nbsp;4. Quantitative comparison of articulated objects generation.**

| Method | CD ↓ | Clip-Sim ↑ | Joint-Axis-Err-5 ↓ | Joint-Pivot-Err-5 ↓ | Joint-Axis-Err-all ↓ | Joint-Pivot-Err-all ↓ |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| Articulate Anything | 23.31 | 0.87 | 0.608 | 0.257 | 0.694 | 0.197 |
| Singapo | 21.10 | 0.85 | 0.241 | 0.153 | - | - |
| URDFormer | 25.42 | 0.84 | 0.781 | 0.652 | - | - |
| PhysForge (w/o joint type emb) | 10.73 | 0.90 | 0.157 | 0.132 | 0.292 | 0.141 |
| PhysForge (w/o kinetics enc) | 11.31 | 0.89 | 0.158 | 0.117 | 0.204 | 0.120 |
| **PhysForge (Ours)** | **10.21** | **0.93** | **0.101** | **0.071** | **0.164** | **0.096** |

**Ablation Analysis:**
*   Removing the **joint type embedding** (interface between stages) degrades joint accuracy, confirming its importance for transferring physical common sense.
*   Removing the **independent kinematic encoder/decoder** further compromises precise constraint synthesis.
*   The full PhysForge model achieves superior image consistency and joint parameter accuracy.

### 4. Qualitative Results & Applications
*   **Figure 3 & 4/5:** Show high-quality, part-aware, and articulated 3D assets generated from single images.
*   **Applications (Figure 6):** Generated assets are directly usable in:
    1.  **Robotic Simulation** (RoboTwin): Robotic manipulators interact with assets using detailed geometry and kinematics.
    2.  **Virtual Worlds** (Unity/Unreal Engine): Assets enable complex, physics-based interactions without manual rigging.
    3.  **Agent-Environment Interaction:** Embodied agents can query the model in natural language to receive a physical blueprint for task planning.

## Theoretical and Practical Implications
*   **Theoretical:** Proposes a novel formulation for 3D generation that is **physics-centric and function-driven**, moving beyond static geometry. The decoupled two-stage framework demonstrates the effective synergy between VLMs (for planning) and diffusion models (for realization).
*   **Practical:** Provides a foundational **data engine** for interactive 3D content creation.
    *   **For Embodied AI:** Supplies simulation-ready assets for training and testing robotic manipulation policies in diverse environments.
    *   **For Game Development:** Accelerates content creation for interactive virtual worlds by generating assets with built-in physics properties.
    *   **Dataset Contribution:** PhysDB fills a critical data gap with its large-scale, fine-grained physical annotations, enabling future research in physics-aware generation.

## Conclusion
PhysForge introduces a novel framework for generating interactive, physics-grounded 3D assets. Its core contributions are:
1.  A **decoupled two-stage architecture** (VLM Planning + Diffusion Realization) that generates a Hierarchical Physical Blueprint and then realizes it with geometry and precise kinematics.
2.  The **KineVoxel Injection (KVI)** mechanism, enabling synergistic generation of articulation parameters within a diffusion model.
3.  The **PhysDB** dataset, providing the necessary training data with a four-tier physical annotation system.

Extensive validation shows PhysForge achieves state-of-the-art performance in part planning and physics property generation. The generated assets are directly applicable in robotic simulators and interactive virtual worlds, paving the way for scalable creation of interactive 3D content.

---

_Markdown view of https://picx.dev/p/yeqYmk, served by PicX — AI-generated visual whiteboard summaries of research papers._
