PhysForge: Generating Physics-Grounded 3D Assets for Interactive Virtual Worlds

Summary (Overview)

  • Problem & Goal: Existing 3D generation methods produce static "hollow shell" assets, lacking the physics and functional properties needed for interaction in virtual worlds and embodied AI. PhysForge aims to generate physics-grounded, simulation-ready 3D assets from a single image.
  • Core Framework: A decoupled two-stage pipeline: (1) A VLM acts as a "physical architect" to plan a Hierarchical Physical Blueprint, and (2) A physics-grounded diffusion model realizes this blueprint via a novel KineVoxel Injection (KVI) mechanism to jointly generate geometry, texture, and precise kinematic parameters.
  • Dataset: Introduces PhysDB, a large-scale dataset of 150,000 assets annotated with a novel four-tier physical system (holistic, static, functional, interactive properties).
  • Key Innovation: The KineVoxel Injection mechanism encodes articulation parameters (origin, axis, limits) into a special voxel representation, enabling synergistic generation with geometry within the diffusion process.
  • Validation: Extensive experiments show state-of-the-art performance in part structure planning and physics property generation. Generated assets are directly applicable in robotic simulators and game engines.

Introduction and Theoretical Foundation

The rapid progress in 3D generative models has created a potential data engine for the soaring demand for 3D content in embodied AI and virtual worlds. However, a significant gap remains: existing methods focus solely on static geometry and textures, overlooking the physics information crucial for interaction. These generated assets cannot be manipulated, making them unsuitable for simulators or games requiring realistic physics.

The core insight of PhysForge is that for an object to be physically interactive, its generation must be driven by its functional logic and hierarchical physics. An object's structure should be a manifestation of its intended physical functions (e.g., a cabinet door for opening, a button for pressing). Therefore, the research shifts the focus from holistic shape generation to physics-centric synthesis.

To achieve this, the authors propose a "planning-then-generation" paradigm, inspired by successes in 2D multimodal research. They leverage the complementary strengths of specialized models: Vision-Language Models (VLMs) possess the world knowledge for complex physical planning, while diffusion models excel at precise synthesis of geometry and kinematic parameters. By decoupling these processes, PhysForge ensures assets are both visually realistic and physically consistent.

Methodology

PhysForge is a two-stage framework supported by the PhysDB dataset.

1. PhysDB: A Physics-Grounded Dataset

A novel, large-scale dataset of 150k 3D objects sourced from Objaverse, annotated with a four-tier physical system:

  1. Holistic Tier: Object-level properties (real-world scale, category, usage scene).
  2. Static Tier: Part-level static attributes (semantic label, physical material, mass).
  3. Functional Tier: Part-level functional attributes (intrinsic function, state machine).
  4. Interactive Tier: Part-level interactive attributes (atomic affordances, kinematic definitions: parent part, joint type, joint parameters).

The annotation pipeline uses a human-in-the-loop process with a multimodal LLM for initial annotation followed by manual correction. To train the kinematic generation stage, the dataset is supplemented with PartNet-Mobility and Infinite-Mobility for ground-truth articulation parameters.

2. Stage 1: VLM as a Physical Blueprint Planner

A finetuned VLM (Qwen2.5-VL) acts as the planner. Its input includes:

  • A single image II.
  • An optional 2D part mask MM (for granularity control).
  • Generated 3D voxels VV (from TRELLIS first stage).

The 3D voxel features are extracted using a PartField encoder and a position-aware 3D ConvNet. The VLM is finetuned to autoregressively generate the Hierarchical Physical Blueprint. This includes:

  • Bounding Box Layout: Represented efficiently using 6 special tokens (<boxs>, <boxe>, and quantized coordinate tokens <box0>...<box63>).
  • Detailed Physical Properties: For each part (parent node, articulation type, material, function, etc.).

A key discovery is physics-guided planning resolves part ambiguity. Co-predicting physical properties alongside bounding boxes provides stronger semantic constraints, enabling reasonable part decompositions even without 2D mask guidance.

3. Stage 2: Diffusion-based Generation with KineVoxel Injection (KVI)

This stage realizes the VLM's blueprint by generating high-fidelity geometry and precise kinematic parameters. The challenge is generating continuous 3D values (joint origin, axis) within a geometry diffusion pipeline.

The solution is the novel KineVoxel Injection (KVI) mechanism:

  • Articulation Parameter Representation: For a part ii, parameters are an 8D vector Pi=(Oi,Ai,Li)P_i = (O_i, A_i, L_i), where OiR3O_i \in \mathbb{R}^3 is the joint origin, AiR3A_i \in \mathbb{R}^3 is the joint axis, and LiR2L_i \in \mathbb{R}^2 is the motion limits.

  • KineVoxel Encoding: This vector is encoded into a "KineVoxel" latent zk,iz_{k,i} using a lightweight Kinematic Encoder EkineE_{kine}:

    zk,i=Ekine(concat(SOOi,SAAi,SLLi))z_{k,i} = E_{kine}(\text{concat}(S_O \cdot O_i, S_A \cdot A_i, S_L \cdot L_i))

    where SO,SA,SLS_O, S_A, S_L are scaling factors.

  • Joint Injection: The KineVoxel latent zk,iz_{k,i} is concatenated with the sequence of geometry voxel latents Zg={zg,i}Z_g = \{z_{g,i}\} and fed into the main denoising transformer. A joint type embedding EtypeE_{type} (from the VLM's planned type, e.g., "revolute") is added to zk,iz_{k,i} to help the transformer distinguish latent types.

  • Training Objective: The model is trained using Conditional Flow Matching (CFM) with a composite loss:

    L=Et,Z0,c[Lgeo+λkineLkine]L = \mathbb{E}_{t,Z_0,c} [ L_{geo} + \lambda_{kine} \cdot L_{kine} ]

    where cc is the condition from the VLM blueprint, and the loss terms are:

    Lgeo=vg,tv^g,t2;Lkine=vk,tv^k,t2.L_{geo} = \| v_{g,t} - \hat{v}_{g,t} \|^2; \quad L_{kine} = \| v_{k,t} - \hat{v}_{k,t} \|^2.

    The weighting factor λkine=10\lambda_{kine} = 10 emphasizes accurate articulation prediction.

Empirical Validation / Results

Evaluation uses PartObjaverse-Tiny, PhysXNet test set, and new test sets from PhysDB and articulated datasets.

1. Part Structure Planning

Baselines: OmniPart (first stage), PartField. Metrics: BBox IoU, Voxel Recall, Voxel IoU.

Table 3. Quantitative results for bounding box generation (%) on PartObjaverse-Tiny.

MethodVoxel recall ↑Voxel IoU ↑Bbox IoU ↑
PartField69.6546.0437.33
OmniPart (SAM mask)68.3343.3434.33
PhysForge-bbox (w/o mask)67.8935.5332.30
PhysForge (w/o mask)73.6347.6636.32
OmniPart73.7952.9241.66
PhysForge (Ours)77.1653.7442.95

Key Findings:

  • PhysForge achieves state-of-the-art results.
  • Physics-guided planning is crucial: "PhysForge (w/o mask)" significantly outperforms "PhysForge-bbox (w/o mask)", showing that predicting physical properties enhances semantic understanding of part structures.
  • PhysForge without a mask outperforms OmniPart using SAM-generated masks, demonstrating robustness.

2. Physics-Grounded Generation

Baselines for Properties: PhysXGen, TRELLIS. Metrics: Chamfer Distance (CD), F1-Score, MAE for scale/material/affordance, CLIP-Similarity for text properties.

Table 1. Quantitative comparison of Physics Property generation on the PhysXNet.

MethodCD ↓F1-0.1 ↑F1-0.05 ↑Abs. scale (cm) ↓Material ↓Affordance ↓Description ↑
TRELLIS10.1086.5372.47----
PhysXGen9.8187.9173.6025.831.593.690.38
PhysForge (Ours)9.2189.2475.4311.040.811.220.87

Table 2. Quantitative comparison on the PhysDB.

MethodCD ↓F1-0.1 ↑F1-0.05 ↑Abs. scale (m) ↓Material ↓Function ↑Interaction ↑
TRELLIS24.3268.1953.28----
PhysXGen25.3065.7950.571.081.440.360.34
PhysForge (Ours)22.8970.5155.380.370.430.830.96

Key Findings: PhysForge outperforms baselines in both geometry quality and physics property accuracy, benefiting from the VLM's world-knowledge prior.

3. Kinematic Parameter Generation

Baselines: Articulate Anything, Singapo, URDFormer. Metrics: CD, Clip-Similarity, Joint Axis Error, Joint Pivot Error.

Table  4. Quantitative comparison of articulated objects generation.

MethodCD ↓Clip-Sim ↑Joint-Axis-Err-5 ↓Joint-Pivot-Err-5 ↓Joint-Axis-Err-all ↓Joint-Pivot-Err-all ↓
Articulate Anything23.310.870.6080.2570.6940.197
Singapo21.100.850.2410.153--
URDFormer25.420.840.7810.652--
PhysForge (w/o joint type emb)10.730.900.1570.1320.2920.141
PhysForge (w/o kinetics enc)11.310.890.1580.1170.2040.120
PhysForge (Ours)10.210.930.1010.0710.1640.096

Ablation Analysis:

  • Removing the joint type embedding (interface between stages) degrades joint accuracy, confirming its importance for transferring physical common sense.
  • Removing the independent kinematic encoder/decoder further compromises precise constraint synthesis.
  • The full PhysForge model achieves superior image consistency and joint parameter accuracy.

4. Qualitative Results & Applications

  • Figure 3 & 4/5: Show high-quality, part-aware, and articulated 3D assets generated from single images.
  • Applications (Figure 6): Generated assets are directly usable in:
    1. Robotic Simulation (RoboTwin): Robotic manipulators interact with assets using detailed geometry and kinematics.
    2. Virtual Worlds (Unity/Unreal Engine): Assets enable complex, physics-based interactions without manual rigging.
    3. Agent-Environment Interaction: Embodied agents can query the model in natural language to receive a physical blueprint for task planning.

Theoretical and Practical Implications

  • Theoretical: Proposes a novel formulation for 3D generation that is physics-centric and function-driven, moving beyond static geometry. The decoupled two-stage framework demonstrates the effective synergy between VLMs (for planning) and diffusion models (for realization).
  • Practical: Provides a foundational data engine for interactive 3D content creation.
    • For Embodied AI: Supplies simulation-ready assets for training and testing robotic manipulation policies in diverse environments.
    • For Game Development: Accelerates content creation for interactive virtual worlds by generating assets with built-in physics properties.
    • Dataset Contribution: PhysDB fills a critical data gap with its large-scale, fine-grained physical annotations, enabling future research in physics-aware generation.

Conclusion

PhysForge introduces a novel framework for generating interactive, physics-grounded 3D assets. Its core contributions are:

  1. A decoupled two-stage architecture (VLM Planning + Diffusion Realization) that generates a Hierarchical Physical Blueprint and then realizes it with geometry and precise kinematics.
  2. The KineVoxel Injection (KVI) mechanism, enabling synergistic generation of articulation parameters within a diffusion model.
  3. The PhysDB dataset, providing the necessary training data with a four-tier physical annotation system.

Extensive validation shows PhysForge achieves state-of-the-art performance in part planning and physics property generation. The generated assets are directly applicable in robotic simulators and interactive virtual worlds, paving the way for scalable creation of interactive 3D content.