SIMART: Decomposing Monolithic Meshes into Sim-ready Articulated Assets via MLLM

Summary (Overview)

Unified MLLM Framework: SIMART is a novel Multimodal Large Language Model (MLLM) framework that jointly performs 3D geometric part decomposition and kinematic parameter prediction to transform static 3D meshes into "simulation-ready" articulated assets.
Sparse 3D VQ-VAE: Introduces a Sparse 3D VQ-VAE that reduces token counts by approximately 70% compared to dense voxel representations, enabling efficient processing of complex 3D assemblies and mitigating memory exhaustion.
State-of-the-Art Performance: Achieves superior results on the proposed SIMART-Bench (combining PartNet-Mobility and AIGC assets) and in-the-wild datasets, outperforming existing baselines in both articulation accuracy and geometric fidelity.
Direct Mesh Processing & Downstream Utility: The framework directly processes input meshes, preserving high-fidelity geometry, and demonstrates practical applications in physics-based robotic simulation and interactive VR/AR environments.

Introduction and Theoretical Foundation

High-quality articulated 3D assets are essential for embodied AI and physical simulation. However, the majority of existing 3D assets are static meshes, and manual creation of articulated ("sim-ready") objects is labor-intensive. Prior methods often use multi-stage pipelines that decouple part decomposition, joint estimation, and assembly, leading to error accumulation and physically invalid results. Alternatively, recent 3D generative models produce high-quality static meshes but lack kinematic metadata.

A unified MLLM paradigm offers a promising single-stage solution for understanding a 3D asset and directly generating per-part geometry along with kinematic specifications (e.g., URDF). However, existing 3D-native MLLMs are constrained by inefficient, dense volumetric tokenization, which wastes tokens on empty space, leading to prohibitive context lengths and memory overhead.

SIMART addresses this by integrating 3D geometric understanding and generation within a unified MLLM. The core challenge is developing an efficient 3D representation that supports both MLLM-based reasoning and scalable, high-fidelity part-level generation.

Methodology

The goal is to generate a simulation-ready asset $\mathcal{A}$ from multimodal inputs $\mathcal{I} = \{I_{vis}, G_{geo}, T_{txt}\}$ (visual observation, raw geometry, language instruction). The output is defined as $\mathcal{A} = (\mathcal{M}_{seg}, \mathcal{P}_{sim})$ , where $\mathcal{M}_{seg} = \{m_1, m_2, ..., m_n\}$ are part-segmented meshes and $\mathcal{P}_{sim}$ is simulation metadata (joint parameters, physical properties).

3.2 Unified MLLM

The framework uses Qwen3-VL as the MLLM backbone. It processes a concatenated sequence of:

Visual tokens $F_{vis} \in \mathbb{R}^{N_v \times D}$ from a ViT encoder.
Geometric tokens $F_{geo} \in \mathbb{R}^{N_g \times D}$ from the Sparse 3D VQ-VAE.
Text tokens $F_{txt} \in \mathbb{R}^{N_t \times D}$ .

The total sequence length $L = N_v + N_g + N_t$ is fed into the Transformer. The MLLM is trained to output a hybrid sequence containing both geometric part tokens and structured URDF metadata.

3.3 Sparse 3D VQ-VAE

This component is key to efficient 3D representation. A raw input mesh is voxelized into a $64^3$ grid. A 3D-UNet encoder maps this to a compact latent feature grid $Z \in \mathbb{R}^{16 \times 16 \times 16 \times C}$ , which is then aggregated to an $8 \times 8 \times 8$ grid.

Unlike dense VQ-VAEs, this model leverages sparsity. It identifies unoccupied voxels and assigns them a specialized zero token ( $e_{zero}$ ) from a learned codebook $\mathcal{C}$ of 4096 entries. Only features from occupied regions are vector-quantized. Formally, for each latent feature $z_i$ :

\hat{z}_i = \begin{cases} e_{zero}, & \text{if Voxel } i \text{ is unoccupied} \\ \arg\min_{e_j \in \mathcal{C}\backslash\{e_{zero}\}} \| z_i - e_j \|_2, & \text{otherwise} \end{cases}

This strategy reduces informative token count by ~70%. To preserve structural topology, each occupied voxel is serialized into a triplet: ⟨voxel⟩ [xyz] [K]. Here, [K] is the geometry index from the codebook, and [xyz] is a linearized coordinate token computed as $xyz = 64x + 8y + z$ , where $x, y, z \in [0, 7]$ are coordinates in the $8 \times 8 \times 8$ grid.

3.4 Simulator-ready Assets Process

Part Segmentation: The MLLM's output part-specific voxel tokens are decoded into sparse point clouds $S_p$ . A graph-based surface segmentation algorithm maps these onto the input mesh $G_{geo}$ . The initial probability of a vertex $v$ belonging to part $p$ is defined by a Gaussian kernel:
$P(v, p) \propto \exp\left(-\frac{d(v, S_p)^2}{2\sigma^2}\right)$
where $d(v, S_p)$ is the distance to the nearest seed of part $p$ , and $\sigma$ is a scale hyperparameter. Iterative graph-smoothing and majority voting yield final face labels $\mathcal{M}_{seg}$ .
URDF Generation: The MLLM directly outputs a structured URDF specification defining the kinematic chain (parent-child hierarchies, joint types, axes, limits) and physical attributes (material density, surface friction).

Empirical Validation / Results

Datasets: Training uses 39,600 3D objects from PhysXNet and PartNet-Mobility. For evaluation, the authors introduce SIMART-Bench, a high-fidelity benchmark combining In-Domain (ID) assets from PartNet-Mobility with Out-of-Distribution (OOD) AI-generated objects to test robustness.

Metrics:

Kinematic Awareness: Type Accuracy (Type ↑), Axis Error (Axis ↓), Origin Error (Origin ↓).
Geometric Decomposition: Intersection over Union (IoU ↑), Chamfer Distance (CD ↓).

4.1 Articulated Object and Kinematic Awareness

Table 1: Quantitative comparison of articulation accuracy and geometric fidelity.

| Method | ID Items | | | | AI-generated Items | | | | | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | | Type ↑ | Axis ↓ | Origin ↓ | IOU ↑ | CD ↓ | Type ↑ | Axis ↓ | Origin ↓ | IOU ↑ | CD ↓ | | Urdformer [7] | 0.496 | 0.585 | 0.610 | 0.002 | 0.624 | 0.544 | 0.557 | 0.476 | 0.016 | 0.650 | | Articulate-Anything [19] | 0.891 | 0.315 | 0.174 | 0.202 | 0.239 | 0.765 | 0.243 | 0.232 | 0.069 | 0.244 | | Physx-Anything [5] | 0.686 | 0.312 | 0.322 | 0.128 | 0.278 | 0.658 | 0.481 | 0.324 | 0.100 | 0.334 | | Particulate [22] | 0.822 | 0.208 | 0.204 | 0.643 | 0.140 | 0.817 | 0.166 | 0.168 | 0.618 | 0.106 | | SIMART (ours) | 0.928 | 0.080 | 0.111 | 0.690 | 0.087 | 0.831 | 0.136 | 0.145 | 0.777 | 0.079 |

SIMART achieves state-of-the-art performance across all metrics. Generative baselines (Articulate-Anything, PhysX-Anything) show poor geometric alignment (low IoU, high CD). SIMART significantly outperforms the mesh-processing baseline Particulate by leveraging MLLM reasoning.

4.2 3D Part Understanding

The task is to identify and reconstruct a specific object component based on a natural language description.

Table 2: Quantitative comparison of part grounding performance on AI-generated items.

| Method | AI-generated Items | | :--- | :---: | :---: | | | IOU ↑ | CD ↓ | | Physx-Anything [5] | 0.067 | 0.347 | | P3SAM [34] + Qwen3-VL | 0.507 | 0.234 | | SIMART (ours) | 0.807 | 0.018 |

SIMART significantly outperforms baselines, effectively linking functional descriptions to physical coordinates via coordinate-aware tokenization and VLM knowledge.

4.3 Ablation Studies

Table 3: Ablation study on AI-generated items.

Method	Type ↑	Center ↓	IoU ↑	CD ↓	Token Num ↓
+ Dense token	OOM				4138
+ Force Sparse	0.661	0.157	0.678	0.100	862
+ Zero Sparse	0.794	0.108	0.745	0.074	516
+ Vision (ours)	0.937	0.074	0.832	0.055	516

Key Findings:

Dense tokens cause Out-of-Memory (OOM) errors during training.
Sparse representation (Force Sparse) reduces tokens and enables training.
The zero-token mechanism (Zero Sparse) further improves performance with minimal tokens.
Integrating visual features yields the best performance, highlighting their role in resolving geometric ambiguities.

Theoretical and Practical Implications

Paradigm Shift: SIMART demonstrates the feasibility of a unified, single-stage MLLM framework for sim-ready asset generation, moving beyond error-prone multi-stage pipelines.
Efficiency Breakthrough: The Sparse 3D VQ-VAE provides a blueprint for efficient 3D tokenization in MLLMs, solving a major scalability bottleneck for complex 3D tasks.
Benchmark Contribution: SIMART-Bench addresses the lack of diversity in existing datasets, providing a more rigorous testbed for generalization.
Practical Applications: The work enables:
- Scalable generation of diverse training scenarios for embodied AI and robotic manipulation.
- Interactive VR/AR asset creation through user-driven interfaces (e.g., click-to-functionalize).
- Automated enrichment of virtual worlds with physically accurate, interactive objects.

Conclusion

SIMART presents a unified MLLM framework that effectively transforms static 3D meshes into functional, simulation-ready articulated assets. By introducing a Sparse 3D VQ-VAE, it achieves a 70% reduction in token redundancy, enabling efficient processing and high-fidelity generation. The model sets a new state-of-the-art on articulation tasks and demonstrates practical utility in physics-based simulation and interactive environments.

Limitations & Future Work: The primary limitation is the scarcity and inconsistent quality of existing articulated 3D datasets, which hinders open-world generalization. Future work will focus on using SIMART as a foundational tool to generate pre-verified articulation predictions, thereby accelerating the data-annotation loop and facilitating the creation of larger, more diverse datasets for enhanced generative capabilities.