# SIMART: Decomposing Monolithic Meshes into Sim-ready Articulated Assets via MLLM

> SIMART introduces a unified MLLM framework that efficiently decomposes static 3D meshes into articulated, simulation-ready assets by using a novel Sparse 3D VQ-VAE for compact representation.

- **Source:** [arXiv](https://arxiv.org/abs/2603.23386)
- **Published:** 2026-03-26
- **Permalink:** https://picx.dev/p/hDzPUF
- **Whiteboard:** https://picx.dev/p/hDzPUF/image

## Summary

# SIMART: Decomposing Monolithic Meshes into Sim-ready Articulated Assets via MLLM

## Summary (Overview)
*   **Unified MLLM Framework:** SIMART is a novel Multimodal Large Language Model (MLLM) framework that jointly performs 3D geometric part decomposition and kinematic parameter prediction to transform static 3D meshes into "simulation-ready" articulated assets.
*   **Sparse 3D VQ-VAE:** Introduces a Sparse 3D VQ-VAE that reduces token counts by approximately **70%** compared to dense voxel representations, enabling efficient processing of complex 3D assemblies and mitigating memory exhaustion.
*   **State-of-the-Art Performance:** Achieves superior results on the proposed SIMART-Bench (combining PartNet-Mobility and AIGC assets) and in-the-wild datasets, outperforming existing baselines in both articulation accuracy and geometric fidelity.
*   **Direct Mesh Processing & Downstream Utility:** The framework directly processes input meshes, preserving high-fidelity geometry, and demonstrates practical applications in physics-based robotic simulation and interactive VR/AR environments.

## Introduction and Theoretical Foundation
High-quality articulated 3D assets are essential for embodied AI and physical simulation. However, the majority of existing 3D assets are static meshes, and manual creation of articulated ("sim-ready") objects is labor-intensive. Prior methods often use multi-stage pipelines that decouple part decomposition, joint estimation, and assembly, leading to error accumulation and physically invalid results. Alternatively, recent 3D generative models produce high-quality static meshes but lack kinematic metadata.

A unified MLLM paradigm offers a promising single-stage solution for understanding a 3D asset and directly generating per-part geometry along with kinematic specifications (e.g., URDF). However, existing 3D-native MLLMs are constrained by inefficient, dense volumetric tokenization, which wastes tokens on empty space, leading to prohibitive context lengths and memory overhead.

**SIMART** addresses this by integrating 3D geometric understanding and generation within a unified MLLM. The core challenge is developing an efficient 3D representation that supports both MLLM-based reasoning and scalable, high-fidelity part-level generation.

## Methodology
The goal is to generate a simulation-ready asset $\mathcal{A}$ from multimodal inputs $\mathcal{I} = \{I_{vis}, G_{geo}, T_{txt}\}$ (visual observation, raw geometry, language instruction). The output is defined as $\mathcal{A} = (\mathcal{M}_{seg}, \mathcal{P}_{sim})$, where $\mathcal{M}_{seg} = \{m_1, m_2, ..., m_n\}$ are part-segmented meshes and $\mathcal{P}_{sim}$ is simulation metadata (joint parameters, physical properties).

### 3.2 Unified MLLM
The framework uses **Qwen3-VL** as the MLLM backbone. It processes a concatenated sequence of:
*   **Visual tokens** $F_{vis} \in \mathbb{R}^{N_v \times D}$ from a ViT encoder.
*   **Geometric tokens** $F_{geo} \in \mathbb{R}^{N_g \times D}$ from the Sparse 3D VQ-VAE.
*   **Text tokens** $F_{txt} \in \mathbb{R}^{N_t \times D}$.

The total sequence length $L = N_v + N_g + N_t$ is fed into the Transformer. The MLLM is trained to output a hybrid sequence containing both geometric part tokens and structured URDF metadata.

### 3.3 Sparse 3D VQ-VAE
This component is key to efficient 3D representation. A raw input mesh is voxelized into a $64^3$ grid. A 3D-UNet encoder maps this to a compact latent feature grid $Z \in \mathbb{R}^{16 \times 16 \times 16 \times C}$, which is then aggregated to an $8 \times 8 \times 8$ grid.

Unlike dense VQ-VAEs, this model leverages sparsity. It identifies unoccupied voxels and assigns them a specialized **zero token** ($e_{zero}$) from a learned codebook $\mathcal{C}$ of 4096 entries. Only features from occupied regions are vector-quantized. Formally, for each latent feature $z_i$:

$$
\hat{z}_i =
\begin{cases}
e_{zero}, & \text{if Voxel } i \text{ is unoccupied} \\
\arg\min_{e_j \in \mathcal{C}\backslash\{e_{zero}\}} \| z_i - e_j \|_2, & \text{otherwise}
\end{cases}
$$

This strategy reduces informative token count by ~70%. To preserve structural topology, each occupied voxel is serialized into a triplet: `⟨voxel⟩ [xyz] [K]`. Here, `[K]` is the geometry index from the codebook, and `[xyz]` is a linearized coordinate token computed as $xyz = 64x + 8y + z$, where $x, y, z \in [0, 7]$ are coordinates in the $8 \times 8 \times 8$ grid.

### 3.4 Simulator-ready Assets Process
1.  **Part Segmentation:** The MLLM's output part-specific voxel tokens are decoded into sparse point clouds $S_p$. A graph-based surface segmentation algorithm maps these onto the input mesh $G_{geo}$. The initial probability of a vertex $v$ belonging to part $p$ is defined by a Gaussian kernel:

    $$
    P(v, p) \propto \exp\left(-\frac{d(v, S_p)^2}{2\sigma^2}\right)
    $$

    where $d(v, S_p)$ is the distance to the nearest seed of part $p$, and $\sigma$ is a scale hyperparameter. Iterative graph-smoothing and majority voting yield final face labels $\mathcal{M}_{seg}$.

2.  **URDF Generation:** The MLLM directly outputs a structured URDF specification defining the kinematic chain (parent-child hierarchies, joint types, axes, limits) and physical attributes (material density, surface friction).

## Empirical Validation / Results
**Datasets:** Training uses 39,600 3D objects from PhysXNet and PartNet-Mobility. For evaluation, the authors introduce **SIMART-Bench**, a high-fidelity benchmark combining In-Domain (ID) assets from PartNet-Mobility with Out-of-Distribution (OOD) AI-generated objects to test robustness.

**Metrics:**
*   **Kinematic Awareness:** Type Accuracy (`Type ↑`), Axis Error (`Axis ↓`), Origin Error (`Origin ↓`).
*   **Geometric Decomposition:** Intersection over Union (`IoU ↑`), Chamfer Distance (`CD ↓`).

### 4.1 Articulated Object and Kinematic Awareness
**Table 1: Quantitative comparison of articulation accuracy and geometric fidelity.**

| Method | **ID Items** | | | | **AI-generated Items** | | | | |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| | **Type ↑** | **Axis ↓** | **Origin ↓** | **IOU ↑** | **CD ↓** | **Type ↑** | **Axis ↓** | **Origin ↓** | **IOU ↑** | **CD ↓** |
| Urdformer [7] | 0.496 | 0.585 | 0.610 | 0.002 | 0.624 | 0.544 | 0.557 | 0.476 | 0.016 | 0.650 |
| Articulate-Anything [19] | 0.891 | 0.315 | 0.174 | 0.202 | 0.239 | 0.765 | 0.243 | 0.232 | 0.069 | 0.244 |
| Physx-Anything [5] | 0.686 | 0.312 | 0.322 | 0.128 | 0.278 | 0.658 | 0.481 | 0.324 | 0.100 | 0.334 |
| Particulate [22] | 0.822 | 0.208 | 0.204 | 0.643 | 0.140 | 0.817 | 0.166 | 0.168 | 0.618 | 0.106 |
| **SIMART (ours)** | **0.928** | **0.080** | **0.111** | **0.690** | **0.087** | **0.831** | **0.136** | **0.145** | **0.777** | **0.079** |

*SIMART achieves state-of-the-art performance across all metrics.* Generative baselines (Articulate-Anything, PhysX-Anything) show poor geometric alignment (low IoU, high CD). SIMART significantly outperforms the mesh-processing baseline Particulate by leveraging MLLM reasoning.

### 4.2 3D Part Understanding
The task is to identify and reconstruct a specific object component based on a natural language description.

**Table 2: Quantitative comparison of part grounding performance on AI-generated items.**

| Method | **AI-generated Items** |
| :--- | :---: | :---: |
| | **IOU ↑** | **CD ↓** |
| Physx-Anything [5] | 0.067 | 0.347 |
| P3SAM [34] + Qwen3-VL | 0.507 | 0.234 |
| **SIMART (ours)** | **0.807** | **0.018** |

*SIMART significantly outperforms baselines, effectively linking functional descriptions to physical coordinates via coordinate-aware tokenization and VLM knowledge.*

### 4.3 Ablation Studies
**Table 3: Ablation study on AI-generated items.**

| Method | Type ↑ | Center ↓ | IoU ↑ | CD ↓ | Token Num ↓ |
| :--- | :---: | :---: | :---: | :---: | :---: |
| + Dense token | OOM | | | | 4138 |
| + Force Sparse | 0.661 | 0.157 | 0.678 | 0.100 | 862 |
| + Zero Sparse | 0.794 | 0.108 | 0.745 | 0.074 | 516 |
| + Vision (ours) | **0.937** | **0.074** | **0.832** | **0.055** | 516 |

**Key Findings:**
1.  **Dense tokens** cause Out-of-Memory (OOM) errors during training.
2.  **Sparse representation** (`Force Sparse`) reduces tokens and enables training.
3.  The **zero-token mechanism** (`Zero Sparse`) further improves performance with minimal tokens.
4.  Integrating **visual features** yields the best performance, highlighting their role in resolving geometric ambiguities.

## Theoretical and Practical Implications
*   **Paradigm Shift:** SIMART demonstrates the feasibility of a **unified, single-stage MLLM framework** for sim-ready asset generation, moving beyond error-prone multi-stage pipelines.
*   **Efficiency Breakthrough:** The **Sparse 3D VQ-VAE** provides a blueprint for efficient 3D tokenization in MLLMs, solving a major scalability bottleneck for complex 3D tasks.
*   **Benchmark Contribution:** **SIMART-Bench** addresses the lack of diversity in existing datasets, providing a more rigorous testbed for generalization.
*   **Practical Applications:** The work enables:
    *   **Scalable generation of diverse training scenarios** for embodied AI and robotic manipulation.
    *   **Interactive VR/AR asset creation** through user-driven interfaces (e.g., click-to-functionalize).
    *   **Automated enrichment of virtual worlds** with physically accurate, interactive objects.

## Conclusion
SIMART presents a unified MLLM framework that effectively transforms static 3D meshes into functional, simulation-ready articulated assets. By introducing a Sparse 3D VQ-VAE, it achieves a 70% reduction in token redundancy, enabling efficient processing and high-fidelity generation. The model sets a new state-of-the-art on articulation tasks and demonstrates practical utility in physics-based simulation and interactive environments.

**Limitations & Future Work:** The primary limitation is the scarcity and inconsistent quality of existing articulated 3D datasets, which hinders open-world generalization. Future work will focus on using SIMART as a foundational tool to generate pre-verified articulation predictions, thereby accelerating the data-annotation loop and facilitating the creation of larger, more diverse datasets for enhanced generative capabilities.

---

_Markdown view of https://picx.dev/p/hDzPUF, served by PicX — AI-generated visual whiteboard summaries of research papers._
