Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

Summary (Overview)

Paradigm Shift: Proposes a novel approach to equip Multimodal Large Language Models (MLLMs) with 3D spatial awareness by leveraging the implicit 3D and physical priors learned by large-scale video generation models, rather than relying on explicit 3D data or complex geometric supervision.
VEGA-3D Framework: Introduces VEGA-3D (Video Extracted Generative Awareness), a plug-and-play framework that repurposes a frozen video diffusion model (e.g., Wan2.1) as a Latent World Simulator to extract spatiotemporal geometric features, which are fused with semantic features via an Adaptive Gated Fusion module.
Key Finding: Identifies that multi-view feature consistency in a model's latent space is a strong predictor of its downstream 3D understanding capability. Diffusion Transformer (DiT)-based video generators exhibit high consistency and provide superior spatial priors.
Superior Performance: Demonstrates state-of-the-art or competitive results across diverse benchmarks: 3D scene understanding (ScanRefer, ScanQA), spatial reasoning (VSI-Bench), and embodied manipulation (LIBERO), validating the effectiveness and generalizability of generative priors.
Design Insights: Shows that the most informative geometric cues are extracted from intermediate noise levels (e.g., $t_k=0.3$ ) and intermediate layers of the generative model, and that generative and semantic features are complementary.

Introduction and Theoretical Foundation

Multimodal Large Language Models (MLLMs) exhibit "spatial blindness," struggling with fine-grained geometric reasoning. Existing solutions either depend on explicit 3D modalities (point clouds, depth) limited by data scarcity, or employ complex geometric scaffolding (e.g., reconstruction modules, 3D knowledge distillation).

This paper posits a paradigm shift: large-scale video generation models inherently learn robust 3D structural priors and physical laws as a byproduct of their training objective to synthesize temporally coherent and physically plausible videos. To generate consistent motion and occlusion, these models must align appearance with underlying 3D geometry. The core research question is whether these implicit physical priors can be extracted and repurposed to enhance downstream 3D visual understanding in MLLMs.

The authors hypothesize that the latent representations of video generators encode geometry-consistent structure, serving as a scalable source of 3D awareness without the need for explicit 3D supervision.

Methodology

The VEGA-3D framework enhances a standard MLLM pipeline with a parallel generative feature branch. The methodology has three stages:

1. 3D Awareness Analysis via Multi-view Feature Consistency A novel metric, Multi-view Correspondence Score, is introduced to quantify a model's implicit geometric understanding. For a 3D scene, features from different views are projected into a shared voxel grid using ground-truth camera poses and depth. The consistency for a voxel $m$ observed in views $v_i$ and $v_j$ is:

S^{(m)}_{\text{voxel}} = \frac{\mathbf{h}^{\top}_{m,v_i} \mathbf{h}_{m,v_j}}{\|\mathbf{h}_{m,v_i}\| \|\mathbf{h}_{m,v_j}\|}

The final score is averaged over all voxel pairs. A high correlation is found between this score and a Normalized Overall Score (NOS) on downstream tasks, confirming it as a reliable indicator of 3D capability.

2. Latent World Simulation (Feature Extraction) A pre-trained video diffusion model (e.g., Wan2.1-T2V) is used as a frozen Latent World Simulator. Given an input video $V$ , it is encoded into a latent $z_0 = E(V)$ by the model's VAE. To activate the model's structural reasoning, the latent is perturbed with noise along the Flow Matching path:

z_k = (1 - t_k) z_0 + t_k \epsilon, \quad \text{where } \epsilon \sim \mathcal{N}(0, I), \quad t_k = k/K

The noisy latent $z_k$ is fed into the backbone with an empty text prompt to minimize semantic hallucination. Features are extracted from a specific intermediate DiT layer $l$ :

\mathbf{f}_{\text{raw}} = \Phi^{(l)}(z_k, k; c_{\text{text}}="")

After pooling, the generative representation $\mathbf{f}_{\text{gen}} \in \mathbb{R}^{T \times N \times D_{\text{gen}}}$ is obtained.

3. Bridging the Generative and Semantic Gap via Adaptive Gated Fusion Generative ( $\mathbf{f}_{\text{gen}}$ ) and semantic ( $\mathbf{f}_{\text{sem}}$ ) features exist in different manifolds. They are first projected to the LLM's dimension:

\mathbf{F}_{\text{gen}} = P_{\text{gen}}(\mathbf{f}_{\text{gen}}), \quad \mathbf{F}_{\text{sem}} = P_{\text{sem}}(\mathbf{f}_{\text{sem}}), \quad \text{where } \mathbf{F}_{\text{gen}}, \mathbf{F}_{\text{sem}} \in \mathbb{R}^{T \times N \times D_{\text{llm}}}

An adaptive gating mechanism computes a token-level weight $g_i \in [0,1]$ to balance the two streams:

g_i = \sigma\left(\mathbf{W}_g^{\top} \text{Concat}\left(\text{LN}(\mathbf{F}_{\text{gen},i}), \text{LN}(\mathbf{F}_{\text{sem},i})\right) + b_g\right)

The final fused token is a convex combination:

\mathbf{F}^{\text{fused}}_i = (1 - g_i) \cdot \mathbf{F}_{\text{gen},i} + g_i \cdot \mathbf{F}_{\text{sem},i}

This allows the model to dynamically prioritize geometric cues for spatial reasoning and semantic cues for recognition.

Empirical Validation / Results

Extensive experiments validate VEGA-3D's effectiveness across three domains.

1. 3D Scene Understanding As shown in Table 1, VEGA-3D, built upon the Video-3D LLM baseline, achieves state-of-the-art or competitive performance across five benchmarks, excelling particularly in localization-centric tasks.

Table 1: Performance on 3D Scene Understanding Benchmarks (Selected Rows)

Method	ScanRefer Acc@0.25	Multi3DRefer F1@0.25	Scan2Cap B-4@0.5	ScanQA CIDER	SQA3D EM
Baseline (Video-3D LLM)	58.1	58.0	41.3	102.1	58.6
3DRS (NeurIPS 25)	62.9	60.4	41.6	104.8	60.6
VEGA-3D (Ours)	63.2	60.8	42.2	106.3	61.3

2. Spatial Reasoning (VSI-Bench) As shown in Table 2, augmenting the Qwen2.5VL-7B baseline with VEGA-3D yields consistent improvements, achieving an overall score of 50.5, outperforming many larger spatial-enhanced models.

Table 2: Performance on VSI-Bench (Spatial Reasoning)

Model	Overall Avg.
Qwen2.5VL-7B (Baseline)	48.9
3DRS-7B	45.9
VG-LLM-8B	50.1
VEGA-3D (Ours)	50.5

3. Embodied Manipulation (LIBERO) Injecting VEGA-3D's priors into the OpenVLA-OFT policy improves the average success rate on the LIBERO benchmark from 97.0% to 97.3%, demonstrating transferability to active physical reasoning.

Ablation Studies & Analysis

Feature Source: DiT-based generative models (e.g., Wan2.1) provide stronger spatial priors than UNet-based models or standard discriminative encoders, correlating with their higher multi-view consistency scores.
Noise and Layer Dynamics: Performance peaks at intermediate noise levels (e.g., $k=300$ , $t_k=0.3$ ) and when using features from intermediate DiT layers (e.g., layer 20), as shown in Figure 6.
Fusion Mechanism: The proposed Adaptive Gated Fusion outperforms simpler alternatives (addition, concatenation, cross-attention), effectively balancing semantic and geometric information (Table 5).
Efficiency: By caching extracted generative features per scene, the inference overhead is substantially reduced (Figure 7).

Theoretical and Practical Implications

Theoretical: Provides evidence that the objective of video generation forces models to learn transferable, geometry-consistent world models. This suggests that physical understanding can emerge as a latent capability in generative models trained on large-scale video data.
Practical: Offers a scalable, plug-and-play solution to mitigate spatial blindness in MLLMs. It bypasses the need for expensive 3D annotations or complex multi-stage training pipelines. The framework is model-agnostic; advancements in video generation directly translate to stronger 3D understanding capabilities.
Field Impact: Shifts the perspective for achieving 3D awareness in AI from "collecting more 3D data" to "unlocking latent priors in existing generative foundations."

Conclusion

VEGA-3D demonstrates that the implicit 3D and physical priors within large-scale video generation models are potent, transferable, and complementary to semantic knowledge. By repurposing these models as Latent World Simulators, the framework equips MLLMs with dense geometric awareness, leading to significant improvements across 3D understanding, reasoning, and manipulation tasks. The work opens a new pathway for scalable spatial intelligence, where advances in generative modeling naturally propel progress in discriminative 3D understanding.

Limitations & Future Work: The main limitation is increased inference cost due to the additional generative backbone. Future work will focus on distilling these priors into lighter encoders and extending the framework to dynamic scene understanding.