GEM: Generative Supervision Helps Embodied Intelligence - Summary

Summary (Overview)

Core Contribution: Introduces GEM, a Generative-supervised Embodied vision-language Model that bridges the gap between high-level semantic understanding and low-level physical grounding by integrating depth map generation as an auxiliary pre-training task.
Key Method: Proposes a novel hybrid architecture combining an autoregressive VLM backbone with a Diffusion Transformer (DiT)-based depth generator, trained via a progressive three-stage strategy to fuse semantic and structural features effectively.
New Dataset: Constructs and releases GEM-4M, a large-scale (4 million samples) embodied pre-training dataset covering grounding, reasoning, and planning tasks, paired with high-quality depth supervision.
State-of-the-Art Performance: GEM achieves top results on diverse embodied reasoning benchmarks. The derived action model, GEM-VLA, sets new records on the LIBERO simulation benchmark (96.1% success rate) and shows superior generalization in challenging real-world robotic manipulation tasks (43% average success rate).
Generative Supervision Benefit: Demonstrates that depth generative supervision is more effective than RGB reconstruction for learning spatial priors, leading to enhanced structural awareness in visual representations, which directly translates to improved embodied reasoning and action execution.

Introduction and Theoretical Foundation

Recent Vision-Language Models (VLMs) have shown promise for embodied intelligence within Vision-Language-Action (VLA) frameworks. However, a significant disconnect exists: standard VLM pre-training paradigms, which rely heavily on massive visual question-answering datasets, focus on high-level semantic comprehension (descriptive reasoning) but lack the low-level spatial and physical knowledge (e.g., geometry, distances, affordances) critical for successful task execution in real-world physical environments.

While some approaches attempt to inject spatial knowledge late into VLA pipelines, they often treat physical priors as separate from semantic learning, preventing the development of a unified, physically-grounded representation. The central research question addressed is: How can essential spatial and physical knowledge be seamlessly embedded into the foundational pre-training phase of VLMs to elevate both semantic reasoning and actionable operational intelligence?

To overcome this, the paper proposes GEM. The core theoretical insight is that generative supervision—specifically, training the model to generate depth maps from 2D observations—forces the VLM to learn fine-grained structural and geometric scene representations. This integrates physical grounding directly into the model's core visual features, bridging the semantic-physical divide from the ground up.

Methodology

3.1 Architecture

GEM augments a standard VLM backbone $M_\theta$ with a lightweight connector $C_\phi$ and a Diffusion Transformer (DiT)-based depth generative head $G_\psi$ .

The VLM encodes an instruction $l$ and visual input $o$ into multimodal token representations: $h = (h_o, h_l) = M_\theta(o, l)$
The standard cross-entropy text generation loss is: $L_{CE} = -\sum_{i=1}^{T} \log p_\theta(y_i | y_{<i}, h_o, h_l)$
The visual tokens $h_o$ are projected via the connector: $c = C_\phi(h_o)$ .
The generative head $G_\psi$ uses $c$ as a condition to reconstruct the observation's depth map $d$ via a flow matching objective: $L_{flow} = \mathbb{E}_{d, t \sim \mathcal{U}(0,1), \epsilon \sim \mathcal{N}(0,I)} \left[ \| v_t(x_t, c) - u_t(x_t | d) \|^2 \right]$ where $u_t(x_t | d)$ is the ground-truth velocity field transforming the noised state $x_t$ into the target depth $d$ .

The total training objective combines both losses: $L_{total} = L_{CE} + \lambda L_{flow}$ , where $\lambda$ is a balancing weight.

3.2 Progressive Training Recipe

To avoid modality interference and ensure stable convergence, a three-stage progressive strategy is adopted:

Stage 1: Connector Initialization: Freeze VLM and DiT head; train only the connector $C_\phi$ with $L_{flow}$ for preliminary feature alignment.
Stage 2: Generative Head Initialization: Freeze VLM; train connector $C_\phi$ and DiT head $G_\psi$ with $L_{flow}$ to equip the generator with basic image generation ability.
Stage 3: Generative-Supervised Joint Training: Unfreeze all trainable parameters (VLM, connector, DiT) and train end-to-end with the combined loss $L_{total}$ to achieve synergy between semantic understanding and structural generation.

3.3 Dataset: GEM-4M

A large-scale, high-quality dataset is curated for embodied pre-training, comprising ~4 million question-answer pairs across three categories:

Embodied Grounding Data (~1.1M): For object detection, localization, and affordance recognition. Sources include PACO-LVIS, RoboPoint, RoboAfford, and annotations generated from robot action datasets using SAM3.
Physical & Spatial Reasoning Data (~1.1M): For 3D spatial reasoning and physical attribute perception. Sources include MindCube, ViCA, SPAR, VSI-590K, and manually annotated samples from 3D scene datasets (ScanNet, ScanNet++).
Spatiotemporal Planning Data (~50K): For sub-task and trajectory planning. Constructed from robot action videos by extracting frames, identifying manipulated objects with Qwen3, generating masks with SAM3, tracking with CoTracker3, and creating QA pairs.

3.4 Expanding to Vision-Language-Action Model (GEM-VLA)

The pre-trained GEM is extended into a VLA model for robotics. A DiT-based action expert $A_\omega$ is added to generate continuous actions from multimodal observation history.

Key-Value tokens from the VLM backbone's attention blocks are used as conditioning $c_{act}$ for the action expert.
The action is generated via a diffusion policy with a flow-matching loss: $L_{action} = \mathbb{E}_{O, a, \epsilon \sim \mathcal{N}(0,I), t \sim \mathcal{U}(0,1)} \left[ \| v_t(a_t, c_{act}) - u_t(a_t | a) \|_2^2 \right]$
The VLA is trained with a combined loss: $L_{total} = L_{action} + \lambda L_{flow}$ .

Empirical Validation / Results

4.2 Evaluation on Embodied Reasoning Capacities

GEM was evaluated on spatial understanding and embodied grounding benchmarks. The 8B variant establishes new state-of-the-art or highly competitive performance against both general-purpose and specialist models.

Table 1: Performance on embodied reasoning benchmarks for spatial understanding.

Models	CV-Bench	VSI-Bench (All ↑)	MMSI-Bench (All ↑)	EmbSpatial (All ↑)
Gemini-3-Pro	82.5	53.0	45.9	81.0
Qwen3-VL-8B (Base)	85.1	57.9	27.7	77.7
GEM-8B (Ours)	86.6	70.6	35.3	79.4
Qwen3-VL-8B-SFT (Ablation)	85.6	68.6	32.8	78.3

Table 2: Performance on object placement and grounding spatial benchmarks.

Models	RefSpatial (All ↑)	Where2Place (All ↑)	RoboSpatial (All ↑)
Gemini-3-Pro	34.3	54.0	57.4
Qwen3-VL-8B (Base)	38.0	61.3	65.4
GEM-8B (Ours)	44.4	65.0	66.9
Qwen3-VL-8B-SFT (Ablation)	45.8	62.0	65.4

Key Findings:

GEM significantly improves over its base Qwen3-VL models (e.g., +12.7 points on VSI-Bench for the 8B model).
It outperforms the strong proprietary baseline Gemini-3-Pro by ~10% on average across grounding benchmarks.
The ablation (Qwen3-VL-SFT), trained on GEM-4M but without depth supervision, underperforms the full GEM model, especially on distance-related questions, proving the value of generative supervision.

4.3 Evaluation on Downstream VLA Tasks

Simulation (LIBERO Benchmark): GEM-VLA achieves a record-breaking 96.1% average success rate across four task suites (Spatial, Object, Goal, Long), outperforming all baselines including standard VLAs ( $\pi_0$ , OpenVLA) and spatially-enhanced VLAs (SpatialVLA, DepthVLA).

Table 3: Success rates (%) on the LIBERO benchmark.

Models	Spatial	Object	Goal	Long	Average ↑
$\pi_0$	96.8	98.8	95.8	85.2	94.2
DepthVLA	96.4	98.0	95.8	89.2	94.9
Qwen3VL-SFT-VLA	97.2	98.4	95.6	88.4	94.9
GEM-VLA (Ours)	99.0	98.8	97.1	89.3	96.1

Real-World Evaluation: On challenging tasks (cloth folding, unzipping, table bussing), GEM-VLA achieves a 43.0% average success rate, a substantial improvement over the previous SOTA ( $\pi_0$ -FAST at 28.7%).

5. Ablation Studies

Depth vs. RGB Supervision: Replacing depth generation with RGB image reconstruction leads to inferior performance, confirming depth provides more explicit spatial cues.
Progressive Training: Direct end-to-end training underperforms the proposed three-stage strategy, validating its necessity for stable feature fusion.
Structural Priors: Visualizations show that features from GEM generate high-fidelity depth maps, while features from the SFT-only model produce blurry results, proving depth supervision successfully encodes structural information.

Table 4: Ablation study on supervision type and training strategy.

Models	CV-Bench	VSI-Bench (All ↑)	RoboSpatial (All ↑)
RGB Supervision	80.9	60.0	44.6
Direct End-to-End Co-Training	79.7	57.6	44.0
Default Setting (GEM)	81.1	63.0	48.9

Theoretical and Practical Implications

Theoretical: Proposes a novel paradigm for embodied AI foundation models, demonstrating that generative objectives can serve as powerful, implicit supervisors for learning task-agnostic, physically-grounded representations. It shows that depth is a particularly effective modality for bridging 2D semantics and 3D physical reasoning.
Practical: GEM provides a scalable and effective method to enhance existing VLMs for embodied applications without requiring expensive 3D data collection. The released GEM-4M dataset and model checkpoints facilitate further research. The strong performance of GEM-VLA in both simulation and real-world settings indicates a clear path toward more robust and generalizable robot policies.

Conclusion

GEM successfully bridges the semantic-physical gap in embodied VLMs by integrating depth map generation as a generative supervision task during pre-training. Through a novel architecture, a progressive training strategy, and a comprehensive dataset, GEM learns unified representations that excel at both high-level reasoning and low-level spatial tasks. The model achieves state-of-the-art results across a wide range of embodied benchmarks, and its extension to a VLA framework demonstrates superior task execution in simulated and real robotic environments. This work underscores the potential of generative supervision as a key mechanism for building more capable and physically-aware embodied AI systems.