HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation - Summary

Summary (Overview)

Unified Framework: Proposes HERMES++, the first unified driving world model that integrates 3D scene understanding (via Large Language Models) and future geometry prediction (via point cloud generation) within a single cohesive framework.
Core Technical Innovations: Introduces four key designs: 1) BEV representation as a unified spatial-semantic interface; 2) LLM-enhanced World Queries for knowledge transfer; 3) Current-to-Future Link to bridge temporal and semantic gaps; 4) Joint Geometric Optimization strategy combining explicit and implicit constraints for structural integrity.
Superior Performance: Achieves state-of-the-art or highly competitive results on both generation and understanding tasks across multiple benchmarks (NuScenes, OmniDrive-nuScenes, NuScenes-QA, DriveLM), outperforming specialist models in each domain.
Effective Synergy: Demonstrates that joint training of understanding and generation tasks creates a positive feedback loop—semantic reasoning guides geometric prediction, and geometric constraints ground language generation.
Strong Generalization: The framework shows promising generalization to additional tasks like motion planning and is adaptable to different LLM architectures and scales, with performance improving alongside model capacity.

Introduction and Theoretical Foundation

Driving world models are pivotal for autonomous driving, simulating environmental dynamics to forecast risks. Existing research falls into two distinct paradigms with complementary limitations:

Generation-centric World Models: Focus on predicting future scene evolution (e.g., future videos or 3D point clouds) but lack intrinsic mechanisms for semantic interpretation (e.g., Visual Question Answering, scene description).
Language-centric Models (LLMs/VLMs): Demonstrate impressive reasoning capabilities for scene understanding but lack the capacity to predict future geometric evolution.

This creates a significant capability gap. A holistic autonomous system requires both contextual awareness of the present and the ability to anticipate future physical states.

Motivation & Core Hypothesis: The authors propose that a true world model should seamlessly integrate 3D scene understanding with accurate future geometry prediction. This requires solving two key challenges:

A suitable 3D representation that consolidates multi-view observations, preserves geometric interactions, and remains compatible with token-based LLMs.
An interaction mechanism to bridge understanding and generation, ensuring semantic understanding guides geometric evolution and geometric predictions ground language generation.

Theoretical Basis: The work builds upon the formal definition of a driving world model. Given an observation $O_t$ at time $t$ and an action $A_t$ , the model predicts the subsequent observation $O_{t+1}$ through three components:

Z_t = E(O_t),\quad Z_{t+1} = M(Z_t, A_t),\quad \hat{O}_{t+1} = D(Z_{t+1})

where $E$ is an encoder, $M$ is a transition model, and $D$ is a decoder. HERMES++ instantiates this with multi-view images as $O_t$ and future point clouds as $\hat{O}_{t+1}$ , leveraging the Bird's-Eye View (BEV) representation as the core spatial substrate $Z$ .

Methodology

The overall pipeline (Fig. 2) integrates language-based reasoning with geometric generation. The key components are:

A. Visual Tokenizer and BEV-to-Point Render

BEV Tokenizer: Transforms multi-view images $\{I^i_t\}_{i=1}^N$ ${I_{t}^{i}}_{i = 1}^{N}$ into a compressed, LLM-compatible format.
1. A vision encoder extracts features, which are lifted to a BEV representation $F^{bev}_t \in \mathbb{R}^{w \times h \times c}$ via spatial cross-attention (inspired by BEVFormer).
2. The BEV feature is downsampled and flattened into tokens: $F_t = \phi(\text{Flatten}(F^{down}_t)) \in \mathbb{R}^{L_{BEV} \times C}$ .
BEV-to-Point Render ( $\mathcal{R}$ ): A differentiable module that decodes BEV features back to 3D point clouds $P_t$ $P_{t}$ .
1. BEV features are upsampled and expanded into a volumetric representation $\hat{V}_t \in \mathbb{R}^{w \times h \times z \times c'}$ .
2. Scene geometry is modeled as an implicit Signed Distance Function (SDF) field. For a LiDAR ray $r_k$ , the rendered depth is a weighted sum over sampled points: $\tilde{d}(r_k) = \sum_{i=1}^{n} w_i d_i, \quad w_i = T_i \alpha_i$ where opacity $\alpha_i$ is derived from SDF values $s_i$ .

B. Unification of Understanding and Generation

Language-based Scene Understanding: The LLM processes BEV tokens $F_t$ and user instruction tokens $T$ to generate textual responses, enriching its internal representations with semantic knowledge.
World Queries for Knowledge Transfer: Learnable queries $Q_w$ are injected into the LLM input to aggregate semantic context. They are initialized from BEV features, conditioned on future ego-motion embeddings $e_{t+i}$ and frame embeddings $FE$ : $Q_w = \phi\left(\text{Concat}_{i=1}^{\Delta t}\left( (Q \oplus e_{t+i}) \oplus FE \right)\right)$ The causal attention mechanism allows these queries to absorb world knowledge from the LLM, becoming carriers ( $Q_w^\epsilon$ ) of semantic priors for generation.
Current-to-Future Link: A transformer-based module that propagates the current encoded BEV feature $B_t$ $B_{t}$ to future states $\{B_{t+i}\}$ ${B_{t + i}}$ , conditioned on world queries and text embeddings.
- Textual Injection: Extracts text embeddings $\hat{T}$ from the LLM to provide explicit semantic conditioning in cross-attention layers.
- Ego Modulation (EM): Adapts feature distributions based on future ego-motion to decouple camera motion from scene dynamics: $\text{EM}(x) = (\gamma + 1) \odot \text{LN}(x) + \beta$

C. Joint Geometric Optimization Strategy

To combat structural ambiguity from rendering-only supervision, a dual-level constraint mechanism is proposed.

Explicit Geometric Constraints: A simple $L_1$ loss on rendered depths vs. ground truth: $\mathcal{L}_{\text{render}} = \sum_{i=0}^{\Delta t} \lambda_i \frac{1}{N_i} \sum_{k=1}^{N_i} | d(r_k) - \tilde{d}(r_k) |$
Implicit Geometric Regularization: Aligns predicted volumetric features $\hat{V}_t$ $\hat{V}_{t}$ with geometry-aware priors $V_t$ $V_{t}$ from a frozen pre-trained geometric feature extractor.
- Cosine Similarity Loss: Enforces local voxel-wise consistency. $\mathcal{L}_{\text{cos}} = 1 - \frac{1}{whz} \sum_{i,j,k} \frac{\hat{V}_t(i,j,k) \cdot V_t(i,j,k)}{\|\hat{V}_t(i,j,k)\|_2 \|V_t(i,j,k)\|_2}$
- Gram Matrix Loss: Enforces global structural patterns by matching feature correlations across spatial projections (e.g., $HW$ , $HZ$ , $WZ$ ). $\mathcal{L}_{\text{gram}} = \frac{1}{3} \sum_{d} \| G^d_t - \hat{G}^d_t \|^2_F, \quad d \in \{HW, HZ, WZ\}$ where $G^d_t = V^d_t {V^d_t}^T$ .

D. Training Objectives

The model is trained with a composite loss:

\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{lang}} + \mathcal{L}_{\text{gen}}

where $\mathcal{L}_{\text{lang}}$ is the standard next-token prediction loss for language, and $\mathcal{L}_{\text{gen}} = 10\mathcal{L}_{\text{render}} + \mathcal{L}_{\text{cos}} + \mathcal{L}_{\text{gram}}$ .

Training Stages:

Geometry-aware Pre-training: Train and freeze the geometric feature extractor.
Vision-Language Alignment: Pre-train tokenizer/Render and align BEV features with LLM using masked augmentation.
Unified Joint Training: Integrate the Current-to-Future Link and train all components with $\mathcal{L}_{\text{total}}$ .

Empirical Validation / Results

A. Main Results on Unified Tasks

Table II: Comparison with specialist models on OmniDrive-nuScenes.

Method	Reference	Modality	Generation (CD 3s ↓)	Understanding (CIDEr ↑)
Generation Specialists
ViDAR	CVPR 24	C → L	1.73	-
DriveX	ICCV 25	C → L	1.10	-
Understanding Specialists
Omni-Q	CVPR 25	C → T	-	0.686
ORION	ICCV 25	C → T	-	0.635
Unified Models
HERMES (conf.)	ICCV 25	C → T&L	1.17	0.741
HERMES++ (1.8B)	-	C → T&L	1.01	0.749
HERMES++ (3.8B)	-	C → T&L	0.97	0.772

Generation: HERMES++ reduces the 3-second Chamfer Distance error by 8.2% compared to the leading specialist DriveX (1.01 vs. 1.10).
Understanding: HERMES++ outperforms the specialist Omni-Q by 9.2% on CIDEr (0.749 vs. 0.686), without any auxiliary detection supervision.
vs. Conference Version: The improved HERMES++ shows a 13.7% reduction in 3s generation error and consistent gains in understanding metrics.

B. Ablation Studies

Key findings from systematic ablations:

1. BEV Input Representation is Critical:

Direct multi-view token input leads to spatial structural collapse, increasing 3s CD by ~32% compared to BEV input, despite similar understanding scores (Fig. 4).
Downsampling scale matters. A factor of 4 provides the best trade-off between geometric detail and token length (Tab. III).

2. Joint Geometric Optimization is Effective:

Table IV: Ablation on regularization losses.
$\mathcal{L}_{\text{cos}}$ $\mathcal{L}_{\text{gram}}$ Gen (CD 3s ↓) Und (CIDEr ↑)
- - 1.637 0.722
✓ - 1.441 0.717
- ✓ 1.544 0.717
✓ ✓ 1.436 0.720
The combined strategy yields the best performance. Visualization (Fig. 5) shows it suppresses projection artifacts and central bias, leading to cleaner, geometry-faithful latent features.

$\mathcal{L}_{\text{cos}}$	$\mathcal{L}_{\text{gram}}$	Gen (CD 3s ↓)	Und (CIDEr ↑)
-	-	1.637	0.722
✓	-	1.441	0.717
-	✓	1.544	0.717
✓	✓	1.436	0.720

3. Current-to-Future Link Components are Necessary:

Table V: Progressive ablation of the Link.
Modules Gen (CD 3s ↓) Und (CIDEr ↑)
w/o Link 2.377 0.433
w/ Simple Link 1.542 0.718
+ Textual Injection 1.506 0.717
+ Ego Modulation 1.442 0.711
+ More blocks 1.436 0.720
Each component contributes: Textual Injection provides semantic guidance, Ego Modulation decouples motion, and greater depth improves modeling of non-linear dynamics.

Modules	Gen (CD 3s ↓)	Und (CIDEr ↑)
w/o Link	2.377	0.433
w/ Simple Link	1.542	0.718
+ Textual Injection	1.506	0.717
+ Ego Modulation	1.442	0.711
+ More blocks	1.436	0.720

4. Task Interaction and World Queries are Beneficial:

Table VI: Joint training (✓✓) outperforms a "Separated unification" baseline (shared encoder but no deep interaction) by a large margin (CD 1.436 vs. 1.634).
Table VII: Processing world queries through the LLM (setting c) is superior to bypassing it (setting b), confirming the importance of infusing queries with LLM knowledge.

5. Hyperparameter Analysis:

World Query Initialization: Simple Max Pooling from BEV features works best, outperforming more complex parametric methods (Tab. VIIIa).
Number of Queries: $n=4$ queries per timestep strikes an optimal balance (Tab. VIIIb).
Model Scalability: Performance improves consistently with larger LLMs (Tab. XIIb), with the 3.8B model achieving CD 1.255 and CIDEr 0.742.

C. Generalization to Additional Tasks

NuScenes-QA (VQA): HERMES++ achieves 61.3% accuracy, setting a new state-of-the-art, surpassing camera-based Omni-Q (59.2%) and even LiDAR-based specialists (Tab. IX).
DriveLM (Graph VQA & Reasoning): Achieves a highly competitive Final Score of 0.59, matching the challenge winner, with leading prediction accuracy (0.83) and match score (0.43) (Tab. X).
Motion Planning: When extended with a lightweight trajectory head, HERMES++ achieves competitive open-loop planning results (Avg. L2: 0.37m, Collision Rate: 0.29%), demonstrating internalized actionable dynamics (Tab. XI).

Theoretical and Practical Implications

Theoretical: Demonstrates the feasibility and synergy of a truly unified world model that closes the loop between semantic interpretation and physical simulation. The BEV representation is proven as an effective unified spatial-semantic interface.
Practical: Provides a foundation for more interpretable and predictive autonomous driving systems. The model's ability to explain its reasoning and show the anticipated geometric consequences of that reasoning could enhance trust and debugability.
Methodological: The Joint Geometric Optimization strategy offers a general approach for enforcing structural priors in neural rendering tasks. The World Query mechanism presents a novel design pattern for transferring knowledge between understanding and generative modules within a single model.

Conclusion

HERMES++ presents a significant step toward a unified driving world model that integrates 3D scene understanding and future geometry prediction. By leveraging a BEV representation, LLM-enhanced world queries, a Current-to-Future Link, and a Joint Geometric Optimization strategy, the framework achieves strong, synergistic performance on both tasks, outperforming specialist approaches. Extensive experiments validate its effectiveness and generalization capabilities.

Limitations & Future Work: The paper notes that further investigation is needed on how to best leverage semantic priors from pre-trained multi-modal models for BEV inputs. Expanding the generation paradigm to diverse modalities (e.g., video, occupancy) for comprehensive scene simulation is a promising future direction.