# HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation

> HERMES++ introduces the first unified driving world model that jointly performs 3D scene understanding and future geometry prediction in a single framework, outperforming specialized models in both tasks.

- **Source:** [arXiv](https://arxiv.org/abs/2604.28196)
- **Published:** 2026-05-08
- **Permalink:** https://picx.dev/p/bXahf0
- **Whiteboard:** https://picx.dev/p/bXahf0/image

## Summary

# HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation - Summary

## Summary (Overview)
*   **Unified Framework:** Proposes HERMES++, the first unified driving world model that integrates **3D scene understanding** (via Large Language Models) and **future geometry prediction** (via point cloud generation) within a single cohesive framework.
*   **Core Technical Innovations:** Introduces four key designs: 1) **BEV representation** as a unified spatial-semantic interface; 2) **LLM-enhanced World Queries** for knowledge transfer; 3) **Current-to-Future Link** to bridge temporal and semantic gaps; 4) **Joint Geometric Optimization** strategy combining explicit and implicit constraints for structural integrity.
*   **Superior Performance:** Achieves state-of-the-art or highly competitive results on both generation and understanding tasks across multiple benchmarks (NuScenes, OmniDrive-nuScenes, NuScenes-QA, DriveLM), outperforming specialist models in each domain.
*   **Effective Synergy:** Demonstrates that joint training of understanding and generation tasks creates a positive feedback loop—semantic reasoning guides geometric prediction, and geometric constraints ground language generation.
*   **Strong Generalization:** The framework shows promising generalization to additional tasks like motion planning and is adaptable to different LLM architectures and scales, with performance improving alongside model capacity.

## Introduction and Theoretical Foundation
Driving world models are pivotal for autonomous driving, simulating environmental dynamics to forecast risks. Existing research falls into two distinct paradigms with complementary limitations:
1.  **Generation-centric World Models:** Focus on predicting future scene evolution (e.g., future videos or 3D point clouds) but lack intrinsic mechanisms for **semantic interpretation** (e.g., Visual Question Answering, scene description).
2.  **Language-centric Models (LLMs/VLMs):** Demonstrate impressive **reasoning capabilities** for scene understanding but lack the capacity to predict **future geometric evolution**.

This creates a significant capability gap. A holistic autonomous system requires both contextual awareness of the present *and* the ability to anticipate future physical states.

**Motivation & Core Hypothesis:** The authors propose that a true world model should seamlessly integrate 3D scene understanding with accurate future geometry prediction. This requires solving two key challenges:
1.  A **suitable 3D representation** that consolidates multi-view observations, preserves geometric interactions, and remains compatible with token-based LLMs.
2.  An **interaction mechanism** to bridge understanding and generation, ensuring semantic understanding guides geometric evolution and geometric predictions ground language generation.

**Theoretical Basis:** The work builds upon the formal definition of a driving world model. Given an observation $O_t$ at time $t$ and an action $A_t$, the model predicts the subsequent observation $O_{t+1}$ through three components:
$$
Z_t = E(O_t),\quad Z_{t+1} = M(Z_t, A_t),\quad \hat{O}_{t+1} = D(Z_{t+1})
$$
where $E$ is an encoder, $M$ is a transition model, and $D$ is a decoder. HERMES++ instantiates this with multi-view images as $O_t$ and future point clouds as $\hat{O}_{t+1}$, leveraging the Bird's-Eye View (BEV) representation as the core spatial substrate $Z$.

## Methodology
The overall pipeline (Fig. 2) integrates language-based reasoning with geometric generation. The key components are:

### A. Visual Tokenizer and BEV-to-Point Render
*   **BEV Tokenizer:** Transforms multi-view images $\{I^i_t\}_{i=1}^N$ into a compressed, LLM-compatible format.
    1.  A vision encoder extracts features, which are lifted to a BEV representation $F^{bev}_t \in \mathbb{R}^{w \times h \times c}$ via spatial cross-attention (inspired by BEVFormer).
    2.  The BEV feature is downsampled and flattened into tokens: $F_t = \phi(\text{Flatten}(F^{down}_t)) \in \mathbb{R}^{L_{BEV} \times C}$.
*   **BEV-to-Point Render ($\mathcal{R}$):** A differentiable module that decodes BEV features back to 3D point clouds $P_t$.
    1.  BEV features are upsampled and expanded into a volumetric representation $\hat{V}_t \in \mathbb{R}^{w \times h \times z \times c'}$.
    2.  Scene geometry is modeled as an implicit Signed Distance Function (SDF) field. For a LiDAR ray $r_k$, the rendered depth is a weighted sum over sampled points:
        $$
        \tilde{d}(r_k) = \sum_{i=1}^{n} w_i d_i, \quad w_i = T_i \alpha_i
        $$
        where opacity $\alpha_i$ is derived from SDF values $s_i$.

### B. Unification of Understanding and Generation
*   **Language-based Scene Understanding:** The LLM processes BEV tokens $F_t$ and user instruction tokens $T$ to generate textual responses, enriching its internal representations with semantic knowledge.
*   **World Queries for Knowledge Transfer:** Learnable queries $Q_w$ are injected into the LLM input to aggregate semantic context. They are initialized from BEV features, conditioned on future ego-motion embeddings $e_{t+i}$ and frame embeddings $FE$:
    $$
    Q_w = \phi\left(\text{Concat}_{i=1}^{\Delta t}\left( (Q \oplus e_{t+i}) \oplus FE \right)\right)
    $$
    The causal attention mechanism allows these queries to absorb world knowledge from the LLM, becoming carriers ($Q_w^\epsilon$) of semantic priors for generation.
*   **Current-to-Future Link:** A transformer-based module that propagates the current encoded BEV feature $B_t$ to future states $\{B_{t+i}\}$, conditioned on world queries and text embeddings.
    *   **Textual Injection:** Extracts text embeddings $\hat{T}$ from the LLM to provide explicit semantic conditioning in cross-attention layers.
    *   **Ego Modulation (EM):** Adapts feature distributions based on future ego-motion to decouple camera motion from scene dynamics:
        $$
        \text{EM}(x) = (\gamma + 1) \odot \text{LN}(x) + \beta
        $$

### C. Joint Geometric Optimization Strategy
To combat structural ambiguity from rendering-only supervision, a dual-level constraint mechanism is proposed.
*   **Explicit Geometric Constraints:** A simple $L_1$ loss on rendered depths vs. ground truth:
    $$
    \mathcal{L}_{\text{render}} = \sum_{i=0}^{\Delta t} \lambda_i \frac{1}{N_i} \sum_{k=1}^{N_i} | d(r_k) - \tilde{d}(r_k) |
    $$
*   **Implicit Geometric Regularization:** Aligns predicted volumetric features $\hat{V}_t$ with geometry-aware priors $V_t$ from a **frozen pre-trained geometric feature extractor**.
    *   **Cosine Similarity Loss:** Enforces local voxel-wise consistency.
        $$
        \mathcal{L}_{\text{cos}} = 1 - \frac{1}{whz} \sum_{i,j,k} \frac{\hat{V}_t(i,j,k) \cdot V_t(i,j,k)}{\|\hat{V}_t(i,j,k)\|_2 \|V_t(i,j,k)\|_2}
        $$
    *   **Gram Matrix Loss:** Enforces global structural patterns by matching feature correlations across spatial projections (e.g., $HW$, $HZ$, $WZ$).
        $$
        \mathcal{L}_{\text{gram}} = \frac{1}{3} \sum_{d} \| G^d_t - \hat{G}^d_t \|^2_F, \quad d \in \{HW, HZ, WZ\}
        $$
        where $G^d_t = V^d_t {V^d_t}^T$.

### D. Training Objectives
The model is trained with a composite loss:
$$
\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{lang}} + \mathcal{L}_{\text{gen}}
$$
where $\mathcal{L}_{\text{lang}}$ is the standard next-token prediction loss for language, and $\mathcal{L}_{\text{gen}} = 10\mathcal{L}_{\text{render}} + \mathcal{L}_{\text{cos}} + \mathcal{L}_{\text{gram}}$.

**Training Stages:**
1.  **Geometry-aware Pre-training:** Train and freeze the geometric feature extractor.
2.  **Vision-Language Alignment:** Pre-train tokenizer/Render and align BEV features with LLM using masked augmentation.
3.  **Unified Joint Training:** Integrate the Current-to-Future Link and train all components with $\mathcal{L}_{\text{total}}$.

## Empirical Validation / Results

### A. Main Results on Unified Tasks
**Table II: Comparison with specialist models on OmniDrive-nuScenes.**

| Method | Reference | Modality | Generation (CD 3s ↓) | Understanding (CIDEr ↑) |
| :--- | :--- | :--- | :--- | :--- |
| **Generation Specialists** | | | | |
| ViDAR | CVPR 24 | C → L | 1.73 | - |
| DriveX | ICCV 25 | C → L | 1.10 | - |
| **Understanding Specialists** | | | | |
| Omni-Q | CVPR 25 | C → T | - | 0.686 |
| ORION | ICCV 25 | C → T | - | 0.635 |
| **Unified Models** | | | | |
| HERMES (conf.) | ICCV 25 | C → T&L | 1.17 | 0.741 |
| **HERMES++ (1.8B)** | - | C → T&L | **1.01** | **0.749** |
| **HERMES++ (3.8B)** | - | C → T&L | **0.97** | **0.772** |

*   **Generation:** HERMES++ reduces the 3-second Chamfer Distance error by **8.2%** compared to the leading specialist DriveX (1.01 vs. 1.10).
*   **Understanding:** HERMES++ outperforms the specialist Omni-Q by **9.2%** on CIDEr (0.749 vs. 0.686), **without any auxiliary detection supervision**.
*   **vs. Conference Version:** The improved HERMES++ shows a **13.7% reduction** in 3s generation error and consistent gains in understanding metrics.

### B. Ablation Studies
Key findings from systematic ablations:

**1. BEV Input Representation is Critical:**
*   Direct multi-view token input leads to **spatial structural collapse**, increasing 3s CD by ~32% compared to BEV input, despite similar understanding scores (Fig. 4).
*   **Downsampling scale** matters. A factor of 4 provides the best trade-off between geometric detail and token length (Tab. III).

**2. Joint Geometric Optimization is Effective:**
*   **Table IV:** Ablation on regularization losses.
    | $\mathcal{L}_{\text{cos}}$ | $\mathcal{L}_{\text{gram}}$ | Gen (CD 3s ↓) | Und (CIDEr ↑) |
    | :---: | :---: | :---: | :---: |
    | - | - | 1.637 | 0.722 |
    | ✓ | - | 1.441 | 0.717 |
    | - | ✓ | 1.544 | 0.717 |
    | ✓ | ✓ | **1.436** | **0.720** |
*   The combined strategy yields the best performance. Visualization (Fig. 5) shows it suppresses projection artifacts and central bias, leading to cleaner, geometry-faithful latent features.

**3. Current-to-Future Link Components are Necessary:**
*   **Table V:** Progressive ablation of the Link.
    | Modules | Gen (CD 3s ↓) | Und (CIDEr ↑) |
    | :--- | :--- | :--- |
    | w/o Link | 2.377 | 0.433 |
    | w/ Simple Link | 1.542 | 0.718 |
    | + Textual Injection | 1.506 | 0.717 |
    | + Ego Modulation | 1.442 | 0.711 |
    | + More blocks | **1.436** | **0.720** |
*   Each component contributes: Textual Injection provides semantic guidance, Ego Modulation decouples motion, and greater depth improves modeling of non-linear dynamics.

**4. Task Interaction and World Queries are Beneficial:**
*   **Table VI:** Joint training (`✓✓`) outperforms a "Separated unification" baseline (shared encoder but no deep interaction) by a large margin (CD 1.436 vs. 1.634).
*   **Table VII:** Processing world queries through the LLM (setting c) is superior to bypassing it (setting b), confirming the importance of infusing queries with LLM knowledge.

**5. Hyperparameter Analysis:**
*   **World Query Initialization:** Simple **Max Pooling** from BEV features works best, outperforming more complex parametric methods (Tab. VIIIa).
*   **Number of Queries:** $n=4$ queries per timestep strikes an optimal balance (Tab. VIIIb).
*   **Model Scalability:** Performance improves consistently with larger LLMs (Tab. XIIb), with the 3.8B model achieving CD 1.255 and CIDEr 0.742.

### C. Generalization to Additional Tasks
*   **NuScenes-QA (VQA):** HERMES++ achieves **61.3%** accuracy, setting a new state-of-the-art, surpassing camera-based Omni-Q (59.2%) and even LiDAR-based specialists (Tab. IX).
*   **DriveLM (Graph VQA & Reasoning):** Achieves a highly competitive Final Score of **0.59**, matching the challenge winner, with leading prediction accuracy (0.83) and match score (0.43) (Tab. X).
*   **Motion Planning:** When extended with a lightweight trajectory head, HERMES++ achieves competitive open-loop planning results (Avg. L2: 0.37m, Collision Rate: 0.29%), demonstrating internalized actionable dynamics (Tab. XI).

## Theoretical and Practical Implications
*   **Theoretical:** Demonstrates the feasibility and synergy of a **truly unified world model** that closes the loop between semantic interpretation and physical simulation. The BEV representation is proven as an effective unified spatial-semantic interface.
*   **Practical:** Provides a foundation for more **interpretable and predictive autonomous driving systems**. The model's ability to explain its reasoning *and* show the anticipated geometric consequences of that reasoning could enhance trust and debugability.
*   **Methodological:** The **Joint Geometric Optimization** strategy offers a general approach for enforcing structural priors in neural rendering tasks. The **World Query** mechanism presents a novel design pattern for transferring knowledge between understanding and generative modules within a single model.

## Conclusion
HERMES++ presents a significant step toward a unified driving world model that integrates 3D scene understanding and future geometry prediction. By leveraging a BEV representation, LLM-enhanced world queries, a Current-to-Future Link, and a Joint Geometric Optimization strategy, the framework achieves strong, synergistic performance on both tasks, outperforming specialist approaches. Extensive experiments validate its effectiveness and generalization capabilities.

**Limitations & Future Work:** The paper notes that further investigation is needed on how to best leverage semantic priors from pre-trained multi-modal models for BEV inputs. Expanding the generation paradigm to diverse modalities (e.g., video, occupancy) for comprehensive scene simulation is a promising future direction.

---

_Markdown view of https://picx.dev/p/bXahf0, served by PicX — AI-generated visual whiteboard summaries of research papers._
