HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation - Summary

Summary (Overview)

  • Unified Framework: Proposes HERMES++, the first unified driving world model that integrates 3D scene understanding (via Large Language Models) and future geometry prediction (via point cloud generation) within a single cohesive framework.
  • Core Technical Innovations: Introduces four key designs: 1) BEV representation as a unified spatial-semantic interface; 2) LLM-enhanced World Queries for knowledge transfer; 3) Current-to-Future Link to bridge temporal and semantic gaps; 4) Joint Geometric Optimization strategy combining explicit and implicit constraints for structural integrity.
  • Superior Performance: Achieves state-of-the-art or highly competitive results on both generation and understanding tasks across multiple benchmarks (NuScenes, OmniDrive-nuScenes, NuScenes-QA, DriveLM), outperforming specialist models in each domain.
  • Effective Synergy: Demonstrates that joint training of understanding and generation tasks creates a positive feedback loop—semantic reasoning guides geometric prediction, and geometric constraints ground language generation.
  • Strong Generalization: The framework shows promising generalization to additional tasks like motion planning and is adaptable to different LLM architectures and scales, with performance improving alongside model capacity.

Introduction and Theoretical Foundation

Driving world models are pivotal for autonomous driving, simulating environmental dynamics to forecast risks. Existing research falls into two distinct paradigms with complementary limitations:

  1. Generation-centric World Models: Focus on predicting future scene evolution (e.g., future videos or 3D point clouds) but lack intrinsic mechanisms for semantic interpretation (e.g., Visual Question Answering, scene description).
  2. Language-centric Models (LLMs/VLMs): Demonstrate impressive reasoning capabilities for scene understanding but lack the capacity to predict future geometric evolution.

This creates a significant capability gap. A holistic autonomous system requires both contextual awareness of the present and the ability to anticipate future physical states.

Motivation & Core Hypothesis: The authors propose that a true world model should seamlessly integrate 3D scene understanding with accurate future geometry prediction. This requires solving two key challenges:

  1. A suitable 3D representation that consolidates multi-view observations, preserves geometric interactions, and remains compatible with token-based LLMs.
  2. An interaction mechanism to bridge understanding and generation, ensuring semantic understanding guides geometric evolution and geometric predictions ground language generation.

Theoretical Basis: The work builds upon the formal definition of a driving world model. Given an observation OtO_t at time tt and an action AtA_t, the model predicts the subsequent observation Ot+1O_{t+1} through three components:

Zt=E(Ot),Zt+1=M(Zt,At),O^t+1=D(Zt+1)Z_t = E(O_t),\quad Z_{t+1} = M(Z_t, A_t),\quad \hat{O}_{t+1} = D(Z_{t+1})

where EE is an encoder, MM is a transition model, and DD is a decoder. HERMES++ instantiates this with multi-view images as OtO_t and future point clouds as O^t+1\hat{O}_{t+1}, leveraging the Bird's-Eye View (BEV) representation as the core spatial substrate ZZ.

Methodology

The overall pipeline (Fig. 2) integrates language-based reasoning with geometric generation. The key components are:

A. Visual Tokenizer and BEV-to-Point Render

  • BEV Tokenizer: Transforms multi-view images {Iti}i=1N\{I^i_t\}_{i=1}^N into a compressed, LLM-compatible format.
    1. A vision encoder extracts features, which are lifted to a BEV representation FtbevRw×h×cF^{bev}_t \in \mathbb{R}^{w \times h \times c} via spatial cross-attention (inspired by BEVFormer).
    2. The BEV feature is downsampled and flattened into tokens: Ft=ϕ(Flatten(Ftdown))RLBEV×CF_t = \phi(\text{Flatten}(F^{down}_t)) \in \mathbb{R}^{L_{BEV} \times C}.
  • BEV-to-Point Render (R\mathcal{R}): A differentiable module that decodes BEV features back to 3D point clouds PtP_t.
    1. BEV features are upsampled and expanded into a volumetric representation V^tRw×h×z×c\hat{V}_t \in \mathbb{R}^{w \times h \times z \times c'}.
    2. Scene geometry is modeled as an implicit Signed Distance Function (SDF) field. For a LiDAR ray rkr_k, the rendered depth is a weighted sum over sampled points: d~(rk)=i=1nwidi,wi=Tiαi\tilde{d}(r_k) = \sum_{i=1}^{n} w_i d_i, \quad w_i = T_i \alpha_i where opacity αi\alpha_i is derived from SDF values sis_i.

B. Unification of Understanding and Generation

  • Language-based Scene Understanding: The LLM processes BEV tokens FtF_t and user instruction tokens TT to generate textual responses, enriching its internal representations with semantic knowledge.
  • World Queries for Knowledge Transfer: Learnable queries QwQ_w are injected into the LLM input to aggregate semantic context. They are initialized from BEV features, conditioned on future ego-motion embeddings et+ie_{t+i} and frame embeddings FEFE: Qw=ϕ(Concati=1Δt((Qet+i)FE))Q_w = \phi\left(\text{Concat}_{i=1}^{\Delta t}\left( (Q \oplus e_{t+i}) \oplus FE \right)\right) The causal attention mechanism allows these queries to absorb world knowledge from the LLM, becoming carriers (QwϵQ_w^\epsilon) of semantic priors for generation.
  • Current-to-Future Link: A transformer-based module that propagates the current encoded BEV feature BtB_t to future states {Bt+i}\{B_{t+i}\}, conditioned on world queries and text embeddings.
    • Textual Injection: Extracts text embeddings T^\hat{T} from the LLM to provide explicit semantic conditioning in cross-attention layers.
    • Ego Modulation (EM): Adapts feature distributions based on future ego-motion to decouple camera motion from scene dynamics: EM(x)=(γ+1)LN(x)+β\text{EM}(x) = (\gamma + 1) \odot \text{LN}(x) + \beta

C. Joint Geometric Optimization Strategy

To combat structural ambiguity from rendering-only supervision, a dual-level constraint mechanism is proposed.

  • Explicit Geometric Constraints: A simple L1L_1 loss on rendered depths vs. ground truth: Lrender=i=0Δtλi1Nik=1Nid(rk)d~(rk)\mathcal{L}_{\text{render}} = \sum_{i=0}^{\Delta t} \lambda_i \frac{1}{N_i} \sum_{k=1}^{N_i} | d(r_k) - \tilde{d}(r_k) |
  • Implicit Geometric Regularization: Aligns predicted volumetric features V^t\hat{V}_t with geometry-aware priors VtV_t from a frozen pre-trained geometric feature extractor.
    • Cosine Similarity Loss: Enforces local voxel-wise consistency. Lcos=11whzi,j,kV^t(i,j,k)Vt(i,j,k)V^t(i,j,k)2Vt(i,j,k)2\mathcal{L}_{\text{cos}} = 1 - \frac{1}{whz} \sum_{i,j,k} \frac{\hat{V}_t(i,j,k) \cdot V_t(i,j,k)}{\|\hat{V}_t(i,j,k)\|_2 \|V_t(i,j,k)\|_2}
    • Gram Matrix Loss: Enforces global structural patterns by matching feature correlations across spatial projections (e.g., HWHW, HZHZ, WZWZ). Lgram=13dGtdG^tdF2,d{HW,HZ,WZ}\mathcal{L}_{\text{gram}} = \frac{1}{3} \sum_{d} \| G^d_t - \hat{G}^d_t \|^2_F, \quad d \in \{HW, HZ, WZ\} where Gtd=VtdVtdTG^d_t = V^d_t {V^d_t}^T.

D. Training Objectives

The model is trained with a composite loss:

Ltotal=Llang+Lgen\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{lang}} + \mathcal{L}_{\text{gen}}

where Llang\mathcal{L}_{\text{lang}} is the standard next-token prediction loss for language, and Lgen=10Lrender+Lcos+Lgram\mathcal{L}_{\text{gen}} = 10\mathcal{L}_{\text{render}} + \mathcal{L}_{\text{cos}} + \mathcal{L}_{\text{gram}}.

Training Stages:

  1. Geometry-aware Pre-training: Train and freeze the geometric feature extractor.
  2. Vision-Language Alignment: Pre-train tokenizer/Render and align BEV features with LLM using masked augmentation.
  3. Unified Joint Training: Integrate the Current-to-Future Link and train all components with Ltotal\mathcal{L}_{\text{total}}.

Empirical Validation / Results

A. Main Results on Unified Tasks

Table II: Comparison with specialist models on OmniDrive-nuScenes.

MethodReferenceModalityGeneration (CD 3s ↓)Understanding (CIDEr ↑)
Generation Specialists
ViDARCVPR 24C → L1.73-
DriveXICCV 25C → L1.10-
Understanding Specialists
Omni-QCVPR 25C → T-0.686
ORIONICCV 25C → T-0.635
Unified Models
HERMES (conf.)ICCV 25C → T&L1.170.741
HERMES++ (1.8B)-C → T&L1.010.749
HERMES++ (3.8B)-C → T&L0.970.772
  • Generation: HERMES++ reduces the 3-second Chamfer Distance error by 8.2% compared to the leading specialist DriveX (1.01 vs. 1.10).
  • Understanding: HERMES++ outperforms the specialist Omni-Q by 9.2% on CIDEr (0.749 vs. 0.686), without any auxiliary detection supervision.
  • vs. Conference Version: The improved HERMES++ shows a 13.7% reduction in 3s generation error and consistent gains in understanding metrics.

B. Ablation Studies

Key findings from systematic ablations:

1. BEV Input Representation is Critical:

  • Direct multi-view token input leads to spatial structural collapse, increasing 3s CD by ~32% compared to BEV input, despite similar understanding scores (Fig. 4).
  • Downsampling scale matters. A factor of 4 provides the best trade-off between geometric detail and token length (Tab. III).

2. Joint Geometric Optimization is Effective:

  • Table IV: Ablation on regularization losses.
    Lcos\mathcal{L}_{\text{cos}}Lgram\mathcal{L}_{\text{gram}}Gen (CD 3s ↓)Und (CIDEr ↑)
    --1.6370.722
    -1.4410.717
    -1.5440.717
    1.4360.720
  • The combined strategy yields the best performance. Visualization (Fig. 5) shows it suppresses projection artifacts and central bias, leading to cleaner, geometry-faithful latent features.

3. Current-to-Future Link Components are Necessary:

  • Table V: Progressive ablation of the Link.
    ModulesGen (CD 3s ↓)Und (CIDEr ↑)
    w/o Link2.3770.433
    w/ Simple Link1.5420.718
    + Textual Injection1.5060.717
    + Ego Modulation1.4420.711
    + More blocks1.4360.720
  • Each component contributes: Textual Injection provides semantic guidance, Ego Modulation decouples motion, and greater depth improves modeling of non-linear dynamics.

4. Task Interaction and World Queries are Beneficial:

  • Table VI: Joint training (✓✓) outperforms a "Separated unification" baseline (shared encoder but no deep interaction) by a large margin (CD 1.436 vs. 1.634).
  • Table VII: Processing world queries through the LLM (setting c) is superior to bypassing it (setting b), confirming the importance of infusing queries with LLM knowledge.

5. Hyperparameter Analysis:

  • World Query Initialization: Simple Max Pooling from BEV features works best, outperforming more complex parametric methods (Tab. VIIIa).
  • Number of Queries: n=4n=4 queries per timestep strikes an optimal balance (Tab. VIIIb).
  • Model Scalability: Performance improves consistently with larger LLMs (Tab. XIIb), with the 3.8B model achieving CD 1.255 and CIDEr 0.742.

C. Generalization to Additional Tasks

  • NuScenes-QA (VQA): HERMES++ achieves 61.3% accuracy, setting a new state-of-the-art, surpassing camera-based Omni-Q (59.2%) and even LiDAR-based specialists (Tab. IX).
  • DriveLM (Graph VQA & Reasoning): Achieves a highly competitive Final Score of 0.59, matching the challenge winner, with leading prediction accuracy (0.83) and match score (0.43) (Tab. X).
  • Motion Planning: When extended with a lightweight trajectory head, HERMES++ achieves competitive open-loop planning results (Avg. L2: 0.37m, Collision Rate: 0.29%), demonstrating internalized actionable dynamics (Tab. XI).

Theoretical and Practical Implications

  • Theoretical: Demonstrates the feasibility and synergy of a truly unified world model that closes the loop between semantic interpretation and physical simulation. The BEV representation is proven as an effective unified spatial-semantic interface.
  • Practical: Provides a foundation for more interpretable and predictive autonomous driving systems. The model's ability to explain its reasoning and show the anticipated geometric consequences of that reasoning could enhance trust and debugability.
  • Methodological: The Joint Geometric Optimization strategy offers a general approach for enforcing structural priors in neural rendering tasks. The World Query mechanism presents a novel design pattern for transferring knowledge between understanding and generative modules within a single model.

Conclusion

HERMES++ presents a significant step toward a unified driving world model that integrates 3D scene understanding and future geometry prediction. By leveraging a BEV representation, LLM-enhanced world queries, a Current-to-Future Link, and a Joint Geometric Optimization strategy, the framework achieves strong, synergistic performance on both tasks, outperforming specialist approaches. Extensive experiments validate its effectiveness and generalization capabilities.

Limitations & Future Work: The paper notes that further investigation is needed on how to best leverage semantic priors from pre-trained multi-modal models for BEV inputs. Expanding the generation paradigm to diverse modalities (e.g., video, occupancy) for comprehensive scene simulation is a promising future direction.