# Grounding World Simulation Models in a Real-World Metropolis

> Seoul World Model generates realistic city-scale videos grounded in real-world locations by conditioning on street-view images with cross-temporal pairing and a virtual lookahead mechanism.

- **Source:** [arXiv](https://arxiv.org/abs/2603.15583)
- **Published:** 2026-03-18
- **Permalink:** https://picx.dev/p/9N6DpJ
- **Whiteboard:** https://picx.dev/p/9N6DpJ/image

## Summary

# Summary of "Grounding World Simulation Models in a Real-World Metropolis"

## Summary (Overview)
*   **Core Contribution:** Introduces **Seoul World Model (SWM)**, the first city-scale video world model that generates videos grounded in the actual geometry and appearance of a real city (Seoul), moving beyond purely imagined environments.
*   **Key Method:** Uses **retrieval-augmented generation**, conditioning an autoregressive video diffusion model on nearby street-view images to anchor the output to real-world locations.
*   **Main Innovations:**
    *   **Cross-temporal pairing** during training to disentangle persistent scene structure from transient objects (e.g., cars, pedestrians).
    *   A **Virtual Lookahead Sink** that dynamically retrieves a future street-view image to stabilize long-horizon generation and combat error accumulation.
    *   Complementary **geometric and semantic referencing** pathways to provide both spatial layout and appearance detail from retrieved images.
*   **Key Results:** SWM outperforms existing video world models in visual quality, camera motion adherence, and 3D structural fidelity on benchmarks from unseen cities (Busan, Ann Arbor), demonstrating successful cross-city generalization.
*   **Significance:** Opens a new direction for world simulation grounded in physically existing environments, with applications in urban planning, autonomous driving simulation, and location-based exploration.

## Introduction and Theoretical Foundation
Traditional video world simulation models generate dynamic, interactive environments but operate entirely within *imagined* worlds. This paper poses a novel question: **What if a world model could simulate a city that actually exists?** The authors formalize this goal as **real-world grounded video world simulation**.

The core idea is to leverage the vast availability of geotagged street-view imagery as a scalable source of location-specific visual references. By conditioning a generative world model on these references, the model can be anchored to the real geometric layout and appearance of a specific location. This enables applications like navigating familiar streets, visualizing urban planning scenarios, or generating hypothetical events (e.g., "a massive wave") in a real city context.

The key challenge is that naively using street-view images for conditioning introduces problems: temporal misalignment (references show a different moment in time), limited/sparse trajectory data, and long-horizon error accumulation in autoregressive generation. SWM is designed to address these specific challenges.

## Methodology
SWM is built upon a pretrained autoregressive video Diffusion Transformer (DiT) [1, 33]. It generates video chunk-by-chunk, conditioned on a text prompt $P^{(i)}$, a target camera trajectory $C^{(i)} = \\{c_t\\}_{t=0}^{T-1}$, and self-generated history latents $Z_{hist}^{(i)}$.

### 1. Data Construction
*   **Real Street-View Dataset:** 440K images from Seoul. A key innovation is **cross-temporal pairing**: target video sequences and their conditioning reference images are captured at *different timestamps*. This forces the model to learn persistent scene structure, ignoring transient objects that differ between reference and target.
*   **Synthetic Dataset:** Created using CARLA [10] simulator to provide diverse camera trajectories (pedestrian, vehicle, free-camera) not well-covered by street-view data, improving model robustness.
*   **View Interpolation Pipeline:** Street-view images are spatially sparse. An **intermittent freeze-frame strategy** is proposed to synthesize coherent training videos from sparse keyframes, ensuring compatibility with the temporal compression of the 3D VAE.

### 2. Street-View Retrieval
For a target chunk, nearby street-view panoramas are retrieved from a geo-indexed database. They are rendered into pinhole views aligned with the target camera's viewing direction. Each retrieved reference $x_{ref,k}^{(i)}$ comes with an estimated camera pose $c_{ref,k}^{(i)}$ and depth map $d_{ref,k}^{(i)}$.

### 3. Virtual Lookahead Sink
To combat error accumulation in long-horizon generation (hundreds of meters), SWM introduces a **Virtual Lookahead (VL) Sink**. Instead of using a static first frame as an attention sink [28, 40], it dynamically retrieves the street-view image nearest to the *endpoint* of the current chunk's target trajectory. This retrieved image is encoded as a latent $z_{VL}^{(i)}$ and placed at a future temporal position $H+L+\Delta_{VL}$ within the input sequence.
$$
Z_{seq}^{(i)} = [Z_{hist}^{(i)}; Z^{(i)}; z_{VL}^{(i)}], \quad p_{seq}^{(i)} = [\underbrace{1,...,H}_{\text{history}}; \underbrace{H+1,...,H+L}_{\text{target}}; \underbrace{H+L+\Delta_{VL}}_{\text{sink}}]
$$
This acts as a "virtual destination," providing a stable, error-free anchor relevant to the upcoming location.

### 4. Geometric and Semantic Referencing
Retrieved images are used for two complementary conditioning pathways:
*   **Geometric Referencing:** The nearest reference image is **warped** into each target frame's viewpoint via depth-based forward splatting to provide explicit spatial layout cues.
    $$
    x_{warp,t}^{(i)} = \text{Render}(\text{Unproj}(x_{ref,j}^{(i)}, d_{ref,j}^{(i)}), c_{ref,j \to t}^{(i)})
    $$
*   **Semantic Referencing:** The original reference image latents are injected into the transformer's sequence, allowing the model to attend to them for appearance details. Cross-temporal pairing encourages attention to persistent structures (see Fig. 6).

## Empirical Validation / Results
**Evaluation Setup:** SWM is evaluated on two benchmarks from cities *unseen* during training: **Busan-City-Bench** and **Ann-Arbor-City-Bench** (from MARS [26]). Each contains 30 test sequences (~100m each). Metrics assess visual quality (FID, FVD, Image Quality), camera-following accuracy (Rotation Error, Translation Error), and 3D adherence to static scene regions (masked PSNR, masked LPIPS).

**Baselines:** Compared against recent video world models: Aether [63], DeepVerse [7], Yume1.5 [32], HY-World1.5 [20], FantasyWorld [8], and Lingbot [46].

**Key Results:**
*   **Quantitative Superiority:** As shown in Table 1, SWM (both Teacher-Forcing/TF and Self-Forcing/SF variants) outperforms all baselines across almost all metrics on both benchmarks, demonstrating superior visual quality, trajectory adherence, and structural fidelity to the real locations.

**Table 1: Quantitative comparison with other methods.** Values are reported as Busan-City-Bench / Ann-Arbor-City-Bench.
| Method | FID ↓ | FVD ↓ | Img.Q. ↑ | RotErr ↓ | TransErr ↓ | mPSNR ↑ | mLPIPS ↓ |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| Aether [63] | 141.24/132.77 | 1096.50/1214.84 | 0.55/0.51 | 0.030/0.078 | 0.083/0.192 | 11.10/13.03 | 0.671/0.635 |
| DeepVerse [7] | 130.32/182.95 | 892.63/1524.97 | 0.53/0.46 | 0.062/0.251 | 0.103/0.469 | 12.20/13.43 | 0.679/0.727 |
| Yume1.5 [32] | 54.82/85.62 | 425.24/993.62 | 0.73/0.61 | 0.153/0.326 | 0.104/0.271 | 12.09/14.15 | 0.667/0.623 |
| HY-World1.5 [20] | 49.63/67.02 | 544.04/864.76 | 0.78/0.54 | 0.044/0.193 | 0.079/0.221 | 11.87/14.26 | 0.588/0.575 |
| FantasyWorld [8] | 83.51/67.72 | 783.11/917.57 | 0.63/0.49 | 0.056/0.215 | 0.141/0.302 | 10.01/11.97 | 0.654/0.592 |
| Lingbot [46] | 62.14/57.99 | 717.44/1039.50 | 0.75/0.60 | 0.081/0.269 | 0.073/0.239 | 10.48/12.51 | 0.645/0.641 |
| **SWM (TF)** | **28.43**/56.61 | **301.76**/**640.17** | 0.78/**0.66** | **0.020**/**0.055** | **0.015**/0.154 | **14.56**/**15.18** | **0.392**/**0.481** |
| **SWM (SF)** | 32.50/**43.97** | 325.87/779.94 | **0.77**/0.57 | 0.028/0.217 | 0.033/0.208 | 13.52/14.20 | 0.478/0.573 |

*   **Qualitative Capabilities:** Figure 7 shows SWM can:
    1.  Generate diverse, text-prompted scenarios (e.g., tsunami, sunset) while preserving the underlying city layout.
    2.  Follow diverse camera trajectories (including pedestrian paths).
    3.  Generate stable, long-horizon videos over several kilometers.
*   **Ablation Studies:** Table 2 confirms the importance of each component:
    *   Removing **cross-temporal pairing** causes the largest degradation.
    *   **Geometric** and **semantic referencing** are complementary; removing either harms results.
    *   The **Virtual Lookahead Sink** is more effective than static sink alternatives (first-frame or first-position), maintaining lower FID over long sequences (Fig. 10).

**Table 2: Ablation study on Busan-City-Bench.**
| Variant | FID ↓ | FVD ↓ | Img.Q. ↑ | RotErr ↓ | TransErr ↓ | mPSNR ↑ | mLPIPS ↓ |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| **Full model** | **28.43** | **301.76** | 0.78 | **0.020** | **0.015** | **14.56** | **0.392** |
| w/o cross-temporal pairing | 44.74 | 487.87 | 0.77 | 0.057 | 0.123 | 12.54 | 0.519 |
| w/o synthetic data | 27.74 | 365.24 | 0.78 | 0.021 | 0.020 | 13.52 | 0.427 |
| w/o geometric referencing | 33.01 | 398.74 | **0.79** | 0.036 | 0.051 | 12.33 | 0.525 |
| w/o semantic referencing | 30.27 | 326.18 | 0.78 | 0.032 | 0.022 | 14.08 | 0.442 |
| w/o any attention sink | 33.06 | 342.81 | 0.78 | 0.021 | 0.016 | 14.16 | 0.406 |
| w/ first frame attention sink | 32.71 | 378.92 | 0.78 | 0.018 | 0.018 | 14.25 | 0.388 |

## Theoretical and Practical Implications
**Theoretical Implications:**
*   Demonstrates that **retrieval-augmentation** is a powerful paradigm for grounding generative models in external, real-world knowledge bases (in this case, a geographic database).
*   Shows that **cross-temporal pairing** is an effective, simple technique for teaching a model to distinguish persistent scene geometry from transient dynamics when using multi-temporal references.
*   Proposes a novel **dynamic anchoring mechanism** (Virtual Lookahead Sink) for long-horizon autoregressive generation, which may be applicable to other sequential generation tasks.

**Practical Implications:**
*   **Urban Planning & Visualization:** Allows stakeholders to visualize proposed changes (new buildings, parks, traffic patterns) within the authentic context of an existing city.
*   **Autonomous Driving Simulation:** Enables the generation of vast, realistic, and variable driving scenarios grounded in real road networks and scenery, for training and testing perception systems.
*   **Location-Based Entertainment & Exploration:** Users can "visit" or create interactive stories in realistic digital twins of real cities.
*   **Cross-City Generalization:** The model's ability to perform well on unseen cities suggests the approach could scale to ground world models in many real-world locations globally.

## Conclusion
The Seoul World Model (SWM) represents a significant step towards **grounding generative world simulation in physically existing environments**. By integrating retrieval-augmented conditioning, cross-temporal training, synthetic data augmentation, and a novel Virtual Lookahead Sink, SWM successfully generates spatially faithful, temporally coherent, and long-horizon videos of real cities. It outperforms existing world models not designed for this task and demonstrates effective cross-city generalization. This work opens a new research direction, encouraging the development of world models that interact with and simulate the real world, with broad potential applications.

---

_Markdown view of https://picx.dev/p/9N6DpJ, served by PicX — AI-generated visual whiteboard summaries of research papers._