Summary of "Grounding World Simulation Models in a Real-World Metropolis"

Summary (Overview)

Core Contribution: Introduces Seoul World Model (SWM), the first city-scale video world model that generates videos grounded in the actual geometry and appearance of a real city (Seoul), moving beyond purely imagined environments.
Key Method: Uses retrieval-augmented generation, conditioning an autoregressive video diffusion model on nearby street-view images to anchor the output to real-world locations.
Main Innovations:
- Cross-temporal pairing during training to disentangle persistent scene structure from transient objects (e.g., cars, pedestrians).
- A Virtual Lookahead Sink that dynamically retrieves a future street-view image to stabilize long-horizon generation and combat error accumulation.
- Complementary geometric and semantic referencing pathways to provide both spatial layout and appearance detail from retrieved images.
Key Results: SWM outperforms existing video world models in visual quality, camera motion adherence, and 3D structural fidelity on benchmarks from unseen cities (Busan, Ann Arbor), demonstrating successful cross-city generalization.
Significance: Opens a new direction for world simulation grounded in physically existing environments, with applications in urban planning, autonomous driving simulation, and location-based exploration.

Introduction and Theoretical Foundation

Traditional video world simulation models generate dynamic, interactive environments but operate entirely within imagined worlds. This paper poses a novel question: What if a world model could simulate a city that actually exists? The authors formalize this goal as real-world grounded video world simulation.

The core idea is to leverage the vast availability of geotagged street-view imagery as a scalable source of location-specific visual references. By conditioning a generative world model on these references, the model can be anchored to the real geometric layout and appearance of a specific location. This enables applications like navigating familiar streets, visualizing urban planning scenarios, or generating hypothetical events (e.g., "a massive wave") in a real city context.

The key challenge is that naively using street-view images for conditioning introduces problems: temporal misalignment (references show a different moment in time), limited/sparse trajectory data, and long-horizon error accumulation in autoregressive generation. SWM is designed to address these specific challenges.

Methodology

SWM is built upon a pretrained autoregressive video Diffusion Transformer (DiT) [1, 33]. It generates video chunk-by-chunk, conditioned on a text prompt $P^{(i)}$ , a target camera trajectory $C^{(i)} = \\{c_t\\}_{t=0}^{T-1}$ , and self-generated history latents $Z_{hist}^{(i)}$ .

1. Data Construction

Real Street-View Dataset: 440K images from Seoul. A key innovation is cross-temporal pairing: target video sequences and their conditioning reference images are captured at different timestamps. This forces the model to learn persistent scene structure, ignoring transient objects that differ between reference and target.
Synthetic Dataset: Created using CARLA [10] simulator to provide diverse camera trajectories (pedestrian, vehicle, free-camera) not well-covered by street-view data, improving model robustness.
View Interpolation Pipeline: Street-view images are spatially sparse. An intermittent freeze-frame strategy is proposed to synthesize coherent training videos from sparse keyframes, ensuring compatibility with the temporal compression of the 3D VAE.

2. Street-View Retrieval

For a target chunk, nearby street-view panoramas are retrieved from a geo-indexed database. They are rendered into pinhole views aligned with the target camera's viewing direction. Each retrieved reference $x_{ref,k}^{(i)}$ comes with an estimated camera pose $c_{ref,k}^{(i)}$ and depth map $d_{ref,k}^{(i)}$ .

3. Virtual Lookahead Sink

To combat error accumulation in long-horizon generation (hundreds of meters), SWM introduces a Virtual Lookahead (VL) Sink. Instead of using a static first frame as an attention sink [28, 40], it dynamically retrieves the street-view image nearest to the endpoint of the current chunk's target trajectory. This retrieved image is encoded as a latent $z_{VL}^{(i)}$ and placed at a future temporal position $H+L+\Delta_{VL}$ within the input sequence.

Z_{seq}^{(i)} = [Z_{hist}^{(i)}; Z^{(i)}; z_{VL}^{(i)}], \quad p_{seq}^{(i)} = [\underbrace{1,...,H}_{\text{history}}; \underbrace{H+1,...,H+L}_{\text{target}}; \underbrace{H+L+\Delta_{VL}}_{\text{sink}}]

This acts as a "virtual destination," providing a stable, error-free anchor relevant to the upcoming location.

4. Geometric and Semantic Referencing

Retrieved images are used for two complementary conditioning pathways:

Geometric Referencing: The nearest reference image is warped into each target frame's viewpoint via depth-based forward splatting to provide explicit spatial layout cues. $x_{warp,t}^{(i)} = \text{Render}(\text{Unproj}(x_{ref,j}^{(i)}, d_{ref,j}^{(i)}), c_{ref,j \to t}^{(i)})$
Semantic Referencing: The original reference image latents are injected into the transformer's sequence, allowing the model to attend to them for appearance details. Cross-temporal pairing encourages attention to persistent structures (see Fig. 6).

Empirical Validation / Results

Evaluation Setup: SWM is evaluated on two benchmarks from cities unseen during training: Busan-City-Bench and Ann-Arbor-City-Bench (from MARS [26]). Each contains 30 test sequences (~100m each). Metrics assess visual quality (FID, FVD, Image Quality), camera-following accuracy (Rotation Error, Translation Error), and 3D adherence to static scene regions (masked PSNR, masked LPIPS).

Baselines: Compared against recent video world models: Aether [63], DeepVerse [7], Yume1.5 [32], HY-World1.5 [20], FantasyWorld [8], and Lingbot [46].

Key Results:

Quantitative Superiority: As shown in Table 1, SWM (both Teacher-Forcing/TF and Self-Forcing/SF variants) outperforms all baselines across almost all metrics on both benchmarks, demonstrating superior visual quality, trajectory adherence, and structural fidelity to the real locations.

Table 1: Quantitative comparison with other methods. Values are reported as Busan-City-Bench / Ann-Arbor-City-Bench.

Method	FID ↓	FVD ↓	Img.Q. ↑	RotErr ↓	TransErr ↓	mPSNR ↑	mLPIPS ↓
Aether [63]	141.24/132.77	1096.50/1214.84	0.55/0.51	0.030/0.078	0.083/0.192	11.10/13.03	0.671/0.635
DeepVerse [7]	130.32/182.95	892.63/1524.97	0.53/0.46	0.062/0.251	0.103/0.469	12.20/13.43	0.679/0.727
Yume1.5 [32]	54.82/85.62	425.24/993.62	0.73/0.61	0.153/0.326	0.104/0.271	12.09/14.15	0.667/0.623
HY-World1.5 [20]	49.63/67.02	544.04/864.76	0.78/0.54	0.044/0.193	0.079/0.221	11.87/14.26	0.588/0.575
FantasyWorld [8]	83.51/67.72	783.11/917.57	0.63/0.49	0.056/0.215	0.141/0.302	10.01/11.97	0.654/0.592
Lingbot [46]	62.14/57.99	717.44/1039.50	0.75/0.60	0.081/0.269	0.073/0.239	10.48/12.51	0.645/0.641
SWM (TF)	28.43/56.61	301.76/640.17	0.78/0.66	0.020/0.055	0.015/0.154	14.56/15.18	0.392/0.481
SWM (SF)	32.50/43.97	325.87/779.94	0.77/0.57	0.028/0.217	0.033/0.208	13.52/14.20	0.478/0.573

Qualitative Capabilities: Figure 7 shows SWM can:
1. Generate diverse, text-prompted scenarios (e.g., tsunami, sunset) while preserving the underlying city layout.
2. Follow diverse camera trajectories (including pedestrian paths).
3. Generate stable, long-horizon videos over several kilometers.
Ablation Studies: Table 2 confirms the importance of each component:
- Removing cross-temporal pairing causes the largest degradation.
- Geometric and semantic referencing are complementary; removing either harms results.
- The Virtual Lookahead Sink is more effective than static sink alternatives (first-frame or first-position), maintaining lower FID over long sequences (Fig. 10).

Table 2: Ablation study on Busan-City-Bench.

Variant	FID ↓	FVD ↓	Img.Q. ↑	RotErr ↓	TransErr ↓	mPSNR ↑	mLPIPS ↓
Full model	28.43	301.76	0.78	0.020	0.015	14.56	0.392
w/o cross-temporal pairing	44.74	487.87	0.77	0.057	0.123	12.54	0.519
w/o synthetic data	27.74	365.24	0.78	0.021	0.020	13.52	0.427
w/o geometric referencing	33.01	398.74	0.79	0.036	0.051	12.33	0.525
w/o semantic referencing	30.27	326.18	0.78	0.032	0.022	14.08	0.442
w/o any attention sink	33.06	342.81	0.78	0.021	0.016	14.16	0.406
w/ first frame attention sink	32.71	378.92	0.78	0.018	0.018	14.25	0.388

Theoretical and Practical Implications

Theoretical Implications:

Demonstrates that retrieval-augmentation is a powerful paradigm for grounding generative models in external, real-world knowledge bases (in this case, a geographic database).
Shows that cross-temporal pairing is an effective, simple technique for teaching a model to distinguish persistent scene geometry from transient dynamics when using multi-temporal references.
Proposes a novel dynamic anchoring mechanism (Virtual Lookahead Sink) for long-horizon autoregressive generation, which may be applicable to other sequential generation tasks.

Practical Implications:

Urban Planning & Visualization: Allows stakeholders to visualize proposed changes (new buildings, parks, traffic patterns) within the authentic context of an existing city.
Autonomous Driving Simulation: Enables the generation of vast, realistic, and variable driving scenarios grounded in real road networks and scenery, for training and testing perception systems.
Location-Based Entertainment & Exploration: Users can "visit" or create interactive stories in realistic digital twins of real cities.
Cross-City Generalization: The model's ability to perform well on unseen cities suggests the approach could scale to ground world models in many real-world locations globally.

Conclusion

The Seoul World Model (SWM) represents a significant step towards grounding generative world simulation in physically existing environments. By integrating retrieval-augmented conditioning, cross-temporal training, synthetic data augmentation, and a novel Virtual Lookahead Sink, SWM successfully generates spatially faithful, temporally coherent, and long-horizon videos of real cities. It outperforms existing world models not designed for this task and demonstrates effective cross-city generalization. This work opens a new research direction, encouraging the development of world models that interact with and simulate the real world, with broad potential applications.