Visual Summary | Latent Spatial Memory for Video World Models

Summary (Overview)

Introduces latent spatial memory — a persistent 3D cache that stores scene information directly in the diffusion latent space (VAE latent tokens) instead of RGB point clouds, eliminating the costly pixel-space round trip.
Presents Mirage, a video world model built around this memory, with depth-guided back-projection for construction, latent-resolution occlusion-aware readout, and autoregressive chunk-wise update with dynamic object filtering.
Achieves state-of-the-art average score on WorldScore (70.36) and competitive novel-view synthesis on RealEstate10K (18.38 PSNR, 0.779 SSIM), with closed-loop consistency improved over all baselines.
Delivers 10.57× faster end-to-end generation and 55× lower GPU memory usage than RGB point-cloud baselines, because the per-step conditioning no longer performs rasterization and VAE re-encoding.
Ablation studies confirm that operating in latent space outperforms explicit RGB caches, feature upsampling at pixel resolution, and single-stage training; the method is robust to depth-estimator choice.

Introduction and Theoretical Foundation

The paper addresses the challenge of 3D spatial consistency in video world models — large-scale video diffusion models that generate plausible future frames conditioned on camera trajectories. Without explicit memory, even powerful generators accumulate geometric drift, producing frames that are individually realistic but collectively inconsistent in a shared world coordinate system.

Prior work attaches a persistent RGB point cloud $M_{rgb}=\{(p_i, c_i)\}$ (Eq. 1) constructed by lifting frames into 3D using depth, and then rendering target-view images and re-encoding them into latents:

\hat{z}_t = \mathcal{E}(\text{Rasterise}(M_{rgb}; E_t, K_t)) \quad \text{(Eq. 2)}

This pixel-space round trip introduces two fundamental bottlenecks:

Computational: Rendering millions of colored points at full resolution and re-encoding them through the VAE dominates wall-clock time and grows with cache size.
Representational: The VAE encoding degrades the signal due to reconstruction error, rasterization artifacts, visibility holes, and distribution mismatch; it discards rich latent features.

Latent spatial memory avoids both problems by storing the diffusion model’s own latent tokens at world-space locations, and reading them back through a single latent-resolution projection — no pixel-space detour.

Methodology

Mirage maintains a persistent cache $M = \{(p_i, f_i)\}, p_i \in \mathbb{R}^3, f_i \in \mathbb{R}^C$ (Eq. 3). The pipeline consists of three steps repeated over overlapping chunks of latent frames.

Initialization

Given initial frame $I_0$ , encode to latent $z \in \mathbb{R}^{C \times h \times w}$ , downsample depth to latent resolution, back-project each latent cell $(u,v)$ to world space:

p_{uv} = \pi^{-1}(u, v, D(u,v); K, E), \quad F_{uv} = z[:, v, u] \quad \text{(Eq. 4)}

One memory element per latent cell seeds the cache.

Latent-Space Memory Readout

For a target view $(E_t, K_t)$ , project all memory points onto the target camera grid at latent resolution, retain frontmost per cell via z-buffering:

i_t(u,v) = \arg\min_{i \in \Omega_t(u,v)} [E_t p_i]_z, \quad \hat{z}_t(u,v) = F_{i_t(u,v)} \quad \text{(Eq. 5)}

Unseen cells are zero-filled; a binary visibility mask $m_t$ is produced. Readouts $\hat{z}_t$ and $m_t$ are concatenated and fed through a ControlNet-style side branch into the video diffusion backbone — no bridging encoder needed.

Autoregressive 3D Cache Update

After denoising a chunk, frames are decoded, depth and camera are re-estimated, frames are re-encoded to clean latents $\tilde{z}_t = \mathcal{E}(I_t)$ , and back-projected:

M \leftarrow M \cup \{(p_{uv}, F_{uv})\}_{(u,v) \in \Lambda_t} \quad \text{(Eq. 6)}

Only cells outside dynamic objects and sky (detected by open-vocabulary entity extractor and video segmenter) are added to the cache. Previous chunk latents are carried as short-term context.

Efficient Adaptation

Two-stage fine-tuning of a pretrained camera-controllable video diffusion transformer (Wan2.2-TI2V-5B, VAE compression $4 \times 16 \times 16$ , latent channels $C=48$ ):

Stage 1: Freeze backbone and VAE, train only the ControlNet side branch.
Stage 2: Attach rank-64 LoRA adapters to self-attention projections and jointly train with side branch. Both stages use the flow-matching objective on target frames.

Empirical Validation / Results

Datasets, Baselines, and Metrics

Training: RealEstate10K videos with depth/camera from feed-forward reconstructor, dynamic regions removed.
WorldScore: 10 metrics covering controllability, consistency, quality, motion. Compare against RGB point-cloud scene generators (WonderJourney, InvisibleStitch, WonderWorld, Voyager, FlashWorld, LucidDreamer, Spatia) and general video generators (VideoCrafter2, EasyAnimate, Allegro, CogVideoX-I2V, Vchitect-2.0, LTX-Video, Wan2.1).
RealEstate10K: Novel-view synthesis (PSNR, SSIM, LPIPS) and closed-loop metrics (PSNRC, SSIMC, LPIPSC) following Spatia. Compare against SEVA, VMem, ViewCrafter, FlexWorld, Voyager, Spatia.
Efficiency: Wall-clock time and peak GPU memory for one cache read vs. rollout length on single NVIDIA H100.

Main Results

Table 1: WorldScore Results	Method	Average Score	Static Score	Dynamic Score	3D Const
Mirage (Ours)	70.36	73.60	67.11	92.21	93.95
Spatia	69.73	72.63	66.82	86.40	89.10
Voyager	66.08	77.62	54.53	81.56	85.99
CogVideoX-I2V	60.64	62.15	59.12	86.21	88.12
Wan2.1	55.21	57.56	52.85	78.74	78.36

Mirage achieves the highest Average Score, leading on 3D consistency and photometric consistency, especially on the dynamic partition. Figure 4 shows qualitative comparisons on out-of-domain prompts where Mirage maintains 3D coherence while baselines exhibit drift.

Table 2: RealEstate10K Results	Method	PSNR ↑	SSIM ↑	LPIPS ↓	PSNR<sub>C</sub> ↑	SSIM<sub>C</sub> ↑
Mirage (Ours)	18.38	0.779	0.250	20.05	0.825	0.228
Spatia	18.58	0.646	0.254	19.38	0.579	0.213
Voyager	17.79	0.636	0.297	17.66	0.540	0.380
VMem	14.62	0.522	0.426	-	-	-

In closed-loop (return trajectory), Mirage achieves the best SSIMC and PSNRC, demonstrating strong long-horizon stability.

Efficiency Scaling

From Figure 5 (reported as linear-scale bar chart):

After initial chunk, Mirage: 0.25 s/frame, cache footprint ~2.25 MiB after 5 chunks.
Spatia (RGB cache): ~7.25 s/frame, ~78.8 MiB.
Gen3C: ~7.63 s/frame, ~124 MiB.
VMem: ~4.07 s/frame (grows with stored views), ~23.4 MiB.

End-to-end speedup: 10.57×; memory reduction: 55× over RGB pipelines. Gap widens with longer rollouts.

Ablation Studies

| Table 3: Ablation on WorldScore Split |

Variant	Avg ↑	Static ↑	Dynamic ↑	3D Cons ↑	Photo Cons ↑
Mirage (full)	70.36	73.60	67.11	92.21	93.95
Explicit RGB Point Cloud	67.71	70.49	64.93	90.75	91.10
Feature Upsample, Pixel Lift	60.85	62.41	59.28	84.90	79.81
No Dynamic Object Filter	61.20	62.69	59.70	80.88	76.10
Single Stage Training	63.18	65.15	61.20	87.11	84.47

| Table 4: Depth Source Sensitivity |

Depth Source	Avg ↑	Static ↑	Dynamic ↑	3D Cons ↑	Photo Cons ↑
DepthAnything 3 (default)	70.36	73.60	67.11	92.21	93.95
MapAnything	69.66	72.78	66.53	91.89	93.32
UniDepth	69.13	72.15	66.10	91.63	92.79

Ablations confirm each component’s necessity: latent memory outperforms RGB; feature upsampling at pixel resolution harms consistency; dynamic object filtering critically improves long-horizon stability; two-stage training stabilizes convergence; depth source robustness is high.

Theoretical and Practical Implications

Computational efficiency: By eliminating pixel-space rendering and VAE re-encoding from the per-step critical path, latent spatial memory makes world-consistent video generation practical for long trajectories under limited GPU budgets.
Representational fidelity: Storing latent tokens preserves the model’s native conditioning features, avoiding information loss from VAE reconstruction, rasterization artifacts, and distribution mismatch — enabling better 3D and photometric consistency.
Generality: The approach is backbone-agnostic (demonstrated with Wan2.2) and robust to depth-estimator noise, suggesting wide applicability to any latent video diffusion model that requires persistent 3D scene representation.
Limitation: Dynamic scene content (moving objects, sky) is excluded from persistent memory because their geometry is unreliable. Scenes dominated by pervasive motion benefit less from the cache; persisting dynamic content across chunks is identified as future work.

Conclusion

Latent spatial memory stores video diffusion model latent features at world-space points, avoiding the pixel-space round trip of RGB point-cloud caches. Mirage, built around this representation, operates entirely within the VAE latent manifold: it constructs the cache via depth-guided back-projection, reads via latent-resolution z-buffered projection, and updates autoregressively with dynamic content exclusion. On WorldScore and RealEstate10K, Mirage achieves state-of-the-art quality while generating videos 10.57× faster and using 55× less GPU memory than RGB-cache baselines. The key limitation is the exclusion of dynamic actors; future work may explore persisting dynamic content across chunks to handle scenes with pervasive motion.