# Latent Spatial Memory for Video World Models

> Latent spatial memory stores features in VAE latent space, achieving SOTA video consistency with 10.57x speedup and 55x less GPU memory.

- **Source:** [arXiv](https://arxiv.org/abs/2606.09828)
- **Published:** 2026-06-10
- **Permalink:** https://picx.dev/p/mIBc4g
- **Whiteboard:** https://picx.dev/p/mIBc4g/image

## Summary

## Summary (Overview)

* Introduces **latent spatial memory** — a persistent 3D cache that stores scene information **directly in the diffusion latent space** (VAE latent tokens) instead of RGB point clouds, eliminating the costly pixel-space round trip.
* Presents **Mirage**, a video world model built around this memory, with depth-guided back-projection for construction, latent-resolution occlusion-aware readout, and autoregressive chunk-wise update with dynamic object filtering.
* Achieves **state-of-the-art** average score on WorldScore (70.36) and competitive novel-view synthesis on RealEstate10K (18.38 PSNR, 0.779 SSIM), with closed-loop consistency improved over all baselines.
* Delivers **10.57× faster** end-to-end generation and **55× lower** GPU memory usage than RGB point-cloud baselines, because the per-step conditioning no longer performs rasterization and VAE re-encoding.
* Ablation studies confirm that operating in latent space outperforms explicit RGB caches, feature upsampling at pixel resolution, and single-stage training; the method is robust to depth-estimator choice.

## Introduction and Theoretical Foundation

The paper addresses the challenge of **3D spatial consistency** in video world models — large-scale video diffusion models that generate plausible future frames conditioned on camera trajectories. Without explicit memory, even powerful generators accumulate geometric drift, producing frames that are individually realistic but collectively inconsistent in a shared world coordinate system.

Prior work attaches a persistent **RGB point cloud** $M_{rgb}=\{(p_i, c_i)\}$ (Eq. 1) constructed by lifting frames into 3D using depth, and then rendering target-view images and re-encoding them into latents:
$$\hat{z}_t = \mathcal{E}(\text{Rasterise}(M_{rgb}; E_t, K_t)) \quad \text{(Eq. 2)}$$

This pixel-space round trip introduces two fundamental bottlenecks:
1. **Computational**: Rendering millions of colored points at full resolution and re-encoding them through the VAE dominates wall-clock time and grows with cache size.
2. **Representational**: The VAE encoding degrades the signal due to reconstruction error, rasterization artifacts, visibility holes, and distribution mismatch; it discards rich latent features.

**Latent spatial memory** avoids both problems by storing the diffusion model’s own latent tokens at world-space locations, and reading them back through a single latent-resolution projection — no pixel-space detour.

## Methodology

Mirage maintains a persistent cache $M = \{(p_i, f_i)\}, p_i \in \mathbb{R}^3, f_i \in \mathbb{R}^C$ (Eq. 3). The pipeline consists of three steps repeated over overlapping chunks of latent frames.

### Initialization
Given initial frame $I_0$, encode to latent $z \in \mathbb{R}^{C \times h \times w}$, downsample depth to latent resolution, back-project each latent cell $(u,v)$ to world space:
$$p_{uv} = \pi^{-1}(u, v, D(u,v); K, E), \quad F_{uv} = z[:, v, u] \quad \text{(Eq. 4)}$$
One memory element per latent cell seeds the cache.

### Latent-Space Memory Readout
For a target view $(E_t, K_t)$, project all memory points onto the target camera grid at latent resolution, retain frontmost per cell via z-buffering:
$$i_t(u,v) = \arg\min_{i \in \Omega_t(u,v)} [E_t p_i]_z, \quad \hat{z}_t(u,v) = F_{i_t(u,v)} \quad \text{(Eq. 5)}$$
Unseen cells are zero-filled; a binary visibility mask $m_t$ is produced. Readouts $\hat{z}_t$ and $m_t$ are concatenated and fed through a ControlNet-style side branch into the video diffusion backbone — **no bridging encoder needed**.

### Autoregressive 3D Cache Update
After denoising a chunk, frames are decoded, depth and camera are re-estimated, frames are re-encoded to clean latents $\tilde{z}_t = \mathcal{E}(I_t)$, and back-projected:
$$M \leftarrow M \cup \{(p_{uv}, F_{uv})\}_{(u,v) \in \Lambda_t} \quad \text{(Eq. 6)}$$
Only cells outside dynamic objects and sky (detected by open-vocabulary entity extractor and video segmenter) are added to the cache. Previous chunk latents are carried as short-term context.

### Efficient Adaptation
Two-stage fine-tuning of a pretrained camera-controllable video diffusion transformer (Wan2.2-TI2V-5B, VAE compression $4 \times 16 \times 16$, latent channels $C=48$):
1. **Stage 1**: Freeze backbone and VAE, train only the ControlNet side branch.
2. **Stage 2**: Attach rank-64 LoRA adapters to self-attention projections and jointly train with side branch.
Both stages use the flow-matching objective on target frames.

## Empirical Validation / Results

### Datasets, Baselines, and Metrics
- **Training**: RealEstate10K videos with depth/camera from feed-forward reconstructor, dynamic regions removed.
- **WorldScore**: 10 metrics covering controllability, consistency, quality, motion. Compare against RGB point-cloud scene generators (WonderJourney, InvisibleStitch, WonderWorld, Voyager, FlashWorld, LucidDreamer, Spatia) and general video generators (VideoCrafter2, EasyAnimate, Allegro, CogVideoX-I2V, Vchitect-2.0, LTX-Video, Wan2.1).
- **RealEstate10K**: Novel-view synthesis (PSNR, SSIM, LPIPS) and closed-loop metrics (PSNR<sub>C</sub>, SSIM<sub>C</sub>, LPIPS<sub>C</sub>) following Spatia. Compare against SEVA, VMem, ViewCrafter, FlexWorld, Voyager, Spatia.
- **Efficiency**: Wall-clock time and peak GPU memory for one cache read vs. rollout length on single NVIDIA H100.

### Main Results

| Table 1: WorldScore Results | Method | Average Score | Static Score | Dynamic Score | 3D Const | Photo Const |
|---|---|---|---|---|---|---|
| **Mirage (Ours)** | **70.36** | **73.60** | **67.11** | **92.21** | **93.95** |
| Spatia | 69.73 | 72.63 | 66.82 | 86.40 | 89.10 |
| Voyager | 66.08 | 77.62 | 54.53 | 81.56 | 85.99 |
| CogVideoX-I2V | 60.64 | 62.15 | 59.12 | 86.21 | 88.12 |
| Wan2.1 | 55.21 | 57.56 | 52.85 | 78.74 | 78.36 |

Mirage achieves the highest Average Score, leading on 3D consistency and photometric consistency, especially on the dynamic partition. Figure 4 shows qualitative comparisons on out-of-domain prompts where Mirage maintains 3D coherence while baselines exhibit drift.

| Table 2: RealEstate10K Results | Method | PSNR ↑ | SSIM ↑ | LPIPS ↓ | PSNR<sub>C</sub> ↑ | SSIM<sub>C</sub> ↑ | LPIPS<sub>C</sub> ↓ |
|---|---|---|---|---|---|---|---|
| **Mirage (Ours)** | 18.38 | **0.779** | **0.250** | **20.05** | **0.825** | 0.228 |
| Spatia | **18.58** | 0.646 | 0.254 | 19.38 | 0.579 | **0.213** |
| Voyager | 17.79 | 0.636 | 0.297 | 17.66 | 0.540 | 0.380 |
| VMem | 14.62 | 0.522 | 0.426 | - | - | - |

In closed-loop (return trajectory), Mirage achieves the best SSIM<sub>C</sub> and PSNR<sub>C</sub>, demonstrating strong long-horizon stability.

### Efficiency Scaling
From Figure 5 (reported as linear-scale bar chart):
- After initial chunk, **Mirage**: 0.25 s/frame, cache footprint ~2.25 MiB after 5 chunks.
- **Spatia** (RGB cache): ~7.25 s/frame, ~78.8 MiB.
- **Gen3C**: ~7.63 s/frame, ~124 MiB.
- **VMem**: ~4.07 s/frame (grows with stored views), ~23.4 MiB.

End-to-end speedup: **10.57×**; memory reduction: **55×** over RGB pipelines. Gap widens with longer rollouts.

### Ablation Studies

| Table 3: Ablation on WorldScore Split |
| Variant | Avg ↑ | Static ↑ | Dynamic ↑ | 3D Cons ↑ | Photo Cons ↑ |
|---|---|---|---|---|---|
| **Mirage (full)** | **70.36** | **73.60** | **67.11** | **92.21** | **93.95** |
| Explicit RGB Point Cloud | 67.71 | 70.49 | 64.93 | 90.75 | 91.10 |
| Feature Upsample, Pixel Lift | 60.85 | 62.41 | 59.28 | 84.90 | 79.81 |
| No Dynamic Object Filter | 61.20 | 62.69 | 59.70 | 80.88 | 76.10 |
| Single Stage Training | 63.18 | 65.15 | 61.20 | 87.11 | 84.47 |

| Table 4: Depth Source Sensitivity |
| Depth Source | Avg ↑ | Static ↑ | Dynamic ↑ | 3D Cons ↑ | Photo Cons ↑ |
|---|---|---|---|---|---|
| DepthAnything 3 (default) | **70.36** | **73.60** | **67.11** | **92.21** | **93.95** |
| MapAnything | 69.66 | 72.78 | 66.53 | 91.89 | 93.32 |
| UniDepth | 69.13 | 72.15 | 66.10 | 91.63 | 92.79 |

Ablations confirm each component’s necessity: latent memory outperforms RGB; feature upsampling at pixel resolution harms consistency; dynamic object filtering critically improves long-horizon stability; two-stage training stabilizes convergence; depth source robustness is high.

## Theoretical and Practical Implications

- **Computational efficiency**: By eliminating pixel-space rendering and VAE re-encoding from the per-step critical path, latent spatial memory makes world-consistent video generation **practical for long trajectories** under limited GPU budgets.
- **Representational fidelity**: Storing latent tokens preserves the model’s native conditioning features, avoiding information loss from VAE reconstruction, rasterization artifacts, and distribution mismatch — enabling better 3D and photometric consistency.
- **Generality**: The approach is **backbone-agnostic** (demonstrated with Wan2.2) and robust to depth-estimator noise, suggesting wide applicability to any latent video diffusion model that requires persistent 3D scene representation.
- **Limitation**: Dynamic scene content (moving objects, sky) is excluded from persistent memory because their geometry is unreliable. Scenes dominated by pervasive motion benefit less from the cache; persisting dynamic content across chunks is identified as future work.

## Conclusion

Latent spatial memory stores video diffusion model latent features at world-space points, avoiding the pixel-space round trip of RGB point-cloud caches. Mirage, built around this representation, operates entirely within the VAE latent manifold: it constructs the cache via depth-guided back-projection, reads via latent-resolution z-buffered projection, and updates autoregressively with dynamic content exclusion. On WorldScore and RealEstate10K, Mirage achieves state-of-the-art quality while generating videos **10.57× faster** and using **55× less GPU memory** than RGB-cache baselines. The key limitation is the exclusion of dynamic actors; future work may explore persisting dynamic content across chunks to handle scenes with pervasive motion.

---

_Markdown view of https://picx.dev/p/mIBc4g, served by PicX — AI-generated visual whiteboard summaries of research papers._
