Summary (Overview)
- Introduces latent spatial memory — a persistent 3D cache that stores scene information directly in the diffusion latent space (VAE latent tokens) instead of RGB point clouds, eliminating the costly pixel-space round trip.
- Presents Mirage, a video world model built around this memory, with depth-guided back-projection for construction, latent-resolution occlusion-aware readout, and autoregressive chunk-wise update with dynamic object filtering.
- Achieves state-of-the-art average score on WorldScore (70.36) and competitive novel-view synthesis on RealEstate10K (18.38 PSNR, 0.779 SSIM), with closed-loop consistency improved over all baselines.
- Delivers 10.57× faster end-to-end generation and 55× lower GPU memory usage than RGB point-cloud baselines, because the per-step conditioning no longer performs rasterization and VAE re-encoding.
- Ablation studies confirm that operating in latent space outperforms explicit RGB caches, feature upsampling at pixel resolution, and single-stage training; the method is robust to depth-estimator choice.
Introduction and Theoretical Foundation
The paper addresses the challenge of 3D spatial consistency in video world models — large-scale video diffusion models that generate plausible future frames conditioned on camera trajectories. Without explicit memory, even powerful generators accumulate geometric drift, producing frames that are individually realistic but collectively inconsistent in a shared world coordinate system.
Prior work attaches a persistent RGB point cloud (Eq. 1) constructed by lifting frames into 3D using depth, and then rendering target-view images and re-encoding them into latents:
This pixel-space round trip introduces two fundamental bottlenecks:
- Computational: Rendering millions of colored points at full resolution and re-encoding them through the VAE dominates wall-clock time and grows with cache size.
- Representational: The VAE encoding degrades the signal due to reconstruction error, rasterization artifacts, visibility holes, and distribution mismatch; it discards rich latent features.
Latent spatial memory avoids both problems by storing the diffusion model’s own latent tokens at world-space locations, and reading them back through a single latent-resolution projection — no pixel-space detour.
Methodology
Mirage maintains a persistent cache (Eq. 3). The pipeline consists of three steps repeated over overlapping chunks of latent frames.
Initialization
Given initial frame , encode to latent , downsample depth to latent resolution, back-project each latent cell to world space:
One memory element per latent cell seeds the cache.
Latent-Space Memory Readout
For a target view , project all memory points onto the target camera grid at latent resolution, retain frontmost per cell via z-buffering:
Unseen cells are zero-filled; a binary visibility mask is produced. Readouts and are concatenated and fed through a ControlNet-style side branch into the video diffusion backbone — no bridging encoder needed.
Autoregressive 3D Cache Update
After denoising a chunk, frames are decoded, depth and camera are re-estimated, frames are re-encoded to clean latents , and back-projected:
Only cells outside dynamic objects and sky (detected by open-vocabulary entity extractor and video segmenter) are added to the cache. Previous chunk latents are carried as short-term context.
Efficient Adaptation
Two-stage fine-tuning of a pretrained camera-controllable video diffusion transformer (Wan2.2-TI2V-5B, VAE compression , latent channels ):
- Stage 1: Freeze backbone and VAE, train only the ControlNet side branch.
- Stage 2: Attach rank-64 LoRA adapters to self-attention projections and jointly train with side branch. Both stages use the flow-matching objective on target frames.
Empirical Validation / Results
Datasets, Baselines, and Metrics
- Training: RealEstate10K videos with depth/camera from feed-forward reconstructor, dynamic regions removed.
- WorldScore: 10 metrics covering controllability, consistency, quality, motion. Compare against RGB point-cloud scene generators (WonderJourney, InvisibleStitch, WonderWorld, Voyager, FlashWorld, LucidDreamer, Spatia) and general video generators (VideoCrafter2, EasyAnimate, Allegro, CogVideoX-I2V, Vchitect-2.0, LTX-Video, Wan2.1).
- RealEstate10K: Novel-view synthesis (PSNR, SSIM, LPIPS) and closed-loop metrics (PSNR<sub>C</sub>, SSIM<sub>C</sub>, LPIPS<sub>C</sub>) following Spatia. Compare against SEVA, VMem, ViewCrafter, FlexWorld, Voyager, Spatia.
- Efficiency: Wall-clock time and peak GPU memory for one cache read vs. rollout length on single NVIDIA H100.
Main Results
| Table 1: WorldScore Results | Method | Average Score | Static Score | Dynamic Score | 3D Const | Photo Const |
|---|---|---|---|---|---|---|
| Mirage (Ours) | 70.36 | 73.60 | 67.11 | 92.21 | 93.95 | |
| Spatia | 69.73 | 72.63 | 66.82 | 86.40 | 89.10 | |
| Voyager | 66.08 | 77.62 | 54.53 | 81.56 | 85.99 | |
| CogVideoX-I2V | 60.64 | 62.15 | 59.12 | 86.21 | 88.12 | |
| Wan2.1 | 55.21 | 57.56 | 52.85 | 78.74 | 78.36 |
Mirage achieves the highest Average Score, leading on 3D consistency and photometric consistency, especially on the dynamic partition. Figure 4 shows qualitative comparisons on out-of-domain prompts where Mirage maintains 3D coherence while baselines exhibit drift.
| Table 2: RealEstate10K Results | Method | PSNR ↑ | SSIM ↑ | LPIPS ↓ | PSNR<sub>C</sub> ↑ | SSIM<sub>C</sub> ↑ | LPIPS<sub>C</sub> ↓ |
|---|---|---|---|---|---|---|---|
| Mirage (Ours) | 18.38 | 0.779 | 0.250 | 20.05 | 0.825 | 0.228 | |
| Spatia | 18.58 | 0.646 | 0.254 | 19.38 | 0.579 | 0.213 | |
| Voyager | 17.79 | 0.636 | 0.297 | 17.66 | 0.540 | 0.380 | |
| VMem | 14.62 | 0.522 | 0.426 | - | - | - |
In closed-loop (return trajectory), Mirage achieves the best SSIM<sub>C</sub> and PSNR<sub>C</sub>, demonstrating strong long-horizon stability.
Efficiency Scaling
From Figure 5 (reported as linear-scale bar chart):
- After initial chunk, Mirage: 0.25 s/frame, cache footprint ~2.25 MiB after 5 chunks.
- Spatia (RGB cache): ~7.25 s/frame, ~78.8 MiB.
- Gen3C: ~7.63 s/frame, ~124 MiB.
- VMem: ~4.07 s/frame (grows with stored views), ~23.4 MiB.
End-to-end speedup: 10.57×; memory reduction: 55× over RGB pipelines. Gap widens with longer rollouts.
Ablation Studies
| Table 3: Ablation on WorldScore Split |
| Variant | Avg ↑ | Static ↑ | Dynamic ↑ | 3D Cons ↑ | Photo Cons ↑ |
|---|---|---|---|---|---|
| Mirage (full) | 70.36 | 73.60 | 67.11 | 92.21 | 93.95 |
| Explicit RGB Point Cloud | 67.71 | 70.49 | 64.93 | 90.75 | 91.10 |
| Feature Upsample, Pixel Lift | 60.85 | 62.41 | 59.28 | 84.90 | 79.81 |
| No Dynamic Object Filter | 61.20 | 62.69 | 59.70 | 80.88 | 76.10 |
| Single Stage Training | 63.18 | 65.15 | 61.20 | 87.11 | 84.47 |
| Table 4: Depth Source Sensitivity |
| Depth Source | Avg ↑ | Static ↑ | Dynamic ↑ | 3D Cons ↑ | Photo Cons ↑ |
|---|---|---|---|---|---|
| DepthAnything 3 (default) | 70.36 | 73.60 | 67.11 | 92.21 | 93.95 |
| MapAnything | 69.66 | 72.78 | 66.53 | 91.89 | 93.32 |
| UniDepth | 69.13 | 72.15 | 66.10 | 91.63 | 92.79 |
Ablations confirm each component’s necessity: latent memory outperforms RGB; feature upsampling at pixel resolution harms consistency; dynamic object filtering critically improves long-horizon stability; two-stage training stabilizes convergence; depth source robustness is high.
Theoretical and Practical Implications
- Computational efficiency: By eliminating pixel-space rendering and VAE re-encoding from the per-step critical path, latent spatial memory makes world-consistent video generation practical for long trajectories under limited GPU budgets.
- Representational fidelity: Storing latent tokens preserves the model’s native conditioning features, avoiding information loss from VAE reconstruction, rasterization artifacts, and distribution mismatch — enabling better 3D and photometric consistency.
- Generality: The approach is backbone-agnostic (demonstrated with Wan2.2) and robust to depth-estimator noise, suggesting wide applicability to any latent video diffusion model that requires persistent 3D scene representation.
- Limitation: Dynamic scene content (moving objects, sky) is excluded from persistent memory because their geometry is unreliable. Scenes dominated by pervasive motion benefit less from the cache; persisting dynamic content across chunks is identified as future work.
Conclusion
Latent spatial memory stores video diffusion model latent features at world-space points, avoiding the pixel-space round trip of RGB point-cloud caches. Mirage, built around this representation, operates entirely within the VAE latent manifold: it constructs the cache via depth-guided back-projection, reads via latent-resolution z-buffered projection, and updates autoregressively with dynamic content exclusion. On WorldScore and RealEstate10K, Mirage achieves state-of-the-art quality while generating videos 10.57× faster and using 55× less GPU memory than RGB-cache baselines. The key limitation is the exclusion of dynamic actors; future work may explore persisting dynamic content across chunks to handle scenes with pervasive motion.
Related papers
- Crafter: A Multi-Agent Harness for Editable Scientific Figure Generation from Diverse Inputs
CRAFTER introduces a multi-agent orchestration harness that dramatically improves scientific figure generation across diverse inputs and types, achieving state-of-the-art results.
- Function2Scene: 3D Indoor Scene Layout from Functional Specifications
Function2Scene introduces a novel framework that generates 3D indoor layouts from functional specifications using an iterative check-and-repair pipeline with LLMs, significantly outperforming prior methods in functional design.
- SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning
A persistent Python kernel as an action interface yields 59.9% accuracy, outperforming prior spatial agents by 11 points without adaptation.