MosaicMem: Hybrid Spatial Memory for Controllable Video World Models
Summary (Overview)
- Hybrid Spatial Memory: Introduces Mosaic Memory (MosaicMem), a novel spatial memory mechanism that combines the geometric precision of explicit 3D methods (via patch lifting and warping) with the dynamic flexibility of implicit memory (via native model conditioning).
- Enhanced Camera Control: Integrates Projective Positional Encoding (PRoPE) as a principled camera-conditioning interface for Diffusion Transformer (DiT) architectures, significantly improving viewpoint controllability and motion adherence.
- Improved Performance: Demonstrates superior performance over both explicit and implicit memory baselines, achieving more accurate camera motion than implicit methods and better handling of dynamic objects than explicit methods.
- New Benchmark: Introduces MosaicMem-World, a new dataset designed to stress-test memory retrieval under complex camera motions and scene revisits, including dynamic objects.
- Advanced Capabilities: Enables minute-level navigation with persistent memory, memory-based scene editing (e.g., stitching, duplication), and autoregressive video generation (Mosaic Forcing) at real-time speeds.
Introduction and Theoretical Foundation
Recent video diffusion models are evolving into world simulators that require long-term consistency under camera motion, revisits, and interventions. A core challenge is spatial memory—the mechanism for preserving and reusing scene structure across time.
- Explicit Memory (e.g., point clouds, 3D Gaussians): Builds an external 3D geometric cache. It provides strong geometric consistency for static scenes but struggles with dynamic, moving objects and can be brittle for long-horizon updates.
- Implicit Memory (e.g., posed frames in latent space): Stores world state within the model's latent representations. It is flexible for dynamics but often suffers from camera drift even with correct poses and is inefficient due to frame-by-frame storage.
MosaicMem is proposed as a hybrid solution. It uses patches as the fundamental memory unit:
- Explicit-style Lifting: Uses an off-the-shelf 3D estimator to lift image patches into 3D for precise localization.
- Implicit-style Conditioning: Retrieves and warps these patches to the target view, then conditions the generation model via its native attention mechanisms, allowing it to decide between faithful reconstruction and synthesizing new, prompt-driven content.
This "patch-and-compose" approach, akin to assembling a mosaic, selectively fills persistent content while letting the model inpaint evolving elements.
Methodology
The method builds upon text+image-to-video (TI2V) models trained via Flow Matching. The generative process follows a probability-flow ODE:
where is the video state, is an input image, are text prompts, are camera poses, and is the MosaicMem spatial memory.
Mosaic Memory Pipeline
- Patch Lifting: For a source patch with depth and camera parameters , lift it into 3D world coordinates.
- Patch Retrieval & Warping: For a target camera view , reproject the 3D patch location to get target coordinates : where is perspective projection.
- Memory Alignment: Two complementary warping mechanisms ensure geometric consistency:
- Warped RoPE: Applies the reprojected coordinates directly to the Rotary Position Embedding (RoPE) of the memory tokens.
- Warped Latent: Uses bilinear sampling to spatially warp the source latent features based on .
- Conditioning: The retrieved and warped memory patches are flattened and concatenated to the input token sequence as conditioning context for the DiT.
PRoPE for Camera Control
To provide fine-grained, frame-accurate camera guidance, Projective Positional Encoding (PRoPE) is integrated. It encodes relative camera geometry between views via a projective transform and injects it into self-attention using a GTA-style transformed attention mechanism:
where and . For temporally compressed latents (factor ), the camera matrices for the four original frames are broadcast into the attention operation.
Empirical Validation / Results
The model is fine-tuned from Wan 2.2 (5B parameters). Evaluation uses four metric groups: visual quality (FID, FVD), camera accuracy (RotErr, TransErr), motion dynamics (Average Optical Flow), and memory retrieval consistency (SSIM, PSNR, LPIPS within corresponding regions).
Quantitative Comparisons
Table 1: Quantitative comparison across memory paradigms and ablations. MosaicMem (full) achieves the best performance.
| Method | Camera Control | Visual Quality | Consistency Score | Dynamic |
|---|---|---|---|---|
| RotErr (↓) | TransErr (↓) | FID (↓) | FVD (↓) | |
| Explicit Memory | ||||
| VMem [21] | 1.59 | 0.14 | 77.12 | 363.34 |
| GEN3C [28] | 1.61 | 0.13 | 77.41 | 372.08 |
| SEVA [48] | 1.42 | 0.12 | 74.67 | 301.77 |
| VWM [37] | 1.50 | 0.13 | 75.83 | 323.67 |
| Implicit Memory | ||||
| WorldMem [39] | 5.87 | 0.49 | 85.72 | 403.50 |
| CaM [41] | 4.65 | 0.43 | 85.32 | 392.11 |
| Ablations | ||||
| ControlMLP alone | 6.51 | 0.52 | 89.17 | 458.45 |
| PRoPE alone | 4.91 | 0.36 | 86.44 | 412.85 |
| MosaicMem w/o PRoPE | 0.79 | 0.11 | 73.18 | 250.84 |
| PRoPE + Warped Latent | 0.66 | 0.08 | 75.46 | 268.13 |
| PRoPE + Warped RoPE | 0.70 | 0.09 | 71.89 | 243.59 |
| MosaicMem (full) | 0.51 | 0.06 | 65.67 | 232.95 |
Key Findings:
- vs. Explicit Memory: MosaicMem achieves comparable or better consistency metrics while generating significantly more dynamic content (higher Dynamic Score).
- vs. Implicit Memory: MosaicMem drastically improves camera control accuracy (RotErr/TransErr) and consistency scores.
- Ablations: Both PRoPE and MosaicMem components are crucial. The full model with both warping strategies performs best.
Qualitative Results & Advanced Applications
- Dynamic Object Generation: MosaicMem successfully generates prompt-driven dynamic objects (e.g., a knight riding a horse), while explicit baselines produce static scenes (see Fig. 4).
- Long-Horizon Navigation: Enables generation of 2-minute coherent videos with consistent revisits, significantly outperforming implicit baselines which suffer from drift and artifact accumulation (see Fig. 6).
- Memory Manipulation & Scene Editing: By manipulating the 3D locations of stored patches, MosaicMem enables scene stitching (e.g., connecting medieval and modern environments) and creative edits like creating "Inception"-style inverted scenes (see Fig. 7).
- Autoregressive Generation (Mosaic Forcing): Distilled into a causal model, it achieves 16 FPS generation. It outperforms other AR systems (RELIC, Matrix-Game) in quality and consistency, especially under large camera motions (see Table 2, Fig. 8).
Table 2: Quantitative comparison on autoregressive video generation (Mosaic Forcing).
| Method | Quality Score (↑) | Consistency (↑) | Camera Control |
|---|---|---|---|
| Total | Subject Consist | Bg Consist | |
| Matrix-Game | 75.11 | 82.40 | 87.92 |
| RELIC | 79.08 | 86.21 | 91.08 |
| MosaicMem-WRoPE | 77.81 | 85.03 | 90.41 |
| MosaicMem (full) | 81.11 | 88.32 | 93.40 |
Theoretical and Practical Implications
- Theoretical: Proposes a principled hybrid framework that bridges the gap between geometry-heavy and learning-heavy approaches to spatial memory in generative world models. The patch-based unit offers a new granularity for memory representation.
- Practical: Unlocks a suite of controllable video generation capabilities essential for building interactive world simulators:
- Precise Camera Control: Critical for applications like virtual cinematography, robotics simulation, and VR/AR content creation.
- Long-Term Consistency: Enables the creation of explorable, persistent digital environments.
- Direct Scene Editing: Provides an intuitive interface for content creators to manipulate scenes geometrically.
- Efficient Long-Form Generation: The autoregressive variant (Mosaic Forcing) makes real-time, long-horizon simulation feasible.
Conclusion
MosaicMem presents a hybrid spatial memory paradigm that effectively combines the strengths of explicit and implicit approaches. By lifting patches into 3D for localization and using attention-based conditioning for synthesis, it achieves superior camera control, dynamic scene generation, and long-term consistency. Integrated with PRoPE and evaluated on a new revisit-focused benchmark, MosaicMem advances the state of controllable video world models, enabling minute-level navigation, memory editing, and autoregressive generation. Future work may explore more efficient patch storage/retrieval and integration with planning algorithms for agent training.