MosaicMem: Hybrid Spatial Memory for Controllable Video World Models

Summary (Overview)

  • Hybrid Spatial Memory: Introduces Mosaic Memory (MosaicMem), a novel spatial memory mechanism that combines the geometric precision of explicit 3D methods (via patch lifting and warping) with the dynamic flexibility of implicit memory (via native model conditioning).
  • Enhanced Camera Control: Integrates Projective Positional Encoding (PRoPE) as a principled camera-conditioning interface for Diffusion Transformer (DiT) architectures, significantly improving viewpoint controllability and motion adherence.
  • Improved Performance: Demonstrates superior performance over both explicit and implicit memory baselines, achieving more accurate camera motion than implicit methods and better handling of dynamic objects than explicit methods.
  • New Benchmark: Introduces MosaicMem-World, a new dataset designed to stress-test memory retrieval under complex camera motions and scene revisits, including dynamic objects.
  • Advanced Capabilities: Enables minute-level navigation with persistent memory, memory-based scene editing (e.g., stitching, duplication), and autoregressive video generation (Mosaic Forcing) at real-time speeds.

Introduction and Theoretical Foundation

Recent video diffusion models are evolving into world simulators that require long-term consistency under camera motion, revisits, and interventions. A core challenge is spatial memory—the mechanism for preserving and reusing scene structure across time.

  • Explicit Memory (e.g., point clouds, 3D Gaussians): Builds an external 3D geometric cache. It provides strong geometric consistency for static scenes but struggles with dynamic, moving objects and can be brittle for long-horizon updates.
  • Implicit Memory (e.g., posed frames in latent space): Stores world state within the model's latent representations. It is flexible for dynamics but often suffers from camera drift even with correct poses and is inefficient due to frame-by-frame storage.

MosaicMem is proposed as a hybrid solution. It uses patches as the fundamental memory unit:

  1. Explicit-style Lifting: Uses an off-the-shelf 3D estimator to lift image patches into 3D for precise localization.
  2. Implicit-style Conditioning: Retrieves and warps these patches to the target view, then conditions the generation model via its native attention mechanisms, allowing it to decide between faithful reconstruction and synthesizing new, prompt-driven content.

This "patch-and-compose" approach, akin to assembling a mosaic, selectively fills persistent content while letting the model inpaint evolving elements.

Methodology

The method builds upon text+image-to-video (TI2V) models trained via Flow Matching. The generative process follows a probability-flow ODE:

dXλdλ=uθ(Xλ,λI,L,C,M),X1=X0+01uθ(Xλ,λI,L,C,M)dλ\frac{d X_{\lambda}}{d\lambda} = u_{\theta}\left( X_{\lambda}, \lambda \mid I, L, C, M \right), \quad X_1 = X_0 + \int_{0}^{1} u_{\theta}\left( X_{\lambda}, \lambda \mid I, L, C, M \right) d\lambda

where XλX_{\lambda} is the video state, II is an input image, LL are text prompts, CC are camera poses, and MM is the MosaicMem spatial memory.

Mosaic Memory Pipeline

  1. Patch Lifting: For a source patch PP with depth DD and camera parameters (Ki,Ti)(K_i, T_i), lift it into 3D world coordinates.
  2. Patch Retrieval & Warping: For a target camera view (Kj,Tj)(K_j, T_j), reproject the 3D patch location to get target coordinates (u,v)(u', v'): (u,v)=Π(KjTjTi1Ki1(u,v,D))(u', v') = \Pi\left( K_j T_j T_i^{-1} K_i^{-1} (u, v, D) \right) where Π()\Pi(\cdot) is perspective projection.
  3. Memory Alignment: Two complementary warping mechanisms ensure geometric consistency:
    • Warped RoPE: Applies the reprojected coordinates (j,u,v)(j, u', v') directly to the Rotary Position Embedding (RoPE) of the memory tokens.
    • Warped Latent: Uses bilinear sampling to spatially warp the source latent features based on (u,v)(u', v').
  4. Conditioning: The retrieved and warped memory patches are flattened and concatenated to the input token sequence as conditioning context for the DiT.

PRoPE for Camera Control

To provide fine-grained, frame-accurate camera guidance, Projective Positional Encoding (PRoPE) is integrated. It encodes relative camera geometry between views via a projective transform P~i1P~i21\tilde{P}_{i_1} \tilde{P}_{i_2}^{-1} and injects it into self-attention using a GTA-style transformed attention mechanism:

AttnPRoPE(Q,K,V)=DAttn(DQ,D1K,D1V)\text{Attn}_{\text{PRoPE}}(Q, K, V) = D \odot \text{Attn}\left( D^{\top} \odot Q, D^{-1} \odot K, D^{-1} \odot V \right)

where DtPRoPE=[DtProj00DtRoPE]D_t^{\text{PRoPE}} = \begin{bmatrix} D_t^{\text{Proj}} & 0 \\ 0 & D_t^{\text{RoPE}} \end{bmatrix} and DtProj=Id/8P~i(t)D_t^{\text{Proj}} = I_{d/8} \otimes \tilde{P}_{i(t)}. For temporally compressed latents (factor s=4s=4), the camera matrices {P~,k}k=03\{ \tilde{P}_{\ell,k} \}_{k=0}^{3} for the four original frames are broadcast into the attention operation.

Empirical Validation / Results

The model is fine-tuned from Wan 2.2 (5B parameters). Evaluation uses four metric groups: visual quality (FID, FVD), camera accuracy (RotErr, TransErr), motion dynamics (Average Optical Flow), and memory retrieval consistency (SSIM, PSNR, LPIPS within corresponding regions).

Quantitative Comparisons

Table 1: Quantitative comparison across memory paradigms and ablations. MosaicMem (full) achieves the best performance.

MethodCamera ControlVisual QualityConsistency ScoreDynamic
RotErr (↓)TransErr (↓)FID (↓)FVD (↓)
Explicit Memory
VMem [21]1.590.1477.12363.34
GEN3C [28]1.610.1377.41372.08
SEVA [48]1.420.1274.67301.77
VWM [37]1.500.1375.83323.67
Implicit Memory
WorldMem [39]5.870.4985.72403.50
CaM [41]4.650.4385.32392.11
Ablations
ControlMLP alone6.510.5289.17458.45
PRoPE alone4.910.3686.44412.85
MosaicMem w/o PRoPE0.790.1173.18250.84
PRoPE + Warped Latent0.660.0875.46268.13
PRoPE + Warped RoPE0.700.0971.89243.59
MosaicMem (full)0.510.0665.67232.95

Key Findings:

  • vs. Explicit Memory: MosaicMem achieves comparable or better consistency metrics while generating significantly more dynamic content (higher Dynamic Score).
  • vs. Implicit Memory: MosaicMem drastically improves camera control accuracy (RotErr/TransErr) and consistency scores.
  • Ablations: Both PRoPE and MosaicMem components are crucial. The full model with both warping strategies performs best.

Qualitative Results & Advanced Applications

  • Dynamic Object Generation: MosaicMem successfully generates prompt-driven dynamic objects (e.g., a knight riding a horse), while explicit baselines produce static scenes (see Fig. 4).
  • Long-Horizon Navigation: Enables generation of 2-minute coherent videos with consistent revisits, significantly outperforming implicit baselines which suffer from drift and artifact accumulation (see Fig. 6).
  • Memory Manipulation & Scene Editing: By manipulating the 3D locations of stored patches, MosaicMem enables scene stitching (e.g., connecting medieval and modern environments) and creative edits like creating "Inception"-style inverted scenes (see Fig. 7).
  • Autoregressive Generation (Mosaic Forcing): Distilled into a causal model, it achieves 16 FPS generation. It outperforms other AR systems (RELIC, Matrix-Game) in quality and consistency, especially under large camera motions (see Table 2, Fig. 8).

Table 2: Quantitative comparison on autoregressive video generation (Mosaic Forcing).

MethodQuality Score (↑)Consistency (↑)Camera Control
TotalSubject ConsistBg Consist
Matrix-Game75.1182.4087.92
RELIC79.0886.2191.08
MosaicMem-WRoPE77.8185.0390.41
MosaicMem (full)81.1188.3293.40

Theoretical and Practical Implications

  • Theoretical: Proposes a principled hybrid framework that bridges the gap between geometry-heavy and learning-heavy approaches to spatial memory in generative world models. The patch-based unit offers a new granularity for memory representation.
  • Practical: Unlocks a suite of controllable video generation capabilities essential for building interactive world simulators:
    • Precise Camera Control: Critical for applications like virtual cinematography, robotics simulation, and VR/AR content creation.
    • Long-Term Consistency: Enables the creation of explorable, persistent digital environments.
    • Direct Scene Editing: Provides an intuitive interface for content creators to manipulate scenes geometrically.
    • Efficient Long-Form Generation: The autoregressive variant (Mosaic Forcing) makes real-time, long-horizon simulation feasible.

Conclusion

MosaicMem presents a hybrid spatial memory paradigm that effectively combines the strengths of explicit and implicit approaches. By lifting patches into 3D for localization and using attention-based conditioning for synthesis, it achieves superior camera control, dynamic scene generation, and long-term consistency. Integrated with PRoPE and evaluated on a new revisit-focused benchmark, MosaicMem advances the state of controllable video world models, enabling minute-level navigation, memory editing, and autoregressive generation. Future work may explore more efficient patch storage/retrieval and integration with planning algorithms for agent training.