MosaicMem: Hybrid Spatial Memory for Controllable Video World Models

Summary (Overview)

Hybrid Spatial Memory: Introduces Mosaic Memory (MosaicMem), a novel spatial memory mechanism that combines the geometric precision of explicit 3D methods (via patch lifting and warping) with the dynamic flexibility of implicit memory (via native model conditioning).
Enhanced Camera Control: Integrates Projective Positional Encoding (PRoPE) as a principled camera-conditioning interface for Diffusion Transformer (DiT) architectures, significantly improving viewpoint controllability and motion adherence.
Improved Performance: Demonstrates superior performance over both explicit and implicit memory baselines, achieving more accurate camera motion than implicit methods and better handling of dynamic objects than explicit methods.
New Benchmark: Introduces MosaicMem-World, a new dataset designed to stress-test memory retrieval under complex camera motions and scene revisits, including dynamic objects.
Advanced Capabilities: Enables minute-level navigation with persistent memory, memory-based scene editing (e.g., stitching, duplication), and autoregressive video generation (Mosaic Forcing) at real-time speeds.

Introduction and Theoretical Foundation

Recent video diffusion models are evolving into world simulators that require long-term consistency under camera motion, revisits, and interventions. A core challenge is spatial memory—the mechanism for preserving and reusing scene structure across time.

Explicit Memory (e.g., point clouds, 3D Gaussians): Builds an external 3D geometric cache. It provides strong geometric consistency for static scenes but struggles with dynamic, moving objects and can be brittle for long-horizon updates.
Implicit Memory (e.g., posed frames in latent space): Stores world state within the model's latent representations. It is flexible for dynamics but often suffers from camera drift even with correct poses and is inefficient due to frame-by-frame storage.

MosaicMem is proposed as a hybrid solution. It uses patches as the fundamental memory unit:

Explicit-style Lifting: Uses an off-the-shelf 3D estimator to lift image patches into 3D for precise localization.
Implicit-style Conditioning: Retrieves and warps these patches to the target view, then conditions the generation model via its native attention mechanisms, allowing it to decide between faithful reconstruction and synthesizing new, prompt-driven content.

This "patch-and-compose" approach, akin to assembling a mosaic, selectively fills persistent content while letting the model inpaint evolving elements.

Methodology

The method builds upon text+image-to-video (TI2V) models trained via Flow Matching. The generative process follows a probability-flow ODE:

\frac{d X_{\lambda}}{d\lambda} = u_{\theta}\left( X_{\lambda}, \lambda \mid I, L, C, M \right), \quad X_1 = X_0 + \int_{0}^{1} u_{\theta}\left( X_{\lambda}, \lambda \mid I, L, C, M \right) d\lambda

where $X_{\lambda}$ is the video state, $I$ is an input image, $L$ are text prompts, $C$ are camera poses, and $M$ is the MosaicMem spatial memory.

Mosaic Memory Pipeline

Patch Lifting: For a source patch $P$ with depth $D$ and camera parameters $(K_i, T_i)$ , lift it into 3D world coordinates.
Patch Retrieval & Warping: For a target camera view $(K_j, T_j)$ , reproject the 3D patch location to get target coordinates $(u', v')$ : $(u', v') = \Pi\left( K_j T_j T_i^{-1} K_i^{-1} (u, v, D) \right)$ where $\Pi(\cdot)$ is perspective projection.
Memory Alignment: Two complementary warping mechanisms ensure geometric consistency:
- Warped RoPE: Applies the reprojected coordinates $(j, u', v')$ directly to the Rotary Position Embedding (RoPE) of the memory tokens.
- Warped Latent: Uses bilinear sampling to spatially warp the source latent features based on $(u', v')$ .
Conditioning: The retrieved and warped memory patches are flattened and concatenated to the input token sequence as conditioning context for the DiT.

PRoPE for Camera Control

To provide fine-grained, frame-accurate camera guidance, Projective Positional Encoding (PRoPE) is integrated. It encodes relative camera geometry between views via a projective transform $\tilde{P}_{i_1} \tilde{P}_{i_2}^{-1}$ and injects it into self-attention using a GTA-style transformed attention mechanism:

\text{Attn}_{\text{PRoPE}}(Q, K, V) = D \odot \text{Attn}\left( D^{\top} \odot Q, D^{-1} \odot K, D^{-1} \odot V \right)

where $D_t^{\text{PRoPE}} = \begin{bmatrix} D_t^{\text{Proj}} & 0 \\ 0 & D_t^{\text{RoPE}} \end{bmatrix}$ and $D_t^{\text{Proj}} = I_{d/8} \otimes \tilde{P}_{i(t)}$ . For temporally compressed latents (factor $s=4$ ), the camera matrices $\{ \tilde{P}_{\ell,k} \}_{k=0}^{3}$ for the four original frames are broadcast into the attention operation.

Empirical Validation / Results

The model is fine-tuned from Wan 2.2 (5B parameters). Evaluation uses four metric groups: visual quality (FID, FVD), camera accuracy (RotErr, TransErr), motion dynamics (Average Optical Flow), and memory retrieval consistency (SSIM, PSNR, LPIPS within corresponding regions).

Quantitative Comparisons

Table 1: Quantitative comparison across memory paradigms and ablations. MosaicMem (full) achieves the best performance.

Method	Camera Control	Visual Quality	Consistency Score	Dynamic
	RotErr (↓)	TransErr (↓)	FID (↓)	FVD (↓)
Explicit Memory
VMem [21]	1.59	0.14	77.12	363.34
GEN3C [28]	1.61	0.13	77.41	372.08
SEVA [48]	1.42	0.12	74.67	301.77
VWM [37]	1.50	0.13	75.83	323.67
Implicit Memory
WorldMem [39]	5.87	0.49	85.72	403.50
CaM [41]	4.65	0.43	85.32	392.11
Ablations
ControlMLP alone	6.51	0.52	89.17	458.45
PRoPE alone	4.91	0.36	86.44	412.85
MosaicMem w/o PRoPE	0.79	0.11	73.18	250.84
PRoPE + Warped Latent	0.66	0.08	75.46	268.13
PRoPE + Warped RoPE	0.70	0.09	71.89	243.59
MosaicMem (full)	0.51	0.06	65.67	232.95

Key Findings:

vs. Explicit Memory: MosaicMem achieves comparable or better consistency metrics while generating significantly more dynamic content (higher Dynamic Score).
vs. Implicit Memory: MosaicMem drastically improves camera control accuracy (RotErr/TransErr) and consistency scores.
Ablations: Both PRoPE and MosaicMem components are crucial. The full model with both warping strategies performs best.

Qualitative Results & Advanced Applications

Dynamic Object Generation: MosaicMem successfully generates prompt-driven dynamic objects (e.g., a knight riding a horse), while explicit baselines produce static scenes (see Fig. 4).
Long-Horizon Navigation: Enables generation of 2-minute coherent videos with consistent revisits, significantly outperforming implicit baselines which suffer from drift and artifact accumulation (see Fig. 6).
Memory Manipulation & Scene Editing: By manipulating the 3D locations of stored patches, MosaicMem enables scene stitching (e.g., connecting medieval and modern environments) and creative edits like creating "Inception"-style inverted scenes (see Fig. 7).
Autoregressive Generation (Mosaic Forcing): Distilled into a causal model, it achieves 16 FPS generation. It outperforms other AR systems (RELIC, Matrix-Game) in quality and consistency, especially under large camera motions (see Table 2, Fig. 8).

Table 2: Quantitative comparison on autoregressive video generation (Mosaic Forcing).

Method	Quality Score (↑)	Consistency (↑)	Camera Control
	Total	Subject Consist	Bg Consist
Matrix-Game	75.11	82.40	87.92
RELIC	79.08	86.21	91.08
MosaicMem-WRoPE	77.81	85.03	90.41
MosaicMem (full)	81.11	88.32	93.40

Theoretical and Practical Implications

Theoretical: Proposes a principled hybrid framework that bridges the gap between geometry-heavy and learning-heavy approaches to spatial memory in generative world models. The patch-based unit offers a new granularity for memory representation.
Practical: Unlocks a suite of controllable video generation capabilities essential for building interactive world simulators:
- Precise Camera Control: Critical for applications like virtual cinematography, robotics simulation, and VR/AR content creation.
- Long-Term Consistency: Enables the creation of explorable, persistent digital environments.
- Direct Scene Editing: Provides an intuitive interface for content creators to manipulate scenes geometrically.
- Efficient Long-Form Generation: The autoregressive variant (Mosaic Forcing) makes real-time, long-horizon simulation feasible.

Conclusion

MosaicMem presents a hybrid spatial memory paradigm that effectively combines the strengths of explicit and implicit approaches. By lifting patches into 3D for localization and using attention-based conditioning for synthesis, it achieves superior camera control, dynamic scene generation, and long-term consistency. Integrated with PRoPE and evaluated on a new revisit-focused benchmark, MosaicMem advances the state of controllable video world models, enabling minute-level navigation, memory editing, and autoregressive generation. Future work may explore more efficient patch storage/retrieval and integration with planning algorithms for agent training.