# MosaicMem: Hybrid Spatial Memory for Controllable Video World Models

> MosaicMem introduces a hybrid spatial memory that combines explicit 3D patch lifting with implicit model conditioning, enabling superior camera control and handling of dynamic objects in video generation.

- **Source:** [arXiv](https://arxiv.org/abs/2603.17117)
- **Published:** 2026-03-20
- **Permalink:** https://picx.dev/p/gGZktc
- **Whiteboard:** https://picx.dev/p/gGZktc/image

## Summary

# MosaicMem: Hybrid Spatial Memory for Controllable Video World Models

## Summary (Overview)
*   **Hybrid Spatial Memory:** Introduces Mosaic Memory (MosaicMem), a novel spatial memory mechanism that combines the **geometric precision of explicit 3D methods** (via patch lifting and warping) with the **dynamic flexibility of implicit memory** (via native model conditioning).
*   **Enhanced Camera Control:** Integrates **Projective Positional Encoding (PRoPE)** as a principled camera-conditioning interface for Diffusion Transformer (DiT) architectures, significantly improving viewpoint controllability and motion adherence.
*   **Improved Performance:** Demonstrates superior performance over both explicit and implicit memory baselines, achieving **more accurate camera motion** than implicit methods and **better handling of dynamic objects** than explicit methods.
*   **New Benchmark:** Introduces **MosaicMem-World**, a new dataset designed to stress-test memory retrieval under complex camera motions and scene revisits, including dynamic objects.
*   **Advanced Capabilities:** Enables **minute-level navigation** with persistent memory, **memory-based scene editing** (e.g., stitching, duplication), and **autoregressive video generation (Mosaic Forcing)** at real-time speeds.

## Introduction and Theoretical Foundation
Recent video diffusion models are evolving into **world simulators** that require long-term consistency under camera motion, revisits, and interventions. A core challenge is **spatial memory**—the mechanism for preserving and reusing scene structure across time.

*   **Explicit Memory (e.g., point clouds, 3D Gaussians):** Builds an external 3D geometric cache. It provides strong geometric consistency for static scenes but **struggles with dynamic, moving objects** and can be brittle for long-horizon updates.
*   **Implicit Memory (e.g., posed frames in latent space):** Stores world state within the model's latent representations. It is flexible for dynamics but often suffers from **camera drift** even with correct poses and is inefficient due to frame-by-frame storage.

**MosaicMem** is proposed as a **hybrid solution**. It uses **patches** as the fundamental memory unit:
1.  **Explicit-style Lifting:** Uses an off-the-shelf 3D estimator to lift image patches into 3D for precise localization.
2.  **Implicit-style Conditioning:** Retrieves and warps these patches to the target view, then conditions the generation model via its native attention mechanisms, allowing it to decide between faithful reconstruction and synthesizing new, prompt-driven content.

This "**patch-and-compose**" approach, akin to assembling a mosaic, selectively fills persistent content while letting the model inpaint evolving elements.

## Methodology
The method builds upon text+image-to-video (TI2V) models trained via **Flow Matching**. The generative process follows a probability-flow ODE:

$$
\frac{d X_{\lambda}}{d\lambda} = u_{\theta}\left( X_{\lambda}, \lambda \mid I, L, C, M \right), \quad X_1 = X_0 + \int_{0}^{1} u_{\theta}\left( X_{\lambda}, \lambda \mid I, L, C, M \right) d\lambda
$$

where $X_{\lambda}$ is the video state, $I$ is an input image, $L$ are text prompts, $C$ are camera poses, and $M$ is the **MosaicMem** spatial memory.

### Mosaic Memory Pipeline
1.  **Patch Lifting:** For a source patch $P$ with depth $D$ and camera parameters $(K_i, T_i)$, lift it into 3D world coordinates.
2.  **Patch Retrieval & Warping:** For a target camera view $(K_j, T_j)$, reproject the 3D patch location to get target coordinates $(u', v')$:
    $$(u', v') = \Pi\left( K_j T_j T_i^{-1} K_i^{-1} (u, v, D) \right)$$
    where $\Pi(\cdot)$ is perspective projection.
3.  **Memory Alignment:** Two complementary warping mechanisms ensure geometric consistency:
    *   **Warped RoPE:** Applies the reprojected coordinates $(j, u', v')$ directly to the Rotary Position Embedding (RoPE) of the memory tokens.
    *   **Warped Latent:** Uses bilinear sampling to spatially warp the source latent features based on $(u', v')$.
4.  **Conditioning:** The retrieved and warped memory patches are flattened and concatenated to the input token sequence as conditioning context for the DiT.

### PRoPE for Camera Control
To provide fine-grained, frame-accurate camera guidance, **Projective Positional Encoding (PRoPE)** is integrated. It encodes relative camera geometry between views via a projective transform $\tilde{P}_{i_1} \tilde{P}_{i_2}^{-1}$ and injects it into self-attention using a GTA-style transformed attention mechanism:

$$
\text{Attn}_{\text{PRoPE}}(Q, K, V) = D \odot \text{Attn}\left( D^{\top} \odot Q, D^{-1} \odot K, D^{-1} \odot V \right)
$$

where $D_t^{\text{PRoPE}} = \begin{bmatrix} D_t^{\text{Proj}} & 0 \\ 0 & D_t^{\text{RoPE}} \end{bmatrix}$ and $D_t^{\text{Proj}} = I_{d/8} \otimes \tilde{P}_{i(t)}$. For temporally compressed latents (factor $s=4$), the camera matrices $\{ \tilde{P}_{\ell,k} \}_{k=0}^{3}$ for the four original frames are broadcast into the attention operation.

## Empirical Validation / Results
The model is fine-tuned from **Wan 2.2 (5B parameters)**. Evaluation uses four metric groups: visual quality (FID, FVD), camera accuracy (RotErr, TransErr), motion dynamics (Average Optical Flow), and memory retrieval consistency (SSIM, PSNR, LPIPS within corresponding regions).

### Quantitative Comparisons
**Table 1:** Quantitative comparison across memory paradigms and ablations. MosaicMem (full) achieves the best performance.

| Method | Camera Control | Visual Quality | Consistency Score | Dynamic |
| :--- | :--- | :--- | :--- | :--- |
| | **RotErr (↓)** | **TransErr (↓)** | **FID (↓)** | **FVD (↓)** | **SSIM (↑)** | **PSNR (↑)** | **LPIPS (↓)** | **Score (↑)** |
| **Explicit Memory** | | | | | | | | |
| VMem [21] | 1.59 | 0.14 | 77.12 | 363.34 | 0.64 | 21.64 | 0.17 | 1.18 |
| GEN3C [28] | 1.61 | 0.13 | 77.41 | 372.08 | 0.64 | 21.58 | 0.17 | 1.21 |
| SEVA [48] | 1.42 | 0.12 | 74.67 | 301.77 | 0.66 | 22.01 | 0.15 | 1.22 |
| VWM [37] | 1.50 | 0.13 | 75.83 | 323.67 | 0.65 | 21.86 | 0.16 | 1.41 |
| **Implicit Memory** | | | | | | | | |
| WorldMem [39] | 5.87 | 0.49 | 85.72 | 403.50 | 0.47 | 15.34 | 0.46 | 1.67 |
| CaM [41] | 4.65 | 0.43 | 85.32 | 392.11 | 0.49 | 15.78 | 0.42 | 1.72 |
| **Ablations** | | | | | | | | |
| ControlMLP alone | 6.51 | 0.52 | 89.17 | 458.45 | 0.37 | 13.55 | 0.56 | 1.84 |
| PRoPE alone | 4.91 | 0.36 | 86.44 | 412.85 | 0.45 | 14.32 | 0.52 | 1.75 |
| MosaicMem w/o PRoPE | 0.79 | 0.11 | 73.18 | 250.84 | 0.68 | 22.33 | 0.14 | 2.11 |
| PRoPE + Warped Latent | 0.66 | 0.08 | 75.46 | 268.13 | 0.65 | 21.49 | 0.15 | 1.98 |
| PRoPE + Warped RoPE | 0.70 | 0.09 | 71.89 | 243.59 | 0.69 | 22.80 | 0.12 | 2.24 |
| **MosaicMem (full)** | **0.51** | **0.06** | **65.67** | **232.95** | **0.75** | **23.57** | **0.11** | **2.58** |

**Key Findings:**
*   **vs. Explicit Memory:** MosaicMem achieves comparable or better consistency metrics while generating significantly more dynamic content (higher Dynamic Score).
*   **vs. Implicit Memory:** MosaicMem drastically improves camera control accuracy (RotErr/TransErr) and consistency scores.
*   **Ablations:** Both PRoPE and MosaicMem components are crucial. The full model with both warping strategies performs best.

### Qualitative Results & Advanced Applications
*   **Dynamic Object Generation:** MosaicMem successfully generates prompt-driven dynamic objects (e.g., a knight riding a horse), while explicit baselines produce static scenes (see Fig. 4).
*   **Long-Horizon Navigation:** Enables generation of **2-minute coherent videos** with consistent revisits, significantly outperforming implicit baselines which suffer from drift and artifact accumulation (see Fig. 6).
*   **Memory Manipulation & Scene Editing:** By manipulating the 3D locations of stored patches, MosaicMem enables scene stitching (e.g., connecting medieval and modern environments) and creative edits like creating "Inception"-style inverted scenes (see Fig. 7).
*   **Autoregressive Generation (Mosaic Forcing):** Distilled into a causal model, it achieves **16 FPS** generation. It outperforms other AR systems (RELIC, Matrix-Game) in quality and consistency, especially under large camera motions (see Table 2, Fig. 8).

**Table 2:** Quantitative comparison on autoregressive video generation (Mosaic Forcing).

| Method | Quality Score (↑) | Consistency (↑) | Camera Control |
| :--- | :--- | :--- | :--- |
| | **Total** | **Subject Consist** | **Bg Consist** | **Motion Smooth** | **Temporal Flicker** | **Aesthetic Quality** | **Imaging Quality** | **PSNR** | **SSIM** | **RotErr (↓)** | **TransErr (↓)** |
| Matrix-Game | 75.11 | 82.40 | 87.92 | 88.35 | 89.10 | 43.12 | 59.77 | 18.57 | 0.524 | 5.32 | 0.38 |
| RELIC | 79.08 | 86.21 | 91.08 | 94.12 | 92.05 | 47.01 | 64.02 | 20.23 | 0.591 | 4.99 | 0.36 |
| MosaicMem-WRoPE | 77.81 | 85.03 | 90.41 | 92.73 | 91.22 | 45.88 | 61.60 | 19.01 | 0.566 | 1.63 | 0.16 |
| **MosaicMem (full)** | **81.11** | **88.32** | **93.40** | **96.58** | **94.21** | **48.15** | **65.97** | **21.57** | **0.652** | **0.89** | **0.11** |

## Theoretical and Practical Implications
*   **Theoretical:** Proposes a principled hybrid framework that bridges the gap between geometry-heavy and learning-heavy approaches to spatial memory in generative world models. The patch-based unit offers a new granularity for memory representation.
*   **Practical:** Unlocks a suite of controllable video generation capabilities essential for building interactive world simulators:
    *   **Precise Camera Control:** Critical for applications like virtual cinematography, robotics simulation, and VR/AR content creation.
    *   **Long-Term Consistency:** Enables the creation of explorable, persistent digital environments.
    *   **Direct Scene Editing:** Provides an intuitive interface for content creators to manipulate scenes geometrically.
    *   **Efficient Long-Form Generation:** The autoregressive variant (Mosaic Forcing) makes real-time, long-horizon simulation feasible.

## Conclusion
MosaicMem presents a **hybrid spatial memory** paradigm that effectively combines the strengths of explicit and implicit approaches. By **lifting patches into 3D for localization** and using **attention-based conditioning for synthesis**, it achieves superior camera control, dynamic scene generation, and long-term consistency. Integrated with **PRoPE** and evaluated on a new **revisit-focused benchmark**, MosaicMem advances the state of controllable video world models, enabling **minute-level navigation, memory editing, and autoregressive generation**. Future work may explore more efficient patch storage/retrieval and integration with planning algorithms for agent training.

---

_Markdown view of https://picx.dev/p/gGZktc, served by PicX — AI-generated visual whiteboard summaries of research papers._