Summary of "Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models"

Summary (Overview)

Identifies a critical limitation: Existing video world models treat environments as static canvases and fail to track dynamic subjects (e.g., people, animals) when they temporarily exit the camera's field of view, leading to frozen, distorted, or vanishing subjects upon re-entry.
Proposes a new paradigm: Introduces Hybrid Memory, which requires models to simultaneously act as precise archivists for static backgrounds and vigilant trackers for dynamic subjects, ensuring both appearance and motion consistency during out-of-view intervals.
Constructs a dedicated benchmark: Introduces HM-World, the first large-scale video dataset (59K clips) designed for hybrid memory research, featuring diverse scenes, subjects, and meticulously designed camera motions to induce exit-and-re-entry events.
Develops a novel architecture: Proposes HyDRA (Hybrid Dynamic Retrieval Attention), a memory mechanism that compresses context into spatiotemporal tokens and uses a dynamic, affinity-based retrieval to selectively recall relevant motion and appearance cues of hidden subjects.
Demonstrates superior performance: Extensive experiments show HyDRA significantly outperforms state-of-the-art methods in preserving dynamic subject consistency and overall generation quality on the HM-World benchmark.

Introduction and Theoretical Foundation

Video world models show immense potential for simulating physical environments but face a fundamental consistency challenge. Current memory mechanisms in these models are designed primarily for static scenes, excelling at memorizing and reconstructing motionless backgrounds. However, the real world is dynamic, populated by subjects with independent motion logic.

The Core Problem: When a dynamic subject moves outside the camera's view (e.g., due to camera panning), existing models lose track of it. Upon the subject's re-entry, these models often fail, producing outputs where the subject is frozen, distorted, or has vanished entirely. This failure stems from a "static canvas" assumption.

To address this, the paper introduces the Hybrid Memory paradigm. This paradigm demands that a model perform two concurrent cognitive tasks:

Static Memory: Precisely memorize and reconstruct static background elements from different viewpoints.
Dynamic Memory: Continuously track and predict the independent motion and appearance of dynamic subjects, even when they are out of sight.

As illustrated in Figure 1 of the paper, hybrid memory requires the model to mentally simulate a subject's unseen trajectory so it can reappear at a plausible location with consistent motion. This is challenging due to:

Spatiotemporal Decoupling: The model must untangle the camera's ego-motion from the subject's independent trajectory.
Out-of-View Extrapolation: The model must simulate subject movement without direct visual evidence.
Feature Entanglement: In standard diffusion latents, static and dynamic features are coupled, making it difficult to retrieve historical context without causing subjects to "freeze" into the background.

Methodology

The proposed framework consists of two main components: the HM-World dataset and the HyDRA memory architecture.

1. HM-World Dataset Construction

Since natural videos with perfect, unoccluded exit-and-re-entry events are scarce, the dataset is synthetically rendered using Unreal Engine 5. The generation pipeline combines four dimensions to ensure diversity and targeted evaluation:

Scenes: 17 stylistically diverse 3D environments.
Subjects: 49 distinct subjects (humans and animals).
Subject Trajectories: 10 predefined motion paths.
Camera Trajectories: 28 deliberately designed paths with back-and-forth motions to actively induce subjects exiting and re-entering the frame.

After procedural combination and filtering, the final dataset contains 59,225 high-fidelity video clips. Each sample includes the video, a caption, camera poses, per-frame subject positions, and timestamps for exit/entry events.

Table 1: Comparison between existing datasets and HM-World.

Dataset	Reference	Dynamic Subject	Subject Exit-Enter	Subject Pose	Camera Movable	Total Num.
WorldScore	ICCV 25	✓	✗	✗	✓	3K
Context-As-Memory	SIGGRAPH Asia 25	✗	✗	✗	✓	10K
Multi-Cam Video	ICCV 25	✓	✗	✗	✓	136K
360°-Motion	ICLR 25	✓	✗	✓	✗	159.4K
HM-World (Ours)	-	✓	✓	✓	✓	59K

2. HyDRA (Hybrid Dynamic Retrieval Attention)

The goal is to predict future frames $X_{tgt}$ given context frames $X_{ctx}$ and a full camera trajectory $P$ , while preserving static background and dynamic subject consistency.

Base Architecture: Built upon a full-sequence video diffusion model (Causal 3D VAE + Diffusion Transformer) using Flow Matching. The loss function is:

\mathcal{L}_{\theta} = \mathbb{E}_{z_0, z_1, t} || u(z_t, t; \theta) - v_t ||^2,

where $z_t$ is the noised latent at timestep $t$ , $v_t = z_0 - z_1$ is the ground-truth velocity, and $u$ is the model predicting $v_t$ .

Camera Injection: Camera poses $P = \{(R_i, t_i)\}_{i=1}^f$ are flattened, encoded by an MLP encoder $E_{cam}$ , and added element-wise to the latent features:

H_{out} = H_{in} + E_{cam}(c_{cam}).

Memory Tokenization: Instead of using raw memory latents $Z_{mem}$ , a 3D-convolution-based Memory Tokenizer $T_{mem}$ compresses them into spatiotemporally-aware tokens $M$ :

M = T_{mem}(Z_{mem}), \quad M \in \mathbb{R}^{C' \times f'_{mem} \times h \times w}.

The 3D convolution expands the receptive field to capture long-duration motion information, creating richer representations for retrieval.

Dynamic Retrieval Attention: This module replaces standard 3D self-attention in the DiT blocks. For each target frame query $q_i$ , the process is:

Compute Affinity: The query is spatially pooled to align with memory token keys $k_{mem,j}$ . A spatiotemporal affinity score $S_{i,j}$ is computed via channel-wise inner product across spatial dimensions: $S_{i,j} = \frac{1}{\sqrt{d}} \sum_{y=1}^{h} \sum_{x=1}^{w} \langle \tilde{q}_i(x, y), k_{mem,j}(x, y) \rangle.$
Top-K Retrieval: The indices $I_i$ of the $K$ memory tokens with the highest affinity are selected: $I_i = \text{TopK}(S_i, K), \quad K_{sel} = \{k_{mem,j} \mid j \in I_i\}.$
Attention with Local Context: The retrieved keys/values $K_{sel}, V_{sel}$ are concatenated with keys/values $K_{loc}, V_{loc}$ from a local temporal window of the target sequence. Standard attention is then computed: $\text{Attention}(q_i, K'_i, V'_i) = \text{Softmax}\left( \frac{q_i (K'_i)^T}{\sqrt{d}} \right) V'_i, \quad \text{where } K'_i = [K_{sel}, K_{loc}].$

This mechanism allows the model to selectively attend to the most relevant historical motion and appearance cues when a subject is about to re-enter the frame.

Empirical Validation / Results

Evaluation Setup: A test set of 1000 unseen samples from HM-World is used. Metrics include:

General Fidelity: PSNR, SSIM, LPIPS.
Frame-level Consistency: Subject Consistency and Background Consistency from VBench.
Dynamic Subject Consistency (DSC): A new proposed metric. It crops subject regions from predicted, ground truth (GT), and context videos, extracts CLIP features, and computes cosine similarity: $\text{DSC}_{GT} = \text{sim}(F_{pred}, F_{gt}), \quad \text{DSC}_{ctx} = \text{sim}(F_{pred}, F_{ctx}).$ $\text{DSC}_{GT}$ measures fidelity to the true future state, while $\text{DSC}_{ctx}$ measures consistency with historical appearance.

Main Results:

Table 2: Quantitative comparison with other methods on HM-World.

Method	Reference	PSNR	SSIM	LPIPS	$\text{DSC}_{ctx}$	$\text{DSC}_{GT}$	Subj. Cons.	Bg. Cons.
Baseline	-	18.696	0.517	0.356	0.812	0.837	0.903	0.925
DFoT	ICML 25	17.693	0.482	0.410	0.803	0.826	0.893	0.913
Context-as-Memory	SIGGRAPH Asia 25	18.921	0.530	0.342	0.816	0.839	0.911	0.922
HyDRA (Ours)	-	20.357	0.606	0.289	0.827	0.849	0.926	0.932

Table 3: Comparison against the commercial model WorldPlay (zero-shot).

Method	PSNR	SSIM	LPIPS	$\text{DSC}_{ctx}$	$\text{DSC}_{GT}$	Subject Consistency	Background Consistency
WorldPlay	14.855	0.355	0.500	0.822	0.832	0.910	0.925
HyDRA (Ours)	20.357	0.606	0.289	0.827	0.849	0.926	0.932

Key Findings: HyDRA outperforms all compared methods across all metrics. It shows significant gains in reconstruction fidelity (PSNR +1.7 over baseline) and, most importantly, in dynamic subject consistency ( $\text{DSC}_{GT}$ +0.012). It also surpasses the strong zero-shot performance of WorldPlay, indicating the effectiveness of its specialized design.

Qualitative Results: Visual comparisons (Fig. 6 in paper) show that baseline and existing methods suffer from severe subject distortion, vanishing, or incoherent motion during re-entry events. In contrast, HyDRA successfully maintains the subject's identity and motion coherence.

Ablation Studies:

Memory Tokenizer Kernel Size: A temporal kernel size of 2 is crucial. Reducing it to 1 (no temporal interaction) causes a significant performance drop (PSNR -1.281), validating the need for capturing long-term dynamics.
Number of Retrieved Tokens: Retrieving too few tokens (5) leads to information loss and artifacts. Settings of 10 or 15 yield optimal and stable performance.
Retrieval Approach: The proposed dynamic affinity-based retrieval outperforms a static Field-of-View (FOV) overlap method across all metrics, especially Subject Consistency (+0.018). Qualitative analysis shows FOV retrieval can select empty frames, while dynamic affinity successfully retrieves keyframes with subject details.

Theoretical and Practical Implications

Paradigm Shift: The work moves beyond the "static canvas" assumption prevalent in video world models, formally defining and addressing the challenge of Hybrid Memory. This sets a new direction for research in consistent dynamic world simulation.
Benchmark Resource: The release of HM-World provides a much-needed, large-scale, and controlled benchmark for the community to train and evaluate models on hybrid memory capabilities, filling a gap in existing datasets.
Architectural Innovation: HyDRA demonstrates the effectiveness of combining spatiotemporal tokenization with dynamic, content-aware retrieval for managing memory in complex dynamic scenes. This approach can inspire future memory designs for video generation and other sequential tasks requiring long-term consistency.
Practical Applications: Robust hybrid memory is essential for downstream applications like autonomous driving (tracking occluded pedestrians), embodied AI (maintaining object permanence), and interactive video/game generation, where the world must remain coherent and believable over time and camera movements.

Conclusion

This paper identifies a critical flaw in current video world models—their inability to track dynamic subjects out of view—and proposes a comprehensive solution.

Main Takeaways: The Hybrid Memory paradigm is necessary for realistic world simulation. The HM-World dataset enables rigorous evaluation of this capability. The HyDRA architecture, with its memory tokenizer and dynamic retrieval attention, effectively implements hybrid memory, significantly improving dynamic subject consistency and overall generation quality.
Limitations and Future Work: Performance can degrade in highly complex scenes with three or more subjects or severe occlusions. Future work will focus on developing more robust memory mechanisms for multi-subject dynamics and scaling the approach to unconstrained real-world environments.