Summary (Overview)

  • Decoupled Perception and Reasoning: MemDreamer introduces a novel paradigm that separates video perception (streaming hierarchical graph construction) from reasoning (agentic tool-augmented retrieval), overcoming token explosion and attention dilution inherent in end-to-end long-video VLMs.
  • Hierarchical Graph Memory (HGM): A three-tier (Video Root → Super Events → Macro Events) coarse-to-fine abstraction with local subgraphs encoding entities, micro-events, and spatiotemporal-causal edges—enabling efficient semantic navigation.
  • Agentic Tool-Augmented Retrieval: A multi-step Observation-Reason-Action loop using three tool categories (Navigation, Search, Graph Traversal) to actively explore the memory, avoiding passive full-context ingestion.
  • State-of-the-Art Results: Achieves SOTA on four benchmarks (LVBench 90.7, LongVideoBench 92.9, Video-MME 92.1, EgoSchema 88.2), narrowing the gap to human experts to 3.7 points on LVBench. Context window reduced to ~2% of full-video ingestion (5.9–6.3K vs. 240–784K tokens), with a 12.5-point absolute gain over the strongest end-to-end baseline.
  • Key Insight: Establishes a strong positive linear correlation (Pearson (R=0.897), (p<0.01)) between a VLM’s agentic reasoning ability (AIME 2025) and long-video performance, shifting the optimization target from brute-force token scaling to agentic capability scaling.

Introduction and Theoretical Foundation

Background: Long video understanding (hours-long) is a frontier for Vision-Language Models (VLMs), critical for embodied intelligence. Current VLMs process videos end-to-end by flattening frames into massive token streams, which suffers from:

  • Token explosion: A 2-hour video at 1 FPS yields over 1.6M tokens.
  • Attention dilution: Redundant tokens cause “lost in the middle” (Liu et al., 2024) and degrade long-range reasoning.

Motivation: Human comprehension is hierarchical (coarse plot → scenes → events → actions) and relational (spatiotemporal and causal connections). Existing flat or chunk-based memory systems lose global perspective and sever causal links, causing decoupled reasoning to degenerate into myopic exhaustive retrieval.

Theoretical Basis: MemDreamer formalizes long-video understanding as a two-stage decoupled process:

  1. Memory Construction: A perception model (P) incrementally builds a structured, purely textual Hierarchical Graph Memory (\mathcal{G}) from the video stream (V).
  2. Agentic Retrieval: A reasoning model (R) equipped with a tool bank (\mathcal{T}) actively explores (\mathcal{G}) via an Observation-Reason-Action loop, extracting concise task-relevant clues (\mathcal{C}) to answer query (Q): (A = R(Q, \mathcal{C})).

This decoupling shifts the problem from passive token consumption to active multi-step exploration, bypassing context limits and enabling the direct transfer of agentic reasoning capabilities to long-video tasks.

Methodology

3.1 Hierarchical Graph Memory Construction

Streaming Adaptive Segmentation: Instead of fixed-length chunks, the system uses a semantic boundary detector with a sliding window of maximum duration (\tau = 10) minutes. At each iteration (k), the perception model resolves a set of complete Macro Events ({e_1,\dots,e_N}) within window (W_k = [t_k^{\text{start}}, t_k^{\text{start}}+\tau]), and the end of the last event becomes the next window’s start.

Downward Subgraph Extraction: For each Macro Event (e_i), a local subgraph (g_i = (V_i, E_i)) is constructed with two node types:

  • (V_i^E): Entities (Person, Object, Location, Group) with attributes.
  • (V_i^M): Micro-events (atomic actions with temporal extent). Three edge categories:
  • Spatial-attribute ((V_i^E \leftrightarrow V_i^E)): LOCATED_IN, NEXT_TO, etc.
  • Subject-object ((V_i^E \leftrightarrow V_i^M)): PERFORMS, RECEIVES, etc.
  • Temporal-causal ((V_i^M \rightarrow V_i^M)): BEFORE, CAUSES, PREVENTS, etc.

Upward Hierarchical Aggregation: Macro Event summaries are clustered bottom-up by temporal adjacency and semantic affinity to form Super Events, which are further aggregated into a single Video Root node. Cross-tier hierarchical edges ((v_{Mc} \rightarrow v_S \rightarrow v_R)) are established. See Table 1 for the complete schema.

3.2 Agentic Tool-Augmented Retrieval

Multi-Dimensional Tool Bank (Table 2):

  1. Hierarchical Navigation: GetSummary, GetSuperEvent, GetMacroEvent, GetSubgraph – for coarse-to-fine hierarchical traversal.
  2. Precise Search: SearchNodes (dense embedding similarity), SearchByTime (temporal localization) – for rapid node localization.
  3. Graph Traversal: GetRelationGraph – for multi-hop causal/entity tracing along topological edges.

Observation-Reason-Action Loop: At step (t), the reasoning model (R) selects action (a_t): [ a_t = R(Q, H_{t-1}) \tag{1} ] where (H_{t-1}) is the historical execution trajectory. Executing (a_t) produces observation (o_t) from (\mathcal{G}). Then (R) distills task-relevant clues: [ c_t = R(o_t, Q) \tag{2} ] The trajectory is updated: [ H_t = H_{t-1} \cup {(a_t, c_t)} \tag{3} ] This selective compression prevents context pollution. The loop terminates when sufficient evidence is gathered (max 12 rounds).

Empirical Validation / Results

Experimental Setup: Evaluated on four benchmarks: LVBench (103 videos, 30min–2hr, 1,549 QA), LongVideoBench (753 videos, 1,337 QA), Video-MME long split (300 videos, 900 QA), EgoSchema (egocentric reasoning). Perception model: Gemini-3.1-Pro; reasoning models: Gemini-3.1-Pro, Gemini-2.5-Pro, Qwen3-VL-235B-A22B-Thinking. Embedding: Qwen3-Embedding. Max tool calls = 12.

Main Results (Table 3):

MethodReason ModelLVBenchLongVideoBenchVideo-MME (Long)EgoSchema
MemDreamerGemini-3.1-Pro90.792.992.187.8
MemDreamerQwen3-VL-Thinking84.886.386.287.4
End-to-End Gemini-3.1-Pro78.278.680.376.4
Human Expert94.4

MemDreamer with Gemini-3.1-Pro achieves absolute gains of +12.5 (LVBench), +14.3 (LongVideoBench), +11.8 (Video-MME) over the strongest end-to-end baseline. Gap to human expert: only 3.7 points.

Context Window Reduction (Table 4):

MethodContext WindowLVBench
End-to-End Gemini-3.1-Pro265K tokens78.2
MemDreamer (Reasoning)6.2K tokens90.7
Reduction factor~43×

MemDreamer uses only ~2% of the full-video context window while improving accuracy by 12.5 points.

Correlation with Agentic Reasoning (Table 5): Pearson correlation between AIME 2025 (agentic reasoning benchmark) and LVBench:

  • End-to-end: (R = 0.702), (p = 0.052) (not significant)
  • With MemDreamer: (R = 0.897), (p < 0.01) (strong, significant)

Ablation Studies:

  • Memory Construction (Table 7): Full Hierarchical-Graph = 90.7 vs. Flat-Chunk = 77.4 (enables causal edges + hierarchy).
  • Retrieval Strategy (Table 8): Agentic Full Tools = 90.7 vs. Vanilla Embedding Similarity = 70.5 vs. Full Memory Context = 78.9.
  • Tool Categories (Table 11): Full toolkit outperforms any subset; Graph Traversal gives the largest single boost (+6.6).
  • Round Budget (Table 9): Optimal at (T_{\max}=12) (90.7); average rounds used ~3.0, showing self-termination.
  • Search top-(k) (Table 10): Optimal at (k=10) (88.7 with (T_{\max}=8)); too few misses evidence, too many dilutes relevance.

Theoretical and Practical Implications

Theoretical:

  • Establishes that long-video understanding is fundamentally a reasoning problem, not just a perception scaling problem. The strong correlation (Pearson (R=0.897)) between agentic reasoning (AIME) and long-video performance suggests optimizing for intrinsic reasoning ability—rather than context length—is a more effective scaling direction.
  • The decoupled paradigm reveals that perception and reasoning can be separated: the perception model only needs to process short clips (<10 min), and its quality has minimal impact on final performance (Table 6: <1.5 point fluctuation when swapping perception backbone).
  • Provides design principles for multimodal memory systems: hierarchical abstraction with topological edges (spatiotemporal/causal) outperforms flat or purely sequential storage.

Practical:

  • MemDreamer is plug-and-play: can be applied to any VLM backbone without retraining, immediately boosting performance (e.g., Qwen3-VL from 63.6 to 84.8 on LVBench).
  • Dramatically reduces computational cost during inference: only 5.9–6.3K context tokens instead of 240–784K, enabling hours-long video understanding on resource-constrained systems.
  • The tool-based agentic retrieval mechanism is interpretable: each action step is logged, allowing debugging and trust in the reasoning process.
  • Potential applications: video surveillance, long-form content analysis, embodied AI (e.g., navigating visual histories in robotics).

Limitations:

  • Additional latency from multi-step agentic loops (though mitigated by small context sizes).
  • Dependence on a high-quality perception model for initial memory construction; while robust (Table 6), extreme cases may suffer.

Conclusion

MemDreamer introduces a paradigm shift for long-video understanding by decoupling perception from reasoning through a Hierarchical Graph Memory and tool-augmented agentic retrieval. Key takeaways:

  • The hierarchical three-tier structure (Video Root → Super Events → Macro Events with local subgraphs) captures both global narrative and fine-grained spatiotemporal-causal relations, enabling efficient semantic navigation.
  • The agentic tool bank (Navigation, Search, Graph Traversal) and Observation-Reason-Action loop transform passive token consumption into active exploratory reasoning.
  • Achieves SOTA on four major benchmarks with a 12.5-point absolute gain over end-to-end baselines, using only 2% of the context window, while narrowing the gap to human experts to 3.7 points.
  • First to demonstrate a strong, statistically significant linear correlation between a VLM’s agentic reasoning capability and its long-video understanding performance ((R=0.897), (p<0.01)), establishing agentic capability scaling as a new paradigm for multimodal comprehension.

Future directions include extending the framework to real-time video streams, incorporating multi-modal signals (audio, speech, text overlay), and further scaling the agentic reasoning loop for more complex interactive tasks.

Related papers