Summary (Overview)
- Decoupled Perception and Reasoning: MemDreamer introduces a novel paradigm that separates video perception (streaming hierarchical graph construction) from reasoning (agentic tool-augmented retrieval), overcoming token explosion and attention dilution inherent in end-to-end long-video VLMs.
- Hierarchical Graph Memory (HGM): A three-tier (Video Root → Super Events → Macro Events) coarse-to-fine abstraction with local subgraphs encoding entities, micro-events, and spatiotemporal-causal edges—enabling efficient semantic navigation.
- Agentic Tool-Augmented Retrieval: A multi-step Observation-Reason-Action loop using three tool categories (Navigation, Search, Graph Traversal) to actively explore the memory, avoiding passive full-context ingestion.
- State-of-the-Art Results: Achieves SOTA on four benchmarks (LVBench 90.7, LongVideoBench 92.9, Video-MME 92.1, EgoSchema 88.2), narrowing the gap to human experts to 3.7 points on LVBench. Context window reduced to ~2% of full-video ingestion (5.9–6.3K vs. 240–784K tokens), with a 12.5-point absolute gain over the strongest end-to-end baseline.
- Key Insight: Establishes a strong positive linear correlation (Pearson (R=0.897), (p<0.01)) between a VLM’s agentic reasoning ability (AIME 2025) and long-video performance, shifting the optimization target from brute-force token scaling to agentic capability scaling.
Introduction and Theoretical Foundation
Background: Long video understanding (hours-long) is a frontier for Vision-Language Models (VLMs), critical for embodied intelligence. Current VLMs process videos end-to-end by flattening frames into massive token streams, which suffers from:
- Token explosion: A 2-hour video at 1 FPS yields over 1.6M tokens.
- Attention dilution: Redundant tokens cause “lost in the middle” (Liu et al., 2024) and degrade long-range reasoning.
Motivation: Human comprehension is hierarchical (coarse plot → scenes → events → actions) and relational (spatiotemporal and causal connections). Existing flat or chunk-based memory systems lose global perspective and sever causal links, causing decoupled reasoning to degenerate into myopic exhaustive retrieval.
Theoretical Basis: MemDreamer formalizes long-video understanding as a two-stage decoupled process:
- Memory Construction: A perception model (P) incrementally builds a structured, purely textual Hierarchical Graph Memory (\mathcal{G}) from the video stream (V).
- Agentic Retrieval: A reasoning model (R) equipped with a tool bank (\mathcal{T}) actively explores (\mathcal{G}) via an Observation-Reason-Action loop, extracting concise task-relevant clues (\mathcal{C}) to answer query (Q): (A = R(Q, \mathcal{C})).
This decoupling shifts the problem from passive token consumption to active multi-step exploration, bypassing context limits and enabling the direct transfer of agentic reasoning capabilities to long-video tasks.
Methodology
3.1 Hierarchical Graph Memory Construction
Streaming Adaptive Segmentation: Instead of fixed-length chunks, the system uses a semantic boundary detector with a sliding window of maximum duration (\tau = 10) minutes. At each iteration (k), the perception model resolves a set of complete Macro Events ({e_1,\dots,e_N}) within window (W_k = [t_k^{\text{start}}, t_k^{\text{start}}+\tau]), and the end of the last event becomes the next window’s start.
Downward Subgraph Extraction: For each Macro Event (e_i), a local subgraph (g_i = (V_i, E_i)) is constructed with two node types:
- (V_i^E): Entities (Person, Object, Location, Group) with attributes.
- (V_i^M): Micro-events (atomic actions with temporal extent). Three edge categories:
- Spatial-attribute ((V_i^E \leftrightarrow V_i^E)): LOCATED_IN, NEXT_TO, etc.
- Subject-object ((V_i^E \leftrightarrow V_i^M)): PERFORMS, RECEIVES, etc.
- Temporal-causal ((V_i^M \rightarrow V_i^M)): BEFORE, CAUSES, PREVENTS, etc.
Upward Hierarchical Aggregation: Macro Event summaries are clustered bottom-up by temporal adjacency and semantic affinity to form Super Events, which are further aggregated into a single Video Root node. Cross-tier hierarchical edges ((v_{Mc} \rightarrow v_S \rightarrow v_R)) are established. See Table 1 for the complete schema.
3.2 Agentic Tool-Augmented Retrieval
Multi-Dimensional Tool Bank (Table 2):
- Hierarchical Navigation: GetSummary, GetSuperEvent, GetMacroEvent, GetSubgraph – for coarse-to-fine hierarchical traversal.
- Precise Search: SearchNodes (dense embedding similarity), SearchByTime (temporal localization) – for rapid node localization.
- Graph Traversal: GetRelationGraph – for multi-hop causal/entity tracing along topological edges.
Observation-Reason-Action Loop: At step (t), the reasoning model (R) selects action (a_t): [ a_t = R(Q, H_{t-1}) \tag{1} ] where (H_{t-1}) is the historical execution trajectory. Executing (a_t) produces observation (o_t) from (\mathcal{G}). Then (R) distills task-relevant clues: [ c_t = R(o_t, Q) \tag{2} ] The trajectory is updated: [ H_t = H_{t-1} \cup {(a_t, c_t)} \tag{3} ] This selective compression prevents context pollution. The loop terminates when sufficient evidence is gathered (max 12 rounds).
Empirical Validation / Results
Experimental Setup: Evaluated on four benchmarks: LVBench (103 videos, 30min–2hr, 1,549 QA), LongVideoBench (753 videos, 1,337 QA), Video-MME long split (300 videos, 900 QA), EgoSchema (egocentric reasoning). Perception model: Gemini-3.1-Pro; reasoning models: Gemini-3.1-Pro, Gemini-2.5-Pro, Qwen3-VL-235B-A22B-Thinking. Embedding: Qwen3-Embedding. Max tool calls = 12.
Main Results (Table 3):
| Method | Reason Model | LVBench | LongVideoBench | Video-MME (Long) | EgoSchema |
|---|---|---|---|---|---|
| MemDreamer | Gemini-3.1-Pro | 90.7 | 92.9 | 92.1 | 87.8 |
| MemDreamer | Qwen3-VL-Thinking | 84.8 | 86.3 | 86.2 | 87.4 |
| End-to-End Gemini-3.1-Pro | — | 78.2 | 78.6 | 80.3 | 76.4 |
| Human Expert | — | 94.4 | — | — | — |
MemDreamer with Gemini-3.1-Pro achieves absolute gains of +12.5 (LVBench), +14.3 (LongVideoBench), +11.8 (Video-MME) over the strongest end-to-end baseline. Gap to human expert: only 3.7 points.
Context Window Reduction (Table 4):
| Method | Context Window | LVBench |
|---|---|---|
| End-to-End Gemini-3.1-Pro | 265K tokens | 78.2 |
| MemDreamer (Reasoning) | 6.2K tokens | 90.7 |
| Reduction factor | ~43× | — |
MemDreamer uses only ~2% of the full-video context window while improving accuracy by 12.5 points.
Correlation with Agentic Reasoning (Table 5): Pearson correlation between AIME 2025 (agentic reasoning benchmark) and LVBench:
- End-to-end: (R = 0.702), (p = 0.052) (not significant)
- With MemDreamer: (R = 0.897), (p < 0.01) (strong, significant)
Ablation Studies:
- Memory Construction (Table 7): Full Hierarchical-Graph = 90.7 vs. Flat-Chunk = 77.4 (enables causal edges + hierarchy).
- Retrieval Strategy (Table 8): Agentic Full Tools = 90.7 vs. Vanilla Embedding Similarity = 70.5 vs. Full Memory Context = 78.9.
- Tool Categories (Table 11): Full toolkit outperforms any subset; Graph Traversal gives the largest single boost (+6.6).
- Round Budget (Table 9): Optimal at (T_{\max}=12) (90.7); average rounds used ~3.0, showing self-termination.
- Search top-(k) (Table 10): Optimal at (k=10) (88.7 with (T_{\max}=8)); too few misses evidence, too many dilutes relevance.
Theoretical and Practical Implications
Theoretical:
- Establishes that long-video understanding is fundamentally a reasoning problem, not just a perception scaling problem. The strong correlation (Pearson (R=0.897)) between agentic reasoning (AIME) and long-video performance suggests optimizing for intrinsic reasoning ability—rather than context length—is a more effective scaling direction.
- The decoupled paradigm reveals that perception and reasoning can be separated: the perception model only needs to process short clips (<10 min), and its quality has minimal impact on final performance (Table 6: <1.5 point fluctuation when swapping perception backbone).
- Provides design principles for multimodal memory systems: hierarchical abstraction with topological edges (spatiotemporal/causal) outperforms flat or purely sequential storage.
Practical:
- MemDreamer is plug-and-play: can be applied to any VLM backbone without retraining, immediately boosting performance (e.g., Qwen3-VL from 63.6 to 84.8 on LVBench).
- Dramatically reduces computational cost during inference: only 5.9–6.3K context tokens instead of 240–784K, enabling hours-long video understanding on resource-constrained systems.
- The tool-based agentic retrieval mechanism is interpretable: each action step is logged, allowing debugging and trust in the reasoning process.
- Potential applications: video surveillance, long-form content analysis, embodied AI (e.g., navigating visual histories in robotics).
Limitations:
- Additional latency from multi-step agentic loops (though mitigated by small context sizes).
- Dependence on a high-quality perception model for initial memory construction; while robust (Table 6), extreme cases may suffer.
Conclusion
MemDreamer introduces a paradigm shift for long-video understanding by decoupling perception from reasoning through a Hierarchical Graph Memory and tool-augmented agentic retrieval. Key takeaways:
- The hierarchical three-tier structure (Video Root → Super Events → Macro Events with local subgraphs) captures both global narrative and fine-grained spatiotemporal-causal relations, enabling efficient semantic navigation.
- The agentic tool bank (Navigation, Search, Graph Traversal) and Observation-Reason-Action loop transform passive token consumption into active exploratory reasoning.
- Achieves SOTA on four major benchmarks with a 12.5-point absolute gain over end-to-end baselines, using only 2% of the context window, while narrowing the gap to human experts to 3.7 points.
- First to demonstrate a strong, statistically significant linear correlation between a VLM’s agentic reasoning capability and its long-video understanding performance ((R=0.897), (p<0.01)), establishing agentic capability scaling as a new paradigm for multimodal comprehension.
Future directions include extending the framework to real-time video streams, incorporating multi-modal signals (audio, speech, text overlay), and further scaling the agentic reasoning loop for more complex interactive tasks.
Related papers
- SWE-Explore: Benchmarking How Coding Agents Explore Repositories
SWE-Explore benchmarks repository exploration and finds that even strong agents are recall-limited at line level, where missing core evidence dominates failures.
- SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning
A persistent Python kernel as an action interface yields 59.9% accuracy, outperforming prior spatial agents by 11 points without adaptation.
- EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments