Visual Summary | MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

Summary (Overview)

Decoupled Perception and Reasoning: MemDreamer introduces a novel paradigm that separates video perception (streaming hierarchical graph construction) from reasoning (agentic tool-augmented retrieval), overcoming token explosion and attention dilution inherent in end-to-end long-video VLMs.
Hierarchical Graph Memory (HGM): A three-tier (Video Root → Super Events → Macro Events) coarse-to-fine abstraction with local subgraphs encoding entities, micro-events, and spatiotemporal-causal edges—enabling efficient semantic navigation.
Agentic Tool-Augmented Retrieval: A multi-step Observation-Reason-Action loop using three tool categories (Navigation, Search, Graph Traversal) to actively explore the memory, avoiding passive full-context ingestion.
State-of-the-Art Results: Achieves SOTA on four benchmarks (LVBench 90.7, LongVideoBench 92.9, Video-MME 92.1, EgoSchema 88.2), narrowing the gap to human experts to 3.7 points on LVBench. Context window reduced to ~2% of full-video ingestion (5.9–6.3K vs. 240–784K tokens), with a 12.5-point absolute gain over the strongest end-to-end baseline.
Key Insight: Establishes a strong positive linear correlation (Pearson (R=0.897), (p<0.01)) between a VLM’s agentic reasoning ability (AIME 2025) and long-video performance, shifting the optimization target from brute-force token scaling to agentic capability scaling.

Introduction and Theoretical Foundation

Background: Long video understanding (hours-long) is a frontier for Vision-Language Models (VLMs), critical for embodied intelligence. Current VLMs process videos end-to-end by flattening frames into massive token streams, which suffers from:

Token explosion: A 2-hour video at 1 FPS yields over 1.6M tokens.
Attention dilution: Redundant tokens cause “lost in the middle” (Liu et al., 2024) and degrade long-range reasoning.

Motivation: Human comprehension is hierarchical (coarse plot → scenes → events → actions) and relational (spatiotemporal and causal connections). Existing flat or chunk-based memory systems lose global perspective and sever causal links, causing decoupled reasoning to degenerate into myopic exhaustive retrieval.

Theoretical Basis: MemDreamer formalizes long-video understanding as a two-stage decoupled process:

Memory Construction: A perception model (P) incrementally builds a structured, purely textual Hierarchical Graph Memory (\mathcal{G}) from the video stream (V).
Agentic Retrieval: A reasoning model (R) equipped with a tool bank (\mathcal{T}) actively explores (\mathcal{G}) via an Observation-Reason-Action loop, extracting concise task-relevant clues (\mathcal{C}) to answer query (Q): (A = R(Q, \mathcal{C})).

This decoupling shifts the problem from passive token consumption to active multi-step exploration, bypassing context limits and enabling the direct transfer of agentic reasoning capabilities to long-video tasks.

Methodology

3.1 Hierarchical Graph Memory Construction

Streaming Adaptive Segmentation: Instead of fixed-length chunks, the system uses a semantic boundary detector with a sliding window of maximum duration (\tau = 10) minutes. At each iteration (k), the perception model resolves a set of complete Macro Events ({e_1,\dots,e_N}) within window (W_k = [t_k^{\text{start}}, t_k^{\text{start}}+\tau]), and the end of the last event becomes the next window’s start.

Downward Subgraph Extraction: For each Macro Event (e_i), a local subgraph (g_i = (V_i, E_i)) is constructed with two node types:

(V_i^E): Entities (Person, Object, Location, Group) with attributes.
(V_i^M): Micro-events (atomic actions with temporal extent). Three edge categories:
Spatial-attribute ((V_i^E \leftrightarrow V_i^E)): LOCATED_IN, NEXT_TO, etc.
Subject-object ((V_i^E \leftrightarrow V_i^M)): PERFORMS, RECEIVES, etc.
Temporal-causal ((V_i^M \rightarrow V_i^M)): BEFORE, CAUSES, PREVENTS, etc.

Upward Hierarchical Aggregation: Macro Event summaries are clustered bottom-up by temporal adjacency and semantic affinity to form Super Events, which are further aggregated into a single Video Root node. Cross-tier hierarchical edges ((v_{Mc} \rightarrow v_S \rightarrow v_R)) are established. See Table 1 for the complete schema.

3.2 Agentic Tool-Augmented Retrieval

Multi-Dimensional Tool Bank (Table 2):

Hierarchical Navigation: GetSummary, GetSuperEvent, GetMacroEvent, GetSubgraph – for coarse-to-fine hierarchical traversal.
Precise Search: SearchNodes (dense embedding similarity), SearchByTime (temporal localization) – for rapid node localization.
Graph Traversal: GetRelationGraph – for multi-hop causal/entity tracing along topological edges.

Observation-Reason-Action Loop: At step (t), the reasoning model (R) selects action (a_t): [ a_t = R(Q, H_{t-1}) \tag{1} ] where (H_{t-1}) is the historical execution trajectory. Executing (a_t) produces observation (o_t) from (\mathcal{G}). Then (R) distills task-relevant clues: [ c_t = R(o_t, Q) \tag{2} ] The trajectory is updated: [ H_t = H_{t-1} \cup {(a_t, c_t)} \tag{3} ] This selective compression prevents context pollution. The loop terminates when sufficient evidence is gathered (max 12 rounds).

Empirical Validation / Results

Experimental Setup: Evaluated on four benchmarks: LVBench (103 videos, 30min–2hr, 1,549 QA), LongVideoBench (753 videos, 1,337 QA), Video-MME long split (300 videos, 900 QA), EgoSchema (egocentric reasoning). Perception model: Gemini-3.1-Pro; reasoning models: Gemini-3.1-Pro, Gemini-2.5-Pro, Qwen3-VL-235B-A22B-Thinking. Embedding: Qwen3-Embedding. Max tool calls = 12.

Main Results (Table 3):

Method	Reason Model	LVBench	LongVideoBench	Video-MME (Long)	EgoSchema
MemDreamer	Gemini-3.1-Pro	90.7	92.9	92.1	87.8
MemDreamer	Qwen3-VL-Thinking	84.8	86.3	86.2	87.4
End-to-End Gemini-3.1-Pro	—	78.2	78.6	80.3	76.4
Human Expert	—	94.4	—	—	—

MemDreamer with Gemini-3.1-Pro achieves absolute gains of +12.5 (LVBench), +14.3 (LongVideoBench), +11.8 (Video-MME) over the strongest end-to-end baseline. Gap to human expert: only 3.7 points.

Context Window Reduction (Table 4):

Method	Context Window	LVBench
End-to-End Gemini-3.1-Pro	265K tokens	78.2
MemDreamer (Reasoning)	6.2K tokens	90.7
Reduction factor	~43×	—

MemDreamer uses only ~2% of the full-video context window while improving accuracy by 12.5 points.

Correlation with Agentic Reasoning (Table 5): Pearson correlation between AIME 2025 (agentic reasoning benchmark) and LVBench:

End-to-end: (R = 0.702), (p = 0.052) (not significant)
With MemDreamer: (R = 0.897), (p < 0.01) (strong, significant)

Ablation Studies:

Memory Construction (Table 7): Full Hierarchical-Graph = 90.7 vs. Flat-Chunk = 77.4 (enables causal edges + hierarchy).
Retrieval Strategy (Table 8): Agentic Full Tools = 90.7 vs. Vanilla Embedding Similarity = 70.5 vs. Full Memory Context = 78.9.
Tool Categories (Table 11): Full toolkit outperforms any subset; Graph Traversal gives the largest single boost (+6.6).
Round Budget (Table 9): Optimal at (T_{\max}=12) (90.7); average rounds used ~3.0, showing self-termination.
Search top-(k) (Table 10): Optimal at (k=10) (88.7 with (T_{\max}=8)); too few misses evidence, too many dilutes relevance.

Theoretical and Practical Implications

Theoretical:

Establishes that long-video understanding is fundamentally a reasoning problem, not just a perception scaling problem. The strong correlation (Pearson (R=0.897)) between agentic reasoning (AIME) and long-video performance suggests optimizing for intrinsic reasoning ability—rather than context length—is a more effective scaling direction.
The decoupled paradigm reveals that perception and reasoning can be separated: the perception model only needs to process short clips (<10 min), and its quality has minimal impact on final performance (Table 6: <1.5 point fluctuation when swapping perception backbone).
Provides design principles for multimodal memory systems: hierarchical abstraction with topological edges (spatiotemporal/causal) outperforms flat or purely sequential storage.

Practical:

MemDreamer is plug-and-play: can be applied to any VLM backbone without retraining, immediately boosting performance (e.g., Qwen3-VL from 63.6 to 84.8 on LVBench).
Dramatically reduces computational cost during inference: only 5.9–6.3K context tokens instead of 240–784K, enabling hours-long video understanding on resource-constrained systems.
The tool-based agentic retrieval mechanism is interpretable: each action step is logged, allowing debugging and trust in the reasoning process.
Potential applications: video surveillance, long-form content analysis, embodied AI (e.g., navigating visual histories in robotics).

Limitations:

Additional latency from multi-step agentic loops (though mitigated by small context sizes).
Dependence on a high-quality perception model for initial memory construction; while robust (Table 6), extreme cases may suffer.

Conclusion

MemDreamer introduces a paradigm shift for long-video understanding by decoupling perception from reasoning through a Hierarchical Graph Memory and tool-augmented agentic retrieval. Key takeaways:

The hierarchical three-tier structure (Video Root → Super Events → Macro Events with local subgraphs) captures both global narrative and fine-grained spatiotemporal-causal relations, enabling efficient semantic navigation.
The agentic tool bank (Navigation, Search, Graph Traversal) and Observation-Reason-Action loop transform passive token consumption into active exploratory reasoning.
Achieves SOTA on four major benchmarks with a 12.5-point absolute gain over end-to-end baselines, using only 2% of the context window, while narrowing the gap to human experts to 3.7 points.
First to demonstrate a strong, statistically significant linear correlation between a VLM’s agentic reasoning capability and its long-video understanding performance ((R=0.897), (p<0.01)), establishing agentic capability scaling as a new paradigm for multimodal comprehension.

Future directions include extending the framework to real-time video streams, incorporating multi-modal signals (audio, speech, text overlay), and further scaling the agentic reasoning loop for more complex interactive tasks.