# MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

> MemDreamer shows that agentic reasoning, not token scaling, drives long-video understanding, achieving SOTA with 2% context via hierarchical graph memory and tool-augmented retrieval.

- **Source:** [arXiv](https://arxiv.org/abs/2606.07512)
- **Published:** 2026-06-11
- **Permalink:** https://picx.dev/p/sQDXZZ
- **Whiteboard:** https://picx.dev/p/sQDXZZ/image

## Summary

## Summary (Overview)

- **Decoupled Perception and Reasoning**: MemDreamer introduces a novel paradigm that separates video perception (streaming hierarchical graph construction) from reasoning (agentic tool-augmented retrieval), overcoming token explosion and attention dilution inherent in end-to-end long-video VLMs.
- **Hierarchical Graph Memory (HGM)**: A three-tier (Video Root → Super Events → Macro Events) coarse-to-fine abstraction with local subgraphs encoding entities, micro-events, and spatiotemporal-causal edges—enabling efficient semantic navigation.
- **Agentic Tool-Augmented Retrieval**: A multi-step Observation-Reason-Action loop using three tool categories (Navigation, Search, Graph Traversal) to actively explore the memory, avoiding passive full-context ingestion.
- **State-of-the-Art Results**: Achieves SOTA on four benchmarks (LVBench 90.7, LongVideoBench 92.9, Video-MME 92.1, EgoSchema 88.2), narrowing the gap to human experts to 3.7 points on LVBench. Context window reduced to ~2% of full-video ingestion (5.9–6.3K vs. 240–784K tokens), with a 12.5-point absolute gain over the strongest end-to-end baseline.
- **Key Insight**: Establishes a strong positive linear correlation (Pearson \(R=0.897\), \(p<0.01\)) between a VLM’s agentic reasoning ability (AIME 2025) and long-video performance, shifting the optimization target from brute-force token scaling to agentic capability scaling.

## Introduction and Theoretical Foundation

**Background**: Long video understanding (hours-long) is a frontier for Vision-Language Models (VLMs), critical for embodied intelligence. Current VLMs process videos end-to-end by flattening frames into massive token streams, which suffers from:
- **Token explosion**: A 2-hour video at 1 FPS yields over 1.6M tokens.
- **Attention dilution**: Redundant tokens cause “lost in the middle” (Liu et al., 2024) and degrade long-range reasoning.

**Motivation**: Human comprehension is hierarchical (coarse plot → scenes → events → actions) and relational (spatiotemporal and causal connections). Existing flat or chunk-based memory systems lose global perspective and sever causal links, causing decoupled reasoning to degenerate into myopic exhaustive retrieval.

**Theoretical Basis**: MemDreamer formalizes long-video understanding as a two-stage decoupled process:
1. **Memory Construction**: A perception model \(P\) incrementally builds a structured, purely textual **Hierarchical Graph Memory** \(\mathcal{G}\) from the video stream \(V\).
2. **Agentic Retrieval**: A reasoning model \(R\) equipped with a tool bank \(\mathcal{T}\) actively explores \(\mathcal{G}\) via an Observation-Reason-Action loop, extracting concise task-relevant clues \(\mathcal{C}\) to answer query \(Q\): \(A = R(Q, \mathcal{C})\).

This decoupling shifts the problem from passive token consumption to active multi-step exploration, bypassing context limits and enabling the direct transfer of agentic reasoning capabilities to long-video tasks.

## Methodology

### 3.1 Hierarchical Graph Memory Construction

**Streaming Adaptive Segmentation**: Instead of fixed-length chunks, the system uses a semantic boundary detector with a sliding window of maximum duration \(\tau = 10\) minutes. At each iteration \(k\), the perception model resolves a set of complete **Macro Events** \(\{e_1,\dots,e_N\}\) within window \(W_k = [t_k^{\text{start}}, t_k^{\text{start}}+\tau]\), and the end of the last event becomes the next window’s start.

**Downward Subgraph Extraction**: For each Macro Event \(e_i\), a local subgraph \(g_i = (V_i, E_i)\) is constructed with two node types:
- \(V_i^E\): Entities (Person, Object, Location, Group) with attributes.
- \(V_i^M\): Micro-events (atomic actions with temporal extent).
Three edge categories:
- **Spatial-attribute** (\(V_i^E \leftrightarrow V_i^E\)): LOCATED_IN, NEXT_TO, etc.
- **Subject-object** (\(V_i^E \leftrightarrow V_i^M\)): PERFORMS, RECEIVES, etc.
- **Temporal-causal** (\(V_i^M \rightarrow V_i^M\)): BEFORE, CAUSES, PREVENTS, etc.

**Upward Hierarchical Aggregation**: Macro Event summaries are clustered bottom-up by temporal adjacency and semantic affinity to form **Super Events**, which are further aggregated into a single **Video Root** node. Cross-tier hierarchical edges (\(v_{Mc} \rightarrow v_S \rightarrow v_R\)) are established. See Table 1 for the complete schema.

### 3.2 Agentic Tool-Augmented Retrieval

**Multi-Dimensional Tool Bank** (Table 2):
1. **Hierarchical Navigation**: GetSummary, GetSuperEvent, GetMacroEvent, GetSubgraph – for coarse-to-fine hierarchical traversal.
2. **Precise Search**: SearchNodes (dense embedding similarity), SearchByTime (temporal localization) – for rapid node localization.
3. **Graph Traversal**: GetRelationGraph – for multi-hop causal/entity tracing along topological edges.

**Observation-Reason-Action Loop**:
At step \(t\), the reasoning model \(R\) selects action \(a_t\):
\[
a_t = R(Q, H_{t-1})
\tag{1}
\]
where \(H_{t-1}\) is the historical execution trajectory. Executing \(a_t\) produces observation \(o_t\) from \(\mathcal{G}\). Then \(R\) distills task-relevant clues:
\[
c_t = R(o_t, Q)
\tag{2}
\]
The trajectory is updated:
\[
H_t = H_{t-1} \cup \{(a_t, c_t)\}
\tag{3}
\]
This selective compression prevents context pollution. The loop terminates when sufficient evidence is gathered (max 12 rounds).

## Empirical Validation / Results

**Experimental Setup**: Evaluated on four benchmarks: LVBench (103 videos, 30min–2hr, 1,549 QA), LongVideoBench (753 videos, 1,337 QA), Video-MME long split (300 videos, 900 QA), EgoSchema (egocentric reasoning). Perception model: Gemini-3.1-Pro; reasoning models: Gemini-3.1-Pro, Gemini-2.5-Pro, Qwen3-VL-235B-A22B-Thinking. Embedding: Qwen3-Embedding. Max tool calls = 12.

**Main Results (Table 3)**:

| Method | Reason Model | LVBench | LongVideoBench | Video-MME (Long) | EgoSchema |
|--------|--------------|---------|----------------|------------------|-----------|
| **MemDreamer** | Gemini-3.1-Pro | **90.7** | **92.9** | **92.1** | 87.8 |
| **MemDreamer** | Qwen3-VL-Thinking | **84.8** | **86.3** | **86.2** | **87.4** |
| End-to-End Gemini-3.1-Pro | — | 78.2 | 78.6 | 80.3 | 76.4 |
| Human Expert | — | 94.4 | — | — | — |

MemDreamer with Gemini-3.1-Pro achieves absolute gains of +12.5 (LVBench), +14.3 (LongVideoBench), +11.8 (Video-MME) over the strongest end-to-end baseline. Gap to human expert: only 3.7 points.

**Context Window Reduction (Table 4)**:

| Method | Context Window | LVBench |
|--------|---------------|---------|
| End-to-End Gemini-3.1-Pro | 265K tokens | 78.2 |
| **MemDreamer** (Reasoning) | **6.2K tokens** | **90.7** |
| Reduction factor | ~43× | — |

MemDreamer uses only ~2% of the full-video context window while improving accuracy by 12.5 points.

**Correlation with Agentic Reasoning (Table 5)**:
Pearson correlation between AIME 2025 (agentic reasoning benchmark) and LVBench:
- End-to-end: \(R = 0.702\), \(p = 0.052\) (not significant)
- With MemDreamer: \(R = 0.897\), \(p < 0.01\) (strong, significant)

**Ablation Studies**:
- **Memory Construction (Table 7)**: Full Hierarchical-Graph = 90.7 vs. Flat-Chunk = 77.4 (enables causal edges + hierarchy).
- **Retrieval Strategy (Table 8)**: Agentic Full Tools = 90.7 vs. Vanilla Embedding Similarity = 70.5 vs. Full Memory Context = 78.9.
- **Tool Categories (Table 11)**: Full toolkit outperforms any subset; Graph Traversal gives the largest single boost (+6.6).
- **Round Budget (Table 9)**: Optimal at \(T_{\max}=12\) (90.7); average rounds used ~3.0, showing self-termination.
- **Search top-\(k\) (Table 10)**: Optimal at \(k=10\) (88.7 with \(T_{\max}=8\)); too few misses evidence, too many dilutes relevance.

## Theoretical and Practical Implications

**Theoretical**:  
- Establishes that long-video understanding is fundamentally a reasoning problem, not just a perception scaling problem. The strong correlation (Pearson \(R=0.897\)) between agentic reasoning (AIME) and long-video performance suggests optimizing for intrinsic reasoning ability—rather than context length—is a more effective scaling direction.
- The decoupled paradigm reveals that perception and reasoning can be separated: the perception model only needs to process short clips (<10 min), and its quality has minimal impact on final performance (Table 6: <1.5 point fluctuation when swapping perception backbone).
- Provides design principles for multimodal memory systems: hierarchical abstraction with topological edges (spatiotemporal/causal) outperforms flat or purely sequential storage.

**Practical**:  
- MemDreamer is plug-and-play: can be applied to any VLM backbone without retraining, immediately boosting performance (e.g., Qwen3-VL from 63.6 to 84.8 on LVBench).
- Dramatically reduces computational cost during inference: only 5.9–6.3K context tokens instead of 240–784K, enabling hours-long video understanding on resource-constrained systems.
- The tool-based agentic retrieval mechanism is interpretable: each action step is logged, allowing debugging and trust in the reasoning process.
- Potential applications: video surveillance, long-form content analysis, embodied AI (e.g., navigating visual histories in robotics).

**Limitations**:  
- Additional latency from multi-step agentic loops (though mitigated by small context sizes).
- Dependence on a high-quality perception model for initial memory construction; while robust (Table 6), extreme cases may suffer.

## Conclusion

MemDreamer introduces a paradigm shift for long-video understanding by **decoupling perception from reasoning** through a **Hierarchical Graph Memory** and **tool-augmented agentic retrieval**. Key takeaways:

- The hierarchical three-tier structure (Video Root → Super Events → Macro Events with local subgraphs) captures both global narrative and fine-grained spatiotemporal-causal relations, enabling efficient semantic navigation.
- The agentic tool bank (Navigation, Search, Graph Traversal) and Observation-Reason-Action loop transform passive token consumption into active exploratory reasoning.
- Achieves SOTA on four major benchmarks with a **12.5-point absolute gain** over end-to-end baselines, using only **2% of the context window**, while narrowing the gap to human experts to **3.7 points**.
- First to demonstrate a **strong, statistically significant linear correlation** between a VLM’s agentic reasoning capability and its long-video understanding performance (\(R=0.897\), \(p<0.01\)), establishing **agentic capability scaling** as a new paradigm for multimodal comprehension.

Future directions include extending the framework to real-time video streams, incorporating multi-modal signals (audio, speech, text overlay), and further scaling the agentic reasoning loop for more complex interactive tasks.

---

_Markdown view of https://picx.dev/p/sQDXZZ, served by PicX — AI-generated visual whiteboard summaries of research papers._
