Visual Summary | Memory is Reconstructed, Not Retrieved: Graph Memory for LLM Agents

Summary (Overview)

Active memory reconstruction paradigm: MRAgent transforms memory access from a static one-shot retrieval to an iterative, multi-step reconstruction process driven by LLM reasoning over accumulated evidence.
Cue–Tag–Content memory graph: A structured associative graph decoupling retrieval into two stages—tag-level semantic routing and content-level access—enabling guided, noise-reduced traversal.
Theoretical expressivity proof: Active retrieval policies (adaptive, stateful) are proven strictly more powerful than passive (non-adaptive) policies for any retrieval budget $T \geq 2$ , formally establishing the advantage of active reconstruction.
Empirical results: On LoCoMo and LongMemEval benchmarks, MRAgent achieves up to 23% relative improvement in LLM-Judge scores over strong baselines, while reducing token consumption by up to 85% compared to graph-based baselines.
Computational efficiency: Selective, on-demand memory access reduces average per-sample token cost to 118k (vs. 632k for A-Mem) and runtime to 586s (vs. over 1,100s for A-Mem).

Introduction and Theoretical Foundation

LLMs exhibit excellent reasoning but struggle with long-term memory due to limited context windows. Prior memory systems (RAG, graph-based memories, hierarchical stores) all follow a passive retrieval paradigm: memory access is a fixed function of the query and does not adapt to intermediate evidence. Cognitive neuroscience, however, views recall as an active reconstruction process: contextual cues propagate through associative representations, progressively reconstructing coherent experiences (Rugg & Renoult, 2025; Frankland & Josselyn, 2019).

This motivates two challenges:

Active Reconstruction: Transform one-shot retrieval into multi-step reasoning that dynamically refines search based on accumulated evidence.
Associative Memory Structure: Organize memory with semantic tags that guide exploration and prune irrelevant branches.

The paper formalizes the active memory access problem. Let memory $M = \{v_1,...,v_N\}$ and query $x$ . A passive policy $\pi_p$ selects units only from $x$ :

\{v^{(1)},...,v^{(T)}\} = \pi_p(x)

An active policy $\pi_a^{(t)}$ selects units conditioned on evolving evidence $S^{(t-1)}$ :

v^{(t)} = \pi_a^{(t)}(x, S^{(t-1)}), \quad S^{(t)} = S^{(t-1)} \cup \{v^{(t)}\}

Existing passive paradigms include:

Similarity-based retrieval (e.g., MemoryBank, Mem0): $\pi_{sim}(x) = \text{TopK}(\{\text{sim}(x,v)\}_{v \in V}, k)$
Graph-based retrieval (e.g., A-Mem, Zep): $\pi_{graph}(x) = V_{sim} \cup \text{Neighbor}(V_{sim})$

Both fail when relevant evidence is not directly similar or not connected via predefined links, and cannot revise strategies mid-retrieval.

Methodology

Associative Memory Graph: Cue–Tag–Content (CTC)

The memory is a heterogeneous graph $\mathcal{M} = (\mathcal{C}, \mathcal{V}, \mathcal{R})$ where:

Cues $c \in \mathcal{C}$ : fine-grained keywords (entities, attributes)
Contents $v \in \mathcal{V}$ : specific memory items (episodes, semantic facts)
Tags $g$ : relations $R \subseteq \mathcal{C} \times \mathcal{G} \times \mathcal{V}$ , summarizing associative links

Two-stage retrieval operators:

\phi_{c \to g}(c) \triangleq \{ g \mid (c,g,\cdot) \in R\}

\phi_{(c,g) \to v}(c,g) \triangleq \{ v \mid (c,g,v) \in R\}

These decouple tag-level reasoning (selecting relevant associative directions) from content-level retrieval (accessing full episodic text), avoiding combinatorial explosion.

Memory is organized into three multi-granular layers:

Episodic Layer: Event-specific memories $e_i \in \mathcal{V}_e$ , indexed by cues and tags, with temporal ordering.
Semantic Layer: Stable knowledge $s_i \in \mathcal{V}_s$ (preferences, attributes), anchored to cues via aspect tags.
Abstraction Layer: Topic nodes $\tau \in \mathcal{V}_\tau$ summarizing recurring patterns across episodes.

Memory population uses LLM extraction: $g_i = F_{LLM}^{tag}(e_i)$ , $C_i = F_{LLM}^{cue}(e_i)$ , and topic generation from related episodes.

MRAgent: Active Reconstruction Process

The reconstruction state at step $t$ is:

S^{(t)} = (Z^{(t)}, H^{(t)})

where $Z^{(t)}$ is the active set (candidates for next traversal) and $H^{(t)}$ is the accumulated context.

Traversal actions $A = \{\Pi_1,...,\Pi_m\}$ :

Forward: $\Pi_{c \to g}(C^{(t)}) = \bigcup_{c' \in C^{(t)}} \phi_{c \to g}(c')$ , $\Pi_{(c,g) \to v}(C^{(t)}, G^{(t)}) = \bigcup_{c',g'} \phi_{(c,g) \to v}(c',g')$
Reverse: $\Pi_{v \to (c,g)}(V^{(t)}) = \{(c',g') \mid \exists v' \in V^{(t)}, (c',g',v') \in R\}$

The iterative loop:

LLM reasoning & action selection: $A^{(t)} = f^{select}(x, H^{(t)}, Z^{(t)})$
Controlled traversal: $\tilde{Z}^{(t+1)} = \bigcup_{a \in A^{(t)}} \Pi_a(Z^{(t)})$
LLM routing & state update: $Z^{(t+1)} = f^{route}(x, H^{(t)}, \tilde{Z}^{(t+1)})$ , $H^{(t+1)} = H^{(t)} \cup Z^{(t+1)}$

The LLM decides when to terminate based on sufficiency of $H^{(t+1)}$ .

Theoretical Analysis

Theorem 4.1 (Active retrieval is strictly more powerful than passive retrieval)
For any retrieval budget $T \geq 2$ , the passive hypothesis class is strictly contained in the active hypothesis class:

\mathcal{H}_{passive}^{LM}(T) \subsetneq \mathcal{H}_{active}^{LM}(T)

Intuition: Active retrieval can implement any passive strategy, but there exist functions (e.g., those requiring conditional branching on intermediate retrieved content) that passive retrieval cannot realize because it must commit to all $T$ retrievals upfront. Full proof in Appendix C.

Empirical Validation / Results

Benchmarks and Baselines

LoCoMo: long conversational memory understanding; questions by type: multi-hop, temporal, open-domain, single-hop.
LongMemEval: multi-session, temporal-reasoning, and preference questions with longer histories.
Baselines: RAG, A-Mem, MemoryOS, LangMem, Mem0. Backbones: Gemini-2.5-Flash, Claude-Sonnet-4.5. Metrics: F1 and LLM-Judge (J) (GPT-4o-mini), evidence recall.

Main Results (RQ1)

Table 1: Performance on LoCoMo

Model	Method	Multi-hop F1 ↑	Multi-hop J ↑	Temporal F1 ↑	Temporal J ↑	Open Domain F1 ↑	Open Domain J ↑	Single hop F1 ↑	Single hop J ↑	Overall J ↑
Gemini	RAG	34.89	58.16	43.52	49.22	25.68	41.67	53.69	69.20	61.30
	A-Mem	33.81	53.54	40.22	49.53	12.49	33.33	46.39	61.83	55.97
	MemoryOS	41.42	63.82	35.91	47.04	23.43	41.66	54.82	71.90	63.35
	LangMem	40.67	61.34	44.70	53.58	20.49	38.54	48.20	69.68	62.86
	Mem0	45.17	68.79	58.19	61.68	26.24	41.66	54.37	73.72	68.31
	MRAgent	43.69	75.17	67.66	80.37	32.51	68.75	64.08	90.48	84.21
Claude	RAG	34.53	57.45	43.39	48.29	26.56	43.75	53.66	69.20	61.10
	A-Mem	42.45	71.67	47.73	55.48	22.02	47.57	55.19	74.71	68.45
	MemoryOS	32.94	60.99	39.14	51.09	18.29	48.95	45.46	66.49	61.18
	LangMem	44.37	70.92	56.64	80.68	22.66	54.71	54.36	83.12	78.61
	Mem0	48.66	75.88	49.50	53.58	28.58	56.25	54.43	74.07	69.02
	MRAgent	56.72	90.19	69.82	85.34	34.67	71.57	68.62	91.10	88.32

MRAgent achieves 23.3% relative improvement over Mem0 (best baseline) on Gemini (J: 84.21 vs 68.31), and 12.4% on Claude (J: 88.32 vs 78.61).
Largest gains on temporal questions (J: from ~61 to 80+) and multi-hop questions (J: from ~68 to 90).

Table 2: LongMemEval (LLM-Judge ↑)

Method	Multi-session	Single-session-user	Temporal-reasoning	Single-session-preference	Overall
RAG	54.89	85.71	42.86	33.33	54.65
A-Mem	42.85	90.00	45.11	46.43	52.98
MemoryOS	56.39	87.14	38.35	46.67	54.92
LangMem	52.63	78.57	45.71	36.67	53.77
Mem0	50.38	78.57	45.11	40.00	53.01
MRAgent	68.42	92.85	68.42	66.67	72.95
MRAgent* (Claude retrieval)	86.46	92.85	85.71	78.57	86.76

32% relative improvement over best baseline (MemoryOS 54.92 → MRAgent 72.95).

Cost Analysis (RQ2)

Table 3: Token consumption and runtime on LongMemEval (Gemini backbone)

Method	Token Consumption	Runtime(s)
A-Mem	632k	1,122.23
MemoryOS	273k	3,135.54
LangMem	3,268k	1,209.57
Mem0	245k	533.29
MRAgent	118k	586.11

MRAgent reduces token cost by ~81% vs A-Mem, and runtime is competitive with the most efficient baseline (Mem0) despite multi-step reasoning.
Efficiency stems from lightweight construction and query-time selective retrieval via tags.

Ablation Study (RQ3)

Figure 5 (summary): Ablations on LoCoMo multi-hop questions (Claude backbone):

Structural variants: CE (Cue → Episode), CTE (Cue–Tag–Episode), CTC (Cue–Tag–Content).
With vs without reasoning: All structures benefit from active multi-step reasoning (blue bars > green bars).
CTC + reasoning gives highest Recall and J score.
Removing semantic memory degrades performance, confirming complementarity of episodic and semantic layers.

Multi-turn Reasoning Analysis (RQ4)

Multi-hop queries improve recall by >30% across successive reasoning steps (red line in Figure 6a).
Average Turns ≈ Max Valid Turns (Figure 6b), indicating the LLM effectively decides when to stop, minimizing redundancy.
Increasing parallel retrieval budget cannot replace deeper sequential reasoning (Appendix D.6).

Case Study (RQ5)

Figure 7 shows MRAgent traversing from cue "Jonna" through multiple tags (e.g., "November 2024 submission", "late rejection") to retrieve both episodic and semantic memories, then reasoning over higher-level topics to align evidence for a temporally complex query.

Theoretical and Practical Implications

Theoretical: Formal proof that active retrieval is strictly more expressive than passive retrieval. This provides a theoretical foundation for designing adaptive memory systems.
Practical: MRAgent demonstrates that deferring relational reasoning to retrieval time, rather than encoding all dependencies during construction, yields both higher accuracy and lower computational cost—a counterintuitive result that challenges existing memory system design.
Design principle: Separating memory into cue-tag-content layers allows LLMs to reason over abstract associations before committing to expensive content retrieval, enabling selective, on-demand access.

Conclusion and Discussion

Main contributions:

Active memory reconstruction paradigm that integrates LLM reasoning directly into multi-step memory access.
Cue–Tag–Content graph enabling two-stage associative retrieval with semantic routing.
Theoretical proof of superiority of active over passive retrieval.
Empirical gains of 12–23% over strong baselines with 81% reduction in token cost.

Limitations:

Deeper exploration steps increase latency compared to one-shot retrieval.
Static construction → memory grows monotonically; no updating or forgetting mechanisms.

Future directions: Adaptive memory construction, lightweight memory maintenance, robust traversal policies to extend active reconstruction to broader long-horizon deployments.

Code available at: https://github.com/Ji-shuo/MRAgent