Summary (Overview)

  • Active memory reconstruction paradigm: MRAgent transforms memory access from a static one-shot retrieval to an iterative, multi-step reconstruction process driven by LLM reasoning over accumulated evidence.
  • Cue–Tag–Content memory graph: A structured associative graph decoupling retrieval into two stages—tag-level semantic routing and content-level access—enabling guided, noise-reduced traversal.
  • Theoretical expressivity proof: Active retrieval policies (adaptive, stateful) are proven strictly more powerful than passive (non-adaptive) policies for any retrieval budget T2T \geq 2, formally establishing the advantage of active reconstruction.
  • Empirical results: On LoCoMo and LongMemEval benchmarks, MRAgent achieves up to 23% relative improvement in LLM-Judge scores over strong baselines, while reducing token consumption by up to 85% compared to graph-based baselines.
  • Computational efficiency: Selective, on-demand memory access reduces average per-sample token cost to 118k (vs. 632k for A-Mem) and runtime to 586s (vs. over 1,100s for A-Mem).

Introduction and Theoretical Foundation

LLMs exhibit excellent reasoning but struggle with long-term memory due to limited context windows. Prior memory systems (RAG, graph-based memories, hierarchical stores) all follow a passive retrieval paradigm: memory access is a fixed function of the query and does not adapt to intermediate evidence. Cognitive neuroscience, however, views recall as an active reconstruction process: contextual cues propagate through associative representations, progressively reconstructing coherent experiences (Rugg & Renoult, 2025; Frankland & Josselyn, 2019).

This motivates two challenges:

  1. Active Reconstruction: Transform one-shot retrieval into multi-step reasoning that dynamically refines search based on accumulated evidence.
  2. Associative Memory Structure: Organize memory with semantic tags that guide exploration and prune irrelevant branches.

The paper formalizes the active memory access problem. Let memory M={v1,...,vN}M = \{v_1,...,v_N\} and query xx. A passive policy πp\pi_p selects units only from xx:

{v(1),...,v(T)}=πp(x)\{v^{(1)},...,v^{(T)}\} = \pi_p(x)

An active policy πa(t)\pi_a^{(t)} selects units conditioned on evolving evidence S(t1)S^{(t-1)}:

v(t)=πa(t)(x,S(t1)),S(t)=S(t1){v(t)}v^{(t)} = \pi_a^{(t)}(x, S^{(t-1)}), \quad S^{(t)} = S^{(t-1)} \cup \{v^{(t)}\}

Existing passive paradigms include:

  • Similarity-based retrieval (e.g., MemoryBank, Mem0): πsim(x)=TopK({sim(x,v)}vV,k)\pi_{sim}(x) = \text{TopK}(\{\text{sim}(x,v)\}_{v \in V}, k)
  • Graph-based retrieval (e.g., A-Mem, Zep): πgraph(x)=VsimNeighbor(Vsim)\pi_{graph}(x) = V_{sim} \cup \text{Neighbor}(V_{sim})

Both fail when relevant evidence is not directly similar or not connected via predefined links, and cannot revise strategies mid-retrieval.


Methodology

Associative Memory Graph: Cue–Tag–Content (CTC)

The memory is a heterogeneous graph M=(C,V,R)\mathcal{M} = (\mathcal{C}, \mathcal{V}, \mathcal{R}) where:

  • Cues cCc \in \mathcal{C}: fine-grained keywords (entities, attributes)
  • Contents vVv \in \mathcal{V}: specific memory items (episodes, semantic facts)
  • Tags gg: relations RC×G×VR \subseteq \mathcal{C} \times \mathcal{G} \times \mathcal{V}, summarizing associative links

Two-stage retrieval operators:

ϕcg(c){g(c,g,)R}\phi_{c \to g}(c) \triangleq \{ g \mid (c,g,\cdot) \in R\} ϕ(c,g)v(c,g){v(c,g,v)R}\phi_{(c,g) \to v}(c,g) \triangleq \{ v \mid (c,g,v) \in R\}

These decouple tag-level reasoning (selecting relevant associative directions) from content-level retrieval (accessing full episodic text), avoiding combinatorial explosion.

Memory is organized into three multi-granular layers:

  • Episodic Layer: Event-specific memories eiVee_i \in \mathcal{V}_e, indexed by cues and tags, with temporal ordering.
  • Semantic Layer: Stable knowledge siVss_i \in \mathcal{V}_s (preferences, attributes), anchored to cues via aspect tags.
  • Abstraction Layer: Topic nodes τVτ\tau \in \mathcal{V}_\tau summarizing recurring patterns across episodes.

Memory population uses LLM extraction: gi=FLLMtag(ei)g_i = F_{LLM}^{tag}(e_i), Ci=FLLMcue(ei)C_i = F_{LLM}^{cue}(e_i), and topic generation from related episodes.

MRAgent: Active Reconstruction Process

The reconstruction state at step tt is:

S(t)=(Z(t),H(t))S^{(t)} = (Z^{(t)}, H^{(t)})

where Z(t)Z^{(t)} is the active set (candidates for next traversal) and H(t)H^{(t)} is the accumulated context.

Traversal actions A={Π1,...,Πm}A = \{\Pi_1,...,\Pi_m\}:

  • Forward: Πcg(C(t))=cC(t)ϕcg(c)\Pi_{c \to g}(C^{(t)}) = \bigcup_{c' \in C^{(t)}} \phi_{c \to g}(c'), Π(c,g)v(C(t),G(t))=c,gϕ(c,g)v(c,g)\Pi_{(c,g) \to v}(C^{(t)}, G^{(t)}) = \bigcup_{c',g'} \phi_{(c,g) \to v}(c',g')
  • Reverse: Πv(c,g)(V(t))={(c,g)vV(t),(c,g,v)R}\Pi_{v \to (c,g)}(V^{(t)}) = \{(c',g') \mid \exists v' \in V^{(t)}, (c',g',v') \in R\}

The iterative loop:

  1. LLM reasoning & action selection: A(t)=fselect(x,H(t),Z(t))A^{(t)} = f^{select}(x, H^{(t)}, Z^{(t)})
  2. Controlled traversal: Z~(t+1)=aA(t)Πa(Z(t))\tilde{Z}^{(t+1)} = \bigcup_{a \in A^{(t)}} \Pi_a(Z^{(t)})
  3. LLM routing & state update: Z(t+1)=froute(x,H(t),Z~(t+1))Z^{(t+1)} = f^{route}(x, H^{(t)}, \tilde{Z}^{(t+1)}), H(t+1)=H(t)Z(t+1)H^{(t+1)} = H^{(t)} \cup Z^{(t+1)}

The LLM decides when to terminate based on sufficiency of H(t+1)H^{(t+1)}.

Theoretical Analysis

Theorem 4.1 (Active retrieval is strictly more powerful than passive retrieval)
For any retrieval budget T2T \geq 2, the passive hypothesis class is strictly contained in the active hypothesis class:

HpassiveLM(T)HactiveLM(T)\mathcal{H}_{passive}^{LM}(T) \subsetneq \mathcal{H}_{active}^{LM}(T)

Intuition: Active retrieval can implement any passive strategy, but there exist functions (e.g., those requiring conditional branching on intermediate retrieved content) that passive retrieval cannot realize because it must commit to all TT retrievals upfront. Full proof in Appendix C.


Empirical Validation / Results

Benchmarks and Baselines

  • LoCoMo: long conversational memory understanding; questions by type: multi-hop, temporal, open-domain, single-hop.
  • LongMemEval: multi-session, temporal-reasoning, and preference questions with longer histories.
  • Baselines: RAG, A-Mem, MemoryOS, LangMem, Mem0. Backbones: Gemini-2.5-Flash, Claude-Sonnet-4.5. Metrics: F1 and LLM-Judge (J) (GPT-4o-mini), evidence recall.

Main Results (RQ1)

Table 1: Performance on LoCoMo

ModelMethodMulti-hop F1 ↑Multi-hop J ↑Temporal F1 ↑Temporal J ↑Open Domain F1 ↑Open Domain J ↑Single hop F1 ↑Single hop J ↑Overall J ↑
GeminiRAG34.8958.1643.5249.2225.6841.6753.6969.2061.30
A-Mem33.8153.5440.2249.5312.4933.3346.3961.8355.97
MemoryOS41.4263.8235.9147.0423.4341.6654.8271.9063.35
LangMem40.6761.3444.7053.5820.4938.5448.2069.6862.86
Mem045.1768.7958.1961.6826.2441.6654.3773.7268.31
MRAgent43.6975.1767.6680.3732.5168.7564.0890.4884.21
ClaudeRAG34.5357.4543.3948.2926.5643.7553.6669.2061.10
A-Mem42.4571.6747.7355.4822.0247.5755.1974.7168.45
MemoryOS32.9460.9939.1451.0918.2948.9545.4666.4961.18
LangMem44.3770.9256.6480.6822.6654.7154.3683.1278.61
Mem048.6675.8849.5053.5828.5856.2554.4374.0769.02
MRAgent56.7290.1969.8285.3434.6771.5768.6291.1088.32
  • MRAgent achieves 23.3% relative improvement over Mem0 (best baseline) on Gemini (J: 84.21 vs 68.31), and 12.4% on Claude (J: 88.32 vs 78.61).
  • Largest gains on temporal questions (J: from ~61 to 80+) and multi-hop questions (J: from ~68 to 90).

Table 2: LongMemEval (LLM-Judge ↑)

MethodMulti-sessionSingle-session-userTemporal-reasoningSingle-session-preferenceOverall
RAG54.8985.7142.8633.3354.65
A-Mem42.8590.0045.1146.4352.98
MemoryOS56.3987.1438.3546.6754.92
LangMem52.6378.5745.7136.6753.77
Mem050.3878.5745.1140.0053.01
MRAgent68.4292.8568.4266.6772.95
MRAgent* (Claude retrieval)86.4692.8585.7178.5786.76
  • 32% relative improvement over best baseline (MemoryOS 54.92 → MRAgent 72.95).

Cost Analysis (RQ2)

Table 3: Token consumption and runtime on LongMemEval (Gemini backbone)

MethodToken ConsumptionRuntime(s)
A-Mem632k1,122.23
MemoryOS273k3,135.54
LangMem3,268k1,209.57
Mem0245k533.29
MRAgent118k586.11
  • MRAgent reduces token cost by ~81% vs A-Mem, and runtime is competitive with the most efficient baseline (Mem0) despite multi-step reasoning.
  • Efficiency stems from lightweight construction and query-time selective retrieval via tags.

Ablation Study (RQ3)

Figure 5 (summary): Ablations on LoCoMo multi-hop questions (Claude backbone):

  • Structural variants: CE (Cue → Episode), CTE (Cue–Tag–Episode), CTC (Cue–Tag–Content).
  • With vs without reasoning: All structures benefit from active multi-step reasoning (blue bars > green bars).
  • CTC + reasoning gives highest Recall and J score.
  • Removing semantic memory degrades performance, confirming complementarity of episodic and semantic layers.

Multi-turn Reasoning Analysis (RQ4)

  • Multi-hop queries improve recall by >30% across successive reasoning steps (red line in Figure 6a).
  • Average Turns ≈ Max Valid Turns (Figure 6b), indicating the LLM effectively decides when to stop, minimizing redundancy.
  • Increasing parallel retrieval budget cannot replace deeper sequential reasoning (Appendix D.6).

Case Study (RQ5)

Figure 7 shows MRAgent traversing from cue "Jonna" through multiple tags (e.g., "November 2024 submission", "late rejection") to retrieve both episodic and semantic memories, then reasoning over higher-level topics to align evidence for a temporally complex query.


Theoretical and Practical Implications

  • Theoretical: Formal proof that active retrieval is strictly more expressive than passive retrieval. This provides a theoretical foundation for designing adaptive memory systems.
  • Practical: MRAgent demonstrates that deferring relational reasoning to retrieval time, rather than encoding all dependencies during construction, yields both higher accuracy and lower computational cost—a counterintuitive result that challenges existing memory system design.
  • Design principle: Separating memory into cue-tag-content layers allows LLMs to reason over abstract associations before committing to expensive content retrieval, enabling selective, on-demand access.

Conclusion and Discussion

Main contributions:

  1. Active memory reconstruction paradigm that integrates LLM reasoning directly into multi-step memory access.
  2. Cue–Tag–Content graph enabling two-stage associative retrieval with semantic routing.
  3. Theoretical proof of superiority of active over passive retrieval.
  4. Empirical gains of 12–23% over strong baselines with 81% reduction in token cost.

Limitations:

  • Deeper exploration steps increase latency compared to one-shot retrieval.
  • Static construction → memory grows monotonically; no updating or forgetting mechanisms.

Future directions: Adaptive memory construction, lightweight memory maintenance, robust traversal policies to extend active reconstruction to broader long-horizon deployments.

Code available at: https://github.com/Ji-shuo/MRAgent

Related papers