Summary (Overview)

  • Framework Decomposition: The paper is the first to systematically decompose agent memory systems into four core data management modules: memory representation and storage, extraction, retrieval and routing, and maintenance. A structured taxonomy covering 12 representative systems is provided.
  • No Universal Winner: End-to-end evaluation across 5 workloads (11 datasets) shows that no single architecture dominates. Graph-based methods excel at single-hop factual recall and updates, composite hybrid systems lead on conversational QA, and trace-preserving memories are best for stateful execution.
  • Retrieval Fidelity Degrades with Temporal Distance: Retrieval accuracy drops sharply as the gap between evidence and query increases. Explicit query planning and balanced hybrid search mitigate this, but flat similarity search fails for long-range evidence.
  • Robustness Under Dynamic Updates: Graph-based systems (e.g., Zep, Cognee) handle knowledge updates most reliably. Append-only stores and fact-extraction plugins struggle, leading to "hallucinations of the past."
  • Operational Cost Trade-offs: Localized maintenance (e.g., LightMem, MemTree) is far more cost-efficient than global reorganization (e.g., Zep, Cognee). Highly structured systems incur orders-of-magnitude higher latency without proportional accuracy gains.

Introduction and Theoretical Foundation

The paper addresses the rapid evolution of LLM agent memory from simple retrieval-augmented mechanisms into full-fledged data management systems that support persistent storage, retrieval, update, consolidation, and lifecycle governance. Despite this progress, existing evaluations treat agent memory as a monolithic black box, focusing on end-to-end task success metrics (F1, BLEU) and neglecting critical system-level concerns: operational costs, architectural trade-offs across memory modules, and robustness under dynamic knowledge updates.

The authors define agent memory M as a persistent data management object that maintains cumulative state beyond a single inference step. An agent memory system is formalized as a tuple of four modules:

Msys=R,S,Q,UM_{sys} = \langle R, S, Q, U \rangle

where:

  • R: Memory Representation and Storage – logical format (token sequence, graph/tree, composite) and physical backend (transient register, single-engine, multi-engine)
  • S: Memory Extraction – how raw input streams are transformed into logical primitives (raw concatenation, schema-free semantic, schema-constrained structured)
  • Q: Memory Retrieval and Routing – how relevant subsets are identified (native attention, semantic KNN, subgraph traversal, agentic routing, hybrid execution)
  • U: Memory Maintenance – policies for conflict resolution, capacity management, and semantic consolidation

The paper distinguishes agent memory from stateless RAG and from traditional database workloads. Unlike RAG, agent memory is persistent, updatable, and governs the full lifecycle. Unlike OLTP/OLAP, agent memory access is semantic, contents evolve under conflicting observations, and workloads combine long-context synthesis, episodic recall, temporal reasoning, and streaming updates.

Methodology

The authors propose an analytical framework decomposing agent memory into four core modules, each with a structured taxonomy:

  1. Memory Representation & Storage: Token-level sequences (explicit discrete text or implicit continuous vectors), graph/tree topologies (temporal knowledge graphs, hierarchical trees), and heterogeneous composites (combining text, metadata, embeddings, and links across multiple backends).

  2. Memory Extraction: Raw sequence concatenation, schema-free semantic extraction (isolating standalone facts), schema-constrained structured extraction (populating rigid schemas like triplets or JSON fields).

  3. Memory Retrieval & Query Routing: Native attention, semantic dense retrieval, topological subgraph traversal, autonomous agentic routing (function call invocation or generative query expansion), and multi-stage hybrid execution (sequential filtering or parallel ensemble fusion).

  4. Memory Maintenance: Timestamp-based multi-versioning, capacity-driven physical eviction (constraint-based or score-based), LLM-driven semantic consolidation (inline compaction or tool-driven CRUD), and continuous parametric optimization.

The experimental design includes:

  • 12 representative memory systems across all taxonomy categories plus 2 reference baselines (Long Context and Embedding RAG)
  • 5 benchmark workloads: LoCoMo (long-conversation QA), LongMemEval (multi-session long memory), DB-Bench (procedural execution from LifelongAgentBench), LongBench (context-length robustness), and LongMemEval temporal slices
  • 5 research questions: RQ1 (overall effectiveness), RQ2 (retrieval fidelity), RQ3 (dynamic update robustness), RQ4 (long-horizon stability), RQ5 (operational cost)
  • Fine-grained ablation: Controlled variants modifying one module at a time to isolate contributions

Empirical Validation / Results

RQ1: Overall Effectiveness

Figure 7 (summarized): No system dominates across all workloads. Structure-aware systems lead on LongMemEval (Zep: 48.0 LLM Judge Accuracy, Cognee: 35.3 ROUGE-L F1). Hybrid filtering is strongest on LoCoMo exactness (MemOS: 11.5 EM). Trace-preserving memories are strongest on DB-Bench (Long Context: 48.2 EM, MemoChat: 55.4 Task Success Rate). Systems with full workload coverage (MemoryOS, MemOS) remain closest to the frontier.

Finding 1 (Workload-Aligned Memory): Strong memory is defined by alignment with the workload bottleneck: relation- and time-aware retrieval for dispersed cross-session reasoning; coarse-to-fine filtering for long but coherent dialogues; trace preservation for stateful execution.

RQ2: Retrieval Fidelity

Figure 8: Semantic-based retrieval yields high early hits (SimpleMem: 39.0 Recall@1) but graph- and hierarchy-based systems are stronger for evidence completeness (A-MEM: 69.5/85.9 Recall@5/@10; MemTree: 59.7/80.5). Recall@10 degrades sharply with evidence distance gap—Embedding RAG drops from ~60% (1-5 bins) to <20% (26-31 bins), while A-MEM and MemOS remain above 60%.

Finding 2 (Evidence-Centric Organization): Retrieval quality depends more on organizing evidence for reconstruction than on top-1 ranking. Explicit structure (links/hierarchy) is most valuable for scattered or temporally distant evidence; flat similarity search is only effective for short-range access.

RQ3: Dynamic Update Robustness

Table 2: Robustness over Memory Update Settings

MethodLoCoMo TemporalLongMemEval Knowledge UpdateLongMemEval Temporal Reasoning
EMAns. F1Substr. EM
Long Context8.126.920.0
Embedding RAG1.67.920.0
Cognee4.028.137.8
Zep4.818.144.4
MemOS8.928.028.9
MemoryOS3.222.735.6

Graph-based methods (Zep, Cognee) lead on fact revision; hybrid filtered memory (MemOS) excels on latest-state grounding. Backbone robustness ablation (Figure 9) shows that stronger LLMs improve answer realization after evidence is localized, not compensate for weak temporal grounding.

Finding 3 (Temporal Update Fidelity): Revisability must be built into the representation; query-time selectivity should match the workload bottleneck; LLM scaling is valuable only after grounding succeeds.

RQ4: Long-Horizon Stability

Figure 10: On LongBench, SimpleMem stays nearly unchanged (35.2→34.9 Accuracy) while Long Context drops sharply (42.6→19.0). On LoCoMo, Embedding RAG falls from 37.1 to 7.4 Answer F1 as evidence gap widens, while graph/consolidated systems (Cognee, MemOS, MemoryOS) remain substantially higher. Graph-organized memory preserves entity–event–time relations; hierarchical organization preserves session-level structure.

Finding 4 (Horizon-Structured Memory): The main challenge is choosing the right abstraction: multi-view filtering for long inputs with distractors; relation-aware indexing for distant facts; coarse-to-fine summarization for session identification before local detail resolution.

RQ5: Operational Cost

Figure 11: LightMem (48.3 Normalized Utility at 3.67s) and MemTree (63.5 at 15.9s) are most efficient. Higher-utility structured systems are expensive: MemoryOS reaches 82.0 utility only at 28.6s; Zep exceeds 84 utility only after 155.1s. On LongBench, LightMem remains at 17.3s while Mem0, MemoChat, MemoryOS, A-MEM rise to 374–552s.

Finding 5 (Operational Scaling Rule): Efficiency is governed by maintenance scope, not structure alone. Localized update/search yields the best cost-utility balance; broader recomputation offsets gains.

Fine-Grained Component Analysis

(M1) Representation & Storage (Table 3): LightMem User-Only Raw achieves best factual recall (LoCoMo EM: 24.2, Ans. F1: 38.9; LongMemEval Substr. EM: 26.0). Compression preserves reasoning but weakens exact matching. Deeper hierarchy does not restore removed content.

(M2) Extraction (Table 4): Coverage-preserving extraction (e.g., MemoChat Heuristic Topic, MemOS Fast Memorize, LightMem Hybrid Raw) provides the best balance. Selective extraction yields modest factual retrieval gains but degrades reasoning.

(M3) Retrieval & Routing (Table 5): Balanced hybrid fusion (A-MEM Hybrid-Balanced: 24.6 Ans. F1, 27.5 Substr. EM) outperforms sparse-leaning. Explicit planning (SimpleMem Planning Only: 20.7 Ans. F1, 90.6 Strict Recall) improves over no planning, but adding reflection does not help further.

(M4) Maintenance (Figure 12): Conservative merging (MemoryOS Conservative-Merge: 23.5 Ans. F1, 22.8 Substr. EM) outperforms delayed flushing and overly coarse summarization. Raw context (Long Context: 23.7 Substr. EM) still best for exact phrasing.

Theoretical and Practical Implications

Theoretical Implications: The paper provides the first systematic data management perspective on agent memory, formalizing it as a tuple of lifecycle modules. It reveals that memory effectiveness is not about a single universal representation but about alignment with workload bottlenecks. The findings challenge the assumption that richer structure always improves performance—the maintenance scope and temporal organization matter more.

Practical Implications:

  • System Designers: Choose memory architecture based on workload: graph/time-aware for dispersed cross-session reasoning; coarse-to-fine filtering for long coherent dialogues; trace-preserving for stateful execution. Localized maintenance (LightMem, MemTree) is recommended for cost-utility balance.
  • Developers: Conservative extraction (preserve raw text, include both user and assistant turns) and conservative consolidation (higher similarity thresholds) are preferred over aggressive filtering or delayed flushing.
  • Evaluation: Exact Match is insufficient; metrics should include retrieval fidelity, update robustness, and operational costs. Evidence-level tracking is critical for diagnosing failures.

Conclusion

The paper concludes that we are not yet fully ready for an agent-native memory system. Key takeaways:

  • No single architecture dominates; effectiveness depends on workload-memory alignment.
  • Representational fidelity and temporal organization are more important than abstraction or hierarchy.
  • Retrieval degrades with temporal distance; explicit query planning and hybrid search help but do not fully solve this.
  • Update robustness requires graph structures and lifecycle management.
  • Operational costs vary by orders of magnitude; localized maintenance is far more efficient than global reorganization.
  • Conservative strategies in both extraction and maintenance outperform aggressive alternatives.

Future directions include designing memory systems that adapt to workload bottlenecks, developing evidence-level retrieval metrics, and creating cost-model-aware architectures that balance utility with latency. The code and testbed are publicly released to facilitate further research.

Related papers