MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems

Summary (Overview)

  • Problem: Memory systems in LLM agents are essential for long-horizon reasoning but are unreliable and difficult to debug. Failures can originate from earlier operations (e.g., construction, update) and only manifest later, making causal attribution challenging.
  • Solution: Proposes a novel framework that transforms memory pipelines into executable memory evolution graphs, enabling fine-grained tracing of operational information flow.
  • Benchmark: Constructs MemTraceBench, a diagnostic benchmark with 160 human-annotated failure cases from four memory systems (Long-Context, RAG, Mem0, EverMemOS) and three datasets.
  • Method: Introduces MemTrace, an automatic attribution method that iteratively traces operation subgraphs to pinpoint the decisive error set (root cause).
  • Impact: Analysis reveals systematic memory failures from operation-level issues (e.g., information loss, retrieval misalignment). Attribution signals can guide automatic prompt optimization, boosting end-task performance by up to 7.62%.

Introduction and Theoretical Foundation

Memory systems enable LLM agents to become stateful, supporting long-horizon tasks and continual learning. However, as these systems grow complex, diagnosing failures becomes a significant challenge. Unlike stateless agents where failures are localized to the current execution, memory-augmented agents maintain persistent states across interactions. A failure (e.g., an incorrect answer) may originate from a faulty memory operation (construction, update, deletion) that occurred much earlier in the interaction history.

The core problem is a traceability gap: while failures are observable, the specific faulty operations, when they were introduced, and how the error propagated through the memory pipeline remain difficult to identify. Existing linear execution logs lack the structured representation needed to track how memory variables are created, modified, overwritten, and propagated.

This paper addresses the problem of error tracing and attribution in LLM memory systems. The key idea is to expose memory-system execution as a unified operation-variable graph that explicitly captures information flow dependencies across turns and sessions.

Formal Problem Definition: The system is denoted as MM, which processes a trajectory τ\tau and answers a question qq with a prediction a^\hat{a} (golden answer aa). Its execution involves memory updates UMU_M, memory reads RMR_M, and answer generation QQ.

Instrumentation produces an execution graph G=(V,O,E)G = (V, O, E), a directed acyclic bipartite graph where:

  • VV: Set of variables (artifacts like raw messages, memory units, prompts).
  • OO: Set of operations (computation steps like LLM inference, retrieval, parsing).
  • EE: Directed edges capturing information flow. An operation oOo \in O has inputs In(o)V\text{In}(o) \subset V and outputs Out(o)V\text{Out}(o) \subset V.

A binary outcome indicator Z(G){0,1}Z(G) \in \{0, 1\} marks success (00) or failure (11).

The goal is to identify the Decisive Error Set OO^*, the earliest and minimal causal cut-set of faulty operations that caused the failure. Formally, OO^* must satisfy:

  1. Every operation in OO^* is faulty.
  2. All operations in its strictly upstream ancestor set AncG(O)\text{Anc}_G(O^*) are correct.
  3. It is causally sufficient: replacing the faulty outputs of OO^* with correct ones (assuming ideal downstream execution) rescues the failure, i.e., Z(G(O,))=0Z(G(O^*, *)) = 0.
  4. It is minimal: no proper subset OOO' \subset O^* is also causally sufficient.
O{OcF(G)OF(G) s.t. OOc}.O^* \in \{ O_c \in \mathcal{F}(G) \mid \nexists O' \in \mathcal{F}(G) \ \text{s.t.} \ O' \subset O_c \}.

This reduces failure attribution to identifying a minimal topological frontier of faulty operations.

Methodology

MemTrace is proposed as an automatic attribution method, casting the problem as an agentic graph exploration task. The agent iteratively inspects local operation subgraphs in GG to locate the decisive error oo^* (assuming a singleton set for this work).

Key Components:

  1. Initialization of Starting Point: Instead of starting with all input messages (large search space), uses hybrid retrieval (dense + sparse, fused via Reciprocal Rank Fusion) over the raw message set {mi}i=1n\{m_i\}_{i=1}^n using a query concatenating the question qq and golden answer aa. The top-N/2\lfloor N/2 \rfloor retrieved messages, along with qq, form the initial to-explore list L0L_0 (a priority queue of variable nodes, prioritized by earliest timestamp tvt_v). List size N=16N=16.

  2. Execution Graph Exploration: At iteration jj:

    • Selects the variable vtv_t with the earliest timestamp from list Lj1L_{j-1}.
    • Retrieves all operations directly involving it: O(vt)={oOvtIn(o)Out(o)}O(v_t) = \{ o \in O \mid v_t \in \text{In}(o) \cup \text{Out}(o) \}.
    • For each oO(vt)o \in O(v_t), converts the operation subgraph GoG_o into a textual representation (name, category, inputs, outputs, dependencies).
    • The agent judges if oo is the decisive error. If not, it adds relevant downstream variables to the list: Lt=(Lt1{vt})AtL_t = (L_{t-1} \setminus \{v_t\}) \cup A_t, where AtVA_t \subseteq V are new variables to explore.
    • Process repeats until oo^* is found or a max iteration limit (200) is reached.
  3. Working Context Management: Handles large graphs via:

    • Preview mode: Omits concrete variable values initially.
    • Targeted access: Pagination and regex search for large values.
    • Automatic summarization: When context exceeds a safety threshold TT (272,000 tokens).
  4. Search-Based Variant (MemTrace-OBS): For weakly structured graphs (e.g., long-context), compresses each operation subgraph into a textual operation block (removing dependency edges/IDs, keeping inputs/outputs/attributes). All blocks are sorted by timestamp into a weakly structured log. The agent is equipped with a global operation-search tool (regex-based) to navigate this log efficiently.

Supporting Toolkit: smartcomment, a lightweight tracing package, is developed to instrument memory systems for recording operations, variables, and dependencies, enabling the construction of execution graphs.

Empirical Validation / Results

Experimental Setup:

  • Benchmark: MemTraceBench (160 failure cases from Long-Context, RAG, Mem0, EverMemOS systems on LoCoMo, LongMemEval, RealMem datasets).
  • Backbones: GPT-4.1 mini and GPT-5.4.
  • Metrics: Error Type prediction Accuracy (ETA), Faulty Operation Identification Accuracy (OIA), token cost, runtime.

Main Results:

Table 1: Failure Attribution Accuracy on MemTraceBench

BackboneMethodLong-ContextRAGMem0EverMemOSOverall
ETAOIAETAOIAETA
GPT-4.1 miniMemTrace-OBS9.173.3325.8317.533.33
MemTrace20.834.1741.6726.6735.83
GPT-5.4MemTrace-OBS7.507.5087.5087.5060.00
MemTrace20.0020.0072.5065.8370.00
  • Graph-based exploration (MemTrace) improves ETA, especially for smaller LLMs (GPT-4.1 mini: 20.00% → 36.46%). It constrains the search scope, forcing the agent to follow information flow.
  • Operation identification (OIA) is significantly harder than error type prediction (ETA), with best overall OIA at 46.25%.
  • Long-context memory is the most challenging setting for attribution.

Table 2: Average Inference Cost (Tokens in thousands, Time in minutes)

BackboneMethodLong-ContextRAGMem0EverMemOSOverall
GPT-4.1 miniMemTrace-OBS692.792.45684.951.651077.10
MemTrace4,471.107.06839.483.84830.85
GPT-5.4MemTrace-OBS373.890.95277.320.67210.00
MemTrace2,524.815.111,477.033.41846.09
  • MemTrace-OBS substantially reduces cost, especially on weakly structured traces (long-context). It uses only 15.25% of the tokens and 27.94% of the runtime of MemTrace on the long-context subset.
  • Despite higher cost, MemTrace is still substantially faster and cheaper than manual human attribution.

Key Analyses:

  1. Error Distribution: Analysis of MemTraceBench reveals distinct failure patterns per system:
    • RAG: No extraction errors (no extraction module), but frequent retrieval and response errors.
    • Mem0 & EverMemOS: Both have extraction modules, but EverMemOS's is more robust (very few extraction errors).
    • Long-Context: No retrieval errors (by design), but suffers from other error types.
    • All systems exhibit response errors, indicating a challenge in effectively using retrieved memories.
  2. Auxiliary Information: Providing source evidence (golden supporting messages) and prior knowledge (system pipeline description) to MemTrace improves OIA and reduces cost.
  3. LLM-as-a-Judge Reliability: When the LLM judge identifies an error, its verdict is almost always correct. Disagreements with human annotators are typically due to the LLM being overly strict.

Theoretical and Practical Implications

Theoretical Implications:

  • Provides a formal, graph-based framework for modeling and attributing errors in stateful, non-parametric memory systems, addressing a significant traceability gap.
  • Identifies that memory failures are systematic and stem from fundamental operation-level issues like information loss and retrieval misalignment, not just random LLM errors.

Practical Implications:

  1. Diagnostic Reports for Memory Systems: MemTrace enables operation-level aggregation of failures, automatically generating summaries that pinpoint bottlenecks in specific pipeline components (e.g., Mem0's extraction module dropping fine details, EverMemOS's reranker failures).
  2. Automatic Optimization of Memory Systems: Establishes a closed-loop optimization pipeline:
    • smartcomment records the execution graph.
    • MemTrace performs credit assignment to localize the decisive faulty operation.
    • An off-the-shelf prompt optimizer is invoked only on the prompts for that localized operation, sidestepping the challenge of optimizing over long, multi-session traces.
    • Result: Applied to Mem0 on LoCoMo, three optimization rounds improved performance on a held-out test split by 7.62%, demonstrating that even imperfect attribution provides useful optimization signals.

Conclusion

This work introduces the problem of error tracing and attribution in LLM memory systems. The core contributions are:

  1. A novel framework that models memory pipelines as executable evolution graphs for fine-grained tracing.
  2. MemTraceBench, a diagnostic benchmark with annotated failures.
  3. MemTrace, an automatic attribution method that explores these graphs to locate root causes.
  4. Demonstration that attribution signals can guide automatic system optimization, leading to measurable performance gains.

Limitations & Future Work:

  • Benchmark Scale: MemTraceBench can be expanded to include broader memory forms (task, multimodal).
  • Error Set Cardinality: Current work focuses on singleton decisive error sets; extending to multi-error scenarios is important.
  • Method Improvement: Combining global search (MemTrace-OBS) with local graph exploration (MemTrace) is a promising direction.
  • Generality: The graph-based diagnosis approach could be applied to other compound agentic systems.

Ethics Statement: MemTrace is for diagnostic support and transparency. Use on real-world systems requires careful data governance (consent, anonymization, secure storage). Its diagnoses should be treated as assistive evidence, not definitive judgments, requiring human review.