MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems
Summary (Overview)
- Problem: Memory systems in LLM agents are essential for long-horizon reasoning but are unreliable and difficult to debug. Failures can originate from earlier operations (e.g., construction, update) and only manifest later, making causal attribution challenging.
- Solution: Proposes a novel framework that transforms memory pipelines into executable memory evolution graphs, enabling fine-grained tracing of operational information flow.
- Benchmark: Constructs MemTraceBench, a diagnostic benchmark with 160 human-annotated failure cases from four memory systems (Long-Context, RAG, Mem0, EverMemOS) and three datasets.
- Method: Introduces MemTrace, an automatic attribution method that iteratively traces operation subgraphs to pinpoint the decisive error set (root cause).
- Impact: Analysis reveals systematic memory failures from operation-level issues (e.g., information loss, retrieval misalignment). Attribution signals can guide automatic prompt optimization, boosting end-task performance by up to 7.62%.
Introduction and Theoretical Foundation
Memory systems enable LLM agents to become stateful, supporting long-horizon tasks and continual learning. However, as these systems grow complex, diagnosing failures becomes a significant challenge. Unlike stateless agents where failures are localized to the current execution, memory-augmented agents maintain persistent states across interactions. A failure (e.g., an incorrect answer) may originate from a faulty memory operation (construction, update, deletion) that occurred much earlier in the interaction history.
The core problem is a traceability gap: while failures are observable, the specific faulty operations, when they were introduced, and how the error propagated through the memory pipeline remain difficult to identify. Existing linear execution logs lack the structured representation needed to track how memory variables are created, modified, overwritten, and propagated.
This paper addresses the problem of error tracing and attribution in LLM memory systems. The key idea is to expose memory-system execution as a unified operation-variable graph that explicitly captures information flow dependencies across turns and sessions.
Formal Problem Definition: The system is denoted as , which processes a trajectory and answers a question with a prediction (golden answer ). Its execution involves memory updates , memory reads , and answer generation .
Instrumentation produces an execution graph , a directed acyclic bipartite graph where:
- : Set of variables (artifacts like raw messages, memory units, prompts).
- : Set of operations (computation steps like LLM inference, retrieval, parsing).
- : Directed edges capturing information flow. An operation has inputs and outputs .
A binary outcome indicator marks success () or failure ().
The goal is to identify the Decisive Error Set , the earliest and minimal causal cut-set of faulty operations that caused the failure. Formally, must satisfy:
- Every operation in is faulty.
- All operations in its strictly upstream ancestor set are correct.
- It is causally sufficient: replacing the faulty outputs of with correct ones (assuming ideal downstream execution) rescues the failure, i.e., .
- It is minimal: no proper subset is also causally sufficient.
This reduces failure attribution to identifying a minimal topological frontier of faulty operations.
Methodology
MemTrace is proposed as an automatic attribution method, casting the problem as an agentic graph exploration task. The agent iteratively inspects local operation subgraphs in to locate the decisive error (assuming a singleton set for this work).
Key Components:
-
Initialization of Starting Point: Instead of starting with all input messages (large search space), uses hybrid retrieval (dense + sparse, fused via Reciprocal Rank Fusion) over the raw message set using a query concatenating the question and golden answer . The top- retrieved messages, along with , form the initial to-explore list (a priority queue of variable nodes, prioritized by earliest timestamp ). List size .
-
Execution Graph Exploration: At iteration :
- Selects the variable with the earliest timestamp from list .
- Retrieves all operations directly involving it: .
- For each , converts the operation subgraph into a textual representation (name, category, inputs, outputs, dependencies).
- The agent judges if is the decisive error. If not, it adds relevant downstream variables to the list: , where are new variables to explore.
- Process repeats until is found or a max iteration limit (200) is reached.
-
Working Context Management: Handles large graphs via:
- Preview mode: Omits concrete variable values initially.
- Targeted access: Pagination and regex search for large values.
- Automatic summarization: When context exceeds a safety threshold (272,000 tokens).
-
Search-Based Variant (MemTrace-OBS): For weakly structured graphs (e.g., long-context), compresses each operation subgraph into a textual operation block (removing dependency edges/IDs, keeping inputs/outputs/attributes). All blocks are sorted by timestamp into a weakly structured log. The agent is equipped with a global operation-search tool (regex-based) to navigate this log efficiently.
Supporting Toolkit: smartcomment, a lightweight tracing package, is developed to instrument memory systems for recording operations, variables, and dependencies, enabling the construction of execution graphs.
Empirical Validation / Results
Experimental Setup:
- Benchmark: MemTraceBench (160 failure cases from Long-Context, RAG, Mem0, EverMemOS systems on LoCoMo, LongMemEval, RealMem datasets).
- Backbones: GPT-4.1 mini and GPT-5.4.
- Metrics: Error Type prediction Accuracy (ETA), Faulty Operation Identification Accuracy (OIA), token cost, runtime.
Main Results:
Table 1: Failure Attribution Accuracy on MemTraceBench
| Backbone | Method | Long-Context | RAG | Mem0 | EverMemOS | Overall |
|---|---|---|---|---|---|---|
| ETA | OIA | ETA | OIA | ETA | ||
| GPT-4.1 mini | MemTrace-OBS | 9.17 | 3.33 | 25.83 | 17.5 | 33.33 |
| MemTrace | 20.83 | 4.17 | 41.67 | 26.67 | 35.83 | |
| GPT-5.4 | MemTrace-OBS | 7.50 | 7.50 | 87.50 | 87.50 | 60.00 |
| MemTrace | 20.00 | 20.00 | 72.50 | 65.83 | 70.00 |
- Graph-based exploration (MemTrace) improves ETA, especially for smaller LLMs (GPT-4.1 mini: 20.00% → 36.46%). It constrains the search scope, forcing the agent to follow information flow.
- Operation identification (OIA) is significantly harder than error type prediction (ETA), with best overall OIA at 46.25%.
- Long-context memory is the most challenging setting for attribution.
Table 2: Average Inference Cost (Tokens in thousands, Time in minutes)
| Backbone | Method | Long-Context | RAG | Mem0 | EverMemOS | Overall |
|---|---|---|---|---|---|---|
| GPT-4.1 mini | MemTrace-OBS | 692.79 | 2.45 | 684.95 | 1.65 | 1077.10 |
| MemTrace | 4,471.10 | 7.06 | 839.48 | 3.84 | 830.85 | |
| GPT-5.4 | MemTrace-OBS | 373.89 | 0.95 | 277.32 | 0.67 | 210.00 |
| MemTrace | 2,524.81 | 5.11 | 1,477.03 | 3.41 | 846.09 |
- MemTrace-OBS substantially reduces cost, especially on weakly structured traces (long-context). It uses only 15.25% of the tokens and 27.94% of the runtime of MemTrace on the long-context subset.
- Despite higher cost, MemTrace is still substantially faster and cheaper than manual human attribution.
Key Analyses:
- Error Distribution: Analysis of MemTraceBench reveals distinct failure patterns per system:
- RAG: No extraction errors (no extraction module), but frequent retrieval and response errors.
- Mem0 & EverMemOS: Both have extraction modules, but EverMemOS's is more robust (very few extraction errors).
- Long-Context: No retrieval errors (by design), but suffers from other error types.
- All systems exhibit response errors, indicating a challenge in effectively using retrieved memories.
- Auxiliary Information: Providing source evidence (golden supporting messages) and prior knowledge (system pipeline description) to MemTrace improves OIA and reduces cost.
- LLM-as-a-Judge Reliability: When the LLM judge identifies an error, its verdict is almost always correct. Disagreements with human annotators are typically due to the LLM being overly strict.
Theoretical and Practical Implications
Theoretical Implications:
- Provides a formal, graph-based framework for modeling and attributing errors in stateful, non-parametric memory systems, addressing a significant traceability gap.
- Identifies that memory failures are systematic and stem from fundamental operation-level issues like information loss and retrieval misalignment, not just random LLM errors.
Practical Implications:
- Diagnostic Reports for Memory Systems: MemTrace enables operation-level aggregation of failures, automatically generating summaries that pinpoint bottlenecks in specific pipeline components (e.g., Mem0's extraction module dropping fine details, EverMemOS's reranker failures).
- Automatic Optimization of Memory Systems: Establishes a closed-loop optimization pipeline:
smartcommentrecords the execution graph.- MemTrace performs credit assignment to localize the decisive faulty operation.
- An off-the-shelf prompt optimizer is invoked only on the prompts for that localized operation, sidestepping the challenge of optimizing over long, multi-session traces.
- Result: Applied to Mem0 on LoCoMo, three optimization rounds improved performance on a held-out test split by 7.62%, demonstrating that even imperfect attribution provides useful optimization signals.
Conclusion
This work introduces the problem of error tracing and attribution in LLM memory systems. The core contributions are:
- A novel framework that models memory pipelines as executable evolution graphs for fine-grained tracing.
- MemTraceBench, a diagnostic benchmark with annotated failures.
- MemTrace, an automatic attribution method that explores these graphs to locate root causes.
- Demonstration that attribution signals can guide automatic system optimization, leading to measurable performance gains.
Limitations & Future Work:
- Benchmark Scale: MemTraceBench can be expanded to include broader memory forms (task, multimodal).
- Error Set Cardinality: Current work focuses on singleton decisive error sets; extending to multi-error scenarios is important.
- Method Improvement: Combining global search (MemTrace-OBS) with local graph exploration (MemTrace) is a promising direction.
- Generality: The graph-based diagnosis approach could be applied to other compound agentic systems.
Ethics Statement: MemTrace is for diagnostic support and transparency. Use on real-world systems requires careful data governance (consent, anonymization, secure storage). Its diagnoses should be treated as assistive evidence, not definitive judgments, requiring human review.