MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems

Summary (Overview)

Problem: Memory systems in LLM agents are essential for long-horizon reasoning but are unreliable and difficult to debug. Failures can originate from earlier operations (e.g., construction, update) and only manifest later, making causal attribution challenging.
Solution: Proposes a novel framework that transforms memory pipelines into executable memory evolution graphs, enabling fine-grained tracing of operational information flow.
Benchmark: Constructs MemTraceBench, a diagnostic benchmark with 160 human-annotated failure cases from four memory systems (Long-Context, RAG, Mem0, EverMemOS) and three datasets.
Method: Introduces MemTrace, an automatic attribution method that iteratively traces operation subgraphs to pinpoint the decisive error set (root cause).
Impact: Analysis reveals systematic memory failures from operation-level issues (e.g., information loss, retrieval misalignment). Attribution signals can guide automatic prompt optimization, boosting end-task performance by up to 7.62%.

Introduction and Theoretical Foundation

Memory systems enable LLM agents to become stateful, supporting long-horizon tasks and continual learning. However, as these systems grow complex, diagnosing failures becomes a significant challenge. Unlike stateless agents where failures are localized to the current execution, memory-augmented agents maintain persistent states across interactions. A failure (e.g., an incorrect answer) may originate from a faulty memory operation (construction, update, deletion) that occurred much earlier in the interaction history.

The core problem is a traceability gap: while failures are observable, the specific faulty operations, when they were introduced, and how the error propagated through the memory pipeline remain difficult to identify. Existing linear execution logs lack the structured representation needed to track how memory variables are created, modified, overwritten, and propagated.

This paper addresses the problem of error tracing and attribution in LLM memory systems. The key idea is to expose memory-system execution as a unified operation-variable graph that explicitly captures information flow dependencies across turns and sessions.

Formal Problem Definition: The system is denoted as $M$ , which processes a trajectory $\tau$ and answers a question $q$ with a prediction $\hat{a}$ (golden answer $a$ ). Its execution involves memory updates $U_M$ , memory reads $R_M$ , and answer generation $Q$ .

Instrumentation produces an execution graph $G = (V, O, E)$ , a directed acyclic bipartite graph where:

$V$ : Set of variables (artifacts like raw messages, memory units, prompts).
$O$ : Set of operations (computation steps like LLM inference, retrieval, parsing).
$E$ : Directed edges capturing information flow. An operation $o \in O$ has inputs $\text{In}(o) \subset V$ and outputs $\text{Out}(o) \subset V$ .

A binary outcome indicator $Z(G) \in \{0, 1\}$ marks success ( $0$ ) or failure ( $1$ ).

The goal is to identify the Decisive Error Set $O^*$ , the earliest and minimal causal cut-set of faulty operations that caused the failure. Formally, $O^*$ must satisfy:

Every operation in $O^*$ is faulty.
All operations in its strictly upstream ancestor set $\text{Anc}_G(O^*)$ are correct.
It is causally sufficient: replacing the faulty outputs of $O^*$ with correct ones (assuming ideal downstream execution) rescues the failure, i.e., $Z(G(O^*, *)) = 0$ .
It is minimal: no proper subset $O' \subset O^*$ is also causally sufficient.

O^* \in \{ O_c \in \mathcal{F}(G) \mid \nexists O' \in \mathcal{F}(G) \ \text{s.t.} \ O' \subset O_c \}.

This reduces failure attribution to identifying a minimal topological frontier of faulty operations.

Methodology

MemTrace is proposed as an automatic attribution method, casting the problem as an agentic graph exploration task. The agent iteratively inspects local operation subgraphs in $G$ to locate the decisive error $o^*$ (assuming a singleton set for this work).

Key Components:

Initialization of Starting Point: Instead of starting with all input messages (large search space), uses hybrid retrieval (dense + sparse, fused via Reciprocal Rank Fusion) over the raw message set $\{m_i\}_{i=1}^n$ using a query concatenating the question $q$ and golden answer $a$ . The top- $\lfloor N/2 \rfloor$ retrieved messages, along with $q$ , form the initial to-explore list $L_0$ (a priority queue of variable nodes, prioritized by earliest timestamp $t_v$ ). List size $N=16$ .
Execution Graph Exploration: At iteration $j$ :
- Selects the variable $v_t$ with the earliest timestamp from list $L_{j-1}$ .
- Retrieves all operations directly involving it: $O(v_t) = \{ o \in O \mid v_t \in \text{In}(o) \cup \text{Out}(o) \}$ .
- For each $o \in O(v_t)$ , converts the operation subgraph $G_o$ into a textual representation (name, category, inputs, outputs, dependencies).
- The agent judges if $o$ is the decisive error. If not, it adds relevant downstream variables to the list: $L_t = (L_{t-1} \setminus \{v_t\}) \cup A_t$ , where $A_t \subseteq V$ are new variables to explore.
- Process repeats until $o^*$ is found or a max iteration limit (200) is reached.
Working Context Management: Handles large graphs via:
- Preview mode: Omits concrete variable values initially.
- Targeted access: Pagination and regex search for large values.
- Automatic summarization: When context exceeds a safety threshold $T$ (272,000 tokens).
Search-Based Variant (MemTrace-OBS): For weakly structured graphs (e.g., long-context), compresses each operation subgraph into a textual operation block (removing dependency edges/IDs, keeping inputs/outputs/attributes). All blocks are sorted by timestamp into a weakly structured log. The agent is equipped with a global operation-search tool (regex-based) to navigate this log efficiently.

Supporting Toolkit: smartcomment, a lightweight tracing package, is developed to instrument memory systems for recording operations, variables, and dependencies, enabling the construction of execution graphs.

Empirical Validation / Results

Experimental Setup:

Benchmark: MemTraceBench (160 failure cases from Long-Context, RAG, Mem0, EverMemOS systems on LoCoMo, LongMemEval, RealMem datasets).
Backbones: GPT-4.1 mini and GPT-5.4.
Metrics: Error Type prediction Accuracy (ETA), Faulty Operation Identification Accuracy (OIA), token cost, runtime.

Main Results:

Table 1: Failure Attribution Accuracy on MemTraceBench

Backbone	Method	Long-Context	RAG	Mem0	EverMemOS	Overall
		ETA	OIA	ETA	OIA	ETA
GPT-4.1 mini	MemTrace-OBS	9.17	3.33	25.83	17.5	33.33
	MemTrace	20.83	4.17	41.67	26.67	35.83
GPT-5.4	MemTrace-OBS	7.50	7.50	87.50	87.50	60.00
	MemTrace	20.00	20.00	72.50	65.83	70.00

Graph-based exploration (MemTrace) improves ETA, especially for smaller LLMs (GPT-4.1 mini: 20.00% → 36.46%). It constrains the search scope, forcing the agent to follow information flow.
Operation identification (OIA) is significantly harder than error type prediction (ETA), with best overall OIA at 46.25%.
Long-context memory is the most challenging setting for attribution.

Table 2: Average Inference Cost (Tokens in thousands, Time in minutes)

Backbone	Method	Long-Context	RAG	Mem0	EverMemOS	Overall
GPT-4.1 mini	MemTrace-OBS	692.79	2.45	684.95	1.65	1077.10
	MemTrace	4,471.10	7.06	839.48	3.84	830.85
GPT-5.4	MemTrace-OBS	373.89	0.95	277.32	0.67	210.00
	MemTrace	2,524.81	5.11	1,477.03	3.41	846.09

MemTrace-OBS substantially reduces cost, especially on weakly structured traces (long-context). It uses only 15.25% of the tokens and 27.94% of the runtime of MemTrace on the long-context subset.
Despite higher cost, MemTrace is still substantially faster and cheaper than manual human attribution.

Key Analyses:

Error Distribution: Analysis of MemTraceBench reveals distinct failure patterns per system:
- RAG: No extraction errors (no extraction module), but frequent retrieval and response errors.
- Mem0 & EverMemOS: Both have extraction modules, but EverMemOS's is more robust (very few extraction errors).
- Long-Context: No retrieval errors (by design), but suffers from other error types.
- All systems exhibit response errors, indicating a challenge in effectively using retrieved memories.
Auxiliary Information: Providing source evidence (golden supporting messages) and prior knowledge (system pipeline description) to MemTrace improves OIA and reduces cost.
LLM-as-a-Judge Reliability: When the LLM judge identifies an error, its verdict is almost always correct. Disagreements with human annotators are typically due to the LLM being overly strict.

Theoretical and Practical Implications

Theoretical Implications:

Provides a formal, graph-based framework for modeling and attributing errors in stateful, non-parametric memory systems, addressing a significant traceability gap.
Identifies that memory failures are systematic and stem from fundamental operation-level issues like information loss and retrieval misalignment, not just random LLM errors.

Practical Implications:

Diagnostic Reports for Memory Systems: MemTrace enables operation-level aggregation of failures, automatically generating summaries that pinpoint bottlenecks in specific pipeline components (e.g., Mem0's extraction module dropping fine details, EverMemOS's reranker failures).
Automatic Optimization of Memory Systems: Establishes a closed-loop optimization pipeline:
- smartcomment records the execution graph.
- MemTrace performs credit assignment to localize the decisive faulty operation.
- An off-the-shelf prompt optimizer is invoked only on the prompts for that localized operation, sidestepping the challenge of optimizing over long, multi-session traces.
- Result: Applied to Mem0 on LoCoMo, three optimization rounds improved performance on a held-out test split by 7.62%, demonstrating that even imperfect attribution provides useful optimization signals.

Conclusion

This work introduces the problem of error tracing and attribution in LLM memory systems. The core contributions are:

A novel framework that models memory pipelines as executable evolution graphs for fine-grained tracing.
MemTraceBench, a diagnostic benchmark with annotated failures.
MemTrace, an automatic attribution method that explores these graphs to locate root causes.
Demonstration that attribution signals can guide automatic system optimization, leading to measurable performance gains.

Limitations & Future Work:

Benchmark Scale: MemTraceBench can be expanded to include broader memory forms (task, multimodal).
Error Set Cardinality: Current work focuses on singleton decisive error sets; extending to multi-error scenarios is important.
Method Improvement: Combining global search (MemTrace-OBS) with local graph exploration (MemTrace) is a promising direction.
Generality: The graph-based diagnosis approach could be applied to other compound agentic systems.

Ethics Statement: MemTrace is for diagnostic support and transparency. Use on real-world systems requires careful data governance (consent, anonymization, secure storage). Its diagnoses should be treated as assistive evidence, not definitive judgments, requiring human review.