δ-mem: Efficient Online Memory for Large Language Models

Summary (Overview)

Core Contribution: δ-mem is a lightweight memory mechanism that augments a frozen full-attention LLM backbone with a compact online state of associative memory (OSAM), enabling dynamic maintenance and use of historical information without full fine-tuning or architectural changes.
Key Mechanism: It compresses past information into a fixed-size state matrix (e.g., 8×8) updated via delta-rule learning, and uses its readout to generate low-rank corrections to the backbone's attention computation (query and output sides) during generation.
Performance: With only an 8×8 memory state, δ-mem improves the average score to 1.10× that of the frozen backbone and 1.15× that of the strongest non-δ-mem baseline. Gains are larger on memory-heavy benchmarks: 1.31× on MemoryAgentBench and 1.20× on LoCoMo.
Efficiency: Provides effective associative memory without extending explicit context, heavy external retrieval modules, or replacing the backbone architecture.
Design Flexibility: Explores three writing granularities: Token-State Write (TSW), Sequence-State Write (SSW), and Multi-State Write (MSW).

Introduction and Theoretical Foundation

Large language models (LLMs) are increasingly deployed in memory-heavy scenarios like long-term personalized assistants and agent systems, requiring the accumulation, updating, and reuse of historical information over extended interactions. Simply expanding the context window is costly (quadratic attention cost) and ineffective due to context degradation/rot.

Existing memory mechanisms fall into three paradigms with limitations:

Textual Memory Mechanisms (TMMs): Store memory as text injected via input context. Suffer from context-window limits, retrieval noise, and compaction loss.
Outside-channel Memory Mechanisms (OMMs): Keep memory in external modules interacting via separate pathways. Introduce overhead, integration complexity, and potential misalignment.
Parametric Memory Mechanisms (PMMs): Encode memory into parameters (e.g., prefixes, adapters). Are efficient but static, limiting adaptation to dynamically evolving information.

δ-mem's Motivation: A need for a mechanism that maintains a compact, dynamically evolving memory state while steering the backbone through a pathway tightly aligned with its internal attention computation.

Theoretical Foundation: The online state update is formulated as optimizing an online regression loss using SGD. Given a memory key $k_t \in \mathbb{R}^r$ and value $v_t \in \mathbb{R}^r$ at position $t$ , the prediction from the previous state $S_{t-1}$ is:

\hat{v}_t = S_{t-1} k_t.

The update minimizes the loss:

L_t(S) = \frac{1}{2} \| S k_t - v_t \|^2,

leading to the delta-rule update:

S_t = S_{t-1} - \beta_t \nabla_{S_{t-1}} L_t(S_{t-1}) = S_{t-1} + \beta_t (v_t - S_{t-1} k_t) k_t^\top.

Inspired by gated retention, a forget gate $\lambda_t$ is introduced for stable long-range evolution:

S_t = \lambda_t S_{t-1} + \beta_t (v_t - S_{t-1} k_t) k_t^\top. \tag{3}

Here $\lambda_t$ controls retention of previous memory, and $\beta_t$ controls the strength of the residual write.

Methodology

δ-mem operates in a sequence: read associative signals from the old state, steer attention with low-rank corrections, then write current information into the state. The backbone remains frozen.

1. Memory Projections

The hidden state $x_t \in \mathbb{R}^d$ is projected into a low-dimensional associative memory space ( $r \ll d$ , e.g., $r=8$ ):

\begin{aligned} q^m_t &= \mathrm{L2\_norm}\left(\tanh(W^m_q x_t)\right), \\ k^m_t &= \mathrm{L2\_norm}\left(\tanh(W^m_k x_t)\right), \\ v^m_t &= W^m_v x_t, \end{aligned} \tag{4}

where $q^m_t, k^m_t, v^m_t \in \mathbb{R}^r$ . Normalizing query and key reduces scale drift. Write ( $\beta_t$ ) and retention ( $\lambda_t$ ) gates are also derived from $x_t$ :

\beta_t = \sigma(W_\beta x_t + b), \quad \lambda_t = 1 - \beta_t. \tag{5}

2. Reading from Online State

Before writing, the model reads context-relevant associative signals by querying the old state:

r_t = S_{t-1} q^m_t. \tag{6}

The read vector $r_t \in \mathbb{R}^r$ provides history-dependent steering signals, with cost independent of history length.

3. Steering Attention via Low-Rank Corrections

The read signal $r_t$ is transformed into corrections for the backbone's attention:

\begin{aligned} \Delta q_t &= W^\Delta_q r_t, \\ \Delta o_t &= W^\Delta_o r_t. \tag{7} \end{aligned}

These are added to the original query and attention output:

\begin{aligned} q^0_t &= W_Q x_t, \quad \tilde{q}_t = q^0_t + \alpha_r \Delta q_t, \tag{8} \\ a_t &= \text{Attn}(\tilde{q}_t, K_{\leq t}, V_{\leq t}), \quad \tilde{y}_t = a_t + \alpha_r \Delta o_t. \tag{9} \end{aligned}

The corrections are low-rank and dynamic because $r_t$ comes from the evolving state $S_{t-1}$ .

4. Writing into Online State

After attention, the current information $(k^m_t, v^m_t)$ is written using the gated delta-rule (dimension-wise):

S_t = \text{Diag}(\lambda_t) S_{t-1} + \text{Diag}(\beta_t) (v^m_t - S_{t-1} k^m_t) (k^m_t)^\top. \tag{10}

Expanded, this becomes:

S_t = \text{Diag}(\lambda_t) S_{t-1} - \text{Diag}(\beta_t) S_{t-1} k^m_t (k^m_t)^\top + \text{Diag}(\beta_t) v^m_t (k^m_t)^\top. \tag{11}

Row-wise, for the $i$ -th row $s^{(i)}_t$ :

s^{(i)}_t = \lambda_{t,i} s^{(i)}_{t-1} + \beta_{t,i} \left( v^m_{t,i} - s^{(i)}_{t-1} k^m_t \right) (k^m_t)^\top. \tag{12}

5. Writing Granularity Strategies

Token-State Write (TSW): Updates state at every token position. $S_t = \text{Update}(S_{t-1}, x_t)$ . Captures fine-grained details but susceptible to noise.
Sequence-State Write (SSW): Updates per message/segment. Averages hidden states within segment $M^{(j)}$ : $\bar{x}^{(j)} = \frac{1}{|M^{(j)}|} \sum_{t \in M^{(j)}} x_t$ , then $S^{(j)} = \text{Update}(S^{(j-1)}, \bar{x}^{(j)})$ . Reduces redundancy, smooths evolution.
Multi-State Write (MSW): Uses $N$ parallel sub-states: $S_t = \{S^{(1)}_t, ..., S^{(N)}_t\}$ , each updated independently. Readouts are concatenated: $r_t = \text{Concat}(r^{(1)}_t, ..., r^{(N)}_t)$ . Reduces interference by separating memory types.

6. Training Objective

Trained with standard supervised fine-tuning (SFT) loss. Context tokens are written into the online state, producing $S_C$ , but are not replayed as explicit backbone input during prediction. The backbone receives only query $Q$ and response $Y$ , steered by the stored state:

L_{\text{SFT}} = -\sum_{j=1}^{|Y|} \log p_{\phi,\theta}(y_j | Q, y_{<j}, S_C), \tag{17}

where $\theta$ are trainable δ-mem parameters and $\phi$ are frozen backbone parameters.

Empirical Validation / Results

Main Results Across Memory Mechanisms (Table 1)

Backbone: Qwen3-4B-Instruct. Key Comparison: δ-mem outperforms all baseline memory mechanisms (Textual, Parametric, Outside-channel) on the final average score.

Model	IFEval	HotpotQA (EM/F1)	GPQA-D	MemoryAgentBench (Avg.)	LoCoMo (Avg.)	Avg.
Qwen3-4B-Instruct	81.89	42.35 / 56.00	39.39	29.54	40.79	46.79
+ BM25 RAG (TMM)	-	40.35 / 52.83	-	24.49	36.68	44.56
+ LLMLingua-2 (TMM)	-	36.93 / 50.03	-	15.63	40.98	42.96
+ MemoryBank (TMM)	-	- / -	-	17.65	38.14	43.88
+ Context2LoRA (PMM)	76.71	37.85 / 50.88	29.29	32.53	48.11	44.90
+ MemGen (PMM)	39.37	5.36 / 16.27	38.89	29.61	40.05	30.66
+ MLP Memory (OMM)	24.95	10.94 / 25.83	22.73	28.80	26.85	22.85
+ δ-Mem (SSW)	81.70	49.22 / 63.43	41.41	37.84	47.05	51.44
+ δ-Mem (TSW)	82.99	49.41 / 63.66	40.40	36.48	46.53	51.66
+ δ-Mem (MSW)	81.52	46.86 / 60.47	37.37	38.85	49.12	50.74

Key Findings:

δ-mem (TSW) achieves the best average score of 51.66%, a +4.87 point improvement over the frozen backbone (46.79%) and +6.76 points over the strongest baseline Context2LoRA (44.90%).
Largest gains on memory-heavy tasks: MemoryAgentBench average improves from 29.54% to 38.85% (MSW), and LoCoMo from 40.79% to 49.12% (MSW). The TTL subtask on MemoryAgentBench nearly doubles from 26.14 to 50.50 (SSW).
Preserves general capabilities: IFEval and GPQA-D scores remain strong.

Results Across Different Backbone Models (Table 2)

δ-mem provides consistent improvements across backbones of varying sizes (Qwen3-4B-Instruct, Qwen3-8B, SmolLM3-3B).

Model	Avg. Score (Backbone)	Avg. Score (δ-Mem Best Variant)	Improvement
Qwen3-4B-Instruct	46.79%	51.66% (TSW)	+4.87 pts
Qwen3-8B	47.20%	50.86% (SSW)	+3.66 pts
SmolLM3-3B	26.08%	36.96% (MSW)	+10.88 pts

Observations:

Larger models (Qwen3-8B): Benefit more from SSW, which smooths token-level noise.
Smaller models (SmolLM3-3B): Exhibit a substantial performance leap with MSW, indicating that separating memory into multiple states is crucial to minimize interference when inherent capacity is lower.

Ablation Studies

1. Context Recovery (Figure 2): Evaluated under a no-context setting where explicit historical context is removed, and only the compressed memory state is used. δ-mem consistently recovers useful information: -T On HotpotQA: Overall EM increased from 0.08% to 6.48%; Overall F1 from 8.27% to 15.20%. -T On LoCoMo: Overall average increased from 3.49% to 8.05%. This demonstrates the online state stores context-relevant signals usable even without explicit context replay.

2. Heads Ablation (Table 3): Studies where to inject memory corrections within the attention block.

The qo (query & output) configuration offers a strong performance-efficiency trade-off (Avg. 47.97%).
The full qkvo configuration yields the best average score (48.05%) but with marginal gain over qo.
Single-branch variants show the output (o) branch is most effective alone (47.05%).

3. Insertion Depth Ablation (Table 4): Studies which model layers to apply δ-mem.

Applying to All Layers achieves the best overall performance (Avg. 47.97%).
Applying to Middle layers performs best among partial-layer variants (Avg. 46.66%), indicating intermediate layers balance semantic abstraction and task-specific computation effectively.

Theoretical and Practical Implications

Theoretical Implications:

Demonstrates that effective memory can be realized through an extremely compact online state (8×8 matrix) directly coupled with core attention computation, challenging the notion that memory requires extensive parameterization or external storage.
Provides a unified framework (memory state + memory steering) for analyzing memory mechanisms.
Shows delta-rule learning with gated updates can form a stable basis for online associative memory in sequence models.

Practical Implications:

Efficiency: Offers a lightweight, plug-and-play memory augmentation for frozen LLMs, requiring minimal trainable parameters and no architectural changes.
Scalability: Memory state size is fixed, so cost is independent of interaction history length, avoiding quadratic attention scaling.
Applicability: Well-suited for long-term assistants and agent systems where information must be accumulated and reused across extended interactions.
Flexibility: Different writing strategies (TSW, SSW, MSW) allow adaptation to various granularities of interaction (tokens, messages, multi-type memory).

Conclusion

δ-mem introduces a novel, efficient online memory mechanism for LLMs. Its core innovation is maintaining a compact, dynamically updated associative memory state that directly steers a frozen backbone's attention via low-rank corrections. Empirical results show it significantly enhances performance on memory-heavy tasks while preserving general capabilities, outperforming existing memory paradigms. Crucially, it achieves this with an extremely small state, proving that effective memory does not necessitate extending explicit context or employing heavy external modules.

Future Directions:

Exploring optimal memory state sizes and dimensions for different model scales and tasks.
Integrating δ-mem with other efficient attention mechanisms.
Applying the framework to multimodal models and other sequential architectures.
Investigating more advanced state organization and update rules for even longer-term memory retention.