Visual Summary | GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents

Summary (Overview)

New benchmark for shared-memory governance: G ATE M EM evaluates LLM agents on three coupled dimensions—utility (authorized recall), access control (withholding from unauthorized requesters), and active forgetting (non-recovery after deletion)—in multi-principal shared environments (medical, office, education, household).
Dataset scale and composition: 91 long-form multi-party episodes with 2,218 hidden checkpoints, balanced across utility (728), access control (727), and active forgetting (763). Each checkpoint includes hidden annotations (expected action, judge spec, leak targets).
Key finding—no method excels on all three axes: Across 7 baselines (long-context, naive RAG, policy RAG, A-MEM, Mem0, ReMeM-I, ReMeM-S) and 6 backbone LLMs, long-context prompting achieves the highest Memory Governance Score (MGS) but at high token cost; retrieval and external-memory methods reduce cost but leak unauthorized or deleted information.
Governance failures persist: Access-control violations arise from soft-overreach (indirect inference, cross-patient confusion) and active-forgetting failures from indirect confirmation or update-delete conflicts. Even policy-aware retrieval trades utility for safety, with over-refusal rates up to 63%.
Multiplicative MGS metric: MGS = U · (1 − A) · (1 − F) enforces that a system cannot compensate for security failures with high utility, nor for utility failures with perfect security.

Introduction and Theoretical Foundation

Large language model (LLM) agents are evolving from stateless chatbots to persistent assistants that maintain memory across interactions. Most existing memory benchmarks (e.g., LoCoMo, LongMemEval, MemBench) assume a single-principal private-memory setting where maximum recall is the sole objective. However, real-world deployments—hospitals, workplaces, campuses, households—involve multi-principal shared memory: multiple users (principals) write to and query a common memory pool under different roles, scopes, and relationships.

In these settings, high recall without governance is a security vulnerability. The paper formulates memory evaluation as a coupled governance problem with three requirements:

Utility: authorized requesters obtain current, in-scope answers.
Access Control: unauthorized or over-scoped requesters are refused or given redacted answers.
Active Forgetting: after an explicit deletion request, the agent cannot recover, confirm, or reconstruct the deleted information (interface-level forgetting, not certified physical erasure).

The benchmark introduces a multiplicative Memory Governance Score (MGS) to reflect the strict requirement that a system must be simultaneously useful, secure, and forgetful.

Methodology

Episode and Memory State

Each episode $e$ is an independent evaluation unit:

e = (S_e, E_e)

where $S_e = (D_e, P_e, R_e, G_e)$ defines the domain, principals, roles/relationships, and initial access rules. The interaction trace $E_e = (\tau_1, \tau_2, \ldots, \tau_T)$ is a temporally ordered sequence of turns:

\tau_t = (p_t, r_t, z_t, u_t)

with speaker $p_t$ , timestamp $r_t$ , turn type $z_t$ (dialogue, note update, lab result, deletion request, etc.), and utterance $u_t$ . The agent incrementally ingests turns:

M_t^{(e)} = \text{Ingest}(M_{t-1}^{(e)}, \tau_t, S_e)

Checkpoints and Governance Categories

Hidden checkpoints $H = \{(c_n, y_n)\}_{n=1}^N$ are inserted at selected turn boundaries. The visible input is:

c_n = (e_n, t_n, p_n^{\text{req}}, x_n)

where $p_n^{\text{req}}$ is the authenticated requester and $x_n$ the query. The hidden annotation is:

y_n = (q_n, a_n^\star, J_n, \Lambda_n)

with category $q_n$ (utility, access control, active forgetting), expected action $a_n^\star$ (answer, answer_redacted, refuse, no_memory), judge spec $J_n$ , and leak targets $\Lambda_n$ .

Evaluation Metrics

Effective Utility (for $n \in C_u$ ): $U = \frac{1}{N_u} \sum_{n \in C_u} \mathbf{1}[\hat{a}_n = a_n^\star \land \text{Satisfies}(\hat{r}_n, J_n)]$
Access-Control Violation (for $n \in C_a$ , $R = \{\text{refuse}, \mathrm{answer\_redacted}\}$ ): $A = \frac{1}{N_a} \sum_{n \in C_a} \mathbf{1}[\text{Leaks}(\hat{r}_n, \Lambda_n) \lor \hat{a}_n \notin R]$
Active Forgetting Failure (for $n \in C_f$ , $a_\emptyset = \mathrm{no\_memory}$ ): $F = \frac{1}{N_f} \sum_{n \in C_f} \mathbf{1}[\text{Recovered}(\hat{r}_n, \Lambda_n) \lor \hat{a}_n \neq a_\emptyset]$
Memory Governance Score: $\text{MGS} = U \cdot (1 - A) \cdot (1 - F)$
Efficiency: Sec/ckpt and Tok/ckpt averaged over checkpoints.

Dataset Construction and Quality Control

The benchmark covers four domains with 91 episodes and 2,218 checkpoints. A four-stage quality control pipeline ensures schema consistency, evidence support for gold answers, deletion-chain closure, and manual leak-target inspection.

Domain	Ep.	Turns/ep.	Tokens/turn	Princ./ep.	Roles/ep.	Ckpts./ep.	# U	# A	# F	Total
Medical	21	204.5	16.4	15.0	11.0	27.6	210	192	177	579
Office	17	241.2	28.9	17.8	14.8	32.2	154	171	222	547
Education	30	224.9	24.4	12.6	11.6	18.0	180	180	180	540
Household	23	224.0	24.7	9.8	9.6	24.0	184	184	184	552
Total	91	223.0	23.5	13.4	11.6	24.4	728	727	763	2,218

Empirical Validation / Results

Main Results (Table 3)

Experiments compare 7 baselines (Long-Context, Naive RAG, Policy RAG, A-MEM, Mem0, ReMeM-I, ReMeM-S) across 6 backbone LLMs (GPT-5.4, Deepseek-V4-Pro, Llama-4-Maverick, GPT-5-mini, GPT-4o-mini, Gemini-2.5-Flash-Lite).

Key findings:

Long-context prompting achieves the highest MGS in most backbone–domain blocks (e.g., GPT-5.4 Medical: MGS=80.1%; Deepseek-V4-Pro Education: 71.0%) but suffers non-negligible leakage (A up to 33.9%, F up to 64.9% on weaker backbones).
Policy RAG reduces access-control violations compared to Naive RAG but often lowers utility due to over-refusal (e.g., GPT-5.4 Medical: U=37.1% vs Long-Context 91.4%).
External-memory systems (A-MEM, Mem0, ReMeM) do not consistently outperform simpler baselines on MGS; they often leak protected or deleted information.
Backbone strength matters: Stronger models (GPT-5.4, Deepseek-V4-Pro) achieve higher MGS, but even they fail to simultaneously satisfy all three governance dimensions.
Efficiency trade-off: Long-Context is fastest (4.22s/ckpt on Medical) but most token-intensive (4.04k tok/ckpt); ReMeM reduces tokens (~1k/ckpt) but incurs high latency (up to 267s/ckpt).

Method (GPT-5.4)	Medical U↑ A↓ F↓ MGS↑	Office U↑ A↓ F↓ MGS↑	Education U↑ A↓ F↓ MGS↑	Household U↑ A↓ F↓ MGS↑
Long-Context	91.4 10.4 2.3 80.1	89.6 33.9 4.5 56.5	85.6 12.8 7.8 68.8	73.4 16.8 11.4 54.0
Naive RAG	64.8 25.0 7.9 44.7	74.0 29.8 9.5 47.0	32.8 12.8 32.8 19.2	51.1 19.0 10.9 36.9
Policy RAG	37.1 10.9 4.0 31.8	76.0 19.9 6.3 57.0	22.2 9.4 16.1 16.9	39.1 14.7 14.1 28.7
A-MEM	65.7 24.0 6.8 46.6	79.2 31.0 11.7 48.3	32.2 15.0 37.2 17.2	51.1 20.1 10.9 36.4
Mem0	38.1 28.1 5.6 25.8	40.3 16.4 14.4 28.8	27.2 8.9 15.0 21.1	25.5 10.3 9.2 20.8
ReMeM-I	56.7 28.6 9.0 36.8	59.7 28.6 6.3 40.0	16.7 13.3 33.9 9.5	32.6 15.2 15.2 23.4
ReMeM-S	54.1 26.3 8.7 36.4	58.6 29.6 6.9 38.4	16.3 13.9 34.1 9.2	31.6 16.8 20.1 21.0

(Table 3 excerpt for GPT-5.4; full table in paper includes all backbones.)

Diagnostic and Failure Analysis

Retrieval-depth sensitivity (Fig. 3a): Utility scales with top- $k$ , but Policy RAG maintains higher access and forgetting safety across all depths than Naive RAG.
Over-refusal (Fig. 3b): Policy RAG has up to 63.3% over-refusal rate on legitimate utility queries, confirming the utility–safety trade-off.
Attack-type breakdown (Fig. 4): Access-control failures are driven by soft-overreach (cross-patient, indirect inference, unassigned clinician). Active-forgetting failures are triggered by indirect confirmation (yes/no probes) and update-delete conflicts.
Qualitative examples (Table 5): Models may take a restrictive action (refuse/answer_redacted) but still leak protected content in the response text; or they may confirm deleted information despite the expected no_memory action.

Theoretical and Practical Implications

Memory as governed state, not recall resource: The paper reframes memory evaluation from "how much can the agent recall" to "does the agent recall the right thing for the right person at the right time". This has direct implications for deploying LLM agents in institutional settings with privacy regulations (HIPAA, GDPR, FERPA).
Multiplicative metric enforces joint optimization: The MGS = U·(1−A)·(1−F) formulation prevents systems from compensating for security failures with high utility, or utility failures with perfect security. This incentivizes balanced agent design.
Practical deployment challenges: Current agents remain far from reliable shared institutional deployment. Long-context prompting is the strongest approach but computationally expensive; retrieval methods reduce cost but leak information. Future work must co-optimize governance, latency, and token cost.
Interface-level forgetting is insufficient: The paper focuses on behavioral non-recoverability, but certified physical erasure from vector stores, caches, and model parameters remains an open problem.

Conclusion

G ATE M EM is a benchmark for evaluating memory governance in multi-principal shared-memory agents across utility, access control, and active forgetting. The dataset comprises 91 long-form episodes and 2,218 hidden checkpoints across four domains. Experiments reveal that no current method simultaneously achieves strong performance on all three dimensions. Long-context prompting offers the best governance trade-off but at high computational cost, while retrieval and external-memory baselines remain vulnerable to unauthorized disclosure and post-deletion recovery. These results indicate that future agents must treat memory not merely as a recall resource, but as a governed shared state with reliable access and deletion semantics. The benchmark, code, dataset, and leaderboard are publicly available.