Summary (Overview)
- New benchmark for shared-memory governance: G ATE M EM evaluates LLM agents on three coupled dimensions—utility (authorized recall), access control (withholding from unauthorized requesters), and active forgetting (non-recovery after deletion)—in multi-principal shared environments (medical, office, education, household).
- Dataset scale and composition: 91 long-form multi-party episodes with 2,218 hidden checkpoints, balanced across utility (728), access control (727), and active forgetting (763). Each checkpoint includes hidden annotations (expected action, judge spec, leak targets).
- Key finding—no method excels on all three axes: Across 7 baselines (long-context, naive RAG, policy RAG, A-MEM, Mem0, ReMeM-I, ReMeM-S) and 6 backbone LLMs, long-context prompting achieves the highest Memory Governance Score (MGS) but at high token cost; retrieval and external-memory methods reduce cost but leak unauthorized or deleted information.
- Governance failures persist: Access-control violations arise from soft-overreach (indirect inference, cross-patient confusion) and active-forgetting failures from indirect confirmation or update-delete conflicts. Even policy-aware retrieval trades utility for safety, with over-refusal rates up to 63%.
- Multiplicative MGS metric: MGS = U · (1 − A) · (1 − F) enforces that a system cannot compensate for security failures with high utility, nor for utility failures with perfect security.
Introduction and Theoretical Foundation
Large language model (LLM) agents are evolving from stateless chatbots to persistent assistants that maintain memory across interactions. Most existing memory benchmarks (e.g., LoCoMo, LongMemEval, MemBench) assume a single-principal private-memory setting where maximum recall is the sole objective. However, real-world deployments—hospitals, workplaces, campuses, households—involve multi-principal shared memory: multiple users (principals) write to and query a common memory pool under different roles, scopes, and relationships.
In these settings, high recall without governance is a security vulnerability. The paper formulates memory evaluation as a coupled governance problem with three requirements:
- Utility: authorized requesters obtain current, in-scope answers.
- Access Control: unauthorized or over-scoped requesters are refused or given redacted answers.
- Active Forgetting: after an explicit deletion request, the agent cannot recover, confirm, or reconstruct the deleted information (interface-level forgetting, not certified physical erasure).
The benchmark introduces a multiplicative Memory Governance Score (MGS) to reflect the strict requirement that a system must be simultaneously useful, secure, and forgetful.
Methodology
Episode and Memory State
Each episode is an independent evaluation unit:
where defines the domain, principals, roles/relationships, and initial access rules. The interaction trace is a temporally ordered sequence of turns:
with speaker , timestamp , turn type (dialogue, note update, lab result, deletion request, etc.), and utterance . The agent incrementally ingests turns:
Checkpoints and Governance Categories
Hidden checkpoints are inserted at selected turn boundaries. The visible input is:
where is the authenticated requester and the query. The hidden annotation is:
with category (utility, access control, active forgetting), expected action (answer, answer_redacted, refuse, no_memory), judge spec , and leak targets .
Evaluation Metrics
- Effective Utility (for ):
- Access-Control Violation (for , ):
- Active Forgetting Failure (for , ):
- Memory Governance Score:
- Efficiency: Sec/ckpt and Tok/ckpt averaged over checkpoints.
Dataset Construction and Quality Control
The benchmark covers four domains with 91 episodes and 2,218 checkpoints. A four-stage quality control pipeline ensures schema consistency, evidence support for gold answers, deletion-chain closure, and manual leak-target inspection.
| Domain | Ep. | Turns/ep. | Tokens/turn | Princ./ep. | Roles/ep. | Ckpts./ep. | # U | # A | # F | Total |
|---|---|---|---|---|---|---|---|---|---|---|
| Medical | 21 | 204.5 | 16.4 | 15.0 | 11.0 | 27.6 | 210 | 192 | 177 | 579 |
| Office | 17 | 241.2 | 28.9 | 17.8 | 14.8 | 32.2 | 154 | 171 | 222 | 547 |
| Education | 30 | 224.9 | 24.4 | 12.6 | 11.6 | 18.0 | 180 | 180 | 180 | 540 |
| Household | 23 | 224.0 | 24.7 | 9.8 | 9.6 | 24.0 | 184 | 184 | 184 | 552 |
| Total | 91 | 223.0 | 23.5 | 13.4 | 11.6 | 24.4 | 728 | 727 | 763 | 2,218 |
Empirical Validation / Results
Main Results (Table 3)
Experiments compare 7 baselines (Long-Context, Naive RAG, Policy RAG, A-MEM, Mem0, ReMeM-I, ReMeM-S) across 6 backbone LLMs (GPT-5.4, Deepseek-V4-Pro, Llama-4-Maverick, GPT-5-mini, GPT-4o-mini, Gemini-2.5-Flash-Lite).
Key findings:
- Long-context prompting achieves the highest MGS in most backbone–domain blocks (e.g., GPT-5.4 Medical: MGS=80.1%; Deepseek-V4-Pro Education: 71.0%) but suffers non-negligible leakage (A up to 33.9%, F up to 64.9% on weaker backbones).
- Policy RAG reduces access-control violations compared to Naive RAG but often lowers utility due to over-refusal (e.g., GPT-5.4 Medical: U=37.1% vs Long-Context 91.4%).
- External-memory systems (A-MEM, Mem0, ReMeM) do not consistently outperform simpler baselines on MGS; they often leak protected or deleted information.
- Backbone strength matters: Stronger models (GPT-5.4, Deepseek-V4-Pro) achieve higher MGS, but even they fail to simultaneously satisfy all three governance dimensions.
- Efficiency trade-off: Long-Context is fastest (4.22s/ckpt on Medical) but most token-intensive (4.04k tok/ckpt); ReMeM reduces tokens (~1k/ckpt) but incurs high latency (up to 267s/ckpt).
| Method (GPT-5.4) | Medical U↑ A↓ F↓ MGS↑ | Office U↑ A↓ F↓ MGS↑ | Education U↑ A↓ F↓ MGS↑ | Household U↑ A↓ F↓ MGS↑ |
|---|---|---|---|---|
| Long-Context | 91.4 10.4 2.3 80.1 | 89.6 33.9 4.5 56.5 | 85.6 12.8 7.8 68.8 | 73.4 16.8 11.4 54.0 |
| Naive RAG | 64.8 25.0 7.9 44.7 | 74.0 29.8 9.5 47.0 | 32.8 12.8 32.8 19.2 | 51.1 19.0 10.9 36.9 |
| Policy RAG | 37.1 10.9 4.0 31.8 | 76.0 19.9 6.3 57.0 | 22.2 9.4 16.1 16.9 | 39.1 14.7 14.1 28.7 |
| A-MEM | 65.7 24.0 6.8 46.6 | 79.2 31.0 11.7 48.3 | 32.2 15.0 37.2 17.2 | 51.1 20.1 10.9 36.4 |
| Mem0 | 38.1 28.1 5.6 25.8 | 40.3 16.4 14.4 28.8 | 27.2 8.9 15.0 21.1 | 25.5 10.3 9.2 20.8 |
| ReMeM-I | 56.7 28.6 9.0 36.8 | 59.7 28.6 6.3 40.0 | 16.7 13.3 33.9 9.5 | 32.6 15.2 15.2 23.4 |
| ReMeM-S | 54.1 26.3 8.7 36.4 | 58.6 29.6 6.9 38.4 | 16.3 13.9 34.1 9.2 | 31.6 16.8 20.1 21.0 |
(Table 3 excerpt for GPT-5.4; full table in paper includes all backbones.)
Diagnostic and Failure Analysis
- Retrieval-depth sensitivity (Fig. 3a): Utility scales with top-, but Policy RAG maintains higher access and forgetting safety across all depths than Naive RAG.
- Over-refusal (Fig. 3b): Policy RAG has up to 63.3% over-refusal rate on legitimate utility queries, confirming the utility–safety trade-off.
- Attack-type breakdown (Fig. 4): Access-control failures are driven by soft-overreach (cross-patient, indirect inference, unassigned clinician). Active-forgetting failures are triggered by indirect confirmation (yes/no probes) and update-delete conflicts.
- Qualitative examples (Table 5): Models may take a restrictive action (refuse/answer_redacted) but still leak protected content in the response text; or they may confirm deleted information despite the expected no_memory action.
Theoretical and Practical Implications
- Memory as governed state, not recall resource: The paper reframes memory evaluation from "how much can the agent recall" to "does the agent recall the right thing for the right person at the right time". This has direct implications for deploying LLM agents in institutional settings with privacy regulations (HIPAA, GDPR, FERPA).
- Multiplicative metric enforces joint optimization: The MGS = U·(1−A)·(1−F) formulation prevents systems from compensating for security failures with high utility, or utility failures with perfect security. This incentivizes balanced agent design.
- Practical deployment challenges: Current agents remain far from reliable shared institutional deployment. Long-context prompting is the strongest approach but computationally expensive; retrieval methods reduce cost but leak information. Future work must co-optimize governance, latency, and token cost.
- Interface-level forgetting is insufficient: The paper focuses on behavioral non-recoverability, but certified physical erasure from vector stores, caches, and model parameters remains an open problem.
Conclusion
G ATE M EM is a benchmark for evaluating memory governance in multi-principal shared-memory agents across utility, access control, and active forgetting. The dataset comprises 91 long-form episodes and 2,218 hidden checkpoints across four domains. Experiments reveal that no current method simultaneously achieves strong performance on all three dimensions. Long-context prompting offers the best governance trade-off but at high computational cost, while retrieval and external-memory baselines remain vulnerable to unauthorized disclosure and post-deletion recovery. These results indicate that future agents must treat memory not merely as a recall resource, but as a governed shared state with reliable access and deletion semantics. The benchmark, code, dataset, and leaderboard are publicly available.
Related papers
- OpenRath: Session-Centered Runtime State for Agent Systems
OpenRath introduces Session as a first-class runtime value for multi-agent systems, making state branchable, inspectable, and replayable.
- Orchestra-o1: Omnimodal Agent Orchestration
Orchestra-o1 achieves 72.8% accuracy on OmniGAIA, surpassing prior best by 10.3% via modality-aware orchestration and DA-GRPO training.
- FastContext: Training Efficient Repository Explorer for Coding Agents
FastContext delegates repository exploration to a trained subagent, cutting token use by 60% while improving coding agent resolution rates.