Visual Summary | EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

Summary (Overview)

EvoArena: A new benchmark suite evaluating LLM agents under persistent environment evolution across three domains – terminal workflows (Terminal-Bench-Evo), software engineering (SWE-Chain-Evo), and social preferences (PersonaMem-Evo). It measures both step-level and chain-level accuracy.
Current agents struggle: Average step accuracy of 39.6% across EvoArena; chain-level accuracy is substantially lower (e.g., 21.5% on Terminal-Bench-Evo, 10.0% on SWE-Chain-Evo).
EvoMem: A patch-based memory paradigm that records memory updates (change, rationale, evidence) as an append-only history, enabling agents to retrieve versioned evidence at inference time.
EvoMem improves robustness: Average gains of 1.5% on EvoArena, 6.1% on GAIA, 4.8% on LoCoMo, and 3.7% on chain-level accuracy. Gains are larger at chain level than step level.
Mechanistic analysis: EvoMem helps when retrieved patches are operationalized (Terminal), reduces regressions on previous behavior (SWE), and improves complete evidence preservation for temporal/dispersed preference reasoning (PersonaMem).

Introduction and Theoretical Foundation

Current LLM agent benchmarks (e.g., WebArena, SWE-bench, GAIA, AgentBench) assume static environments – interfaces, rules, code states, and task conditions are fixed once constructed. In real-world deployment, environments continuously evolve: APIs change, codebases accumulate milestones, user preferences shift. Existing dynamic evaluations (refresh tasks, asynchronous events, self-evolving instances) do not test persistent environment evolution where the same setting changes across versions and the agent must know what changed, what still holds, and how to act under the current version.

Key failure mode: state collapse – most memory-based agents maintain a single latest memory state. When newer information overwrites older but still-valid knowledge (e.g., a workflow permission update that should only apply to a new release, not an older one), the agent loses the previous behavior and its contextual validity period.

Methodology

EvoArena Construction

Three evolution regimes, each with a version chain:

Terminal-Bench-Evo (executable workflow evolution): From 89 initial Terminal-Bench tasks, construct 5-version chains via workflow-state analysis, evolution-plan design, inherited version realization, quality control. Total 441 task instances (352 evolved + 89 initial). Covers I/O changes (49.1%), workspace/module (13.4%), CLI/API (10.5%), dependency (8.0%), semantic/policy (4.6%).
SWE-Chain-Evo (software evolution): 50 evolution chains from 12 repositories (Go, Python), 145 unique milestones, 493 chain-step instances. Chains length 5–15 steps. Oracle-state progression: apply reference milestone patch between steps to isolate adaptation to evolving codebase from compounding errors.
PersonaMem-Evo (social intelligence evolution): 10 persona conversations, 313 preference evolution chains, 505 multiple-choice questions. Question types: single-pattern transfer (130), multi-pattern synthesis (129), temporal trajectory (129), conflict resolution (117). Long-context histories (median 174.7K tokens, 597 turns).

EvoMem: Patch-Based Memory Evolution

Formal Definition: Let $x_t$ be input at time $t$ , $M_{t-1}$ the base memory state. Base updater: $M_t = \mathcal{U}(M_{t-1}, x_t)$ .

EvoMem monitors the transition and computes:

\Delta_t = \text{Diff}(M_{t-1}, M_t)

creating a patch only for non-additive updates (revisions, overwrites, reinterpretations).

Each patch stored in append-only history:

p_t = \langle \tau_t, C_t^-, C_t^+, r_t, z_t, e_t \rangle

where $\tau_t$ = temporal metadata, $C_t^-$ = pre-update memory, $C_t^+$ = post-update memory, $r_t$ = update rationale, $z_t$ = concise summary, $e_t$ = supporting evidence (triggering interaction, task context, execution feedback, environment snapshot).

Patch history: $P_{1:t} = \{p_1, \ldots, p_t\}$ .

Retrieval: Given query $q$ :

Latest memory retrieval: $c_{\text{mem}} = \mathcal{R}_{\text{mem}}(q, M_T)$
Patch retrieval: $P_q = \mathcal{R}_{\text{patch}}(q, P_{1:T})$
Final context: $c(q) = \text{Concat}(c_{\text{mem}}, P_q)$

Instantiation: Generalized across agents – Terminus2 (terminal), OpenHands (software), A-Mem (conversational memory), Memento-Skill (tool use). Each agent defines base memory $M_T$ , triggers for non-additive updates, and patch retrieval function.

Empirical Validation / Results

Main Results on EvoArena (Table 3)

Benchmark	Agent	Model	Base Step	+EvoMem Step	∆ Step	Base Chain	+EvoMem Chain	∆ Chain
Terminal-Bench-Evo	Terminus2	GPT-5.5	62.8	65.1	+2.3	31.8	45.5	+13.7
		Gemini-3.1-Pro	53.8	56.5	+2.7	39.3	44.1	+4.8
		Average (8 models)	43.6	46.0	+2.4	21.5	27.6	+6.1
SWE-Chain-Evo	OpenHands	GPT-5.5	49.7	50.9	+1.2	12.2	16.8	+4.6
		Average (8 models)	27.9	28.3	+0.4	10.0	12.1	+2.1
PersonaMem-Evo	A-Mem	GPT-5.5	40.0	43.8	+3.8	37.5	41.2	+3.7
		Average (8 models)	47.3	49.0	+1.7	40.0	43.2	+3.2

Main Results on Standard Benchmarks (Table 4)

Benchmark	Agent	Model	Base	+EvoMem	∆
GAIA	Memento-Skill	Average (6 models)	65.8	72.3	+6.5
LoCoMo	A-Mem	Average (6 models)	39.7	43.0	+3.3

Terminal-Bench-Evo Mechanism Analysis (Table 5)

EvoMem helps most when patches are operationalized: patch uptake $\geq 1$ yields +8.3% (vs. +2.6% none). Command-level uptake: +6.2% (vs. +3.1%). Evolution-requirement coverage high: +5.3% (vs. +2.1% low). Patch example retrieval: +6.5% (vs. +3.1% none).

SWE-Chain-Evo Regression Analysis (Table 6)

Pass_to_Pass failure rate (lower is better):

Model	Base	+EvoMem	∆
Qwen3.6-27B	9.01%	6.73%	-2.28%
Kimi-K2.6	7.14%	3.33%	-3.81%
Gemini-3.1-Pro	11.11%	8.89%	-2.22%
Average	9.09%	6.32%	-2.77%

PersonaMem-Evo Breakdown (Table 7 & 8)

By question type:

Temporal trajectory: +5.2% (46.6→51.7)
Multi-pattern synthesis: +5.2% (38.8→44.0)
Conflict resolution: -0.9% (29.5→28.6)
Single-pattern transfer: -1.8% (46.2→44.4)

Evidence capture:

Clause-level capture: 89.4→90.3% (+0.9%)
Row-level capture: 72.5→74.9% (+2.4%)
Largest gain on temporal trajectory (row-level +4.4%) and multi-pattern synthesis (row-level +3.5%)

Theoretical and Practical Implications

Theoretical: Environment evolution requires version-aware state tracking – memory should not collapse to a single latest state. EvoMem formalizes memory as an evolving history of grounded updates, connecting each state to prior states, rationales, and evidence.
Practical: EvoMem is lightweight (patch-based, no need to redesign base memory) and general (across terminal, software, conversational, skill agents). It improves both evolving and static benchmarks, suggesting that preserving update history is beneficial even when explicit evolution is not present.
Chain-level gains: EvoMem is especially useful for sustained reliability across dependent task sequences, relevant for deployment scenarios requiring multi-step adaptation.
Efficiency–accuracy trade-off: Longer trajectories and larger token budgets do not guarantee higher accuracy (e.g., GPT-5.5 high accuracy but highest token usage). Joint evaluation of accuracy and inference cost is essential.

Conclusion

This paper introduces EvoArena, a benchmark for evaluating LLM agents under persistent environment evolution across terminal workflows, software codebases, and social preferences. Current agents perform poorly (average 39.6% step accuracy), especially on chain-level evaluation (e.g., 10% on SWE-Chain-Evo). The proposed EvoMem patch-based memory paradigm records memory updates as structured histories ( $p_t = \langle \tau_t, C_t^-, C_t^+, r_t, z_t, e_t \rangle$ ), enabling agents to retrieve versioned evidence. EvoMem consistently improves performance across EvoArena (+1.5% step, +3.7% chain), GAIA (+6.5%), and LoCoMo (+3.3%). Mechanistic analysis shows EvoMem helps by operationalizing patch content (Terminal), reducing regressions (SWE), and preserving complete evolving evidence (PersonaMem). The work highlights the importance of treating memory as an evolving, inspectable history for reliable agent deployment in dynamic real-world environments.