Summary (Overview)
- EvoArena: A new benchmark suite evaluating LLM agents under persistent environment evolution across three domains – terminal workflows (Terminal-Bench-Evo), software engineering (SWE-Chain-Evo), and social preferences (PersonaMem-Evo). It measures both step-level and chain-level accuracy.
- Current agents struggle: Average step accuracy of 39.6% across EvoArena; chain-level accuracy is substantially lower (e.g., 21.5% on Terminal-Bench-Evo, 10.0% on SWE-Chain-Evo).
- EvoMem: A patch-based memory paradigm that records memory updates (change, rationale, evidence) as an append-only history, enabling agents to retrieve versioned evidence at inference time.
- EvoMem improves robustness: Average gains of 1.5% on EvoArena, 6.1% on GAIA, 4.8% on LoCoMo, and 3.7% on chain-level accuracy. Gains are larger at chain level than step level.
- Mechanistic analysis: EvoMem helps when retrieved patches are operationalized (Terminal), reduces regressions on previous behavior (SWE), and improves complete evidence preservation for temporal/dispersed preference reasoning (PersonaMem).
Introduction and Theoretical Foundation
Current LLM agent benchmarks (e.g., WebArena, SWE-bench, GAIA, AgentBench) assume static environments – interfaces, rules, code states, and task conditions are fixed once constructed. In real-world deployment, environments continuously evolve: APIs change, codebases accumulate milestones, user preferences shift. Existing dynamic evaluations (refresh tasks, asynchronous events, self-evolving instances) do not test persistent environment evolution where the same setting changes across versions and the agent must know what changed, what still holds, and how to act under the current version.
Key failure mode: state collapse – most memory-based agents maintain a single latest memory state. When newer information overwrites older but still-valid knowledge (e.g., a workflow permission update that should only apply to a new release, not an older one), the agent loses the previous behavior and its contextual validity period.
Methodology
EvoArena Construction
Three evolution regimes, each with a version chain:
- Terminal-Bench-Evo (executable workflow evolution): From 89 initial Terminal-Bench tasks, construct 5-version chains via workflow-state analysis, evolution-plan design, inherited version realization, quality control. Total 441 task instances (352 evolved + 89 initial). Covers I/O changes (49.1%), workspace/module (13.4%), CLI/API (10.5%), dependency (8.0%), semantic/policy (4.6%).
- SWE-Chain-Evo (software evolution): 50 evolution chains from 12 repositories (Go, Python), 145 unique milestones, 493 chain-step instances. Chains length 5–15 steps. Oracle-state progression: apply reference milestone patch between steps to isolate adaptation to evolving codebase from compounding errors.
- PersonaMem-Evo (social intelligence evolution): 10 persona conversations, 313 preference evolution chains, 505 multiple-choice questions. Question types: single-pattern transfer (130), multi-pattern synthesis (129), temporal trajectory (129), conflict resolution (117). Long-context histories (median 174.7K tokens, 597 turns).
EvoMem: Patch-Based Memory Evolution
Formal Definition: Let be input at time , the base memory state. Base updater: .
EvoMem monitors the transition and computes:
creating a patch only for non-additive updates (revisions, overwrites, reinterpretations).
Each patch stored in append-only history:
where = temporal metadata, = pre-update memory, = post-update memory, = update rationale, = concise summary, = supporting evidence (triggering interaction, task context, execution feedback, environment snapshot).
Patch history: .
Retrieval: Given query :
- Latest memory retrieval:
- Patch retrieval:
- Final context:
Instantiation: Generalized across agents – Terminus2 (terminal), OpenHands (software), A-Mem (conversational memory), Memento-Skill (tool use). Each agent defines base memory , triggers for non-additive updates, and patch retrieval function.
Empirical Validation / Results
Main Results on EvoArena (Table 3)
| Benchmark | Agent | Model | Base Step | +EvoMem Step | ∆ Step | Base Chain | +EvoMem Chain | ∆ Chain |
|---|---|---|---|---|---|---|---|---|
| Terminal-Bench-Evo | Terminus2 | GPT-5.5 | 62.8 | 65.1 | +2.3 | 31.8 | 45.5 | +13.7 |
| Gemini-3.1-Pro | 53.8 | 56.5 | +2.7 | 39.3 | 44.1 | +4.8 | ||
| Average (8 models) | 43.6 | 46.0 | +2.4 | 21.5 | 27.6 | +6.1 | ||
| SWE-Chain-Evo | OpenHands | GPT-5.5 | 49.7 | 50.9 | +1.2 | 12.2 | 16.8 | +4.6 |
| Average (8 models) | 27.9 | 28.3 | +0.4 | 10.0 | 12.1 | +2.1 | ||
| PersonaMem-Evo | A-Mem | GPT-5.5 | 40.0 | 43.8 | +3.8 | 37.5 | 41.2 | +3.7 |
| Average (8 models) | 47.3 | 49.0 | +1.7 | 40.0 | 43.2 | +3.2 |
Main Results on Standard Benchmarks (Table 4)
| Benchmark | Agent | Model | Base | +EvoMem | ∆ |
|---|---|---|---|---|---|
| GAIA | Memento-Skill | Average (6 models) | 65.8 | 72.3 | +6.5 |
| LoCoMo | A-Mem | Average (6 models) | 39.7 | 43.0 | +3.3 |
Terminal-Bench-Evo Mechanism Analysis (Table 5)
EvoMem helps most when patches are operationalized: patch uptake yields +8.3% (vs. +2.6% none). Command-level uptake: +6.2% (vs. +3.1%). Evolution-requirement coverage high: +5.3% (vs. +2.1% low). Patch example retrieval: +6.5% (vs. +3.1% none).
SWE-Chain-Evo Regression Analysis (Table 6)
Pass_to_Pass failure rate (lower is better):
| Model | Base | +EvoMem | ∆ |
|---|---|---|---|
| Qwen3.6-27B | 9.01% | 6.73% | -2.28% |
| Kimi-K2.6 | 7.14% | 3.33% | -3.81% |
| Gemini-3.1-Pro | 11.11% | 8.89% | -2.22% |
| Average | 9.09% | 6.32% | -2.77% |
PersonaMem-Evo Breakdown (Table 7 & 8)
By question type:
- Temporal trajectory: +5.2% (46.6→51.7)
- Multi-pattern synthesis: +5.2% (38.8→44.0)
- Conflict resolution: -0.9% (29.5→28.6)
- Single-pattern transfer: -1.8% (46.2→44.4)
Evidence capture:
- Clause-level capture: 89.4→90.3% (+0.9%)
- Row-level capture: 72.5→74.9% (+2.4%)
- Largest gain on temporal trajectory (row-level +4.4%) and multi-pattern synthesis (row-level +3.5%)
Theoretical and Practical Implications
- Theoretical: Environment evolution requires version-aware state tracking – memory should not collapse to a single latest state. EvoMem formalizes memory as an evolving history of grounded updates, connecting each state to prior states, rationales, and evidence.
- Practical: EvoMem is lightweight (patch-based, no need to redesign base memory) and general (across terminal, software, conversational, skill agents). It improves both evolving and static benchmarks, suggesting that preserving update history is beneficial even when explicit evolution is not present.
- Chain-level gains: EvoMem is especially useful for sustained reliability across dependent task sequences, relevant for deployment scenarios requiring multi-step adaptation.
- Efficiency–accuracy trade-off: Longer trajectories and larger token budgets do not guarantee higher accuracy (e.g., GPT-5.5 high accuracy but highest token usage). Joint evaluation of accuracy and inference cost is essential.
Conclusion
This paper introduces EvoArena, a benchmark for evaluating LLM agents under persistent environment evolution across terminal workflows, software codebases, and social preferences. Current agents perform poorly (average 39.6% step accuracy), especially on chain-level evaluation (e.g., 10% on SWE-Chain-Evo). The proposed EvoMem patch-based memory paradigm records memory updates as structured histories (), enabling agents to retrieve versioned evidence. EvoMem consistently improves performance across EvoArena (+1.5% step, +3.7% chain), GAIA (+6.5%), and LoCoMo (+3.3%). Mechanistic analysis shows EvoMem helps by operationalizing patch content (Terminal), reducing regressions (SWE), and preserving complete evolving evidence (PersonaMem). The work highlights the importance of treating memory as an evolving, inspectable history for reliable agent deployment in dynamic real-world environments.
Related papers
- Kwai Keye-VL-2.0 Technical Report
First multimodal MoE achieves SOTA long-video understanding and agentic tasks with 3B active parameters via sparse attention and multi-teacher distillation.
- VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding
Training on the VideoKR corpus, with skill-oriented examples and domain knowledge, boosts models' knowledge-intensive video reasoning by 3–5 points.
- Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts
RHO improves LLM agents by optimizing harnesses from unlabeled past trajectories, boosting SWE-Bench Pro pass rates from 59% to 78%.