Summary (Overview)
- Proposes AgenticSTS, a bounded-memory testbed for long-horizon LLM agents, where every decision prompt is freshly composed from five typed knowledge layers ((L_1)–(L_5)) rather than appending raw cross-decision transcripts.
- Instantiates the contract in Slay the Spire 2, a closed-rule stochastic deck-building game with a long horizon (~80 min, 67 strategic LLM calls per run). The task is hard (human win rate at A0: 16%; frontier LLMs in public benchmarks: 0% at A0) but unsaturated.
- Within a balanced fixed-A0 ablation ((N = 10) per cell), the largest observed win-rate separation is between no-scaffold (3/10) and scaffolded cells with (L_5) skills (6/10). This difference is directional (Fisher exact (p \approx 0.37)) rather than statistically significant; Wilson 95% CIs overlap.
- Releases a reusable archive of 298 completed trajectories with condition tags, SHA-anchored (L_4/L_5) snapshots, prompt records, and analysis scripts, enabling community study of memory contracts.
Introduction and Theoretical Foundation
Background and Motivation
Memory for a long-horizon LLM agent is not merely a store of text — it is a contract about what each future decision is allowed to see. Two dominant contracts exist:
- Accumulating transcript (e.g., ReAct, Reflexion [47, 28]): past observations, tool calls, and reflections are appended to each prompt, leading to unbounded context growth.
- Typed retrieval (e.g., MemGPT, structured memory systems [25, 7, 17]): prior experience is distilled into typed records and only relevant pieces are retrieved per decision.
The paper argues that this design choice determines what evidence the model sees, what stale information can re-enter decisions, and which components can be ablated. The goal is to make the memory interface bounded, inspectable, and ablatable rather than treating context growth as an implicit default.
Theoretical Basis
The contract formalizes memory as a per-decision composition function:
[ u_d = \text{compose}\big(L_1, L_2(s_d), L_3(s_d), L_4(s_d), L_5(s_d)\big) \tag{2} ]
where (u_d) is the user message for decision (d) at state (s_d), and each (L_i) is a typed knowledge layer with different mutability and role. This design enforces bounded context, enables layer-wise ablation, and supports reproducible evaluation.
The Slay the Spire 2 Testbed
The game is chosen for four properties:
- (P1) Closed, enumerable rule space: 576 cards, 293 relics, 115 monsters, etc., all text-readable.
- (P2) Long horizon: median ~80 min wall-clock, 67 strategic LLM calls per run.
- (P3) Multi-axis stochasticity: random draws, rewards, enemy placements preclude trajectory memorization.
- (P4) State-conditioned combat math: requires calculation from current state, not web-like recall.
The Ascension ladder ((A_0)–(A_{10})) provides an ordinal difficulty scale. A derived analysis score is defined by Eq. (1):
[ s = \begin{cases} 100, & \text{if victory},\ \text{floor} + \frac{52}{3} \cdot \text{bosses}, & \text{otherwise}, \end{cases} \tag{1} ]
where bosses counts cleared act bosses (0/1/2 for non-victory, 3 for victory). The goal is to make memory evaluation systematic rather than ad hoc.
Methodology
The Bounded-Memory Contract (Architecture)
The agent uses typed retrieval to compose each decision prompt from five layers:
| Layer | Content | Mutability | Experimental role |
|---|---|---|---|
| (L_1) | Operator prompts (role, protocol) | Immutable | Fixed |
| (L_2) | State-typed schemas, legal action formats | Immutable | Fixed |
| (L_3) | Game knowledge (cards, relics, enemies) | Patch-refreshed | Filterable |
| (L_4) | Episodic summaries (postrun, per character × ascension × act × enemy class) | Writable postrun | Disable/freeze/update |
| (L_5) | Triggered strategic skills (scenario-class tactics with explicit triggers) | Writable postrun | Disable/freeze/update |
Boundedness: With capped top-(k) retrieval and item sizes, the configured prompt size is (O(|\text{sys}| + s_{\text{thread}} + \sum_i k_i \cdot s_i)), independent of run length. This contrasts with transcript-accumulating designs that have (\Omega(d \cdot \bar{s})) growth for (d) decisions.
Ablatability: Each layer can be toggled on/off independently, allowing attribution of performance differences to specific memory components.
Combat Truncation and Routing
- A dispatcher routes decisions to four model tiers: fast (trivial combat plans), strategic (ordinary decisions), analysis (postrun memory extraction), evolution (skill distillation).
- Combat uses a local conversation object with at most three messages per round; earlier rounds are summarized through the typed state.
- This yields a median of 67 strategic LLM calls per run instead of one call per in-game action.
Skill Discovery
- Mistake-driven discovery (self-evolve): reads combat losses, runs A/B checks ((B=3) resample, strict 2/3 plus zero-harmful), and applies a four-level write gate (cosine, Jaccard, LLM judge, optional reap).
- Stub-template-filled authoring (Mode B): fills five character-parametric templates under namespace isolation and a library lock.
Experimental Design
Three empirical questions guide the evaluation:
- Fixed-A0 ablation (5-condition decomposition):
baseline-strict: stripped prompt, no memory/skills/thread, strict knowledge filter.prompt-only: full prompt helpers, no (L_4)/(L_5).mode-a: human-authored (L_5) seeds.mode-b-frozen: stub-template (L_5) bodies.full-frozen: mode-a + frozen (L_4) store.- All postrun/evolution writes disabled; frozen stores at SHA 1888a62.
- Cross-backbone probe: same frozen (L_4)+(L_5) stack (trained on Gemini 3.1 Pro) tested on Qwen 3.6-27B and DeepSeek V4 Pro at A0.
- Auto-mode ladder: after victory at (A_n), attempt (A_{n+1}); after defeat, retry (A_n). Measures climb endpoint.
Statistical Protocol
- Cell-level win rates: Wilson 95% confidence intervals [43].
- Continuous scores: 5,000-bootstrap 95% intervals [9].
- Pooled scaffolded row: exact Clopper–Pearson interval.
- Headline fixed-A0: first ten completed games per condition (balanced 50-game subset). Additional completed runs remain in diagnostic streams.
Empirical Validation / Results
Public Difficulty Calibration
- AGI-Eval (May 2026): zero A0 victories across five frontier-model configurations [1].
- Mega Crit: player-side A0 win rate 16% across 240M community runs [23].
- Under AgenticSTS harness,
baseline-strictwins 3/10 (Wilson CI [10.8%, 60.3%]; mean score 70.4). The task is hard but not saturated.
Within-Harness Ablation (Fixed A0)
Table 2: Fixed-A0 ablation ((N=10) per cell). Wilson 95% CIs: [11,60]/[17,69]/[31,83] for 3/10/4/10/6/10.
| Cell | (L_5) | (L_4) | Win | Score |
|---|---|---|---|---|
| No scaffold | – | – | 3/10 | 70.4 |
| Prompt only | – | – | 4/10 | 69.6 |
| Hand skills A | A | – | 6/10 | 85.5 |
| Template skills B | B | – | 6/10 | 83.3 |
| Skills+episodes | A | ✓ | 6/10 | 82.1 |
The largest separation is between no-scaffold and skill-scaffolded rows. Using Eq. (3):
[ \Delta_{L_\ell} = \hat{p}{\text{with-}\ell} - \hat{p}{\text{without-}\ell}, \tag{3} ]
gives (\Delta_{\text{prompt}} = +1/10) and (\Delta_{L_5} = +2/10) at the same prompt setup. Fisher exact test: 3/10 vs. 6/10 gives (p \approx 0.37); pooled scaffolded vs. unscaffolded (18/30 vs. 7/20) gives (p \approx 0.148). The result is directional, not statistically significant.
Auto-Mode Ascension Ladder
- Runs with postrun-active memory attempt (A_6)–(A_8); no-postrun streams stop at (A_2)–(A_4).
- The ladder complements the fixed-A0 matrix by measuring climb endpoint rather than fixed-difficulty reliability.
Cross-Backbone Transfer
Table 3: Cross-backbone transfer of Gemini-trained (L_4)+(L_5) stack ((N=5)/cell, Gemini (N=10)). Full-frozen 95% score CIs: [13.8,41.9]/[21.7,45.9]/[63.1,96.5] for Qwen/DeepSeek/Gemini. Qwen and DeepSeek wins = 0/5 in both columns; (\Delta%) is a score-only signal. † Gemini floor-48 endpoints include victories.
| Backbone | Baseline→Full-frozen | Wins (both) | Score (baseline→full) | (\Delta%) | Floor (baseline→full) |
|---|---|---|---|---|---|
| Qwen 3.6-27B | 0/5 → 0/5 | 0/5 | 14.6 → 26.9 | +84.5% | 17 → 33 |
| DeepSeek V4-Pro | 0/5 → 0/5 | 0/5 | 41.3 → 33.8 | –18.1% | 37 → 33 |
| Gemini 3.1-Pro | 3/10 → 6/10 | 3/10 → 6/10 | 70.4 → 82.1 | +16.6% | 48 → 48† |
The Gemini-trained stack lifts Qwen’s mean score but reduces DeepSeek’s; transfer is an empirical property, not a premise. Wins remain 0/5 for Qwen and DeepSeek.
Comparison with Open-Source Accumulating-Context Agents
- Competitors: STS2MCP [11] and CharTyr [6] – both use a single accumulating chat transcript re-sent on every decision.
- Effect: Both competitors win 0/5 at A0; AgenticSTS full-frozen wins 6/10, even baseline-strict wins 3/10.
- Speed: 9.9/8.5 minutes per floor (competitors) vs. 2.3 minutes (AgenticSTS).
- Cost: 66–90× more fresh (non-cached) LLM tokens per score point.
- Per-call prompt grows from ~9k to 500k tokens within a single run for competitors; bounded contract holds strategic user message flat at ~5k median.
- This is a system-level comparison, not a controlled ablation of the memory contract, due to differences in game patch, routing, thinking effort, decision batching.
Theoretical and Practical Implications
Making Memory an Object of Evaluation
- Separating memory into typed slots makes attribution tractable: gains can be traced to a specific layer (e.g., (L_5) skills) rather than to "more context."
- The bounded contract **decouples interface design from
Related papers
- PerceptionRubrics: Calibrating Multimodal Evaluation to Human Perception
PerceptionRubrics reveals a reliability gap: models pass atomic checks but fail strict conjunctive constraints, exposing perceptual brittleness hidden by saturated benchmarks.
- Scaling the Horizon, Not the Parameters: Reaching Trillion-Parameter Performance with a 35B Agent
A 35B MoE model matches trillion-parameter performance on long-horizon agent tasks by scaling agent horizon instead of parameters.
- Dockerless: Environment-Free Program Verifier for Coding Agents
Dockerless achieves state-of-the-art open-source SWE-bench results via environment-free agentic verification, matching execution-based verifiers without Docker.