# AgentSPEX: An Agent SPecification and EXecution Language

> AgentSPEX is a declarative YAML language that explicitly specifies LLM-agent workflows with control flow and modularity, outperforming CoT and ReAct on diverse benchmarks.

- **Source:** [arXiv](https://arxiv.org/abs/2604.13346)
- **Published:** 2026-04-23
- **Permalink:** https://picx.dev/p/2oOdWI
- **Whiteboard:** https://picx.dev/p/2oOdWI/image

## Summary

# AgentSPEX: An Agent SPecification and EXecution Language

## Summary (Overview)
*   **Introduces AgentSPEX**: A declarative language (YAML-based) for specifying LLM-agent workflows with explicit control flow, modular structure, and an accompanying agent harness for execution.
*   **Addresses Limitations of Existing Paradigms**: Moves beyond reactive prompting (implicit control) and Python-based orchestration frameworks (tight coupling, steep learning curve) to improve controllability, reproducibility, and accessibility.
*   **Core Features**: Includes typed steps (`task`, `step`), control flow (`if`, `while`, `for_each`), parallel execution, reusable submodules, explicit context management via variables, a visual editor, and a durable harness with checkpointing.
*   **Empirical Validation**: Outperforms Chain-of-Thought (CoT) and ReAct baselines on 7 diverse benchmarks (science, math, writing, paper understanding, software engineering), achieving state-of-the-art or competitive results.
*   **User Study Findings**: Perceived as more interpretable and accessible for authoring workflows from scratch compared to a framework like LangGraph, though the latter is seen as more suitable for highly complex workflows.

## Introduction and Theoretical Foundation
The rapid advancement of AI agents for complex tasks (e.g., resolving GitHub issues, scientific research) has led to a rich ecosystem of development frameworks. Two dominant paradigms exist:
1.  **Reactive Prompting (e.g., ReAct)**: A single instruction guides the model through an open-ended sequence of reasoning and tool calls. Control flow and intermediate state are implicit in the conversation history, leading to potential issues with performance, cost, reproducibility, and controllability on long-horizon tasks.
2.  **Orchestration Frameworks (e.g., LangGraph, DSPy, CrewAI)**: Impose structure through explicit workflow definitions but tightly couple the logic with Python code. This creates steep learning curves and makes agents difficult to maintain, modify, and share with non-programmers.

**AgentSPEX** is introduced to bridge this gap. Its design philosophy is guided by two principles:
1.  **Expressiveness**: Capture common agent invocation patterns (branching, loops, composition) without requiring modifications to execution source code.
2.  **Accessibility**: Remain simple enough for users to author, inspect, and modify agent behavior with minimal overhead.

The framework's theoretical contribution lies in making **control flow, composition, and context management explicit and declarative**, shifting these responsibilities from the LLM's implicit reasoning to a structured, user-controlled specification.

## Methodology
AgentSPEX consists of a **specification language** and an **execution harness**.

### 1. AgentSPEX Language
Workflows are specified in declarative, human-readable YAML files. A workflow has a common structure:
```yaml
name: "research_assistant"
goal: "Research a topic and write a summary"
config:
  model: "gpt-5.4"
  enabled_tools: ["web_search", "file_write"]
parameters:
  topic: "Enhancing LLM reasoning via RLHF"
  file_path: "outputs/report.md"
workflow:
  - task:
      instruction: "Generate a list of search queries for {{ topic }}"
      save_as: "search_queries"
  - call:
      module: "modules/search_and_summarize.yaml"
      parameters:
        queries: "{{ search_queries }}"
      save_as: "paper_summary"
  - task:
      instruction: "Write a report at {{ file_path }} based on these findings: {{ paper_summary }}"
```

**Core Language Constructs** (from Table 1):
| Construct | Category | Description |
| :--- | :--- | :--- |
| `task` | Invocation | Start a new conversation |
| `step` | Invocation | Continue a persistent conversation |
| `if` / `switch` | Control flow | Conditional branching |
| `while` | Control flow | Loop with configurable iteration limit |
| `for_each` | Control flow | Iterate over a list |
| `call` | Composition | Invoke another workflow as a sub-module |
| `parallel` / `gather` | Concurrency | Execute operations concurrently |
| `set_variable` | State | Assign a value to a context variable |
| `increment` | State | Increment a numeric variable |
| `input` | State | Prompt the user for input |
| `return` | State | Return a value to the calling workflow |

**State Management & Composition**:
*   **Context Variables**: Steps reference variables using Mustache-style templates (`{{variable}}`) and save outputs via `save_as`.
*   **`task` vs. `step`**: `task` starts a fresh conversation; `step` accumulates history across turns, giving authors direct control over information flow.
*   **Unified Composition**: Any workflow can invoke another as a submodule via `call`. Workflows can also be registered as tools for dynamic invocation.

### 2. Visual Editor
A bidirectional visual editor provides synchronized graph-based and YAML-based views (Figure 3). Users can edit via drag-and-drop or direct text modification.

### 3. Agent Harness
The harness executes the specification:
*   **Interpreter**: Validates workflow, resolves parameters, expands templates, and dispatches operations with hierarchical step IDs (e.g., `3.2.1`).
*   **Executor**: Runs the multi-turn LLM-tool interaction loop, terminating on final response or limits. Uses a Model Context Protocol (MCP) client for tool execution.
*   **Execution Environment**: Docker-based sandbox with isolated access to 50+ tools (file ops, web search, code execution, browser automation).
*   **Observability Dashboard**: Live logs of agent actions and reasoning steps.
*   **Durability System**:
    *   **Checkpointing**: Saves state after each step (context, metrics, sandbox). Enables resume from interruptions.
    *   **Execution Tracing & Selective Replay**: Records full trace. Allows replay from a prior trace to isolate the effect of prompt/flow changes.
*   **Formal Verification Potential**: Explicit control flow and variable dependencies enable the definition of pre-/post-conditions for steps, allowing verification using formal languages (Lean, Isabelle).

## Empirical Validation / Results
The framework is demonstrated with three ready-to-use agents and evaluated on 7 benchmarks.

### Agent Demos
1.  **Deep Research**: Takes a query, implements a multi-level (breadth/depth) search strategy, and generates a comprehensive Markdown report.
2.  **AI Scientist**: Two-stage pipeline (Thinker/Writer) that generates a novel academic research proposal, including safety checks, related work retrieval, and parallel citation insertion.
3.  **AI Advisor**: Takes a research proposal/paper and produces a rubric-based review with actionable feedback.

### Benchmark Evaluation
**Table 2: Evaluation results on seven different benchmarks.**
| Agent | Model | Domain | Score |
| :--- | :--- | :--- | :--- |
| **SciBench** (Wang et al., 2024a) | | | |
| CoT | GPT-5 | Science | 85.92% |
| ReAct | GPT-5 | Science | 87.79% |
| **AgentSPEX (Ours)** | **GPT-5** | **Science** | **90.61%** |
| **StemEZ** (Wang et al., 2024b) | | | |
| CoT | GPT-5 | Science | 82.87% |
| ReAct | GPT-5 | Science | 84.72% |
| **AgentSPEX (Ours)** | **GPT-5** | **Science** | **86.57%** |
| **ChemBench** (Mirza et al., 2025) | | | |
| CoT | GPT-5* | Science | 78.90% |
| ReAct | GPT-5* | Science | 77.80% |
| **AgentSPEX (Ours)** | **GPT-5*** | **Science** | **83.30%** |
| **AIME 2025** (Art of Problem Solving, 2026) | | | |
| CoT (OpenAI, 2025a) | GPT-5 (without tools) | Mathematics | 94.60% |
| CoT (OpenAI, 2025a) | GPT-5 (with Python) | Mathematics | 99.60% |
| **AgentSPEX (Ours)** | **GPT-5** | **Mathematics** | **100.0%** |
| **ELAIPBench** (Dai et al., 2026) | | | |
| CoT | GPT-5* | Paper Understanding | 37.22% |
| ReAct | GPT-5* | Paper Understanding | 33.80% |
| **AgentSPEX (Ours)** | **GPT-5*** | **Paper Understanding** | **43.70%** |
| **WritingBench** (Wu et al., 2025) | | | |
| CoT | Claude-Sonnet-4.5-Thinking | Writing | 79.90% |
| ReAct | Claude-Sonnet-4.5-Thinking | Writing | 80.30% |
| **AgentSPEX (Ours)** | **Claude-Sonnet-4.5-Thinking** | **Writing** | **81.00%** |
| **SWE-Bench Verified** (Jimenez et al., housed) | | | |
| mini-SWE-agent (Yang et al., 2024) | Claude-Opus-4.5*/4.6* | Software Engineering | 76.20% |
| Live-SWE-agent (Xia et al., 2025) | Claude-Opus-4.5*/4.6* | Software Engineering | 74.60% |
| **AgentSPEX (Ours)** | **Claude-Opus-4.5*/4.6*** | **Software Engineering** | **77.10%** |

*Denotes use of high-reasoning effort.

**Key Findings**:
*   AgentSPEX achieves the highest score on all 7 benchmarks.
*   **Significant Gains**: +2.8% (SciBench), +5.5% (ChemBench), +6.5% (ELAIPBench) over the stronger baseline. Perfect score on AIME 2025.
*   **Pattern Analysis**: The ReAct baseline (same workflow in prompt, no enforcement) sometimes underperforms CoT (e.g., -3.4% on ELAIPBench). This suggests that offloading control flow logic to the interpreter (AgentSPEX) alleviates the model's burden of simultaneously interpreting structure and reasoning.
*   **Larger improvements** are seen on benchmarks requiring processing of substantial input or multi-step coordination (ChemBench, ELAIPBench), likely benefiting from explicit context management that prevents context degradation.
*   **Model-Robustness**: On SWE-Bench Verified, AgentSPEX shows minimal performance drop (-0.2%) when upgrading from Claude-Opus-4.5 to 4.6, compared to larger drops for other agents (-1.2% to -6.8%).

### User Study
A study with 23 participants compared AgentSPEX and LangGraph workflows implementing the same behavior.

**Qualitative Results**:
*   **AgentSPEX** was favored for **readability**, **clarity of prompting**, and **ease of starting a new workflow from scratch**. Described as "accessible to non-coders" and "easier to understand."
*   **LangGraph** was preferred for constructing **complex, multi-step workflows**. Described as "customizable" and "more rigorous."
This suggests AgentSPEX is perceived as more approachable, but its ability to handle complexity was initially less apparent (addressed by the provided demos).

## Theoretical and Practical Implications
**Theoretical Implications**:
*   **Declarative Agent Specification**: Proposes a shift from implicit, model-managed control flow to explicit, user-specified workflows, enabling formal reasoning about agent behavior.
*   **Context Management**: Provides a structured mechanism to combat "context rot" and performance degradation in long-horizon tasks by explicitly controlling the information each step receives.
*   **Verification**: The explicit structure opens the door for formal verification of agent plans and execution trajectories, a step towards more reliable and verifiable agentic systems.

**Practical Implications**:
*   **Accessibility**: Lowers the barrier to entry for agent development, enabling domain experts and non-programmers to author and modify workflows via YAML and a visual editor.
*   **Maintainability & Reproducibility**: Self-contained YAML files are easy to version-control, diff, and share. Explicit steps enhance reproducibility.
*   **Production Readiness**: The durable harness (checkpointing, tracing, replay) supports robust, long-running workflows. The framework supports complex, production-ready agents (as demonstrated).
*   **Performance**: Enforces efficient execution patterns that can outperform both unstructured (ReAct) and single-prompt (CoT) approaches, especially on longer, more structured tasks.

**Framework Comparison** (from Table 3):
| Approach | Natural Language | Explicit Context | Visual Editor |
| :--- | :--- | :--- | :--- |
| AutoGen (Wu et al., 2023) | ✗ | ✗ | ✗ |
| DSPy (Khattab et al., 2024) | ✗ | ✗ | ✗ |
| CrewAI (CrewAI, 2026) | Partial | ✗ | ✗ |
| LangGraph w/ LangFlow (Langflow AI, 2026) | ✗ | ✗ | ✓ |
| n8n (n8n-io, 2026) | ✗ | ✗ | ✓ |
| ADL (Zeng & Yan, 2025) | ✓ | ✗ | ✗ |
| PDL (Vaziri et al., 2024) | ✓ | Partial | ✗ |
| **AgentSPEX (Ours)** | **✓** | **✓** | **✓** |

## Conclusion
AgentSPEX introduces a structured, declarative language and harness for building LLM agents that improves upon the limitations of both reactive prompting and Python-based orchestration frameworks. Its key contributions are:
1.  **Expressive & Accessible Specification**: YAML-based workflows with explicit control flow, state, and composition.
2.  **Robust Execution Harness**: Featuring sandboxed tools, observability, and durability mechanisms (checkpointing, tracing).
3.  **Visual Development**: A bidirectional editor for drag-and-drop authoring.
4.  **Empirical Effectiveness**: Demonstrated superior performance across diverse benchmarks and provided ready-to-use complex agents.
5.  **User-Validated Usability**: Perceived as more interpretable and accessible for workflow authoring.

**Future Work** includes advancing formal verification, training models to automatically write/use workflows, incorporating end-to-end agentic training pipelines, and enhancing support for multi-agent orchestration and long-context reasoning.

---

_Markdown view of https://picx.dev/p/2oOdWI, served by PicX — AI-generated visual whiteboard summaries of research papers._
