AgentSPEX: An Agent SPecification and EXecution Language
Summary (Overview)
- Introduces AgentSPEX: A declarative language (YAML-based) for specifying LLM-agent workflows with explicit control flow, modular structure, and an accompanying agent harness for execution.
- Addresses Limitations of Existing Paradigms: Moves beyond reactive prompting (implicit control) and Python-based orchestration frameworks (tight coupling, steep learning curve) to improve controllability, reproducibility, and accessibility.
- Core Features: Includes typed steps (
task,step), control flow (if,while,for_each), parallel execution, reusable submodules, explicit context management via variables, a visual editor, and a durable harness with checkpointing. - Empirical Validation: Outperforms Chain-of-Thought (CoT) and ReAct baselines on 7 diverse benchmarks (science, math, writing, paper understanding, software engineering), achieving state-of-the-art or competitive results.
- User Study Findings: Perceived as more interpretable and accessible for authoring workflows from scratch compared to a framework like LangGraph, though the latter is seen as more suitable for highly complex workflows.
Introduction and Theoretical Foundation
The rapid advancement of AI agents for complex tasks (e.g., resolving GitHub issues, scientific research) has led to a rich ecosystem of development frameworks. Two dominant paradigms exist:
- Reactive Prompting (e.g., ReAct): A single instruction guides the model through an open-ended sequence of reasoning and tool calls. Control flow and intermediate state are implicit in the conversation history, leading to potential issues with performance, cost, reproducibility, and controllability on long-horizon tasks.
- Orchestration Frameworks (e.g., LangGraph, DSPy, CrewAI): Impose structure through explicit workflow definitions but tightly couple the logic with Python code. This creates steep learning curves and makes agents difficult to maintain, modify, and share with non-programmers.
AgentSPEX is introduced to bridge this gap. Its design philosophy is guided by two principles:
- Expressiveness: Capture common agent invocation patterns (branching, loops, composition) without requiring modifications to execution source code.
- Accessibility: Remain simple enough for users to author, inspect, and modify agent behavior with minimal overhead.
The framework's theoretical contribution lies in making control flow, composition, and context management explicit and declarative, shifting these responsibilities from the LLM's implicit reasoning to a structured, user-controlled specification.
Methodology
AgentSPEX consists of a specification language and an execution harness.
1. AgentSPEX Language
Workflows are specified in declarative, human-readable YAML files. A workflow has a common structure:
name: "research_assistant"
goal: "Research a topic and write a summary"
config:
model: "gpt-5.4"
enabled_tools: ["web_search", "file_write"]
parameters:
topic: "Enhancing LLM reasoning via RLHF"
file_path: "outputs/report.md"
workflow:
- task:
instruction: "Generate a list of search queries for {{ topic }}"
save_as: "search_queries"
- call:
module: "modules/search_and_summarize.yaml"
parameters:
queries: "{{ search_queries }}"
save_as: "paper_summary"
- task:
instruction: "Write a report at {{ file_path }} based on these findings: {{ paper_summary }}"
Core Language Constructs (from Table 1):
| Construct | Category | Description |
|---|---|---|
task | Invocation | Start a new conversation |
step | Invocation | Continue a persistent conversation |
if / switch | Control flow | Conditional branching |
while | Control flow | Loop with configurable iteration limit |
for_each | Control flow | Iterate over a list |
call | Composition | Invoke another workflow as a sub-module |
parallel / gather | Concurrency | Execute operations concurrently |
set_variable | State | Assign a value to a context variable |
increment | State | Increment a numeric variable |
input | State | Prompt the user for input |
return | State | Return a value to the calling workflow |
State Management & Composition:
- Context Variables: Steps reference variables using Mustache-style templates (
{{variable}}) and save outputs viasave_as. taskvs.step:taskstarts a fresh conversation;stepaccumulates history across turns, giving authors direct control over information flow.- Unified Composition: Any workflow can invoke another as a submodule via
call. Workflows can also be registered as tools for dynamic invocation.
2. Visual Editor
A bidirectional visual editor provides synchronized graph-based and YAML-based views (Figure 3). Users can edit via drag-and-drop or direct text modification.
3. Agent Harness
The harness executes the specification:
- Interpreter: Validates workflow, resolves parameters, expands templates, and dispatches operations with hierarchical step IDs (e.g.,
3.2.1). - Executor: Runs the multi-turn LLM-tool interaction loop, terminating on final response or limits. Uses a Model Context Protocol (MCP) client for tool execution.
- Execution Environment: Docker-based sandbox with isolated access to 50+ tools (file ops, web search, code execution, browser automation).
- Observability Dashboard: Live logs of agent actions and reasoning steps.
- Durability System:
- Checkpointing: Saves state after each step (context, metrics, sandbox). Enables resume from interruptions.
- Execution Tracing & Selective Replay: Records full trace. Allows replay from a prior trace to isolate the effect of prompt/flow changes.
- Formal Verification Potential: Explicit control flow and variable dependencies enable the definition of pre-/post-conditions for steps, allowing verification using formal languages (Lean, Isabelle).
Empirical Validation / Results
The framework is demonstrated with three ready-to-use agents and evaluated on 7 benchmarks.
Agent Demos
- Deep Research: Takes a query, implements a multi-level (breadth/depth) search strategy, and generates a comprehensive Markdown report.
- AI Scientist: Two-stage pipeline (Thinker/Writer) that generates a novel academic research proposal, including safety checks, related work retrieval, and parallel citation insertion.
- AI Advisor: Takes a research proposal/paper and produces a rubric-based review with actionable feedback.
Benchmark Evaluation
Table 2: Evaluation results on seven different benchmarks.
| Agent | Model | Domain | Score |
|---|---|---|---|
| SciBench (Wang et al., 2024a) | |||
| CoT | GPT-5 | Science | 85.92% |
| ReAct | GPT-5 | Science | 87.79% |
| AgentSPEX (Ours) | GPT-5 | Science | 90.61% |
| StemEZ (Wang et al., 2024b) | |||
| CoT | GPT-5 | Science | 82.87% |
| ReAct | GPT-5 | Science | 84.72% |
| AgentSPEX (Ours) | GPT-5 | Science | 86.57% |
| ChemBench (Mirza et al., 2025) | |||
| CoT | GPT-5* | Science | 78.90% |
| ReAct | GPT-5* | Science | 77.80% |
| AgentSPEX (Ours) | GPT-5* | Science | 83.30% |
| AIME 2025 (Art of Problem Solving, 2026) | |||
| CoT (OpenAI, 2025a) | GPT-5 (without tools) | Mathematics | 94.60% |
| CoT (OpenAI, 2025a) | GPT-5 (with Python) | Mathematics | 99.60% |
| AgentSPEX (Ours) | GPT-5 | Mathematics | 100.0% |
| ELAIPBench (Dai et al., 2026) | |||
| CoT | GPT-5* | Paper Understanding | 37.22% |
| ReAct | GPT-5* | Paper Understanding | 33.80% |
| AgentSPEX (Ours) | GPT-5* | Paper Understanding | 43.70% |
| WritingBench (Wu et al., 2025) | |||
| CoT | Claude-Sonnet-4.5-Thinking | Writing | 79.90% |
| ReAct | Claude-Sonnet-4.5-Thinking | Writing | 80.30% |
| AgentSPEX (Ours) | Claude-Sonnet-4.5-Thinking | Writing | 81.00% |
| SWE-Bench Verified (Jimenez et al., housed) | |||
| mini-SWE-agent (Yang et al., 2024) | Claude-Opus-4.5*/4.6* | Software Engineering | 76.20% |
| Live-SWE-agent (Xia et al., 2025) | Claude-Opus-4.5*/4.6* | Software Engineering | 74.60% |
| AgentSPEX (Ours) | Claude-Opus-4.5/4.6** | Software Engineering | 77.10% |
*Denotes use of high-reasoning effort.
Key Findings:
- AgentSPEX achieves the highest score on all 7 benchmarks.
- Significant Gains: +2.8% (SciBench), +5.5% (ChemBench), +6.5% (ELAIPBench) over the stronger baseline. Perfect score on AIME 2025.
- Pattern Analysis: The ReAct baseline (same workflow in prompt, no enforcement) sometimes underperforms CoT (e.g., -3.4% on ELAIPBench). This suggests that offloading control flow logic to the interpreter (AgentSPEX) alleviates the model's burden of simultaneously interpreting structure and reasoning.
- Larger improvements are seen on benchmarks requiring processing of substantial input or multi-step coordination (ChemBench, ELAIPBench), likely benefiting from explicit context management that prevents context degradation.
- Model-Robustness: On SWE-Bench Verified, AgentSPEX shows minimal performance drop (-0.2%) when upgrading from Claude-Opus-4.5 to 4.6, compared to larger drops for other agents (-1.2% to -6.8%).
User Study
A study with 23 participants compared AgentSPEX and LangGraph workflows implementing the same behavior.
Qualitative Results:
- AgentSPEX was favored for readability, clarity of prompting, and ease of starting a new workflow from scratch. Described as "accessible to non-coders" and "easier to understand."
- LangGraph was preferred for constructing complex, multi-step workflows. Described as "customizable" and "more rigorous." This suggests AgentSPEX is perceived as more approachable, but its ability to handle complexity was initially less apparent (addressed by the provided demos).
Theoretical and Practical Implications
Theoretical Implications:
- Declarative Agent Specification: Proposes a shift from implicit, model-managed control flow to explicit, user-specified workflows, enabling formal reasoning about agent behavior.
- Context Management: Provides a structured mechanism to combat "context rot" and performance degradation in long-horizon tasks by explicitly controlling the information each step receives.
- Verification: The explicit structure opens the door for formal verification of agent plans and execution trajectories, a step towards more reliable and verifiable agentic systems.
Practical Implications:
- Accessibility: Lowers the barrier to entry for agent development, enabling domain experts and non-programmers to author and modify workflows via YAML and a visual editor.
- Maintainability & Reproducibility: Self-contained YAML files are easy to version-control, diff, and share. Explicit steps enhance reproducibility.
- Production Readiness: The durable harness (checkpointing, tracing, replay) supports robust, long-running workflows. The framework supports complex, production-ready agents (as demonstrated).
- Performance: Enforces efficient execution patterns that can outperform both unstructured (ReAct) and single-prompt (CoT) approaches, especially on longer, more structured tasks.
Framework Comparison (from Table 3):
| Approach | Natural Language | Explicit Context | Visual Editor |
|---|---|---|---|
| AutoGen (Wu et al., 2023) | ✗ | ✗ | ✗ |
| DSPy (Khattab et al., 2024) | ✗ | ✗ | ✗ |
| CrewAI (CrewAI, 2026) | Partial | ✗ | ✗ |
| LangGraph w/ LangFlow (Langflow AI, 2026) | ✗ | ✗ | ✓ |
| n8n (n8n-io, 2026) | ✗ | ✗ | ✓ |
| ADL (Zeng & Yan, 2025) | ✓ | ✗ | ✗ |
| PDL (Vaziri et al., 2024) | ✓ | Partial | ✗ |
| AgentSPEX (Ours) | ✓ | ✓ | ✓ |
Conclusion
AgentSPEX introduces a structured, declarative language and harness for building LLM agents that improves upon the limitations of both reactive prompting and Python-based orchestration frameworks. Its key contributions are:
- Expressive & Accessible Specification: YAML-based workflows with explicit control flow, state, and composition.
- Robust Execution Harness: Featuring sandboxed tools, observability, and durability mechanisms (checkpointing, tracing).
- Visual Development: A bidirectional editor for drag-and-drop authoring.
- Empirical Effectiveness: Demonstrated superior performance across diverse benchmarks and provided ready-to-use complex agents.
- User-Validated Usability: Perceived as more interpretable and accessible for workflow authoring.
Future Work includes advancing formal verification, training models to automatically write/use workflows, incorporating end-to-end agentic training pipelines, and enhancing support for multi-agent orchestration and long-context reasoning.