AgentSPEX: An Agent SPecification and EXecution Language

Summary (Overview)

Introduces AgentSPEX: A declarative language (YAML-based) for specifying LLM-agent workflows with explicit control flow, modular structure, and an accompanying agent harness for execution.
Addresses Limitations of Existing Paradigms: Moves beyond reactive prompting (implicit control) and Python-based orchestration frameworks (tight coupling, steep learning curve) to improve controllability, reproducibility, and accessibility.
Core Features: Includes typed steps (task, step), control flow (if, while, for_each), parallel execution, reusable submodules, explicit context management via variables, a visual editor, and a durable harness with checkpointing.
Empirical Validation: Outperforms Chain-of-Thought (CoT) and ReAct baselines on 7 diverse benchmarks (science, math, writing, paper understanding, software engineering), achieving state-of-the-art or competitive results.
User Study Findings: Perceived as more interpretable and accessible for authoring workflows from scratch compared to a framework like LangGraph, though the latter is seen as more suitable for highly complex workflows.

Introduction and Theoretical Foundation

The rapid advancement of AI agents for complex tasks (e.g., resolving GitHub issues, scientific research) has led to a rich ecosystem of development frameworks. Two dominant paradigms exist:

Reactive Prompting (e.g., ReAct): A single instruction guides the model through an open-ended sequence of reasoning and tool calls. Control flow and intermediate state are implicit in the conversation history, leading to potential issues with performance, cost, reproducibility, and controllability on long-horizon tasks.
Orchestration Frameworks (e.g., LangGraph, DSPy, CrewAI): Impose structure through explicit workflow definitions but tightly couple the logic with Python code. This creates steep learning curves and makes agents difficult to maintain, modify, and share with non-programmers.

AgentSPEX is introduced to bridge this gap. Its design philosophy is guided by two principles:

Expressiveness: Capture common agent invocation patterns (branching, loops, composition) without requiring modifications to execution source code.
Accessibility: Remain simple enough for users to author, inspect, and modify agent behavior with minimal overhead.

The framework's theoretical contribution lies in making control flow, composition, and context management explicit and declarative, shifting these responsibilities from the LLM's implicit reasoning to a structured, user-controlled specification.

Methodology

AgentSPEX consists of a specification language and an execution harness.

1. AgentSPEX Language

Workflows are specified in declarative, human-readable YAML files. A workflow has a common structure:

name: "research_assistant"
goal: "Research a topic and write a summary"
config:
  model: "gpt-5.4"
  enabled_tools: ["web_search", "file_write"]
parameters:
  topic: "Enhancing LLM reasoning via RLHF"
  file_path: "outputs/report.md"
workflow:
  - task:
      instruction: "Generate a list of search queries for {{ topic }}"
      save_as: "search_queries"
  - call:
      module: "modules/search_and_summarize.yaml"
      parameters:
        queries: "{{ search_queries }}"
      save_as: "paper_summary"
  - task:
      instruction: "Write a report at {{ file_path }} based on these findings: {{ paper_summary }}"

Core Language Constructs (from Table 1):

Construct	Category	Description
`task`	Invocation	Start a new conversation
`step`	Invocation	Continue a persistent conversation
`if` / `switch`	Control flow	Conditional branching
`while`	Control flow	Loop with configurable iteration limit
`for_each`	Control flow	Iterate over a list
`call`	Composition	Invoke another workflow as a sub-module
`parallel` / `gather`	Concurrency	Execute operations concurrently
`set_variable`	State	Assign a value to a context variable
`increment`	State	Increment a numeric variable
`input`	State	Prompt the user for input
`return`	State	Return a value to the calling workflow

State Management & Composition:

Context Variables: Steps reference variables using Mustache-style templates ({{variable}}) and save outputs via save_as.
task vs. step: task starts a fresh conversation; step accumulates history across turns, giving authors direct control over information flow.
Unified Composition: Any workflow can invoke another as a submodule via call. Workflows can also be registered as tools for dynamic invocation.

2. Visual Editor

A bidirectional visual editor provides synchronized graph-based and YAML-based views (Figure 3). Users can edit via drag-and-drop or direct text modification.

3. Agent Harness

The harness executes the specification:

Interpreter: Validates workflow, resolves parameters, expands templates, and dispatches operations with hierarchical step IDs (e.g., 3.2.1).
Executor: Runs the multi-turn LLM-tool interaction loop, terminating on final response or limits. Uses a Model Context Protocol (MCP) client for tool execution.
Execution Environment: Docker-based sandbox with isolated access to 50+ tools (file ops, web search, code execution, browser automation).
Observability Dashboard: Live logs of agent actions and reasoning steps.
Durability System:
- Checkpointing: Saves state after each step (context, metrics, sandbox). Enables resume from interruptions.
- Execution Tracing & Selective Replay: Records full trace. Allows replay from a prior trace to isolate the effect of prompt/flow changes.
Formal Verification Potential: Explicit control flow and variable dependencies enable the definition of pre-/post-conditions for steps, allowing verification using formal languages (Lean, Isabelle).

Empirical Validation / Results

The framework is demonstrated with three ready-to-use agents and evaluated on 7 benchmarks.

Agent Demos

Deep Research: Takes a query, implements a multi-level (breadth/depth) search strategy, and generates a comprehensive Markdown report.
AI Scientist: Two-stage pipeline (Thinker/Writer) that generates a novel academic research proposal, including safety checks, related work retrieval, and parallel citation insertion.
AI Advisor: Takes a research proposal/paper and produces a rubric-based review with actionable feedback.

Benchmark Evaluation

Table 2: Evaluation results on seven different benchmarks.

Agent	Model	Domain	Score
SciBench (Wang et al., 2024a)
CoT	GPT-5	Science	85.92%
ReAct	GPT-5	Science	87.79%
AgentSPEX (Ours)	GPT-5	Science	90.61%
StemEZ (Wang et al., 2024b)
CoT	GPT-5	Science	82.87%
ReAct	GPT-5	Science	84.72%
AgentSPEX (Ours)	GPT-5	Science	86.57%
ChemBench (Mirza et al., 2025)
CoT	GPT-5*	Science	78.90%
ReAct	GPT-5*	Science	77.80%
AgentSPEX (Ours)	GPT-5*	Science	83.30%
AIME 2025 (Art of Problem Solving, 2026)
CoT (OpenAI, 2025a)	GPT-5 (without tools)	Mathematics	94.60%
CoT (OpenAI, 2025a)	GPT-5 (with Python)	Mathematics	99.60%
AgentSPEX (Ours)	GPT-5	Mathematics	100.0%
ELAIPBench (Dai et al., 2026)
CoT	GPT-5*	Paper Understanding	37.22%
ReAct	GPT-5*	Paper Understanding	33.80%
AgentSPEX (Ours)	GPT-5*	Paper Understanding	43.70%
WritingBench (Wu et al., 2025)
CoT	Claude-Sonnet-4.5-Thinking	Writing	79.90%
ReAct	Claude-Sonnet-4.5-Thinking	Writing	80.30%
AgentSPEX (Ours)	Claude-Sonnet-4.5-Thinking	Writing	81.00%
SWE-Bench Verified (Jimenez et al., housed)
mini-SWE-agent (Yang et al., 2024)	Claude-Opus-4.5/4.6	Software Engineering	76.20%
Live-SWE-agent (Xia et al., 2025)	Claude-Opus-4.5/4.6	Software Engineering	74.60%
AgentSPEX (Ours)	Claude-Opus-4.5/4.6**	Software Engineering	77.10%

*Denotes use of high-reasoning effort.

Key Findings:

AgentSPEX achieves the highest score on all 7 benchmarks.
Significant Gains: +2.8% (SciBench), +5.5% (ChemBench), +6.5% (ELAIPBench) over the stronger baseline. Perfect score on AIME 2025.
Pattern Analysis: The ReAct baseline (same workflow in prompt, no enforcement) sometimes underperforms CoT (e.g., -3.4% on ELAIPBench). This suggests that offloading control flow logic to the interpreter (AgentSPEX) alleviates the model's burden of simultaneously interpreting structure and reasoning.
Larger improvements are seen on benchmarks requiring processing of substantial input or multi-step coordination (ChemBench, ELAIPBench), likely benefiting from explicit context management that prevents context degradation.
Model-Robustness: On SWE-Bench Verified, AgentSPEX shows minimal performance drop (-0.2%) when upgrading from Claude-Opus-4.5 to 4.6, compared to larger drops for other agents (-1.2% to -6.8%).

User Study

A study with 23 participants compared AgentSPEX and LangGraph workflows implementing the same behavior.

Qualitative Results:

AgentSPEX was favored for readability, clarity of prompting, and ease of starting a new workflow from scratch. Described as "accessible to non-coders" and "easier to understand."
LangGraph was preferred for constructing complex, multi-step workflows. Described as "customizable" and "more rigorous." This suggests AgentSPEX is perceived as more approachable, but its ability to handle complexity was initially less apparent (addressed by the provided demos).

Theoretical and Practical Implications

Theoretical Implications:

Declarative Agent Specification: Proposes a shift from implicit, model-managed control flow to explicit, user-specified workflows, enabling formal reasoning about agent behavior.
Context Management: Provides a structured mechanism to combat "context rot" and performance degradation in long-horizon tasks by explicitly controlling the information each step receives.
Verification: The explicit structure opens the door for formal verification of agent plans and execution trajectories, a step towards more reliable and verifiable agentic systems.

Practical Implications:

Accessibility: Lowers the barrier to entry for agent development, enabling domain experts and non-programmers to author and modify workflows via YAML and a visual editor.
Maintainability & Reproducibility: Self-contained YAML files are easy to version-control, diff, and share. Explicit steps enhance reproducibility.
Production Readiness: The durable harness (checkpointing, tracing, replay) supports robust, long-running workflows. The framework supports complex, production-ready agents (as demonstrated).
Performance: Enforces efficient execution patterns that can outperform both unstructured (ReAct) and single-prompt (CoT) approaches, especially on longer, more structured tasks.

Framework Comparison (from Table 3):

Approach	Natural Language	Explicit Context	Visual Editor
AutoGen (Wu et al., 2023)	✗	✗	✗
DSPy (Khattab et al., 2024)	✗	✗	✗
CrewAI (CrewAI, 2026)	Partial	✗	✗
LangGraph w/ LangFlow (Langflow AI, 2026)	✗	✗	✓
n8n (n8n-io, 2026)	✗	✗	✓
ADL (Zeng & Yan, 2025)	✓	✗	✗
PDL (Vaziri et al., 2024)	✓	Partial	✗
AgentSPEX (Ours)	✓	✓	✓

Conclusion

AgentSPEX introduces a structured, declarative language and harness for building LLM agents that improves upon the limitations of both reactive prompting and Python-based orchestration frameworks. Its key contributions are:

Expressive & Accessible Specification: YAML-based workflows with explicit control flow, state, and composition.
Robust Execution Harness: Featuring sandboxed tools, observability, and durability mechanisms (checkpointing, tracing).
Visual Development: A bidirectional editor for drag-and-drop authoring.
Empirical Effectiveness: Demonstrated superior performance across diverse benchmarks and provided ready-to-use complex agents.
User-Validated Usability: Perceived as more interpretable and accessible for workflow authoring.

Future Work includes advancing formal verification, training models to automatically write/use workflows, incorporating end-to-end agentic training pipelines, and enhancing support for multi-agent orchestration and long-context reasoning.