# From Static Templates to Dynamic Runtime Graphs: A Survey of Workflow Optimization for LLM Agents

> This survey introduces Agentic Computation Graphs as a unifying abstraction and organizes LLM agent workflow optimization by when and how their structure is determined.

- **Source:** [arXiv](https://arxiv.org/abs/2603.22386)
- **Published:** 2026-03-26
- **Permalink:** https://picx.dev/p/oq0RNK
- **Whiteboard:** https://picx.dev/p/oq0RNK/image

## Summary

# Summary (Overview)

*   **Workflow-Centered Formulation:** This survey introduces **Agentic Computation Graphs (ACGs)** as a unifying abstraction for LLM agent workflows. It distinguishes three key objects: **reusable templates**, **run-specific realized graphs**, and **execution traces**.
*   **Taxonomy of Structure Determination:** The literature is organized by **when workflow structure is determined**, using the concepts of **Graph Determination Time (GDT)** and **Graph Plasticity Mode (GPM)**. This distinguishes **static** methods (fixed reusable templates) from **dynamic** methods (run-specific generation, selection, or editing).
*   **Cross-Cutting Synthesis:** Methods are synthesized along three axes: **optimization target** (node, graph, or joint), **feedback evidence** (metric, verifier, preference, or trace), and **update mechanism**. This clarifies what is changed, why, and how.
*   **Evaluation Protocol:** The survey advocates for **structure-aware evaluation**, moving beyond just downstream task metrics to also report **graph-level properties**, **execution cost**, **robustness**, and **structural variation** across inputs.
*   **Design Guidance:** Based on surveyed patterns, practical guidance is provided: use static optimization for stable, repetitive tasks; start with selection/pruning before full generation for dynamic needs; and reserve in-execution editing for highly interactive environments.

# Introduction and Theoretical Foundation

Large Language Model (LLM) systems are evolving from single-prompt chatbots to complex, executable **workflows** that coordinate multiple actions (LLM calls, tool use, retrieval, code execution, verification, etc.). The **workflow structure**—the components, their dependencies, and information flow—critically impacts both effectiveness and efficiency.

This survey positions itself within the broader literature by focusing specifically on **workflow optimization** as a primary design problem, distinct from adjacent topics like general agent planning, tool learning, or multi-agent collaboration surveys (see Table 1 in the paper).

**Core Abstraction: Agentic Computation Graph (ACG)**
An ACG is a directed graph where nodes perform atomic actions and edges encode control, data, or communication dependencies. A node can be described by the tuple $\langle \text{Instruction, Context, Tools, Model/Decoding} \rangle$.

**Key Distinctions:**
1.  **ACG Template ($\bar{G}$):** A reusable executable specification $\bar{G} = (V, E, \Phi, \Sigma, A)$, where $V$ is nodes, $E$ is edges, $\Phi$ are node parameters (prompts, tools), $\Sigma$ is a scheduling policy, and $A$ is admissible edit actions.
2.  **Realized Graph ($G_{run}$):** The workflow structure actually used for a particular run, which may be a subgraph or edited version of the template.
3.  **Execution Trace ($\tau$):** The sequence of states, actions, observations, and costs produced by executing $G_{run}$: $\tau = \{ (s_t, a_t, o_t, c_t) \}_{t=1}^T$.

**Optimization Formulation:**
Workflow optimization is framed as balancing task quality against execution cost:
$$
\max \mathbb{E}_{x \sim \mathcal{D}} \left[ \mathbb{E}_{G_{run} | x} \left[ \mathbb{E}_{\tau | G_{run}, x} [ R(\tau; x) - \lambda C(\tau) ] \right] \right]
$$
where $R(\tau; x)$ is a task-quality score, $C(\tau)$ is execution cost, and $\lambda$ controls the trade-off.

**Organizing Principle: Structure Determination**
*   **Static:** Deployed structure is a reusable template fixed after training/search.
*   **Dynamic:** Part of the realized graph is constructed, selected, or edited at inference time.
    *   **Graph Determination Time (GDT):** *Offline, Pre-execution, In-execution*.
    *   **Graph Plasticity Mode (GPM):** *None, Select, Generate, Edit*.

# Methodology

The survey methodology involves a structured analysis of 77 in-scope works (39 core, 7 adjacent, 31 background), including preprints, conference papers, and benchmark resources. A **compact comparison card** is used to classify methods along stable dimensions (see Table 9 in the Appendix):

| Field | Meaning |
| :--- | :--- |
| **Setting** | Static/Dynamic, GDT, GPM |
| **Optimized Level** | Node, Graph, or Joint |
| **Representation** | Code, Text, DSL, Graph IR, etc. |
| **Feedback / Evidence** | Metric, Verifier, Preference, Trace, etc. |
| **Update Mechanism** | Search, Generator, RL, Repair/Edit, etc. |
| **Cost Handling** | None, Soft Objective, Hard Constraint |

# Empirical Validation / Results

The survey synthesizes results by categorizing methods and their characteristics. Key findings are presented in comparative tables.

## Static Optimization of Agent Workflows
Methods that optimize a reusable template before deployment. They are easier to inspect and benchmark but may be brittle to distribution shift.

**Table 2: Representative Core Static Workflow-Optimization Methods (Summary)**

| Method | Setting (GDT/GPM) | Level | Representation | Feedback / Evidence | Update Mechanism | Cost Handling |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| AFlow (Zhang et al., 2025e) | offline / none | graph | typed operator graph | metric: task score | search: MCTS | soft objective ($) |
| ADAS (Hu et al., 2025a) | offline / none | joint | runnable code | metric: task score | search: archive meta-search | none |
| A²Flow (Zhao et al., 2025) | offline / none | graph | abstract operator graph | supervision: demos + execution | hybrid: operator learning + search | none |
| Multi-Agent Design (Zhou et al., 2025) | offline / none | joint | topology + prompts | metric: task score | hybrid: staged alternation | none |
| Optima (Chen et al., 2025) | offline / none | node | fixed scaffold + trajectories | metric: quality–efficiency reward | hybrid: generate-rank-select-train | soft objective (efficiency) |
| VFlow (Wei et al., 2025) | offline / none | graph | domain workflow graph | verifier: multi-level checks | hybrid: MCTS + cooperative evolution | soft objective (resource) |
| Maestro (Wang et al., 2025a) | offline / none | joint | typed stochastic graph | trace: reflective text + score | hybrid: alternating graph/config updates | soft objective (budget) |

**Node-Level Optimization inside Fixed Scaffolds:** Methods like **DSPy**, **OPRO**, **EvoPrompt**, **CAPO**, and **GEPA** optimize local parameters (prompts, demonstrations) within a fixed graph structure, offering a practical and fast path to improvement.

## Dynamic Optimization and Runtime Adaptation
Methods that determine workflow structure at inference time, offering flexibility for heterogeneous tasks.

**Table 3: Representative Core Dynamic Workflow-Optimization Methods (Summary)**

| Method | Setting (GDT/GPM) | Level | Representation | Feedback / Evidence | Update Mechanism | Cost Handling |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| **Pre-execution Generation/Selection** |
| Difficulty-Aware Agentic Orchestration (Su et al., 2025a) | pre-exec / select | joint | modular operator workflow | proxy: difficulty estimate | controller: router + allocator | soft objective (cost) |
| Assemble Your Crew (Li et al., 2025b) | pre-exec / generate | graph | query-conditioned DAG | supervision: task labels | generator: autoregressive DAG | none |
| G-Designer (Zhang et al., 2025d) | pre-exec / generate | graph | generated communication graph | metric: quality score | generator: VGAE | soft objective (cost) |
| ScoreFlow (Wang et al., 2025e) | pre-exec / generate | graph | workflow generator | preference: score-aware pairs | preference optimization | none |
| FlowReasoner (Gao et al., 2025a) | pre-exec / generate | graph | operator-library program | metric: reward | RL: meta-controller | soft objective (cost) |
| **In-execution Editing** |
| DyFlow (Wang et al., 2025c) | in-exec / edit | joint | designer–executor workflow | trace: intermediate feedback | hybrid: online planning + execution | none |
| AgentConductor (Wang et al., 2026b) | in-exec / edit | graph | YAML / DAG topology | verifier: validity + execution | RL: topology revision | hard constraint (budget) |
| Aime (Shi et al., 2025) | in-exec / edit | graph | planner + dynamic actor graph | trace: runtime outcomes | controller: actor instantiation | none |
| MetaGen (Wang et al., 2026c) | in-exec / edit | joint | dynamic role pool + graph | trace: running feedback | repair/edit: training-free evolution | soft objective (cost) |
| ProAgent (Ye et al., 2023) | in-exec / edit | graph | structured JSON process graph | verifier: tests | repair/edit: incremental repair | none |

**Lightweight Dynamic Adaptation (Adjacent Methods):** Techniques like **Adaptive Graph Pruning**, **DAGP**, **AgentDropout**, **DyLAN**, and **MasRouter** perform runtime selection or pruning over a fixed super-graph, offering cost savings with inherited validity.

## Feedback Signals and Update Mechanisms
The evidence used to guide optimization is tightly coupled to the update mechanism.

1.  **Metric- and Score-Driven Optimization:** Uses scalar task metrics (success, accuracy). Drives black-box search (e.g., AFlow, ADAS) and RL-based generators (e.g., FlowReasoner).
2.  **Verifier-Driven Optimization:** Uses constraints like unit tests, schema checks, or functional correctness (e.g., VFlow, MermaidFlow, AgentConductor). Enables aggressive mutation with cheap validation.
3.  **Preference and Ranking Signals:** Compares workflows or traces instead of using absolute rewards (e.g., ScoreFlow, RobustFlow, Optima). Useful when rewards are noisy.
4.  **Trace-Derived Textual Feedback:** Uses semantic critiques from execution logs to propose changes (e.g., GEPA, MetaGen, Maestro). Offers rich feedback but requires coupling with validators.

# Theoretical and Practical Implications

**Theoretical Implications:**
*   **Credit Assignment Problem:** It remains difficult to attribute performance gains to specific structural changes (e.g., a new edge vs. more compute).
*   **Expressivity vs. Verifiability Trade-off:** Expressive workflows are powerful but hard to validate and compare. Constrained IRs improve reproducibility but may limit solutions.
*   **Need for Theory:** The field lacks a theory for when dynamic generation is necessary, when static templates suffice, and how sample complexity scales with structural plasticity.

**Practical Implications and Design Guidance:**
*   **When Static is Enough:** For stable APIs, strong verification, and repetitive workloads, a well-searched static template is often superior due to lower cost and easier debugging.
*   **Selection vs. Generation vs. Editing:** Choose the minimum plasticity required:
    *   **Selection/Pruning:** When tasks vary mainly in difficulty or required compute.
    *   **Pre-execution Generation:** When tasks require genuinely different decomposition or communication patterns.
    *   **In-execution Editing:** For interactive environments where runtime observations fundamentally change the plan.
*   **Graph vs. Prompt Optimization:** If errors arise from missing verification, poor decomposition, or incorrect control flow, **graph-level optimization** is the higher-leverage intervention over prompt tuning.
*   **Value of Verifiers:** Verifiers pay off most when they are **cheap and semantically meaningful** (e.g., unit tests, schema checks). Placement and invocation frequency are key design choices.
*   **A Practical Hybrid Recipe:**
    1.  Start with a constrained static scaffold and optimize node-level prompts.
    2.  Add graph-level search if trace analysis reveals structural failures.
    3.  For heterogeneous tasks, prefer runtime selection before full generation.
    4.  Use in-execution editing only for high environmental uncertainty.
    5.  Compress/prune communication for efficiency after finding a capable design.

# Conclusion

This survey provides a workflow-centered view of LLM agent systems, unifying them under the **Agentic Computation Graph (ACG)** abstraction. By distinguishing **static** from **dynamic** structure determination and analyzing methods along the axes of **optimization target**, **feedback evidence**, and **update mechanism**, it offers a framework for comparing and designing workflow optimization techniques.

A key conclusion is that **workflow structure should be a first-class design object**, with evaluation reporting not just final answers but also the **workflow used, its variation, and its cost**. Future work must address **structural credit assignment**, **continual adaptation under drift**, **improved benchmarks**, and the development of a **theoretical foundation** for the field.

**Table 5: A Proposed Minimum Reporting Protocol for Workflow-Optimization Papers**

| Dimension | What should be reported | Why it matters |
| :--- | :--- | :--- |
| **Workflow representation** | code, DSL, graph IR, schema constraints, executable interpreter, available operators and tools | Determines what can be searched, validated, or edited |
| **Structural setting** | static or dynamic, GDT, GPM, admissible edits, routing policy, stopping rules | Clarifies what kind of structural variation the method actually allows |
| **Model and tool configuration** | base models, decoding settings, tool registry, verifier placement, memory policy | Separates workflow effects from backbone or tool effects |
| **Online inference cost** | tokens, LLM calls, tool calls, latency, wall-clock time, dollars, cost-per-success | Makes quality–cost trade-offs scientifically comparable |
| **Graph-level metrics** | node count, depth, width, communication volume, edit count, structural variance | Treats the workflow as a first-class output rather than an invisible implementation detail |
| **Robustness tests** | paraphrases, noisy retrieval, tool failure injection, API drift, unseen tools, strict budget caps | Checks whether the workflow policy is stable outside nominal conditions |

---

_Markdown view of https://picx.dev/p/oq0RNK, served by PicX — AI-generated visual whiteboard summaries of research papers._
