Summary (Overview)

Workflow-Centered Formulation: This survey introduces Agentic Computation Graphs (ACGs) as a unifying abstraction for LLM agent workflows. It distinguishes three key objects: reusable templates, run-specific realized graphs, and execution traces.
Taxonomy of Structure Determination: The literature is organized by when workflow structure is determined, using the concepts of Graph Determination Time (GDT) and Graph Plasticity Mode (GPM). This distinguishes static methods (fixed reusable templates) from dynamic methods (run-specific generation, selection, or editing).
Cross-Cutting Synthesis: Methods are synthesized along three axes: optimization target (node, graph, or joint), feedback evidence (metric, verifier, preference, or trace), and update mechanism. This clarifies what is changed, why, and how.
Evaluation Protocol: The survey advocates for structure-aware evaluation, moving beyond just downstream task metrics to also report graph-level properties, execution cost, robustness, and structural variation across inputs.
Design Guidance: Based on surveyed patterns, practical guidance is provided: use static optimization for stable, repetitive tasks; start with selection/pruning before full generation for dynamic needs; and reserve in-execution editing for highly interactive environments.

Introduction and Theoretical Foundation

Large Language Model (LLM) systems are evolving from single-prompt chatbots to complex, executable workflows that coordinate multiple actions (LLM calls, tool use, retrieval, code execution, verification, etc.). The workflow structure—the components, their dependencies, and information flow—critically impacts both effectiveness and efficiency.

This survey positions itself within the broader literature by focusing specifically on workflow optimization as a primary design problem, distinct from adjacent topics like general agent planning, tool learning, or multi-agent collaboration surveys (see Table 1 in the paper).

Core Abstraction: Agentic Computation Graph (ACG) An ACG is a directed graph where nodes perform atomic actions and edges encode control, data, or communication dependencies. A node can be described by the tuple $\langle \text{Instruction, Context, Tools, Model/Decoding} \rangle$ .

Key Distinctions:

ACG Template ( $\bar{G}$ ): A reusable executable specification $\bar{G} = (V, E, \Phi, \Sigma, A)$ , where $V$ is nodes, $E$ is edges, $\Phi$ are node parameters (prompts, tools), $\Sigma$ is a scheduling policy, and $A$ is admissible edit actions.
Realized Graph ( $G_{run}$ ): The workflow structure actually used for a particular run, which may be a subgraph or edited version of the template.
Execution Trace ( $\tau$ ): The sequence of states, actions, observations, and costs produced by executing $G_{run}$ : $\tau = \{ (s_t, a_t, o_t, c_t) \}_{t=1}^T$ .

Optimization Formulation: Workflow optimization is framed as balancing task quality against execution cost:

\max \mathbb{E}_{x \sim \mathcal{D}} \left[ \mathbb{E}_{G_{run} | x} \left[ \mathbb{E}_{\tau | G_{run}, x} [ R(\tau; x) - \lambda C(\tau) ] \right] \right]

where $R(\tau; x)$ is a task-quality score, $C(\tau)$ is execution cost, and $\lambda$ controls the trade-off.

Organizing Principle: Structure Determination

Static: Deployed structure is a reusable template fixed after training/search.
Dynamic: Part of the realized graph is constructed, selected, or edited at inference time.
- Graph Determination Time (GDT): Offline, Pre-execution, In-execution.
- Graph Plasticity Mode (GPM): None, Select, Generate, Edit.

Methodology

The survey methodology involves a structured analysis of 77 in-scope works (39 core, 7 adjacent, 31 background), including preprints, conference papers, and benchmark resources. A compact comparison card is used to classify methods along stable dimensions (see Table 9 in the Appendix):

Field	Meaning
Setting	Static/Dynamic, GDT, GPM
Optimized Level	Node, Graph, or Joint
Representation	Code, Text, DSL, Graph IR, etc.
Feedback / Evidence	Metric, Verifier, Preference, Trace, etc.
Update Mechanism	Search, Generator, RL, Repair/Edit, etc.
Cost Handling	None, Soft Objective, Hard Constraint

Empirical Validation / Results

The survey synthesizes results by categorizing methods and their characteristics. Key findings are presented in comparative tables.

Static Optimization of Agent Workflows

Methods that optimize a reusable template before deployment. They are easier to inspect and benchmark but may be brittle to distribution shift.

Table 2: Representative Core Static Workflow-Optimization Methods (Summary)

Method	Setting (GDT/GPM)	Level	Representation	Feedback / Evidence	Update Mechanism	Cost Handling
AFlow (Zhang et al., 2025e)	offline / none	graph	typed operator graph	metric: task score	search: MCTS	soft objective ($)
ADAS (Hu et al., 2025a)	offline / none	joint	runnable code	metric: task score	search: archive meta-search	none
A²Flow (Zhao et al., 2025)	offline / none	graph	abstract operator graph	supervision: demos + execution	hybrid: operator learning + search	none
Multi-Agent Design (Zhou et al., 2025)	offline / none	joint	topology + prompts	metric: task score	hybrid: staged alternation	none
Optima (Chen et al., 2025)	offline / none	node	fixed scaffold + trajectories	metric: quality–efficiency reward	hybrid: generate-rank-select-train	soft objective (efficiency)
VFlow (Wei et al., 2025)	offline / none	graph	domain workflow graph	verifier: multi-level checks	hybrid: MCTS + cooperative evolution	soft objective (resource)
Maestro (Wang et al., 2025a)	offline / none	joint	typed stochastic graph	trace: reflective text + score	hybrid: alternating graph/config updates	soft objective (budget)

Node-Level Optimization inside Fixed Scaffolds: Methods like DSPy, OPRO, EvoPrompt, CAPO, and GEPA optimize local parameters (prompts, demonstrations) within a fixed graph structure, offering a practical and fast path to improvement.

Dynamic Optimization and Runtime Adaptation

Methods that determine workflow structure at inference time, offering flexibility for heterogeneous tasks.

Table 3: Representative Core Dynamic Workflow-Optimization Methods (Summary)

Method	Setting (GDT/GPM)	Level	Representation	Feedback / Evidence	Update Mechanism	Cost Handling
Pre-execution Generation/Selection
Difficulty-Aware Agentic Orchestration (Su et al., 2025a)	pre-exec / select	joint	modular operator workflow	proxy: difficulty estimate	controller: router + allocator	soft objective (cost)
Assemble Your Crew (Li et al., 2025b)	pre-exec / generate	graph	query-conditioned DAG	supervision: task labels	generator: autoregressive DAG	none
G-Designer (Zhang et al., 2025d)	pre-exec / generate	graph	generated communication graph	metric: quality score	generator: VGAE	soft objective (cost)
ScoreFlow (Wang et al., 2025e)	pre-exec / generate	graph	workflow generator	preference: score-aware pairs	preference optimization	none
FlowReasoner (Gao et al., 2025a)	pre-exec / generate	graph	operator-library program	metric: reward	RL: meta-controller	soft objective (cost)
In-execution Editing
DyFlow (Wang et al., 2025c)	in-exec / edit	joint	designer–executor workflow	trace: intermediate feedback	hybrid: online planning + execution	none
AgentConductor (Wang et al., 2026b)	in-exec / edit	graph	YAML / DAG topology	verifier: validity + execution	RL: topology revision	hard constraint (budget)
Aime (Shi et al., 2025)	in-exec / edit	graph	planner + dynamic actor graph	trace: runtime outcomes	controller: actor instantiation	none
MetaGen (Wang et al., 2026c)	in-exec / edit	joint	dynamic role pool + graph	trace: running feedback	repair/edit: training-free evolution	soft objective (cost)
ProAgent (Ye et al., 2023)	in-exec / edit	graph	structured JSON process graph	verifier: tests	repair/edit: incremental repair	none

Lightweight Dynamic Adaptation (Adjacent Methods): Techniques like Adaptive Graph Pruning, DAGP, AgentDropout, DyLAN, and MasRouter perform runtime selection or pruning over a fixed super-graph, offering cost savings with inherited validity.

Feedback Signals and Update Mechanisms

The evidence used to guide optimization is tightly coupled to the update mechanism.

Metric- and Score-Driven Optimization: Uses scalar task metrics (success, accuracy). Drives black-box search (e.g., AFlow, ADAS) and RL-based generators (e.g., FlowReasoner).
Verifier-Driven Optimization: Uses constraints like unit tests, schema checks, or functional correctness (e.g., VFlow, MermaidFlow, AgentConductor). Enables aggressive mutation with cheap validation.
Preference and Ranking Signals: Compares workflows or traces instead of using absolute rewards (e.g., ScoreFlow, RobustFlow, Optima). Useful when rewards are noisy.
Trace-Derived Textual Feedback: Uses semantic critiques from execution logs to propose changes (e.g., GEPA, MetaGen, Maestro). Offers rich feedback but requires coupling with validators.

Theoretical and Practical Implications

Theoretical Implications:

Credit Assignment Problem: It remains difficult to attribute performance gains to specific structural changes (e.g., a new edge vs. more compute).
Expressivity vs. Verifiability Trade-off: Expressive workflows are powerful but hard to validate and compare. Constrained IRs improve reproducibility but may limit solutions.
Need for Theory: The field lacks a theory for when dynamic generation is necessary, when static templates suffice, and how sample complexity scales with structural plasticity.

Practical Implications and Design Guidance:

When Static is Enough: For stable APIs, strong verification, and repetitive workloads, a well-searched static template is often superior due to lower cost and easier debugging.
Selection vs. Generation vs. Editing: Choose the minimum plasticity required:
- Selection/Pruning: When tasks vary mainly in difficulty or required compute.
- Pre-execution Generation: When tasks require genuinely different decomposition or communication patterns.
- In-execution Editing: For interactive environments where runtime observations fundamentally change the plan.
Graph vs. Prompt Optimization: If errors arise from missing verification, poor decomposition, or incorrect control flow, graph-level optimization is the higher-leverage intervention over prompt tuning.
Value of Verifiers: Verifiers pay off most when they are cheap and semantically meaningful (e.g., unit tests, schema checks). Placement and invocation frequency are key design choices.
A Practical Hybrid Recipe:
1. Start with a constrained static scaffold and optimize node-level prompts.
2. Add graph-level search if trace analysis reveals structural failures.
3. For heterogeneous tasks, prefer runtime selection before full generation.
4. Use in-execution editing only for high environmental uncertainty.
5. Compress/prune communication for efficiency after finding a capable design.

Conclusion

This survey provides a workflow-centered view of LLM agent systems, unifying them under the Agentic Computation Graph (ACG) abstraction. By distinguishing static from dynamic structure determination and analyzing methods along the axes of optimization target, feedback evidence, and update mechanism, it offers a framework for comparing and designing workflow optimization techniques.

A key conclusion is that workflow structure should be a first-class design object, with evaluation reporting not just final answers but also the workflow used, its variation, and its cost. Future work must address structural credit assignment, continual adaptation under drift, improved benchmarks, and the development of a theoretical foundation for the field.

Table 5: A Proposed Minimum Reporting Protocol for Workflow-Optimization Papers

Dimension	What should be reported	Why it matters
Workflow representation	code, DSL, graph IR, schema constraints, executable interpreter, available operators and tools	Determines what can be searched, validated, or edited
Structural setting	static or dynamic, GDT, GPM, admissible edits, routing policy, stopping rules	Clarifies what kind of structural variation the method actually allows
Model and tool configuration	base models, decoding settings, tool registry, verifier placement, memory policy	Separates workflow effects from backbone or tool effects
Online inference cost	tokens, LLM calls, tool calls, latency, wall-clock time, dollars, cost-per-success	Makes quality–cost trade-offs scientifically comparable
Graph-level metrics	node count, depth, width, communication volume, edit count, structural variance	Treats the workflow as a first-class output rather than an invisible implementation detail
Robustness tests	paraphrases, noisy retrieval, tool failure injection, API drift, unseen tools, strict budget caps	Checks whether the workflow policy is stable outside nominal conditions