Summary (Overview)
- Workflow-Centered Formulation: This survey introduces Agentic Computation Graphs (ACGs) as a unifying abstraction for LLM agent workflows. It distinguishes three key objects: reusable templates, run-specific realized graphs, and execution traces.
- Taxonomy of Structure Determination: The literature is organized by when workflow structure is determined, using the concepts of Graph Determination Time (GDT) and Graph Plasticity Mode (GPM). This distinguishes static methods (fixed reusable templates) from dynamic methods (run-specific generation, selection, or editing).
- Cross-Cutting Synthesis: Methods are synthesized along three axes: optimization target (node, graph, or joint), feedback evidence (metric, verifier, preference, or trace), and update mechanism. This clarifies what is changed, why, and how.
- Evaluation Protocol: The survey advocates for structure-aware evaluation, moving beyond just downstream task metrics to also report graph-level properties, execution cost, robustness, and structural variation across inputs.
- Design Guidance: Based on surveyed patterns, practical guidance is provided: use static optimization for stable, repetitive tasks; start with selection/pruning before full generation for dynamic needs; and reserve in-execution editing for highly interactive environments.
Introduction and Theoretical Foundation
Large Language Model (LLM) systems are evolving from single-prompt chatbots to complex, executable workflows that coordinate multiple actions (LLM calls, tool use, retrieval, code execution, verification, etc.). The workflow structure—the components, their dependencies, and information flow—critically impacts both effectiveness and efficiency.
This survey positions itself within the broader literature by focusing specifically on workflow optimization as a primary design problem, distinct from adjacent topics like general agent planning, tool learning, or multi-agent collaboration surveys (see Table 1 in the paper).
Core Abstraction: Agentic Computation Graph (ACG) An ACG is a directed graph where nodes perform atomic actions and edges encode control, data, or communication dependencies. A node can be described by the tuple .
Key Distinctions:
- ACG Template (): A reusable executable specification , where is nodes, is edges, are node parameters (prompts, tools), is a scheduling policy, and is admissible edit actions.
- Realized Graph (): The workflow structure actually used for a particular run, which may be a subgraph or edited version of the template.
- Execution Trace (): The sequence of states, actions, observations, and costs produced by executing : .
Optimization Formulation: Workflow optimization is framed as balancing task quality against execution cost:
where is a task-quality score, is execution cost, and controls the trade-off.
Organizing Principle: Structure Determination
- Static: Deployed structure is a reusable template fixed after training/search.
- Dynamic: Part of the realized graph is constructed, selected, or edited at inference time.
- Graph Determination Time (GDT): Offline, Pre-execution, In-execution.
- Graph Plasticity Mode (GPM): None, Select, Generate, Edit.
Methodology
The survey methodology involves a structured analysis of 77 in-scope works (39 core, 7 adjacent, 31 background), including preprints, conference papers, and benchmark resources. A compact comparison card is used to classify methods along stable dimensions (see Table 9 in the Appendix):
| Field | Meaning |
|---|---|
| Setting | Static/Dynamic, GDT, GPM |
| Optimized Level | Node, Graph, or Joint |
| Representation | Code, Text, DSL, Graph IR, etc. |
| Feedback / Evidence | Metric, Verifier, Preference, Trace, etc. |
| Update Mechanism | Search, Generator, RL, Repair/Edit, etc. |
| Cost Handling | None, Soft Objective, Hard Constraint |
Empirical Validation / Results
The survey synthesizes results by categorizing methods and their characteristics. Key findings are presented in comparative tables.
Static Optimization of Agent Workflows
Methods that optimize a reusable template before deployment. They are easier to inspect and benchmark but may be brittle to distribution shift.
Table 2: Representative Core Static Workflow-Optimization Methods (Summary)
| Method | Setting (GDT/GPM) | Level | Representation | Feedback / Evidence | Update Mechanism | Cost Handling |
|---|---|---|---|---|---|---|
| AFlow (Zhang et al., 2025e) | offline / none | graph | typed operator graph | metric: task score | search: MCTS | soft objective ($) |
| ADAS (Hu et al., 2025a) | offline / none | joint | runnable code | metric: task score | search: archive meta-search | none |
| A²Flow (Zhao et al., 2025) | offline / none | graph | abstract operator graph | supervision: demos + execution | hybrid: operator learning + search | none |
| Multi-Agent Design (Zhou et al., 2025) | offline / none | joint | topology + prompts | metric: task score | hybrid: staged alternation | none |
| Optima (Chen et al., 2025) | offline / none | node | fixed scaffold + trajectories | metric: quality–efficiency reward | hybrid: generate-rank-select-train | soft objective (efficiency) |
| VFlow (Wei et al., 2025) | offline / none | graph | domain workflow graph | verifier: multi-level checks | hybrid: MCTS + cooperative evolution | soft objective (resource) |
| Maestro (Wang et al., 2025a) | offline / none | joint | typed stochastic graph | trace: reflective text + score | hybrid: alternating graph/config updates | soft objective (budget) |
Node-Level Optimization inside Fixed Scaffolds: Methods like DSPy, OPRO, EvoPrompt, CAPO, and GEPA optimize local parameters (prompts, demonstrations) within a fixed graph structure, offering a practical and fast path to improvement.
Dynamic Optimization and Runtime Adaptation
Methods that determine workflow structure at inference time, offering flexibility for heterogeneous tasks.
Table 3: Representative Core Dynamic Workflow-Optimization Methods (Summary)
| Method | Setting (GDT/GPM) | Level | Representation | Feedback / Evidence | Update Mechanism | Cost Handling |
|---|---|---|---|---|---|---|
| Pre-execution Generation/Selection | ||||||
| Difficulty-Aware Agentic Orchestration (Su et al., 2025a) | pre-exec / select | joint | modular operator workflow | proxy: difficulty estimate | controller: router + allocator | soft objective (cost) |
| Assemble Your Crew (Li et al., 2025b) | pre-exec / generate | graph | query-conditioned DAG | supervision: task labels | generator: autoregressive DAG | none |
| G-Designer (Zhang et al., 2025d) | pre-exec / generate | graph | generated communication graph | metric: quality score | generator: VGAE | soft objective (cost) |
| ScoreFlow (Wang et al., 2025e) | pre-exec / generate | graph | workflow generator | preference: score-aware pairs | preference optimization | none |
| FlowReasoner (Gao et al., 2025a) | pre-exec / generate | graph | operator-library program | metric: reward | RL: meta-controller | soft objective (cost) |
| In-execution Editing | ||||||
| DyFlow (Wang et al., 2025c) | in-exec / edit | joint | designer–executor workflow | trace: intermediate feedback | hybrid: online planning + execution | none |
| AgentConductor (Wang et al., 2026b) | in-exec / edit | graph | YAML / DAG topology | verifier: validity + execution | RL: topology revision | hard constraint (budget) |
| Aime (Shi et al., 2025) | in-exec / edit | graph | planner + dynamic actor graph | trace: runtime outcomes | controller: actor instantiation | none |
| MetaGen (Wang et al., 2026c) | in-exec / edit | joint | dynamic role pool + graph | trace: running feedback | repair/edit: training-free evolution | soft objective (cost) |
| ProAgent (Ye et al., 2023) | in-exec / edit | graph | structured JSON process graph | verifier: tests | repair/edit: incremental repair | none |
Lightweight Dynamic Adaptation (Adjacent Methods): Techniques like Adaptive Graph Pruning, DAGP, AgentDropout, DyLAN, and MasRouter perform runtime selection or pruning over a fixed super-graph, offering cost savings with inherited validity.
Feedback Signals and Update Mechanisms
The evidence used to guide optimization is tightly coupled to the update mechanism.
- Metric- and Score-Driven Optimization: Uses scalar task metrics (success, accuracy). Drives black-box search (e.g., AFlow, ADAS) and RL-based generators (e.g., FlowReasoner).
- Verifier-Driven Optimization: Uses constraints like unit tests, schema checks, or functional correctness (e.g., VFlow, MermaidFlow, AgentConductor). Enables aggressive mutation with cheap validation.
- Preference and Ranking Signals: Compares workflows or traces instead of using absolute rewards (e.g., ScoreFlow, RobustFlow, Optima). Useful when rewards are noisy.
- Trace-Derived Textual Feedback: Uses semantic critiques from execution logs to propose changes (e.g., GEPA, MetaGen, Maestro). Offers rich feedback but requires coupling with validators.
Theoretical and Practical Implications
Theoretical Implications:
- Credit Assignment Problem: It remains difficult to attribute performance gains to specific structural changes (e.g., a new edge vs. more compute).
- Expressivity vs. Verifiability Trade-off: Expressive workflows are powerful but hard to validate and compare. Constrained IRs improve reproducibility but may limit solutions.
- Need for Theory: The field lacks a theory for when dynamic generation is necessary, when static templates suffice, and how sample complexity scales with structural plasticity.
Practical Implications and Design Guidance:
- When Static is Enough: For stable APIs, strong verification, and repetitive workloads, a well-searched static template is often superior due to lower cost and easier debugging.
- Selection vs. Generation vs. Editing: Choose the minimum plasticity required:
- Selection/Pruning: When tasks vary mainly in difficulty or required compute.
- Pre-execution Generation: When tasks require genuinely different decomposition or communication patterns.
- In-execution Editing: For interactive environments where runtime observations fundamentally change the plan.
- Graph vs. Prompt Optimization: If errors arise from missing verification, poor decomposition, or incorrect control flow, graph-level optimization is the higher-leverage intervention over prompt tuning.
- Value of Verifiers: Verifiers pay off most when they are cheap and semantically meaningful (e.g., unit tests, schema checks). Placement and invocation frequency are key design choices.
- A Practical Hybrid Recipe:
- Start with a constrained static scaffold and optimize node-level prompts.
- Add graph-level search if trace analysis reveals structural failures.
- For heterogeneous tasks, prefer runtime selection before full generation.
- Use in-execution editing only for high environmental uncertainty.
- Compress/prune communication for efficiency after finding a capable design.
Conclusion
This survey provides a workflow-centered view of LLM agent systems, unifying them under the Agentic Computation Graph (ACG) abstraction. By distinguishing static from dynamic structure determination and analyzing methods along the axes of optimization target, feedback evidence, and update mechanism, it offers a framework for comparing and designing workflow optimization techniques.
A key conclusion is that workflow structure should be a first-class design object, with evaluation reporting not just final answers but also the workflow used, its variation, and its cost. Future work must address structural credit assignment, continual adaptation under drift, improved benchmarks, and the development of a theoretical foundation for the field.
Table 5: A Proposed Minimum Reporting Protocol for Workflow-Optimization Papers
| Dimension | What should be reported | Why it matters |
|---|---|---|
| Workflow representation | code, DSL, graph IR, schema constraints, executable interpreter, available operators and tools | Determines what can be searched, validated, or edited |
| Structural setting | static or dynamic, GDT, GPM, admissible edits, routing policy, stopping rules | Clarifies what kind of structural variation the method actually allows |
| Model and tool configuration | base models, decoding settings, tool registry, verifier placement, memory policy | Separates workflow effects from backbone or tool effects |
| Online inference cost | tokens, LLM calls, tool calls, latency, wall-clock time, dollars, cost-per-success | Makes quality–cost trade-offs scientifically comparable |
| Graph-level metrics | node count, depth, width, communication volume, edit count, structural variance | Treats the workflow as a first-class output rather than an invisible implementation detail |
| Robustness tests | paraphrases, noisy retrieval, tool failure injection, API drift, unseen tools, strict budget caps | Checks whether the workflow policy is stable outside nominal conditions |