Summary (Overview)

  • Workflow-Centered Formulation: This survey introduces Agentic Computation Graphs (ACGs) as a unifying abstraction for LLM agent workflows. It distinguishes three key objects: reusable templates, run-specific realized graphs, and execution traces.
  • Taxonomy of Structure Determination: The literature is organized by when workflow structure is determined, using the concepts of Graph Determination Time (GDT) and Graph Plasticity Mode (GPM). This distinguishes static methods (fixed reusable templates) from dynamic methods (run-specific generation, selection, or editing).
  • Cross-Cutting Synthesis: Methods are synthesized along three axes: optimization target (node, graph, or joint), feedback evidence (metric, verifier, preference, or trace), and update mechanism. This clarifies what is changed, why, and how.
  • Evaluation Protocol: The survey advocates for structure-aware evaluation, moving beyond just downstream task metrics to also report graph-level properties, execution cost, robustness, and structural variation across inputs.
  • Design Guidance: Based on surveyed patterns, practical guidance is provided: use static optimization for stable, repetitive tasks; start with selection/pruning before full generation for dynamic needs; and reserve in-execution editing for highly interactive environments.

Introduction and Theoretical Foundation

Large Language Model (LLM) systems are evolving from single-prompt chatbots to complex, executable workflows that coordinate multiple actions (LLM calls, tool use, retrieval, code execution, verification, etc.). The workflow structure—the components, their dependencies, and information flow—critically impacts both effectiveness and efficiency.

This survey positions itself within the broader literature by focusing specifically on workflow optimization as a primary design problem, distinct from adjacent topics like general agent planning, tool learning, or multi-agent collaboration surveys (see Table 1 in the paper).

Core Abstraction: Agentic Computation Graph (ACG) An ACG is a directed graph where nodes perform atomic actions and edges encode control, data, or communication dependencies. A node can be described by the tuple Instruction, Context, Tools, Model/Decoding\langle \text{Instruction, Context, Tools, Model/Decoding} \rangle.

Key Distinctions:

  1. ACG Template (Gˉ\bar{G}): A reusable executable specification Gˉ=(V,E,Φ,Σ,A)\bar{G} = (V, E, \Phi, \Sigma, A), where VV is nodes, EE is edges, Φ\Phi are node parameters (prompts, tools), Σ\Sigma is a scheduling policy, and AA is admissible edit actions.
  2. Realized Graph (GrunG_{run}): The workflow structure actually used for a particular run, which may be a subgraph or edited version of the template.
  3. Execution Trace (τ\tau): The sequence of states, actions, observations, and costs produced by executing GrunG_{run}: τ={(st,at,ot,ct)}t=1T\tau = \{ (s_t, a_t, o_t, c_t) \}_{t=1}^T.

Optimization Formulation: Workflow optimization is framed as balancing task quality against execution cost:

maxExD[EGrunx[EτGrun,x[R(τ;x)λC(τ)]]]\max \mathbb{E}_{x \sim \mathcal{D}} \left[ \mathbb{E}_{G_{run} | x} \left[ \mathbb{E}_{\tau | G_{run}, x} [ R(\tau; x) - \lambda C(\tau) ] \right] \right]

where R(τ;x)R(\tau; x) is a task-quality score, C(τ)C(\tau) is execution cost, and λ\lambda controls the trade-off.

Organizing Principle: Structure Determination

  • Static: Deployed structure is a reusable template fixed after training/search.
  • Dynamic: Part of the realized graph is constructed, selected, or edited at inference time.
    • Graph Determination Time (GDT): Offline, Pre-execution, In-execution.
    • Graph Plasticity Mode (GPM): None, Select, Generate, Edit.

Methodology

The survey methodology involves a structured analysis of 77 in-scope works (39 core, 7 adjacent, 31 background), including preprints, conference papers, and benchmark resources. A compact comparison card is used to classify methods along stable dimensions (see Table 9 in the Appendix):

FieldMeaning
SettingStatic/Dynamic, GDT, GPM
Optimized LevelNode, Graph, or Joint
RepresentationCode, Text, DSL, Graph IR, etc.
Feedback / EvidenceMetric, Verifier, Preference, Trace, etc.
Update MechanismSearch, Generator, RL, Repair/Edit, etc.
Cost HandlingNone, Soft Objective, Hard Constraint

Empirical Validation / Results

The survey synthesizes results by categorizing methods and their characteristics. Key findings are presented in comparative tables.

Static Optimization of Agent Workflows

Methods that optimize a reusable template before deployment. They are easier to inspect and benchmark but may be brittle to distribution shift.

Table 2: Representative Core Static Workflow-Optimization Methods (Summary)

MethodSetting (GDT/GPM)LevelRepresentationFeedback / EvidenceUpdate MechanismCost Handling
AFlow (Zhang et al., 2025e)offline / nonegraphtyped operator graphmetric: task scoresearch: MCTSsoft objective ($)
ADAS (Hu et al., 2025a)offline / nonejointrunnable codemetric: task scoresearch: archive meta-searchnone
A²Flow (Zhao et al., 2025)offline / nonegraphabstract operator graphsupervision: demos + executionhybrid: operator learning + searchnone
Multi-Agent Design (Zhou et al., 2025)offline / nonejointtopology + promptsmetric: task scorehybrid: staged alternationnone
Optima (Chen et al., 2025)offline / nonenodefixed scaffold + trajectoriesmetric: quality–efficiency rewardhybrid: generate-rank-select-trainsoft objective (efficiency)
VFlow (Wei et al., 2025)offline / nonegraphdomain workflow graphverifier: multi-level checkshybrid: MCTS + cooperative evolutionsoft objective (resource)
Maestro (Wang et al., 2025a)offline / nonejointtyped stochastic graphtrace: reflective text + scorehybrid: alternating graph/config updatessoft objective (budget)

Node-Level Optimization inside Fixed Scaffolds: Methods like DSPy, OPRO, EvoPrompt, CAPO, and GEPA optimize local parameters (prompts, demonstrations) within a fixed graph structure, offering a practical and fast path to improvement.

Dynamic Optimization and Runtime Adaptation

Methods that determine workflow structure at inference time, offering flexibility for heterogeneous tasks.

Table 3: Representative Core Dynamic Workflow-Optimization Methods (Summary)

MethodSetting (GDT/GPM)LevelRepresentationFeedback / EvidenceUpdate MechanismCost Handling
Pre-execution Generation/Selection
Difficulty-Aware Agentic Orchestration (Su et al., 2025a)pre-exec / selectjointmodular operator workflowproxy: difficulty estimatecontroller: router + allocatorsoft objective (cost)
Assemble Your Crew (Li et al., 2025b)pre-exec / generategraphquery-conditioned DAGsupervision: task labelsgenerator: autoregressive DAGnone
G-Designer (Zhang et al., 2025d)pre-exec / generategraphgenerated communication graphmetric: quality scoregenerator: VGAEsoft objective (cost)
ScoreFlow (Wang et al., 2025e)pre-exec / generategraphworkflow generatorpreference: score-aware pairspreference optimizationnone
FlowReasoner (Gao et al., 2025a)pre-exec / generategraphoperator-library programmetric: rewardRL: meta-controllersoft objective (cost)
In-execution Editing
DyFlow (Wang et al., 2025c)in-exec / editjointdesigner–executor workflowtrace: intermediate feedbackhybrid: online planning + executionnone
AgentConductor (Wang et al., 2026b)in-exec / editgraphYAML / DAG topologyverifier: validity + executionRL: topology revisionhard constraint (budget)
Aime (Shi et al., 2025)in-exec / editgraphplanner + dynamic actor graphtrace: runtime outcomescontroller: actor instantiationnone
MetaGen (Wang et al., 2026c)in-exec / editjointdynamic role pool + graphtrace: running feedbackrepair/edit: training-free evolutionsoft objective (cost)
ProAgent (Ye et al., 2023)in-exec / editgraphstructured JSON process graphverifier: testsrepair/edit: incremental repairnone

Lightweight Dynamic Adaptation (Adjacent Methods): Techniques like Adaptive Graph Pruning, DAGP, AgentDropout, DyLAN, and MasRouter perform runtime selection or pruning over a fixed super-graph, offering cost savings with inherited validity.

Feedback Signals and Update Mechanisms

The evidence used to guide optimization is tightly coupled to the update mechanism.

  1. Metric- and Score-Driven Optimization: Uses scalar task metrics (success, accuracy). Drives black-box search (e.g., AFlow, ADAS) and RL-based generators (e.g., FlowReasoner).
  2. Verifier-Driven Optimization: Uses constraints like unit tests, schema checks, or functional correctness (e.g., VFlow, MermaidFlow, AgentConductor). Enables aggressive mutation with cheap validation.
  3. Preference and Ranking Signals: Compares workflows or traces instead of using absolute rewards (e.g., ScoreFlow, RobustFlow, Optima). Useful when rewards are noisy.
  4. Trace-Derived Textual Feedback: Uses semantic critiques from execution logs to propose changes (e.g., GEPA, MetaGen, Maestro). Offers rich feedback but requires coupling with validators.

Theoretical and Practical Implications

Theoretical Implications:

  • Credit Assignment Problem: It remains difficult to attribute performance gains to specific structural changes (e.g., a new edge vs. more compute).
  • Expressivity vs. Verifiability Trade-off: Expressive workflows are powerful but hard to validate and compare. Constrained IRs improve reproducibility but may limit solutions.
  • Need for Theory: The field lacks a theory for when dynamic generation is necessary, when static templates suffice, and how sample complexity scales with structural plasticity.

Practical Implications and Design Guidance:

  • When Static is Enough: For stable APIs, strong verification, and repetitive workloads, a well-searched static template is often superior due to lower cost and easier debugging.
  • Selection vs. Generation vs. Editing: Choose the minimum plasticity required:
    • Selection/Pruning: When tasks vary mainly in difficulty or required compute.
    • Pre-execution Generation: When tasks require genuinely different decomposition or communication patterns.
    • In-execution Editing: For interactive environments where runtime observations fundamentally change the plan.
  • Graph vs. Prompt Optimization: If errors arise from missing verification, poor decomposition, or incorrect control flow, graph-level optimization is the higher-leverage intervention over prompt tuning.
  • Value of Verifiers: Verifiers pay off most when they are cheap and semantically meaningful (e.g., unit tests, schema checks). Placement and invocation frequency are key design choices.
  • A Practical Hybrid Recipe:
    1. Start with a constrained static scaffold and optimize node-level prompts.
    2. Add graph-level search if trace analysis reveals structural failures.
    3. For heterogeneous tasks, prefer runtime selection before full generation.
    4. Use in-execution editing only for high environmental uncertainty.
    5. Compress/prune communication for efficiency after finding a capable design.

Conclusion

This survey provides a workflow-centered view of LLM agent systems, unifying them under the Agentic Computation Graph (ACG) abstraction. By distinguishing static from dynamic structure determination and analyzing methods along the axes of optimization target, feedback evidence, and update mechanism, it offers a framework for comparing and designing workflow optimization techniques.

A key conclusion is that workflow structure should be a first-class design object, with evaluation reporting not just final answers but also the workflow used, its variation, and its cost. Future work must address structural credit assignment, continual adaptation under drift, improved benchmarks, and the development of a theoretical foundation for the field.

Table 5: A Proposed Minimum Reporting Protocol for Workflow-Optimization Papers

DimensionWhat should be reportedWhy it matters
Workflow representationcode, DSL, graph IR, schema constraints, executable interpreter, available operators and toolsDetermines what can be searched, validated, or edited
Structural settingstatic or dynamic, GDT, GPM, admissible edits, routing policy, stopping rulesClarifies what kind of structural variation the method actually allows
Model and tool configurationbase models, decoding settings, tool registry, verifier placement, memory policySeparates workflow effects from backbone or tool effects
Online inference costtokens, LLM calls, tool calls, latency, wall-clock time, dollars, cost-per-successMakes quality–cost trade-offs scientifically comparable
Graph-level metricsnode count, depth, width, communication volume, edit count, structural varianceTreats the workflow as a first-class output rather than an invisible implementation detail
Robustness testsparaphrases, noisy retrieval, tool failure injection, API drift, unseen tools, strict budget capsChecks whether the workflow policy is stable outside nominal conditions