Summary of "CRAFTER: A Multi-Agent Harness for Editable Scientific Figure Generation from Diverse Inputs"

Summary (Overview)

Novel harness framework: Introduces CRAFTER, a multi-agent orchestration layer that wraps an image generator with planning, verification, and structured revision, enabling generalization across three figure types (academic diagrams, posters, infographics) and four input conditions (text-only, mask completion, sketch, key-element composition) without architectural changes.
End-to-end editable pipeline: CRAFTEDITOR applies the same harness pattern to convert raster outputs into editable SVGs, enabling element-level edits (swap icons, resize components) that no prior system supports.
Comprehensive benchmark: CRAFTBENCH contains 279 curated samples with human quality annotation, spanning three figure types and four input conditions, paired with a VLM-as-judge evaluation protocol that removes position bias and adapts to diverse tasks.
State-of-the-art results: CRAFTER achieves 50.2% overall win-rate on CRAFTBENCH (using Nano Banana 2 backbone), outperforming standalone generators (19.9%) and strong agentic baselines (28.0%); CRAFTEDITOR surpasses all baselines on raster-to-SVG conversion. Ablations confirm each mechanism contributes 5.04–8.90 point drops when removed.

Introduction and Theoretical Foundation

Scientific figure generation remains labor-intensive due to the structured composition of discrete semantic components (labeled boxes, arrows, icons, annotations) with precise spatial relationships. Existing systems are narrow in scope—targeting single figure types under text-only input—and produce raster outputs that cannot be locally revised. Code-generation methods (TikZ) yield editable but visually poorer outputs.

The paper identifies three core difficulties that prevent reliable generation:

High output variance on structured layouts: generators produce different localized errors (garbled labels, misaligned connectors) across seeds.
Prompt degradation: accumulating free-text corrections introduces contradictions that degrade faithfulness.
Absence of structured feedback: scalar scores (e.g., 5/10) provide no actionable corrections.

The key insight is that scientific figures require not a more powerful backbone but a harness—an orchestration layer that wraps an existing engine with structured specification memory, enabling targeted correction and closed-loop verification.

Methodology

The Harness Abstraction

Formalized as a four-role loop over a shared evolving specification $S$ :

At each round $t$ :

\mathbf{p}_t = D(\text{input}, S_{t-1}), \quad \mathbf{a}_t = E(\mathbf{p}_t)

\mathbf{d}_t = V(\mathbf{a}_t, \text{input}, S_{t-1}), \quad S_t = R(\mathbf{d}_t, S_{t-1})

where:

$D$ (Designer) produces an actionable plan $\mathbf{p}_t$
$E$ (Executor) renders $\mathbf{p}_t$ into artifact $\mathbf{a}_t$
$V$ (Verifier) emits a directive diagnostic $\mathbf{d}_t$ (per-dimension scores, identified defects, suggested corrections)
$R$ (Reviser) applies typed edits (structured operations: add layout constraint, ban artifact category, resize named element) to $S_{t-1}$

Table 1: Harness role assignments for CRAFTER and CRAFTEDITOR

Role	CRAFTER	CRAFTEDITOR
$D$ Designer	Plan generator	SVG skeleton generator
$E$ Executor	Image-generation backend	Element-injection code
$V$ Verifier	Multi-dim. critic	Hybrid critic (VLM + programmatic)
$R$ Reviser	Specification refiner	SVG editor

CRAFTER: Multi-Agent Figure Generation

Five cooperating agents implement the four roles:

Intent reasoner: Analyzes context and instruction, seeds initial specification $S_0$
Plan generator ( $D$ ): Proposes $K$ candidate visual plans in parallel (diversity-driven exploration)
Image-generation backend ( $E$ ): Renders each plan to raster
Critic ( $V$ ): Issues per-dimension scores along six axes, identifies defects, suggests corrections
Specification refiner ( $R$ ): Converts diagnostics into typed edits on $S$
Convergence judge: Governs loop (accept, refine, or revert to best-so-far)

Three mechanisms address the key failure modes:

Diversity-driven plan exploration ( $K$ adaptive candidates) escapes fundamentally unsuitable compositions before refining.
Structured corrective layer replaces free-text accumulation with typed edits on $S$ , keeping specification consistent.
Verify-then-refine with directive critic: iterative loop (up to $T=3$ rounds) with early-exit gate and best-so-far reversion.

CRAFTEDITOR: Raster-to-SVG Conversion

Three phases:

Extraction: Instruction-driven canvas cleaning—VLM designer authors keep/delete plan, editor executes at pixel level, verifier inspects (up to $T=3$ ).
Processing: Each element is captioned, grounded (spatial coordinates), and classified (vector vs. raster).
Composition: Iterative SVG assembly—designer proposes two candidate SVG skeletons (different decoding temperatures); executor splices extracted assets; hybrid critic (VLM + programmatic checkers: text overflow, arrow-endpoint accuracy, overlap, missing components) drives refinement (up to $T=4$ rounds with best-so-far reversion).

Empirical Validation / Results

Benchmarks

PaperBanana-Bench: 100 text-to-image academic diagram samples [Zhu et al., 2026a]
CRAFTBENCH: 279 samples (179 text-to-image, 30 mask-completion, 40 sketch, 30 key-element) across academic (140), poster (109), infographic (30) styles; all with human quality annotation.

Evaluation Protocol

VLM-as-judge (Gemini 3.5 Flash) scores each output independently (0–10 per aspect, avoiding position bias). Aspects: content faithfulness, readability, task-specific format fidelity. A weighted mean yields total score; relative margin determines verdict {Model, Tie, Human} under calibrated tie band. Bench-level score averages mapped verdicts (100/50/0). Human study confirms alignment with preference.

Main Results

Table 2: Results (%) on PaperBanana-Bench and CRAFTBENCH. Bold marks column-best; $\Delta$ is the gap between CRAFTER and its standalone generator.

Method	PaperBanana-Bench (Overall)	CraftBench (Overall)
Standalone generators
GLM-Image	0.00	0.00
Qwen-Image	0.00	0.40
GPT-Image-2*	1.37	15.80
Nano Banana 2	11.13	19.90
Nano Banana Pro	22.43	22.40
Agentic frameworks
AutoFigure (w/ Nano Banana 2)	1.37	2.20
PaperBanana (w/ Nano Banana 2)	33.73	28.00
PaperBanana (w/ Nano Banana Pro)	35.96	29.00
CRAFTER (w/ Nano Banana 2)	50.34	50.20
$\Delta$ vs. Nano Banana 2	+39.21	+30.30
CRAFTER (w/ Nano Banana Pro)	50.00	52.30
$\Delta$ vs. Nano Banana Pro	+27.57	+29.90

*GPT-Image-2 returned valid outputs for 260/279 inputs on CraftBench due to instability and content-safety refusals.

Key findings:

CRAFTER with Nano Banana 2 achieves 50.34% on PaperBanana-Bench (vs. PaperBanana's 33.73%) and 50.20% on CRAFTBENCH (vs. PaperBanana's 28.00%).
Gains are consistent across all sub-tasks (T2I: 48.3%, Mask: 45.0%, Sketch: 70.0%, KeyEl: 40.0%).
Ablation study: removing any single mechanism (e.g., diversity exploration, structured corrective layer, directive critic) causes a 5.04–8.90 point drop, confirming each component's independent contribution.

CRAFTEDITOR Evaluation

Using three-VLM ensemble (GPT-4o, Gemini 2.5 Pro, Gemini 3.1 Pro) to compare SVG fidelity against baselines, CRAFTEDITOR achieves the highest quality scores, with ablated variants showing significant degradation when the iterative refinement loop or hybrid critic is removed.

Theoretical and Practical Implications

Harness Pattern: The paper demonstrates that a four-role orchestration loop with a shared structured specification can dramatically improve structured generation tasks—a principle that may generalize beyond scientific figures to other domains requiring precise layout and editability (e.g., diagrams, architectural plans, UI mockups).
Editable Outputs: CRAFTEDITOR bridges the gap between high-quality raster generation and practical editability, enabling researchers to locally fix labels, swap icons, or adjust layouts without regenerating from scratch.
Cross-Type/Condition Generalization: Because all task-specific behavior resides in agent prompts, the same harness architecture handles diverse figure types and input conditions without model retraining—a significant step toward practical scientific illustration tools.
Benchmarking: CRAFTBENCH and its VLM-as-judge protocol provide a standardized evaluation method that captures the multi-faceted nature of real-world figure generation, encouraging future work on conditional generation beyond text-only inputs.
Empirical Insights: The ablations show that iterative refinement alone is not sufficient; the combination of diversity planning, structured corrections, and directive feedback is crucial, and each component contributes independently.

Conclusion

CRAFTER and CRAFTEDITOR form the first end-to-end generation-to-editing pipeline for scientific figures. CRAFTER uses a multi-agent harness with three targeted mechanisms (diversity-driven exploration, structured corrective layer, verify-then-refine iteration) to substantially outperform both standalone generators and strongest agentic baselines on two benchmarks across three figure types and four input conditions. CRAFTEDITOR extends the same harness pattern to convert raster outputs into editable SVGs, outperforming all baselines. The paper also introduces CRAFTBENCH, a 279-sample benchmark with human annotation, and a VLM-as-judge evaluation protocol that adapts to diverse tasks.

Future directions include: extending the harness to support additional modalities (e.g., animated figures, interactive dashboards); integrating more flexible user interaction; and exploring the harness pattern for other structured generation tasks. Code and benchmark are publicly available.