Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification - Summary

Summary (Overview)

Introduces Vision2Web, a hierarchical benchmark for evaluating multimodal coding agents on visual website development, spanning three complexity levels: static UI-to-code, interactive multi-page frontend, and long-horizon full-stack development.
Constructs a realistic dataset of 193 tasks from real-world websites, comprising 918 prototype images and 1,255 test cases across 16 website categories, ensuring diversity and avoiding data leakage.
Proposes a workflow-based agent verification paradigm combining a GUI agent verifier (for functional correctness) and a VLM-based judge (for visual fidelity) to enable reproducible, flexible, and holistic evaluation.
Reveals substantial performance gaps in state-of-the-art models (e.g., Claude-Opus-4.5, GPT-5, Gemini-3-Pro), with performance degrading systematically as task complexity increases and on smaller device form factors.
Highlights systematic weaknesses in agents' capabilities for complex operations like state management, CRUD, and cross-page coordination, even for top-performing models.

Introduction and Theoretical Foundation

Recent advances in Large Language Models (LLMs) have enhanced the capabilities of autonomous coding agents for end-to-end software development. However, existing evaluation benchmarks are limited:

Limited task formulation: Benchmarks like SWE-Bench focus on incremental code edits, not holistic, end-to-end engineering.
Misaligned multimodal coverage: While text-only benchmarks explore end-to-end development, multimodal benchmarks (e.g., Design2Code) are largely restricted to static webpage reproduction.
Insufficient verification mechanisms: Reliably assessing complex, long-horizon system outcomes remains challenging due to underspecified tasks and insufficient verification.

Website development is an ideal testbed as it spans the full software lifecycle and requires coordinated understanding of visual prototypes, textual requirements, and codebases. To address these gaps, Vision2Web is designed around three core principles:

Capability disentanglement: Organizing tasks into three progressively harder levels for systematic diagnosis.
Verifiable task construction: Curating tasks from real-world websites via a rigorous pipeline.
Reliable automated evaluation: Adopting a workflow-based agent verification paradigm for reproducible assessment.

Methodology

1. Hierarchical Task Formulation

Vision2Web formalizes tasks via a three-level framework:

Level 1: Static Webpage: Evaluate the ability to interpret UI prototypes (desktop, tablet, mobile) and generate device-responsive, executable static code.
Level 2: Interactive Frontend: Evaluate the ability to generate a fully interactive multi-page frontend from multiple prototypes and textual descriptions of inter-page logic.
Level 3: Full-Stack Website: Evaluate comprehensive end-to-end capabilities, requiring interpretation of structured requirement documents, management of application states, integrated debugging, and delivery of cohesive full-stack systems.

2. Dataset Construction Pipeline

A multi-stage pipeline refines web corpora into evaluation tasks:

Structural Assessment: Filter based on DOM-level properties (HTML tag distribution, tree depth, token length) from the C4 validation set, reducing candidates to 63,515.
Content Screening: Use VLM-based scoring to filter for functional richness and visual coherence, retaining 7,391 pages.
Manual Review: Annotators manually review websites for consistency, implementation difficulty, and interaction clarity, ensuring balanced category coverage.

3. Workflow-Based Agent Verification Paradigm

End-to-end evaluation is formalized as a directed dependency graph, where nodes are verification sub-procedures and edges encode sequential dependencies. This is instantiated into test workflows—agent-executable subgraphs—following principles of decoupling dependent nodes and integrating related ones.

Two complementary verifiers are used:

Functional Verification Nodes (GUI Agent Verifier): Assess functional correctness. Each node is a 3-tuple $n_i = \langle O_i, A_i, V_i \rangle$ , where $O_i$ is the objective, $A_i$ defines guided actions, and $V_i$ encodes validation criteria. The Functional Score (FS) is the proportion of passed nodes.
Visual Verification Nodes (VLM-Based Judge): Assess visual fidelity by comparing rendered pages to prototypes via component-level comparisons. The Visual Score (VS) is the average score across all prototypes.

The verification process is outlined in Algorithm 1:

Algorithm 1 Workflow-Based Agent Verification
input Workflow W = (n_1 → ... → n_t), initial state S_0
output Aggregate functional and visual scores (F, V)
H, F, V ← ∅
for n_i ∈ W do
    if n_i is Functional verification then
        (F_i, S_{i+1}) ← GUIAgentVerifier(H, O_i, A_i, V_i, S_i)
        F ← F ∪ {F_i}; H ← H ∪ {(O_i, A_i)}
    else if n_i is Visual verification then
        (V_i, S_{i+1}) ← VLMBasedJudge(P_i, S_i)
        V ← V ∪ {V_i}
    end if
end for
return (F, V)

4. Agent-Assisted Annotation

Test cases are annotated through expert-AI collaboration:

Static Webpage: Lightweight, resolution-specific visual verification.
Interactive Frontend: Largely automated, with Claude Code inferring navigation structures.
Full-Stack Website: Expert-in-the-loop strategy. Experts draft high-level workflows; Claude Code refines them into executable sequences.

5. Experimental Settings

Models Evaluated: Eight state-of-the-art multimodal models: Claude-Opus-4.5, Claude-Sonnet-4.5, GPT-5, Gemini-3-Pro-Preview, Gemini-3-Flash-Preview, Seed-1.8-VL, Qwen3-VL-32B/8B-Instruct.
Frameworks: Integrated into OpenHands and Claude Code coding agent frameworks.
Environment: Containerized with frontend/backend/database dependencies. Agents generate a startup script; deployments exceeding 10 minutes or with errors are failures.
Verifiers: GUI agent verifier instantiated with GLM-4.6V; VLM judge uses Gemini-3-Pro-Preview.

Empirical Validation / Results

Main Results (Table 3)

The comprehensive results are shown in Table 3. Key findings:

Finding 1: Performance degrades with task complexity. Under OpenHands, Gemini-3-Pro-Preview scores: Desktop (63.3), Tablet (55.8), Mobile (48.3) on static pages, but drops sharply on full-stack tasks: VS=11.7, FS=22.6. Claude-Opus-4.5 maintains strong performance but also declines on full-stack tasks.

Finding 2: Performance degrades on smaller devices and complex images. Static page tasks show 10–20% lower scores on tablet/mobile vs. desktop. Figure 4 shows larger, denser prototypes induce additional performance declines.

Finding 3: Claude-Opus-4.5 consistently achieves strongest performance. Under OpenHands, Claude-Opus-4.5 attains VS=58.9 (desktop), VS/FS=46.5/66.7 (frontend), VS/FS=38.4/57.6 (full-stack). Seed-1.8-VL fails entirely on full-stack (VS=0, FS=0); Qwen models largely cannot complete tasks.

Finding奥 4: Performance varies across frameworks. For most models (excluding Claude), performance under OpenHands tends to be higher than under Claude Code.

Finding 5: Full-stack performance varies across website categories (Table 4).

Website Category	Opus-4.5	Sonnet-4.5	GPT-5
Content	37.1 / 61.2	9.3 / 16.1	20.7 / 53.5
Transaction	43.2 / 64.9	10.8 / 14.3	13.4 / 50.6
SaaS Platform	22.9 / 39.9	21.7 / 42.8	16.7 / 40.5
Public Service	56.9 / 60.0	41.2 / 52.0	27.4 / 56.0

Public Service websites (simple structures) perform best. SaaS platforms (complex interactions) yield weakest results.

Finding 6: Agents exhibit weaknesses in state-dependent operations (Table 5).

Test Case Category	Opus-4.5	Sonnet-4.5	GPT-5
Navigation & Routing	66.3	25.9	53.9
State Management	43.2	16.1	41.5
Form Interaction	49.2	23.7	56.8
...	...	...	...
File & Media Operations	33.3	0.0	16.7

Navigation & Routing and Authentication are most reliable. Performance drops on State Management, CRUD, and File Operations.

Analysis of Failure Modes

Fine-Grained Visual Alignment: Misaligned layouts, incorrect sizes, color mismatches, fragile asset handling.
Cross-Module Visual Understanding: Visual fidelity degrades on subsequent pages; missing components, broken navigation.
System-Level Planning and Execution: Deficiencies in long-horizon planning and autonomous verification; projects may fail to launch or crash.

Validation of the Agent Verifier

GUI Agent Verifier: 87.2% agreement (218/250 nodes) with human annotations on sampled workflows.
VLM-Based Judge: Achieves an average Spearman rank correlation $\rho = 0.66$ (median 0.80) with human judgments, indicating substantial alignment (human inter-annotator $\rho = 0.78$ ).

Theoretical and Practical Implications

Benchmark Design: Vision2Web demonstrates the need for hierarchical, progressively challenging task designs to systematically diagnose agent capabilities across development stages.
Evaluation Paradigm: The workflow-based agent verification paradigm provides a blueprint for reproducible, implementation -agnostic evaluation of complex, interactive systems, balancing flexibility with control.
Agent Development: The results reveal critical limitations in current multimodal coding agents, especially in cross-modal reasoning, long-horizon planning, and multi-page coordination. This highlights directions for future research, such as improving visual grounding, state tracking, and self-verification mechanisms.
Real-World Relevance: By grounding the benchmark in real-world websites and categories, the findings reflect practical challenges in automated software development, guiding the development of more robust and capable agents.

Conclusion

Vision2Web is a comprehensive hierarchical benchmark for evaluating multimodal coding agents in visual website development. Its three-level task design enables systematic assessment under increasing complexity. The proposed workflow-based agent verification paradigm, combining a GUI agent verifier and a VLM judge, allows for reproducible, holistic measurement of functional and visual correctness. Experiments reveal that strong performance on isolated tasks does not reliably transfer to end-to-end system construction, exposing systematic deficiencies in handling structural complexity and state reasoning. These findings underscore the need for hierarchical task designs and principled autonomous evaluation as a foundation for advancing and rigorously assessing coding agents.