Summary of "DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios"

Summary (Overview)

  • Introduces DV-World, a comprehensive benchmark of 260 tasks designed to evaluate Data Visualization (DV) agents across the full professional lifecycle in real-world scenarios, moving beyond idealized code-sandbox settings.
  • Spans three core domains: DV-Sheet for native spreadsheet chart creation, diagnostic repair, and dashboard synthesis; DV-Evolution for cross-framework logic evolution and adaptation; and DV-Interact for proactive multi-turn interaction to align with ambiguous user intent.
  • Proposes a hybrid evaluation framework integrating Table-value Alignment for numerical precision and MLLM-as-a-Judge with expert-designed rubrics for semantic-visual assessment, demonstrating strong alignment with human judgment.
  • Reveals significant performance gaps: State-of-the-art models (e.g., Gemini-3-Pro, GPT-5.2) achieve less than 50% overall performance, exposing critical deficits in handling native object models, cross-paradigm evolution, and iterative intent alignment.
  • Provides a realistic testbed to steer development towards the versatile, integrated expertise required in enterprise visualization workflows, highlighting the need for a shift from one-shot generation to comprehensive lifecycle management.

Introduction and Theoretical Foundation

Data Visualization (DV) is a critical bridge between data and human decisions. While LLM and MLLM-based DV agents have shown impressive code generation in standardized sandboxes, current benchmarks fail to capture the complexity of real-world professional workflows. The paper identifies three critical gaps in existing evaluation paradigms:

  1. Environmental Decoupling: Overlooking native spreadsheet-centric workflows (e.g., Excel's object models, data-to-chart bindings) in favor of developer-style code generation.
  2. Creation-Only Myopia: Focusing on one-shot chart construction while under-testing the evolutionary work required to adapt visualizations to new data and requirements across diverse frameworks.
  3. Perfect-Intent Assumptions: Building benchmarks on fully specified prompts, ignoring the ambiguity in real user requests and the need for proactive clarification and dialogue.

To bridge these gaps, DV-World is introduced as a benchmark designed to evaluate DV agents across the full lifecycle, emphasizing native environmental grounding, cross-platform evolution, and proactive intent alignment.

Methodology

1. Benchmark Construction & Task Definition

DV-World comprises 260 tasks curated from real-world sources (e.g., ExcelForum, Kaggle) and manually adapted by 18 visualization specialists. It is structured into three domains:

  • DV-Sheet: Evaluates native spreadsheet manipulation. An agent πsheet\pi_{\text{sheet}} performs:

    • Create (E=πsheet(I,E0)E^* = \pi_{\text{sheet}}(I, E_0)): Generate a native chart with dynamic range bindings ff.
    • Fix: Diagnose and repair a defective chart CerrC_{\text{err}} into CfixC_{\text{fix}}.
    • Dash: Compose a professional dashboard by arranging multiple charts {Ci}\{C_i\} and tables {Tj}\{T_j\}.
  • DV-Evolution: Evaluates cross-modal logic evolution. An agent πevol\pi_{\text{evol}} synthesizes:

    σ=πevol(I,V,D,L)\sigma = \pi_{\text{evol}}(I, V, D, L)

    Given a reference image VV, new dataset DD, and requirements II, it must produce executable code σ=C,T\sigma = \langle C^*, T^* \rangle in target language LL (e.g., Python, D3.js, Vega-Lite).

  • DV-Interact: Evaluates proactive iterative interaction under ambiguity. An agent πint\pi_{\text{int}} interacts with a dual-stage user simulator to clarify a task q0q_0 within an environment E={D,L}E = \{D, L\}, generating code CC^* that aligns with the user's latent objectives.

2. Evaluation Metrics

A hybrid framework combines quantitative data fidelity checks with qualitative rubric-based assessment via MLLM-as-a-Judge.

  • Table Coverage (TC): Measures data integrity via tolerance-aware matching.

    STC=1NvalidcCI(match(vgen,vgt))S_{\text{TC}} = \frac{1}{N_{\text{valid}}} \sum_{c \in C} \mathbb{I}(\text{match}(v_{\text{gen}}, v_{\text{gt}}))
  • Rubric Score: For tasks like DVSheet-Create and DV-Evol, expert rubrics evaluate dimensions (e.g., Reliability, Consistency, Aesthetics). The score is:

    Srubric(O,R)=k=1Nskk=1Nwk,sk=Λ(ck,O)[0,wk]S_{\text{rubric}}(O, R) = \frac{\sum_{k=1}^{N} s_k}{\sum_{k=1}^{N} w_k}, \quad s_k = \Lambda(c_k, O) \in [0, w_k]
  • Composite Scores: For creation/evolution tasks, the final score combines rubric and TC:

    Screa/evol=wSrubric+(1w)STC(default w=0.5)S_{\text{crea/evol}} = w \cdot S_{\text{rubric}} + (1 - w) \cdot S_{\text{TC}} \quad (\text{default } w=0.5)
  • Success Rates: For repair and interactive tasks:

    • DVSheet-Fix: SRDVSheet-Fix=I[fFmust:Sim(Cf,Gf)τ],τ0.95SR_{\text{DVSheet-Fix}} = \mathbb{I}[\forall f \in F_{\text{must}}: \text{Sim}(C_f, G_f) \ge \tau], \tau \ge 0.95
    • DV-Inter: Uses an Interaction Success Rate (ISR): ISR=(1λ)+λNsuccessNrefNreq+1,λ=0.5ISR = (1 - \lambda) + \lambda \cdot \frac{N_{\text{success}} - N_{\text{ref}}}{N_{\text{req}} + 1}, \quad \lambda = 0.5 The final score is Sfinal=SrubricISRS_{\text{final}} = S_{\text{rubric}} \cdot ISR.

3. Experimental Setup

  • Models Evaluated: A wide range of state-of-the-art LLMs/MLLMs, including open-source (Qwen3, GLM-4.7, DeepSeek-V3.1) and proprietary (Gemini, GPT, Grok) families.
  • Baseline Agents: SheetCopilot (for DV-Sheet), OpenHands (for DV-Evol), and a unified DV-World-Agent (ReAct-based with tools like bash, load_image, render_chart, ask_user).
  • Evaluation Protocol: 4 independent runs per task. GPT-5-Mini used as the user simulator for DV-Inter. Gemini-2.5-Flash used as the primary MLLM-judge.

Empirical Validation / Results

Main Performance Results

Table 1: Comparison of DV-World with existing benchmarks.

Benchmark# TasksSourceEnv.Input FormatFinal OutputInteractive AgencyOpen-endedEvaluation Method
SpreadsheetBench912RealNat.Sheet+NLSheetRule-based
Bird-Interact600ManualProg.DB+NL1 SQLExecution-based
OSWorld369Real+ManualOSActions+NLActionsExecution-based
DAComp-DA210Real+ManualProg.Table+NLReport+ChartLLM-judge(rubrics)
DV-World (Ours)260Real + ManualNat. & Prog.Sheet+NL & Table+I+NLFin.Table+1/N ChartBothRule-based & MLLM-judge (rubrics)

Table NA (from Table 3): DV-Sheet Results (Top Models)

MethodCreate (Overall)Fix (SR %)Dashboards (Overall)Score
Gemini-3-Pro36.07 (±2.18)48.0035.29 (±2.64)40.48 (±2.12)
GPT-5.234.43 (±2.42)42.0033.98 (±2.31)37.24 (±1.88)
DeepSeek-V3.228.31 (±1.74)36.0036.35 (±3.02)33.12 (±2.45)
Human80.8188.0087.34-

Table NA (from Table 4): DV-Evolution Results (Top Models)

MethodPythonApache EChartsVega-LiteD3.jsPlotly.jsScore
Gemini-3-Pro60.36 (±1.62)44.45 (±2.31)46.30 (±1.48)56.34 (±2.45)49.76 (±4.72)51.44 (±2.18)
Gemini-3-Flash58.54 (±1.44)46.01 (±2.15)45.39 (±3.32)49.83 (±2.21)47.54 (±1.58)49.46 (±2.05)
Grok-453.84 (±1.82)44.99 (±2.24)50.68 (±3.18)49.44 (±1.75)45.11 (±1.52)48.81 (±2.14)
Human85.2382.1188.4685.2184.44-

Table NA (from Table 5): DV-Interact Results (Top Models)

MethodMLLM-Score (Overall)ISR (%)ScoreUser Cost
Grok-451.10 (±2.24)79.5740.43 (±1.95)$0.051
DeepSeek-V3.251.30 (±2.18)74.0537.94 (±2.05)$0.032
GPT-5.250.58 (±1.92)69.2535.09 (±1.88)$0.021
Human79.60---

Key Findings:

  1. Performance Ceiling: Even top models struggle, with peak scores below 52% across all domains, far below human baselines (~80-88%).
  2. DV-Sheet Challenge: Native spreadsheet manipulation is particularly difficult (peak score 40.48%), highlighting deficits in managing object models and dynamic bindings.
  3. Framework Variance: In DV-Evol, performance varies significantly by library, with stronger results in Python/Vega-Lite than in more complex D3.js/Plotly.js.
  4. Interactive Bottleneck: In DV-Inter, the primary challenge is the interactive process itself; models fail to efficiently bridge underspecified intent to complex visualization logic.

Detailed Analysis

  • Native Grounding (DV-Sheet): A positive correlation exists between Table Coverage and Visual Aesthetics; models perform better aesthetically as data grounding improves. Agents excel at simple logic repairs but struggle with precise geometric mapping (axis scaling, encoding errors). Performance declines as table size increases.
  • Cross-Paradigm Evolution (DV-Evol): Performance decays as the target Lines of Code (LOC) increase, indicating a "verbosity tax" for frameworks like D3.js. The load_image tool is critical for maintaining semantic fidelity; its removal leads to universal performance drops (e.g., -7.69% for Gemini-3-Pro in D3.js).
  • Interactive Agency (DV-Inter): Transitioning to interactive alignment offers significant benefits (e.g., +23.0% gain for Gemini-3-Pro), but gains depend on proactive reasoning quality, not just interaction frequency.

Table 2: Key Statistics for DV-World.

MetricValue
Total Tasks260 (100%)
- DV-Sheet (Create/Fix/Dash)50 / 50 / 30 (50%)
- DV-Evolution80 (30.8%)
- DV-Interact50 (19.2%)
DV-Sheet: Avg Columns / Rows36.53 / 11,583.36
DV-Evolution: Avg Columns / Rows58.98 / 52,584.58
DV-Interact: Ambiguities per task3.17

Error Analysis

  • DV-Sheet: Errors are dominated by Data Accuracy (over 50% in Create, ~69% in Fix), followed by Layout Readability and Visual Design issues.
  • DV-Evol: Errors are primarily Layout & Readability (42.43% on average) and Data Consistency (31.98%), with Visual Style being less problematic (25.59%).
  • DV-Inter: Failures are primarily driven by the Cognitive-Execution Gap (38.44% avg), where clarified intent fails to produce grounded results. Interactive Avoidance and Inquiry Deficit are also major bottlenecks.

Theoretical and Practical Implications

  • Theoretical: DV-World establishes a new paradigm for evaluating DV agents, shifting focus from isolated code generation to integrated lifecycle management encompassing environmental mastery, semantic portability, and proactive alignment. It provides a structured taxonomy of real-world challenges (e.g., fix types, ambiguity points).
  • Practical: The benchmark exposes critical weaknesses in current agents, guiding research and development towards:
    1. Improving native object model understanding and data binding reliability.
    2. Enhancing cross-framework semantic transfer and design preservation.
    3. Developing robust interactive reasoning and clarification strategies.
    4. The hybrid evaluation framework (MLLM-judge + data fidelity checks) offers a reproducible and human-aligned method for assessing complex, open-ended visualization tasks.

Conclusion

DV-World provides a comprehensive, realistic benchmark for evaluating Data Visualization agents across the full professional lifecycle. Results demonstrate that even state-of-the-art models are ill-equipped for the complexities of real-world visualization, struggling with error correction, faithful data binding, consistent evolution, and intent alignment. By combining rubric-based judgments with quantitative checks and interaction metrics, DV-World enables detailed diagnosis and progress tracking. It establishes a standardized yardstick to quantify and accelerate progress toward reliable, versatile DV agents capable of handling enterprise-level workflows. Future work involves expanding the benchmark and exploring agent architectures that integrate the environmental, evolutionary, and interactive capabilities highlighted as essential by DV-World.