# DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios

> DV-World introduces a comprehensive real-world benchmark showing that state-of-the-art AI agents perform below 50% on tasks requiring native spreadsheet manipulation, cross-framework adaptation, and proactive user interaction.

- **Source:** [arXiv](https://arxiv.org/abs/2604.25914)
- **Published:** 2026-04-30
- **Permalink:** https://picx.dev/p/wcII9c
- **Whiteboard:** https://picx.dev/p/wcII9c/image

## Summary

# Summary of "DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios"

## Summary (Overview)
*   **Introduces DV-World**, a comprehensive benchmark of 260 tasks designed to evaluate Data Visualization (DV) agents across the full professional lifecycle in **real-world scenarios**, moving beyond idealized code-sandbox settings.
*   **Spans three core domains**: **DV-Sheet** for native spreadsheet chart creation, diagnostic repair, and dashboard synthesis; **DV-Evolution** for cross-framework logic evolution and adaptation; and **DV-Interact** for proactive multi-turn interaction to align with ambiguous user intent.
*   **Proposes a hybrid evaluation framework** integrating **Table-value Alignment** for numerical precision and **MLLM-as-a-Judge** with expert-designed rubrics for semantic-visual assessment, demonstrating strong alignment with human judgment.
*   **Reveals significant performance gaps**: State-of-the-art models (e.g., Gemini-3-Pro, GPT-5.2) achieve less than 50% overall performance, exposing critical deficits in handling native object models, cross-paradigm evolution, and iterative intent alignment.
*   **Provides a realistic testbed** to steer development towards the versatile, integrated expertise required in enterprise visualization workflows, highlighting the need for a shift from one-shot generation to comprehensive lifecycle management.

## Introduction and Theoretical Foundation
Data Visualization (DV) is a critical bridge between data and human decisions. While LLM and MLLM-based DV agents have shown impressive code generation in standardized sandboxes, current benchmarks fail to capture the complexity of real-world professional workflows. The paper identifies three critical gaps in existing evaluation paradigms:
1.  **Environmental Decoupling**: Overlooking native spreadsheet-centric workflows (e.g., Excel's object models, data-to-chart bindings) in favor of developer-style code generation.
2.  **Creation-Only Myopia**: Focusing on one-shot chart construction while under-testing the evolutionary work required to adapt visualizations to new data and requirements across diverse frameworks.
3.  **Perfect-Intent Assumptions**: Building benchmarks on fully specified prompts, ignoring the ambiguity in real user requests and the need for proactive clarification and dialogue.

To bridge these gaps, **DV-World** is introduced as a benchmark designed to evaluate DV agents across the full lifecycle, emphasizing **native environmental grounding**, **cross-platform evolution**, and **proactive intent alignment**.

## Methodology

### 1. Benchmark Construction & Task Definition
DV-World comprises 260 tasks curated from real-world sources (e.g., ExcelForum, Kaggle) and manually adapted by 18 visualization specialists. It is structured into three domains:

*   **DV-Sheet**: Evaluates native spreadsheet manipulation. An agent $\pi_{\text{sheet}}$ performs:
    *   **Create ($E^* = \pi_{\text{sheet}}(I, E_0)$)**: Generate a native chart with dynamic range bindings $f$.
    *   **Fix**: Diagnose and repair a defective chart $C_{\text{err}}$ into $C_{\text{fix}}$.
    *   **Dash**: Compose a professional dashboard by arranging multiple charts $\{C_i\}$ and tables $\{T_j\}$.

*   **DV-Evolution**: Evaluates cross-modal logic evolution. An agent $\pi_{\text{evol}}$ synthesizes:
    $$\sigma = \pi_{\text{evol}}(I, V, D, L)$$
    Given a reference image $V$, new dataset $D$, and requirements $I$, it must produce executable code $\sigma = \langle C^*, T^* \rangle$ in target language $L$ (e.g., Python, D3.js, Vega-Lite).

*   **DV-Interact**: Evaluates proactive iterative interaction under ambiguity. An agent $\pi_{\text{int}}$ interacts with a dual-stage user simulator to clarify a task $q_0$ within an environment $E = \{D, L\}$, generating code $C^*$ that aligns with the user's latent objectives.

### 2. Evaluation Metrics
A hybrid framework combines quantitative data fidelity checks with qualitative rubric-based assessment via MLLM-as-a-Judge.

*   **Table Coverage (TC)**: Measures data integrity via tolerance-aware matching.
    $$S_{\text{TC}} = \frac{1}{N_{\text{valid}}} \sum_{c \in C} \mathbb{I}(\text{match}(v_{\text{gen}}, v_{\text{gt}}))$$

*   **Rubric Score**: For tasks like DVSheet-Create and DV-Evol, expert rubrics evaluate dimensions (e.g., Reliability, Consistency, Aesthetics). The score is:
    $$S_{\text{rubric}}(O, R) = \frac{\sum_{k=1}^{N} s_k}{\sum_{k=1}^{N} w_k}, \quad s_k = \Lambda(c_k, O) \in [0, w_k]$$

*   **Composite Scores**: For creation/evolution tasks, the final score combines rubric and TC:
    $$S_{\text{crea/evol}} = w \cdot S_{\text{rubric}} + (1 - w) \cdot S_{\text{TC}} \quad (\text{default } w=0.5)$$

*   **Success Rates**: For repair and interactive tasks:
    *   **DVSheet-Fix**: $SR_{\text{DVSheet-Fix}} = \mathbb{I}[\forall f \in F_{\text{must}}: \text{Sim}(C_f, G_f) \ge \tau], \tau \ge 0.95$
    *   **DV-Inter**: Uses an **Interaction Success Rate (ISR)**:
        $$ISR = (1 - \lambda) + \lambda \cdot \frac{N_{\text{success}} - N_{\text{ref}}}{N_{\text{req}} + 1}, \quad \lambda = 0.5$$
        The final score is $S_{\text{final}} = S_{\text{rubric}} \cdot ISR$.

### 3. Experimental Setup
*   **Models Evaluated**: A wide range of state-of-the-art LLMs/MLLMs, including open-source (Qwen3, GLM-4.7, DeepSeek-V3.1) and proprietary (Gemini, GPT, Grok) families.
*   **Baseline Agents**: **SheetCopilot** (for DV-Sheet), **OpenHands** (for DV-Evol), and a unified **DV-World-Agent** (ReAct-based with tools like `bash`, `load_image`, `render_chart`, `ask_user`).
*   **Evaluation Protocol**: 4 independent runs per task. GPT-5-Mini used as the user simulator for DV-Inter. Gemini-2.5-Flash used as the primary MLLM-judge.

## Empirical Validation / Results

### Main Performance Results
**Table 1: Comparison of DV-World with existing benchmarks.**
| Benchmark | # Tasks | Source | Env. | Input Format | Final Output | Interactive Agency | Open-ended | Evaluation Method |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| SpreadsheetBench | 912 | Real | Nat. | Sheet+NL | Sheet | ✗ | ✗ | Rule-based |
| Bird-Interact | 600 | Manual | Prog. | DB+NL | 1 SQL | ✓ | ✗ | Execution-based |
| OSWorld | 369 | Real+Manual | OS | Actions+NL | Actions | ✗ | ✗ | Execution-based |
| DAComp-DA | 210 | Real+Manual | Prog. | Table+NL | Report+Chart | ✗ | ✓ | LLM-judge(rubrics) |
| **DV-World (Ours)** | **260** | **Real + Manual** | **Nat. & Prog.** | **Sheet+NL & Table+I+NL** | **Fin.Table+1/N Chart** | **✓** | **Both** | **Rule-based & MLLM-judge (rubrics)** |

**Table NA (from Table 3): DV-Sheet Results (Top Models)**
| Method | Create (Overall) | Fix (SR %) | Dashboards (Overall) | **Score** |
| :--- | :--- | :--- | :--- | :--- |
| Gemini-3-Pro | 36.07 (±2.18) | 48.00 | 35.29 (±2.64) | **40.48 (±2.12)** |
| GPT-5.2 | 34.43 (±2.42) | 42.00 | 33.98 (±2.31) | 37.24 (±1.88) |
| DeepSeek-V3.2 | 28.31 (±1.74) | 36.00 | 36.35 (±3.02) | 33.12 (±2.45) |
| Human | 80.81 | 88.00 | 87.34 | - |

**Table NA (from Table 4): DV-Evolution Results (Top Models)**
| Method | Python | Apache ECharts | Vega-Lite | D3.js | Plotly.js | **Score** |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| Gemini-3-Pro | 60.36 (±1.62) | 44.45 (±2.31) | 46.30 (±1.48) | 56.34 (±2.45) | 49.76 (±4.72) | **51.44 (±2.18)** |
| Gemini-3-Flash | 58.54 (±1.44) | 46.01 (±2.15) | 45.39 (±3.32) | 49.83 (±2.21) | 47.54 (±1.58) | 49.46 (±2.05) |
| Grok-4 | 53.84 (±1.82) | 44.99 (±2.24) | 50.68 (±3.18) | 49.44 (±1.75) | 45.11 (±1.52) | 48.81 (±2.14) |
| Human | 85.23 | 82.11 | 88.46 | 85.21 | 84.44 | - |

**Table NA (from Table 5): DV-Interact Results (Top Models)**
| Method | MLLM-Score (Overall) | ISR (%) | **Score** | User Cost |
| :--- | :--- | :--- | :--- | :--- |
| Grok-4 | 51.10 (±2.24) | 79.57 | **40.43 (±1.95)** | $0.051 |
| DeepSeek-V3.2 | 51.30 (±2.18) | 74.05 | 37.94 (±2.05) | $0.032 |
| GPT-5.2 | 50.58 (±1.92) | 69.25 | 35.09 (±1.88) | $0.021 |
| Human | 79.60 | - | - | - |

**Key Findings:**
1.  **Performance Ceiling**: Even top models struggle, with peak scores below 52% across all domains, far below human baselines (~80-88%).
2.  **DV-Sheet Challenge**: Native spreadsheet manipulation is particularly difficult (peak score 40.48%), highlighting deficits in managing object models and dynamic bindings.
3.  **Framework Variance**: In DV-Evol, performance varies significantly by library, with stronger results in Python/Vega-Lite than in more complex D3.js/Plotly.js.
4.  **Interactive Bottleneck**: In DV-Inter, the primary challenge is the interactive process itself; models fail to efficiently bridge underspecified intent to complex visualization logic.

### Detailed Analysis
*   **Native Grounding (DV-Sheet)**: A positive correlation exists between **Table Coverage** and **Visual Aesthetics**; models perform better aesthetically as data grounding improves. Agents excel at simple logic repairs but struggle with precise geometric mapping (axis scaling, encoding errors). Performance declines as table size increases.
*   **Cross-Paradigm Evolution (DV-Evol)**: Performance decays as the target Lines of Code (LOC) increase, indicating a "verbosity tax" for frameworks like D3.js. The `load_image` tool is critical for maintaining semantic fidelity; its removal leads to universal performance drops (e.g., -7.69% for Gemini-3-Pro in D3.js).
*   **Interactive Agency (DV-Inter)**: Transitioning to interactive alignment offers significant benefits (e.g., +23.0% gain for Gemini-3-Pro), but gains depend on **proactive reasoning quality**, not just interaction frequency.

**Table 2: Key Statistics for DV-World.**
| Metric | Value |
| :--- | :--- |
| Total Tasks | 260 (100%) |
| - DV-Sheet (Create/Fix/Dash) | 50 / 50 / 30 (50%) |
| - DV-Evolution | 80 (30.8%) |
| - DV-Interact | 50 (19.2%) |
| DV-Sheet: Avg Columns / Rows | 36.53 / 11,583.36 |
| DV-Evolution: Avg Columns / Rows | 58.98 / 52,584.58 |
| DV-Interact: Ambiguities per task | 3.17 |

### Error Analysis
*   **DV-Sheet**: Errors are dominated by **Data Accuracy** (over 50% in Create, ~69% in Fix), followed by Layout Readability and Visual Design issues.
*   **DV-Evol**: Errors are primarily **Layout & Readability** (42.43% on average) and **Data Consistency** (31.98%), with **Visual Style** being less problematic (25.59%).
*   **DV-Inter**: Failures are primarily driven by the **Cognitive-Execution Gap** (38.44% avg), where clarified intent fails to produce grounded results. **Interactive Avoidance** and **Inquiry Deficit** are also major bottlenecks.

## Theoretical and Practical Implications
*   **Theoretical**: DV-World establishes a new paradigm for evaluating DV agents, shifting focus from isolated code generation to integrated **lifecycle management** encompassing environmental mastery, semantic portability, and proactive alignment. It provides a structured taxonomy of real-world challenges (e.g., fix types, ambiguity points).
*   **Practical**: The benchmark exposes critical weaknesses in current agents, guiding research and development towards:
    1.  Improving **native object model** understanding and **data binding** reliability.
    2.  Enhancing **cross-framework semantic transfer** and **design preservation**.
    3.  Developing robust **interactive reasoning** and **clarification** strategies.
    4.  The hybrid evaluation framework (MLLM-judge + data fidelity checks) offers a reproducible and human-aligned method for assessing complex, open-ended visualization tasks.

## Conclusion
DV-World provides a comprehensive, realistic benchmark for evaluating Data Visualization agents across the full professional lifecycle. Results demonstrate that even state-of-the-art models are ill-equipped for the complexities of real-world visualization, struggling with error correction, faithful data binding, consistent evolution, and intent alignment. By combining rubric-based judgments with quantitative checks and interaction metrics, DV-World enables detailed diagnosis and progress tracking. It establishes a standardized yardstick to quantify and accelerate progress toward reliable, versatile DV agents capable of handling enterprise-level workflows. Future work involves expanding the benchmark and exploring agent architectures that integrate the environmental, evolutionary, and interactive capabilities highlighted as essential by DV-World.

---

_Markdown view of https://picx.dev/p/wcII9c, served by PicX — AI-generated visual whiteboard summaries of research papers._