Summary of "DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios"
Summary (Overview)
- Introduces DV-World, a comprehensive benchmark of 260 tasks designed to evaluate Data Visualization (DV) agents across the full professional lifecycle in real-world scenarios, moving beyond idealized code-sandbox settings.
- Spans three core domains: DV-Sheet for native spreadsheet chart creation, diagnostic repair, and dashboard synthesis; DV-Evolution for cross-framework logic evolution and adaptation; and DV-Interact for proactive multi-turn interaction to align with ambiguous user intent.
- Proposes a hybrid evaluation framework integrating Table-value Alignment for numerical precision and MLLM-as-a-Judge with expert-designed rubrics for semantic-visual assessment, demonstrating strong alignment with human judgment.
- Reveals significant performance gaps: State-of-the-art models (e.g., Gemini-3-Pro, GPT-5.2) achieve less than 50% overall performance, exposing critical deficits in handling native object models, cross-paradigm evolution, and iterative intent alignment.
- Provides a realistic testbed to steer development towards the versatile, integrated expertise required in enterprise visualization workflows, highlighting the need for a shift from one-shot generation to comprehensive lifecycle management.
Introduction and Theoretical Foundation
Data Visualization (DV) is a critical bridge between data and human decisions. While LLM and MLLM-based DV agents have shown impressive code generation in standardized sandboxes, current benchmarks fail to capture the complexity of real-world professional workflows. The paper identifies three critical gaps in existing evaluation paradigms:
- Environmental Decoupling: Overlooking native spreadsheet-centric workflows (e.g., Excel's object models, data-to-chart bindings) in favor of developer-style code generation.
- Creation-Only Myopia: Focusing on one-shot chart construction while under-testing the evolutionary work required to adapt visualizations to new data and requirements across diverse frameworks.
- Perfect-Intent Assumptions: Building benchmarks on fully specified prompts, ignoring the ambiguity in real user requests and the need for proactive clarification and dialogue.
To bridge these gaps, DV-World is introduced as a benchmark designed to evaluate DV agents across the full lifecycle, emphasizing native environmental grounding, cross-platform evolution, and proactive intent alignment.
Methodology
1. Benchmark Construction & Task Definition
DV-World comprises 260 tasks curated from real-world sources (e.g., ExcelForum, Kaggle) and manually adapted by 18 visualization specialists. It is structured into three domains:
-
DV-Sheet: Evaluates native spreadsheet manipulation. An agent performs:
- Create (): Generate a native chart with dynamic range bindings .
- Fix: Diagnose and repair a defective chart into .
- Dash: Compose a professional dashboard by arranging multiple charts and tables .
-
DV-Evolution: Evaluates cross-modal logic evolution. An agent synthesizes:
Given a reference image , new dataset , and requirements , it must produce executable code in target language (e.g., Python, D3.js, Vega-Lite).
-
DV-Interact: Evaluates proactive iterative interaction under ambiguity. An agent interacts with a dual-stage user simulator to clarify a task within an environment , generating code that aligns with the user's latent objectives.
2. Evaluation Metrics
A hybrid framework combines quantitative data fidelity checks with qualitative rubric-based assessment via MLLM-as-a-Judge.
-
Table Coverage (TC): Measures data integrity via tolerance-aware matching.
-
Rubric Score: For tasks like DVSheet-Create and DV-Evol, expert rubrics evaluate dimensions (e.g., Reliability, Consistency, Aesthetics). The score is:
-
Composite Scores: For creation/evolution tasks, the final score combines rubric and TC:
-
Success Rates: For repair and interactive tasks:
- DVSheet-Fix:
- DV-Inter: Uses an Interaction Success Rate (ISR): The final score is .
3. Experimental Setup
- Models Evaluated: A wide range of state-of-the-art LLMs/MLLMs, including open-source (Qwen3, GLM-4.7, DeepSeek-V3.1) and proprietary (Gemini, GPT, Grok) families.
- Baseline Agents: SheetCopilot (for DV-Sheet), OpenHands (for DV-Evol), and a unified DV-World-Agent (ReAct-based with tools like
bash,load_image,render_chart,ask_user). - Evaluation Protocol: 4 independent runs per task. GPT-5-Mini used as the user simulator for DV-Inter. Gemini-2.5-Flash used as the primary MLLM-judge.
Empirical Validation / Results
Main Performance Results
Table 1: Comparison of DV-World with existing benchmarks.
| Benchmark | # Tasks | Source | Env. | Input Format | Final Output | Interactive Agency | Open-ended | Evaluation Method |
|---|---|---|---|---|---|---|---|---|
| SpreadsheetBench | 912 | Real | Nat. | Sheet+NL | Sheet | ✗ | ✗ | Rule-based |
| Bird-Interact | 600 | Manual | Prog. | DB+NL | 1 SQL | ✓ | ✗ | Execution-based |
| OSWorld | 369 | Real+Manual | OS | Actions+NL | Actions | ✗ | ✗ | Execution-based |
| DAComp-DA | 210 | Real+Manual | Prog. | Table+NL | Report+Chart | ✗ | ✓ | LLM-judge(rubrics) |
| DV-World (Ours) | 260 | Real + Manual | Nat. & Prog. | Sheet+NL & Table+I+NL | Fin.Table+1/N Chart | ✓ | Both | Rule-based & MLLM-judge (rubrics) |
Table NA (from Table 3): DV-Sheet Results (Top Models)
| Method | Create (Overall) | Fix (SR %) | Dashboards (Overall) | Score |
|---|---|---|---|---|
| Gemini-3-Pro | 36.07 (±2.18) | 48.00 | 35.29 (±2.64) | 40.48 (±2.12) |
| GPT-5.2 | 34.43 (±2.42) | 42.00 | 33.98 (±2.31) | 37.24 (±1.88) |
| DeepSeek-V3.2 | 28.31 (±1.74) | 36.00 | 36.35 (±3.02) | 33.12 (±2.45) |
| Human | 80.81 | 88.00 | 87.34 | - |
Table NA (from Table 4): DV-Evolution Results (Top Models)
| Method | Python | Apache ECharts | Vega-Lite | D3.js | Plotly.js | Score |
|---|---|---|---|---|---|---|
| Gemini-3-Pro | 60.36 (±1.62) | 44.45 (±2.31) | 46.30 (±1.48) | 56.34 (±2.45) | 49.76 (±4.72) | 51.44 (±2.18) |
| Gemini-3-Flash | 58.54 (±1.44) | 46.01 (±2.15) | 45.39 (±3.32) | 49.83 (±2.21) | 47.54 (±1.58) | 49.46 (±2.05) |
| Grok-4 | 53.84 (±1.82) | 44.99 (±2.24) | 50.68 (±3.18) | 49.44 (±1.75) | 45.11 (±1.52) | 48.81 (±2.14) |
| Human | 85.23 | 82.11 | 88.46 | 85.21 | 84.44 | - |
Table NA (from Table 5): DV-Interact Results (Top Models)
| Method | MLLM-Score (Overall) | ISR (%) | Score | User Cost |
|---|---|---|---|---|
| Grok-4 | 51.10 (±2.24) | 79.57 | 40.43 (±1.95) | $0.051 |
| DeepSeek-V3.2 | 51.30 (±2.18) | 74.05 | 37.94 (±2.05) | $0.032 |
| GPT-5.2 | 50.58 (±1.92) | 69.25 | 35.09 (±1.88) | $0.021 |
| Human | 79.60 | - | - | - |
Key Findings:
- Performance Ceiling: Even top models struggle, with peak scores below 52% across all domains, far below human baselines (~80-88%).
- DV-Sheet Challenge: Native spreadsheet manipulation is particularly difficult (peak score 40.48%), highlighting deficits in managing object models and dynamic bindings.
- Framework Variance: In DV-Evol, performance varies significantly by library, with stronger results in Python/Vega-Lite than in more complex D3.js/Plotly.js.
- Interactive Bottleneck: In DV-Inter, the primary challenge is the interactive process itself; models fail to efficiently bridge underspecified intent to complex visualization logic.
Detailed Analysis
- Native Grounding (DV-Sheet): A positive correlation exists between Table Coverage and Visual Aesthetics; models perform better aesthetically as data grounding improves. Agents excel at simple logic repairs but struggle with precise geometric mapping (axis scaling, encoding errors). Performance declines as table size increases.
- Cross-Paradigm Evolution (DV-Evol): Performance decays as the target Lines of Code (LOC) increase, indicating a "verbosity tax" for frameworks like D3.js. The
load_imagetool is critical for maintaining semantic fidelity; its removal leads to universal performance drops (e.g., -7.69% for Gemini-3-Pro in D3.js). - Interactive Agency (DV-Inter): Transitioning to interactive alignment offers significant benefits (e.g., +23.0% gain for Gemini-3-Pro), but gains depend on proactive reasoning quality, not just interaction frequency.
Table 2: Key Statistics for DV-World.
| Metric | Value |
|---|---|
| Total Tasks | 260 (100%) |
| - DV-Sheet (Create/Fix/Dash) | 50 / 50 / 30 (50%) |
| - DV-Evolution | 80 (30.8%) |
| - DV-Interact | 50 (19.2%) |
| DV-Sheet: Avg Columns / Rows | 36.53 / 11,583.36 |
| DV-Evolution: Avg Columns / Rows | 58.98 / 52,584.58 |
| DV-Interact: Ambiguities per task | 3.17 |
Error Analysis
- DV-Sheet: Errors are dominated by Data Accuracy (over 50% in Create, ~69% in Fix), followed by Layout Readability and Visual Design issues.
- DV-Evol: Errors are primarily Layout & Readability (42.43% on average) and Data Consistency (31.98%), with Visual Style being less problematic (25.59%).
- DV-Inter: Failures are primarily driven by the Cognitive-Execution Gap (38.44% avg), where clarified intent fails to produce grounded results. Interactive Avoidance and Inquiry Deficit are also major bottlenecks.
Theoretical and Practical Implications
- Theoretical: DV-World establishes a new paradigm for evaluating DV agents, shifting focus from isolated code generation to integrated lifecycle management encompassing environmental mastery, semantic portability, and proactive alignment. It provides a structured taxonomy of real-world challenges (e.g., fix types, ambiguity points).
- Practical: The benchmark exposes critical weaknesses in current agents, guiding research and development towards:
- Improving native object model understanding and data binding reliability.
- Enhancing cross-framework semantic transfer and design preservation.
- Developing robust interactive reasoning and clarification strategies.
- The hybrid evaluation framework (MLLM-judge + data fidelity checks) offers a reproducible and human-aligned method for assessing complex, open-ended visualization tasks.
Conclusion
DV-World provides a comprehensive, realistic benchmark for evaluating Data Visualization agents across the full professional lifecycle. Results demonstrate that even state-of-the-art models are ill-equipped for the complexities of real-world visualization, struggling with error correction, faithful data binding, consistent evolution, and intent alignment. By combining rubric-based judgments with quantitative checks and interaction metrics, DV-World enables detailed diagnosis and progress tracking. It establishes a standardized yardstick to quantify and accelerate progress toward reliable, versatile DV agents capable of handling enterprise-level workflows. Future work involves expanding the benchmark and exploring agent architectures that integrate the environmental, evolutionary, and interactive capabilities highlighted as essential by DV-World.