Summary of "DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios"

Summary (Overview)

Introduces DV-World, a comprehensive benchmark of 260 tasks designed to evaluate Data Visualization (DV) agents across the full professional lifecycle in real-world scenarios, moving beyond idealized code-sandbox settings.
Spans three core domains: DV-Sheet for native spreadsheet chart creation, diagnostic repair, and dashboard synthesis; DV-Evolution for cross-framework logic evolution and adaptation; and DV-Interact for proactive multi-turn interaction to align with ambiguous user intent.
Proposes a hybrid evaluation framework integrating Table-value Alignment for numerical precision and MLLM-as-a-Judge with expert-designed rubrics for semantic-visual assessment, demonstrating strong alignment with human judgment.
Reveals significant performance gaps: State-of-the-art models (e.g., Gemini-3-Pro, GPT-5.2) achieve less than 50% overall performance, exposing critical deficits in handling native object models, cross-paradigm evolution, and iterative intent alignment.
Provides a realistic testbed to steer development towards the versatile, integrated expertise required in enterprise visualization workflows, highlighting the need for a shift from one-shot generation to comprehensive lifecycle management.

Introduction and Theoretical Foundation

Data Visualization (DV) is a critical bridge between data and human decisions. While LLM and MLLM-based DV agents have shown impressive code generation in standardized sandboxes, current benchmarks fail to capture the complexity of real-world professional workflows. The paper identifies three critical gaps in existing evaluation paradigms:

Environmental Decoupling: Overlooking native spreadsheet-centric workflows (e.g., Excel's object models, data-to-chart bindings) in favor of developer-style code generation.
Creation-Only Myopia: Focusing on one-shot chart construction while under-testing the evolutionary work required to adapt visualizations to new data and requirements across diverse frameworks.
Perfect-Intent Assumptions: Building benchmarks on fully specified prompts, ignoring the ambiguity in real user requests and the need for proactive clarification and dialogue.

To bridge these gaps, DV-World is introduced as a benchmark designed to evaluate DV agents across the full lifecycle, emphasizing native environmental grounding, cross-platform evolution, and proactive intent alignment.

Methodology

1. Benchmark Construction & Task Definition

DV-World comprises 260 tasks curated from real-world sources (e.g., ExcelForum, Kaggle) and manually adapted by 18 visualization specialists. It is structured into three domains:

DV-Sheet: Evaluates native spreadsheet manipulation. An agent $\pi_{\text{sheet}}$ performs:
- Create ( $E^* = \pi_{\text{sheet}}(I, E_0)$ ): Generate a native chart with dynamic range bindings $f$ .
- Fix: Diagnose and repair a defective chart $C_{\text{err}}$ into $C_{\text{fix}}$ .
- Dash: Compose a professional dashboard by arranging multiple charts $\{C_i\}$ and tables $\{T_j\}$ .
DV-Evolution: Evaluates cross-modal logic evolution. An agent $\pi_{\text{evol}}$ synthesizes:
$\sigma = \pi_{\text{evol}}(I, V, D, L)$
Given a reference image $V$ , new dataset $D$ , and requirements $I$ , it must produce executable code $\sigma = \langle C^*, T^* \rangle$ in target language $L$ (e.g., Python, D3.js, Vega-Lite).
DV-Interact: Evaluates proactive iterative interaction under ambiguity. An agent $\pi_{\text{int}}$ interacts with a dual-stage user simulator to clarify a task $q_0$ within an environment $E = \{D, L\}$ , generating code $C^*$ that aligns with the user's latent objectives.

2. Evaluation Metrics

A hybrid framework combines quantitative data fidelity checks with qualitative rubric-based assessment via MLLM-as-a-Judge.

Table Coverage (TC): Measures data integrity via tolerance-aware matching.
$S_{\text{TC}} = \frac{1}{N_{\text{valid}}} \sum_{c \in C} \mathbb{I}(\text{match}(v_{\text{gen}}, v_{\text{gt}}))$
Rubric Score: For tasks like DVSheet-Create and DV-Evol, expert rubrics evaluate dimensions (e.g., Reliability, Consistency, Aesthetics). The score is:
$S_{\text{rubric}}(O, R) = \frac{\sum_{k=1}^{N} s_k}{\sum_{k=1}^{N} w_k}, \quad s_k = \Lambda(c_k, O) \in [0, w_k]$
Composite Scores: For creation/evolution tasks, the final score combines rubric and TC:
$S_{\text{crea/evol}} = w \cdot S_{\text{rubric}} + (1 - w) \cdot S_{\text{TC}} \quad (\text{default } w=0.5)$
Success Rates: For repair and interactive tasks:
- DVSheet-Fix: $SR_{\text{DVSheet-Fix}} = \mathbb{I}[\forall f \in F_{\text{must}}: \text{Sim}(C_f, G_f) \ge \tau], \tau \ge 0.95$
- DV-Inter: Uses an Interaction Success Rate (ISR): $ISR = (1 - \lambda) + \lambda \cdot \frac{N_{\text{success}} - N_{\text{ref}}}{N_{\text{req}} + 1}, \quad \lambda = 0.5$ The final score is $S_{\text{final}} = S_{\text{rubric}} \cdot ISR$ .

3. Experimental Setup

Models Evaluated: A wide range of state-of-the-art LLMs/MLLMs, including open-source (Qwen3, GLM-4.7, DeepSeek-V3.1) and proprietary (Gemini, GPT, Grok) families.
Baseline Agents: SheetCopilot (for DV-Sheet), OpenHands (for DV-Evol), and a unified DV-World-Agent (ReAct-based with tools like bash, load_image, render_chart, ask_user).
Evaluation Protocol: 4 independent runs per task. GPT-5-Mini used as the user simulator for DV-Inter. Gemini-2.5-Flash used as the primary MLLM-judge.

Empirical Validation / Results

Main Performance Results

Table 1: Comparison of DV-World with existing benchmarks.

Benchmark	# Tasks	Source	Env.	Input Format	Final Output	Interactive Agency	Open-ended	Evaluation Method
SpreadsheetBench	912	Real	Nat.	Sheet+NL	Sheet	✗	✗	Rule-based
Bird-Interact	600	Manual	Prog.	DB+NL	1 SQL	✓	✗	Execution-based
OSWorld	369	Real+Manual	OS	Actions+NL	Actions	✗	✗	Execution-based
DAComp-DA	210	Real+Manual	Prog.	Table+NL	Report+Chart	✗	✓	LLM-judge(rubrics)
DV-World (Ours)	260	Real + Manual	Nat. & Prog.	Sheet+NL & Table+I+NL	Fin.Table+1/N Chart	✓	Both	Rule-based & MLLM-judge (rubrics)

Table NA (from Table 3): DV-Sheet Results (Top Models)

Method	Create (Overall)	Fix (SR %)	Dashboards (Overall)	Score
Gemini-3-Pro	36.07 (±2.18)	48.00	35.29 (±2.64)	40.48 (±2.12)
GPT-5.2	34.43 (±2.42)	42.00	33.98 (±2.31)	37.24 (±1.88)
DeepSeek-V3.2	28.31 (±1.74)	36.00	36.35 (±3.02)	33.12 (±2.45)
Human	80.81	88.00	87.34	-

Table NA (from Table 4): DV-Evolution Results (Top Models)

Method	Python	Apache ECharts	Vega-Lite	D3.js	Plotly.js	Score
Gemini-3-Pro	60.36 (±1.62)	44.45 (±2.31)	46.30 (±1.48)	56.34 (±2.45)	49.76 (±4.72)	51.44 (±2.18)
Gemini-3-Flash	58.54 (±1.44)	46.01 (±2.15)	45.39 (±3.32)	49.83 (±2.21)	47.54 (±1.58)	49.46 (±2.05)
Grok-4	53.84 (±1.82)	44.99 (±2.24)	50.68 (±3.18)	49.44 (±1.75)	45.11 (±1.52)	48.81 (±2.14)
Human	85.23	82.11	88.46	85.21	84.44	-

Table NA (from Table 5): DV-Interact Results (Top Models)

Method	MLLM-Score (Overall)	ISR (%)	Score	User Cost
Grok-4	51.10 (±2.24)	79.57	40.43 (±1.95)	$0.051
DeepSeek-V3.2	51.30 (±2.18)	74.05	37.94 (±2.05)	$0.032
GPT-5.2	50.58 (±1.92)	69.25	35.09 (±1.88)	$0.021
Human	79.60	-	-	-

Key Findings:

Performance Ceiling: Even top models struggle, with peak scores below 52% across all domains, far below human baselines (~80-88%).
DV-Sheet Challenge: Native spreadsheet manipulation is particularly difficult (peak score 40.48%), highlighting deficits in managing object models and dynamic bindings.
Framework Variance: In DV-Evol, performance varies significantly by library, with stronger results in Python/Vega-Lite than in more complex D3.js/Plotly.js.
Interactive Bottleneck: In DV-Inter, the primary challenge is the interactive process itself; models fail to efficiently bridge underspecified intent to complex visualization logic.

Detailed Analysis

Native Grounding (DV-Sheet): A positive correlation exists between Table Coverage and Visual Aesthetics; models perform better aesthetically as data grounding improves. Agents excel at simple logic repairs but struggle with precise geometric mapping (axis scaling, encoding errors). Performance declines as table size increases.
Cross-Paradigm Evolution (DV-Evol): Performance decays as the target Lines of Code (LOC) increase, indicating a "verbosity tax" for frameworks like D3.js. The load_image tool is critical for maintaining semantic fidelity; its removal leads to universal performance drops (e.g., -7.69% for Gemini-3-Pro in D3.js).
Interactive Agency (DV-Inter): Transitioning to interactive alignment offers significant benefits (e.g., +23.0% gain for Gemini-3-Pro), but gains depend on proactive reasoning quality, not just interaction frequency.

Table 2: Key Statistics for DV-World.

Metric	Value
Total Tasks	260 (100%)
- DV-Sheet (Create/Fix/Dash)	50 / 50 / 30 (50%)
- DV-Evolution	80 (30.8%)
- DV-Interact	50 (19.2%)
DV-Sheet: Avg Columns / Rows	36.53 / 11,583.36
DV-Evolution: Avg Columns / Rows	58.98 / 52,584.58
DV-Interact: Ambiguities per task	3.17

Error Analysis

DV-Sheet: Errors are dominated by Data Accuracy (over 50% in Create, ~69% in Fix), followed by Layout Readability and Visual Design issues.
DV-Evol: Errors are primarily Layout & Readability (42.43% on average) and Data Consistency (31.98%), with Visual Style being less problematic (25.59%).
DV-Inter: Failures are primarily driven by the Cognitive-Execution Gap (38.44% avg), where clarified intent fails to produce grounded results. Interactive Avoidance and Inquiry Deficit are also major bottlenecks.

Theoretical and Practical Implications

Theoretical: DV-World establishes a new paradigm for evaluating DV agents, shifting focus from isolated code generation to integrated lifecycle management encompassing environmental mastery, semantic portability, and proactive alignment. It provides a structured taxonomy of real-world challenges (e.g., fix types, ambiguity points).
Practical: The benchmark exposes critical weaknesses in current agents, guiding research and development towards:
1. Improving native object model understanding and data binding reliability.
2. Enhancing cross-framework semantic transfer and design preservation.
3. Developing robust interactive reasoning and clarification strategies.
4. The hybrid evaluation framework (MLLM-judge + data fidelity checks) offers a reproducible and human-aligned method for assessing complex, open-ended visualization tasks.

Conclusion

DV-World provides a comprehensive, realistic benchmark for evaluating Data Visualization agents across the full professional lifecycle. Results demonstrate that even state-of-the-art models are ill-equipped for the complexities of real-world visualization, struggling with error correction, faithful data binding, consistent evolution, and intent alignment. By combining rubric-based judgments with quantitative checks and interaction metrics, DV-World enables detailed diagnosis and progress tracking. It establishes a standardized yardstick to quantify and accelerate progress toward reliable, versatile DV agents capable of handling enterprise-level workflows. Future work involves expanding the benchmark and exploring agent architectures that integrate the environmental, evolutionary, and interactive capabilities highlighted as essential by DV-World.