Visual Summary | Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories

Summary (Overview)

Data2Story is a multi-agent framework that orchestrates seven specialized roles (Detective, Analyst, Editor, Designer, Programmer, Auditor, Inspector) into a virtual newsroom, transforming raw data into verifiable, multimodal articles.
Key innovation 1: Evidence-grounded claims – The Inspector binds every sentence, chart, and asset to its upstream provenance (specific code lines, data sources, or external URLs), making the article fully auditable.
Key innovation 2: Multimodal generative storytelling – The system reasons about audience needs and deploys interactive maps, audio, video, and charts tailored to the data topic, rather than producing static text-and-charts.
Evaluation: On 18 paired articles (vs. human-written pieces from The Economist, The Pudding, TidyTuesday), Data2Story receives favorable ratings from 53 human participants across five rubric dimensions, with significantly higher transparency and auditability. Human articles retain an edge in editorial angle, creative design, and informative presentation.
Positioning: Data2Story is a collaborator, not a replacement – it augments newsroom workflows with scalable analysis, multimedia assets, and built-in provenance tracing.

Introduction and Theoretical Foundation

Data journalism turns raw data into stories that non-expert audiences can understand and trust. Producing a high-quality news feature typically takes a newsroom team weeks, involving context hunting, statistical analysis, angle selection, and visual design. While recent AI agents excel at individual steps (data science agents [1–4], visualization agents [5–8]), no prior system can serve as an end-to-end data journalist.

A critical challenge in AI-generated journalism is lack of verification and traceability: readers and editors cannot confirm where a number came from, whether a chart reflects the underlying data, or whether a claim was hallucinated. Data2Story directly addresses this gap by grounding nearly every statistic, visual asset, and factual claim in executable code or a verifiable source URL.

The theoretical foundation draws on the concept of a virtual newsroom – a coordinated team of specialized agents, each with distinct expertise. The paper builds on prior work in deep search agents, data visualization agents, data science agents, and data journalism systems, as summarized in Table 1.

Table 1 | Comparison with related works. (Key aspects: Ext. Search, Narr. Angle, Multimodal Image/Video/Audio/Interactive, Evidence Source/Code/Grounded)

System	Inputs	Outputs	Ext. Search	Narr. Angle	Multimodal (I/V/A/Interact)	Evidence (Src/Cd/Grd)
Search Agents (MindSearch, MMSearch, DR Tulu)	Query	Report	✓	✗	✗/✗/✗/✗	✓/✗/✓
Vis Agents (MatplotAgent, LIDA, CoDA)	Query+Data	Infographic	✗	✗	✓/✗/✗/✗	✗/✓/✓
Data Science Agents (DSGym, Data Interpreter, AI Scientist)	Query+Data	Report/Score	✓/✗	✗/✓	✓/✗/✗/✗	✓/✓/✓
Data Journalist Agents (LLM writer, Human writer)	Data	Article	✓	✓	✓/✓/✓/✓	✓/✗/✓
Data2Story (Ours)	Data	Article	✓	✓	✓/✓/✓/✓	✓/✓/✓

Methodology

The Virtual Newsroom

Given raw data $D$ , Data2Story produces an article $U$ through seven roles:

Detective: Gathers external context $\tilde{D}$ via web search, augmenting $D$ with categorized context items, source URLs, and reference media.
Analyst: Enumerates all possible statistics by profiling every column and executing code, producing results $R = \{r_i\}$ and supporting code $C = \{c_i\}$ where $r_i \xleftarrow{c_i} D \cup \tilde{D}$ .
Editor: Decides the narrative arc – ranks findings by priority, selects items to keep, and drafts a paragraph-level outline $F \xleftarrow{LLM} R$ , with each $f_i$ annotated with upstream items.
Designer: For each finding $f_i$ , reasons about reader preferences and selects the best medium (map, audio, video, interactive widget), producing per-section visual assets $V \xleftarrow{Tool} F$ .
Programmer: Renders the final HTML page $U$ in assembly mode ( $U \leftarrow \{F, V\}$ ) or revision mode ( $U \leftarrow \{U, S\}$ ).
Auditor: Reviews the rendered page, flags visual/structural defects, and returns suggestions $S \leftarrow U$ to the Programmer.
Inspector: Closes the evidence loop by decomposing the page into partial findings $U = \{u_m\}$ $U = {u_{m}}$ and binding each $u_m$ $u_{m}$ to evidence entries $E = D \cup R \cup C \cup F \cup V$ $E = D \cup R \cup C \cup F \cup V$ . Two evidence types:
- Code evidence: claim traces back to the specific script and line that produced it.
- Reference evidence: contextual claim grounded in an external URL.

Evidence Binding

The Inspector recognizes two types of evidence links:

u_m \sim (d_i, r_j, c_j, f_k, v_l)

where each $u_m$ is an HTML fragment (sentence, chart, interactive element) and each evidence item is from one of the five upstream roles. This enables full auditability: every claim can be followed back through the Programmer, Designer, and Analyst to the original data file or source reference.

Empirical Validation / Results

Setting

18 articles from three sources: The Economist (analytical), The Pudding (long-form interactive), TidyTuesday (community-driven with code).
Data2Story uses Claude Code with claude-opus-4.7.
Human study: 53 participants (Prolific platform), blind scoring on 5 rubric dimensions (1–7 scale): Visual Design, Narrative & Pacing, Data & Method Transparency, Claim–Data Alignment, Insight Value.

Key Results

1. Textual Distribution

Agent uses $1.45\times$ more sentences than humans, but each sentence is $0.77\times$ shorter.

2. Claim Coverage

Human-in-Agent $P(\text{Agent}|\text{Human}) = 50.4\%$
Agent-in-Human $P(\text{Human}|\text{Agent}) = 35.1\%$
(overall; gap widest on Economist: $73.0\%$ vs $39.5\%$ )
Defined as: $P(\text{Agent}|\text{Human}) = \frac{|\text{Human} \cap \text{Agent}|}{|\text{Human}|}, \quad P(\text{Human}|\text{Agent}) = \frac{|\text{Human} \cap \text{Agent}|}{|\text{Agent}|}$

3. Human Study (n=53)

Overall rubric score: Data2Story $4.21$ vs Human $3.38$ ( $p<.001$ ).
Largest gap: Transparency ( $\Delta = +1.49$ , $p<.001$ ); smallest: Visual ( $\Delta = +0.51$ , $p=.015$ ).
Source breakdown: Economist ( $\Delta=+1.02$ ), TidyTuesday ( $\Delta=+1.20$ ) favor agent; Pudding is a tie ( $\Delta=+0.11$ , $p=.699$ ).
Pairwise preference: 74% prefer Data2Story, 25% human, 2% tie.

4. Computer-Use Agent as Judge

Across-family gpt-5.5-xhigh with browser-use.
With Inspector: overall mean $5.10$ ; without Inspector: $4.60$ (human: $3.87$ ).
Inspector effect concentrates on Transparency ( $+1.67$ on 1–7 scale).
Agent judge preserves human ranking (Spearman $\rho = 0.44$ , $p=.009$ ), at a fraction of cost.

5. Verifiability (Auditability)

Data2Story: 93% of claims have traceable binding (code or reference evidence).
Human articles: 25% (no code shipped by default; verifier must guess).
Per-role contribution: Editor $99.3\%$ , Detective $95.1\%$ , Analyst $74.1\%$ , Designer $29.0\%$ .

6. Inspector Usefulness (Human Feedback)

66% of participants found the Inspector helpful; 25% found it not helpful or distracting (density of provenance traces was a concern).

Qualitative Human Advantages

Editorial angle: e.g., Repair Cafés article – human frames repair as manufacturer accountability (expert testimony); agent ranks broken items but misses the cause.
Creative design: Pudding’s Stand-Up Comedy article turns transcript into interactive interface; agent uses static YouTube thumbnail and standard charts.
Informative presentation: Human space-race chart overlays state vs commercial, failure rates, and annotations; agent distributes across single-variable charts.

Theoretical and Practical Implications

Theoretical Contributions

Evidence-grounding as a design principle: The Inspector demonstrates that multi-agent systems can enforce end-to-end traceability without sacrificing narrative quality – a new paradigm for trustworthy AI content generation.
Multimodal reasoning: Showing that an agent can autonomously decide which medium (map, audio, interactive) best serves the data and audience, not just what to say.
Computational journalism framework: Formalizes the newsroom workflow as a pipeline of specialized, auditable agents, enabling reproducible and transparent data storytelling.

Practical Implications

Augmenting newsroom workflows: Data2Story can handle labor-intensive data analysis, graphics design, and niche dataset exploration, freeing human journalists for editorial judgment and creative framing.
Opening specialized stories: The system can surface overlooked datasets (e.g., 2026 World Cup climate risk, arXiv discipline shift, time-use gender gaps) that newsrooms lack bandwidth to investigate.
Auditability as a feature: The Inspector provides a formalized, machine-checkable provenance that human articles rarely offer, enabling readers and editors to independently verify claims.
Cost-effective evaluation: Computer-use agents can serve as a cost-saving proxy for ranking article quality, aligning with human judgments (Spearman $\rho=0.44$ ).

Conclusion

Data2Story is a multi-agent framework that orchestrates seven specialized roles into a virtual newsroom for end-to-end data journalism. It contributes two key properties: (i) evidence-grounded claims via an Inspector that binds each output to upstream code or reference, and (ii) multimodal generative storytelling that reasons about audience needs before deploying appropriate tools. Across 18 paired articles with human references, Data2Story receives favorable ratings from human participants and computer-use agents, with the Inspector specifically improving transparency.

The authors position Data2Story as a collaborator for human journalists – handling analysis, multimedia generation, and auditability, while humans provide editorial angle, creative design, and informative presentation. Future directions include:

Incorporating human feedback in the loop for revision.
Exploring how agents can interpret reader feedback and adjust professionally.
Directly comparing the depth of written angles between human and agent.

A central finding is that Data2Story’s greatest advantage lies in auditability: making the evidentiary basis of each claim explicit and measurable – something even carefully crafted human articles rarely provide natively. This moves toward a trustworthy agentic data system where readers can follow every claim back to its source.