Summary (Overview)

Proposes DR³-Eval: A new benchmark designed for realistic, reproducible, and multimodal evaluation of Deep Research Agents (DRAs) focused on long-form, citation-grounded report generation. It addresses the tension between realism and reproducibility by using a static, per-task "sandbox corpus" that simulates the open web.
Introduces a novel construction methodology: Tasks are reverse-constructed from authentic, user-provided multimodal materials (text, image, video, audio) and a curated sandbox containing supportive, distractor, and noise documents. This ensures tasks have a verifiable solution path grounded in real-world complexity.
Develops a multi-dimensional evaluation framework: The benchmark assesses performance across five metrics: Information Recall (IR), Citation Coverage (CC), Factual Accuracy (FA), Instruction Following (IF), and Depth Quality (DQ), validated for alignment with human judgment.
Reveals critical challenges for state-of-the-art models: Experiments with the developed DR³-Agent system and multiple LLMs (Claude Sonnet 4, GLM-4.7, Gemini-2.5-Pro, etc.) show the benchmark is highly challenging. Performance degrades with larger, noisier sandboxes, and the primary failure mode across models is hallucination, highlighting issues with evidence grounding in long-horizon tasks.

Introduction and Theoretical Foundation

Recent advances in Large Language Models (LLMs) have enabled the development of Deep Research Agents (DRAs). These systems aim to autonomously perform complex, long-horizon research tasks involving planning, iterative information retrieval, multimodal understanding, and the synthesis of structured, citation-grounded reports. Unlike traditional QA systems, DRAs must operate under uncertainty, reason over heterogeneous and noisy information, and integrate evidence into coherent analytical outputs.

However, evaluating these capabilities poses a significant challenge. Existing benchmarks suffer from key limitations:

Real-time web access benchmarks (e.g., DeepResearch Bench) offer realism but lack reproducibility due to temporal volatility and evaluation ambiguity.
Curated document benchmarks (e.g., DRBench) improve structure but often omit the noisy, misleading information common in real research.
Sandbox-based benchmarks (e.g., DeepResearchGym) ensure reproducibility but simplify the context to "clean," text-only data, lacking multimodal grounding and authentic user workflows.

This creates a gap between real-world research complexity—involving multimodal user materials, noisy information, and implicit research intent—and the environments used for evaluation. DR³-Eval is introduced to bridge this gap by reconciling realism, controllability, and reproducibility.

Methodology

1. Dataset Construction Pipeline

The benchmark is constructed through a rigorous five-stage pipeline to ensure realism and verifiability.

Stage 1: Grounding in Real-World Needs. 100 authentic, multimodal document sets (50 English, 50 Chinese) were collected from paid volunteers across Technology, Economy, and Humanities domains (13 sub-fields). All materials underwent a two-stage sanitization protocol to remove Personally Identifiable Information (PII).

Stage 2: Distilling Search Paths. A "divergent-convergent" process uses an LLM (Gemini-2.5-Pro) to analyze source files and generate keywords.

Divergent: Generate 10 initial candidate keywords covering diverse facets.
Convergent: Classify keywords into:
- Signal Keywords: Point toward the core solution path.
- Noise Keywords: Thematically related but lead to irrelevant/misleading information.

Stage 3: Building Research Sandbox. A static, per-task sandbox corpus is built to ensure reproducibility. For each keyword, up to 100 web results are retrieved, deduplicated, and cleaned. Documents are categorized into three types:

Supportive Web Pages: High-relevance results from signal keywords, manually verified to provide necessary and sufficient evidence.
Distractor Web Pages: Also from signal keywords, but content is outdated, one-sided, or inaccurate.
Noise Web Pages: Results from noise keywords.

A fine-grained difficulty scaling strategy creates sandboxes of five context lengths: 32k, 64k, 128k, 256k, and 512k tokens. All settings include the full set of supportive pages. The number of distractors increases with context length, and the remaining quota is filled with noise pages.

Stage我们发现 4: Constructing Query. Using a reverse construction method, the final user query is synthesized based on the pre-determined evidential documents (supportive web pages and user files) and integrated with signal keywords. This ensures each query has a definitive, verifiable answer that requires joint reasoning across sources.

Stage 5: Quality Control. A four-dimensional validation protocol (Implicit Guidance, Synthesis Necessity, Insight Novelty, Interpretative Unambiguity) filters candidate tasks. From 280 initial candidates, 100 high-quality tasks remained (35.7% pass rate).

2. The DR³-Agent System

To demonstrate the benchmark's utility, the authors developed DR³-Agent, a multi-agent system adapted for the closed-world setting. It is based on the MiroFlow framework and features:

A Main Agent with integrated perception tools (for audio, video, etc.) that acts as a reasoning hub, running a "Plan-Act-Observe" loop.
Two specialized sub-agents:
- RAG Search Sub-Agent: Performs iterative dense retrieval within the static sandbox using the ReAct paradigm and query refinement, simulating heuristic web exploration.
- File Reader Sub-Agent: Parses long-text user files with fine-grained queries.

3. Evaluation Metrics

A multi-dimensional framework assesses performance in two dimensions: Information Seeking and Report Generation. GPT-5.1 (Φ) is used as the primary evaluator LLM, with Gemini-2.5-Pro as an assistant for multimodal content.

Information Seeking Metrics:

Information Recall (IR): Measures coverage of key insights extracted from user files ( $I_{UF}$ ) and the sandbox corpus ( $I_{SC}$ ). $IR_{UF}(R, I_{UF}) = \frac{1}{|I_{UF}|} \sum_{i \in I_{UF}} \mathbb{1}[\text{cov}(i, R) = 1]$ $IR_{SC}(R, I_{SC}) = \frac{1}{|I_{SC}|} \sum_{i \in I_{SC}} \mathbb{1}[\text{cov}(i, R) = 1]$ where $\text{cov}(i, R) \in \{1, 0.5, 0\}$ .
Citation Coverage (CC): Measures the recall of documents strictly necessary for the query ( $D_{req}$ ). $CC(R, D_{req}) = \frac{|D_{req} \cap D_{cited}|}{|D_{req}|}$ where $D_{cited}$ are documents cited in report $R$ .

Report Generation Metrics:

Factual Accuracy (FA): Verifies each claim-source pair $(c, s)$ in the report against its source. $FA(R) = \frac{1}{|C|} \sum_{(c, s) \in C} V(c, s)$ where $V(c, s) = 1$ if source $s$ supports claim $c$ , and $0$ otherwise.
Instruction Following (IF): Checks adherence to a checklist $L$ derived from the query. $IF(R, L) = \frac{1}{|L|} \sum_{l \in L} S(l, R)$ where $S(l, R) \in \{1, 0\}$ .
Depth Quality (DQ): An LLM judge scores the analytical substance and logical rigor of the report $R$ conditioned on query $Q$ and rubric $P$ . $DQ(R, Q) = \Phi(R, Q | P)$

Empirical Validation / Results

Experiments evaluated multiple state-of-the-art LLMs within the DR³-Agent framework across different sandbox corpus sizes.

Main Results

Table 2: Evaluation results on DR³-Agent.

Models	Information Seeking			Report Generation			Total Score
	IR_UF	IR_SC	CC	FA	IF	DQ	Avg.
	64k	128k	512k	64k	128k	512k	64k
Claude Sonnet 4	58.8	60.4	60.8	55.3	46.6	41.8	64.7
GLM-4.7	55.7	55.0	57.1	53.1	47.6	42.1	65.4
GLM-4.6	53.4	52.6	50.3	49.5	43.9	39.8	58.2
Gemini-2.5-Pro	43.9	45.7	42.9	37.7	35.1	30.8	54.3
GPT-4.1	40.7	42.5	41.3	30.9	29.4	29.2	37.2
Qwen3-235B-A22B	37.4	36.0	39.7	35.7	29.8	28.8	40.6
Qwen3-32B	33.2	36.6	35.4	26.5	25.3	24.7	34.2
Qwen3-30B-A3B	30.9	38.2	34.1	23.2	25.7	23.5	26.6

Key Observations:

DR³-Eval is challenging: Claude Sonnet 4 performs best, but even top models show significant room for improvement. Scaling laws hold within model families.
Longer contexts lower performance: As the sandbox grows from 64k to 512k tokens, performance (Avg., IR_SC, CC) drops across all models, indicating difficulty with noise and distraction.
Instruction Following ≠ Factual Accuracy: Some models (e.g., Qwen3-235B, GPT-4.1) achieve good IF scores but very low FA, suggesting they generate "complete-looking" but unfounded reports.
Performance varies by domain: As shown in Figure 4, model strengths differ across domains like Physics, Industry, and Finance.

Further Analysis

Evaluation Stability: Bootstrap analysis and low score variance (e.g., SD of 0.83 for Claude Sonnet 4) confirm the benchmark's reliability. The difference between top models is statistically significant ( $p = 0.0046$ ).
Sandbox vs. Real Web: Experiments replacing the sandbox with real-time web search show close performance (Table 3), indicating the sandbox preserves core information difficulty and is a reliable substitute.
Sandbox Design Effectiveness: Ablation studies (Figure 5) confirm that distractor documents effectively increase task difficulty, and the sandbox contains no other effective information beyond the designated supportive documents.
LLM-as-Judge Alignment: The automated scoring shows strong correlation with human evaluation (Table 4). Table 4: LLM-as-judge vs. human evaluation.
Method $r$ (Pearson) $\rho$ (Spearman) Agreement
DR³-Eval (Ours) 0.78 0.73 0.89
Inter-Human 0.83 0.76 0.91
Retrieval and Error Analysis:
- Different Retrievers: Vector-based retrieval (OpenAI text-embedding-3-small) outperforms lexical methods (BM25) (Table 5).
- Agentic-RAG Turns: Increasing maximum iteration turns for the RAG sub-agent improves performance, but excessive turns can cause a decline (Table 6).
- Error Attribution: Analysis of 100 reports per model (Figure 8) reveals that hallucination is the primary failure mode for most models (48%-77% of errors), indicating the key challenge is evidence grounding, not just evidence acquisition.

Method	$r$ (Pearson)	$\rho$ (Spearman)	Agreement
DR³-Eval (Ours)	0.78	0.73	0.89
Inter-Human	0.83	0.76	0.91

Theoretical and Practical Implications

Theoretical Implications:

Advances Benchmark Design: DR³-Eval provides a principled framework for evaluating long-horizon research capabilities, balancing the often conflicting goals of realism, controllability, and reproducibility.
Highlights Model Limitations: The benchmark systematically exposes critical failure modes in current LLMs, particularly hallucination under information overload and retrieval robustness in noisy environments. It shows that scaling alone is insufficient for complex research tasks.
Validates Multi-dimensional Evaluation: The proposed metric suite, validated against human judgment, demonstrates the necessity of moving beyond single-score assessments to understand different facets of research agent performance.

Practical Implications:

For DRA Development: Offers a reproducible testbed for diagnosing and improving agents, especially in areas of retrieval strategy, citation grounding, and hallucination mitigation.
For Sustainable AI Research: By using a static sandbox instead of live web crawling for every evaluation, the framework promotes more computationally efficient and environmentally sustainable benchmarking.
For Ethical AI Development: The focus on Factual Accuracy and Citation Coverage steers the field towards developing verifiable and reliable systems, potentially mitigating risks from persuasive but unfounded AI-generated content.

Conclusion

DR³-Eval addresses key limitations in evaluating Deep Research Agents by introducing a benchmark grounded in authentic user scenarios, constructed with a controlled yet web-like sandbox, and employing a reverse-construction method to eliminate ambiguity. Experimental results demonstrate that the benchmark poses substantial challenges to state-of-the-art LLMs, with performance degrading in larger, noisier environments and hallucination emerging as the predominant failure mode. This indicates that the main bottleneck for current models lies in the stability of evidence utilization during long-form generation, not merely evidence acquisition.

The work provides a foundation for more rigorous, reproducible, and diagnostically useful evaluation of autonomous research systems, guiding future development towards greater reliability and safety.