MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome

Summary (Overview)

Holistic Benchmark: Introduces MiroEval, a benchmark and evaluation framework for deep research agents, comprising 100 tasks (70 text-only, 30 multimodal) grounded in real user needs and constructed via a dual-path pipeline that supports periodic updates.
Multi-Dimensional Evaluation: Proposes a three-layer evaluation suite: (1) Adaptive Synthesis Quality with task-specific rubrics, (2) Agentic Factuality Verification via active retrieval over web and multimodal attachments, and (3) Process-Centric Evaluation auditing search, reasoning, and refinement.
Key Findings: Evaluation of 13 systems reveals: the three dimensions capture complementary capabilities; process quality is a reliable predictor of overall outcome and reveals weaknesses invisible to output metrics; multimodal tasks pose substantially greater challenges, causing performance drops of 3-10 points.
Top Performers: The MiroThinker series (particularly MiroThinker-H1) achieves the most balanced performance, ranking highest overall in both text-only (77.5) and multimodal (74.5) settings.
Validation: Human verification confirms benchmark quality (92.0% precision). Robustness experiments and a human ranking study (Kendall’s τ = 0.91) validate the evaluation framework's reliability.

Introduction and Theoretical Foundation

The rapid advancement of Large Language Models (LLMs) has enabled a shift from passive text generation to agentic systems capable of autonomous planning and execution. Deep research, defined as the autonomous, multi-step process of investigating complex information needs through iterative search, evidence gathering, verification, and synthesis, has become a prominent paradigm.

As these systems are adopted in high-stakes domains (finance, healthcare, legal analysis), users demand more than fluent reports: they need factually reliable answers, grounded in thorough, traceable investigation, and capable of incorporating multimodal materials (images, PDFs, spreadsheets) common in real-world queries.

Existing benchmarks have limitations:

Evaluate only the final report, not the underlying research process.
Offer limited multimodal coverage beyond short-form QA.
Rely on synthetic or academic queries that don't capture real-world complexity.
Are static, risking obsolescence as knowledge evolves.

To address these gaps, MiroEval is introduced as a holistic diagnostic tool for the next generation of deep research agents, focusing on real user needs, multimodal support, and process-level assessment.

Methodology

1. Benchmark Construction (Query Collection)

The benchmark comprises 100 queries (70 text-only, 30 multimodal) built via two complementary paths (Figure 2), enabling a live and evolving setting.

A. User-Derived Query Curation (65 queries):

Source: Inspired by query patterns from a closed internal testing phase (text and multimodal with attachments).
Privacy: No original user queries appear. Strict protocols include automated filtering of sensitive content and systematic replacement of named entities.
Process: An LLM classifies each anonymized query along dimensions (attachment type, complexity, target evaluation features). Queries are then routed to one of 6 rewriting strategies (Table 9) spanning three difficulty tiers (Easy, Medium, Hard) based on constraints, feature matching, quota bonuses, and usage decay.
Strategies target specific evaluation features (Table 8), such as search, multimodal understanding, error correction, and planning.

B. Automated Query Generation (35 text-only queries):

Source: Grounded in real-time web trends (via Serper API) across 12 topics with 3 subtopics each (Table 10).
Process: An LLM generates 15 candidate queries per topic, conditioned on trends and anonymized seed exemplars.
Three-Stage Filtering:
1. Search Validation: Requires ≥ 3 results from ≥ 2 distinct domains.
2. Deep-Research Necessity: An LLM evaluates if the query demands external investigation (confidence ≥ 0.7).
3. Inverse Quality Assessment: Retains only queries where a baseline answer generated without search access is inadequate. The joint condition is: $Q_{gen} = \{ q | \sigma(q) \le 0.75 \ \land\ \ell(q) \neq \text{high} \ \land\ \mathrm{requires\_search}(q) \}$ where $\sigma(q)$ is a continuous quality score, $\ell(q)$ is a categorical label, and requires_search is a binary flag.

Benchmark Overview (Figure 3):

Domains: Covers 12 domains (Tech: 20, Finance: 17, Science: 13, etc.).
Task Types: 10 types (Decision & Recommendation: 17, Comparative Analysis: 16, Fact Enumeration & Verification: 15, etc.).
Quality Verification: Three expert annotators achieved an overall precision of 92.0% (Table 2).

2. Evaluation Framework (Three Complementary Layers)

The evaluation suite assesses systems along three dimensions (Figure 4).

A. Comprehensive Adaptive Synthesis Quality Evaluation (§3.1)

Adaptive Dimensions: For a query $Q = (I, A)$ $Q = (I, A)$ (instruction $I$ $I$ , optional attachments $A$ $A$ ), the framework constructs a tailored dimension space $D = D_{fixed} \cup D_{dynamic}(Q)$ $D = D_{f i x e d} \cup D_{d y nami c} (Q)$ .
- $D_{fixed}$ : Universal aspects (Coverage, Insight, Instruction-following, Clarity).
- $D_{dynamic}(Q)$ : For text-only queries ( $A = \emptyset$ ), generates 1–3 task-specific expertise dimensions. For attachment-augmented queries ( $A \neq \emptyset$ ), adds a Grounding dimension.
Key Facts Extraction: For multimodal tasks, an upstream module extracts verifiable factual anchors from raw attachments to generate precise, attachment-specific grounding criteria.
Dynamic Scoring: Dimension weights $W_d$ and criterion weights $w_{d,c}$ are dynamically assigned. The evaluator scores each criterion $s_{d,c} \in [0, 10]$ , and the final quality score is: $S_{quality} = \sum_{d \in D} W_d \sum_{c} w_{d,c} s_{d,c}$

B. Agentic Factuality Evaluation (§3.2)

Claim Decomposition: The report $R$ is decomposed into a set of verifiable statements $S(Q, R) = \{s_1, ..., s_n\}$ .
Evidence Retrieval: For each statement $s$ $s$ , an evaluation agent retrieves evidence from two sources: $E(s) = E_{search}(s) \cup E_{attach}(s)$
- $E_{search}$ : From external web search.
- $E_{attach}$ : From task-provided attachments using native multimodal processing (for images, PDFs) or retrieval-augmented processing (for spreadsheets, slides).
Consistency Assessment: The agent assigns a factuality label: $y(s) \in \{\text{RIGHT}, \text{WRONG}, \text{CONFLICT}, \text{UNKNOWN}\}$ The CONFLICT label explicitly captures disagreements between heterogeneous sources.

C. Process-Centric Evaluation (§3.3)

Process Representation: Raw process logs $P$ are transformed into a structured sequence of atomic steps (information acquisition, evidence inspection, synthesis, etc.) to recover dependencies and extract key process findings.
Intrinsic Process Quality: Evaluated across five dimensions:
1. Search Breadth: Explores wide range of sources/perspectives.
2. Analytical Depth: Conducts multi-step reasoning and in-depth analysis.
3. Progressive Refinement: Iteratively improves understanding.
4. Critical Thinking: Evaluates source reliability and handles conflicts.
5. Efficiency: Avoids unnecessary redundancy.
Process-Report Alignment: Evaluates consistency between process findings and the final report in two directions:
- Process → Report (P→R): Checks if major process findings are realized in the report.
- Report → Process (R→P): Checks if report conclusions are traceable to sufficient process support.
- Contradiction Detection (Contr): Evaluates handling of conflicting evidence.
Overall Process Score: Defined as a weighted combination: $S_{process} = \alpha S_{intrinsic}(P) + (1 - \alpha) S_{align}(P, R)$

Empirical Validation / Results

Evaluation was conducted across 13 leading deep research systems (Table 3), including OpenAI Deep Research, Gemini-3.1-Pro, Claude, and three MiroThinker variants.

Main Results (Text-Only Setting)

Overall Performance Tiers (Text-Only):

Top Tier: MiroThinker-H1 (77.5), OpenAI Deep Research (76.7), MiroThinker-1.7 (75.5).
Middle Tier: Gemini-3.1-Pro (69.9), Kimi-K2.5 (68.4), MiniMax-M2.5 (67.4), Claude (67.7).
Lower Tier: Manus (64.0), Qwen (64.7), Grok (60.2), Doubao (60.7).

Key Findings:

Dimensions are Complementary: System rankings shift substantially across dimensions. For example, Kimi-K2.5 has the highest Synthesis score (75.7) among non-MiroThinker systems but a low Factuality score (65.4). Manus has the lowest Synthesis score (55.4) but a competitive Factuality score (72.6).
Process Predicts Outcome: Process quality is broadly predictive of overall outcome quality. The top systems on Process are also the top on overall outcome.
Multimodal Challenge: Performance drops by 3 to 10 points in the multimodal setting. MiroThinker-H1 is most resilient (-3.0), while Qwen-3.5-Plus suffers the largest drop (-8.6).

Outcome-Level Analysis (Table 4, Figure 5)

Synthesis Sub-Metrics: Specificity is the universal bottleneck (lowest-scoring sub-metric). Insight is the most discriminative capability (scores range from 54.8 to 80.3).
Factual Claims – Precision-Volume Trade-off: Systems exhibit different strategies. For example:
- ChatGLM Agent: High volume (4,096 correct claims) but lower precision (68.6 Right Ratio).
- OpenAI Deep Research: Lower volume (3,335 correct) but high precision (83.3 Right Ratio).
- MiroThinker Series: Achieves a balance—MiroThinker-H1 has high volume (3,746 correct) and high precision (81.1 Right Ratio) with the lowest absolute error count (161 wrong claims).

Process-Level Analysis (Table 5)

Intrinsic Quality: Systems achieve reasonable Search Breadth but substantially lower Analytical Depth, making Depth the most discriminative intrinsic metric. Efficiency is a universal weakness, indicating substantial redundancy in research processes.
Alignment Asymmetry: Findings → Report (F→R) scores are generally high (MiroThinker-H1: 87.0). Report → Process (R→P) scores are dramatically lower (MiroThinker-H1: 63.3), revealing a significant traceability gap—report content often cannot be traced back to the documented research process.
Correlation: Process quality shows a strong Pearson correlation (0.88) with the combined outcome score, confirming its predictive value.

Further Analysis

User-Derived vs. Auto-Generated Queries (Table 6): User-derived queries are consistently harder, but system rankings remain stable across both sources, validating the automated construction pipeline.
Evaluation Robustness (Appendix D): Results are robust across repeated runs (std. dev. < 0.6), alternative judge models (Gemini), and prompt modifications. A human ranking study with 5 experts showed strong agreement with MiroEval rankings (Kendall’s τ = 0.91).

Performance Comparison Tables:

Table 3: Overall Performance Comparison

Model	Text-Only Overall	MultiModal Overall
MiroThinker-H1	77.5	74.5
OpenAI Deep Research	76.7	70.2
MiroThinker-1.7	75.5	71.6
Gemini-3.1-Pro	69.9	68.1
Claude-Opus-4.6	67.7	66.4
MiniMax-M2.5	67.4	63.3
Kimi-K2.5	68.4	–
ChatGLM Agent	65.8	63.6
Qwen-3.5-Plus	64.7	56.1
Manus-1.6-Max	64.0	62.0
Doubao	60.7	–
Grok	60.2	60.5

Table的行为 4: Synthesis and Factuality Breakdown (Text-Only, Excerpt)

Model	Synthesis Avg	Factuality Ratio	Overall
MiroThinker-H1	76.7	81.1	78.9
OpenAI Deep Research	73.8	83.3	78.6
MiroThinker-1.7	74.3	79.4	76.9
Gemini-3.1-Pro	71.2	71.3	71.3
Kimi-K2.5	75.7	65.4	70.6
Claude-Opus-4.6	67.3	69.8	68.6

Table 5: Process Evaluation Breakdown (Text-Only, Excerpt)

Model	Intrinsic Avg	Alignment Avg	Overall
MiroThinker-H1	70.4	78.9	74.7
OpenAI Deep Research	72.0	74.1	73.1
MiroThinker-1.7	70.1	75.2	72.7
Gemini-3.1-Pro	68.2	66.0	67.1
Claude-Opus-4.6	64.8	67.2	66.0

Theoretical and Practical Implications

Holistic System Diagnosis: MiroEval moves beyond final-report evaluation to provide a multi-dimensional diagnostic tool, revealing complementary strengths and weaknesses (e.g., synthesis vs. factuality trade-offs).
Importance of Process Evaluation: Demonstrates that process quality is a reliable predictor of overall outcome and uncovers critical weaknesses—like the traceability gap (R→P) and low Analytical Depth—that are invisible to output-level metrics. This validates process-centric evaluation as essential for assessing thorough investigation.
Multimodal as a Key Challenge: The significant performance drop on multimodal tasks highlights that current systems struggle to integrate and reason over visual/content materials effectively, pointing to a crucial area for future development.
Benchmark Design Principles: The dual-path construction (user-derived + auto-generated) ensures tasks are grounded in real needs while enabling temporal refresh, providing a model for creating live, evolving benchmarks.
Guidance for Improvement: The analysis identifies specific bottlenecks: improving Specificity in reports, enhancing Analytical Depth and Efficiency in processes, and closing the traceability gap between reports and their supporting research.

Conclusion

MiroEval provides a comprehensive benchmark and evaluation framework for deep research systems, addressing gaps in existing evaluations by incorporating real user needs, multimodal support, and process-centric assessment.

The evaluation of 13 systems yields three principal findings:

Synthesis quality, factual precision, and process rigor are complementary dimensions.
Process quality reliably predicts overall outcome and reveals critical weaknesses like insufficient analytical depth and report-process traceability gaps.
Multimodal tasks pose substantially greater challenges, with most systems declining by 3-10 points.

The MiroThinker series, particularly MiroThinker-H1, demonstrates the most balanced performance across all dimensions. Human verification and robustness experiments confirm the benchmark's quality and the evaluation framework's reliability.

Limitations and Future Work: Process evaluation requires systems to expose intermediate traces, limiting applicability to fully closed-source systems. Factuality evaluation flags conflicts (CONFLICT) but does not resolve them. Future work will leverage the refreshable pipeline to keep MiroEval temporally relevant as a live benchmark.