MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome

Summary (Overview)

  • Holistic Benchmark: Introduces MiroEval, a benchmark and evaluation framework for deep research agents, comprising 100 tasks (70 text-only, 30 multimodal) grounded in real user needs and constructed via a dual-path pipeline that supports periodic updates.
  • Multi-Dimensional Evaluation: Proposes a three-layer evaluation suite: (1) Adaptive Synthesis Quality with task-specific rubrics, (2) Agentic Factuality Verification via active retrieval over web and multimodal attachments, and (3) Process-Centric Evaluation auditing search, reasoning, and refinement.
  • Key Findings: Evaluation of 13 systems reveals: the three dimensions capture complementary capabilities; process quality is a reliable predictor of overall outcome and reveals weaknesses invisible to output metrics; multimodal tasks pose substantially greater challenges, causing performance drops of 3-10 points.
  • Top Performers: The MiroThinker series (particularly MiroThinker-H1) achieves the most balanced performance, ranking highest overall in both text-only (77.5) and multimodal (74.5) settings.
  • Validation: Human verification confirms benchmark quality (92.0% precision). Robustness experiments and a human ranking study (Kendall’s τ = 0.91) validate the evaluation framework's reliability.

Introduction and Theoretical Foundation

The rapid advancement of Large Language Models (LLMs) has enabled a shift from passive text generation to agentic systems capable of autonomous planning and execution. Deep research, defined as the autonomous, multi-step process of investigating complex information needs through iterative search, evidence gathering, verification, and synthesis, has become a prominent paradigm.

As these systems are adopted in high-stakes domains (finance, healthcare, legal analysis), users demand more than fluent reports: they need factually reliable answers, grounded in thorough, traceable investigation, and capable of incorporating multimodal materials (images, PDFs, spreadsheets) common in real-world queries.

Existing benchmarks have limitations:

  • Evaluate only the final report, not the underlying research process.
  • Offer limited multimodal coverage beyond short-form QA.
  • Rely on synthetic or academic queries that don't capture real-world complexity.
  • Are static, risking obsolescence as knowledge evolves.

To address these gaps, MiroEval is introduced as a holistic diagnostic tool for the next generation of deep research agents, focusing on real user needs, multimodal support, and process-level assessment.

Methodology

1. Benchmark Construction (Query Collection)

The benchmark comprises 100 queries (70 text-only, 30 multimodal) built via two complementary paths (Figure 2), enabling a live and evolving setting.

A. User-Derived Query Curation (65 queries):

  • Source: Inspired by query patterns from a closed internal testing phase (text and multimodal with attachments).
  • Privacy: No original user queries appear. Strict protocols include automated filtering of sensitive content and systematic replacement of named entities.
  • Process: An LLM classifies each anonymized query along dimensions (attachment type, complexity, target evaluation features). Queries are then routed to one of 6 rewriting strategies (Table 9) spanning three difficulty tiers (Easy, Medium, Hard) based on constraints, feature matching, quota bonuses, and usage decay.
  • Strategies target specific evaluation features (Table 8), such as search, multimodal understanding, error correction, and planning.

B. Automated Query Generation (35 text-only queries):

  • Source: Grounded in real-time web trends (via Serper API) across 12 topics with 3 subtopics each (Table 10).
  • Process: An LLM generates 15 candidate queries per topic, conditioned on trends and anonymized seed exemplars.
  • Three-Stage Filtering:
    1. Search Validation: Requires ≥ 3 results from ≥ 2 distinct domains.
    2. Deep-Research Necessity: An LLM evaluates if the query demands external investigation (confidence ≥ 0.7).
    3. Inverse Quality Assessment: Retains only queries where a baseline answer generated without search access is inadequate. The joint condition is: Qgen={qσ(q)0.75  (q)high  requires_search(q)}Q_{gen} = \{ q | \sigma(q) \le 0.75 \ \land\ \ell(q) \neq \text{high} \ \land\ \mathrm{requires\_search}(q) \} where σ(q)\sigma(q) is a continuous quality score, (q)\ell(q) is a categorical label, and requires_search is a binary flag.

Benchmark Overview (Figure 3):

  • Domains: Covers 12 domains (Tech: 20, Finance: 17, Science: 13, etc.).
  • Task Types: 10 types (Decision & Recommendation: 17, Comparative Analysis: 16, Fact Enumeration & Verification: 15, etc.).
  • Quality Verification: Three expert annotators achieved an overall precision of 92.0% (Table 2).

2. Evaluation Framework (Three Complementary Layers)

The evaluation suite assesses systems along three dimensions (Figure 4).

A. Comprehensive Adaptive Synthesis Quality Evaluation (§3.1)

  • Adaptive Dimensions: For a query Q=(I,A)Q = (I, A) (instruction II, optional attachments AA), the framework constructs a tailored dimension space D=DfixedDdynamic(Q)D = D_{fixed} \cup D_{dynamic}(Q).
    • DfixedD_{fixed}: Universal aspects (Coverage, Insight, Instruction-following, Clarity).
    • Ddynamic(Q)D_{dynamic}(Q): For text-only queries (A=A = \emptyset), generates 1–3 task-specific expertise dimensions. For attachment-augmented queries (AA \neq \emptyset), adds a Grounding dimension.
  • Key Facts Extraction: For multimodal tasks, an upstream module extracts verifiable factual anchors from raw attachments to generate precise, attachment-specific grounding criteria.
  • Dynamic Scoring: Dimension weights WdW_d and criterion weights wd,cw_{d,c} are dynamically assigned. The evaluator scores each criterion sd,c[0,10]s_{d,c} \in [0, 10], and the final quality score is: Squality=dDWdcwd,csd,cS_{quality} = \sum_{d \in D} W_d \sum_{c} w_{d,c} s_{d,c}

B. Agentic Factuality Evaluation (§3.2)

  • Claim Decomposition: The report RR is decomposed into a set of verifiable statements S(Q,R)={s1,...,sn}S(Q, R) = \{s_1, ..., s_n\}.
  • Evidence Retrieval: For each statement ss, an evaluation agent retrieves evidence from two sources: E(s)=Esearch(s)Eattach(s)E(s) = E_{search}(s) \cup E_{attach}(s)
    • EsearchE_{search}: From external web search.
    • EattachE_{attach}: From task-provided attachments using native multimodal processing (for images, PDFs) or retrieval-augmented processing (for spreadsheets, slides).
  • Consistency Assessment: The agent assigns a factuality label: y(s){RIGHT,WRONG,CONFLICT,UNKNOWN}y(s) \in \{\text{RIGHT}, \text{WRONG}, \text{CONFLICT}, \text{UNKNOWN}\} The CONFLICT label explicitly captures disagreements between heterogeneous sources.

C. Process-Centric Evaluation (§3.3)

  • Process Representation: Raw process logs PP are transformed into a structured sequence of atomic steps (information acquisition, evidence inspection, synthesis, etc.) to recover dependencies and extract key process findings.
  • Intrinsic Process Quality: Evaluated across five dimensions:
    1. Search Breadth: Explores wide range of sources/perspectives.
    2. Analytical Depth: Conducts multi-step reasoning and in-depth analysis.
    3. Progressive Refinement: Iteratively improves understanding.
    4. Critical Thinking: Evaluates source reliability and handles conflicts.
    5. Efficiency: Avoids unnecessary redundancy.
  • Process-Report Alignment: Evaluates consistency between process findings and the final report in two directions:
    • Process → Report (P→R): Checks if major process findings are realized in the report.
    • Report → Process (R→P): Checks if report conclusions are traceable to sufficient process support.
    • Contradiction Detection (Contr): Evaluates handling of conflicting evidence.
  • Overall Process Score: Defined as a weighted combination: Sprocess=αSintrinsic(P)+(1α)Salign(P,R)S_{process} = \alpha S_{intrinsic}(P) + (1 - \alpha) S_{align}(P, R)

Empirical Validation / Results

Evaluation was conducted across 13 leading deep research systems (Table 3), including OpenAI Deep Research, Gemini-3.1-Pro, Claude, and three MiroThinker variants.

Main Results (Text-Only Setting)

Overall Performance Tiers (Text-Only):

  • Top Tier: MiroThinker-H1 (77.5), OpenAI Deep Research (76.7), MiroThinker-1.7 (75.5).
  • Middle Tier: Gemini-3.1-Pro (69.9), Kimi-K2.5 (68.4), MiniMax-M2.5 (67.4), Claude (67.7).
  • Lower Tier: Manus (64.0), Qwen (64.7), Grok (60.2), Doubao (60.7).

Key Findings:

  1. Dimensions are Complementary: System rankings shift substantially across dimensions. For example, Kimi-K2.5 has the highest Synthesis score (75.7) among non-MiroThinker systems but a low Factuality score (65.4). Manus has the lowest Synthesis score (55.4) but a competitive Factuality score (72.6).
  2. Process Predicts Outcome: Process quality is broadly predictive of overall outcome quality. The top systems on Process are also the top on overall outcome.
  3. Multimodal Challenge: Performance drops by 3 to 10 points in the multimodal setting. MiroThinker-H1 is most resilient (-3.0), while Qwen-3.5-Plus suffers the largest drop (-8.6).

Outcome-Level Analysis (Table 4, Figure 5)

  • Synthesis Sub-Metrics: Specificity is the universal bottleneck (lowest-scoring sub-metric). Insight is the most discriminative capability (scores range from 54.8 to 80.3).
  • Factual Claims – Precision-Volume Trade-off: Systems exhibit different strategies. For example:
    • ChatGLM Agent: High volume (4,096 correct claims) but lower precision (68.6 Right Ratio).
    • OpenAI Deep Research: Lower volume (3,335 correct) but high precision (83.3 Right Ratio).
    • MiroThinker Series: Achieves a balance—MiroThinker-H1 has high volume (3,746 correct) and high precision (81.1 Right Ratio) with the lowest absolute error count (161 wrong claims).

Process-Level Analysis (Table 5)

  • Intrinsic Quality: Systems achieve reasonable Search Breadth but substantially lower Analytical Depth, making Depth the most discriminative intrinsic metric. Efficiency is a universal weakness, indicating substantial redundancy in research processes.
  • Alignment Asymmetry: Findings → Report (F→R) scores are generally high (MiroThinker-H1: 87.0). Report → Process (R→P) scores are dramatically lower (MiroThinker-H1: 63.3), revealing a significant traceability gap—report content often cannot be traced back to the documented research process.
  • Correlation: Process quality shows a strong Pearson correlation (0.88) with the combined outcome score, confirming its predictive value.

Further Analysis

  • User-Derived vs. Auto-Generated Queries (Table 6): User-derived queries are consistently harder, but system rankings remain stable across both sources, validating the automated construction pipeline.
  • Evaluation Robustness (Appendix D): Results are robust across repeated runs (std. dev. < 0.6), alternative judge models (Gemini), and prompt modifications. A human ranking study with 5 experts showed strong agreement with MiroEval rankings (Kendall’s τ = 0.91).

Performance Comparison Tables:

Table 3: Overall Performance Comparison

ModelText-Only OverallMultiModal Overall
MiroThinker-H177.574.5
OpenAI Deep Research76.770.2
MiroThinker-1.775.571.6
Gemini-3.1-Pro69.968.1
Claude-Opus-4.667.766.4
MiniMax-M2.567.463.3
Kimi-K2.568.4
ChatGLM Agent65.863.6
Qwen-3.5-Plus64.756.1
Manus-1.6-Max64.062.0
Doubao60.7
Grok60.260.5

Table的行为 4: Synthesis and Factuality Breakdown (Text-Only, Excerpt)

ModelSynthesis AvgFactuality RatioOverall
MiroThinker-H176.781.178.9
OpenAI Deep Research73.883.378.6
MiroThinker-1.774.379.476.9
Gemini-3.1-Pro71.271.371.3
Kimi-K2.575.765.470.6
Claude-Opus-4.667.369.868.6

Table 5: Process Evaluation Breakdown (Text-Only, Excerpt)

ModelIntrinsic AvgAlignment AvgOverall
MiroThinker-H170.478.974.7
OpenAI Deep Research72.074.173.1
MiroThinker-1.770.175.272.7
Gemini-3.1-Pro68.266.067.1
Claude-Opus-4.664.867.266.0

Theoretical and Practical Implications

  • Holistic System Diagnosis: MiroEval moves beyond final-report evaluation to provide a multi-dimensional diagnostic tool, revealing complementary strengths and weaknesses (e.g., synthesis vs. factuality trade-offs).
  • Importance of Process Evaluation: Demonstrates that process quality is a reliable predictor of overall outcome and uncovers critical weaknesses—like the traceability gap (R→P) and low Analytical Depth—that are invisible to output-level metrics. This validates process-centric evaluation as essential for assessing thorough investigation.
  • Multimodal as a Key Challenge: The significant performance drop on multimodal tasks highlights that current systems struggle to integrate and reason over visual/content materials effectively, pointing to a crucial area for future development.
  • Benchmark Design Principles: The dual-path construction (user-derived + auto-generated) ensures tasks are grounded in real needs while enabling temporal refresh, providing a model for creating live, evolving benchmarks.
  • Guidance for Improvement: The analysis identifies specific bottlenecks: improving Specificity in reports, enhancing Analytical Depth and Efficiency in processes, and closing the traceability gap between reports and their supporting research.

Conclusion

MiroEval provides a comprehensive benchmark and evaluation framework for deep research systems, addressing gaps in existing evaluations by incorporating real user needs, multimodal support, and process-centric assessment.

The evaluation of 13 systems yields three principal findings:

  1. Synthesis quality, factual precision, and process rigor are complementary dimensions.
  2. Process quality reliably predicts overall outcome and reveals critical weaknesses like insufficient analytical depth and report-process traceability gaps.
  3. Multimodal tasks pose substantially greater challenges, with most systems declining by 3-10 points.

The MiroThinker series, particularly MiroThinker-H1, demonstrates the most balanced performance across all dimensions. Human verification and robustness experiments confirm the benchmark's quality and the evaluation framework's reliability.

Limitations and Future Work: Process evaluation requires systems to expose intermediate traces, limiting applicability to fully closed-source systems. Factuality evaluation flags conflicts (CONFLICT) but does not resolve them. Future work will leverage the refreshable pipeline to keep MiroEval temporally relevant as a live benchmark.