MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome
Summary (Overview)
- Holistic Benchmark: Introduces MiroEval, a benchmark and evaluation framework for deep research agents, comprising 100 tasks (70 text-only, 30 multimodal) grounded in real user needs and constructed via a dual-path pipeline that supports periodic updates.
- Multi-Dimensional Evaluation: Proposes a three-layer evaluation suite: (1) Adaptive Synthesis Quality with task-specific rubrics, (2) Agentic Factuality Verification via active retrieval over web and multimodal attachments, and (3) Process-Centric Evaluation auditing search, reasoning, and refinement.
- Key Findings: Evaluation of 13 systems reveals: the three dimensions capture complementary capabilities; process quality is a reliable predictor of overall outcome and reveals weaknesses invisible to output metrics; multimodal tasks pose substantially greater challenges, causing performance drops of 3-10 points.
- Top Performers: The MiroThinker series (particularly MiroThinker-H1) achieves the most balanced performance, ranking highest overall in both text-only (77.5) and multimodal (74.5) settings.
- Validation: Human verification confirms benchmark quality (92.0% precision). Robustness experiments and a human ranking study (Kendall’s τ = 0.91) validate the evaluation framework's reliability.
Introduction and Theoretical Foundation
The rapid advancement of Large Language Models (LLMs) has enabled a shift from passive text generation to agentic systems capable of autonomous planning and execution. Deep research, defined as the autonomous, multi-step process of investigating complex information needs through iterative search, evidence gathering, verification, and synthesis, has become a prominent paradigm.
As these systems are adopted in high-stakes domains (finance, healthcare, legal analysis), users demand more than fluent reports: they need factually reliable answers, grounded in thorough, traceable investigation, and capable of incorporating multimodal materials (images, PDFs, spreadsheets) common in real-world queries.
Existing benchmarks have limitations:
- Evaluate only the final report, not the underlying research process.
- Offer limited multimodal coverage beyond short-form QA.
- Rely on synthetic or academic queries that don't capture real-world complexity.
- Are static, risking obsolescence as knowledge evolves.
To address these gaps, MiroEval is introduced as a holistic diagnostic tool for the next generation of deep research agents, focusing on real user needs, multimodal support, and process-level assessment.
Methodology
1. Benchmark Construction (Query Collection)
The benchmark comprises 100 queries (70 text-only, 30 multimodal) built via two complementary paths (Figure 2), enabling a live and evolving setting.
A. User-Derived Query Curation (65 queries):
- Source: Inspired by query patterns from a closed internal testing phase (text and multimodal with attachments).
- Privacy: No original user queries appear. Strict protocols include automated filtering of sensitive content and systematic replacement of named entities.
- Process: An LLM classifies each anonymized query along dimensions (attachment type, complexity, target evaluation features). Queries are then routed to one of 6 rewriting strategies (Table 9) spanning three difficulty tiers (Easy, Medium, Hard) based on constraints, feature matching, quota bonuses, and usage decay.
- Strategies target specific evaluation features (Table 8), such as
search,multimodal understanding,error correction, andplanning.
B. Automated Query Generation (35 text-only queries):
- Source: Grounded in real-time web trends (via Serper API) across 12 topics with 3 subtopics each (Table 10).
- Process: An LLM generates 15 candidate queries per topic, conditioned on trends and anonymized seed exemplars.
- Three-Stage Filtering:
- Search Validation: Requires ≥ 3 results from ≥ 2 distinct domains.
- Deep-Research Necessity: An LLM evaluates if the query demands external investigation (confidence ≥ 0.7).
- Inverse Quality Assessment: Retains only queries where a baseline answer generated without search access is inadequate. The joint condition is:
where is a continuous quality score, is a categorical label, and
requires_searchis a binary flag.
Benchmark Overview (Figure 3):
- Domains: Covers 12 domains (Tech: 20, Finance: 17, Science: 13, etc.).
- Task Types: 10 types (Decision & Recommendation: 17, Comparative Analysis: 16, Fact Enumeration & Verification: 15, etc.).
- Quality Verification: Three expert annotators achieved an overall precision of 92.0% (Table 2).
2. Evaluation Framework (Three Complementary Layers)
The evaluation suite assesses systems along three dimensions (Figure 4).
A. Comprehensive Adaptive Synthesis Quality Evaluation (§3.1)
- Adaptive Dimensions: For a query (instruction , optional attachments ), the framework constructs a tailored dimension space .
- : Universal aspects (Coverage, Insight, Instruction-following, Clarity).
- : For text-only queries (), generates 1–3 task-specific expertise dimensions. For attachment-augmented queries (), adds a Grounding dimension.
- Key Facts Extraction: For multimodal tasks, an upstream module extracts verifiable factual anchors from raw attachments to generate precise, attachment-specific grounding criteria.
- Dynamic Scoring: Dimension weights and criterion weights are dynamically assigned. The evaluator scores each criterion , and the final quality score is:
B. Agentic Factuality Evaluation (§3.2)
- Claim Decomposition: The report is decomposed into a set of verifiable statements .
- Evidence Retrieval: For each statement , an evaluation agent retrieves evidence from two sources:
- : From external web search.
- : From task-provided attachments using native multimodal processing (for images, PDFs) or retrieval-augmented processing (for spreadsheets, slides).
- Consistency Assessment: The agent assigns a factuality label:
The
CONFLICTlabel explicitly captures disagreements between heterogeneous sources.
C. Process-Centric Evaluation (§3.3)
- Process Representation: Raw process logs are transformed into a structured sequence of atomic steps (information acquisition, evidence inspection, synthesis, etc.) to recover dependencies and extract key process findings.
- Intrinsic Process Quality: Evaluated across five dimensions:
- Search Breadth: Explores wide range of sources/perspectives.
- Analytical Depth: Conducts multi-step reasoning and in-depth analysis.
- Progressive Refinement: Iteratively improves understanding.
- Critical Thinking: Evaluates source reliability and handles conflicts.
- Efficiency: Avoids unnecessary redundancy.
- Process-Report Alignment: Evaluates consistency between process findings and the final report in two directions:
- Process → Report (P→R): Checks if major process findings are realized in the report.
- Report → Process (R→P): Checks if report conclusions are traceable to sufficient process support.
- Contradiction Detection (Contr): Evaluates handling of conflicting evidence.
- Overall Process Score: Defined as a weighted combination:
Empirical Validation / Results
Evaluation was conducted across 13 leading deep research systems (Table 3), including OpenAI Deep Research, Gemini-3.1-Pro, Claude, and three MiroThinker variants.
Main Results (Text-Only Setting)
Overall Performance Tiers (Text-Only):
- Top Tier: MiroThinker-H1 (77.5), OpenAI Deep Research (76.7), MiroThinker-1.7 (75.5).
- Middle Tier: Gemini-3.1-Pro (69.9), Kimi-K2.5 (68.4), MiniMax-M2.5 (67.4), Claude (67.7).
- Lower Tier: Manus (64.0), Qwen (64.7), Grok (60.2), Doubao (60.7).
Key Findings:
- Dimensions are Complementary: System rankings shift substantially across dimensions. For example, Kimi-K2.5 has the highest Synthesis score (75.7) among non-MiroThinker systems but a low Factuality score (65.4). Manus has the lowest Synthesis score (55.4) but a competitive Factuality score (72.6).
- Process Predicts Outcome: Process quality is broadly predictive of overall outcome quality. The top systems on Process are also the top on overall outcome.
- Multimodal Challenge: Performance drops by 3 to 10 points in the multimodal setting. MiroThinker-H1 is most resilient (-3.0), while Qwen-3.5-Plus suffers the largest drop (-8.6).
Outcome-Level Analysis (Table 4, Figure 5)
- Synthesis Sub-Metrics: Specificity is the universal bottleneck (lowest-scoring sub-metric). Insight is the most discriminative capability (scores range from 54.8 to 80.3).
- Factual Claims – Precision-Volume Trade-off: Systems exhibit different strategies. For example:
- ChatGLM Agent: High volume (4,096 correct claims) but lower precision (68.6 Right Ratio).
- OpenAI Deep Research: Lower volume (3,335 correct) but high precision (83.3 Right Ratio).
- MiroThinker Series: Achieves a balance—MiroThinker-H1 has high volume (3,746 correct) and high precision (81.1 Right Ratio) with the lowest absolute error count (161 wrong claims).
Process-Level Analysis (Table 5)
- Intrinsic Quality: Systems achieve reasonable Search Breadth but substantially lower Analytical Depth, making Depth the most discriminative intrinsic metric. Efficiency is a universal weakness, indicating substantial redundancy in research processes.
- Alignment Asymmetry: Findings → Report (F→R) scores are generally high (MiroThinker-H1: 87.0). Report → Process (R→P) scores are dramatically lower (MiroThinker-H1: 63.3), revealing a significant traceability gap—report content often cannot be traced back to the documented research process.
- Correlation: Process quality shows a strong Pearson correlation (0.88) with the combined outcome score, confirming its predictive value.
Further Analysis
- User-Derived vs. Auto-Generated Queries (Table 6): User-derived queries are consistently harder, but system rankings remain stable across both sources, validating the automated construction pipeline.
- Evaluation Robustness (Appendix D): Results are robust across repeated runs (std. dev. < 0.6), alternative judge models (Gemini), and prompt modifications. A human ranking study with 5 experts showed strong agreement with MiroEval rankings (Kendall’s τ = 0.91).
Performance Comparison Tables:
Table 3: Overall Performance Comparison
| Model | Text-Only Overall | MultiModal Overall |
|---|---|---|
| MiroThinker-H1 | 77.5 | 74.5 |
| OpenAI Deep Research | 76.7 | 70.2 |
| MiroThinker-1.7 | 75.5 | 71.6 |
| Gemini-3.1-Pro | 69.9 | 68.1 |
| Claude-Opus-4.6 | 67.7 | 66.4 |
| MiniMax-M2.5 | 67.4 | 63.3 |
| Kimi-K2.5 | 68.4 | – |
| ChatGLM Agent | 65.8 | 63.6 |
| Qwen-3.5-Plus | 64.7 | 56.1 |
| Manus-1.6-Max | 64.0 | 62.0 |
| Doubao | 60.7 | – |
| Grok | 60.2 | 60.5 |
Table的行为 4: Synthesis and Factuality Breakdown (Text-Only, Excerpt)
| Model | Synthesis Avg | Factuality Ratio | Overall |
|---|---|---|---|
| MiroThinker-H1 | 76.7 | 81.1 | 78.9 |
| OpenAI Deep Research | 73.8 | 83.3 | 78.6 |
| MiroThinker-1.7 | 74.3 | 79.4 | 76.9 |
| Gemini-3.1-Pro | 71.2 | 71.3 | 71.3 |
| Kimi-K2.5 | 75.7 | 65.4 | 70.6 |
| Claude-Opus-4.6 | 67.3 | 69.8 | 68.6 |
Table 5: Process Evaluation Breakdown (Text-Only, Excerpt)
| Model | Intrinsic Avg | Alignment Avg | Overall |
|---|---|---|---|
| MiroThinker-H1 | 70.4 | 78.9 | 74.7 |
| OpenAI Deep Research | 72.0 | 74.1 | 73.1 |
| MiroThinker-1.7 | 70.1 | 75.2 | 72.7 |
| Gemini-3.1-Pro | 68.2 | 66.0 | 67.1 |
| Claude-Opus-4.6 | 64.8 | 67.2 | 66.0 |
Theoretical and Practical Implications
- Holistic System Diagnosis: MiroEval moves beyond final-report evaluation to provide a multi-dimensional diagnostic tool, revealing complementary strengths and weaknesses (e.g., synthesis vs. factuality trade-offs).
- Importance of Process Evaluation: Demonstrates that process quality is a reliable predictor of overall outcome and uncovers critical weaknesses—like the traceability gap (R→P) and low Analytical Depth—that are invisible to output-level metrics. This validates process-centric evaluation as essential for assessing thorough investigation.
- Multimodal as a Key Challenge: The significant performance drop on multimodal tasks highlights that current systems struggle to integrate and reason over visual/content materials effectively, pointing to a crucial area for future development.
- Benchmark Design Principles: The dual-path construction (user-derived + auto-generated) ensures tasks are grounded in real needs while enabling temporal refresh, providing a model for creating live, evolving benchmarks.
- Guidance for Improvement: The analysis identifies specific bottlenecks: improving Specificity in reports, enhancing Analytical Depth and Efficiency in processes, and closing the traceability gap between reports and their supporting research.
Conclusion
MiroEval provides a comprehensive benchmark and evaluation framework for deep research systems, addressing gaps in existing evaluations by incorporating real user needs, multimodal support, and process-centric assessment.
The evaluation of 13 systems yields three principal findings:
- Synthesis quality, factual precision, and process rigor are complementary dimensions.
- Process quality reliably predicts overall outcome and reveals critical weaknesses like insufficient analytical depth and report-process traceability gaps.
- Multimodal tasks pose substantially greater challenges, with most systems declining by 3-10 points.
The MiroThinker series, particularly MiroThinker-H1, demonstrates the most balanced performance across all dimensions. Human verification and robustness experiments confirm the benchmark's quality and the evaluation framework's reliability.
Limitations and Future Work: Process evaluation requires systems to expose intermediate traces, limiting applicability to fully closed-source systems. Factuality evaluation flags conflicts (CONFLICT) but does not resolve them. Future work will leverage the refreshable pipeline to keep MiroEval temporally relevant as a live benchmark.