MiroThinker-1.7 & H1: Towards Heavy-Duty Research Agents via Verification - Technical Report Summary

Summary (Overview)

  • New Research Agent Family: Introduces MiroThinker-1.7, a research agent for complex long-horizon reasoning, and MiroThinker-H1, which extends it with heavy-duty verification-centric reasoning for more reliable multi-step problem solving.
  • Agentic Mid-Training: A key contribution is an agentic mid-training stage that strengthens atomic capabilities (planning, reasoning, tool use, summarization) at each interaction step, enabling more effective scaling of reasoning trajectories.
  • Verification-Centric Reasoning (H1): MiroThinker-H1 incorporates Local and Global Verifiers to audit reasoning during inference, allowing for step refinement and selection of the best-supported final answer.
  • State-of-the-Art Performance: MiroThinker-H1 achieves top results on benchmarks like BrowseComp (88.2), GAIA (88.5), and FrontierScience-Olympiad (79.0), outperforming leading commercial and open-source agents.
  • Open-Source Release: Releases MiroThinker-1.7 and a smaller variant, MiroThinker-1.7-mini, as open-source models, providing competitive agent capabilities with improved efficiency.

Introduction and Theoretical Foundation

Recent Large Language Models (LLMs) excel at conversational tasks but struggle with real-world problems requiring long chains of reasoning, iterative information gathering, and verification (e.g., scientific analysis, financial research). While agentic AI systems that interact with tools have emerged, simply scaling the length of reasoning trajectories often degrades performance due to error propagation.

This paper argues that improving long-horizon reasoning requires scaling effective interaction, which depends on:

  1. Strong atomic agentic capabilities at each step (planning, reasoning, tool execution).
  2. Verifiable mechanisms to verify and refine reasoning trajectories during problem-solving.

Motivated by this, the authors introduce MiroThinker-1.7, which focuses on strengthening step-level capabilities, and MiroThinker-H1, which adds a verification-centric reasoning mode for more reliable long-horizon problem-solving.

Methodology

1. Agentic Workflow & Formulation

MiroThinker builds on the ReAct paradigm, extending it with context management within a dual-loop structure:

  • Step Loop: Manages individual reasoning steps within an episode.
  • Episode Loop: Handles trajectory-level restarts if an episode fails.

The agent accumulates a trajectory log Ht(e)={(T1,A1,O1),...,(Tt1,At1,Ot1)}H_t^{(e)} = \{(T_1, A_1, O_1), ..., (T_{t-1}, A_{t-1}, O_{t-1})\}, where TiT_i, AiA_i, and OiO_i are thought, action, and observation at step ii. A context operator Φt\Phi_t transforms this log into an effective context Ct(e)C_t^{(e)} that fits within token limits using a sliding window and truncation.

St(K)={i{1,...,t1}itK}S_t(K) = \{ i \in \{1,...,t-1\} \mid i \ge t - K \} Φt(Oi)={TruncL(Oi),iSt(K),otherwise\Phi_t(O_i) = \begin{cases} \text{Trunc}_L(O_i), & i \in S_t(K) \\ \emptyset, & \text{otherwise} \end{cases} Ct(e)={(Ti,Ai,Φt(Oi))}i=1t1C_t^{(e)} = \{(T_i, A_i, \Phi_t(O_i))\}_{i=1}^{t-1}

The agent's reasoning and action selection operate on this managed view: Tt=fθ(q,Ct(e))T_t = f_\theta(q, C_t^{(e)}), At=πθ(Ct(e),Tt)A_t = \pi_\theta(C_t^{(e)}, T_t).

2. High-Quality QA Construction

A dual-pipeline framework generates training data:

  • Corpus-based Pipeline: Efficient, large-scale generation from structured knowledge graphs (e.g., Wikipedia) for breadth.
  • Web-Augmented Multi-hop Pipeline (WebHop): Generates precisely calibrated questions with verified multi-hop structure and open-web grounding for depth. It uses structured reasoning trees, web-based semantic expansion, and hierarchical solvability verification.

3. Training Pipeline

A four-stage pipeline based on Qwen3 MoE checkpoints:

  1. Agentic Mid-training: Strengthens atomic capabilities using large-scale supervision for single-turn planning and interleaved reasoning/summarization. The objective is:

    Lmid(θ)=E(C<k,yk)Dmid[logπθ(ykC<k)]\mathcal{L}_{\text{mid}}(\theta) = -\mathbb{E}_{(C_{<k}, y_k) \sim \mathcal{D}_{\text{mid}}} [\log \pi_\theta(y_k | C_{<k})]

    where yky_k is the target output at step kk.

  2. Agentic Supervised Fine-Tuning (SFT): Trains the model to replicate expert multi-step trajectories. The objective is:

    LSFT(θ)=E(x,H)[t=1THlogπθ(Tt,Atx,H<t)]\mathcal{L}_{\text{SFT}}(\theta) = -\mathbb{E}_{(x, H)} \left[ \sum_{t=1}^{T_H} \log \pi_\theta(T_t, A_t | x, H_{<t}) \right]
  3. Agentic Preference Optimization (DPO): Uses Direct Preference Optimization with an auxiliary SFT loss on preferred trajectories to improve decision-making.

    LDPO(x,H+,H)=logσ(β[(logπθ(H+x)logπθ(Hx))(logπref(H+x)logπref(Hx))])\mathcal{L}_{\text{DPO}}(x, H^+, H^-) = -\log \sigma\left( \beta \left[ (\log \pi_\theta(H^+|x) - \log \pi_\theta(H^-|x)) - (\log \pi_{\text{ref}}(H^+|x) - \log \pi_{\text{ref}}(H^-|x)) \right] \right) LPO(θ)=E(x,H+,H)[LDPO(x,H+,H)]+λLSFT(+)(θ)\mathcal{L}_{\text{PO}}(\theta) = \mathbb{E}_{(x, H^+, H^-)}[\mathcal{L}_{\text{DPO}}(x, H^+, H^-)] + \lambda \mathcal{L}_{\text{SFT}}^{(+)}(\theta)
  4. Agentic Reinforcement Learning (RL): Uses Group Relative Policy Optimization (GRPO) for online trial-and-error refinement in live environments, with a targeted entropy control mechanism.

    LGRPO(θ)=ExDEHπθ[A^(x,H)logπθ(Hx)t=1HβKL(t,H)DKL(πθ(st)πref(st))]\mathcal{L}_{\text{GRPO}}(\theta) = \mathbb{E}_{x \sim \mathcal{D}} \mathbb{E}_{H \sim \pi_\theta} \left[ \hat{A}(x, H) \log \pi_\theta(H|x) - \sum_{t=1}^{|H|} \beta_{\text{KL}}(t, H) D_{\text{KL}}(\pi_\theta(\cdot|s_t) \| \pi_{\text{ref}}(\cdot|s_t)) \right]

4. Heavy-Duty Reasoning Mode (MiroThinker-H1)

Extends MiroThinker-1.7 with verification integrated into the reasoning process:

  • Local Verifier: Evaluates and refines intermediate reasoning steps (planning, tool calls) during inference to explore alternatives and correct errors early.
  • Global Verifier: Audits the complete reasoning trajectory, comparing candidate solution paths to select the final answer supported by the most coherent evidence chain.

Empirical Validation / Results

The models were evaluated on a diverse set of agentic and domain-specific benchmarks. Key results are summarized below.

Table 1: Performance on Agentic Benchmarks

ModelBrowseCompBrowseComp-ZHHumanity's Last ExamGAIAxbench-DeepSearch-2510SEAL-0DeepSearchQA
Qwen3.5-397B78.670.348.346.9
GPT-5.482.752.1
Gemini-3.1-Pro85.951.4
Claude-4.6-Opus84.053.191.3
MiroThinker-1.7-mini67.972.336.480.357.248.267.9
MiroThinker-1.774.075.342.982.762.053.072.1
MiroThinker-H188.284.447.788.572.061.380.6

Table 2: Performance on Professional-Domain Benchmarks

ModelFrontierSci-OlympiadSUPERChem (text)FinSearchComp (T2/T3)MedBrowseComp
GPT-5.2-high77.158.073.8
Gemini-3-Pro76.163.252.7
MiroThinker-1.7-mini67.936.862.648.2
MiroThinker-1.771.542.167.954.2
MiroThinker-H179.051.373.956.5

Key Findings:

  1. State-of-the-Art Performance: MiroThinker-H1 achieves top scores on BrowseComp (88.2), GAIA (88.5, +12.1 over GPT-5), and SEAL-0 (61.3).
  2. Strong Domain Expertise: MiroThinker-H1 leads on three of four professional-domain benchmarks, including FrontierSci-Olympiad (79.0) and FinSearchComp (73.9).
  3. Effective Interaction Scaling: MiroThinker-1.7-mini (30B params) outperforms its predecessor MiroThinker-1.5 (30B) by 16.7% on average while using 43.0% fewer interaction rounds, validating the focus on step quality over trajectory length.
  4. Impact of Verification:
    • Local Verifier: On a hard BrowseComp subset, MiroThinker-H1 with only local verification improved Pass@1 by +26.4 points while reducing steps by ~82% (from 1185.2 to 210.8).
    • Global Verifier: Provided consistent gains across all benchmarks, especially on search-intensive tasks like BrowseComp (+14.2) and SEAL-0 (+8.3).
    • Compute Scaling: Accuracy on BrowseComp scales with compute, reaching 85.9 at 16× budget and 88.2 at 64×.
  5. Long Report Quality: In an evaluation of 50 deep research queries, MiroThinker-H1 achieved the highest Report Quality score (76.8) and strong Factuality (79.1), competitive with top commercial research agents.

Theoretical and Practical Implications

  • Paradigm Shift in Agent Design: The work demonstrates that effective interaction scaling—improving the reliability of each atomic step—is more crucial for long-horizon reasoning than simply increasing the number of steps. The agentic mid-training stage provides a concrete methodology for achieving this.
  • Verification as a Core Component: Integrating verification (local and global) directly into the reasoning process is shown to be a powerful technique for improving reliability, reducing wasted computation, and correcting errors, moving beyond post-hoc checking.
  • Open-Source Advancements: The release of MiroThinker-1.7 and its mini variant provides the community with highly capable, efficient research agents, lowering the barrier to entry for advanced agentic AI research and applications.
  • Practical Applicability: Strong performance across specialized domains (science, finance, medicine) indicates the framework's potential for building reliable AI assistants in knowledge-intensive professional fields.

Conclusion

MiroThinker-1.7 and H1 represent a significant advance in building research agents capable of complex, long-horizon reasoning. By combining agent-native training (focusing on atomic step capabilities) with a verification-centric reasoning mode, the systems achieve state-of-the-art performance across diverse and challenging benchmarks. The results validate the core thesis that improving the quality and verifiability of each interaction step is key to effective scaling. The release of open-source models fosters further research and development in agentic AI. Future work may explore more advanced verification mechanisms, broader tool integration, and applications in even more complex real-world scenarios.