OpenResearcher: A Fully Open Pipeline for Long-Horizon Deep Research Trajectory Synthesis

Summary (Overview)

Open and Reproducible Synthesis Pipeline: Introduces OpenResearcher, a scalable pipeline that decouples one-time online corpus bootstrapping from multi-turn trajectory synthesis, executing the search-and-browse loop entirely offline. This eliminates reliance on costly and unstable live web APIs.
Explicit Browser Abstraction: Models realistic browsing behavior with three minimal primitives: search (retrieves results), open (fetches full document), and find (locates evidence within a document). This enables systematic, multi-scale information discovery.
Effective Trajectory-Based Training: Synthesizes over 97,000 long-horizon trajectories (including many with 100+ tool calls) using GPT-OSS-120B as a teacher model. Supervised Fine-Tuning (SFT) of a 30B-A3B backbone on these trajectories yields a model that significantly outperforms its base version and rivals proprietary systems on key benchmarks.
Controlled Analysis and Insights: The offline environment enables detailed studies of pipeline design, revealing that: final-answer correctness is not a dominant filtering signal for SFT; one-time online bootstrapping for corpus coverage is essential; and retrieving ("hitting") a gold document does not guarantee a correct final answer.
Strong Empirical Performance: The trained OpenResearcher-30B-A3B model achieves 54.8% accuracy on BrowseComp-Plus (a +34.0 point improvement over the base model) and remains competitive on live-web benchmarks (BrowseComp, GAIA, xbench-DeepSearch), demonstrating effective generalization from offline training.

Introduction and Theoretical Foundation

Training capable deep research agents—systems that perform iterative search, evidence aggregation, and multi-step reasoning—is bottlenecked by the scarcity of high-quality, long-horizon trajectories that reflect realistic web browsing behavior. Existing approaches often rely on proprietary live web APIs (e.g., Google Search), making large-scale synthesis expensive, unstable over time, and difficult to reproduce. Furthermore, these live environments hinder controlled analysis, as internal search events depend on a constantly changing web.

This work addresses the central question: How can we synthesize high-quality, long-horizon deep research trajectories in a scalable, low-cost, reproducible, and analytically useful manner?

The proposed solution, OpenResearcher, is built on two core ideas:

Decouple corpus construction from trajectory generation: Perform a one-time online bootstrapping step to seed answer-supporting ("gold") documents, build a large offline corpus with distractors, and then run the multi-turn synthesis loop entirely locally.
Model browsing explicitly with minimal primitives: Use the tools search, open, and find to teach the model not just what to retrieve, but how to inspect documents and localize specific evidence, mirroring human research behavior.

The interaction process is formalized following a ReAct-style paradigm. Given a query $q$ , system prompt $s_0$ , and tool metadata $T_{meta}$ , the model generates a trajectory $H_T$ as a sequence of reasoning–action–observation triplets:

H_T = \{ (q, s_0, T_{meta}), (r_1, a_1, o_1), ..., (r_i, a_i, o_i), ..., (r_T, a_T) \} \tag{1}

where $r_i$ , $a_i$ , and $o_i$ denote the reasoning chain, action (tool call), and observation, respectively. $a_T$ is the final answer. The agent's policy $\pi$ generates the current thought and action based on the history:

r_t, a_t \sim \pi(\cdot|H_{t-1}) \tag{2}

The environment $E$ executes the action and returns an observation:

o_t = E(a_t) \tag{3}

updating the trajectory as $H_t = H_{t-1} \cup \{(r_t, a_t, o_t)\}$ . This loop continues until termination.

Methodology

The OpenResearcher pipeline consists of three main stages, as illustrated in Figure 2 of the paper.

1. QA Question Collection

To ensure questions require deep research, the authors select challenging, long-horizon queries from MiroVerse-v0.1, a dataset requiring multi-hop reasoning over heterogeneous evidence. A random 10% sample yields roughly 6K question-answer (QA) pairs. Answers are normalized into concise, verifiable forms. Existing MiroVerse trajectories are not used; all are regenerated from scratch using only clean QA pairs.

2. Offline Search Engine Construction

To make trajectory synthesis meaningful, the relevant evidence must be retrievable. This is ensured via a one-time, coverage-oriented bootstrapping process.

Gold Document Retrieval: For each of the 6K QA pairs, a search query is constructed by concatenating the question and reference answer. Web content is retrieved via the Serper API, cleaned, and deduplicated, yielding ~10K "gold" documents that contain sufficient evidence for the answer.
Offline Corpus Construction: To simulate web-scale complexity, 15 million documents (~10 trillion tokens) from FineWeb are merged with the gold documents. The FineWeb documents act as realistic distractors.
Corpus Indexing: Each document is embedded using Qwen3-Embedding-8B and indexed with FAISS for efficient dense retrieval, simulating a web search API.

3. Explicit Browser Primitives for Trajectory Synthesis

Instead of treating search as simple retrieval, OpenResearcher models browsing with three explicit tools (Figure 3):

search: Returns top-K results (title, URL, snippet) for a query. Enables broad information retrieval.
open: Fetches the full content of a document from a URL. Mirrors clicking into a webpage.
find: Locates exact string matches within an opened document. Critical for named-entity lookup and factual verification.

These tools enable progressive focus from corpus -> document -> evidence.

4. Trajectory Generation Procedure

Using GPT-OSS-120B as the teacher model, integrated with the three browser tools and the offline search engine, trajectories are synthesized. The model is prompted to:

Use only the provided tools.
Reason step-by-step before each tool call.
Terminate only when confident in a final answer. The teacher model does not have access to the reference answer during generation.

Lightweight filtering removes malformed or overly long trajectories. This process yields over 97,000 trajectories with a broad range of reasoning horizons.

Empirical Validation / Results

Experimental Setup

Training: The base model Nemotron-3-Nano-30B-A3B is supervised fine-tuned (SFT) on a curated subset of ~55K trajectories that yield correct final answers (rejection sampling). Training uses Megatron-LM on 8 H100 GPUs for ~8 hours. Sequences are pre-packed to a 256K token context to preserve full reasoning chains.
Evaluation: Benchmarks include:
- Closed-web: BrowseComp-Plus (uses the official offline corpus).
- Open-web: BrowseComp, GAIA, xbench-DeepSearch (use the Serper API for live search).
Baselines: Compared against proprietary foundation models (GPT-4.1, Claude-4-Opus, etc.) and open-source deep research agents (Tongyi DeepResearch, ASearcher, etc.).

Main Results

Table 1: Performance comparison on Deep Research benchmarks.

METHODS	BrowseComp-Plus
Foundation Models with Tools
GPT-4.1	36.4
Claude-4-Opus	36.8
Gemini-2.5-Pro	29.5
Kimi-K2	35.4
DeepSeek-R1	16.4
Nemotron-3-Nano	20.8
Deep Research Agents
Tongyi DeepResearch	44.5
CutBill-30B-A3B	30.3
Ours
OpenResearcher	54.8

METHODS	BrowseComp	GAIA	xbench
Foundation Models with Tools
OpenAI o4-mini	28.3	55.8	67.0
Claude-4-Sonnet	12.2	57.6	64.0
Kimi-K2	14.1	57.7	50.0
DeepSeek-R1	BibRef(8.9)	30.3	55.0
Nemotron-3-Nano	10.6	50.5	55.0
Deep Research Agents
ASearcher-QwQ-32B	5.2	52.8	42.0
WebDancer-QwQ-32B	3.8	51.5	39.0
WebSailor-72B	12.0	55.4	55.0
DeepMiner-32B	21.2	54.4	53.0
Ours
OpenResearcher	26.3	64.1	65.0

Key Insights:

On BrowseComp-Plus, OpenResearcher achieves 54.8% accuracy, a +34.0 point improvement over the base Nemotron-3-Nano model (20.8%), and outperforms strong proprietary baselines.
On open-web benchmarks, the model generalizes effectively without any live-web training, remaining competitive with frontier models and substantially outperforming existing open-source deep research agents.

In-Depth Analysis of Synthesized Trajectories

Table 2: Statistics of synthesized trajectories.

Metric	Success	Failure	All
Rate	56.7%	43.3%	100%
Avg. tool calls	38.4	71.7	52.8
Avg. searches	22.1	48.8	33.6
Max tool calls	172	185	185
Max searches	109	119	119

Failed trajectories require nearly twice as many tool calls on average (71.7 vs. 38.4), indicating failure stems from inefficient or misdirected search strategies, not insufficient exploration.
The excess calls are primarily driven by search operations (48.7 vs. 22.1), suggesting successful trajectories converge on relevant documents earlier.
Pass@k analysis shows solution diversity: Pass@1 is 0.567, rising to 0.792 at Pass@16. The solve-rate distribution is bimodal, with ~20% of questions near 0% pass rate (extremely hard) and ~30% near 100% (robustly solvable).

Cost Efficiency

Table 3: Estimated cost breakdown comparison for synthesizing 5.76M search requests.

Method	Price/K	Total Cost
Serper API	$1	$5,760
SerpAPI	$5	$28,800
Offline retriever (ours)	$0	$0

The offline design offers major cost savings, no rate limits, deterministic behavior, and no dependency on proprietary infrastructure.

Theoretical and Practical Implications

Ablation Studies and Research Questions (RQ)

The controlled offline environment enables targeted analyses that yield practical insights for deep research pipeline design.

RQ1: Is final-answer correctness a necessary filtering signal for trajectory SFT?

Finding: No. Training on correct-only, incorrect-only, and all trajectories yielded nearly identical downstream accuracy on BrowseComp-Plus (all within 0.6 points).
Implication: Even failed trajectories provide valuable supervision about search structure, tool-use order, and stopping behavior.

RQ2: Is one-time online bootstrapping for corpus coverage necessary?

Finding: Yes, it is essential. Removing gold documents from the corpus caused a severe collapse in performance.
- Gold-document hit rate dropped from 29.54% to 1.73%.
- Trajectory accuracy dropped from 56.86% to 43.81%.
- Downstream BrowseComp-Plus accuracy collapsed from 54.81% to 6.35%.

RQ3: How much turn budget is enough?

Finding: Both accuracy and gold hit rate improve steadily with increased turn budget but begin to plateau beyond roughly 100 turns, indicating diminishing returns after sufficient opportunity to locate evidence.

RQ4: Do explicit browser tools matter? Table 5 (Left): Browser Tool Ablation on BrowseComp-Plus.

Tools	Acc. ↑	Gold Hit ↑	1st Hit ↓	Calls ↓	AvgTok ↓
Search only	43.86	1.45	41.00	70.57	80511.69
Search + Open	56.39	51.20	20.60	53.56	58094.04
All three tools	62.17	53.37	17.23	49.97	52248.64

Finding: Yes, they are critical. The full suite (search+open+find) performs best. Adding open provides the largest jump, making evidence access reliable. Adding find further improves accuracy, brings the first gold hit earlier, and reduces tool calls and token usage.

RQ5: Does retrieving gold documents guarantee a correct final answer? Table 5 (Right): Gold hit vs. correctness.

Statistic	Value (%)
$P(\text{correct} \mid \text{search-hit})$	61.84
$P(\text{correct} \mid \text{open-hit})$	86.72
$P(\text{search-hit} \mid \text{correct})$	99.38
$P(\text{open-hit} \mid \text{correct})$	95.01

Finding: No. Merely surfacing a gold document in search results (search-hit) is a weaker predictor of correctness (61.84%) than explicitly opening it (open-hit, 86.72%).
Implication: Evidence exposure is necessary for success (99.38% of correct trajectories have a search-hit) but not sufficient. The gap between search-hit and open-hit probabilities highlights the distinction between retrieval failure and reasoning failure.

Conclusion

OpenResearcher presents a reproducible pipeline for synthesizing long-horizon deep research trajectories by moving the expensive search-and-browse loop to a controllable offline environment. Its explicit browser abstraction (search, open, find) effectively models realistic information-seeking behavior.

The synthesized trajectories prove highly effective for post-training open-weight agents, as demonstrated by strong performance on both closed-corpus and live-web benchmarks. The analyses provide valuable insights into deep research pipeline design, clarifying the roles of data filtering, corpus coverage, agent configuration, and the relationship between retrieval success and final answer accuracy.

Future directions include exploring more advanced training algorithms (e.g., reinforcement learning) on top of the SFT foundation, scaling the corpus and trajectory diversity further, and applying the pipeline to other domains requiring long-horizon, tool-augmented reasoning.