Visual Summary | OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data

Here is a comprehensive summary of the academic paper "OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data" in Markdown format:

Summary (Overview)

OpenSeeker is the first fully open-source search agent (model and data) that achieves frontier-level performance through strategic data synthesis.
Core Innovations: Two technical methods: (1) Fact-grounded scalable controllable QA synthesis that reverse-engineers web graphs to generate complex multi-hop reasoning tasks, and (2) Denoised trajectory synthesis that uses retrospective summarization to produce high-quality actions.
Performance: Trained on only 11.7k synthesized samples via simple SFT, OpenSeeker achieves state-of-the-art results on multiple benchmarks (BrowseComp: 29.5%, BrowseComp-ZH: 48.4%, xbench-DeepSearch: 74.0%, WideSearch: 59.4%), surpassing industrial competitors like Tongyi DeepResearch.
Democratization: The work aims to break the corporate "data moat" by fully open-sourcing the complete training dataset and model weights to foster transparent research.
Academic Achievement: This represents the first work by a purely academic team to achieve SOTA performance on frontier search benchmarks while fully open-sourcing training data.

Introduction and Theoretical Foundation

The paper addresses a critical problem in AI research: the development of high-performance search agents has been dominated by industrial giants due to a lack of transparent, high-quality training data. This data scarcity has hindered progress in the broader research community. While corporate entities have produced capable proprietary agents (OpenAI Deep Research, Kimi-Researcher, Gemini Deep Research) and some have released open-weight models (Kimi K2 series, GLM, MiniMax M2), none have disclosed their training data, creating a "data moat."

The authors argue that to train effective deep search agents, two pivotal challenges must be addressed:

High-difficulty QA: Only sufficiently complex queries compel the system to engage in rigorous multi-turn interaction cycles ("Reasoning → Tool Call → Tool Response"), generating long-horizon trajectories.
High-quality trajectories: Synthesis of solution paths must rely on stable methods to ensure training signals represent "correct and generalizable" strategies.

OpenSeeker is introduced as a solution to democratize frontier search intelligence by providing the complete synthesis pipeline and high-fidelity training data.

Methodology

The methodology consists of two core technical innovations:

1. Fact-Grounded Scalable Controllable QA Synthesis

This framework constructs question-answer pairs $(q, y)$ directly from the web graph $G = (V, E)$ , where $V$ denotes web pages and $E$ denotes hyperlinks. The pipeline operates in two phases:

Generative Construction:

Graph Expansion: Samples a seed node $v_{seed} \sim V$ and traverses outgoing edges to gather $k$ connected nodes, forming a local dependency subgraph $G_{sub} = \{v_{seed}\} \cup \{v_i | (v_{seed}, v_i) \in E\}_k$ .
Entity Extraction: Identifies the central theme $y_{theme}$ of $v_{seed}$ and distills key entities from across $G_{sub}$ into a condensed Entity Subgraph $G_{entity}$ , removing textual noise.
Question Generation: Uses a generator $P_{gen}$ to synthesize an initial question $q_{init}$ conditioned on $G_{entity}$ , imposing a hard structural constraint that deriving $y_{theme}$ must necessitate traversing multiple edges within $G_{entity}$ .
Entity Obfuscation: Applies an obfuscation operator $\Phi$ to entity nodes $e$ to map them to vague references $\tilde{e} = \Phi(e)$ , yielding a Fuzzy Entity Subgraph $\tilde{G}_{entity}$ .
Question Obfuscation: Generates the final question $\tilde{q}$ by rewriting $q_{init}$ to incorporate ambiguous descriptions from $\tilde{G}_{entity}$ while preserving reasoning logic.

Dual-Criteria Verification via Rejection Sampling: Two indicator functions filter synthesized pairs $(\tilde{q}, y)$ :

Criterion 1 (Difficulty): $I[\pi_{base}(\tilde{q}) \neq y]$ , where $\pi_{base}$ is a strong foundation model in a closed-book setting. If correct, question is discarded.
Criterion 2 (Solvability): $I[\pi_{base}(\tilde{q}|G_{entity}) = y]$ , where model is provided $G_{entity}$ as context. If fails, sample is rejected.

This paradigm offers three strengths:

Factual grounding: Anchored in real web topology, mitigating hallucination.
Scalability: Leverages TB-scale web archives as an inexhaustible source.
Controllability: Difficulty calibrated by tuning subgraph size $(k)$ .

2. Denoised Trajectory Synthesis

This method synthesizes high-quality search trajectories $\tau = [q, (r_1, a_1, o_1), ..., (r_T, a_T, o_T), y]$ , where $r_t$ is reasoning, $a_t$ is action (tool call), and $o_t$ is observation (tool response).

Synthesis via Dynamic Context Denoising: The context construction follows a "Summarized History + Raw Recent" protocol:

H_t = \{q, (r_1, a_1, s_1), ..., (r_{t-2}, a_{t-2}, s_{t-2})\} \cup \{(r_{t-1}, a_{t-1}, o_{t-1})\}

where $s_i = Summarize(o_i | context)$ is a compressed summary.

The mechanism operates in a two-phase cycle:

Decision phase: Agent generates $(r_t, a_t)$ based on $H_t$ , which includes the full raw observation $o_{t-1}$ .
Compression phase: After step $t$ , system retrospectively compresses $o_{t-1}$ into $s_{t-1}$ , which replaces $o_{t-1}$ in long-term history for $H_{t+1}$ .

Asymmetric Context Training for Robust Denoising:

Synthesis data (Teacher): Uses clean, denoised context $H_t$ containing summaries.
Training data (Student): Uses noisy, raw context $H^{train}_t = \{q, (r_1, a_1, o_1), ..., (r_{t-1}, a_{t-1}, o_{t-1})\}$ . The student is supervised to predict optimal $(r_t, a_t)$ given noisy context, forcing it to learn denoising capabilities.

Empirical Validation / Results

Experimental Setup:

Model: OpenSeeker initialized from Qwen3-30B-A3B-Thinking-2507 (30B total parameters, 3B activated).
Training: Single SFT run on 11.7k samples (10.3k English, 1.4k Chinese) without heuristic filtering or hyperparameter optimization.
Benchmarks: BrowseComp, BrowseComp-ZH, xbench-DeepSearch, WideSearch.

Key Results:

Table 1: Comparisons among OpenSeeker and other search agents

Model Name	# Samples	# OS Samples	Training	Academic	BrowseComp	BrowseComp-ZH	xbench	WideSearch
OpenSeeker-v1-30B-SFT	11.7k	11.7k	SFT	✓	29.5%	48.4%	74.0%	59.4%
DeepDive-32B	4.1k	4.1k	SFT+RL	×	15.3%	29.7%	51.8%	-
MiroThinker-32B-v0.1	147k	147k	SFT	×	10.6%	13.8%	-	-
WebSailor-V2-30B-SFT	?	0	SFT	×	24.4%	28.3%	61.7%	-
WebLeaper-30B	15k	0	SFT	×	27.7%	-	66.0%	44.1%
Tongyi DeepResearch	?	0	CPT+SFT+RL	×	43.4%	46.7%	75.0%	-
OpenAI-o3	?	0	?	×	49.1%	68.7%	-	60.0%

Table 2: Performance comparison of different models trained via SFT

Data	# Samples	# OS Samples	Academic	BrowseComp	BrowseComp-ZH	xbench	WideSearch-EN
OpenSeeker-v1-30B-SFT	11.7k	11.7k	✓	29.5%	48.4%	74.0%	59.4%
DeepDive-32B	4.1k	4.1k	×	9.5%	23.0%	48.5%	-
MiroThinker-32B-v0.1	147k	147k	×	10.6%	13.8%	-	-
WebSailor-V2-30B	?	0	×	24.4%	28.3%	61.7%	-
WebLeaper-30B	15k	0	×	27.7%	-	66.0%	44.1%

Table 3: Performance comparison under comparable data volumes

Data	# Samples	# OS Samples	Developer	BrowseComp	xbench	WideSearch-EN
OpenSeeker-v1-Data-11.7k	11.7k	11.7k	Academic	29.50%	74.00%	59.40%
WebSailor-V2-10k	10k	0	Tongyi	24.50%	62.67%	38.91%
WebSailor-V2-5k + WebLeaper-Basic-5k	10k	0	Tongyi	20.67%	58.33%	32.26%
WebSailor-V2-5k + WebLeaper-Union-5k	10k	0	Tongyi	27.50%	62.33%	41.70%
WebSailor-V2-5k + WebLeaper-Reverse-Union-10k	15k	0	Tongyi	27.67%	66.00%	44.07%

Key Findings:

Outperforming resource-intensive baselines: OpenSeeker achieves 48.4% on BrowseComp-ZH, surpassing Tongyi DeepResearch (46.7%) which uses CPT+SFT+RL.
Superior performance under identical SFT setup: Among ∼30B models trained only with SFT, OpenSeeker outperforms WebSailor-V2-SFT by nearly 20% on BrowseComp-ZH.
Superior performance with comparable data volume: Despite using fewer samples (11.7k vs. 10k-15k), OpenSeeker outperforms best baseline combinations by 8% on xbench and 15% on WideSearch.
Data difficulty analysis: The synthesized Chinese data averages 46.35 tool calls and 76.1k tokens per trajectory, significantly more complex than BrowseComp-ZH (26.98 tool calls, 15.1k tokens).

Theoretical and Practical Implications

Breaking Corporate Data Monopoly: OpenSeeker dismantles the "data moat" held by industrial corporations, providing the academic community with resources to replicate industrial-grade capabilities.
Data Quality over Quantity: The results demonstrate that high-fidelity, complex data (even with limited volume) is more effective than large volumes of lower-quality data.
Democratization of Frontier AI: By fully open-sourcing the synthesis pipeline, training dataset, and model weights, this work fosters a more inclusive, transparent, and collaborative ecosystem for search agent research.
Methodological Advancements: The fact-grounded QA synthesis and denoised trajectory synthesis techniques provide scalable, controllable frameworks for generating high-quality training data for complex reasoning tasks.

Conclusion

OpenSeeker represents a significant breakthrough in democratizing frontier search agent development. Through two innovative data synthesis methods, it produces high-fidelity training data that enables state-of-the-art performance with only 11.7k samples and simple SFT. The work:

Achieves competitive results against industrial models trained with extensive resources.
Demonstrates that data quality is paramount over quantity.
Fully open-sources all components (data, model, pipeline) to break down barriers.
Provides a foundation for future research to optimize data distributions, implement quality filtering, and generate even more complex data.

The authors emphasize that their current results represent a lower bound due to resource constraints (single training run, no hyperparameter optimization), leaving substantial room for future improvement. OpenSeeker aims to catalyze a more open, collaborative development of autonomous agents.