Here is a comprehensive summary of the academic paper "OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data" in Markdown format:
Summary (Overview)
- OpenSeeker is the first fully open-source search agent (model and data) that achieves frontier-level performance through strategic data synthesis.
- Core Innovations: Two technical methods: (1) Fact-grounded scalable controllable QA synthesis that reverse-engineers web graphs to generate complex multi-hop reasoning tasks, and (2) Denoised trajectory synthesis that uses retrospective summarization to produce high-quality actions.
- Performance: Trained on only 11.7k synthesized samples via simple SFT, OpenSeeker achieves state-of-the-art results on multiple benchmarks (BrowseComp: 29.5%, BrowseComp-ZH: 48.4%, xbench-DeepSearch: 74.0%, WideSearch: 59.4%), surpassing industrial competitors like Tongyi DeepResearch.
- Democratization: The work aims to break the corporate "data moat" by fully open-sourcing the complete training dataset and model weights to foster transparent research.
- Academic Achievement: This represents the first work by a purely academic team to achieve SOTA performance on frontier search benchmarks while fully open-sourcing training data.
Introduction and Theoretical Foundation
The paper addresses a critical problem in AI research: the development of high-performance search agents has been dominated by industrial giants due to a lack of transparent, high-quality training data. This data scarcity has hindered progress in the broader research community. While corporate entities have produced capable proprietary agents (OpenAI Deep Research, Kimi-Researcher, Gemini Deep Research) and some have released open-weight models (Kimi K2 series, GLM, MiniMax M2), none have disclosed their training data, creating a "data moat."
The authors argue that to train effective deep search agents, two pivotal challenges must be addressed:
- High-difficulty QA: Only sufficiently complex queries compel the system to engage in rigorous multi-turn interaction cycles ("Reasoning → Tool Call → Tool Response"), generating long-horizon trajectories.
- High-quality trajectories: Synthesis of solution paths must rely on stable methods to ensure training signals represent "correct and generalizable" strategies.
OpenSeeker is introduced as a solution to democratize frontier search intelligence by providing the complete synthesis pipeline and high-fidelity training data.
Methodology
The methodology consists of two core technical innovations:
1. Fact-Grounded Scalable Controllable QA Synthesis
This framework constructs question-answer pairs directly from the web graph , where denotes web pages and denotes hyperlinks. The pipeline operates in two phases:
Generative Construction:
- Graph Expansion: Samples a seed node and traverses outgoing edges to gather connected nodes, forming a local dependency subgraph .
- Entity Extraction: Identifies the central theme of and distills key entities from across into a condensed Entity Subgraph , removing textual noise.
- Question Generation: Uses a generator to synthesize an initial question conditioned on , imposing a hard structural constraint that deriving must necessitate traversing multiple edges within .
- Entity Obfuscation: Applies an obfuscation operator to entity nodes to map them to vague references , yielding a Fuzzy Entity Subgraph .
- Question Obfuscation: Generates the final question by rewriting to incorporate ambiguous descriptions from while preserving reasoning logic.
Dual-Criteria Verification via Rejection Sampling: Two indicator functions filter synthesized pairs :
- Criterion 1 (Difficulty): , where is a strong foundation model in a closed-book setting. If correct, question is discarded.
- Criterion 2 (Solvability): , where model is provided as context. If fails, sample is rejected.
This paradigm offers three strengths:
- Factual grounding: Anchored in real web topology, mitigating hallucination.
- Scalability: Leverages TB-scale web archives as an inexhaustible source.
- Controllability: Difficulty calibrated by tuning subgraph size .
2. Denoised Trajectory Synthesis
This method synthesizes high-quality search trajectories , where is reasoning, is action (tool call), and is observation (tool response).
Synthesis via Dynamic Context Denoising: The context construction follows a "Summarized History + Raw Recent" protocol:
where is a compressed summary.
The mechanism operates in a two-phase cycle:
- Decision phase: Agent generates based on , which includes the full raw observation .
- Compression phase: After step , system retrospectively compresses into , which replaces in long-term history for .
Asymmetric Context Training for Robust Denoising:
- Synthesis data (Teacher): Uses clean, denoised context containing summaries.
- Training data (Student): Uses noisy, raw context . The student is supervised to predict optimal given noisy context, forcing it to learn denoising capabilities.
Empirical Validation / Results
Experimental Setup:
- Model: OpenSeeker initialized from Qwen3-30B-A3B-Thinking-2507 (30B total parameters, 3B activated).
- Training: Single SFT run on 11.7k samples (10.3k English, 1.4k Chinese) without heuristic filtering or hyperparameter optimization.
- Benchmarks: BrowseComp, BrowseComp-ZH, xbench-DeepSearch, WideSearch.
Key Results:
Table 1: Comparisons among OpenSeeker and other search agents
| Model Name | # Samples | # OS Samples | Training | Academic | BrowseComp | BrowseComp-ZH | xbench | WideSearch |
|---|---|---|---|---|---|---|---|---|
| OpenSeeker-v1-30B-SFT | 11.7k | 11.7k | SFT | ✓ | 29.5% | 48.4% | 74.0% | 59.4% |
| DeepDive-32B | 4.1k | 4.1k | SFT+RL | × | 15.3% | 29.7% | 51.8% | - |
| MiroThinker-32B-v0.1 | 147k | 147k | SFT | × | 10.6% | 13.8% | - | - |
| WebSailor-V2-30B-SFT | ? | 0 | SFT | × | 24.4% | 28.3% | 61.7% | - |
| WebLeaper-30B | 15k | 0 | SFT | × | 27.7% | - | 66.0% | 44.1% |
| Tongyi DeepResearch | ? | 0 | CPT+SFT+RL | × | 43.4% | 46.7% | 75.0% | - |
| OpenAI-o3 | ? | 0 | ? | × | 49.1% | 68.7% | - | 60.0% |
Table 2: Performance comparison of different models trained via SFT
| Data | # Samples | # OS Samples | Academic | BrowseComp | BrowseComp-ZH | xbench | WideSearch-EN |
|---|---|---|---|---|---|---|---|
| OpenSeeker-v1-30B-SFT | 11.7k | 11.7k | ✓ | 29.5% | 48.4% | 74.0% | 59.4% |
| DeepDive-32B | 4.1k | 4.1k | × | 9.5% | 23.0% | 48.5% | - |
| MiroThinker-32B-v0.1 | 147k | 147k | × | 10.6% | 13.8% | - | - |
| WebSailor-V2-30B | ? | 0 | × | 24.4% | 28.3% | 61.7% | - |
| WebLeaper-30B | 15k | 0 | × | 27.7% | - | 66.0% | 44.1% |
Table 3: Performance comparison under comparable data volumes
| Data | # Samples | # OS Samples | Developer | BrowseComp | xbench | WideSearch-EN |
|---|---|---|---|---|---|---|
| OpenSeeker-v1-Data-11.7k | 11.7k | 11.7k | Academic | 29.50% | 74.00% | 59.40% |
| WebSailor-V2-10k | 10k | 0 | Tongyi | 24.50% | 62.67% | 38.91% |
| WebSailor-V2-5k + WebLeaper-Basic-5k | 10k | 0 | Tongyi | 20.67% | 58.33% | 32.26% |
| WebSailor-V2-5k + WebLeaper-Union-5k | 10k | 0 | Tongyi | 27.50% | 62.33% | 41.70% |
| WebSailor-V2-5k + WebLeaper-Reverse-Union-10k | 15k | 0 | Tongyi | 27.67% | 66.00% | 44.07% |
Key Findings:
- Outperforming resource-intensive baselines: OpenSeeker achieves 48.4% on BrowseComp-ZH, surpassing Tongyi DeepResearch (46.7%) which uses CPT+SFT+RL.
- Superior performance under identical SFT setup: Among ∼30B models trained only with SFT, OpenSeeker outperforms WebSailor-V2-SFT by nearly 20% on BrowseComp-ZH.
- Superior performance with comparable data volume: Despite using fewer samples (11.7k vs. 10k-15k), OpenSeeker outperforms best baseline combinations by 8% on xbench and 15% on WideSearch.
- Data difficulty analysis: The synthesized Chinese data averages 46.35 tool calls and 76.1k tokens per trajectory, significantly more complex than BrowseComp-ZH (26.98 tool calls, 15.1k tokens).
Theoretical and Practical Implications
- Breaking Corporate Data Monopoly: OpenSeeker dismantles the "data moat" held by industrial corporations, providing the academic community with resources to replicate industrial-grade capabilities.
- Data Quality over Quantity: The results demonstrate that high-fidelity, complex data (even with limited volume) is more effective than large volumes of lower-quality data.
- Democratization of Frontier AI: By fully open-sourcing the synthesis pipeline, training dataset, and model weights, this work fosters a more inclusive, transparent, and collaborative ecosystem for search agent research.
- Methodological Advancements: The fact-grounded QA synthesis and denoised trajectory synthesis techniques provide scalable, controllable frameworks for generating high-quality training data for complex reasoning tasks.
Conclusion
OpenSeeker represents a significant breakthrough in democratizing frontier search agent development. Through two innovative data synthesis methods, it produces high-fidelity training data that enables state-of-the-art performance with only 11.7k samples and simple SFT. The work:
- Achieves competitive results against industrial models trained with extensive resources.
- Demonstrates that data quality is paramount over quantity.
- Fully open-sources all components (data, model, pipeline) to break down barriers.
- Provides a foundation for future research to optimize data distributions, implement quality filtering, and generate even more complex data.
The authors emphasize that their current results represent a lower bound due to resource constraints (single training run, no hyperparameter optimization), leaving substantial room for future improvement. OpenSeeker aims to catalyze a more open, collaborative development of autonomous agents.