# OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data

> OpenSeeker democratizes frontier search agents by fully open-sourcing its training data and model, achieving state-of-the-art performance through novel data synthesis methods.

- **Source:** [arXiv](https://arxiv.org/abs/2603.15594)
- **Published:** 2026-03-18
- **Permalink:** https://picx.dev/p/e2EAIc
- **Whiteboard:** https://picx.dev/p/e2EAIc/image

## Summary

Here is a comprehensive summary of the academic paper "OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data" in Markdown format:

## Summary (Overview)
* **OpenSeeker** is the first fully open-source search agent (model and data) that achieves frontier-level performance through strategic data synthesis.
* **Core Innovations**: Two technical methods: (1) Fact-grounded scalable controllable QA synthesis that reverse-engineers web graphs to generate complex multi-hop reasoning tasks, and (2) Denoised trajectory synthesis that uses retrospective summarization to produce high-quality actions.
* **Performance**: Trained on only 11.7k synthesized samples via simple SFT, OpenSeeker achieves state-of-the-art results on multiple benchmarks (BrowseComp: 29.5%, BrowseComp-ZH: 48.4%, xbench-DeepSearch: 74.0%, WideSearch: 59.4%), surpassing industrial competitors like Tongyi DeepResearch.
* **Democratization**: The work aims to break the corporate "data moat" by fully open-sourcing the complete training dataset and model weights to foster transparent research.
* **Academic Achievement**: This represents the first work by a purely academic team to achieve SOTA performance on frontier search benchmarks while fully open-sourcing training data.

## Introduction and Theoretical Foundation
The paper addresses a critical problem in AI research: the development of high-performance search agents has been dominated by industrial giants due to a lack of transparent, high-quality training data. This data scarcity has hindered progress in the broader research community. While corporate entities have produced capable proprietary agents (OpenAI Deep Research, Kimi-Researcher, Gemini Deep Research) and some have released open-weight models (Kimi K2 series, GLM, MiniMax M2), none have disclosed their training data, creating a "data moat."

The authors argue that to train effective deep search agents, two pivotal challenges must be addressed:
1. **High-difficulty QA**: Only sufficiently complex queries compel the system to engage in rigorous multi-turn interaction cycles ("Reasoning → Tool Call → Tool Response"), generating long-horizon trajectories.
2. **High-quality trajectories**: Synthesis of solution paths must rely on stable methods to ensure training signals represent "correct and generalizable" strategies.

OpenSeeker is introduced as a solution to democratize frontier search intelligence by providing the complete synthesis pipeline and high-fidelity training data.

## Methodology
The methodology consists of two core technical innovations:

### 1. Fact-Grounded Scalable Controllable QA Synthesis
This framework constructs question-answer pairs $(q, y)$ directly from the web graph $G = (V, E)$, where $V$ denotes web pages and $E$ denotes hyperlinks. The pipeline operates in two phases:

**Generative Construction**:
* **Graph Expansion**: Samples a seed node $v_{seed} \sim V$ and traverses outgoing edges to gather $k$ connected nodes, forming a local dependency subgraph $G_{sub} = \{v_{seed}\} \cup \{v_i | (v_{seed}, v_i) \in E\}_k$.
* **Entity Extraction**: Identifies the central theme $y_{theme}$ of $v_{seed}$ and distills key entities from across $G_{sub}$ into a condensed **Entity Subgraph** $G_{entity}$, removing textual noise.
* **Question Generation**: Uses a generator $P_{gen}$ to synthesize an initial question $q_{init}$ conditioned on $G_{entity}$, imposing a hard structural constraint that deriving $y_{theme}$ must necessitate traversing multiple edges within $G_{entity}$.
* **Entity Obfuscation**: Applies an obfuscation operator $\Phi$ to entity nodes $e$ to map them to vague references $\tilde{e} = \Phi(e)$, yielding a **Fuzzy Entity Subgraph** $\tilde{G}_{entity}$.
* **Question Obfuscation**: Generates the final question $\tilde{q}$ by rewriting $q_{init}$ to incorporate ambiguous descriptions from $\tilde{G}_{entity}$ while preserving reasoning logic.

**Dual-Criteria Verification via Rejection Sampling**:
Two indicator functions filter synthesized pairs $(\tilde{q}, y)$:
* **Criterion 1 (Difficulty)**: $I[\pi_{base}(\tilde{q}) \neq y]$, where $\pi_{base}$ is a strong foundation model in a closed-book setting. If correct, question is discarded.
* **Criterion 2 (Solvability)**: $I[\pi_{base}(\tilde{q}|G_{entity}) = y]$, where model is provided $G_{entity}$ as context. If fails, sample is rejected.

This paradigm offers three strengths:
1. **Factual grounding**: Anchored in real web topology, mitigating hallucination.
2. **Scalability**: Leverages TB-scale web archives as an inexhaustible source.
3. **Controllability**: Difficulty calibrated by tuning subgraph size $(k)$.

### 2. Denoised Trajectory Synthesis
This method synthesizes high-quality search trajectories $\tau = [q, (r_1, a_1, o_1), ..., (r_T, a_T, o_T), y]$, where $r_t$ is reasoning, $a_t$ is action (tool call), and $o_t$ is observation (tool response).

**Synthesis via Dynamic Context Denoising**:
The context construction follows a "Summarized History + Raw Recent" protocol:
$$H_t = \{q, (r_1, a_1, s_1), ..., (r_{t-2}, a_{t-2}, s_{t-2})\} \cup \{(r_{t-1}, a_{t-1}, o_{t-1})\}$$
where $s_i = Summarize(o_i | context)$ is a compressed summary.

The mechanism operates in a two-phase cycle:
1. **Decision phase**: Agent generates $(r_t, a_t)$ based on $H_t$, which includes the full raw observation $o_{t-1}$.
2. **Compression phase**: After step $t$, system retrospectively compresses $o_{t-1}$ into $s_{t-1}$, which replaces $o_{t-1}$ in long-term history for $H_{t+1}$.

**Asymmetric Context Training for Robust Denoising**:
* **Synthesis data (Teacher)**: Uses clean, denoised context $H_t$ containing summaries.
* **Training data (Student)**: Uses noisy, raw context $H^{train}_t = \{q, (r_1, a_1, o_1), ..., (r_{t-1}, a_{t-1}, o_{t-1})\}$.
The student is supervised to predict optimal $(r_t, a_t)$ given noisy context, forcing it to learn denoising capabilities.

## Empirical Validation / Results
**Experimental Setup**:
* **Model**: OpenSeeker initialized from Qwen3-30B-A3B-Thinking-2507 (30B total parameters, 3B activated).
* **Training**: Single SFT run on 11.7k samples (10.3k English, 1.4k Chinese) without heuristic filtering or hyperparameter optimization.
* **Benchmarks**: BrowseComp, BrowseComp-ZH, xbench-DeepSearch, WideSearch.

**Key Results**:

**Table 1: Comparisons among OpenSeeker and other search agents**
| Model Name | # Samples | # OS Samples | Training | Academic | BrowseComp | BrowseComp-ZH | xbench | WideSearch |
|------------|-----------|--------------|----------|----------|------------|----------------|--------|------------|
| **OpenSeeker-v1-30B-SFT** | 11.7k | 11.7k | SFT | ✓ | **29.5%** | **48.4%** | **74.0%** | **59.4%** |
| DeepDive-32B | 4.1k | 4.1k | SFT+RL | × | 15.3% | 29.7% | 51.8% | - |
| MiroThinker-32B-v0.1 | 147k | 147k | SFT | × | 10.6% | 13.8% | - | - |
| WebSailor-V2-30B-SFT | ? | 0 | SFT | × | 24.4% | 28.3% | 61.7% | - |
| WebLeaper-30B | 15k | 0 | SFT | × | 27.7% | - | 66.0% | 44.1% |
| Tongyi DeepResearch | ? | 0 | CPT+SFT+RL | × | 43.4% | 46.7% | 75.0% | - |
| OpenAI-o3 | ? | 0 | ? | × | 49.1% | 68.7% | - | 60.0% |

**Table 2: Performance comparison of different models trained via SFT**
| Data | # Samples | # OS Samples | Academic | BrowseComp | BrowseComp-ZH | xbench | WideSearch-EN |
|------|-----------|--------------|----------|------------|----------------|--------|----------------|
| **OpenSeeker-v1-30B-SFT** | 11.7k | 11.7k | ✓ | **29.5%** | **48.4%** | **74.0%** | **59.4%** |
| DeepDive-32B | 4.1k | 4.1k | × | 9.5% | 23.0% | 48.5% | - |
| MiroThinker-32B-v0.1 | 147k | 147k | × | 10.6% | 13.8% | - | - |
| WebSailor-V2-30B | ? | 0 | × | 24.4% | 28.3% | 61.7% | - |
| WebLeaper-30B | 15k | 0 | × | 27.7% | - | 66.0% | 44.1% |

**Table 3: Performance comparison under comparable data volumes**
| Data | # Samples | # OS Samples | Developer | BrowseComp | xbench | WideSearch-EN |
|------|-----------|--------------|-----------|------------|--------|----------------|
| **OpenSeeker-v1-Data-11.7k** | 11.7k | 11.7k | Academic | **29.50%** | **74.00%** | **59.40%** |
| WebSailor-V2-10k | 10k | 0 | Tongyi | 24.50% | 62.67% | 38.91% |
| WebSailor-V2-5k + WebLeaper-Basic-5k | 10k | 0 | Tongyi | 20.67% | 58.33% | 32.26% |
| WebSailor-V2-5k + WebLeaper-Union-5k | 10k | 0 | Tongyi | 27.50% | 62.33% | 41.70% |
| WebSailor-V2-5k + WebLeaper-Reverse-Union-10k | 15k | 0 | Tongyi | 27.67% | 66.00% | 44.07% |

**Key Findings**:
1. **Outperforming resource-intensive baselines**: OpenSeeker achieves 48.4% on BrowseComp-ZH, surpassing Tongyi DeepResearch (46.7%) which uses CPT+SFT+RL.
2. **Superior performance under identical SFT setup**: Among ∼30B models trained only with SFT, OpenSeeker outperforms WebSailor-V2-SFT by nearly 20% on BrowseComp-ZH.
3. **Superior performance with comparable data volume**: Despite using fewer samples (11.7k vs. 10k-15k), OpenSeeker outperforms best baseline combinations by 8% on xbench and 15% on WideSearch.
4. **Data difficulty analysis**: The synthesized Chinese data averages **46.35 tool calls** and **76.1k tokens** per trajectory, significantly more complex than BrowseComp-ZH (26.98 tool calls, 15.1k tokens).

## Theoretical and Practical Implications
* **Breaking Corporate Data Monopoly**: OpenSeeker dismantles the "data moat" held by industrial corporations, providing the academic community with resources to replicate industrial-grade capabilities.
* **Data Quality over Quantity**: The results demonstrate that high-fidelity, complex data (even with limited volume) is more effective than large volumes of lower-quality data.
* **Democratization of Frontier AI**: By fully open-sourcing the synthesis pipeline, training dataset, and model weights, this work fosters a more inclusive, transparent, and collaborative ecosystem for search agent research.
* **Methodological Advancements**: The fact-grounded QA synthesis and denoised trajectory synthesis techniques provide scalable, controllable frameworks for generating high-quality training data for complex reasoning tasks.

## Conclusion
OpenSeeker represents a significant breakthrough in democratizing frontier search agent development. Through two innovative data synthesis methods, it produces high-fidelity training data that enables state-of-the-art performance with only 11.7k samples and simple SFT. The work:
* **Achieves competitive results** against industrial models trained with extensive resources.
* **Demonstrates that data quality** is paramount over quantity.
* **Fully open-sources** all components (data, model, pipeline) to break down barriers.
* **Provides a foundation** for future research to optimize data distributions, implement quality filtering, and generate even more complex data.

The authors emphasize that their current results represent a lower bound due to resource constraints (single training run, no hyperparameter optimization), leaving substantial room for future improvement. OpenSeeker aims to catalyze a more open, collaborative development of autonomous agents.

---

_Markdown view of https://picx.dev/p/e2EAIc, served by PicX — AI-generated visual whiteboard summaries of research papers._
