OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories - Summary

Summary (Overview)

Simple SFT Surpasses Complex Pipelines: The paper demonstrates that a simple Supervised Fine-Tuning (SFT) approach, when fueled with high-quality data, can outperform the typical industrial recipe (Continual Pre-Training + SFT + Reinforcement Learning) for training frontier search agents.
Core Data Synthesis Modifications: Three key modifications to data synthesis are introduced: 1) Scaling graph size for richer exploration and multi-hop reasoning, 2) Expanding the tool set for broader functionality, and 3) Strict low-step filtering to enforce a high difficulty floor.
State-of-the-Art Performance: Trained on only 10.6k data points, OpenSeeker-v2 (30B) achieves new SOTA results on four benchmarks: 46.0% on BrowseComp, 58.1% on BrowseComp-ZH, 34.6% on Humanity’s Last Exam (HLE), and 78.0% on xbench, surpassing models like Tongyi DeepResearch.
Academic and Open-Source Milestone: OpenSeeker-v2 is presented as the first SOTA search agent within its model scale and paradigm (ReAct) developed by a purely academic team using only SFT, with fully open-sourced model weights.
Emphasis on Data Quality: The work highlights that meticulously designed, high-difficulty, and information-rich synthetic data is a critical and scalable path for advancing agent capabilities, potentially reducing reliance on massive compute.

Introduction and Theoretical Foundation

The development of high-performance deep search agents for Large Language Models (LLMs) has been dominated by well-funded corporate entities. The standard industry pipeline is highly resource-intensive, involving stages of Continual Pre-Training (CPT) on massive corpora, Supervised Fine-Tuning (SFT), and complex Reinforcement Learning (RL). This creates a significant barrier for academic and open-source innovation.

Building upon the initial OpenSeeker work, this paper challenges the necessity of such complex pipelines. Its central hypothesis is that the quality of training trajectories—specifically their difficulty and informational richness—is paramount. The research question is: Can a straightforward SFT approach, powered by superior data, rival or exceed the performance of heavy industrial training recipes?

Methodology

The OpenSeeker-v2 framework is based on SFT, with a focus on synthesizing a high-quality dataset D_v2. The methodology centers on three modifications to the data generation pipeline:

Scaling Graph Size for Richer Exploration: The source knowledge graph G = (V, E) is expanded more aggressively during task synthesis. For a seed node v_seed, a larger evidence subgraph is constructed by increasing the expansion budget from k to K (where K > k):
$G^{(K)}_{sub} = \text{Expand}(G, v_{seed}, K).$
A synthetic query q is then generated conditioned on this enriched context: q ∼ P_{gen}(q | G^{(K)}_{sub}). This forces the generation of questions that require multi-hop evidence aggregation.
Expanding the Tool Set for Broader Functionality: The agent is equipped with an expanded set of tools A (larger than in OpenSeeker-v1). This enables the generation of more diverse and functionally rich ReAct-style trajectories:
$τ = (r_1, a_1, o_1, r_2, a_2, o_2, ..., r_T, a_T, o_T, r_{T+1}, y),$
where a_t ∈ A is a tool call, o_t is the observation, and r_t is the reasoning trace.
Strict Low-Step Filtering: To ensure high difficulty, a filter is applied to the raw dataset D_raw:
$D_{v2} = \{(q, τ) ∈ D_{raw} | T(τ) ≥ T_{min}\}.$
Here, T(τ) is the number of tool-call steps in trajectory τ, and T_min is a predefined minimum threshold. This discards simple queries solvable by direct lookup.

The final model is trained with a standard SFT objective on the filtered dataset D_v2. The base model is Qwen3-30B-A3B-Thinking-2507 (30B total parameters, 3B activated for inference).

Empirical Validation / Results

OpenSeeker-v2 is evaluated on four challenging benchmarks for deep research tasks: BrowseComp, BrowseComp-ZH, Humanity’s Last Exam (HLE), and xbench-DeepSearch.

Table 1: Performance comparison of OpenSeeker-v2 against other search agents (ReAct-based, ~30B scale).

Model Name	# Samples	Training	Academic	BrowseComp	BC-ZH	HLE	xbench
OpenSeeker-v2-30B-SFT	10.6 k	SFT	✓	46.0	58.1	34.6	78.0
Tongyi DeepResearch	?	CPT + SFT + RL	×	43.4	46.7	32.9	75.0
RedSearcher-30B	?	CPT + SFT + RL	×	42.1	49.8	34.3	-
OpenSeeker-v1-30B-SFT	11.7 k	SFT	✓	29.5	48.4	-	74.0
WebSailor-V2-30B-RL	?	SFT + RL	×	35.3	44.1	30.6	73.7

Key Findings:

Surpassing Heavier Pipelines: OpenSeeker-v2, using only SFT, outperforms strong industrial models (Tongyi DeepResearch, RedSearcher) that employ full CPT+SFT+RL pipelines across all four benchmarks.
Substantial Improvement over v1: Compared to its predecessor (OpenSeeker-v1) under the same SFT-only setup, OpenSeeker-v2 shows large gains (e.g., +16.5% on BrowseComp, +9.7% on BrowseComp-ZH), demonstrating the scaling potential via improved data quality.
Higher Data Difficulty: Analysis of the training data shows that OpenSeeker-v2 trajectories are significantly longer and more complex. The average tool-call steps per trajectory are 64.67 for OpenSeeker-v2, compared to 46.97 for OpenSeeker-v1 and 36.01 for RedSearcher (see Figure 2 in paper). This confirms the success of the low-step filtering and graph scaling strategies in creating high-difficulty training data.

Theoretical and Practical Implications

Data Quality over Pipeline Complexity: The results strongly suggest that for training long-horizon search agents, investing in high-quality, high-difficulty synthetic data can be more effective than deploying increasingly complex and expensive multi-stage training pipelines.
Democratization of Agent Research: By proving that SOTA performance is achievable with a simple SFT approach and a relatively small (10.6k), carefully curated dataset, the work lowers the barrier to entry. The full open-sourcing of model weights provides a reproducible baseline for the academic community.
Scalable Pathway for Advancement: The consistent gains from OpenSeeker-v1 to v2 indicate that the approach of scaling data quality (richness, difficulty, diversity) is not saturated and represents a promising and scalable direction for future improvements in agent capabilities.

Conclusion

OpenSeeker-v2 demonstrates that a search agent trained with simple SFT on a small but high-quality dataset can rival and surpass agents trained with extensive, resource-heavy industrial pipelines. The core innovation lies in three straightforward data synthesis modifications—scaling graph size, expanding the tool set, and applying strict low-step filtering—which collectively produce informative and high-difficulty training trajectories.

The work underscores the critical role of data quality in unlocking agent performance. The authors plan to continue pushing the limits of search agents by further scaling up the quantity, quality, and diversity of synthesized data, following this promising path.