Visual Summary | SearchSwarm: Towards Delegation Intelligence in Agentic LLMs for Long-Horizon Deep Research

Summary (Overview)

Problem: Long-horizon agent tasks generate unbounded context demands, but LLM context windows are inherently finite. Existing passive context management (truncation, summarization) lacks prior planning.
Proposal: Introduce delegation intelligence — the main agent decomposes tasks, dispatches bounded subtasks to subagents via call_sub_agent, and integrates condensed reports. A harness guides optimal delegation behavior at inference.
Data Synthesis: Harness-guided trajectories that encode correct delegation decisions are used as supervised fine-tuning (SFT) data to internalize delegation behavior into model weights.
Results: SearchSwarm-30B-A3B achieves 68.1 on BrowseComp, 73.3 on BrowseComp-ZH, 82.5 on GAIA, and 80.8 on xbench-DeepSearch — best among all models of comparable scale and competitive with models over 10× larger.
Open Source: Harness, model weights, and training data are released to facilitate future research.

Introduction and Theoretical Foundation

Large language models (LLMs) increasingly act as agents for complex, multi-step tasks whose information needs grow without bound. Yet context windows remain finite, creating a fundamental tension. Early context management strategies — summarization after length thresholds, discarding old tool outputs — are passive: they react after budget exhaustion without prior planning.

An alternative paradigm is delegation: a main agent decomposes the task in advance, dispatches bounded subtasks to subagents, and receives only condensed results, actively preserving context budget. The required capability is termed delegation intelligence: the ability to decompose, determine when and what to delegate, and integrate results.

Naturally occurring text rarely exhibits explicit multi-agent coordination, making training data for delegation intelligence scarce. The paper presents a structured recipe to synthesize such data in the deep research domain, a representative long-horizon agent task. The core idea: design a harness that elicits high-quality delegation at inference time, then use the resulting trajectories as SFT data to internalize the behavior.

Methodology

Formulation

The deep research task is modelled as a multi-turn ReAct (Yao et al., 2022) interaction. At step ( t ):

Thought ( \tau_t ): internal reasoning
Action ( a_t ): tool call (including call_sub_agent)
Observation ( o_t ): environment result

Trajectory:

H_T = \langle q, (\tau_0, a_0, o_0), \ldots, (\tau_T, a_T, o_T), y \rangle \tag{1}

Policy:

\tau_t, a_t \sim \pi(\cdot \mid q, H_{t-1}) \tag{2}

When ( a_t = \mathrm{call_sub_agent}(b) ), the brief ( b ) triggers an independent sub‑trajectory:

H_{\text{sub}} = \langle b, (\tau^s_0, a^s_0, o^s_0), \ldots, (\tau^s_S, a^s_S, o^s_S), r \rangle \tag{3}

and the main agent receives ( o_t = r ) (the condensed report). Intermediate steps are not visible.

Harness Design

The harness consists of a tool set and system prompts that guide an LLM toward high-quality delegation via four principles:

Principle	Description
Encouraging delegation	Main agent should offload token‑expensive but cognitively shallow information gathering to subagents, reserving its own context for high‑level coordination and verification.
Comprehensive briefing	Each brief must include subtask description, why it matters, what is already known, what remains uncertain, and directions already ruled out — so subagents operate with full context.
Main agent retains core judgment	Only the main agent has a complete view across subtasks; it must independently decide hypotheses, termination, and adjudication. Subagents focus on evidence gathering.
Citation‑grounded reporting	Subagent reports attach inline citations to every important conclusion. The main agent propagates these to its final answer, ensuring end‑to‑end traceability.

Tools include search, visit, google_scholar, python, and call_sub_agent. Single‑level delegation (subagents cannot delegate further).

Supervised Fine‑Tuning

Data collection: Queries from RedSearcher and OpenSeeker datasets. The model executes deep research under harness guidance. Two configurations:

Same model as main and subagent – both trajectories retained.
Stronger main agent + weaker subagent – only main agent trajectories retained (more deliberate decomposition and verification).

Filtering: Retain only main trajectories with correct final answers. Remove repeated tool calls, hallucinated citations, tool misuse.

Training objective: next‑token prediction with environment masking:

\mathcal{L} = -\sum_{t=1}^{T} \sum_{j=1}^{|a_t|} \log p_\theta \left( a_t^{(j)} \mid a_t^{(<j)}, \tau_{<t} \right) \tag{5}

Loss computed only on model outputs ( a_t ); environment returns ( o_t ) are masked.

Empirical Validation / Results

Benchmarks and Baselines

Evaluated on BrowseComp, BrowseComp-ZH, GAIA, and xbench-DeepSearch-2505. Compared against closed‑source (GPT‑5, Claude‑4.5, Gemini‑3.0, etc.), large open‑source (DeepSeek V3.2, GLM‑4.7, etc.), and lightweight open‑source models (Tongyi DeepResearch, RedSearcher, LongSeeker, MiroThinker).

Main Results

Model	Size	BrowseComp	BrowseComp‑ZH	GAIA	xbench-DeepSearch-2505
Closed‑source
GPT‑5.2‑Thinking	–	65.8	76.1	–	–
Claude‑4.5‑Opus	–	67.8	62.4	71.5	–
Seed‑2.0‑Pro	–	77.3*	82.4*	78.6	–
Open‑source large
DeepSeek V3.2	671B‑A37B	67.6*	65.0*	75.1	78.0
GLM‑4.7	355B‑A32B	67.5*	66.6*	–	72.0
Step‑3.5‑Flash	196B‑A11B	69.0*	66.9	84.5	83.7
Lightweight open‑source
Tongyi DeepResearch	30B‑A3B	43.4	46.7	70.9	75.0
RedSearcher	30B‑A3B	57.4*	58.2*	80.1	–
LongSeeker	30B‑A3B	61.5*	62.5*	77.7*	78.0*
MiroThinker‑1.7‑mini	30B‑A3B	67.9*	72.3*	80.3*	–
SearchSwarm (Ours)	30B‑A3B	68.1*	73.3*	82.5*	80.8*

indicates context‑management techniques. Bold: best among lightweight models.

SearchSwarm surpasses all comparable‑scale models and matches/frontier models much larger.

Harness Ablation (200‑question BrowseComp subset)

Configuration	Score
Base Tongyi framework	47.7
+ `call_sub_agent` tool (schema only)	50.0 (+2.3)
+ full harness (principles)	57.7 (+10.0)

Full harness substantially improves performance over simply providing the tool.

Training from a Different Base Model

Fine‑tuning Qwen3‑30B‑A3B‑Thinking‑2507 on the same data yields 66.5 (200‑question BrowseComp) and 64.0 (BrowseComp‑ZH), exceeding RedSearcher and LongSeeker, demonstrating data quality.

Generalization to Single‑Agent Setting

Without call_sub_agent tool, SearchSwarm achieves 52.0 / 53.3 on BrowseComp subset and BrowseComp‑ZH, versus 43.5 / 46.5 for Tongyi DeepResearch, showing training on delegation improves general problem‑solving.

Generalization to Open‑Ended Deep Research

Model	ScholarQA‑v2	HealthBench	ResearchQA	DeepResearchBench	Average
Tongyi DeepResearch	46.5	46.2	66.7	40.6	50.0
Dr.Tulu	88.3	52.8	75.7	45.4	65.6
SearchSwarm (Ours)	79.2	52.8	80.2	44.4	64.2

SearchSwarm outperforms its base model by 14.2 points average and approaches closed‑source DeepResearch systems, despite being trained only on short‑answer queries.

Behavioral Analysis

Main agent primarily orchestrates: call_sub_agent accounts for 70%+ of tool calls on BrowseComp. Direct tool usage is verification‑oriented (dominant visit). Subagents perform exploratory retrieval (dominant search). Python usage reflects computational needs (GAIA, xbench).

Theoretical and Practical Implications

Context Management via Delegation: The paradigm reinterprets delegation as an intelligent, content‑aware compression mechanism that frees the main agent’s context budget for high‑level reasoning, in contrast to fixed‑rule truncation or summarization.
Lightweight Competitiveness: A 30B‑A3B model with trained delegation intelligence can match or exceed models 10× larger on long‑horizon research tasks, suggesting that structural intelligence (decomposition, coordination) can compensate for raw parameter count.
Generalizable Skills: Delegation training improves performance even in single‑agent settings and on open‑ended tasks, indicating that the learned patterns of systematic decomposition and verification are broadly beneficial.
Open‑Source Facilitation: Releasing the harness, model weights, and training data provides a complete recipe for others to study and extend delegation intelligence in agentic workflows.

Conclusion

SearchSwarm presents a preliminary yet effective exploration of training delegation intelligence for long‑horizon agent tasks, demonstrated on deep research. A specially designed harness elicits optimal delegation behavior at inference; its trajectories, when used as supervised fine‑tuning data, internalize that intelligence into model weights. The resulting 30B‑A3B model achieves state‑of‑the‑art performance among comparable‑scale models on four challenging benchmarks and remains competitive with models over 10× larger. The acquired capabilities generalize to single‑agent settings and open‑ended research tasks. Future work may extend the paradigm to deeper hierarchies of delegation, other task domains, and reinforcement learning based training of the main agent’s delegation policy.