Summary (Overview)
- Problem: Long-horizon agent tasks generate unbounded context demands, but LLM context windows are inherently finite. Existing passive context management (truncation, summarization) lacks prior planning.
- Proposal: Introduce delegation intelligence — the main agent decomposes tasks, dispatches bounded subtasks to subagents via
call_sub_agent, and integrates condensed reports. A harness guides optimal delegation behavior at inference. - Data Synthesis: Harness-guided trajectories that encode correct delegation decisions are used as supervised fine-tuning (SFT) data to internalize delegation behavior into model weights.
- Results: SearchSwarm-30B-A3B achieves 68.1 on BrowseComp, 73.3 on BrowseComp-ZH, 82.5 on GAIA, and 80.8 on xbench-DeepSearch — best among all models of comparable scale and competitive with models over 10× larger.
- Open Source: Harness, model weights, and training data are released to facilitate future research.
Introduction and Theoretical Foundation
Large language models (LLMs) increasingly act as agents for complex, multi-step tasks whose information needs grow without bound. Yet context windows remain finite, creating a fundamental tension. Early context management strategies — summarization after length thresholds, discarding old tool outputs — are passive: they react after budget exhaustion without prior planning.
An alternative paradigm is delegation: a main agent decomposes the task in advance, dispatches bounded subtasks to subagents, and receives only condensed results, actively preserving context budget. The required capability is termed delegation intelligence: the ability to decompose, determine when and what to delegate, and integrate results.
Naturally occurring text rarely exhibits explicit multi-agent coordination, making training data for delegation intelligence scarce. The paper presents a structured recipe to synthesize such data in the deep research domain, a representative long-horizon agent task. The core idea: design a harness that elicits high-quality delegation at inference time, then use the resulting trajectories as SFT data to internalize the behavior.
Methodology
Formulation
The deep research task is modelled as a multi-turn ReAct (Yao et al., 2022) interaction. At step ( t ):
- Thought ( \tau_t ): internal reasoning
- Action ( a_t ): tool call (including
call_sub_agent) - Observation ( o_t ): environment result
Trajectory:
Policy:
When ( a_t = \mathrm{call_sub_agent}(b) ), the brief ( b ) triggers an independent sub‑trajectory:
and the main agent receives ( o_t = r ) (the condensed report). Intermediate steps are not visible.
Harness Design
The harness consists of a tool set and system prompts that guide an LLM toward high-quality delegation via four principles:
| Principle | Description |
|---|---|
| Encouraging delegation | Main agent should offload token‑expensive but cognitively shallow information gathering to subagents, reserving its own context for high‑level coordination and verification. |
| Comprehensive briefing | Each brief must include subtask description, why it matters, what is already known, what remains uncertain, and directions already ruled out — so subagents operate with full context. |
| Main agent retains core judgment | Only the main agent has a complete view across subtasks; it must independently decide hypotheses, termination, and adjudication. Subagents focus on evidence gathering. |
| Citation‑grounded reporting | Subagent reports attach inline citations to every important conclusion. The main agent propagates these to its final answer, ensuring end‑to‑end traceability. |
Tools include search, visit, google_scholar, python, and call_sub_agent. Single‑level delegation (subagents cannot delegate further).
Supervised Fine‑Tuning
Data collection: Queries from RedSearcher and OpenSeeker datasets. The model executes deep research under harness guidance. Two configurations:
- Same model as main and subagent – both trajectories retained.
- Stronger main agent + weaker subagent – only main agent trajectories retained (more deliberate decomposition and verification).
Filtering: Retain only main trajectories with correct final answers. Remove repeated tool calls, hallucinated citations, tool misuse.
Training objective: next‑token prediction with environment masking:
Loss computed only on model outputs ( a_t ); environment returns ( o_t ) are masked.
Empirical Validation / Results
Benchmarks and Baselines
Evaluated on BrowseComp, BrowseComp-ZH, GAIA, and xbench-DeepSearch-2505. Compared against closed‑source (GPT‑5, Claude‑4.5, Gemini‑3.0, etc.), large open‑source (DeepSeek V3.2, GLM‑4.7, etc.), and lightweight open‑source models (Tongyi DeepResearch, RedSearcher, LongSeeker, MiroThinker).
Main Results
| Model | Size | BrowseComp | BrowseComp‑ZH | GAIA | xbench-DeepSearch-2505 |
|---|---|---|---|---|---|
| Closed‑source | |||||
| GPT‑5.2‑Thinking | – | 65.8 | 76.1 | – | – |
| Claude‑4.5‑Opus | – | 67.8 | 62.4 | 71.5 | – |
| Seed‑2.0‑Pro | – | 77.3* | 82.4* | 78.6 | – |
| Open‑source large | |||||
| DeepSeek V3.2 | 671B‑A37B | 67.6* | 65.0* | 75.1 | 78.0 |
| GLM‑4.7 | 355B‑A32B | 67.5* | 66.6* | – | 72.0 |
| Step‑3.5‑Flash | 196B‑A11B | 69.0* | 66.9 | 84.5 | 83.7 |
| Lightweight open‑source | |||||
| Tongyi DeepResearch | 30B‑A3B | 43.4 | 46.7 | 70.9 | 75.0 |
| RedSearcher | 30B‑A3B | 57.4* | 58.2* | 80.1 | – |
| LongSeeker | 30B‑A3B | 61.5* | 62.5* | 77.7* | 78.0* |
| MiroThinker‑1.7‑mini | 30B‑A3B | 67.9* | 72.3* | 80.3* | – |
| SearchSwarm (Ours) | 30B‑A3B | 68.1* | 73.3* | 82.5* | 80.8* |
- indicates context‑management techniques. Bold: best among lightweight models.
SearchSwarm surpasses all comparable‑scale models and matches/frontier models much larger.
Harness Ablation (200‑question BrowseComp subset)
| Configuration | Score |
|---|---|
| Base Tongyi framework | 47.7 |
+ call_sub_agent tool (schema only) | 50.0 (+2.3) |
| + full harness (principles) | 57.7 (+10.0) |
Full harness substantially improves performance over simply providing the tool.
Training from a Different Base Model
Fine‑tuning Qwen3‑30B‑A3B‑Thinking‑2507 on the same data yields 66.5 (200‑question BrowseComp) and 64.0 (BrowseComp‑ZH), exceeding RedSearcher and LongSeeker, demonstrating data quality.
Generalization to Single‑Agent Setting
Without call_sub_agent tool, SearchSwarm achieves 52.0 / 53.3 on BrowseComp subset and BrowseComp‑ZH, versus 43.5 / 46.5 for Tongyi DeepResearch, showing training on delegation improves general problem‑solving.
Generalization to Open‑Ended Deep Research
| Model | ScholarQA‑v2 | HealthBench | ResearchQA | DeepResearchBench | Average |
|---|---|---|---|---|---|
| Tongyi DeepResearch | 46.5 | 46.2 | 66.7 | 40.6 | 50.0 |
| Dr.Tulu | 88.3 | 52.8 | 75.7 | 45.4 | 65.6 |
| SearchSwarm (Ours) | 79.2 | 52.8 | 80.2 | 44.4 | 64.2 |
SearchSwarm outperforms its base model by 14.2 points average and approaches closed‑source DeepResearch systems, despite being trained only on short‑answer queries.
Behavioral Analysis
Main agent primarily orchestrates: call_sub_agent accounts for 70%+ of tool calls on BrowseComp. Direct tool usage is verification‑oriented (dominant visit). Subagents perform exploratory retrieval (dominant search). Python usage reflects computational needs (GAIA, xbench).
Theoretical and Practical Implications
- Context Management via Delegation: The paradigm reinterprets delegation as an intelligent, content‑aware compression mechanism that frees the main agent’s context budget for high‑level reasoning, in contrast to fixed‑rule truncation or summarization.
- Lightweight Competitiveness: A 30B‑A3B model with trained delegation intelligence can match or exceed models 10× larger on long‑horizon research tasks, suggesting that structural intelligence (decomposition, coordination) can compensate for raw parameter count.
- Generalizable Skills: Delegation training improves performance even in single‑agent settings and on open‑ended tasks, indicating that the learned patterns of systematic decomposition and verification are broadly beneficial.
- Open‑Source Facilitation: Releasing the harness, model weights, and training data provides a complete recipe for others to study and extend delegation intelligence in agentic workflows.
Conclusion
SearchSwarm presents a preliminary yet effective exploration of training delegation intelligence for long‑horizon agent tasks, demonstrated on deep research. A specially designed harness elicits optimal delegation behavior at inference; its trajectories, when used as supervised fine‑tuning data, internalize that intelligence into model weights. The resulting 30B‑A3B model achieves state‑of‑the‑art performance among comparable‑scale models on four challenging benchmarks and remains competitive with models over 10× larger. The acquired capabilities generalize to single‑agent settings and open‑ended research tasks. Future work may extend the paradigm to deeper hierarchies of delegation, other task domains, and reinforcement learning based training of the main agent’s delegation policy.
Related papers
- Kwai Keye-VL-2.0 Technical Report
First multimodal MoE achieves SOTA long-video understanding and agentic tasks with 3B active parameters via sparse attention and multi-teacher distillation.
- SWE-Explore: Benchmarking How Coding Agents Explore Repositories
SWE-Explore benchmarks repository exploration and finds that even strong agents are recall-limited at line level, where missing core evidence dominates failures.
- SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive Variations
Even top LLM mediators close only a third of the consensus gap, revealing that mediation success depends on socio-cognitive adaptation, not general reasoning.