# SearchSwarm: Towards Delegation Intelligence in Agentic LLMs for Long-Horizon Deep Research

> Training a 30B-A3B model on harness-elicited delegation trajectories yields state-of-the-art on long-horizon benchmarks, rivaling 10x larger models.

- **Source:** [arXiv](https://arxiv.org/abs/2606.09730)
- **Published:** 2026-06-11
- **Permalink:** https://picx.dev/p/aFopgw
- **Whiteboard:** https://picx.dev/p/aFopgw/image

## Summary

## Summary (Overview)

- **Problem**: Long-horizon agent tasks generate unbounded context demands, but LLM context windows are inherently finite. Existing passive context management (truncation, summarization) lacks prior planning.
- **Proposal**: Introduce *delegation intelligence* — the main agent decomposes tasks, dispatches bounded subtasks to subagents via `call_sub_agent`, and integrates condensed reports. A harness guides optimal delegation behavior at inference.
- **Data Synthesis**: Harness-guided trajectories that encode correct delegation decisions are used as supervised fine-tuning (SFT) data to internalize delegation behavior into model weights.
- **Results**: SearchSwarm-30B-A3B achieves 68.1 on BrowseComp, 73.3 on BrowseComp-ZH, 82.5 on GAIA, and 80.8 on xbench-DeepSearch — best among all models of comparable scale and competitive with models over 10× larger.
- **Open Source**: Harness, model weights, and training data are released to facilitate future research.

## Introduction and Theoretical Foundation

Large language models (LLMs) increasingly act as agents for complex, multi-step tasks whose information needs grow without bound. Yet context windows remain finite, creating a fundamental tension. Early context management strategies — summarization after length thresholds, discarding old tool outputs — are *passive*: they react after budget exhaustion without prior planning.

An alternative paradigm is **delegation**: a main agent decomposes the task in advance, dispatches bounded subtasks to subagents, and receives only condensed results, actively preserving context budget. The required capability is termed **delegation intelligence**: the ability to decompose, determine when and what to delegate, and integrate results.

Naturally occurring text rarely exhibits explicit multi-agent coordination, making training data for delegation intelligence scarce. The paper presents a structured recipe to synthesize such data in the **deep research** domain, a representative long-horizon agent task. The core idea: design a *harness* that elicits high-quality delegation at inference time, then use the resulting trajectories as SFT data to internalize the behavior.

## Methodology

### Formulation

The deep research task is modelled as a multi-turn ReAct (Yao et al., 2022) interaction. At step \( t \):

- **Thought** \( \tau_t \): internal reasoning
- **Action** \( a_t \): tool call (including `call_sub_agent`)
- **Observation** \( o_t \): environment result

Trajectory:
$$
H_T = \langle q, (\tau_0, a_0, o_0), \ldots, (\tau_T, a_T, o_T), y \rangle  \tag{1}
$$

Policy:
$$
\tau_t, a_t \sim \pi(\cdot \mid q, H_{t-1})  \tag{2}
$$

When \( a_t = \text{call\_sub\_agent}(b) \), the brief \( b \) triggers an independent sub‑trajectory:
$$
H_{\text{sub}} = \langle b, (\tau^s_0, a^s_0, o^s_0), \ldots, (\tau^s_S, a^s_S, o^s_S), r \rangle  \tag{3}
$$
and the main agent receives \( o_t = r \) (the condensed report). Intermediate steps are not visible.

### Harness Design

The harness consists of a tool set and system prompts that guide an LLM toward high-quality delegation via four principles:

| Principle | Description |
|-----------|-------------|
| **Encouraging delegation** | Main agent should offload token‑expensive but cognitively shallow information gathering to subagents, reserving its own context for high‑level coordination and verification. |
| **Comprehensive briefing** | Each brief must include subtask description, why it matters, what is already known, what remains uncertain, and directions already ruled out — so subagents operate with full context. |
| **Main agent retains core judgment** | Only the main agent has a complete view across subtasks; it must independently decide hypotheses, termination, and adjudication. Subagents focus on evidence gathering. |
| **Citation‑grounded reporting** | Subagent reports attach inline citations to every important conclusion. The main agent propagates these to its final answer, ensuring end‑to‑end traceability. |

Tools include `search`, `visit`, `google_scholar`, `python`, and `call_sub_agent`. Single‑level delegation (subagents cannot delegate further).

### Supervised Fine‑Tuning

**Data collection**: Queries from RedSearcher and OpenSeeker datasets. The model executes deep research under harness guidance. Two configurations:
1. Same model as main and subagent – both trajectories retained.
2. Stronger main agent + weaker subagent – only main agent trajectories retained (more deliberate decomposition and verification).

**Filtering**: Retain only main trajectories with correct final answers. Remove repeated tool calls, hallucinated citations, tool misuse.

**Training objective**: next‑token prediction with environment masking:
$$
\mathcal{L} = -\sum_{t=1}^{T} \sum_{j=1}^{|a_t|} \log p_\theta \left( a_t^{(j)} \mid a_t^{(<j)}, \tau_{<t} \right)  \tag{5}
$$
Loss computed only on model outputs \( a_t \); environment returns \( o_t \) are masked.

## Empirical Validation / Results

### Benchmarks and Baselines

Evaluated on BrowseComp, BrowseComp-ZH, GAIA, and xbench-DeepSearch-2505. Compared against closed‑source (GPT‑5, Claude‑4.5, Gemini‑3.0, etc.), large open‑source (DeepSeek V3.2, GLM‑4.7, etc.), and lightweight open‑source models (Tongyi DeepResearch, RedSearcher, LongSeeker, MiroThinker).

### Main Results

| Model | Size | BrowseComp | BrowseComp‑ZH | GAIA | xbench-DeepSearch-2505 |
|-------|------|-----------|--------------|------|------------------------|
| **Closed‑source** | | | | | |
| GPT‑5.2‑Thinking | – | 65.8 | 76.1 | – | – |
| Claude‑4.5‑Opus | – | 67.8 | 62.4 | 71.5 | – |
| Seed‑2.0‑Pro | – | 77.3* | 82.4* | 78.6 | – |
| **Open‑source large** | | | | | |
| DeepSeek V3.2 | 671B‑A37B | 67.6* | 65.0* | 75.1 | 78.0 |
| GLM‑4.7 | 355B‑A32B | 67.5* | 66.6* | – | 72.0 |
| Step‑3.5‑Flash | 196B‑A11B | 69.0* | 66.9 | 84.5 | 83.7 |
| **Lightweight open‑source** | | | | | |
| Tongyi DeepResearch | 30B‑A3B | 43.4 | 46.7 | 70.9 | 75.0 |
| RedSearcher | 30B‑A3B | 57.4* | 58.2* | 80.1 | – |
| LongSeeker | 30B‑A3B | 61.5* | 62.5* | 77.7* | 78.0* |
| MiroThinker‑1.7‑mini | 30B‑A3B | 67.9* | 72.3* | 80.3* | – |
| **SearchSwarm (Ours)** | **30B‑A3B** | **68.1*** | **73.3*** | **82.5*** | **80.8*** |

* indicates context‑management techniques. **Bold**: best among lightweight models.

SearchSwarm surpasses all comparable‑scale models and matches/frontier models much larger.

### Harness Ablation (200‑question BrowseComp subset)

| Configuration | Score |
|---------------|-------|
| Base Tongyi framework | 47.7 |
| + `call_sub_agent` tool (schema only) | 50.0 (+2.3) |
| + full harness (principles) | 57.7 (+10.0) |

Full harness substantially improves performance over simply providing the tool.

### Training from a Different Base Model

Fine‑tuning Qwen3‑30B‑A3B‑Thinking‑2507 on the same data yields 66.5 (200‑question BrowseComp) and 64.0 (BrowseComp‑ZH), exceeding RedSearcher and LongSeeker, demonstrating data quality.

### Generalization to Single‑Agent Setting

Without `call_sub_agent` tool, SearchSwarm achieves 52.0 / 53.3 on BrowseComp subset and BrowseComp‑ZH, versus 43.5 / 46.5 for Tongyi DeepResearch, showing training on delegation improves general problem‑solving.

### Generalization to Open‑Ended Deep Research

| Model | ScholarQA‑v2 | HealthBench | ResearchQA | DeepResearchBench | Average |
|-------|-------------|-------------|------------|-------------------|---------|
| Tongyi DeepResearch | 46.5 | 46.2 | 66.7 | 40.6 | 50.0 |
| Dr.Tulu | 88.3 | 52.8 | 75.7 | 45.4 | 65.6 |
| **SearchSwarm (Ours)** | **79.2** | **52.8** | **80.2** | **44.4** | **64.2** |

SearchSwarm outperforms its base model by 14.2 points average and approaches closed‑source DeepResearch systems, despite being trained only on short‑answer queries.

### Behavioral Analysis

Main agent primarily orchestrates: `call_sub_agent` accounts for 70%+ of tool calls on BrowseComp. Direct tool usage is verification‑oriented (dominant `visit`). Subagents perform exploratory retrieval (dominant `search`). Python usage reflects computational needs (GAIA, xbench).

## Theoretical and Practical Implications

- **Context Management via Delegation**: The paradigm reinterprets delegation as an intelligent, content‑aware compression mechanism that frees the main agent’s context budget for high‑level reasoning, in contrast to fixed‑rule truncation or summarization.
- **Lightweight Competitiveness**: A 30B‑A3B model with trained delegation intelligence can match or exceed models 10× larger on long‑horizon research tasks, suggesting that structural intelligence (decomposition, coordination) can compensate for raw parameter count.
- **Generalizable Skills**: Delegation training improves performance even in single‑agent settings and on open‑ended tasks, indicating that the learned patterns of systematic decomposition and verification are broadly beneficial.
- **Open‑Source Facilitation**: Releasing the harness, model weights, and training data provides a complete recipe for others to study and extend delegation intelligence in agentic workflows.

## Conclusion

SearchSwarm presents a preliminary yet effective exploration of training **delegation intelligence** for long‑horizon agent tasks, demonstrated on deep research. A specially designed harness elicits optimal delegation behavior at inference; its trajectories, when used as supervised fine‑tuning data, internalize that intelligence into model weights. The resulting 30B‑A3B model achieves state‑of‑the‑art performance among comparable‑scale models on four challenging benchmarks and remains competitive with models over 10× larger. The acquired capabilities generalize to single‑agent settings and open‑ended research tasks. Future work may extend the paradigm to deeper hierarchies of delegation, other task domains, and reinforcement learning based training of the main agent’s delegation policy.

---

_Markdown view of https://picx.dev/p/aFopgw, served by PicX — AI-generated visual whiteboard summaries of research papers._