Summary (Overview)

  • Agents-A1 is a 35B Mixture-of-Experts (MoE) agentic model that achieves performance competitive with trillion-parameter models (e.g., Kimi-K2.6, DeepSeek-V4-pro) by scaling the agent horizon rather than the number of parameters.
  • The authors build a Long-Horizon Knowledge-Action Infrastructure that connects external knowledge, actions, observations, and verifier outcomes, producing agentic trajectories with an average length of 45K tokens.
  • A three-stage training recipe is employed: (1) full-domain supervised fine-tuning (SFT) for broad agentic behaviors, (2) domain-level teacher models trained for specialized expertise, and (3) multi-teacher domain-routed on-policy distillation (OPD) with Salient Vocabulary Alignment (SVA) to unify six heterogeneous domains into one student model.
  • Agents-A1 achieves leading results on long-horizon agent benchmarks including SEAL-0 (56.4), IFBench (80.6), HiPhO (46.4), FrontierScience-Olympiad (79.0), and MolBench-Bind (56.8), and remains highly competitive on SciCode (44.3), HLE (47.6), and BrowseComp (75.5).
  • The work provides a practical path for scaling the horizon with a small model (35B) to match or surpass the performance of much larger models (1T) on long-horizon tasks.

Introduction and Theoretical Foundation

Recent progress in LLMs pushes AI toward autonomous agents that plan, use tools, interact, and improve over long horizons (e.g., software engineering, scientific research, decision making). Two scaling routes exist:

  1. Parameter scaling (e.g., DSV4, Kimi-K2.6): relies on large model scale to internalize reasoning patterns, tool-use, and domain knowledge. Effective but computationally expensive.
  2. Horizon scaling: makes the intermediate decision process explicit – knowledge acquisition, action execution, observation interpretation, and verification become trainable supervision. However, two bottlenecks exist:
    • Knowledge infrastructure: long-horizon trajectory training requires a unified environment connecting external knowledge, actions, observations, and verification signals.
    • Domain integration: scaling the horizon requires integrating heterogeneous abilities (e.g., information retrieval, tool use, executable iteration, constraint tracking) that emerge unevenly across domains.

Agents-A1 addresses these by building a knowledge-action graph (KAG) defined as:

Gd=(Cd,Ad,Od,Vd)\mathcal{G}_d = (\mathcal{C}_d, \mathcal{A}_d, \mathcal{O}_d, \mathcal{V}_d)

where Cd\mathcal{C}_d is the domain corpus (evidence chunks, entities, facts), Ad\mathcal{A}_d is the action space (tool calls, retrieval, code edits), Od\mathcal{O}_d is the observation space (tool returns, execution states), and Vd\mathcal{V}_d is the verifier set (correctness, evidence support checks).

The graph is populated by linked action records (st,at,ot,vt)(s_t, a_t, o_t, v_t) with stCdO<ts_t \subseteq \mathcal{C}_d \cup \mathcal{O}_{<t}, atAda_t \in \mathcal{A}_d, otOdo_t \in \mathcal{O}_d, vtVdv_t \in \mathcal{V}_d.

To generate high-quality trajectories, a proposer–solver–verifier game is used: πP\pi_\text{P} samples graph regions to propose tasks, πS\pi_\text{S} solves them with tools, and πV\pi_\text{V} verifies answers, executing trajectories according to criteria (verifiable, valid, process-informative, evidence-covering, unambiguously specified).

The key insight: scaling the horizon with a smaller model can match trillion-parameter performance by leveraging dense process-level supervision and multi-domain integration.

Methodology

1. Long-Horizon Knowledge-Action Infrastructure

A Knowledge-Action Graph (KAG) is constructed by decomposing agentic competence into five atomic abilities: information acquisition, tool calling, executable iteration, evidence verification, and constraint tracking. For each domain, the KAG is built from heterogeneous corpora (wiki pages, competition specs, scientific problems, tool schemas) and expanded through a self-play graph search loop. The resulting trajectories provide step-level supervision with average length 45K tokens.

2. Three-Stage Training Recipe

Stage 1: Full-Domain Supervised Fine-Tuning (SFT)

  • Initialization: Qwen3.5-35B-A3B.
  • Data: Approximately 100K trajectories across deep research, coding/engineering, scientific problem-solving, instruction following, and general agentic tasks. Average token lengths per domain range from 3K (instruction following) to 48K (coding). Overall average: 45K tokens.
  • Training: Standard cross-entropy loss on response tokens only; sample packing; hyperparameters: learning rate 1×1051\times10^{-5}, batch size 16, 1 epoch, max sequence length 131,072.
Data SourceAvg. Token Length
Deep research44K
Coding and engineering48K
Scientific reasoning37K
Instruction following3K
General agentic tasks39K
Overall45K

Stage 2: Domain-Level Teacher Training

Four domain teachers are trained using specialized SFT or RL:

  • Search teacher (SFT + RL): Trained on search trajectories (Sec. 3.1) with GRPO. Reward combines correctness, efficiency/penalty (linear penalty after KK rounds, repetition penalty), and format calibration. Hyperparameters: learning rate 1×1061\times10^{-6}, rollout batch 256, GRPO clip [0.2,0.28][0.2, 0.28], KL penalty 0.001, entropy 0.0001.

  • Science teacher (two-stage SFT): Stage 1: reasoning-enhanced SFT on non-tool scientific reasoning trajectories. Stage 2: tool-augmented SFT on filtered trajectories with search, visit, code, scholar tools.

  • Instruction-following teacher (two-stage RL):

    • Stage 1: RL on Nemotron instruction-following data with rule-based rewards.
    • Stage 2: RL on long-context learning data with answer-matching reward. Uses GRPO with dynamic sampling (keep groups with non-uniform rewards).
  • Tool-calling teacher (SFT + RL): SFT on tool-calling data, followed by RL with asymmetric advantage:

    Ai=Aiout+λneg1[riout=0]Aiproc,λneg=0.5A_i = A_i^\text{out} + \lambda_\text{neg} \mathbf{1}[r_i^\text{out}=0] A_i^\text{proc}, \quad \lambda_\text{neg}=0.5

    where AioutA_i^\text{out} is normalized over all samples, AiprocA_i^\text{proc} only over negative samples. Hard-task set of 64 samples reused across rollout rounds.

Stage 3: Multi-Teacher On-Policy Distillation (OPD)

  • Student initialization: from full-domain SFT checkpoint.
  • Teacher pool: domain-level teachers (search, science, instruction, tool).
  • Domain routing: each prompt assigned a domain label; only the corresponding teacher provides the distillation signal.
  • Rollout: student generates trajectories under domain protocol; tool outputs and user turns are masked from loss.
  • Salient Vocabulary Alignment (SVA): at each generation position tt, the student and routed teacher are evaluated on the same student-generated prefix. Let ps(u)p_{s'}(u) and pt,i(u)p_{t,i}(u) be the student and teacher distributions. The top-kk tokens under the teacher distribution Si,t(k)\mathcal{S}_{i,t}^{(k)} are selected. Both distributions are renormalized on this set: pˉs(u)=ps(u)vSi,t(k)ps(v),pˉt,i(u)=pt,i(u)vSi,t(k)pt,i(v),uSi,t(k)\bar{p}_{s'}(u) = \frac{p_{s'}(u)}{\sum_{v\in\mathcal{S}_{i,t}^{(k)}} p_{s'}(v)}, \quad \bar{p}_{t,i}(u) = \frac{p_{t,i}(u)}{\sum_{v\in\mathcal{S}_{i,t}^{(k)}} p_{t,i}(v)},\quad u\in\mathcal{S}_{i,t}^{(k)} The per-sample SVA loss is the truncated reverse KL: SVA(i)(θs;θt,i)=1RitRiuSi,t(k)pˉs(u)logpˉs(u)pˉt,i(u)\ell_{\text{SVA}}^{(i)}(\theta'_s;\theta_{t,i}) = \frac{1}{|R_i|}\sum_{t\in R_i}\sum_{u\in\mathcal{S}_{i,t}^{(k)}}\bar{p}_{s'}(u)\log\frac{\bar{p}_{s'}(u)}{\bar{p}_{t,i}(u)}
  • Domain-normalized objective: For a mini-batch B\mathcal{B}, compute loss per domain and then average over active domains: LMT-SVA(θs)=1DBdDB1BdiBdSVA(i)(θs;θt,i)\mathcal{L}_{\text{MT-SVA}}(\theta'_s) = \frac{1}{|\mathcal{D}_\mathcal{B}|}\sum_{d\in\mathcal{D}_\mathcal{B}}\frac{1}{|\mathcal{B}_d|}\sum_{i\in\mathcal{B}_d} \ell_{\text{SVA}}^{(i)}(\theta'_s;\theta_{t,i}) This prevents high-frequency domains from dominating.

3. Multi-Domain Data Pipeline

Detailed construction for five domains: Long-horizon search (wiki-based KAG with relation chains and QA pairs), Machine learning engineering (gradeable competitions with agentic harness, tree of solution nodes, tools in Table 1), Scientific reasoning (graph-driven problem enhancement, trajectory generation with tools), Instruction following (Nemotron constraints + long-context QA with injected in-context rules/distractors), and Tool calling (graph-compositional task synthesis, trajectory generation in Tool Sandbox).

Empirical Validation / Results

Evaluation Setting

  • Long-horizon search: GAIA, BrowseComp, XBench, SEAL-0 – 300 turns, pass@1, official judges.
  • Engineering: SciCode (288 subproblems, pass@1), MLE-Bench-Lite (22 Kaggle competitions, medal rate, 12-hour budget on H200).
  • Scientific research: HLE with tools, HiPhO, FS-O, FS-R – official protocols, tool-free for baselines to avoid confounds.
  • Instruction following: LongBench V2 (503 MC questions), IFBench (294 prompts), IFEval (541 prompts) – strict accuracy.
  • General agentic: τ2\tau^2-Bench, VitaBench (using DeepSeek-V3.2 as simulator/judge).
  • Scientific agentic: MolBench-Bind (Binding Affinity Comparison, 3 runs), MatTools (138 subtasks, 3 runs).

Key Results

BenchmarkAgents-A1 (35B)Qwen3.6-35B-A3BKimi-K2.6 (1T)DSV4-Pro (1T)GPT-5.5 (1T)
BrowseComp75.567.983.283.484.4
SEAL-056.438.750.555.042.3
GAIA96.078.680.698.187.4
SciCode44.335.853.550.056.1
MLE-Bench-Lite43.934.962.163.672.7
HiPhO46.437.741.138.743.3
FS-O79.060.373.076.078.0
FS-R40.02.917.913.326.7
IFBench80.664.471.873.575.9
MolBench-Bind.56.848.721.637.862.2
MatTools47.115.963.847.168.8
  • Agents-A1 outperforms all 1T models on SEAL-0, IFBench, HiPhO, FS-O, FS-R, and MolBench-Bind.
  • On BrowseComp, XBench, SciCode, MLE-Bench, HLE, and τ2\tau^2-Bench, it is highly competitive but slightly behind the best 1T models.
  • Compared to same-scale 35B models, Agents-A1 leads on all benchmarks except τ2\tau^2-Bench (nearly tied) and VitaBench (slightly behind DSV4-Pro but ahead of others).
  • The OPD stage substantially recovers performance drops observed in full-domain SFT, especially on instruction following and HLE.

Ablations and Teacher Performance

  • Search teacher improves over Qwen3.5 on GAIA (+25.6), SEAL-0 (+12.7), HLE (+2.9).
  • Science teacher boosts FS-R from 2.5 to 54.3, HiPhO from 37.0 to 46.9.
  • Instruction-following teacher improves IFBench from 70.2 to 82.0, LongBench from 59.0 to 62.4.
  • Tool teacher raises τ2\tau^2-Bench from 32.53 to 82.50, VitaBench from 26.00 to 44.16.

Long-Horizon Case Studies

  • 12-hour MLE optimization: Agents-A1 autonomously improves a Kaggle whale call detection task from AUC 0.58 to 0.9935, achieving gold-medal-level performance through temporal analysis, augmentation, architectural refinement.
  • Earth science analysis: Agents-A1 reconstructs Cyclone Nargis (2008) track and intensity from IBTrACS, generating diagnostics and visualizations automatically.

Theoretical and Practical Implications

  • Scaling the horizon, not parameters, is a viable path to achieve trillion-parameter performance with a much smaller model (35B). This reduces computational cost and democratizes access to strong agentic capabilities.
  • The knowledge-action infrastructure provides a general framework for constructing dense, process-level supervision from heterogeneous corpora, enabling credit assignment across long trajectories.
  • Multi-teacher on-policy distillation with Salient Vocabulary Alignment effectively resolves conflicts between different reasoning patterns (e.g., long-thinking vs. multi-turn tool-use) that arise in multi-domain training, without requiring model merging or ensemble.
  • The approach unifies six heterogeneous domains (search, engineering, science, instruction following, tool calling, general agentic) into a single deployable student.
  • The work suggests that atomic abilities (planning, reflection, summarization, memory) are crucial for long-horizon agents and should be targets for further improvement.

Conclusion

Agents-A1 demonstrates that a 35B MoE model can reach or match the performance of trillion-parameter models on long-horizon agent tasks by scaling the agent horizon rather than parameters. Key enablers are the long-horizon knowledge-action infrastructure and the three-stage training recipe culminating in multi-teacher on-policy distillation with salient vocabulary alignment. The model achieves strong results across search, engineering, scientific research, instruction following, and agentic tasks.

Future work will focus on improving fundamental atomic abilities for long-interaction agents (planning, reflection, context summarization, memory) to further enhance performance on extremely long-horizon tasks. The authors hope this provides a practical path for the community to build powerful agents with modest compute budgets.

Related papers