Summary (Overview)

  • Proposes Orchestra-o1, an omnimodal agent orchestration framework that decouples high-level orchestration from specialized perception and action execution, enabling modality-aware task decomposition, online sub-agent specialization, and parallel sub-task execution.
  • Introduces DA-GRPO (Decision-Aligned Group Relative Policy Optimization), an efficient offline agentic reinforcement learning algorithm that provides dense, decision-level supervision for training the main orchestrator agent.
  • Achieves state-of-the-art performance on the OmniGAIA benchmark: with GPT-5 as the main agent, Orchestra-o1 reaches 72.8% accuracy, surpassing the second-best approach (Gemini-3-Pro, 62.5%) by 10.3% absolute; using the trained open-source 8B model, it improves the previous best open-source accuracy from 20.8% to 30.0%.
  • Demonstrates cost-effectiveness: Compared to AOrchestra, Orchestra-o1 achieves higher accuracy (72.8% vs. 40.0%) while using lower cost (341.6 vs. 565.7) due to parallel sub-task execution and explicit tool/sub-agent selection.

Introduction and Theoretical Foundation

Large language model (LLM)-based agents have evolved from single-agent workflows to multi-agent systems, where a main agent coordinates multiple specialized agents to decompose complex tasks. However, existing orchestration frameworks are limited to text or vision-language settings and fail to handle omnimodal scenarios where text, image, audio, and video coexist and interact.

Current omnimodal agents fall into two categories:

  1. Native omnimodal agents – use a single omnimodal LLM (OLLM) for perception, reasoning, planning, and tool use simultaneously. They struggle with long-horizon reasoning, tool use, and fine-grained cross-modal understanding. Even strong models like Gemini-3-Pro achieve only 62.5% on OmniGAIA.
  2. Orchestration-based agents – decouple perception/action from high-level reasoning using a text-based orchestrator. Existing open-source frameworks (e.g., AOrchestra) are limited by incomplete toolsets and linear sub-agent workflows.

The paper argues for a unified orchestration paradigm that explicitly models modality-aware task decomposition, dependency-aware scheduling, and parallel sub-task execution. The theoretical foundation is built on an information-theoretic justification (Proposition 2) showing that orchestration can achieve strictly higher mutual information with the latent answer compared to native processing, given specialized sub-agents are at least as informative per modality.

Methodology

Problem Definition. Given task instance x=(q,M)x = (q, \mathcal{M}) where qq is the question and M={mi}i=1N\mathcal{M} = \{m_i\}_{i=1}^N are auxiliary modality inputs (image, audio, video), the goal is to produce final answer a^\hat{a} maximizing reward R(a^,a)R(\hat{a}, a^*).

System Formulation. The main agent πθ\pi_\theta acts as orchestrator. At round tt, it observes state:

st=q,M,ct,Ht,B,T,s_t = \left\langle q, \mathcal{M}, c_t, H_t, \mathcal{B}, \mathcal{T} \right\rangle,

where ctc_t is accumulated context, HtH_t is structured sub-task history, B\mathcal{B} is the set of available sub-agent backends, and T\mathcal{T} is the tool set. The main agent outputs a structured decision yty_t from {delegate,complete}\{\text{delegate}, \text{complete}\}. If delegate, it generates a batch of KtK_t sub-tasks:

ut,j=(It,j,Ct,j,bt,j,Tt,j),u_{t,j} = (I_{t,j}, C_{t,j}, b_{t,j}, \mathcal{T}_{t,j}),

with instruction It,jI_{t,j}, context Ct,jC_{t,j}, backend bt,jBb_{t,j} \in \mathcal{B}, and tool subset Tt,jT\mathcal{T}_{t,j} \subseteq \mathcal{T}.

Orchestra-o1 Framework (key components):

  1. Flexible Agentic Backends: Each backend bb has capability vector ϕ(b)=(ϕbtxt,ϕbimg,ϕbaud,ϕbvid,ϕbcode,κb,δb)\phi(b) = (\phi^{\text{txt}}_b, \phi^{\text{img}}_b, \phi^{\text{aud}}_b, \phi^{\text{vid}}_b, \phi^{\text{code}}_b, \kappa_b, \delta_b) with cost and latency. Model assignment maximizes cost-aware matching score (Eq. 5).

  2. Unified Omnimodal Tool Ecosystem: Tools include perception (image, audio, video analysis) and action (web search, page visit, code execution). Tool selection via requirement matching (Eq. 6-7).

  3. Modality-Aware Task Decomposition: The main agent induces a latent dependency graph Gt=(Vt,Et)\mathcal{G}_t = (\mathcal{V}_t, \mathcal{E}_t) over unsolved sub-goals. Each node has modality mask μ(v)\mu(v) and tool mask α(v)\alpha(v). The ready set is:

    Rt={vVtCt:Pred(v)Ct}.\mathcal{R}_t = \{ v \in \mathcal{V}_t \setminus \mathcal{C}_t : \text{Pred}(v) \subseteq \mathcal{C}_t \}.

    A parallel batch Pt\mathcal{P}_t is selected from Rt\mathcal{R}_t under capacity and budget constraints (Eq. 10).

  4. Parallel Sub-task Execution: Each sub-task is executed by an independent ReAct-style sub-agent. Execution factorizes as:

    p(ZtUt,st)=j=1Ktp(zt,jut,j,st).p(Z_t \mid U_t, s_t) = \prod_{j=1}^{K_t} p(z_{t,j} \mid u_{t,j}, s_t).

    Proposition 1 formalizes the latency advantage of parallel over linear execution.

  5. Context Memory and Iterative Refinement: After each round, memory HtH_t is updated with summarized sub-agent results. Compressed context Ct+1C_{t+1} is maintained under token budget LctxL_{\text{ctx}} (Eq. 20). The main agent terminates when evidence sufficiency score exceeds threshold τstop\tau_{\text{stop}} (Eq. 22).

Proposition 1 (Round-level Latency Advantage):

Latencylinear(t)=j=1Ktδt,j,Latencyparallel(t)=max1jKtδt,j+δtsync.\text{Latency}_{\text{linear}}(t) = \sum_{j=1}^{K_t} \delta_{t,j}, \quad \text{Latency}_{\text{parallel}}(t) = \max_{1 \le j \le K_t} \delta_{t,j} + \delta^{\text{sync}}_t.

Speedup StKtS_t \le K_t under δtsyncδt,jmaxδt,j\delta^{\text{sync}}_t \le \sum \delta_{t,j} - \max \delta_{t,j}.

Proposition 2 (Information Gain): If r:I(Y;Erq,E<r)I(Y;Er0q,E<r0)\forall r: I(Y; E_r \mid q, E_{<r}) \ge I(Y; E^0_r \mid q, E^0_{<r}) with strict inequality for some rr, then I(Y;Eorchq)>I(Y;E0q)I(Y; E^{\text{orch}} \mid q) > I(Y; E^0 \mid q).

Training Recipe – DA-GRPO:

  1. Training Data Curation: Starting from 300 seeds, use anchor fact extraction, five rewrite strategies (pivot swapping, temporal shifting, numerical recombination, entity-sibling querying, multi-hop reordering), and a cascade of quality gates (anchor coverage, lexical similarity, modal-bypass test, numerical check, LLM judge). Produces ~1200 verified examples with dense decision-level supervision.

  2. DA-GRPO Objective: For each state, sample GG candidate decisions {yi,j}j=1G\{y_{i,j}\}_{j=1}^G. Reward each decision on four dimensions:

    ri,j=α1ri,jformat+α2ri,jaction+α3ri,jtool+α4ri,jdecision.r_{i,j} = \alpha_1 r^{\text{format}}_{i,j} + \alpha_2 r^{\text{action}}_{i,j} + \alpha_3 r^{\text{tool}}_{i,j} + \alpha_4 r^{\text{decision}}_{i,j}.

    Weights: 0.1, 0.1, 0.2, 0.6.

    Compute relative advantage:

    A^i,j=ri,jMean({ri,k})Std({ri,k})+ϵ.\hat{A}_{i,j} = \frac{r_{i,j} - \text{Mean}(\{r_{i,k}\})}{\text{Std}(\{r_{i,k}\}) + \epsilon}.

    Optimize with clipped policy gradient:

    LDA-GRPO(θ)=Ei,j[min(ρi,j(θ)A^i,j, clip(ρi,j(θ),1ϵ,1+ϵ)A^i,j)βDKL(πθπref)].\mathcal{L}_{\text{DA-GRPO}}(\theta) = -\mathbb{E}_{i,j}\left[ \min\left( \rho_{i,j}(\theta)\hat{A}_{i,j},\ \text{clip}(\rho_{i,j}(\theta),1-\epsilon,1+\epsilon)\hat{A}_{i,j} \right) - \beta D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}}) \right].

Empirical Validation / Results

Benchmark: OmniGAIA, covering text, image, audio, video across 9 categories and 3 difficulty levels.

Main Results (Table 1):

CategoryGem.-3-ProAOrch-GPT-5Ours-GPT-5OmniAtlas-Qwen3-30BOurs-8B
Geography65.234.872.510.121.7
Technology59.240.869.430.632.7
History62.156.175.829.937.9
Finance72.032.064.032.012.0
Sport78.451.483.818.929.7
Art52.825.063.916.716.7
Movie48.542.469.712.145.5
Science42.330.873.111.538.5
Food88.922.283.327.838.9
Overall62.540.072.820.830.0
  • Orchestra-o1-GPT-5 outperforms Gemini-3-Pro by 10.3% and AOrchestra-GPT-5 by 32.8% overall.
  • Orchestra-o1-8B improves over OmniAtlas-Qwen3-30B by 9.2% despite using a smaller 8B backbone.

Difficulty-level Results (Figure 4): Proprietary: 80.3% easy, 75.0% medium, 56.4% hard. Open-source: 36.1%, 26.9%, 26.9% – consistent improvements over best baselines.

Efficiency (Figure 5): Orchestra-o1 vs. AOrchestra: 72.8% accuracy at cost 341.6 vs. 40.0% at cost 565.7. Parallel execution and smart tool selection drive cost-effectiveness.

Ablation Studies (Figure 6, Table 2):

  • ReAct-GPT-5 with same tools: 53.9% → Orchestra-o1-GPT-5: 72.8% (+18.9%).
  • Qwen3-8B ReAct: 12.5% → Orchestra-o1 without post-training: 26.3% → with SFT: 28.6% → with vanilla GRPO: 27.7% → with DA-GRPO: 30.0%.

Theoretical and Practical Implications

Theoretical:

  • Orchestration provides an information-theoretic advantage (Proposition 2) by specializing sub-agents per modality, increasing mutual information between evidence and answer.
  • Parallel execution yields formal latency speedup (Proposition 1), bounded by the number of independent sub-tasks.
  • DA-GRPO offers a principled way to provide dense, decision-level rewards for multi-step orchestration, avoiding expensive online execution during training.

Practical:

  • Orchestra-o1 is a scalable, open-source framework for building omnimodal agent swarms with flexible backends and tools.
  • The system achieves both higher accuracy and lower cost than linear orchestration baselines.
  • Training with DA-GRPO enables compact 8B models to competently orchestrate complex multi-modal tasks, democratizing omnimodal agent intelligence.
  • The data curation pipeline can generate diverse, verified training examples from limited seeds, applicable to other orchestration tasks.

Conclusion

Orchestra-o1 is an omnimodal agent orchestration framework that separates high-level orchestration from specialized execution, enabling modality-aware decomposition, parallel sub-task execution, and iterative evidence aggregation. DA-GRPO trains the main agent by rewarding decision-level quality across format, action, tool, and strategic dimensions. Extensive experiments on OmniGAIA show state-of-the-art performance, with strong gains over both native omnimodal agents and previous orchestration approaches, while being more cost-effective. Future work will extend to practical scenarios like audio-video collaborative coding and voice-guided computer-use tasks.

Related papers