Visual Summary | Orchestra-o1: Omnimodal Agent Orchestration

Summary (Overview)

Proposes Orchestra-o1, an omnimodal agent orchestration framework that decouples high-level orchestration from specialized perception and action execution, enabling modality-aware task decomposition, online sub-agent specialization, and parallel sub-task execution.
Introduces DA-GRPO (Decision-Aligned Group Relative Policy Optimization), an efficient offline agentic reinforcement learning algorithm that provides dense, decision-level supervision for training the main orchestrator agent.
Achieves state-of-the-art performance on the OmniGAIA benchmark: with GPT-5 as the main agent, Orchestra-o1 reaches 72.8% accuracy, surpassing the second-best approach (Gemini-3-Pro, 62.5%) by 10.3% absolute; using the trained open-source 8B model, it improves the previous best open-source accuracy from 20.8% to 30.0%.
Demonstrates cost-effectiveness: Compared to AOrchestra, Orchestra-o1 achieves higher accuracy (72.8% vs. 40.0%) while using lower cost (341.6 vs. 565.7) due to parallel sub-task execution and explicit tool/sub-agent selection.

Introduction and Theoretical Foundation

Large language model (LLM)-based agents have evolved from single-agent workflows to multi-agent systems, where a main agent coordinates multiple specialized agents to decompose complex tasks. However, existing orchestration frameworks are limited to text or vision-language settings and fail to handle omnimodal scenarios where text, image, audio, and video coexist and interact.

Current omnimodal agents fall into two categories:

Native omnimodal agents – use a single omnimodal LLM (OLLM) for perception, reasoning, planning, and tool use simultaneously. They struggle with long-horizon reasoning, tool use, and fine-grained cross-modal understanding. Even strong models like Gemini-3-Pro achieve only 62.5% on OmniGAIA.
Orchestration-based agents – decouple perception/action from high-level reasoning using a text-based orchestrator. Existing open-source frameworks (e.g., AOrchestra) are limited by incomplete toolsets and linear sub-agent workflows.

The paper argues for a unified orchestration paradigm that explicitly models modality-aware task decomposition, dependency-aware scheduling, and parallel sub-task execution. The theoretical foundation is built on an information-theoretic justification (Proposition 2) showing that orchestration can achieve strictly higher mutual information with the latent answer compared to native processing, given specialized sub-agents are at least as informative per modality.

Methodology

Problem Definition. Given task instance $x = (q, \mathcal{M})$ where $q$ is the question and $\mathcal{M} = \{m_i\}_{i=1}^N$ are auxiliary modality inputs (image, audio, video), the goal is to produce final answer $\hat{a}$ maximizing reward $R(\hat{a}, a^*)$ .

System Formulation. The main agent $\pi_\theta$ acts as orchestrator. At round $t$ , it observes state:

s_t = \left\langle q, \mathcal{M}, c_t, H_t, \mathcal{B}, \mathcal{T} \right\rangle,

where $c_t$ is accumulated context, $H_t$ is structured sub-task history, $\mathcal{B}$ is the set of available sub-agent backends, and $\mathcal{T}$ is the tool set. The main agent outputs a structured decision $y_t$ from $\{\text{delegate}, \text{complete}\}$ . If delegate, it generates a batch of $K_t$ sub-tasks:

u_{t,j} = (I_{t,j}, C_{t,j}, b_{t,j}, \mathcal{T}_{t,j}),

with instruction $I_{t,j}$ , context $C_{t,j}$ , backend $b_{t,j} \in \mathcal{B}$ , and tool subset $\mathcal{T}_{t,j} \subseteq \mathcal{T}$ .

Orchestra-o1 Framework (key components):

Flexible Agentic Backends: Each backend $b$ has capability vector $\phi(b) = (\phi^{\text{txt}}_b, \phi^{\text{img}}_b, \phi^{\text{aud}}_b, \phi^{\text{vid}}_b, \phi^{\text{code}}_b, \kappa_b, \delta_b)$ with cost and latency. Model assignment maximizes cost-aware matching score (Eq. 5).
Unified Omnimodal Tool Ecosystem: Tools include perception (image, audio, video analysis) and action (web search, page visit, code execution). Tool selection via requirement matching (Eq. 6-7).
Modality-Aware Task Decomposition: The main agent induces a latent dependency graph $\mathcal{G}_t = (\mathcal{V}_t, \mathcal{E}_t)$ over unsolved sub-goals. Each node has modality mask $\mu(v)$ and tool mask $\alpha(v)$ . The ready set is:
$\mathcal{R}_t = \{ v \in \mathcal{V}_t \setminus \mathcal{C}_t : \text{Pred}(v) \subseteq \mathcal{C}_t \}.$
A parallel batch $\mathcal{P}_t$ is selected from $\mathcal{R}_t$ under capacity and budget constraints (Eq. 10).
Parallel Sub-task Execution: Each sub-task is executed by an independent ReAct-style sub-agent. Execution factorizes as:
$p(Z_t \mid U_t, s_t) = \prod_{j=1}^{K_t} p(z_{t,j} \mid u_{t,j}, s_t).$
Proposition 1 formalizes the latency advantage of parallel over linear execution.
Context Memory and Iterative Refinement: After each round, memory $H_t$ is updated with summarized sub-agent results. Compressed context $C_{t+1}$ is maintained under token budget $L_{\text{ctx}}$ (Eq. 20). The main agent terminates when evidence sufficiency score exceeds threshold $\tau_{\text{stop}}$ (Eq. 22).

Proposition 1 (Round-level Latency Advantage):

\text{Latency}_{\text{linear}}(t) = \sum_{j=1}^{K_t} \delta_{t,j}, \quad \text{Latency}_{\text{parallel}}(t) = \max_{1 \le j \le K_t} \delta_{t,j} + \delta^{\text{sync}}_t.

Speedup $S_t \le K_t$ under $\delta^{\text{sync}}_t \le \sum \delta_{t,j} - \max \delta_{t,j}$ .

Proposition 2 (Information Gain): If $\forall r: I(Y; E_r \mid q, E_{<r}) \ge I(Y; E^0_r \mid q, E^0_{<r})$ with strict inequality for some $r$ , then $I(Y; E^{\text{orch}} \mid q) > I(Y; E^0 \mid q)$ .

Training Recipe – DA-GRPO:

Training Data Curation: Starting from 300 seeds, use anchor fact extraction, five rewrite strategies (pivot swapping, temporal shifting, numerical recombination, entity-sibling querying, multi-hop reordering), and a cascade of quality gates (anchor coverage, lexical similarity, modal-bypass test, numerical check, LLM judge). Produces ~1200 verified examples with dense decision-level supervision.
DA-GRPO Objective: For each state, sample $G$ candidate decisions $\{y_{i,j}\}_{j=1}^G$ . Reward each decision on four dimensions:
$r_{i,j} = \alpha_1 r^{\text{format}}_{i,j} + \alpha_2 r^{\text{action}}_{i,j} + \alpha_3 r^{\text{tool}}_{i,j} + \alpha_4 r^{\text{decision}}_{i,j}.$
Weights: 0.1, 0.1, 0.2, 0.6.

Compute relative advantage:
$\hat{A}_{i,j} = \frac{r_{i,j} - \text{Mean}(\{r_{i,k}\})}{\text{Std}(\{r_{i,k}\}) + \epsilon}.$
Optimize with clipped policy gradient:
$\mathcal{L}_{\text{DA-GRPO}}(\theta) = -\mathbb{E}_{i,j}\left[ \min\left( \rho_{i,j}(\theta)\hat{A}_{i,j},\ \text{clip}(\rho_{i,j}(\theta),1-\epsilon,1+\epsilon)\hat{A}_{i,j} \right) - \beta D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}}) \right].$

Empirical Validation / Results

Benchmark: OmniGAIA, covering text, image, audio, video across 9 categories and 3 difficulty levels.

Main Results (Table 1):

Category	Gem.-3-Pro	AOrch-GPT-5	Ours-GPT-5	OmniAtlas-Qwen3-30B	Ours-8B
Geography	65.2	34.8	72.5	10.1	21.7
Technology	59.2	40.8	69.4	30.6	32.7
History	62.1	56.1	75.8	29.9	37.9
Finance	72.0	32.0	64.0	32.0	12.0
Sport	78.4	51.4	83.8	18.9	29.7
Art	52.8	25.0	63.9	16.7	16.7
Movie	48.5	42.4	69.7	12.1	45.5
Science	42.3	30.8	73.1	11.5	38.5
Food	88.9	22.2	83.3	27.8	38.9
Overall	62.5	40.0	72.8	20.8	30.0

Orchestra-o1-GPT-5 outperforms Gemini-3-Pro by 10.3% and AOrchestra-GPT-5 by 32.8% overall.
Orchestra-o1-8B improves over OmniAtlas-Qwen3-30B by 9.2% despite using a smaller 8B backbone.

Difficulty-level Results (Figure 4): Proprietary: 80.3% easy, 75.0% medium, 56.4% hard. Open-source: 36.1%, 26.9%, 26.9% – consistent improvements over best baselines.

Efficiency (Figure 5): Orchestra-o1 vs. AOrchestra: 72.8% accuracy at cost 341.6 vs. 40.0% at cost 565.7. Parallel execution and smart tool selection drive cost-effectiveness.

Ablation Studies (Figure 6, Table 2):

ReAct-GPT-5 with same tools: 53.9% → Orchestra-o1-GPT-5: 72.8% (+18.9%).
Qwen3-8B ReAct: 12.5% → Orchestra-o1 without post-training: 26.3% → with SFT: 28.6% → with vanilla GRPO: 27.7% → with DA-GRPO: 30.0%.

Theoretical and Practical Implications

Theoretical:

Orchestration provides an information-theoretic advantage (Proposition 2) by specializing sub-agents per modality, increasing mutual information between evidence and answer.
Parallel execution yields formal latency speedup (Proposition 1), bounded by the number of independent sub-tasks.
DA-GRPO offers a principled way to provide dense, decision-level rewards for multi-step orchestration, avoiding expensive online execution during training.

Practical:

Orchestra-o1 is a scalable, open-source framework for building omnimodal agent swarms with flexible backends and tools.
The system achieves both higher accuracy and lower cost than linear orchestration baselines.
Training with DA-GRPO enables compact 8B models to competently orchestrate complex multi-modal tasks, democratizing omnimodal agent intelligence.
The data curation pipeline can generate diverse, verified training examples from limited seeds, applicable to other orchestration tasks.

Conclusion

Orchestra-o1 is an omnimodal agent orchestration framework that separates high-level orchestration from specialized execution, enabling modality-aware decomposition, parallel sub-task execution, and iterative evidence aggregation. DA-GRPO trains the main agent by rewarding decision-level quality across format, action, tool, and strategic dimensions. Extensive experiments on OmniGAIA show state-of-the-art performance, with strong gains over both native omnimodal agents and previous orchestration approaches, while being more cost-effective. Future work will extend to practical scenarios like audio-video collaborative coding and voice-guided computer-use tasks.