# Orchestra-o1: Omnimodal Agent Orchestration

> Orchestra-o1 achieves 72.8% accuracy on OmniGAIA, surpassing prior best by 10.3% via modality-aware orchestration and DA-GRPO training.

- **Source:** [arXiv](https://arxiv.org/abs/2606.13707)
- **Published:** 2026-06-16
- **Permalink:** https://picx.dev/p/nJIGk4
- **Whiteboard:** https://picx.dev/p/nJIGk4/image

## Summary

## Summary (Overview)

- **Proposes Orchestra-o1**, an omnimodal agent orchestration framework that decouples high-level orchestration from specialized perception and action execution, enabling modality-aware task decomposition, online sub-agent specialization, and parallel sub-task execution.
- **Introduces DA-GRPO** (Decision-Aligned Group Relative Policy Optimization), an efficient offline agentic reinforcement learning algorithm that provides dense, decision-level supervision for training the main orchestrator agent.
- **Achieves state-of-the-art performance** on the OmniGAIA benchmark: with GPT-5 as the main agent, Orchestra-o1 reaches 72.8% accuracy, surpassing the second-best approach (Gemini-3-Pro, 62.5%) by 10.3% absolute; using the trained open-source 8B model, it improves the previous best open-source accuracy from 20.8% to 30.0%.
- **Demonstrates cost-effectiveness**: Compared to AOrchestra, Orchestra-o1 achieves higher accuracy (72.8% vs. 40.0%) while using lower cost (341.6 vs. 565.7) due to parallel sub-task execution and explicit tool/sub-agent selection.

## Introduction and Theoretical Foundation

Large language model (LLM)-based agents have evolved from single-agent workflows to multi-agent systems, where a main agent coordinates multiple specialized agents to decompose complex tasks. However, existing orchestration frameworks are limited to text or vision-language settings and fail to handle **omnimodal** scenarios where text, image, audio, and video coexist and interact.

Current omnimodal agents fall into two categories:
1. **Native omnimodal agents** – use a single omnimodal LLM (OLLM) for perception, reasoning, planning, and tool use simultaneously. They struggle with long-horizon reasoning, tool use, and fine-grained cross-modal understanding. Even strong models like Gemini-3-Pro achieve only 62.5% on OmniGAIA.
2. **Orchestration-based agents** – decouple perception/action from high-level reasoning using a text-based orchestrator. Existing open-source frameworks (e.g., AOrchestra) are limited by incomplete toolsets and linear sub-agent workflows.

The paper argues for a unified orchestration paradigm that explicitly models **modality-aware task decomposition**, **dependency-aware scheduling**, and **parallel sub-task execution**. The theoretical foundation is built on an information-theoretic justification (Proposition 2) showing that orchestration can achieve strictly higher mutual information with the latent answer compared to native processing, given specialized sub-agents are at least as informative per modality.

## Methodology

**Problem Definition.** Given task instance $x = (q, \mathcal{M})$ where $q$ is the question and $\mathcal{M} = \{m_i\}_{i=1}^N$ are auxiliary modality inputs (image, audio, video), the goal is to produce final answer $\hat{a}$ maximizing reward $R(\hat{a}, a^*)$.

**System Formulation.** The main agent $\pi_\theta$ acts as orchestrator. At round $t$, it observes state:
$$s_t = \left\langle q, \mathcal{M}, c_t, H_t, \mathcal{B}, \mathcal{T} \right\rangle,$$
where $c_t$ is accumulated context, $H_t$ is structured sub-task history, $\mathcal{B}$ is the set of available sub-agent backends, and $\mathcal{T}$ is the tool set. The main agent outputs a structured decision $y_t$ from $\{\text{delegate}, \text{complete}\}$. If delegate, it generates a batch of $K_t$ sub-tasks:
$$u_{t,j} = (I_{t,j}, C_{t,j}, b_{t,j}, \mathcal{T}_{t,j}),$$
with instruction $I_{t,j}$, context $C_{t,j}$, backend $b_{t,j} \in \mathcal{B}$, and tool subset $\mathcal{T}_{t,j} \subseteq \mathcal{T}$.

**Orchestra-o1 Framework** (key components):

1. **Flexible Agentic Backends**: Each backend $b$ has capability vector $\phi(b) = (\phi^{\text{txt}}_b, \phi^{\text{img}}_b, \phi^{\text{aud}}_b, \phi^{\text{vid}}_b, \phi^{\text{code}}_b, \kappa_b, \delta_b)$ with cost and latency. Model assignment maximizes cost-aware matching score (Eq. 5).

2. **Unified Omnimodal Tool Ecosystem**: Tools include perception (image, audio, video analysis) and action (web search, page visit, code execution). Tool selection via requirement matching (Eq. 6-7).

3. **Modality-Aware Task Decomposition**: The main agent induces a latent dependency graph $\mathcal{G}_t = (\mathcal{V}_t, \mathcal{E}_t)$ over unsolved sub-goals. Each node has modality mask $\mu(v)$ and tool mask $\alpha(v)$. The ready set is:
   $$\mathcal{R}_t = \{ v \in \mathcal{V}_t \setminus \mathcal{C}_t : \text{Pred}(v) \subseteq \mathcal{C}_t \}.$$
   A parallel batch $\mathcal{P}_t$ is selected from $\mathcal{R}_t$ under capacity and budget constraints (Eq. 10).

4. **Parallel Sub-task Execution**: Each sub-task is executed by an independent ReAct-style sub-agent. Execution factorizes as:
   $$p(Z_t \mid U_t, s_t) = \prod_{j=1}^{K_t} p(z_{t,j} \mid u_{t,j}, s_t).$$
   Proposition 1 formalizes the latency advantage of parallel over linear execution.

5. **Context Memory and Iterative Refinement**: After each round, memory $H_t$ is updated with summarized sub-agent results. Compressed context $C_{t+1}$ is maintained under token budget $L_{\text{ctx}}$ (Eq. 20). The main agent terminates when evidence sufficiency score exceeds threshold $\tau_{\text{stop}}$ (Eq. 22).

**Proposition 1 (Round-level Latency Advantage)**:
$$\text{Latency}_{\text{linear}}(t) = \sum_{j=1}^{K_t} \delta_{t,j}, \quad \text{Latency}_{\text{parallel}}(t) = \max_{1 \le j \le K_t} \delta_{t,j} + \delta^{\text{sync}}_t.$$
Speedup $S_t \le K_t$ under $\delta^{\text{sync}}_t \le \sum \delta_{t,j} - \max \delta_{t,j}$.

**Proposition 2 (Information Gain)**:
If $\forall r: I(Y; E_r \mid q, E_{<r}) \ge I(Y; E^0_r \mid q, E^0_{<r})$ with strict inequality for some $r$, then $I(Y; E^{\text{orch}} \mid q) > I(Y; E^0 \mid q)$.

**Training Recipe – DA-GRPO**:

1. **Training Data Curation**: Starting from 300 seeds, use anchor fact extraction, five rewrite strategies (pivot swapping, temporal shifting, numerical recombination, entity-sibling querying, multi-hop reordering), and a cascade of quality gates (anchor coverage, lexical similarity, modal-bypass test, numerical check, LLM judge). Produces ~1200 verified examples with dense decision-level supervision.

2. **DA-GRPO Objective**: For each state, sample $G$ candidate decisions $\{y_{i,j}\}_{j=1}^G$. Reward each decision on four dimensions:
   $$r_{i,j} = \alpha_1 r^{\text{format}}_{i,j} + \alpha_2 r^{\text{action}}_{i,j} + \alpha_3 r^{\text{tool}}_{i,j} + \alpha_4 r^{\text{decision}}_{i,j}.$$
   Weights: 0.1, 0.1, 0.2, 0.6.

   Compute relative advantage:
   $$\hat{A}_{i,j} = \frac{r_{i,j} - \text{Mean}(\{r_{i,k}\})}{\text{Std}(\{r_{i,k}\}) + \epsilon}.$$

   Optimize with clipped policy gradient:
   $$\mathcal{L}_{\text{DA-GRPO}}(\theta) = -\mathbb{E}_{i,j}\left[ \min\left( \rho_{i,j}(\theta)\hat{A}_{i,j},\ \text{clip}(\rho_{i,j}(\theta),1-\epsilon,1+\epsilon)\hat{A}_{i,j} \right) - \beta D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}}) \right].$$

## Empirical Validation / Results

**Benchmark**: OmniGAIA, covering text, image, audio, video across 9 categories and 3 difficulty levels.

**Main Results** (Table 1):

| Category | Gem.-3-Pro | AOrch-GPT-5 | Ours-GPT-5 | OmniAtlas-Qwen3-30B | Ours-8B |
|----------|------------|-------------|------------|----------------------|---------|
| Geography | 65.2 | 34.8 | **72.5** | 10.1 | **21.7** |
| Technology | 59.2 | 40.8 | **69.4** | 30.6 | **32.7** |
| History | 62.1 | 56.1 | **75.8** | 29.9 | **37.9** |
| Finance | 72.0 | 32.0 | **64.0** | 32.0 | 12.0 |
| Sport | 78.4 | 51.4 | **83.8** | 18.9 | **29.7** |
| Art | 52.8 | 25.0 | **63.9** | 16.7 | **16.7** |
| Movie | 48.5 | 42.4 | **69.7** | 12.1 | **45.5** |
| Science | 42.3 | 30.8 | **73.1** | 11.5 | **38.5** |
| Food | 88.9 | 22.2 | **83.3** | 27.8 | **38.9** |
| **Overall** | **62.5** | **40.0** | **72.8** | **20.8** | **30.0** |

- Orchestra-o1-GPT-5 outperforms Gemini-3-Pro by 10.3% and AOrchestra-GPT-5 by 32.8% overall.
- Orchestra-o1-8B improves over OmniAtlas-Qwen3-30B by 9.2% despite using a smaller 8B backbone.

**Difficulty-level Results** (Figure 4): Proprietary: 80.3% easy, 75.0% medium, 56.4% hard. Open-source: 36.1%, 26.9%, 26.9% – consistent improvements over best baselines.

**Efficiency** (Figure 5): Orchestra-o1 vs. AOrchestra: 72.8% accuracy at cost 341.6 vs. 40.0% at cost 565.7. Parallel execution and smart tool selection drive cost-effectiveness.

**Ablation Studies** (Figure 6, Table 2):
- ReAct-GPT-5 with same tools: 53.9% → Orchestra-o1-GPT-5: 72.8% (+18.9%).
- Qwen3-8B ReAct: 12.5% → Orchestra-o1 without post-training: 26.3% → with SFT: 28.6% → with vanilla GRPO: 27.7% → with DA-GRPO: **30.0%**.

## Theoretical and Practical Implications

**Theoretical:**
- Orchestration provides an information-theoretic advantage (Proposition 2) by specializing sub-agents per modality, increasing mutual information between evidence and answer.
- Parallel execution yields formal latency speedup (Proposition 1), bounded by the number of independent sub-tasks.
- DA-GRPO offers a principled way to provide dense, decision-level rewards for multi-step orchestration, avoiding expensive online execution during training.

**Practical:**
- Orchestra-o1 is a scalable, open-source framework for building omnimodal agent swarms with flexible backends and tools.
- The system achieves both higher accuracy and lower cost than linear orchestration baselines.
- Training with DA-GRPO enables compact 8B models to competently orchestrate complex multi-modal tasks, democratizing omnimodal agent intelligence.
- The data curation pipeline can generate diverse, verified training examples from limited seeds, applicable to other orchestration tasks.

## Conclusion

Orchestra-o1 is an omnimodal agent orchestration framework that separates high-level orchestration from specialized execution, enabling modality-aware decomposition, parallel sub-task execution, and iterative evidence aggregation. DA-GRPO trains the main agent by rewarding decision-level quality across format, action, tool, and strategic dimensions. Extensive experiments on OmniGAIA show state-of-the-art performance, with strong gains over both native omnimodal agents and previous orchestration approaches, while being more cost-effective. Future work will extend to practical scenarios like audio-video collaborative coding and voice-guided computer-use tasks.

---

_Markdown view of https://picx.dev/p/nJIGk4, served by PicX — AI-generated visual whiteboard summaries of research papers._