# Heterogeneous Scientific Foundation Model Collaboration

> Eywa introduces a framework where language models collaborate with specialized scientific foundation models via a "Tsaheylu" interface, achieving higher performance with lower computational cost across diverse data modalities.

- **Source:** [arXiv](https://arxiv.org/abs/2604.27351)
- **Published:** 2026-05-02
- **Permalink:** https://picx.dev/p/DZDopI
- **Whiteboard:** https://picx.dev/p/DZDopI/image

## Summary

# Heterogeneous Scientific Foundation Model Collaboration

## Summary (Overview)
*   **Introduces Eywa**, a heterogeneous agentic framework that bridges language-centric agentic systems and domain-specific foundation models (FMs) via an FM-LLM "Tsaheylu" interface, enabling collaboration across non-linguistic scientific data modalities.
*   **Proposes three instantiations**: `EywaAgent` (single-agent with FM integration), `EywaMAS` (multi-agent system with plug-and-play `EywaAgent`s), and `EywaOrchestra` (planning-based dynamic orchestration of heterogeneous experts).
*   **Establishes theoretical guarantees**, proving that `EywaAgent` strictly improves optimal task risk over language-only agents under a Domain Advantage assumption, and that adaptive orchestration (`EywaOrchestra`) dominates any fixed system configuration.
*   **Introduces `EywaBench`**, a scalable multi-task, multi-domain scientific benchmark covering physical, life, and social sciences across natural language, time series, and tabular data.
*   **Demonstrates empirical gains**: Experiments show `Eywa` improves utility (performance) while significantly reducing token consumption (~30%) and execution time compared to language-only baselines across diverse scientific domains.

## Introduction and Theoretical Foundation
Recent agentic AI systems powered by Large Language Models (LLMs) excel in general reasoning but are fundamentally limited by their reliance on natural language as a universal interface. This creates a bottleneck for scientific tasks involving specialized, structured data modalities (e.g., time series, tabular data, symbolic formulas), where domain-specific foundation models (FMs) offer superior predictive capabilities but lack native language interfaces.

The core challenge is enabling **heterogeneous foundation models to collaborate within agentic systems**. The paper frames this as a communication limitation: existing multi-agent systems assume all agents communicate via language, but many scientific FMs do not. Drawing an analogy from *Avatar*, the authors propose a "Tsaheylu" (neural bond) interface to connect LLMs (Na'vi) and specialized FMs (Pandoran species) under a unified coordinating intelligence ("Eywa").

The problem is formally defined. A task instance is $τ = (q, x, y^⋆, ℓ)$ with instruction $q$, input $x$, target $y^⋆$, and loss $ℓ$. The input space factorizes into linguistic and domain-specific components:
$$X = X_{lng} × X_1 × · · · × X_m$$
The goal is to minimize the expected task loss: $\min_G \mathbb{E}_{τ∼T}[ℓ(\hat{y}_G(τ), y^⋆)]$.

A key **Domain Advantage Assumption (Assumption 1)** is made: for any informative domain component $x_k = π_k(x)$, the foundation model $F_k$ achieves strictly better performance than any language-only model on the serialized input:
$$\mathbb{E}_{τ} [ℓ_k(F_k(x_k), y^⋆)] < \inf_{A_{LLM}} \mathbb{E}_{τ} [ℓ_k(A_{LLM}(\text{serialize}(x_k)), y^⋆)]$$

## Methodology
The Eywa framework is developed in three stages, inspired by the Pandora ecosystem analogy.

### 1. EywaAgent: Reasoning Foundation Model Agents
The fundamental unit is the `EywaAgent`, which augments a domain-specific FM with an LLM-based reasoning interface via the "Tsaheylu" bond.

**Definition 2**: An `EywaAgent` is a tuple $A_{eywa} = (A_{LLM}, F, ϕ, ψ, C)$ where:
*   $A_{LLM}: S → ∆(M)$ is a language model.
*   $F: X × U → O$ is a domain-specific foundation model.
*   $(ϕ, ψ)$ is the Tsaheylu interface: $ϕ: S → U$ (query compiler), $ψ: O → Z$ (response adapter).
*   $C: S → \{\text{invoke}, \text{skip}\}$ is a control policy.

At each step $t$, given state $s^{(t)}$:
*   If $a^{(t)} = \text{skip}$: $z^{(t)} = A_{LLM}(s^{(t)})$ (standard LLM reasoning).
*   If $a^{(t)} = \text{invoke}$: Execute the Tsaheylu pipeline $u = ϕ(s^{(t)})$, $o = F(x, u)$, $z^{(t)} = ψ(o)$.

The updated state is $s^{(t+1)} = s^{(t)} ∪ \{z^{(t)}\}$. This allows seamless switching between generalized reasoning and specialized acting.

**Theorem 3 (Improvement over Language-only Agent)**: Under Assumption 1, `EywaAgent` achieves a strictly lower optimal expected task loss than language-only agents:
$$\inf_{f∈F_{Eywa}} \mathbb{E}_{τ∼T}[ℓ(f(x), y^⋆)] < \inf_{f∈F_{LLM}} \mathbb{E}_{τ∼T}[ℓ(f(x), y^⋆)]$$

**Implementation**: The Tsaheylu interface is instantiated using the Model Context Protocol (MCP), exposing each FM as a remote service with a well-defined schema.

### 2. EywaMAS: Multi-Agent Composition
`EywaMAS` generalizes the framework to multi-agent settings by allowing plug-and-play composition of `EywaAgent`s and traditional LLM agents.

**Definition 4**: An `EywaMAS` is a multi-agent system $M_{Eywa} = (A, G)$ where $A = \{A_1, ..., A_n\}$ is a set of heterogeneous agents (LLM agents or `EywaAgent`s), and $G$ specifies the communication topology. It follows standard MAS dynamics (Equation 1 from the paper):
$$s_i^{(t)} = \text{Update}_i(s_i^{(t-1)}, m_{-i}^{(t)}), \quad m_i^{(t)} ∼ A_i(s_i^{(t)})$$

### 3. EywaOrchestra: Dynamic Orchestration
`EywaOrchestra` introduces a planning-based conductor that dynamically instantiates the optimal heterogeneous system for each task.

**Definition 5**: `EywaOrchestra` is a tuple $O = (C, P)$, where $C$ is the configuration space induced by candidate LLMs $M_{LLM}$, candidate FMs $M_{FM}$, and a topology pool $Π$; and $P$ is the conductor. The conductor $P$, conditioned on the input task, selects a configuration $c$ that decides: (i) agent types (LLM or `EywaAgent`), (ii) backbone LLMs, (iii) attached FMs, and (iv) communication topology.

**Algorithm 1** outlines the process:
1.  Select system configuration $c ← P(q, x)$.
2.  Instantiate the heterogeneous agent system specified by $c$.
3.  Execute the system and return output $\hat{y}$.

The **oracle adaptive risk** $R_{oracle} = \mathbb{E}_{τ}[ \min_{c∈C} \mathbb{E}[ℓ(F_c(q, x), y^⋆)] ]$ is always less than or equal to the **best fixed-configuration risk** $R^⋆_{fixed} = \min_{c∈C} \mathbb{E}_{τ}[ℓ(F_c(q, x), y^⋆)]$, with strict inequality when different tasks favor different configurations.

## Empirical Validation / Results
Evaluation is conducted on the novel `EywaBench` benchmark and compares `Eywa` against strong single-agent and multi-agent baselines.

### EywaBench: A Scalable Multi-task Multi-domain Scientific Benchmark
*   **Coverage**: Spans 3 domains (Physical, Life, Social), each with 3 sub-domains (e.g., Material, Energy, Space; Biology, Clinic, Drug), across 3 modalities (Natural Language, Time Series, Tabular).
*   **Sources**: Constructed from DeepPrinciple, MMLU-Pro, fev-bench, and TabArena.
*   **Scale**: 200 task instances in V1, designed for extensibility in task volume and domain coverage.
*   **Metric**: A unified utility score $u ∈ [0,1]$ computed modality-specifically (soft-match for NL, normalized error for time series/tabular) to enable cross-modality comparison.

### Experimental Setup
*   **LLM**: `gpt-5-nano` as default backbone.
*   **Foundation Models**: `Chronos` (time series) and `TabPFN` (tabular).
*   **Baselines**:
    *   Single-agent LLM (`gpt-5-nano`).
    *   Homogeneous Multi-Agent LLM: Refine MAS, Debate MAS.
    *   Heterogeneous Multi-Agent LLM: Mixture-of-Agents (MoA), X-MAS.
*   **Implementation**: Tsaheylu built with LangChain agents and FastMCP servers.

### Main Results
**Table 1** (Key results table from the paper) shows comprehensive performance across domains. Key findings:

| Method | Metrics | Physical Science | Life Science | Social Science | **Overall** |
| :--- | :--- | :--- | :--- | :--- | :--- |
| **Single-LLM-Agent** | Utility (↑) | 0.5616 / 0.8202 / 0.5235 | 0.3402 / 0.4582 / 0.6004 | 0.7689 / 0.6528 / 0.6758 | **0.6154** |
| | Time (↓) | 34.48 / 27.01 / 26.00 | 34.68 / 22.37 / 21.13 | 22.67 / 22.28 / 18.42 | 25.22 |
| | Tokens (↓) | 6367 / 4854 / 4512 | 6164 / 3618 / 3571 | 4097 / 3915 / 3327 | 4469 |
| **EywaAgent (Ours)** | Utility (↑) | **0.5871** / **0.8390** / **0.6123** | **0.3718** / **0.5085** / **0.6199** | **0.8048** / **0.7371** / **0.7060** | **0.6558** |
| | Time (↓) | 34.88 / 24.42 / 23.12 | 30.84 / 20.32 / 15.84 | 19.71 / 20.98 / 15.99 | **22.78** |
| | Tokens (↓) | 5040 / 3167 / 3329 | 4858 / 2333 / 2210 | 2791 / 2444 / 2248 | **3137** |
| **EywaMAS (Ours)** | Utility (↑) | **0.6381** / **0.8742** / **0.6899** | **0.3798** / **0.5086** / **0.6248** | **0.7959** / **0.7284** / **0.7406** | **0.6761** |
| | Time (↓) | 77.25 / 75.96 / 72.51 | 111.92 / 59.97 / 59.23 | 68.40 / 58.11 / 46.49 | 72.11 |
| | Tokens (↓) | 14529 / 11709 / 11787 | 16502 / 9407 / 8078 | 11044 / 9470 / 8912 | 11214 |
| **EywaOrchestra (Ours)** | Utility (↑) | **0.6249** / **0.8711** / **0.7187** | **0.3682** / **0.5159** / **0.6319** | **0.7830** / **0.7388** / **0.7298** | **0.6746** |
| | Time (↓) | 61.78 / 39.92 / 75.47 | 67.88 / 45.38 / 45.94 | 49.13 / 34.18 / 28.80 | **48.16** |
| | Tokens (↓) | 11535 / 7723 / 10810 | 11315 / 7050 / 6495 | 7117 / 7264 / 6892 | **8335** |

*   **EywaAgent** improves average utility by **6.6%** over the single-agent baseline while reducing token usage by **~30%** and latency by **~10%**.
*   **EywaMAS** achieves the best overall utility among multi-agent systems, outperforming homogeneous and heterogeneous LLM-only MAS baselines.
*   **EywaOrchestra** approaches `EywaMAS` utility with **lower cost and full automation**, reducing average tokens from 11,214 to 8,335 (**-26%**) compared to the fixed multi-agent system.
*   **Efficiency**: The Pareto frontier plot (Figure 1, 5) shows `Eywa` methods consistently achieve higher utility with lower token consumption than baselines.

### Further Analysis
*   **Hyperparameter Sensitivity**: `Eywa` remains robust across variations in LLM sampling temperature, FM temperature, and prompt design (Figure 6).
*   **LLM Backbone Ablation**: `Eywa` is effective across different LLM backbones (`gpt-4.1-nano`, `gpt-5-nano`, `gpt-5-mini`), with performance scaling with LLM capability but showing diminishing returns, suggesting domain-specific capability becomes the bottleneck (Table 2, Appendix Table 7).
*   **Case Studies**: Qualitative examples illustrate how `EywaAgent` successfully invokes specialized FMs (e.g., Chronos for time series forecasting) where LLM-only agents fail or resort to simplistic heuristics, and how `EywaOrchestra` dynamically selects optimal configurations.

## Theoretical and Practical Implications
**Theoretical**: The paper provides a rigorous information-theoretic and statistical learning foundation. It proves that serialization creates an irreducible Bayes risk gap (Lemma 12), that `EywaAgent` strictly expands the solvable task space (Theorem 15), and that adaptive orchestration dominates fixed systems (Theorem 18). It also formalizes the token efficiency gains (Proposition 19).

**Practical**: `Eywa` provides a actionable framework for integrating the rapidly growing ecosystem of scientific foundation models (e.g., for materials, weather, biology) into LLM-powered agentic workflows. It moves beyond tool-calling to enable true *modality-native collaboration*, where FMs participate directly in reasoning loops. This can accelerate scientific discovery, complex analysis, and decision-making in fields where data is inherently non-linguistic. The plug-and-play design (`EywaMAS`) and dynamic orchestration (`EywaOrchestra`) lower the barrier to building and optimizing such heterogeneous systems.

## Conclusion
The `Eywa` framework successfully addresses the limitation of language-centric agentic systems in scientific domains by enabling effective collaboration between LLMs and domain-specific foundation models. Through the "Tsaheylu" interface and its three instantiations (`EywaAgent`, `EywaMAS`, `EywaOrchestra`), it demonstrates significant improvements in task utility, token efficiency, and execution speed across a diverse range of scientific tasks. The work opens avenues for scaling heterogeneous model ecosystems, learning better orchestration policies, and developing tighter integration mechanisms between linguistic and non-linguistic AI models.

---

_Markdown view of https://picx.dev/p/DZDopI, served by PicX — AI-generated visual whiteboard summaries of research papers._