Heterogeneous Scientific Foundation Model Collaboration

Summary (Overview)

  • Introduces Eywa, a heterogeneous agentic framework that bridges language-centric agentic systems and domain-specific foundation models (FMs) via an FM-LLM "Tsaheylu" interface, enabling collaboration across non-linguistic scientific data modalities.
  • Proposes three instantiations: EywaAgent (single-agent with FM integration), EywaMAS (multi-agent system with plug-and-play EywaAgents), and EywaOrchestra (planning-based dynamic orchestration of heterogeneous experts).
  • Establishes theoretical guarantees, proving that EywaAgent strictly improves optimal task risk over language-only agents under a Domain Advantage assumption, and that adaptive orchestration (EywaOrchestra) dominates any fixed system configuration.
  • Introduces EywaBench, a scalable multi-task, multi-domain scientific benchmark covering physical, life, and social sciences across natural language, time series, and tabular data.
  • Demonstrates empirical gains: Experiments show Eywa improves utility (performance) while significantly reducing token consumption (~30%) and execution time compared to language-only baselines across diverse scientific domains.

Introduction and Theoretical Foundation

Recent agentic AI systems powered by Large Language Models (LLMs) excel in general reasoning but are fundamentally limited by their reliance on natural language as a universal interface. This creates a bottleneck for scientific tasks involving specialized, structured data modalities (e.g., time series, tabular data, symbolic formulas), where domain-specific foundation models (FMs) offer superior predictive capabilities but lack native language interfaces.

The core challenge is enabling heterogeneous foundation models to collaborate within agentic systems. The paper frames this as a communication limitation: existing multi-agent systems assume all agents communicate via language, but many scientific FMs do not. Drawing an analogy from Avatar, the authors propose a "Tsaheylu" (neural bond) interface to connect LLMs (Na'vi) and specialized FMs (Pandoran species) under a unified coordinating intelligence ("Eywa").

The problem is formally defined. A task instance is τ=(q,x,y,)τ = (q, x, y^⋆, ℓ) with instruction qq, input xx, target yy^⋆, and loss . The input space factorizes into linguistic and domain-specific components:

X=Xlng×X1××XmX = X_{lng} × X_1 × · · · × X_m

The goal is to minimize the expected task loss: minGEτT[(y^G(τ),y)]\min_G \mathbb{E}_{τ∼T}[ℓ(\hat{y}_G(τ), y^⋆)].

A key Domain Advantage Assumption (Assumption 1) is made: for any informative domain component xk=πk(x)x_k = π_k(x), the foundation model FkF_k achieves strictly better performance than any language-only model on the serialized input:

Eτ[k(Fk(xk),y)]<infALLMEτ[k(ALLM(serialize(xk)),y)]\mathbb{E}_{τ} [ℓ_k(F_k(x_k), y^⋆)] < \inf_{A_{LLM}} \mathbb{E}_{τ} [ℓ_k(A_{LLM}(\text{serialize}(x_k)), y^⋆)]

Methodology

The Eywa framework is developed in three stages, inspired by the Pandora ecosystem analogy.

1. EywaAgent: Reasoning Foundation Model Agents

The fundamental unit is the EywaAgent, which augments a domain-specific FM with an LLM-based reasoning interface via the "Tsaheylu" bond.

Definition 2: An EywaAgent is a tuple Aeywa=(ALLM,F,ϕ,ψ,C)A_{eywa} = (A_{LLM}, F, ϕ, ψ, C) where:

  • ALLM:S(M)A_{LLM}: S → ∆(M) is a language model.
  • F:X×UOF: X × U → O is a domain-specific foundation model.
  • (ϕ,ψ)(ϕ, ψ) is the Tsaheylu interface: ϕ:SUϕ: S → U (query compiler), ψ:OZψ: O → Z (response adapter).
  • C:S{invoke,skip}C: S → \{\text{invoke}, \text{skip}\} is a control policy.

At each step tt, given state s(t)s^{(t)}:

  • If a(t)=skipa^{(t)} = \text{skip}: z(t)=ALLM(s(t))z^{(t)} = A_{LLM}(s^{(t)}) (standard LLM reasoning).
  • If a(t)=invokea^{(t)} = \text{invoke}: Execute the Tsaheylu pipeline u=ϕ(s(t))u = ϕ(s^{(t)}), o=F(x,u)o = F(x, u), z(t)=ψ(o)z^{(t)} = ψ(o).

The updated state is s(t+1)=s(t){z(t)}s^{(t+1)} = s^{(t)} ∪ \{z^{(t)}\}. This allows seamless switching between generalized reasoning and specialized acting.

Theorem 3 (Improvement over Language-only Agent): Under Assumption 1, EywaAgent achieves a strictly lower optimal expected task loss than language-only agents:

inffFEywaEτT[(f(x),y)]<inffFLLMEτT[(f(x),y)]\inf_{f∈F_{Eywa}} \mathbb{E}_{τ∼T}[ℓ(f(x), y^⋆)] < \inf_{f∈F_{LLM}} \mathbb{E}_{τ∼T}[ℓ(f(x), y^⋆)]

Implementation: The Tsaheylu interface is instantiated using the Model Context Protocol (MCP), exposing each FM as a remote service with a well-defined schema.

2. EywaMAS: Multi-Agent Composition

EywaMAS generalizes the framework to multi-agent settings by allowing plug-and-play composition of EywaAgents and traditional LLM agents.

Definition 4: An EywaMAS is a multi-agent system MEywa=(A,G)M_{Eywa} = (A, G) where A={A1,...,An}A = \{A_1, ..., A_n\} is a set of heterogeneous agents (LLM agents or EywaAgents), and GG specifies the communication topology. It follows standard MAS dynamics (Equation 1 from the paper):

si(t)=Updatei(si(t1),mi(t)),mi(t)Ai(si(t))s_i^{(t)} = \text{Update}_i(s_i^{(t-1)}, m_{-i}^{(t)}), \quad m_i^{(t)} ∼ A_i(s_i^{(t)})

3. EywaOrchestra: Dynamic Orchestration

EywaOrchestra introduces a planning-based conductor that dynamically instantiates the optimal heterogeneous system for each task.

Definition 5: EywaOrchestra is a tuple O=(C,P)O = (C, P), where CC is the configuration space induced by candidate LLMs MLLMM_{LLM}, candidate FMs MFMM_{FM}, and a topology pool ΠΠ; and PP is the conductor. The conductor PP, conditioned on the input task, selects a configuration cc that decides: (i) agent types (LLM or EywaAgent), (ii) backbone LLMs, (iii) attached FMs, and (iv) communication topology.

Algorithm 1 outlines the process:

  1. Select system configuration cP(q,x)c ← P(q, x).
  2. Instantiate the heterogeneous agent system specified by cc.
  3. Execute the system and return output y^\hat{y}.

The oracle adaptive risk Roracle=Eτ[mincCE[(Fc(q,x),y)]]R_{oracle} = \mathbb{E}_{τ}[ \min_{c∈C} \mathbb{E}[ℓ(F_c(q, x), y^⋆)] ] is always less than or equal to the best fixed-configuration risk Rfixed=mincCEτ[(Fc(q,x),y)]R^⋆_{fixed} = \min_{c∈C} \mathbb{E}_{τ}[ℓ(F_c(q, x), y^⋆)], with strict inequality when different tasks favor different configurations.

Empirical Validation / Results

Evaluation is conducted on the novel EywaBench benchmark and compares Eywa against strong single-agent and multi-agent baselines.

EywaBench: A Scalable Multi-task Multi-domain Scientific Benchmark

  • Coverage: Spans 3 domains (Physical, Life, Social), each with 3 sub-domains (e.g., Material, Energy, Space; Biology, Clinic, Drug), across 3 modalities (Natural Language, Time Series, Tabular).
  • Sources: Constructed from DeepPrinciple, MMLU-Pro, fev-bench, and TabArena.
  • Scale: 200 task instances in V1, designed for extensibility in task volume and domain coverage.
  • Metric: A unified utility score u[0,1]u ∈ [0,1] computed modality-specifically (soft-match for NL, normalized error for time series/tabular) to enable cross-modality comparison.

Experimental Setup

  • LLM: gpt-5-nano as default backbone.
  • Foundation Models: Chronos (time series) and TabPFN (tabular).
  • Baselines:
    • Single-agent LLM (gpt-5-nano).
    • Homogeneous Multi-Agent LLM: Refine MAS, Debate MAS.
    • Heterogeneous Multi-Agent LLM: Mixture-of-Agents (MoA), X-MAS.
  • Implementation: Tsaheylu built with LangChain agents and FastMCP servers.

Main Results

Table 1 (Key results table from the paper) shows comprehensive performance across domains. Key findings:

MethodMetricsPhysical ScienceLife ScienceSocial ScienceOverall
Single-LLM-AgentUtility (↑)0.5616 / 0.8202 / 0.52350.3402 / 0.4582 / 0.60040.7689 / 0.6528 / 0.67580.6154
Time (↓)34.48 / 27.01 / 26.0034.68 / 22.37 / 21.1322.67 / 22.28 / 18.4225.22
Tokens (↓)6367 / 4854 / 45126164 / 3618 / 35714097 / 3915 / 33274469
EywaAgent (Ours)Utility (↑)0.5871 / 0.8390 / 0.61230.3718 / 0.5085 / 0.61990.8048 / 0.7371 / 0.70600.6558
Time (↓)34.88 / 24.42 / 23.1230.84 / 20.32 / 15.8419.71 / 20.98 / 15.9922.78
Tokens (↓)5040 / 3167 / 33294858 / 2333 / 22102791 / 2444 / 22483137
EywaMAS (Ours)Utility (↑)0.6381 / 0.8742 / 0.68990.3798 / 0.5086 / 0.62480.7959 / 0.7284 / 0.74060.6761
Time (↓)77.25 / 75.96 / 72.51111.92 / 59.97 / 59.2368.40 / 58.11 / 46.4972.11
Tokens (↓)14529 / 11709 / 1178716502 / 9407 / 807811044 / 9470 / 891211214
EywaOrchestra (Ours)Utility (↑)0.6249 / 0.8711 / 0.71870.3682 / 0.5159 / 0.63190.7830 / 0.7388 / 0.72980.6746
Time (↓)61.78 / 39.92 / 75.4767.88 / 45.38 / 45.9449.13 / 34.18 / 28.8048.16
Tokens (↓)11535 / 7723 / 1081011315 / 7050 / 64957117 / 7264 / 68928335
  • EywaAgent improves average utility by 6.6% over the single-agent baseline while reducing token usage by ~30% and latency by ~10%.
  • EywaMAS achieves the best overall utility among multi-agent systems, outperforming homogeneous and heterogeneous LLM-only MAS baselines.
  • EywaOrchestra approaches EywaMAS utility with lower cost and full automation, reducing average tokens from 11,214 to 8,335 (-26%) compared to the fixed multi-agent system.
  • Efficiency: The Pareto frontier plot (Figure 1, 5) shows Eywa methods consistently achieve higher utility with lower token consumption than baselines.

Further Analysis

  • Hyperparameter Sensitivity: Eywa remains robust across variations in LLM sampling temperature, FM temperature, and prompt design (Figure 6).
  • LLM Backbone Ablation: Eywa is effective across different LLM backbones (gpt-4.1-nano, gpt-5-nano, gpt-5-mini), with performance scaling with LLM capability but showing diminishing returns, suggesting domain-specific capability becomes the bottleneck (Table 2, Appendix Table 7).
  • Case Studies: Qualitative examples illustrate how EywaAgent successfully invokes specialized FMs (e.g., Chronos for time series forecasting) where LLM-only agents fail or resort to simplistic heuristics, and how EywaOrchestra dynamically selects optimal configurations.

Theoretical and Practical Implications

Theoretical: The paper provides a rigorous information-theoretic and statistical learning foundation. It proves that serialization creates an irreducible Bayes risk gap (Lemma 12), that EywaAgent strictly expands the solvable task space (Theorem 15), and that adaptive orchestration dominates fixed systems (Theorem 18). It also formalizes the token efficiency gains (Proposition 19).

Practical: Eywa provides a actionable framework for integrating the rapidly growing ecosystem of scientific foundation models (e.g., for materials, weather, biology) into LLM-powered agentic workflows. It moves beyond tool-calling to enable true modality-native collaboration, where FMs participate directly in reasoning loops. This can accelerate scientific discovery, complex analysis, and decision-making in fields where data is inherently non-linguistic. The plug-and-play design (EywaMAS) and dynamic orchestration (EywaOrchestra) lower the barrier to building and optimizing such heterogeneous systems.

Conclusion

The Eywa framework successfully addresses the limitation of language-centric agentic systems in scientific domains by enabling effective collaboration between LLMs and domain-specific foundation models. Through the "Tsaheylu" interface and its three instantiations (EywaAgent, EywaMAS, EywaOrchestra), it demonstrates significant improvements in task utility, token efficiency, and execution speed across a diverse range of scientific tasks. The work opens avenues for scaling heterogeneous model ecosystems, learning better orchestration policies, and developing tighter integration mechanisms between linguistic and non-linguistic AI models.