Heterogeneous Scientific Foundation Model Collaboration
Summary (Overview)
- Introduces Eywa, a heterogeneous agentic framework that bridges language-centric agentic systems and domain-specific foundation models (FMs) via an FM-LLM "Tsaheylu" interface, enabling collaboration across non-linguistic scientific data modalities.
- Proposes three instantiations:
EywaAgent(single-agent with FM integration),EywaMAS(multi-agent system with plug-and-playEywaAgents), andEywaOrchestra(planning-based dynamic orchestration of heterogeneous experts). - Establishes theoretical guarantees, proving that
EywaAgentstrictly improves optimal task risk over language-only agents under a Domain Advantage assumption, and that adaptive orchestration (EywaOrchestra) dominates any fixed system configuration. - Introduces
EywaBench, a scalable multi-task, multi-domain scientific benchmark covering physical, life, and social sciences across natural language, time series, and tabular data. - Demonstrates empirical gains: Experiments show
Eywaimproves utility (performance) while significantly reducing token consumption (~30%) and execution time compared to language-only baselines across diverse scientific domains.
Introduction and Theoretical Foundation
Recent agentic AI systems powered by Large Language Models (LLMs) excel in general reasoning but are fundamentally limited by their reliance on natural language as a universal interface. This creates a bottleneck for scientific tasks involving specialized, structured data modalities (e.g., time series, tabular data, symbolic formulas), where domain-specific foundation models (FMs) offer superior predictive capabilities but lack native language interfaces.
The core challenge is enabling heterogeneous foundation models to collaborate within agentic systems. The paper frames this as a communication limitation: existing multi-agent systems assume all agents communicate via language, but many scientific FMs do not. Drawing an analogy from Avatar, the authors propose a "Tsaheylu" (neural bond) interface to connect LLMs (Na'vi) and specialized FMs (Pandoran species) under a unified coordinating intelligence ("Eywa").
The problem is formally defined. A task instance is with instruction , input , target , and loss . The input space factorizes into linguistic and domain-specific components:
The goal is to minimize the expected task loss: .
A key Domain Advantage Assumption (Assumption 1) is made: for any informative domain component , the foundation model achieves strictly better performance than any language-only model on the serialized input:
Methodology
The Eywa framework is developed in three stages, inspired by the Pandora ecosystem analogy.
1. EywaAgent: Reasoning Foundation Model Agents
The fundamental unit is the EywaAgent, which augments a domain-specific FM with an LLM-based reasoning interface via the "Tsaheylu" bond.
Definition 2: An EywaAgent is a tuple where:
- is a language model.
- is a domain-specific foundation model.
- is the Tsaheylu interface: (query compiler), (response adapter).
- is a control policy.
At each step , given state :
- If : (standard LLM reasoning).
- If : Execute the Tsaheylu pipeline , , .
The updated state is . This allows seamless switching between generalized reasoning and specialized acting.
Theorem 3 (Improvement over Language-only Agent): Under Assumption 1, EywaAgent achieves a strictly lower optimal expected task loss than language-only agents:
Implementation: The Tsaheylu interface is instantiated using the Model Context Protocol (MCP), exposing each FM as a remote service with a well-defined schema.
2. EywaMAS: Multi-Agent Composition
EywaMAS generalizes the framework to multi-agent settings by allowing plug-and-play composition of EywaAgents and traditional LLM agents.
Definition 4: An EywaMAS is a multi-agent system where is a set of heterogeneous agents (LLM agents or EywaAgents), and specifies the communication topology. It follows standard MAS dynamics (Equation 1 from the paper):
3. EywaOrchestra: Dynamic Orchestration
EywaOrchestra introduces a planning-based conductor that dynamically instantiates the optimal heterogeneous system for each task.
Definition 5: EywaOrchestra is a tuple , where is the configuration space induced by candidate LLMs , candidate FMs , and a topology pool ; and is the conductor. The conductor , conditioned on the input task, selects a configuration that decides: (i) agent types (LLM or EywaAgent), (ii) backbone LLMs, (iii) attached FMs, and (iv) communication topology.
Algorithm 1 outlines the process:
- Select system configuration .
- Instantiate the heterogeneous agent system specified by .
- Execute the system and return output .
The oracle adaptive risk is always less than or equal to the best fixed-configuration risk , with strict inequality when different tasks favor different configurations.
Empirical Validation / Results
Evaluation is conducted on the novel EywaBench benchmark and compares Eywa against strong single-agent and multi-agent baselines.
EywaBench: A Scalable Multi-task Multi-domain Scientific Benchmark
- Coverage: Spans 3 domains (Physical, Life, Social), each with 3 sub-domains (e.g., Material, Energy, Space; Biology, Clinic, Drug), across 3 modalities (Natural Language, Time Series, Tabular).
- Sources: Constructed from DeepPrinciple, MMLU-Pro, fev-bench, and TabArena.
- Scale: 200 task instances in V1, designed for extensibility in task volume and domain coverage.
- Metric: A unified utility score computed modality-specifically (soft-match for NL, normalized error for time series/tabular) to enable cross-modality comparison.
Experimental Setup
- LLM:
gpt-5-nanoas default backbone. - Foundation Models:
Chronos(time series) andTabPFN(tabular). - Baselines:
- Single-agent LLM (
gpt-5-nano). - Homogeneous Multi-Agent LLM: Refine MAS, Debate MAS.
- Heterogeneous Multi-Agent LLM: Mixture-of-Agents (MoA), X-MAS.
- Single-agent LLM (
- Implementation: Tsaheylu built with LangChain agents and FastMCP servers.
Main Results
Table 1 (Key results table from the paper) shows comprehensive performance across domains. Key findings:
| Method | Metrics | Physical Science | Life Science | Social Science | Overall |
|---|---|---|---|---|---|
| Single-LLM-Agent | Utility (↑) | 0.5616 / 0.8202 / 0.5235 | 0.3402 / 0.4582 / 0.6004 | 0.7689 / 0.6528 / 0.6758 | 0.6154 |
| Time (↓) | 34.48 / 27.01 / 26.00 | 34.68 / 22.37 / 21.13 | 22.67 / 22.28 / 18.42 | 25.22 | |
| Tokens (↓) | 6367 / 4854 / 4512 | 6164 / 3618 / 3571 | 4097 / 3915 / 3327 | 4469 | |
| EywaAgent (Ours) | Utility (↑) | 0.5871 / 0.8390 / 0.6123 | 0.3718 / 0.5085 / 0.6199 | 0.8048 / 0.7371 / 0.7060 | 0.6558 |
| Time (↓) | 34.88 / 24.42 / 23.12 | 30.84 / 20.32 / 15.84 | 19.71 / 20.98 / 15.99 | 22.78 | |
| Tokens (↓) | 5040 / 3167 / 3329 | 4858 / 2333 / 2210 | 2791 / 2444 / 2248 | 3137 | |
| EywaMAS (Ours) | Utility (↑) | 0.6381 / 0.8742 / 0.6899 | 0.3798 / 0.5086 / 0.6248 | 0.7959 / 0.7284 / 0.7406 | 0.6761 |
| Time (↓) | 77.25 / 75.96 / 72.51 | 111.92 / 59.97 / 59.23 | 68.40 / 58.11 / 46.49 | 72.11 | |
| Tokens (↓) | 14529 / 11709 / 11787 | 16502 / 9407 / 8078 | 11044 / 9470 / 8912 | 11214 | |
| EywaOrchestra (Ours) | Utility (↑) | 0.6249 / 0.8711 / 0.7187 | 0.3682 / 0.5159 / 0.6319 | 0.7830 / 0.7388 / 0.7298 | 0.6746 |
| Time (↓) | 61.78 / 39.92 / 75.47 | 67.88 / 45.38 / 45.94 | 49.13 / 34.18 / 28.80 | 48.16 | |
| Tokens (↓) | 11535 / 7723 / 10810 | 11315 / 7050 / 6495 | 7117 / 7264 / 6892 | 8335 |
- EywaAgent improves average utility by 6.6% over the single-agent baseline while reducing token usage by ~30% and latency by ~10%.
- EywaMAS achieves the best overall utility among multi-agent systems, outperforming homogeneous and heterogeneous LLM-only MAS baselines.
- EywaOrchestra approaches
EywaMASutility with lower cost and full automation, reducing average tokens from 11,214 to 8,335 (-26%) compared to the fixed multi-agent system. - Efficiency: The Pareto frontier plot (Figure 1, 5) shows
Eywamethods consistently achieve higher utility with lower token consumption than baselines.
Further Analysis
- Hyperparameter Sensitivity:
Eywaremains robust across variations in LLM sampling temperature, FM temperature, and prompt design (Figure 6). - LLM Backbone Ablation:
Eywais effective across different LLM backbones (gpt-4.1-nano,gpt-5-nano,gpt-5-mini), with performance scaling with LLM capability but showing diminishing returns, suggesting domain-specific capability becomes the bottleneck (Table 2, Appendix Table 7). - Case Studies: Qualitative examples illustrate how
EywaAgentsuccessfully invokes specialized FMs (e.g., Chronos for time series forecasting) where LLM-only agents fail or resort to simplistic heuristics, and howEywaOrchestradynamically selects optimal configurations.
Theoretical and Practical Implications
Theoretical: The paper provides a rigorous information-theoretic and statistical learning foundation. It proves that serialization creates an irreducible Bayes risk gap (Lemma 12), that EywaAgent strictly expands the solvable task space (Theorem 15), and that adaptive orchestration dominates fixed systems (Theorem 18). It also formalizes the token efficiency gains (Proposition 19).
Practical: Eywa provides a actionable framework for integrating the rapidly growing ecosystem of scientific foundation models (e.g., for materials, weather, biology) into LLM-powered agentic workflows. It moves beyond tool-calling to enable true modality-native collaboration, where FMs participate directly in reasoning loops. This can accelerate scientific discovery, complex analysis, and decision-making in fields where data is inherently non-linguistic. The plug-and-play design (EywaMAS) and dynamic orchestration (EywaOrchestra) lower the barrier to building and optimizing such heterogeneous systems.
Conclusion
The Eywa framework successfully addresses the limitation of language-centric agentic systems in scientific domains by enabling effective collaboration between LLMs and domain-specific foundation models. Through the "Tsaheylu" interface and its three instantiations (EywaAgent, EywaMAS, EywaOrchestra), it demonstrates significant improvements in task utility, token efficiency, and execution speed across a diverse range of scientific tasks. The work opens avenues for scaling heterogeneous model ecosystems, learning better orchestration policies, and developing tighter integration mechanisms between linguistic and non-linguistic AI models.