# ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration

> ARIS introduces an adversarial multi-agent research harness that uses cross-model collaboration and a rigorous evidence-to-claim audit cascade to combat plausible unsupported success in autonomous ML research.

- **Source:** [arXiv](https://arxiv.org/abs/2605.03042)
- **Published:** 2026-05-07
- **Permalink:** https://picx.dev/p/9eeqst
- **Whiteboard:** https://picx.dev/p/9eeqst/image

## Summary

# ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration - Summary

## Summary (Overview)
*   **ARIS** is an open-source **research harness** designed for autonomous machine learning research. Its core philosophy is that long-horizon tasks performed by a single agent are unreliable, with the primary failure mode being **"plausible unsupported success"**—where claims outrun the evidence that supports them.
*   The system is built around **cross-model adversarial collaboration** as a default configuration. An **executor** model (e.g., Claude) drives progress, while a **reviewer** from a different model family (e.g., GPT-5.4) critiques intermediate artifacts and requests revisions, breaking self-review blind spots.
*   The architecture is organized into three layers: an **Execution Layer** with over 65 reusable Markdown-defined skills and a persistent research wiki; an **Orchestration Layer** coordinating five end-to-end workflows; and an **Assurance Layer** featuring a rigorous evidence-to-claim audit cascade and manuscript quality checks.
*   A key innovation is the **three-stage evidence-to-claim audit cascade** (`experiment-audit`, `result-to-claim`, `paper-claim-audit`) designed to catch integrity failures, map results to explicit claims, and verify manuscript statements against raw evidence.
*   The system emphasizes **modularity and portability**, with skills defined as plain-text files usable across multiple executor platforms (Claude Code, Codex CLI, Cursor), and includes a prototype **meta-optimization outer loop** for improving the harness itself.

## Introduction and Theoretical Foundation
Recent work on "harness engineering" (Lee et al., 2026) suggests that LLM system performance depends heavily on the surrounding system logic (the *harness*) as well as model weights. Autonomous ML research poses a complex harness-engineering problem, spanning idea generation, experimentation, manuscript writing, and rebuttal.

Existing autonomous research systems (e.g., AI Scientist, Agent Laboratory) exhibit limitations: reliance on the same model family for execution and review (leading to correlated errors), tightly coupled end-to-end workflows, and a lack of explicit system-level checks for experimental integrity.

The paper's core theoretical foundation is a **stringent assumption**:
> Any long-term task performed by a single agent is unreliable.

The central risk is not outright failure but **plausible unsupported success**, where results are real but misreported, claims outrun their evidence, and readers inherit the executor's framing. This assumption decomposes into three operational bottlenecks that guide ARIS's design:
1.  **Persistent research state (i)** is required for meaningful stepwise review.
2.  **Modular execution (ii)** is required to divide long trajectories into replaceable stages.
3.  **Independent assurance (iii)** is required so the reviewer examines artifacts from a sufficiently different perspective.

The design is analogized to adversarial vs. stochastic bandits: single-model self-review is the stochastic case (predictable noise), while cross-model review is adversarial (actively probing weaknesses), and adversarial bandits are fundamentally harder to game.

## Methodology
ARIS is implemented as a stateful research harness with a three-layer architecture (Figure 4).

**1. Execution Layer:**
*   **Skills:** Over 65 research capabilities are defined as single `SKILL.md` files containing YAML frontmatter and natural-language workflow specifications. They are modular and portable.
*   **Research Wiki:** A persistent, cross-session memory system storing four entity types (papers, ideas, experiments, claims) with typed relationships. It prevents re-trying failed ideas and enables "spiral learning" (Figure 7).
*   **Tooling:** Includes six Model Control Protocol (MCP) bridges for model routing (Codex, Claude, Gemini, etc.), citation lookup tools, and a deterministic `FigureSpec` SVG renderer.

**2. Orchestration Layer:**
*   **Workflows:** Five end-to-end workflows chain skills through plain-text artifact contracts (Figure 1, Table 2):
    *   **W1: Idea Discovery** – Literature survey, idea generation, novelty checking.
    *   **W1.5: Experiment Bridge** – Code implementation, review, deployment.
    *   **W2: Auto Review Loop** – Core adversarial loop (Figure 2). A cross-model reviewer scores a draft, actionable items are extracted, experiments may be run for new evidence, and revisions are made. The loop runs for up to 4 rounds or until a score threshold (default 6/10) is met.
    *   **W3: Paper Writing** – From narrative to polished PDF (Figure 3). Includes planning, figure generation, LaTeX drafting with a five-pass editing pipeline, proof checking, claim auditing, compilation, and an improvement loop.
    *   **W4: Rebuttal** – Parses reviews, atomizes concerns, drafts a response, and passes through three safety gates (fabrication, overpromise, coverage) via a stress test.

**3. Assurance Layer:**
This layer operationalizes the "independent assurance" bottleneck. Its core is the **evidence-to-claim audit cascade** (Figure 6):
*   **Stage 1: Experiment-Integrity Audit (`/experiment-audit`):** A cross-model reviewer audits evaluation code and outputs against five integrity failure modes: model-derived reference labels, self-normalized scores, phantom results, dead-code inflation, and scope inflation.
*   **Stage 2: Result-to-Claim Mapping (`/result-to-claim`):** Maps experimental results to explicit claim verdicts: `supported`, `partially supported`, or `invalidated`. Integrates Stage 1 audit statuses.
*   **Stage 3: Paper-Claim Audit (`/paper-claim-audit`):** A **fresh zero-context** reviewer (new thread, no prior conversation) cross-checks every quantitative claim in the manuscript against the claim ledger and raw result files. Statuses include `exact_match`, `number_mismatch`, etc.

**Additional Manuscript Assurance:**
*   **Five-pass scientific-editing pipeline** (clutter removal, active voice, sentence structure, terminology consistency, numerical consistency).
*   **Proof verification (`/proof-checker`)** with a 20-category issue taxonomy.
*   **Visual PDF review** assessing both LaTeX source and compiled PDF.
*   **Citation audit (`/citation-audit`)** verifying existence, metadata correctness, and context appropriateness of every `\cite`.

**Cross-Model Adversarial Collaboration Mechanism:**
The core loop (Figure 5) alternates between executor generation and external-model critique. Reviewers are configured along two axes:
1.  **Access Scope:** `Document-only`, `Artifact-augmented`, or `Repository-level`.
2.  **Context Policy:** `Fresh` (new thread per round) or `Cross-round` (retains state).

The system defaults to pairing executor and reviewer from **different model families** (e.g., Claude executor with GPT reviewer) as the recommended configuration to reduce correlated errors.

**System Controls:**
*   **Effort Levels:** Four presets (`lite`, `balanced`, `max`, `beast`) that scale breadth, depth, and iteration counts.
*   **Reviewer Routing:** Review requests default to GPT-5.4 via Codex MCP, with optional routing to GPT-5.4 Pro via Oracle MCP for high-stakes reviews.

**Meta-Optimization:**
A prototype outer loop (`/meta-optimize`) analyzes usage events, proposes patches to `SKILL.md` files, and applies them only after GPT-5.4 xhigh review and user approval.

## Empirical Validation / Results
The paper provides **observational deployment evidence**; outcomes cannot be causally attributed to ARIS alone.

**Deployment Footprint (as of April 2026):**
| Dimension | Current Status |
| :--- | :--- |
| **Executor platforms** | 3 tested + 3 adapted (6 total) |
| **Reviewer models** | 6+ (GPT, Gemini, GLM, MiniMax, Kimi, DeepSeek) |
| **GPU backends** | 4 (local, SSH, Vast.ai, Modal) |
| **Venue templates** | 9 families |
| **Community contributions** | 30+ contributed skills |

**Documented Overnight Run:**
A single trajectory was documented where the system operated autonomously for ~8 hours:
*   Completed **four review–revise rounds** (Workflow 2).
*   Increased an internal reviewer score from **5.0 to 7.5/10**.
*   Launched **more than 20 GPU experiments**.
*   **Removed claims** that were not supported by available evidence.
This run demonstrates the harness's ability to operationalize claim pruning and review-driven revision in a realistic setting.

**Feature Comparison:**
Table 4 provides a structured comparison with related systems, highlighting ARIS's unique combination of features:
| System | Cross-family policy | Adversarial review | Composable skills | E2E Research Workflows | Assurance Stack | Cross-platform portability |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| **AI Scientist** (Lu et al., 2024) | none | partial | × | ✓ | partial | × |
| **Agent Laboratory** (Schmidgall et al., 2025) | none | × | × | ✓ | × | × |
| **AutoGen** (Wu et al., 2023) | none | × | partial | × | × | × |
| **Aris (ours)** | **default** | **✓** | **✓** | **✓** | **✓** | **✓** † |

† Tested on 3 platforms with adaptation guides for 3 more.

## Theoretical and Practical Implications
**Theoretical Implications:**
*   **Harness-Centric View:** The work reinforces the idea that system performance is co-determined by model weights *and* the orchestration harness, especially for long-horizon tasks.
*   **Adversarial Design for Assurance:** It proposes a practical, minimal (two-agent) adversarial framework as a defense against "plausible unsupported success," drawing an analogy to game theory and adversarial bandits.
*   **Modularity for Auditability:** By decomposing research into single-file skills and plain-text artifacts, the system makes the research process more inspectable and debuggable, addressing the "opaque agent trajectory" problem.

**Practical Implications:**
*   **Lowering Barriers to Research:** ARIS can assist in automating repetitive aspects of the research lifecycle (literature review, experiment deployment, manuscript formatting, rebuttal drafting), potentially increasing researcher productivity.
*   **Enhanced Research Integrity:** The evidence-to-claim audit cascade and manuscript assurance mechanisms provide a systematic, automated safety net against common integrity failure modes, which could improve the rigor of AI-generated research.
*   **Portable Research Workflows:** The skill-based, platform-agnostic design allows researchers to define and reuse research procedures across different LLM-powered coding environments, reducing vendor lock-in.
*   **Foundation for Self-Improvement:** The cross-model accountability primitives (reviewer independence, claim ledgers) could be adapted as an oversight layer between model outputs and downstream training data, potentially mitigating issues in recursive self-improvement loops.

## Conclusion
ARIS is a research harness built around the conservative assumption that single-agent, long-horizon research is unreliable. It responds by implementing **cross-model adversarial collaboration** as a default, decomposing the workflow into **modular, auditable skills**, and enforcing integrity through a **multi-stage assurance stack**.

The main contributions are:
1.  An **assurance stack** with a three-stage evidence-to-claim audit cascade and manuscript quality checks.
2.  A **modular system architecture** with over 65 reusable skills, a persistent research wiki, and configurable workflows.
3.  **Early deployment evidence** across multiple platforms, demonstrating operational feasibility.

**Limitations** include the absence of controlled evaluation, the inability to guarantee correctness, potential reviewer bias amplification, and security concerns with repository-level review. Human responsibility remains paramount.

**Future Work** includes:
*   **Controlled evaluations** (Appendix E outlines a benchmark protocol) to isolate the effect of cross-model heterogeneity.
*   Development of **local reviewer models** for confidential settings.
*   User studies on **researcher productivity**.
*   Exploring the adaptation of ARIS's accountability primitives for **LLM self-improvement** pipelines, as a mechanism to reduce judge-model coupling.

**Code and documentation** are available at: [https://github.com/wanshuiyin/Auto-claude-code-research-in-sleep](https://github.com/wanshuiyin/Auto-claude-code-research-in-sleep).

---

_Markdown view of https://picx.dev/p/9eeqst, served by PicX — AI-generated visual whiteboard summaries of research papers._
