ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration - Summary

Summary (Overview)

ARIS is an open-source research harness designed for autonomous machine learning research. Its core philosophy is that long-horizon tasks performed by a single agent are unreliable, with the primary failure mode being "plausible unsupported success"—where claims outrun the evidence that supports them.
The system is built around cross-model adversarial collaboration as a default configuration. An executor model (e.g., Claude) drives progress, while a reviewer from a different model family (e.g., GPT-5.4) critiques intermediate artifacts and requests revisions, breaking self-review blind spots.
The architecture is organized into three layers: an Execution Layer with over 65 reusable Markdown-defined skills and a persistent research wiki; an Orchestration Layer coordinating five end-to-end workflows; and an Assurance Layer featuring a rigorous evidence-to-claim audit cascade and manuscript quality checks.
A key innovation is the three-stage evidence-to-claim audit cascade (experiment-audit, result-to-claim, paper-claim-audit) designed to catch integrity failures, map results to explicit claims, and verify manuscript statements against raw evidence.
The system emphasizes modularity and portability, with skills defined as plain-text files usable across multiple executor platforms (Claude Code, Codex CLI, Cursor), and includes a prototype meta-optimization outer loop for improving the harness itself.

Introduction and Theoretical Foundation

Recent work on "harness engineering" (Lee et al., 2026) suggests that LLM system performance depends heavily on the surrounding system logic (the harness) as well as model weights. Autonomous ML research poses a complex harness-engineering problem, spanning idea generation, experimentation, manuscript writing, and rebuttal.

Existing autonomous research systems (e.g., AI Scientist, Agent Laboratory) exhibit limitations: reliance on the same model family for execution and review (leading to correlated errors), tightly coupled end-to-end workflows, and a lack of explicit system-level checks for experimental integrity.

The paper's core theoretical foundation is a stringent assumption:

Any long-term task performed by a single agent is unreliable.

The central risk is not outright failure but plausible unsupported success, where results are real but misreported, claims outrun their evidence, and readers inherit the executor's framing. This assumption decomposes into three operational bottlenecks that guide ARIS's design:

Persistent research state (i) is required for meaningful stepwise review.
Modular execution (ii) is required to divide long trajectories into replaceable stages.
Independent assurance (iii) is required so the reviewer examines artifacts from a sufficiently different perspective.

The design is analogized to adversarial vs. stochastic bandits: single-model self-review is the stochastic case (predictable noise), while cross-model review is adversarial (actively probing weaknesses), and adversarial bandits are fundamentally harder to game.

Methodology

ARIS is implemented as a stateful research harness with a three-layer architecture (Figure 4).

1. Execution Layer:

Skills: Over 65 research capabilities are defined as single SKILL.md files containing YAML frontmatter and natural-language workflow specifications. They are modular and portable.
Research Wiki: A persistent, cross-session memory system storing four entity types (papers, ideas, experiments, claims) with typed relationships. It prevents re-trying failed ideas and enables "spiral learning" (Figure 7).
Tooling: Includes six Model Control Protocol (MCP) bridges for model routing (Codex, Claude, Gemini, etc.), citation lookup tools, and a deterministic FigureSpec SVG renderer.

2. Orchestration Layer:

Workflows: Five end-to-end workflows chain skills through plain-text artifact contracts (Figure 1, Table 2):
- W1: Idea Discovery – Literature survey, idea generation, novelty checking.
- W1.5: Experiment Bridge – Code implementation, review, deployment.
- W2: Auto Review Loop – Core adversarial loop (Figure 2). A cross-model reviewer scores a draft, actionable items are extracted, experiments may be run for new evidence, and revisions are made. The loop runs for up to 4 rounds or until a score threshold (default 6/10) is met.
- W3: Paper Writing – From narrative to polished PDF (Figure 3). Includes planning, figure generation, LaTeX drafting with a five-pass editing pipeline, proof checking, claim auditing, compilation, and an improvement loop.
- W4: Rebuttal – Parses reviews, atomizes concerns, drafts a response, and passes through three safety gates (fabrication, overpromise, coverage) via a stress test.

3. Assurance Layer: This layer operationalizes the "independent assurance" bottleneck. Its core is the evidence-to-claim audit cascade (Figure 6):

Stage 1: Experiment-Integrity Audit (/experiment-audit): A cross-model reviewer audits evaluation code and outputs against five integrity failure modes: model-derived reference labels, self-normalized scores, phantom results, dead-code inflation, and scope inflation.
Stage 2: Result-to-Claim Mapping (/result-to-claim): Maps experimental results to explicit claim verdicts: supported, partially supported, or invalidated. Integrates Stage 1 audit statuses.
Stage 3: Paper-Claim Audit (/paper-claim-audit): A fresh zero-context reviewer (new thread, no prior conversation) cross-checks every quantitative claim in the manuscript against the claim ledger and raw result files. Statuses include exact_match, number_mismatch, etc.

Additional Manuscript Assurance:

Five-pass scientific-editing pipeline (clutter removal, active voice, sentence structure, terminology consistency, numerical consistency).
Proof verification (/proof-checker) with a 20-category issue taxonomy.
Visual PDF review assessing both LaTeX source and compiled PDF.
Citation audit (/citation-audit) verifying existence, metadata correctness, and context appropriateness of every \cite.

Cross-Model Adversarial Collaboration Mechanism: The core loop (Figure 5) alternates between executor generation and external-model critique. Reviewers are configured along two axes:

Access Scope: Document-only, Artifact-augmented, or Repository-level.
Context Policy: Fresh (new thread per round) or Cross-round (retains state).

The system defaults to pairing executor and reviewer from different model families (e.g., Claude executor with GPT reviewer) as the recommended configuration to reduce correlated errors.

System Controls:

Effort Levels: Four presets (lite, balanced, max, beast) that scale breadth, depth, and iteration counts.
Reviewer Routing: Review requests default to GPT-5.4 via Codex MCP, with optional routing to GPT-5.4 Pro via Oracle MCP for high-stakes reviews.

Meta-Optimization: A prototype outer loop (/meta-optimize) analyzes usage events, proposes patches to SKILL.md files, and applies them only after GPT-5.4 xhigh review and user approval.

Empirical Validation / Results

The paper provides observational deployment evidence; outcomes cannot be causally attributed to ARIS alone.

Deployment Footprint (as of April 2026):

Dimension	Current Status
Executor platforms	3 tested + 3 adapted (6 total)
Reviewer models	6+ (GPT, Gemini, GLM, MiniMax, Kimi, DeepSeek)
GPU backends	4 (local, SSH, Vast.ai, Modal)
Venue templates	9 families
Community contributions	30+ contributed skills

Documented Overnight Run: A single trajectory was documented where the system operated autonomously for ~8 hours:

Completed four review–revise rounds (Workflow 2).
Increased an internal reviewer score from 5.0 to 7.5/10.
Launched more than 20 GPU experiments.
Removed claims that were not supported by available evidence. This run demonstrates the harness's ability to operationalize claim pruning and review-driven revision in a realistic setting.

Feature Comparison: Table 4 provides a structured comparison with related systems, highlighting ARIS's unique combination of features:

System	Cross-family policy	Adversarial review	Composable skills	E2E Research Workflows	Assurance Stack	Cross-platform portability
AI Scientist (Lu et al., 2024)	none	partial	×	✓	partial	×
Agent Laboratory (Schmidgall et al., 2025)	none	×	×	✓	×	×
AutoGen (Wu et al., 2023)	none	×	partial	×	×	×
Aris (ours)	default	✓	✓	✓	✓	✓ †

† Tested on 3 platforms with adaptation guides for 3 more.

Theoretical and Practical Implications

Theoretical Implications:

Harness-Centric View: The work reinforces the idea that system performance is co-determined by model weights and the orchestration harness, especially for long-horizon tasks.
Adversarial Design for Assurance: It proposes a practical, minimal (two-agent) adversarial framework as a defense against "plausible unsupported success," drawing an analogy to game theory and adversarial bandits.
Modularity for Auditability: By decomposing research into single-file skills and plain-text artifacts, the system makes the research process more inspectable and debuggable, addressing the "opaque agent trajectory" problem.

Practical Implications:

Lowering Barriers to Research: ARIS can assist in automating repetitive aspects of the research lifecycle (literature review, experiment deployment, manuscript formatting, rebuttal drafting), potentially increasing researcher productivity.
Enhanced Research Integrity: The evidence-to-claim audit cascade and manuscript assurance mechanisms provide a systematic, automated safety net against common integrity failure modes, which could improve the rigor of AI-generated research.
Portable Research Workflows: The skill-based, platform-agnostic design allows researchers to define and reuse research procedures across different LLM-powered coding environments, reducing vendor lock-in.
Foundation for Self-Improvement: The cross-model accountability primitives (reviewer independence, claim ledgers) could be adapted as an oversight layer between model outputs and downstream training data, potentially mitigating issues in recursive self-improvement loops.

Conclusion

ARIS is a research harness built around the conservative assumption that single-agent, long-horizon research is unreliable. It responds by implementing cross-model adversarial collaboration as a default, decomposing the workflow into modular, auditable skills, and enforcing integrity through a multi-stage assurance stack.

The main contributions are:

An assurance stack with a three-stage evidence-to-claim audit cascade and manuscript quality checks.
A modular system architecture with over 65 reusable skills, a persistent research wiki, and configurable workflows.
Early deployment evidence across multiple platforms, demonstrating operational feasibility.

Limitations include the absence of controlled evaluation, the inability to guarantee correctness, potential reviewer bias amplification, and security concerns with repository-level review. Human responsibility remains paramount.

Future Work includes:

Controlled evaluations (Appendix E outlines a benchmark protocol) to isolate the effect of cross-model heterogeneity.
Development of local reviewer models for confidential settings.
User studies on researcher productivity.
Exploring the adaptation of ARIS's accountability primitives for LLM self-improvement pipelines, as a mechanism to reduce judge-model coupling.

Code and documentation are available at: https://github.com/wanshuiyin/Auto-claude-code-research-in-sleep.