Summary (Overview)

  • Agents’ Last Exam (ALE) is a benchmark evaluating AI agents on long-horizon, economically valuable, real-world computer-use tasks, sourced from 250+ industry experts and covering 55 subfields across 13 industry clusters with over 1,000 task instances.
  • The benchmark is grounded in the O*NET/SOC 2018 occupational taxonomy and targets Generalist Computer-Use Agents (GCUA) that combine GUI interaction, CLI commands, code execution, and tool use within a single action loop.
  • Current frontier agents remain far from saturation: the average full pass rate is 2.6% on the hardest difficulty tier, and even the strongest configuration (Codex + GPT-5.5) scores below 50% on the easiest tier.
  • Unlike prior benchmarks, ALE provides full industry coverage (all 55 digital subdomains), uses deterministic verification via structured rubric-based scoring, and is designed as a living benchmark with rolling private/public task rotation to resist contamination.
  • ALE aims to close the gap between benchmark success and GDP-relevant impact: passing these tasks would indicate readiness for sustained professional work, not just abstract competence.

Introduction and Theoretical Foundation

The paper argues that despite impressive AI benchmark results (e.g., world-champion games, olympiad mathematics, competitive programming), the economic impact has remained surprisingly muted. This "utility problem" arises because existing benchmarks measure abstract competence rather than the ability to carry out long-horizon, economically valuable work in real professional environments.

Key motivation:

  • Benchmarks shape research direction and engineering targets (e.g., ImageNet for computer vision). Economically central sectors (finance, law, electrical engineering, manufacturing) lack comparable evaluations.
  • Building such evaluations is structurally difficult: authentic workflows are expensive to collect; broad industry coverage requires sustained expert access; and verification of heterogeneous outputs (files, spreadsheets, media, designs) is hard without human judgment.
  • Prior benchmarks trade off realism, breadth, or verifiability. ALE aims to satisfy all three simultaneously.

The name "Agents' Last Exam" carries a dual aspiration:

  • Last as competence threshold: passing demonstrates readiness for sustained, economically valuable work.
  • Last as difficulty frontier: authentic long-horizon workflows sit at the boundary of what current systems can reliably do.

Methodology

Benchmark Design Principles

Three requirements govern task admission:

  1. Representativeness – workflows must match real professional practice using domain-appropriate software.
  2. Complexity – end-to-end deliverables requiring substantial expert time (not isolated UI actions).
  3. Verifiability – outputs must admit deterministic checking or unambiguous rubric-based scoring against observable artifacts.

Taxonomy and Coverage

The ALE taxonomy is grounded in SOC 2018 and O*NET, clustering occupations with similar software-mediated workflows into 13 domains spanning 55 subdomains. Coverage is measured against prior benchmarks: the union of 16 major prior benchmarks leaves 13 of 55 subdomains entirely uncovered.

Task Construction Pipeline

Tasks are constructed through a five-gate protocol:

  1. Expert sourcing via advisory committees of industry practitioners.
  2. Task submission through a web portal where experts upload past projects with five core components (description, input files, target software, expected deliverable, evaluation specification).
  3. First-pass review with conference-style decisions (major/minor revision, borderline accept, accept, strong accept).
  4. Task implementation converting accepted specifications into runnable assets and codified evaluation logic.
  5. Final QC with peer review verifying correctness, calibration, and context sufficiency.

Public/private split: 150 of 1,490 task instances (~10%) are public; the remainder are held privately to resist contamination. Private tasks rotate into the public set over time.

Evaluation Pipeline

The pipeline decouples three components:

  • Task specification (main.py with load(), start(), evaluate() lifecycle functions).
  • Agent (harness + foundation model) interacting via an action loop over screenshots, shell output, mouse/keyboard, file edits, and API calls.
  • Environment (remote VM with canonical directory layout: input/, software/, output/, reference/).

Agents are evaluated in GCUA configuration with a unified CUA MCP bridge exposing 14 desktop-action tools. Evaluation uses a gate-and-score pattern: a binary precondition must pass before a continuous quality metric is evaluated; failure on the gate forces score to 0. ALE avoids LLM-as-judge wherever a deterministic alternative exists; when an LLM judge is necessary, it uses narrow, evidence-anchored yes/no probes.

Empirical Validation / Results

Main Results (Table 1)

ConfigurationOverall Pass Rate (%)Near-Term Pass (%)Full-Spectrum Pass (%)Last-Exam Pass (%)
Codex (GPT-5.5)26.242.420.08.6
ALE-Claw (GPT-5.5)24.235.621.88.6
Cursor (GPT-5.5)22.536.420.02.9
Claude Code (Sonnet 4.6)17.131.412.70.0
All models average (hardest tier)~2.6

Key findings:

  • Three difficulty tiers are established: Near-Term (59 tasks, top pass rates ~30%), Full-Spectrum (55 tasks, one per subdomain), Last-Exam (36 tasks, most agents at 0%).
  • ALE-CLI subset (106 Linux-only tasks) is substantially harder than Terminal-Bench: Codex + GPT-5.5 achieves 82% on Terminal-Bench but only 26.4% overall pass rate on ALE-CLI.
  • Cost per run: $3–10 on average; overall timeout rate is 4.3%.

Analysis

  • Domain-level performance (Figure 9a): Computational mathematics and agriculture/environment score highest (~60%); visual media and education remain below 30%.
  • Tool usage: GUI tools are underutilized—34% of tasks designate graphical software as primary tool, yet GUI share in agent tool calls remains small.
  • Failure taxonomy (Figure 9d for Claude Code + Opus 4.7): Understanding and Approach failures account for ~75% of cases, indicating domain knowledge is the dominant bottleneck, not execution capability.

Benchmark Positioning (Table 2)

BenchmarkTask FormSizeBreadth (of 55 subdomains)Verification
MMLUKnowledge QA~16K26/55Auto (exact match)
GPQAKnowledge QA~5008/55Auto (exact match)
SWE-benchCode patch~2K5/55Auto (unit tests)
OSWorldGUI Operation~4005/55Auto (state checks)
GDPvalProject deliverable~20016/55Human (expert)
RLIProject deliverable~25014/55Human (expert)
ALE (ours)Project deliverable~1.5K55/55Auto (deterministic scripts)

ALE is the only benchmark covering all 55 subdomains and using deterministic verification without human judges.

Theoretical and Practical Implications

  • Benchmark design methodology: ALE demonstrates that economically grounded, real-world task evaluation with deterministic verification is feasible at scale. The staged construction protocol with expert sourcing, multi-round QC, and public/private split provides a template for future living benchmarks.
  • Industry coverage: By anchoring in the SOC/O*NET occupational taxonomy, ALE provides a comprehensive, principled mapping of digital industries that can serve as a common coordinate system for comparing benchmarks (as used in Figure 3 and Table 2).
  • Agent evaluation: The benchmark formally defines the Generalist Computer-Use Agent (GCUA) class, decomposing agent capabilities into five layers (Brain, Eyes, Body, Hands, Feet), clarifying the limitations of existing CLI-only and GUI-only agents.
  • Economic relevance: Saturation of ALE tasks would signal readiness for real industrial adoption. The gap between benchmark success (high performance on existing benchmarks) and ALE performance (2.6% on hardest tier) quantifies the distance to economically meaningful deployment.
  • Domain knowledge bottleneck: The failure analysis reveals that domain knowledge, not execution skill, is the primary limiting factor—this suggests that pre-training on specialized professional software workflows and domain-specific training data may be more impactful than improving general reasoning or tool-use capabilities alone.

Conclusion

ALE is a living benchmark of 1,490 task instances across all 55 digital subdomains of the O*NET/SOC taxonomy, sourced from real professional work of 250+ experts and scored with deterministic automated verification. Frontier agents currently achieve only small fractions on the hardest tasks. The benchmark is released to close the gap between benchmark success and GDP-relevant impact: saturation would indicate that AI agents can sustain the long-horizon, tool-intensive work that professional practice actually requires. Future directions include expanding the task pool, rotating private tasks into the public set, and continuing to evaluate new model generations as they emerge.

Related papers