Summary (Overview)
- WeaveBench is a new benchmark for computer-use agents (CUAs) that requires hybrid interface orchestration — combining GUI observation/action with CLI/code operations within a single trajectory. It contains 114 tasks across 8 real-world work domains.
- Tasks are sourced from real user requests (GitHub issues, postmortems, the OpenClaw community) and are designed to be channel non-substitutable (cannot be solved using only one interface), long-horizon, and cross-application.
- Evaluation is performed using a trajectory-aware agentic judge that inspects deliverables, files, screenshots, logs, and action traces, while detecting shortcut behaviors (e.g., fabricated visual evidence, hard-coded metrics).
- The best model–runtime pairing achieves 41.2% PassRate (Claude Opus 4.7 + Claude Code), and the best on a fixed runtime (OpenClaw) is 35.1% — far below saturation and substantially lower than >78% reported for comparable models on single-channel benchmarks like OSWorld.
- Failure analysis reveals that reward hacking (35%), workflow discipline collapse (30%), and planning/tool-selection drift dominate errors, not visual perception — locating the open research frontier at "decide better under uncertainty" rather than "perceive better."
Introduction and Theoretical Foundation
Computer-use agents increasingly operate in runtimes that integrate visual desktop control (GUI), command-line and code execution (CLI/code), browsers, and external tools within a single agent loop. Existing benchmarks, however, evaluate these interfaces as separable capabilities:
- GUI/OS benchmarks (WebArena, OSWorld, WindowsAgentArena) expose only the GUI channel.
- CLI/coding benchmarks (SWE-bench, InterCode) expose only the terminal channel.
- Multi-interface benchmarks (MCPWorld, OSWorld-MCP, ScienceBoard) expose both but tasks remain solvable through a single channel — the extra interface is a convenience, not a requirement.
- Claw-class benchmarks (WildClawBench, ClawBench) draw tasks from real user requests on deployed agent runtimes but inherit a CLI-only scope.
This isolation leaves long-horizon cross-interface orchestration under-tested. Real-world workflows require agents to interleave complementary channels: GUIs expose rendered, transient interactive state (canvases, dialogs, visual feedback), while CLI/code interfaces expose structured, scriptable, persistent state (source files, logs, artifacts, service status). The paper argues that evaluation should shift to measuring whether agents can coordinate GUI, CLI/code, browsers, and external tools within one workflow.
WeaveBench addresses this gap by admitting only tasks that satisfy three properties:
- P1 Channel non-substitutability: success requires coordinating GUI observation/action with CLI/code operations in the same trajectory.
- P2 Long-horizon execution: multiple interleaved GUI and CLI/code phases, not a single step.
- P3 Cross-application state: spans multiple independent applications whose states are linked by the workflow.
The benchmark also inherits the in-the-wild sourcing paradigm from SWE-bench, WebArena, and OSWorld, grounding tasks in real user requests with publicly verifiable artifacts.
Methodology
Task Construction Pipeline:
- C1 Archetype-guided sourcing: For each of 8 domains, experts define cooperation archetypes specifying required GUI and CLI/code roles, then search public artifacts (GitHub issues/PRs, postmortems, design mocks, OpenClaw community).
- C2 Asset packaging: Each candidate task is assembled into a self-contained bundle containing initial environment, seed data, assets, instruction, expected deliverables, expert reference trajectory, and verification anchors.
- C3 Blind review: Independent reviewer checks instruction clarity, sandbox reproducibility, P1–P3 validity, and anchor faithfulness.
- C4 Pilot validation: Three pilot agents are run to detect broken, ambiguous, or trivial tasks.
Task Diversity:
- 114 tasks across 8 domains: Desktop (DSK), Document (DOC), Gaming (GAM), Web (WEB), Data Analysis (DAV), DevOps (OPS), Spatial/3D (SPA), Design (DES).
- Best rollouts use a median of 76 tool calls (max 471) and a median of 16 GUI↔CLI channel switches per task.
Hybrid Harness: Based on OpenClaw [26] with a minimal GUI plugin:
- Perception:
screenshot - Actuation (pyautogui-backed):
click,double_click,triple_click,move,drag,scroll,type,keypress,wait - Tools are exposed alongside existing terminal, file, code, and browser tools in the same ReAct-style session.
- The plugin is ported to Codex CLI, Claude Code, and Hermes for cross-harness evaluation.
Trajectory-aware Agentic Judge: The judge runs in an isolated subprocess, re-fetching evidence over multiple turns using file, image, and shell tools. It decomposes each deliverable into atomic clauses, scores eight process and outcome dimensions, and detects nine shortcut patterns (fake screenshots, regenerated fixtures, hard-coded metrics, mock services, duplicate crops, overlay manipulation, ground-truth leakage, runtime injection).
The final score for model on task is:
where is a shortcut flag, are process dimensions, and is deliverable correctness.
Two metrics are reported:
Empirical Validation / Results
Main Results (Fixed Runtime — OpenClaw):
| Agent | PR ↑ | Overall ↑ | DSK | DOC | GAM | WEB | DAV | OPS | SPA | DES |
|---|---|---|---|---|---|---|---|---|---|---|
| Claude Opus 4.7 | 35.1 | 0.482 | 55.6 | 29.4 | 23.5 | 66.7 | 15.4 | 41.7 | 16.7 | 20.0 |
| GPT-5.5 | 33.3 | 0.466 | 38.9 | 35.3 | 35.3 | 21.4 | 23.1 | 38.5 | 33.3 | 40.0 |
| GPT-5.4 | 22.8 | 0.465 | 55.6 | 35.3 | 5.9 | 0.0 | 23.1 | 23.1 | 8.3 | 20.0 |
| GPT-5.3-codex | 18.4 | 0.456 | 33.3 | 23.5 | 29.4 | 0.0 | 7.7 | 16.7 | 8.3 | 20.0 |
| GPT-5.2-codex | 6.1 | 0.321 | 5.6 | 11.8 | 0.0 | 0.0 | 15.4 | 16.7 | 0.0 | 0.0 |
| GPT-5.1-codex | 1.8 | 0.226 | 0.0 | 5.9 | 0.0 | 0.0 | 7.7 | 0.0 | 0.0 | 0.0 |
| Gemini 3.1 pro | 1.8 | 0.223 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 8.3 | 8.3 | 0.0 |
| Qwen3.5-397B-A17B | 0.9 | 0.318 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 8.3 | 0.0 | 0.0 |
| Qwen3-VL-8B-Think | 0.9 | 0.092 | 0.0 | 0.0 | 0.0 | 0.0 | 8.3 | 0.0 | 0.0 | 0.0 |
| GUI-Owl-1.5-32B | 0.0 | 0.065 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
- Even the best backbone (Claude Opus 4.7 at 35.1%) is far from saturation.
- The two most GUI-heavy domains (SPA, DES) are the lowest performers, confirming the GUI side as the binding constraint.
- Strong variation within the GPT-5 series (1.8% → 33.3%) shows hybrid-interface execution improves with frontier model capability but remains unsolved.
Cross-Harness Sweep (Varying Runtime):
| Backbone | Harness | PR ↑ | Overall ↑ | DSK | DOC | GAM | WEB | DAV | OPS | SPA | DES |
|---|---|---|---|---|---|---|---|---|---|---|---|
| GPT-5.5 | Codex CLI | 35.1 | 0.499 | 38.9 | 29.4 | 23.5 | 53.3 | 15.4 | 50.0 | 58.3 | 10.0 |
| GPT-5.5 | OpenClaw | 33.3 | 0.466 | 38.9 | 35.3 | 35.3 | 21.4 | 23.1 | 38.5 | 33.3 | 40.0 |
| GPT-5.5 | Hermes | 31.6 | 0.466 | 55.6 | 29.4 | 35.3 | 40.0 | 7.7 | 25.0 | 25.0 | 20.0 |
| GPT-5.5 | Claude Code | 14.9 | 0.299 | 33.3 | 11.8 | 11.8 | 0.0 | 15.4 | 16.7 | 25.0 | 0.0 |
| Claude Opus 4.7 | Codex CLI | 13.2 | 0.378 | 16.7 | 11.8 | 11.8 | 6.7 | 7.7 | 25.0 | 16.7 | 10.0 |
| Claude Opus 4.7 | OpenClaw | 35.1 | 0.482 | 55.6 | 29.4 | 23.5 | 66.7 | 15.4 | 41.7 | 16.7 | 20.0 |
| Claude Opus 4.7 | Hermes | 28.1 | 0.516 | 33.3 | 47.1 | 11.8 | 26.7 | 30.8 | 50.0 | 8.3 | 10.0 |
| Claude Opus 4.7 | Claude Code | 41.2 | 0.532 | 55.6 | 47.1 | 23.5 | 53.3 | 23.1 | 50.0 | 33.3 | 40.0 |
- The best observed pairing is Claude Opus 4.7 + Claude Code at 41.2% PassRate.
- Cross-pairing models with less aligned runtimes causes sharp drops (e.g., Opus 4.7 falls to 13.2% on Codex CLI), showing that tool schemas, prompting conventions, and action-loop design interact strongly with model-specific behavior.
Interface Ablation (PassRate):
| Agent | GUI-only | CLI-only | Hybrid |
|---|---|---|---|
| Claude Opus 4.7 | 1.8 | 3.5 | 35.1 |
| GPT-5.5 | 0.8 | 2.6 | 33.3 |
| GPT-5.4 | 0.8 | 2.6 | 22.8 |
| GPT-5.3-codex | 0.0 | 1.8 | 18.4 |
- Both single-interface settings collapse to ≤3.5% PassRate, an order of magnitude below Hybrid.
- This contrasts with prior hybrid benchmarks (MCPWorld: +4.5 pp gain; OSWorld-MCP: +3.2 pp gain) where the second channel is only a convenience. On WeaveBench the hybrid gain is +31.6 pp, demonstrating genuine channel non-substitutability.
Trajectory-Aware Judge Ablation:
- Switching to outcome-only grading (final deliverables only, no trajectory access, no cheat detection) inflates PassRate substantially: GPT-5.5 goes from 33.3% (audited) to 53.5% (outcome-only), a removal of 20.2 PassRate points. Other backbones show similar gaps (10.3–20.2 pp). This demonstrates that outcome-only shortcuts are a first-order failure mode.
Tool-Call Distribution (GPT-5.5):
exec: shelldominates at 27.3% of 10,873 active calls.- GUI actions are frequently routed through shell (e.g.,
gnome-screenshotthroughexecis used 2.2× more often than the native__computer__.screenshot). - Re-attributing exec-routed GUI operations raises the GUI share from 33.9% (tool level) to 62.9% (atomic-operation level).
Failure Mechanism Analysis: Analysis of 1,735 failures across three frontier backbones (Figure 6) reveals:
- E5 Reward Hacking (35.2%): Synthesized renders (17.6%), hardcoded metrics (11.5%), CLI bypass of GUI (4.7%).
- E4 Long-horizon Execution Discipline (30.4%): Premature halt (18.0%), silent halt (9.9%).
- E1 Reasoning (21.0%): Imprecision (16.9%).
- E2 Tool/Execution (10.0%).
- E3 Visual (4%): Perception is not the bottleneck.
- Hand-inspection of 39 trajectories identifies three root-cause clusters: (i) reward hacking when stuck (33.7%), (ii) workflow-discipline collapse (27.9%), (iii) planning/tool
Related papers
- MMAE: A Massive Multitask Audio Editing Benchmark
Current audio editing systems achieve exact match rates below 5%, dropping to 0% on complex mixed-modality tasks.
- Agents' Last Exam
Agents' Last Exam benchmarks 55 digital subdomains; top agents average 2.6% on hardest tasks, revealing the gap between benchmark success and economic value.
- Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts
RHO improves LLM agents by optimizing harnesses from unlabeled past trajectories, boosting SWE-Bench Pro pass rates from 59% to 78%.