Summary of "ClawBench: Can AI Agents Complete Everyday Online Tasks?"

Summary (Overview)

  • ClawBench is a new benchmark comprising 153 real-world, write-heavy online tasks (e.g., purchases, bookings, applications) across 144 live platforms and 15 life categories, designed to test AI agents' ability to function as general online assistants.
  • It introduces a safe, realistic evaluation framework that operates on production websites, using a targeted interception mechanism to block only the final, irreversible submission request (e.g., order placement), while allowing full interaction with dynamic, complex web pages.
  • The framework employs a five-layer recording infrastructure (session replay, action screenshots, HTTP traffic, agent messages, browser actions) and an Agentic Evaluator that compares agent trajectories against human ground-truth references for traceable, step-level evaluation.
  • Performance on existing benchmarks does not transfer: Frontier models like Claude Sonnet 4.6 and GPT-5.4 achieve 65-75% on traditional benchmarks (OSWorld, WebArena) but only 33.3% and 6.5%, respectively, on ClawBench, highlighting its difficulty.
  • Overall results are low: Even the strongest model (Claude Sonnet 4.6) achieves only a 33.3% success rate, with two of seven evaluated models scoring below 5%, demonstrating that current AI agents are far from reliable at automating everyday online tasks.

Introduction and Theoretical Foundation

The advent of Large Language Model (LLM)-powered AI agents capable of navigating graphical interfaces and executing multi-step workflows (e.g., OpenAI Operator, Anthropic Computer Use) raises the question of their utility as general-purpose online assistants. To be truly useful, agents must reliably complete the everyday online tasks people regularly perform, such as booking flights or submitting job applications.

However, evaluating agents on such tasks is challenging due to the unpredictable and consequential nature of real websites. Most existing benchmarks (e.g., WebArena, OSWorld) retreat to offline sandboxes with static HTML, fixed DOM structures, and no authentication or dynamic content. While this simplifies evaluation, it removes the very complexities—cookie pop-ups, dynamic JavaScript, multi-step interactions—that define real-world difficulty. Benchmarks that do operate on real websites are often limited to read-only information retrieval or use mock APIs, leaving write-heavy, state-changing task completion largely unevaluated.

ClawBench is introduced to fill this gap. Its core motivation is to provide a realistic, safe, and diagnostically rich testbed for evaluating AI agents on the types of tasks that directly impact daily life, thereby measuring progress toward agents that can reliably "get things done" on the live web.

Methodology

1. Task Design and Collection

The benchmark focuses on write-heavy web tasks that modify server-side state (form submissions, purchases, applications). Each task is defined by:

  • A natural-language user instruction.
  • A starting URL.
  • A terminal submission target specified at the HTTP-request level.

A rigorous, multi-stage filtering pipeline was used to curate 153 final tasks across 144 unique live platforms. For each task, human annotators completed it end-to-end within the recording framework to produce a human reference trajectory and identify the exact interception signal—the specific HTTP endpoint, method, and payload schema of the irreversible submission request. This manual annotation ensures high-precision, safe interception.

2. Task Taxonomy

Tasks are organized into a two-level taxonomy for fine-grained analysis:

  • 8 High-Level Category Groups: Daily, Work, Dev, Social, Academic, Travel, Pets, Finance.
  • 15 Fine-Grained Categories: e.g., Shopping, Entertainment, Job Search, Education.

3. Safety & Realism: The Interception Mechanism

The key design insight is that evaluation on real websites only requires intercepting the final request, not preventing interaction. A lightweight Chrome extension and a Chrome DevTools Protocol (CDP)-based server monitor all outgoing HTTP requests.

  • When a request matches a human-annotated interception specification, the system: captures the full request body, blocks it from reaching the server, and logs the payload.
  • All other requests (page loads, AJAX calls, etc.) pass through unmodified, preserving the full ecological validity and complexity of the live website.

4. Five-Layer Recording Infrastructure

Every agent run produces five synchronized layers of behavioral data:

  1. Session Recording: Full video via Xvfb and FFmpeg.
  2. Action Screenshots: Per-step screenshot after each agent action.
  3. HTTP Traffic: All requests logged via CDP.
  4. Agent Messages: Full chain of reasoning traces and tool calls in JSON.
  5. Browser Actions: Low-level events (clicks, keystrokes) captured via extension.

Human ground-truth trajectories are recorded under the same setup, enabling comparative evaluation.

5. Evaluation Protocol: The Agentic Evaluator

Trajectories are scored using an Agentic Evaluator, implemented by invoking a Claude Code sub-agent under a fixed rubric. It performs explicit step-level alignment between the agent trajectory Ta(t)T_a^{(t)} and the human reference trajectory Th(t)T_h^{(t)}.

The evaluator function A\mathcal{A} maps the task instruction q(t)q^{(t)} and both trajectories to a binary verdict:

Score(t)=A(q(t),Ta(t),Th(t)),\text{Score}(t) = \mathcal{A}\left( q^{(t)}, T_a^{(t)}, T_h^{(t)} \right),

where Score(t){0,1}\text{Score}(t) \in \{0, 1\}.

The overall success rate (SR) over a task set T\mathcal{T} is:

SR=1TtTScore(t).\text{SR} = \frac{1}{|\mathcal{T}|} \sum_{t \in \mathcal{T}} \text{Score}(t).

This comparative approach provides a concrete specification of success, grounded in platform-specific details, and yields structured justifications for failures.

Empirical Validation / Results

Experimental Setup

  • Models Evaluated (7): 5 proprietary (Claude Sonnet 4.6, GPT-5.4, Gemini 3.1 Flash Lite, Claude Haiku 4.5, Gemini 3 Flash) and 2 open-source (GLM-5, Kimi K2.5).
  • Infrastructure: Agents control a Chromium browser via the OpenClaw framework, with the ClawBench interception and recording systems running in the background.
  • Primary Metric: Success Rate (SR), reported overall and per category.

Main Results

Table 2: Main results on ClawBench. Success rate (%) of seven AI agents.

RankModelOverallDailyFinanceWorkDevAcademicTravelSocialPets
1Claude Sonnet 4.633.344.250.019.011.150.023.138.918.2
2GLM-5 †24.230.816.738.116.728.60.016.718.2
3Gemini 3 Flash19.015.433.323.822.228.630.811.10.0
4Claude Haiku 4.518.315.433.319.027.821.47.716.718.2
5GPT-5.46.59.60.00.011.17.17.70.09.1
6Gemini 3.1 Flash Lite3.31.90.00.05.614.30.00.09.1
7Kimi K2.50.71.90.00.00.00.0我们发现,在论文中的表格数据里,Pets列中Claude Sonnet 4.6的成功率是18.2%,但在同一列的其他数据中,例如Gemini 3 Flash是0.0%,Claude Haiku 4.5是18.2%,GLM-5是18.2%。这说明表格数据本身是一致的。用户提供的文本中,在Pets列,Claude Sonnet 4.6对应的数据是18.2%,而不是9.1%。因此,我们以用户提供的文本中的表格数据为准。0.00.0

† denotes a text-only model without vision capability. Bold marks the best result per column; underline marks second best.

Key Findings:

  1. Low Overall Performance: The best model (Claude Sonnet 4.6) succeeds on only 33.3% of tasks. Performance drops sharply for others, with GPT-5.4 at 6.5% and Kimi K2.5 at 0.7%.
  2. Significant Performance Gap vs. Traditional Benchmarks: As shown in Figure 1 (right), models like Claude Sonnet 4.6 and GPT-5.4 achieve 65-75% on OSWorld and WebArena but perform dramatically worse on ClawBench, underscoring its heightened difficulty and realism.
  3. Category-Specific Strengths: Performance varies considerably across domains. No model dominates all categories, indicating that current agents lack uniform competence. For example, GLM-5 performs best on "Work" tasks, while Gemini 3 Flash leads on "Travel".

Theoretical and Practical Implications

Theoretical Implications:

  • Highlights a Critical Evaluation Gap: ClawBench demonstrates that strong performance on controlled, sandboxed benchmarks does not guarantee competence on the dynamic, complex live web. This calls for a reevaluation of how web agent capabilities are measured.
  • Provides a Diagnostic Framework: The five-layer recording and agentic evaluator enable traceable failure analysis, moving beyond binary scores to understand why an agent failed (e.g., misinterpreted a form field, missed a required step). This provides concrete signals for guiding future agent development in areas like planning, grounding, and robustness.

Practical Implications:

  • Roadmap for Assistive AI: The low success rates indicate that AI agents are not yet ready to serve as reliable general-purpose online assistants. ClawBench provides a clear benchmark to track progress toward this practical goal.
  • Safety-by-Design for Evaluation: The interception mechanism offers a blueprint for safe, large-scale evaluation on live platforms without real-world side effects, balancing ecological validity with ethical responsibility.
  • Open-Source Foundation: By releasing the complete pipeline, the authors enable the community to maintain, expand, and adapt the benchmark as websites and agent technologies evolve, ensuring its long-term relevance.

Conclusion

ClawBench establishes a realistic and challenging testbed for evaluating AI agents on everyday online tasks. By operating on live production websites, focusing on write-heavy workflows, and providing rich, traceable diagnostics, it reveals a substantial gap between agent performance in controlled settings and in the real world. The low success rates of even frontier models underscore that creating reliable general-purpose web assistants remains an unsolved problem. The release of the benchmark and toolkit aims to catalyze research to bridge this gap, bringing us closer to practical, helpful AI agents.