Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

Summary (Overview)

Claw-Eval is a comprehensive evaluation suite designed to address critical gaps in existing benchmarks for LLM-based autonomous agents. It features 300 human-verified tasks across three groups (General service orchestration, Multimodal perception/generation, Multi-turn professional dialogue) with 2,159 fine-grained rubric items.
The framework introduces full-trajectory auditing via three independent evidence channels (execution traces, audit logs, environment snapshots) and integrated multi-dimensional scoring (Completion, Safety, Robustness) to replace unreliable output-only grading.
Key empirical findings from evaluating 14 frontier models reveal that: trajectory-opaque evaluation is systematically unreliable, missing 44% of safety violations; robustness (consistency) is a distinct capability axis that degrades sharply under controlled error injection; and no single model dominates across all modalities or task types.

Introduction and Theoretical Foundation

Large Language Models (LLMs) have evolved from conversational assistants into autonomous agents capable of executing multi-step workflows in real-world software environments. This shift necessitates evaluation methodologies that assess agents in live, interactive environments, focusing on how goals are accomplished through situated action, not just what final outputs are produced.

Existing agent benchmarks suffer from three critical limitations (G1–G3):

Trajectory-Opaque Grading (G1): Many benchmarks check only final artifacts, making faithful execution indistinguishable from fabricated steps. This creates an evaluation surface susceptible to "reward hacking."
Underspecified Safety and Robustness (G2): Safety is often isolated in standalone suites, not evaluated under genuine task pressure. Robustness is rarely tested via systematic stress-testing (e.g., API failures).
Modally Narrow Task Coverage (G3): Benchmarks typically target a single modality (e.g., text-based tool calls, GUI navigation), failing to jointly evaluate heterogeneous capabilities under a consistent methodology.

Claw-Eval is introduced to address these gaps with three corresponding design principles: (1) Full-trajectory auditing, (2) Integrated multi-dimensional scoring, and (3) Unified cross-modal coverage. A core challenge is the inherent stochasticity of agentic execution, which Claw-Eval addresses by running each task for k independent trials and reporting complementary metrics.

Methodology

Claw-Eval's architecture is built on a core premise: trustworthy evaluation requires grounding every score in evidence of what the agent actually did.

1. Auditable Execution Pipeline

The framework organizes evaluation into a strict three-phase lifecycle within an isolated Docker container, with a temporal firewall separating execution from grading.

Phase 1: Setup: A fresh sandbox container is provisioned with workspace files (datasets, media assets). Mock services (CRM, email gateways) are deployed outside the sandbox, each silently maintaining an audit log.
Phase 2: Execution: The agent interacts with the environment through two complementary capability layers:
- System Layer: 11 built-in tools for code execution, file operations, web interaction, and multimodal media processing.
- Service Layer: Task-specific custom tools exposing mock APIs. The complete agentic context is recorded in a structured execution trace.
Phase 3: Judge: Upon agent termination, grading artifacts are injected. The pipeline assembles three independent lines of evidence for scoring:
1. Execution Trace: The complete agentic context.
2. Service Audit Logs: Every API request received by mock services.
3. Environment Snapshot: The physical end-state (e.g., generated files).

2. Cross-Modal Task Suite

The 300 tasks are organized into three groups testing complementary capabilities, all instantiating the same three-phase lifecycle.

Group	Category	Description	# Tasks
General (161)	Easy, Medium, Hard	Practical workflow execution, from single-service queries to multi-system orchestration. 43 tasks embed safety constraints.	161
Multimodal (101)	Video, Doc & Image, Code	Perceptual and generative capabilities over rich media (video, documents, images), requiring a perceive–reason–act loop.	101
Multi-turn Dialogue (38)	STEM, Social Science, Business	Professional consultations with a simulated user persona that progressively reveals information based on the agent's questioning.	38

3. Scoring Protocol

The scoring protocol converts rich evidentiary records into comprehensive, precise, and reliable scores.

Multi-dimensional Scoring: Each task is evaluated along three orthogonal dimensions combined into a final score:

\text{score} = s_{\text{safety}} \times (\alpha \cdot s_{\text{completion}} + \beta \cdot s_{\text{robustness}})

where $\alpha + \beta = 1$ . The paper uses $\alpha = 0.8$ and $\beta = 0.2$ .

Completion ( $s_{\text{completion}}$ ): Degree to which the task objective is fulfilled, aggregated from task-specific rubric weights.
Safety ( $s_{\text{safety}}$ ): Acts as a multiplicative gate. A violation pulls the entire score toward zero. Safety constraints are embedded within normal workflow tasks.
Robustness ( $s_{\text{robustness}}$ ): Measured via controlled error injection on mock services. The score captures the breadth of recovery: $s_{\text{robustness}} = \begin{cases} \frac{|T_{\text{recovered}}|}{|T_{\text{errored}}|} & \text{if } |T_{\text{errored}}| > 0 \\ 1 & \text{otherwise} \end{cases}$ where $T_{\text{errored}}$ is the set of tool types that encountered an injected error and $T_{\text{recovered}} \subseteq T_{\text{errored}}$ is the subset for which the agent subsequently obtained a successful response.

Fine-grained Rubrics: Each task is decomposed into 2,159 independently verifiable rubric items (mean 7.2 per task). Items are either deterministic checks (e.g., file exists, API invoked) or judgment-based assessments via an LLM judge, all anchored in the three independent evidence sources.

Evaluation Metrics: To account for stochastic variance, each task is run for $k=3$ independent trials. Three complementary metrics are reported:

Average Score: Mean task score across all runs, measuring overall capability level. $\text{Score} = \frac{1}{N} \sum_{i=1}^{N} \frac{1}{k} \sum_{j=1}^{k} s_{ij}$
Pass@ $k$ : Fraction of tasks passed at least once in $k$ runs, measuring the capability ceiling. $\text{Pass@}k = \frac{1}{N} \sum_{i=1}^{N} \mathbb{1}[\max_{j=1}^{k} s_{ij} \geq \tau]$
Pass $^k$ : Fraction of tasks passed on every trial, measuring the reliability floor. $\text{Pass}^k = \frac{1}{N} \sum_{i=1}^{N} \mathbb{1}[\min_{j=1}^{k} s_{ij} \geq \tau]$ A pass threshold $\tau = 0.75$ is used.

Empirical Validation / Results

Experimental Setup

Models: 14 frontier models from seven families (e.g., Claude Opus 4.6, GPT-5.4, Gemini-3.1-Pro). 9 visual models are evaluated on the Multimodal group.
Settings: Default parameters, temperature=0, isolated Docker sandbox, error injection rate initially 0. Each task run for 3 trials. Gemini-3-Flash serves as the LLM judge for General/Multimodal tasks; Claude Opus-4.6 (temperature=0.7) serves as simulated user and judge for Dialogue tasks.

Main Results

Table 4: Main evaluation results (General & Multi-turn tasks). Models sorted by Pass $^3$ .

Model	General	Multi-turn	Overall
	Score	Pass@3	Pass $^3$
Claude Opus 4.6	80.6	80.8	70.8
Claude Sonnet 4.6	81.3	81.4	68.3
GPT 5.4	78.3	75.8	60.2
Gemini 3.1 Pro	76.6	80.8	55.9
...	...	...	...
Nemotron 3 Super	41.7	34.8	6.8

Finding 1: Consistency (Pass $^3$ ) and peak performance (Score/Pass@3) do not align. Claude-Opus-4.6 leads Pass $^3$ (70.4%) while Claude-Sonnet-4.6 leads Score (81.4%).
Finding that the benchmark retains headroom: The strongest model achieves only 70.4% Overall Pass $^3$ .

Figure 2: Pass $^3$ rate by difficulty level (General tasks). All models degrade monotonically from Easy to Hard. The difficulty range provides effective discrimination.

Table 5: Multimodal task evaluation results sorted by Pass $^3$ .

Model	Score	Pass@3	Pass $^3$
GPT 5.4	54.4	55.5	25.7
Claude Opus 4.6	54.7	52.5	24.8
Claude Sonnet 4.6	50.9	43.6	23.8
...	...	...	...
GLM 5V Turbo	47.0	34.6	13.9

Finding: Multimodal tasks are substantially harder (max Pass $^3$ = 25.7%) than General tasks, and rankings shift across modalities. Multimodal capability is a distinct axis.

Targeted Analyses

5.1 Trajectory-Opaque Judges Miss Violations A vanilla LLM judge (Gemini-3-Flash), given the full conversation transcript and grader source code but without server-side audit logs or environment snapshots, is compared against Claw-Eval's hybrid pipeline.

Safety: Misses 44% of task-level violations (12 out of 27).
Robustness: Misses 13% of task-level issues (15 out of 118).

This validates the hybrid design: rule-based checks are necessary for deterministic, safety-critical criteria.

5.2 Injected Failures Erode Consistency Three models are evaluated on General tasks with error injection rates from 0.0 to 0.6.

Pass@3 (capability ceiling) remains largely stable.
Pass $^3$ (reliability floor) drops sharply (e.g., Gemini-3.1-Pro loses 24.2%).
The gap between Pass@3 and Pass $^3$ widens monotonically, quantifying a growing divide between capability and reliability under perturbation.

5.3 Better Questions, Not More, Yield Better Multi-turn Performance Analysis of 38 multi-turn dialogue tasks across 13 models.

Round count shows near-zero correlation with Pass $^3$ ( $r = 0.07$ , $R^2 < 0.01$ ).
Question precision (quality of clarifying questions) strongly correlates with Pass $^3$ ( $r = 0.87$ , $R^2 = 0.76$ ).

What separates high-performing agents is not how many questions they ask, but how well they ask them.

5.4 Multimodal Capability is Domain-Specific Table 6: Pass $^3$ (%) by model and multimodal domain.

Model	Video (53)	Doc (22)	Code (26)	Overall
GPT 5.4	11.5	54.5	29.6	25.7
Claude Opus 4.6	15.4	45.5	25.9	24.8
Claude Sonnet 4.6	15.4	40.9	25.9	23.8
MiMo V2 Omni	5.8	18.2	33.3	15.8

Finding: No single model dominates all domains. Each domain (Video, Doc & Image, Code) has a different leader. Overall rankings obscure substantial rank shifts at the domain level.

Theoretical and Practical Implications

For Evaluation Methodology: The paper demonstrates that trajectory-opaque evaluation is systematically unreliable and unsafe. Trustworthy agent evaluation requires a hybrid approach combining rule-based checks on structured, auditable evidence (traces, logs, snapshots) with LLM judgment for open-ended assessment. Multi-dimensional scoring (Completion, Safety, Robustness) and multi-trial metrics (Pass@k, Pass $^k$ ) are essential to capture the full spectrum of deployable capability.
For Agent Development: The findings highlight actionable directions:
1. Prioritize consistency and error recovery over peak performance, as robustness is a distinct capability axis that degrades under realistic perturbations.
2. Develop domain-targeted multimodal perception rather than assuming uniform scaling, as no model excels across all visual domains.
3. Focus on information acquisition strategy quality in interactive settings, as effective questioning is far more critical than conversation length.

Conclusion

Claw-Eval provides a transparent, end-to-end evaluation suite that addresses critical gaps in existing agent benchmarks through full-trajectory auditing, cross-modal task coverage, and integrated multi-dimensional scoring. Key findings confirm that:

Trajectory-opaque evaluation is unreliable, missing significant safety and robustness issues.
Agent capability is not monolithic; consistency (reliability) under perturbation is a distinct and vulnerable axis separate from peak performance.
Aggregate metrics mask structured capability gaps across modalities and interaction types (e.g., question quality > quantity).

These findings suggest that future agent development should prioritize consistent error recovery, domain-targeted multimodal perception, and high-quality information acquisition strategies to build agents that are not only technically capable but also reliably deployable in real-world environments.