Summary (Overview)

  • EnterpriseClawBench is an enterprise agent benchmark constructed from 852 real-world agent sessions (with a 120-task manually audited Lite subset), using a fully automated pipeline that converts proprietary workplace interactions into reproducible, artifact-centric tasks.
  • The benchmark evaluates harness–model combinations (not just models) across five harnesses (Claude Code, Codex, DeepAgents, Hermes, OpenClaw) and nine models, reporting not only scores but also cost, runtime, artifact delivery, and multi-dimensional semantic quality.
  • The best configuration achieves only 0.663 (Codex with GPT-5.5), showing that enterprise artifact tasks remain far from saturated.
  • The benchmark supports skill generalization experiments: skills distilled from one task subclass can be tested on held-out tasks from the same class, revealing strong creator–consumer fit effects.
  • The benchmark data are not released due to proprietary content; the reusable contribution is the construction and evaluation protocol.

Introduction and Theoretical Foundation

Large language models are evolving from text-only assistants into agents that operate inside executable workspaces and return business artifacts. A session is defined as a bounded workplace interaction comprising chat turns, uploaded files, tool traces, generated artifacts, and persistent workspace state. Correctness requires not only a proper chat response but also correct input recovery, workspace state preservation, and production of usable files under practical cost and latency—requirements that separate enterprise-agent evaluation from knowledge QA and isolated tool use.

Three gaps are identified:

  1. Enterprise realism vs. scalable task construction: Existing benchmarks (Workspace-Bench, WorkArena, TheAgentCompany, EnterpriseBench, EntWorld) are often human-authored, simulated, or built from public environments, leaving a gap between benchmark tasks and naturally occurring enterprise demand.
  2. Multidimensional evaluation: Agent performance is shaped by more than the base model—harness effects, multimodal judging, artifact delivery quality, time, and cost must be reported as a coupled result.
  3. Task-class-level skill evaluation: Reusable skills are becoming operational assets, but existing benchmarks evaluate skills at the level of individual task items rather than as transfer units across tasks from the same class.

The paper contributes:

  • an automated construction protocol that converts real enterprise agent sessions into reproducible benchmark tasks;
  • a multidimensional evaluation framework that jointly reports harness–model performance, file delivery, text/visual semantic quality, cost, and runtime;
  • native support for evaluating skill generalization across held-out tasks from the same enterprise task class.

Methodology

Data and Construction Pipeline

EnterpriseClawBench is built from continuous internal use of an enterprise agent system at an AI startup with over 100 employees (March–May 2026). Employees interact through private/group chats, upload files, and expect deliverables in a persistent Linux workspace. The pipeline (Figure 2 in the paper) converts noisy proprietary sessions into reproducible benchmark tasks:

  1. Task recovery: Split and merge session turns to obtain 5,291 raw TaskInstances.
  2. Mechanical gates: Four parallel checks—length filter (5,181 pass), fixture lookup (4,896 pass), redaction recovery (4,286 pass), public-network gate (5,003 pass). Mechanical join yields 3,813 mechanically usable candidates.
  3. Self-contained review and single-turn prompt rewriting: 852 tasks kept.
  4. Taxonomy assignment: Each task is assigned a role class (7 explicit classes: Product/project, Engineering/IT, HR/admin, Executive, Sales/customer, Marketing, Finance/ops) and a role-specific skill subclass (45 total subclasses).
  5. Annotation: Expected deliverables, hard rules (objective checks like file type, count, non-emptiness), and semantic rubrics (text/visual).
  6. Sandbox preflight: Checks input upload, agent execution, artifact download, and judge routing.

Benchmark statistics (Figure 4): 852 tasks, 719 input fixture files (mostly MD, DOC, Image, PDF, Sheet, Code), 887 expected deliverables (mostly MD, TXT, HTML).

Evaluation Setting

  • Harness–model combinations: Five harnesses (Claude Code, Codex, DeepAgents, Hermes, OpenClaw) with supported models (GPT-5.5, Sonnet 4.6, Opus 4.6, Haiku 4.5, Kimi K2.6, MiniMax-M3, GPT-4.1-mini, Qwen3-235B-A22B, DeepSeek V4 Pro).
  • Scoring: Two layers.
    • Hard rules check objective delivery properties (required file type, file count, non-emptiness, openability, tracebacks, unreplaced placeholders).
    • Semantic judges score output quality along five dimensions: grounded accuracy, task relevance, substantive depth, practical utility, communication quality.
    • Modality routing: text-extractable outputs → text judge (Sonnet 4.6); HTML, slides, PDFs, spreadsheets, images → rendered screenshots → visual judge (Sonnet 4.6).
  • Skill evaluation experiment: For a frontend-page-generation subclass, collect traces/artifacts from 10 in-domain tasks → a skill creator model distills an agent-specific skill → inject skill back into the same consumer agent → evaluate on 5 held-out tasks. The effect is the change in held-out average score before/after injection.

Empirical Validation / Results

Main Leaderboard on Lite (120 tasks, 32 harness–model combinations)

The best score is 0.663 (Codex with GPT-5.5). Figure 1 in the paper shows the full ranking. Key findings:

  • Harness–model interaction: Claude-family models (Sonnet 4.6, Opus 4.6, Haiku 4.5) drop substantially under Hermes (e.g., Sonnet 4.6: 0.62–0.64 under Claude Code/DeepAgents/OpenClaw → 0.458 under Hermes). Trace inspection suggests approval checks and trace truncation prevent artifact completion.
  • Cost–score trade-off: There is a non-linear, log-like relationship: low to mid cost brings large gains, diminishing returns beyond mid-range. Hermes/Claude combinations are outliers (high cost, low score).
  • Role-class effects: GPT-5.5 is the most robust generalist. Marketing and Finance/ops remain hardest (heavy document comprehension + company-specific conventions).
  • Artifact-type effects: File format changes rankings. GPT-5.5 strongest on HTML, code/JSON; Opus 4.6 strongest on spreadsheets. Visual judge scores are systematically inflated for spreadsheets and presentations.
  • Rubric-dimension analysis: Systems are generally better at communication quality and task relevance than at grounded accuracy (accuracy bottleneck from failing to locate key information in large files).

Scalability Check on Full Set (852 tasks, DeepAgents harness)

ModelScoreTextVisualRule
GPT-5.50.7660.8130.6420.959
Sonnet 4.60.7490.7930.6340.957
Haiku 4.50.6320.6660.5420.963
GPT-4.1-mini0.3360.3830.2130.817

Table 2 (from paper). The full benchmark ranking aligns with the Lite subset, confirming pipeline effectiveness.

Skill Evaluation

For frontend-page-generation subclass, a skill injection matrix (Figure 9 in paper) shows:

  • GPT-5.5 is the strongest creator (mean delta +0.068, no negative deltas).
  • Haiku 4.5 is the weakest creator (mean delta -0.094, large degradation for OpenClaw/Kimi K2.6).
  • Skill quality depends strongly on creator model; consumer ability and creator–consumer fit also matter.
  • Conclusion: Skill injection is high variance; evaluation should use a consumer–creator matrix, not a single averaged score.

Judge Reliability

  • LLM–LLM agreement: GPT-5.4-text correlates strongly with Sonnet 4.6 text judge (Spearman ρ = 0.918, 1,853 cases); GPT-5.4-visual preserves ordering (ρ = 0.866, 1,428 cases).
  • Human calibration audit (48 packets, text/visual split):
ScopenHumanSonnetMAESpearman
Overall480.5710.4980.2190.263
Text240.4760.5040.1340.790
Visual240.6660.4920.303-0.259

Table 3 (from paper). Text route aligns well with human scores; visual agreement is much weaker (MAE 0.303, negative rank correlation). This exposes an important gap: multimodal evaluation is not yet mature.

Theoretical and Practical Implications

  • Evaluation must report harness–model combinations – enterprise agent performance is coupled with the scaffolding, not just the base model.
  • Multidimensional reporting (score, cost, runtime, artifact quality, semantic dimensions) is necessary to understand system behavior and trade-offs.
  • Skill generalization can be meaningfully tested at the task-class level using a consumer–creator matrix, revealing transfer patterns invisible from single-metric evaluations.
  • Enterprise artifact evaluation is far from saturated – the best systems still leave substantial room for improvement, especially in grounded accuracy and visual artifact quality.
  • Practical guidance for deployment: stronger models buy diminishing returns; harness compatibility is critical; visual judge reliability needs human calibration; and skill creation/consumption abilities are not aligned.

Conclusion

EnterpriseClawBench provides an automated pipeline to transform proprietary enterprise agent sessions into reproducible, artifact-centric benchmark tasks (852 tasks, 120 manually audited). The benchmark reveals that enterprise tasks remain unsaturated (best score 0.663), model performance changes substantially with the harness, and multidimensional evaluation (cost, runtime, artifact delivery, semantic quality) is necessary. The task-class taxonomy enables native skill-generalization testing, showing that skill quality depends on creator model, consumer behavior, and their fit. Limitations include single-enterprise deployment, no public data release due to proprietary content, and imperfect LLM judges (especially for visual artifacts). Future work should extend the protocol to more organizations and improve multimodal judge calibration.

Related papers