EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings

Summary (Overview)

Introduces EnterpriseOps-Gym: A comprehensive benchmark designed to evaluate agentic planning in realistic enterprise settings, featuring a containerized sandbox with 164 database tables and 512 functional tools, and 1,150 expert-curated tasks across eight mission-critical verticals (Customer Service, HR, IT, etc.).
Reveals Critical Limitations in SOTA Models: Evaluation of 14 frontier models shows top-performing Claude Opus 4.5 achieves only 37.4% success rate. Performance degrades sharply with task horizon length and is worst on policy-governed and cross-domain tasks.
Identifies Planning as Primary Bottleneck: Providing oracle human-authored plans improves model performance by 14–35 percentage points, far exceeding gains from automated planning or multi-agent orchestration, indicating strategic reasoning is the key challenge, not tool execution.
Highlights Safety and Compliance Gaps: Agents frequently fail to refuse infeasible tasks (best model achieves 53.9%), leading to potential harmful side effects. Models struggle most with Permission and Process Compliance verifiers.
Provides Cost-Performance Analysis: Presents a Pareto frontier for model selection, showing Gemini-3-Flash offers a strong cost-performance trade-off, while Claude Opus 4.5 provides the highest absolute performance at a premium.

Introduction and Theoretical Foundation

Large Language Models (LLMs) are evolving from conversational assistants to autonomous agents capable of executing complex workflows. A critical application is their deployment as AI workers in enterprise environments. However, this deployment is hindered by benchmarks that fail to capture the intricacies of professional settings, which require:

Long-horizon planning amidst persistent state changes.
Strict adherence to access control policies and procedural rules.
Coherent state management across long sequences of interleaved tool calls.

In enterprise domains, agents directly modify live databases, trigger downstream workflows, and affect real users. Actions are stateful and often irreversible, errors propagate silently, and strict organizational policies constrain every step. Existing benchmarks are inadequate:

General tool-use evaluations (e.g., ToolLLM, API-Bank) treat tool calls as atomic and stateless, lacking state dependencies.
Enterprise-focused benchmarks (e.g., WorkArena, CRMArena) are often confined to single vendor ecosystems with shallow environments (<25 tables, <50 tools) and limited policy constraints.

EnterpriseOps-Gym is introduced to bridge this gap, providing a high-fidelity enterprise simulation to stress-test the core challenges of enterprise automation.

Methodology

Benchmark Construction

Domain Selection: Eight domains were selected based on relevance, policy complexity, and availability of domain experts (SMEs).

Operational Backbone: Customer Service Management (CSM), Human Resources (HR), IT Service Management (ITSM). Characterized by strict process compliance and high-stakes workflows.
Universal Collaboration Tools: Email, Calendar, Teams, Drive. Require sophisticated orchestration and management.
Hybrid: Mandates cross-domain orchestration across the fragmented tools.

Sandbox Environment:

A fully interactive, containerized Docker sandbox.
Hosts 164 relational database tables and 512 functional tools.
Models realistic enterprise constraints without exposing proprietary infrastructure.
High relational density: average of ~1.7 Foreign Keys (FK) per table, indicating complex inter-table dependencies.

Task Creation Pipeline: A rigorous, multi-stage process involving over 160 contributors (software engineers and SMEs).

Scenario Design: Crafting multi-step scenarios with specific complexity thresholds (tool counts, verification conditions, state dependencies, policy conflicts). Includes 30 infeasible tasks.
Ground Truth Execution & Plan Authoring: For each task, annotators provide a detailed step-by-step natural language reasoning plan and manually execute it to capture a gold-standard trajectory.
Outcome-based Verification: Annotators author executable SQL verification scripts that check the final state for: goal achievement, integrity constraints, permission compliance, and side effects.
Quality Assurance: Multiple rounds of review for task feasibility, instruction clarity, verification script correctness, and plan coherence.

Evaluation Setup

Baseline Models: 14 frontier models evaluated under a unified interface.

Closed-source: Claude 4.5 variants (Opus, Sonnet), GPT variants (5.2 High, 5.2 Low, 5, 5-Mini), Gemini variants (3-Pro, 3-Flash, 2.5-Pro).
Open-source: Kimi-K2-Thinking, DeepSeek-V3.2, GPT-OSS-120B, Qwen3 variants (235B Inst., 30B Think, 4B Think).

Agent Framework: Standard ReAct-style reasoning and tool-execution loop. Evaluated in an oracle-tool setting (perfect retriever) to focus on planning and execution.

Primary Metric: pass@1 task completion rate, averaged over three runs. A task is successful only if all SQL-based verification checks pass.

Ablation Studies:

Tool Overload: Adding distractor tools (+5, +10, +15) to test robustness.
Plan-Conditioned Execution: Providing models with (a) Claude-generated plans or (b) human-authored oracle plans.
Multi-Agent Systems: Testing Planner+Executor and Planner+Decompose+Subtask Executor architectures.
Thinking Budget: Varying test-time compute (low, medium, high) for GPT-OSS-120B.

Empirical Validation / Results

Overall Performance

Table 2: Overall task completion performance on EnterpriseOps-Gym

Model	Teams	CSM	Email	ITSM	Calendar	HR	Drive	Hybrid	Infeas.	Avg.
Claude Opus 4.5	50.0	34.2	51.9	23.8	43.2	32.1	49.5	30.7	50.0	37.4
Gemini-3-Flash	47.3	35.0	44.3	28.5	30.5	12.6	49.7	24.2	38.5	31.9
GPT-5.2 (High)	31.0	34.8	51.0	21.7	38.5	25.0	40.0	22.2	50.0	31.8
Claude Sonnet 4.5	51.0	16.7	51.3	17.6	34.6	21.6	52.1	28.1	46.2	30.9
GPT-5	26.3	36.4	49.0	18.9	41.3	17.9	34.0	23.5	50.5	29.8
...	...	...	...	...	...	...	...	...	...	...
DeepSeek-V3.2 (High)	37.0	14.1	47.1	16.1	21.2	16.3	35.2	22.9	53.8	24.5
GPT-OSS-120B (High)	32.0	16.3	42.3	6.1	35.6	16.3	41.0	19.6	50.0	23.7

Top Performance: Claude Opus 4.5 leads with 37.4% average success.
Domain Variance: Models perform best on collaboration domains (Email, Teams, Drive, ~51%) and worst on policy-heavy ITSM (best: 28.5%) and cross-domain Hybrid (best: 30.7%).
Open-source Lag: Best open-source model (DeepSeek-V3.2 High) achieves 24.5%.
Infeasibility Detection: Best models (GPT-5.2 Low, DeepSeek-V3.2 High) refuse tasks cleanly only ~54% of the time.

Key Findings

Planning Horizon: Performance degrades monotonically with task horizon length (see Figure 4). Closed-source models show greater resilience than open-source models.
Tool Overload is Not a Bottleneck: Adding distractor tools had negligible impact on performance (~+1.0% change), confirming the primary challenge is not tool discovery.
Cost-Performance Trade-off: Gemini-3-Flash provides a strong practical trade-off (31.9% at $0.03/task). Claude Opus 4.5 offers highest reliability (37.4%) at a premium ($ 0.36/task). DeepSeek-V3.2 (High) is Pareto-dominant for open-source (24.5% at $0.014/task).
Thinking Budget Scaling: Increasing test-time compute yields substantial gains for GPT-OSS-120B (e.g., Drive: 8.6% → 41.0%), but performance plateaus early in some domains (e.g., ITSM).

Ablation Studies: Isolating the Planning Bottleneck

Table 3: Plan-Conditioned Execution Baseline (Performance on CSM, ITSM, HR)

Model	Plan	CSM	ITSM	HR
Kimi-K2	Claude Plan (CP)	19.6 (↑12.5%)	18.1 (↑5.9%)	17.2 (↑9.0%)
	Human Plan (HP)	42.2 (↑35.1%)	29.1 (↑16.9%)	34.5 (↑26.3%)
Qwen3-30B	Claude Plan (CP)	15.2 (↑9.8%)	11.7 (↑5.0%)	17.9 (↑10.3%)
	Human Plan (HP)	33.9 (↑28.5%)	20.9 (↑14.2%)	33.2 (↑25.6%)

Human Plans Reveal High Ceiling: Providing oracle human plans yields massive improvements (14–35 percentage points), far exceeding gains from automated Claude plans (5–13 pp).
Small Models Can Execute Faithfully: Qwen3-4B with human plans becomes competitive with larger models, suggesting the core deficit is strategic reasoning, not tool invocation.
Complex Orchestration Fails: A Planner+Decompose+Subtask Executor multi-agent system often regresses performance due to strong sequential state dependencies in tasks (see Figure 6).

Failure Analysis

Table 7 (Excerpt): Verifier Pass Rates by Category

Task Completion: ~40-60%
Integrity Constraints: ~30-50%
Permission and Process Compliance: ~20-40% (Lowest)

Common Failure Modes:

Missing Prerequisite Lookup: Creating database objects without first querying necessary prerequisites, leading to broken foreign-key links.
Cascading State Propagation: Failing to trigger follow-up actions mandated by system policies.
Incorrect ID Resolution: Passing unverified identifiers instead of resolving correct IDs through prior tool interactions.
Premature Completion Hallucination: Declaring task completion before all required steps are executed.

Theoretical and Practical Implications

Agents Are Not Enterprise-Ready: With success rates below 40% and poor infeasibility detection (~54%), current agents are not reliable for autonomous deployment without human oversight. Failures are systematic, centered on policy compliance and state management.
Planning is the Dominant Bottleneck: The large gap between human-plan and model-plan performance indicates that constraint-aware strategic reasoning is the primary challenge, not tool execution proficiency. This dissociates planning capability from general model scale.
Safety and Compliance are Critical Weaknesses: Low performance on Permission and Process Compliance verifiers highlights a major risk for production deployment, where policy violations can cause cascading system failures and security vulnerabilities.
Benchmark Design Matters: EnterpriseOps-Gym demonstrates the need for benchmarks with high environmental fidelity (many tables/tools), outcome-based verification, expert curation, and explicit tests for safe refusal behavior.

Conclusion

EnterpriseOps-Gym establishes that current LLM agents are far from reliable for autonomous enterprise work, primarily due to deficiencies in long-horizon, policy-aware planning. The benchmark provides a concrete testbed to advance research in this direction.

Main Takeaways:

The best model achieves only 37.4% success, with performance decaying as task complexity increases.
Providing human plans boosts performance by 14–35 pp, identifying strategic reasoning as the core bottleneck.
Agents are unsafe, correctly refusing infeasible tasks only ~54% of the time.
Permission and Process Compliance is the hardest category for models.

Future Research Directions:

Constraint-Aware Plan Generation: Methods that explicitly reason over policies, side-effects, and prerequisites.
Long-Horizon State Management: Mechanisms (e.g., episodic memory) to maintain coherent world state over many steps and prevent error accumulation.
Safe Refusal and Escalation: Improving reliability in detecting and abstaining from infeasible or policy-violating requests.

EnterpriseOps-Gym is released to the community to drive progress toward reliable AI workers capable of handling the intricacies of professional enterprise environments.