OCCUBENCH: Evaluating AI Agents on Real-World Professional Tasks via Language World Models

Summary (Overview)

  • Introduces OCCUBENCH, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language World Models (LWMs) that simulate domain-specific environments through LLM-driven tool response generation.
  • Evaluates agents along two complementary dimensions: task completion across professional domains and environmental robustness under controlled fault injection (explicit errors, implicit data degradation, and mixed faults).
  • Key Findings: (1) No single model dominates all industries, each has a distinct occupational capability profile; (2) Implicit faults are harder than explicit and mixed faults; (3) Larger models, newer generations, and higher reasoning effort consistently improve performance; (4) Strong agents are not necessarily strong environment simulators.
  • Methodology: Uses a multi-agent synthesis pipeline to automatically produce evaluation instances with guaranteed solvability, calibrated difficulty, and document-grounded diversity.
  • Results: Evaluates 15 frontier models; GPT-5.2 leads overall (79.6%), but performance varies significantly by industry. Implicit faults cause the largest performance drop (average 53.4% vs. 67.5% clean).

Introduction and Theoretical Foundation

AI agents are increasingly expected to perform professional work across diverse occupational domains such as emergency patient triage, financial auditing, and customs processing. However, a fundamental evaluation gap exists: the professional domains where agents would deliver the most value are precisely the domains where no benchmarks exist. Existing benchmarks (e.g., WebArena, OSWorld, SWE-bench) are confined to domains with available public environments or APIs, creating a severe blind spot covering the vast majority of high-value professional work. Limitations include:

  • The Untestable Majority: Domains like healthcare, finance, and energy are bound to enterprise systems with no public access.
  • Prohibitive Scaling Cost: Adding new domains requires substantial engineering (deploying applications, integrating APIs).
  • No Robustness Evaluation: Benchmarks evaluate only the "happy path," ignoring real-world environmental noise like API timeouts and incomplete data.

Our Approach: Language World Models (LWMs) The key observation is that the environment itself can be simulated by an LLM. Given a configuration c, an LLM becomes a stateful, interactive environment. This transforms environment construction from an engineering problem into a configuration problem, extending benchmark coverage to "any domain an LLM can understand."

Methodology

Language World Model Formalization

A Language World Model (LWM) is defined as a function:

(st+1,ot+1)=fθ(st,at;c)(1)(s_{t+1}, o_{t+1}) = f_\theta(s_t, a_t; c) \quad \text{(1)}

where:

  • c = (system prompt, tool schema, initial state, state description) is the environment configuration.
  • s_t is the latent environment state maintained implicitly by the LLM through its context window.
  • a_t is the agent's action (a tool call with name and arguments).
  • o_{t+1} is the observation returned to the agent (a structured JSON tool response).

Why LLMs Can Serve as World Models:

  1. Format Priors: Pre-training on API documentation provides priors for generating well-formatted tool responses.
  2. Domain Knowledge: LLMs encode operational logic for hundreds of professional domains.
  3. State Maintenance: System prompt constraints and in-context tracking enable coherent multi-turn simulation.
  4. Edge Case Handling: LLMs handle unexpected inputs more gracefully than rule-based simulators.

Environment Configuration

Each LWM environment is fully specified by four components:

  1. System Prompt: Defines behavioral rules, simulation logic, error handling, and output format constraints.
  2. Tool Schema: Defines the agent's action space as a set of callable functions with typed parameters (median 5 tools per environment).
  3. Initial State: A structured JSON object specifying starting conditions.
  4. State Description: Semantic annotations guiding the LLM to maintain causal consistency.

Multi-Agent Synthesis Pipeline

The pipeline automatically generates evaluation instances ensuring:

  • Solvability: A valid solution exists and is verified.
  • Verifiability: Clear, automated success criteria.
  • Discriminative: Calibrated difficulty distinguishing agent capabilities.
  • Diverse: Structural variation across instances (grounded by 16 non-overlapping sub-topics per scenario with professional reference documents).

The pipeline uses Gemini-3-Flash-Preview as the World Model to generate configurations, tasks, solutions, and rubrics. Tasks are executed multiple times to verify solvability and calibrate difficulty. A majority-vote verifier assesses trajectories, and a repair module fixes failures.

Evaluation Loop

The interaction between the agent and LWM follows the loop shown in Figure 1 (not reproduced here). The agent issues tool calls, the LWM generates observations, and the trajectory is scored by a rubric-based verifier.

Empirical Validation / Results

Benchmark Scale and Coverage

OCCUBENCH covers 100 scenarios across 10 industry categories (Table 1) and 65 specialized domains, resulting in 382 solvable task instances.

Table 1: Industry categories and representative scenarios in OCCUBENCH

Category# ScenariosRepresentative Scenarios
Business & Enterprise19Resume screening, expense auditing, AML review
Technology & IT16Linux ops, CI/CD recovery, intrusion response
Industrial & Engineering12Production scheduling, mine ventilation
Transportation & Logistics11Last-mile delivery, train dispatch
Commerce & Consumer9Dynamic pricing, hotel revenue mgmt.
Education & Culture8Adaptive curriculum, fact-checking
Healthcare & Life Sciences7Emergency triage, drug interaction screening
Public Service & Governance7Permit processing, wildfire evacuation
Agriculture & Environment?

Environmental Fault Injection Evaluates agent robustness through controlled fault injection:

  • E0 (Clean): No faults. Baseline performance.
  • E1 (Explicit Faults): Randomly injects clearly visible error responses (HTTP 500, TimeoutError). Clear error signals.
  • E2 (Implicit Faults): Returns degraded responses with no error signal (truncated data, empty fields). Superficially correct.
  • E3 (Mixed): Approximately half explicit, half implicit faults.

Faults are transient, spaced across interactions, and parameterized by fault count (default 2) and fault duration (default 2 consecutive calls).

Main Results: Cross-Industry Evaluation (E0)

Table 2: E0 completion rate (%) by industry category for all 15 models

ModelAvgAgriBizCommEduHlthIndPubSciTechTrans
GPT-5.279.684866777768584948072
Gemini 3.1 Pro72.368737584627372817860
Claude Opus 4.671.574785375767368626877
Qwen 3.5 Plus69.977708156817176697455
DeepSeek V3.269.665786766716972627464
Claude Opus 4.565.258765662526572566866
Claude Sonnet 4.564.965706950717160446862
Claude Sonnet 4.664.458716469676464696457
Kimi K2.564.168625662816272567457
GLM-562.655756753575668627055
Claude Opus 461.352755053575876816651
Gemini 3.1 FL61.368705853675868626845
Qwen 3.5 Flash59.761606753765368696051
MiniMax M2.753.948605631576060626440
Claude Sonnet 453.435636138575176316047

Key Findings:

  • No single model dominates all industries. GPT-5.2 leads overall but trails in Commerce (67%) where Qwen 3.5 Plus leads (81%).
  • Open-source models are highly competitive. Qwen 3.5 Plus (69.9%) and DeepSeek V3.2 (69.6%) outperform most Claude variants.
  • Each model has a distinct occupational capability profile (visualized in radar chart Figure 2).

Environmental Robustness

Table iii: Environmental robustness evaluation for 9 flagship models

ModelE0E1E2E3Rob.
Gemini 3.1 Pro72.373.363.165.20.87
MiniMax M2.753.952.947.146.90.87
GPT-5.279.675.970.467.00.84
GLM-562.659.452.647.40.76
Claude Opus 4.671.568.153.963.90.75
DeepSeek V3.269.659.956.051.60.74
Qwen 3.5 Plus69.961.051.654.20.74
Claude Sonnet 4.664.462.845.052.90.70
Kimi K2.564.150.040.640.10.63
Avg67.562.653.454.40.77

Key Findings:

  • Current agents struggle under adverse environments. Average performance drops 14.1 points from E0 (67.5%) to E2 (53.4%).
  • Implicit faults (E2) are harder than both explicit (E1) and mixed (E3) faults. Average E2 score (53.4%) is lower than E1 (62.6%) and E3 (54.4%). Implicit faults lack overt error signals and require agents to independently detect data degradation.
  • Increasing fault severity deepens the challenge. Performance declines further as fault count and duration increase (Figure 4).

Model Scaling, Generational Progress, and Reasoning Effort

  • Larger models consistently outperform smaller counterparts within families (Figure 5), e.g., Gemini Pro vs. Flash-Lite gap: 11.0%.
  • Claude Opus shows consistent generational improvement: 61.3% (v4) → 65.2% (v4.5) → 71.5% (v4.6) (Figure 6).
  • Higher reasoning effort leads to better performance. GPT-5.2 improves by 27.5 points from none (54.7%) to xhigh (82.2%) effort (Figure 7).

Simulator Quality Matters

A key question: Is a strong agent also a strong environment simulator?

Table 4: Cross-simulator evaluation (E0)

AgentGemini Flash (CR %, Rk)Qwen 3.5+ (CR %, Rk)GPT-5.2 (CR %, Rk)
GPT-5.279.6, 174.3, 142.4, 1
Gemini Pro72.3, 268.6, 228.3, 4
Opus 4.671.5, 366.2, 333.5, 2
Qwen 3.5+69.9, 461.8, 628.3, 4
DeepSeek69.6, 565.2, 429.6, 3
Kimi K2.564.1, 652.4, 823.0, 8
GLM-562.6, 764.1, 523.6, 7
MiniMax M2.753.9, 854.7, 725.4, 6

Findings:

  • Strong agents are not necessarily strong simulators. GPT-5.2 ranks first as an agent but produces the worst simulation quality (all agents average only 29.3% under it).
  • A capable simulator yields reliable rankings. Pairwise ranking agreement between Gemini Flash and Qwen 3.5 Plus simulators reaches 85.7% (24/28 pairs) (Figure 8).
  • Simulator failure modes include state fabrication, entity omission, and rule invention (Figures 9-11).

Theoretical and Practical Implications

Industry Difficulty and Model-Industry Interaction

  • Industry Difficulty Analysis (Figure 12): Easiest industries are Business & Enterprise (avg 70.1%) and Public Service & Governance (69.4%); hardest are Transportation & Logistics (56.2%) and Education & Culture (57.6%).
  • Each model has a distinct occupational capability profile:
    • Gemini 3.1 Pro excels in knowledge-intensive domains (Education, Science).
    • Claude Opus 4.6 excels in operational domains (Transportation, Business).
    • Qwen 3.5 Plus excels in consumer-facing domains (Commerce, Healthcare).
  • Practical Implication: Organizations should select agent models based on specific industry needs, not just aggregate rankings.

Case Studies Illustrating Agent Capabilities and Failures

The paper includes detailed case studies (Figures 13-17) illustrating:

  • Proactive constraint monitoring vs. violation (Last-Mile Delivery).
  • Skipped verification failure mode (Fish Farm Water Quality Control).
  • Procedural ordering errors (Building Inspection Compliance).
  • Fault resilience differences under explicit (E1) and implicit (E2) faults.

Conclusion

OCCUBENCH is the first benchmark to systematically evaluate AI agents on real-world professional tasks across a broad spectrum of industries and domains via Language World Models. Key conclusions:

  1. Cross-industry evaluation is essential: No single model dominates, revealing unique capability profiles invisible to single-domain benchmarks.
  2. Environmental robustness is a critical gap: Agents struggle significantly, especially with implicit faults lacking error signals.
  3. Scaling benefits are consistent: Larger models, newer generations, and increased reasoning effort reliably improve performance.
  4. Simulator quality is crucial for LWM-based evaluation: While strong agents aren't necessarily good simulators, using a capable simulator yields reliable agent rankings (85.7% pairwise agreement).

Limitations:

  • LWM simulation fidelity: Evaluates decision-making process rather than handling exact real-world data values.
  • Simulator dependence: Evaluation results are tied to the specific simulator used.

Future Directions: OCCUBENCH provides a framework for a richer evaluation paradigm that considers cross-industry specialization and environmental resilience, moving beyond simple task completion metrics.