OCCUBENCH: Evaluating AI Agents on Real-World Professional Tasks via Language World Models

Summary (Overview)

Introduces OCCUBENCH, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language World Models (LWMs) that simulate domain-specific environments through LLM-driven tool response generation.
Evaluates agents along two complementary dimensions: task completion across professional domains and environmental robustness under controlled fault injection (explicit errors, implicit data degradation, and mixed faults).
Key Findings: (1) No single model dominates all industries, each has a distinct occupational capability profile; (2) Implicit faults are harder than explicit and mixed faults; (3) Larger models, newer generations, and higher reasoning effort consistently improve performance; (4) Strong agents are not necessarily strong environment simulators.
Methodology: Uses a multi-agent synthesis pipeline to automatically produce evaluation instances with guaranteed solvability, calibrated difficulty, and document-grounded diversity.
Results: Evaluates 15 frontier models; GPT-5.2 leads overall (79.6%), but performance varies significantly by industry. Implicit faults cause the largest performance drop (average 53.4% vs. 67.5% clean).

Introduction and Theoretical Foundation

AI agents are increasingly expected to perform professional work across diverse occupational domains such as emergency patient triage, financial auditing, and customs processing. However, a fundamental evaluation gap exists: the professional domains where agents would deliver the most value are precisely the domains where no benchmarks exist. Existing benchmarks (e.g., WebArena, OSWorld, SWE-bench) are confined to domains with available public environments or APIs, creating a severe blind spot covering the vast majority of high-value professional work. Limitations include:

The Untestable Majority: Domains like healthcare, finance, and energy are bound to enterprise systems with no public access.
Prohibitive Scaling Cost: Adding new domains requires substantial engineering (deploying applications, integrating APIs).
No Robustness Evaluation: Benchmarks evaluate only the "happy path," ignoring real-world environmental noise like API timeouts and incomplete data.

Our Approach: Language World Models (LWMs) The key observation is that the environment itself can be simulated by an LLM. Given a configuration c, an LLM becomes a stateful, interactive environment. This transforms environment construction from an engineering problem into a configuration problem, extending benchmark coverage to "any domain an LLM can understand."

Methodology

Language World Model Formalization

A Language World Model (LWM) is defined as a function:

(s_{t+1}, o_{t+1}) = f_\theta(s_t, a_t; c) \quad \text{(1)}

where:

c = (system prompt, tool schema, initial state, state description) is the environment configuration.
s_t is the latent environment state maintained implicitly by the LLM through its context window.
a_t is the agent's action (a tool call with name and arguments).
o_{t+1} is the observation returned to the agent (a structured JSON tool response).

Why LLMs Can Serve as World Models:

Format Priors: Pre-training on API documentation provides priors for generating well-formatted tool responses.
Domain Knowledge: LLMs encode operational logic for hundreds of professional domains.
State Maintenance: System prompt constraints and in-context tracking enable coherent multi-turn simulation.
Edge Case Handling: LLMs handle unexpected inputs more gracefully than rule-based simulators.

Environment Configuration

Each LWM environment is fully specified by four components:

System Prompt: Defines behavioral rules, simulation logic, error handling, and output format constraints.
Tool Schema: Defines the agent's action space as a set of callable functions with typed parameters (median 5 tools per environment).
Initial State: A structured JSON object specifying starting conditions.
State Description: Semantic annotations guiding the LLM to maintain causal consistency.

Multi-Agent Synthesis Pipeline

The pipeline automatically generates evaluation instances ensuring:

Solvability: A valid solution exists and is verified.
Verifiability: Clear, automated success criteria.
Discriminative: Calibrated difficulty distinguishing agent capabilities.
Diverse: Structural variation across instances (grounded by 16 non-overlapping sub-topics per scenario with professional reference documents).

The pipeline uses Gemini-3-Flash-Preview as the World Model to generate configurations, tasks, solutions, and rubrics. Tasks are executed multiple times to verify solvability and calibrate difficulty. A majority-vote verifier assesses trajectories, and a repair module fixes failures.

Evaluation Loop

The interaction between the agent and LWM follows the loop shown in Figure 1 (not reproduced here). The agent issues tool calls, the LWM generates observations, and the trajectory is scored by a rubric-based verifier.

Empirical Validation / Results

Benchmark Scale and Coverage

OCCUBENCH covers 100 scenarios across 10 industry categories (Table 1) and 65 specialized domains, resulting in 382 solvable task instances.

Table 1: Industry categories and representative scenarios in OCCUBENCH

Category	# Scenarios	Representative Scenarios
Business & Enterprise	19	Resume screening, expense auditing, AML review
Technology & IT	16	Linux ops, CI/CD recovery, intrusion response
Industrial & Engineering	12	Production scheduling, mine ventilation
Transportation & Logistics	11	Last-mile delivery, train dispatch
Commerce & Consumer	9	Dynamic pricing, hotel revenue mgmt.
Education & Culture	8	Adaptive curriculum, fact-checking
Healthcare & Life Sciences	7	Emergency triage, drug interaction screening
Public Service & Governance	7	Permit processing, wildfire evacuation
Agriculture & Environment	?

Environmental Fault Injection Evaluates agent robustness through controlled fault injection:

E0 (Clean): No faults. Baseline performance.
E1 (Explicit Faults): Randomly injects clearly visible error responses (HTTP 500, TimeoutError). Clear error signals.
E2 (Implicit Faults): Returns degraded responses with no error signal (truncated data, empty fields). Superficially correct.
E3 (Mixed): Approximately half explicit, half implicit faults.

Faults are transient, spaced across interactions, and parameterized by fault count (default 2) and fault duration (default 2 consecutive calls).

Main Results: Cross-Industry Evaluation (E0)

Table 2: E0 completion rate (%) by industry category for all 15 models

Model	Avg	Agri	Biz	Comm	Edu	Hlth	Ind	Pub	Sci	Tech	Trans
GPT-5.2	79.6	84	86	67	77	76	85	84	94	80	72
Gemini 3.1 Pro	72.3	68	73	75	84	62	73	72	81	78	60
Claude Opus 4.6	71.5	74	78	53	75	76	73	68	62	68	77
Qwen 3.5 Plus	69.9	77	70	81	56	81	71	76	69	74	55
DeepSeek V3.2	69.6	65	78	67	66	71	69	72	62	74	64
Claude Opus 4.5	65.2	58	76	56	62	52	65	72	56	68	66
Claude Sonnet 4.5	64.9	65	70	69	50	71	71	60	44	68	62
Claude Sonnet 4.6	64.4	58	71	64	69	67	64	64	69	64	57
Kimi K2.5	64.1	68	62	56	62	81	62	72	56	74	57
GLM-5	62.6	55	75	67	53	57	56	68	62	70	55
Claude Opus 4	61.3	52	75	50	53	57	58	76	81	66	51
Gemini 3.1 FL	61.3	68	70	58	53	67	58	68	62	68	45
Qwen 3.5 Flash	59.7	61	60	67	53	76	53	68	69	60	51
MiniMax M2.7	53.9	48	60	56	31	57	60	60	62	64	40
Claude Sonnet 4	53.4	35	63	61	38	57	51	76	31	60	47

Key Findings:

No single model dominates all industries. GPT-5.2 leads overall but trails in Commerce (67%) where Qwen 3.5 Plus leads (81%).
Open-source models are highly competitive. Qwen 3.5 Plus (69.9%) and DeepSeek V3.2 (69.6%) outperform most Claude variants.
Each model has a distinct occupational capability profile (visualized in radar chart Figure 2).

Environmental Robustness

Table iii: Environmental robustness evaluation for 9 flagship models

Model	E0	E1	E2	E3	Rob.
Gemini 3.1 Pro	72.3	73.3	63.1	65.2	0.87
MiniMax M2.7	53.9	52.9	47.1	46.9	0.87
GPT-5.2	79.6	75.9	70.4	67.0	0.84
GLM-5	62.6	59.4	52.6	47.4	0.76
Claude Opus 4.6	71.5	68.1	53.9	63.9	0.75
DeepSeek V3.2	69.6	59.9	56.0	51.6	0.74
Qwen 3.5 Plus	69.9	61.0	51.6	54.2	0.74
Claude Sonnet 4.6	64.4	62.8	45.0	52.9	0.70
Kimi K2.5	64.1	50.0	40.6	40.1	0.63
Avg	67.5	62.6	53.4	54.4	0.77

Key Findings:

Current agents struggle under adverse environments. Average performance drops 14.1 points from E0 (67.5%) to E2 (53.4%).
Implicit faults (E2) are harder than both explicit (E1) and mixed (E3) faults. Average E2 score (53.4%) is lower than E1 (62.6%) and E3 (54.4%). Implicit faults lack overt error signals and require agents to independently detect data degradation.
Increasing fault severity deepens the challenge. Performance declines further as fault count and duration increase (Figure 4).

Model Scaling, Generational Progress, and Reasoning Effort

Larger models consistently outperform smaller counterparts within families (Figure 5), e.g., Gemini Pro vs. Flash-Lite gap: 11.0%.
Claude Opus shows consistent generational improvement: 61.3% (v4) → 65.2% (v4.5) → 71.5% (v4.6) (Figure 6).
Higher reasoning effort leads to better performance. GPT-5.2 improves by 27.5 points from none (54.7%) to xhigh (82.2%) effort (Figure 7).

Simulator Quality Matters

A key question: Is a strong agent also a strong environment simulator?

Table 4: Cross-simulator evaluation (E0)

Agent	Gemini Flash (CR %, Rk)	Qwen 3.5+ (CR %, Rk)	GPT-5.2 (CR %, Rk)
GPT-5.2	79.6, 1	74.3, 1	42.4, 1
Gemini Pro	72.3, 2	68.6, 2	28.3, 4
Opus 4.6	71.5, 3	66.2, 3	33.5, 2
Qwen 3.5+	69.9, 4	61.8, 6	28.3, 4
DeepSeek	69.6, 5	65.2, 4	29.6, 3
Kimi K2.5	64.1, 6	52.4, 8	23.0, 8
GLM-5	62.6, 7	64.1, 5	23.6, 7
MiniMax M2.7	53.9, 8	54.7, 7	25.4, 6

Findings:

Strong agents are not necessarily strong simulators. GPT-5.2 ranks first as an agent but produces the worst simulation quality (all agents average only 29.3% under it).
A capable simulator yields reliable rankings. Pairwise ranking agreement between Gemini Flash and Qwen 3.5 Plus simulators reaches 85.7% (24/28 pairs) (Figure 8).
Simulator failure modes include state fabrication, entity omission, and rule invention (Figures 9-11).

Theoretical and Practical Implications

Industry Difficulty and Model-Industry Interaction

Industry Difficulty Analysis (Figure 12): Easiest industries are Business & Enterprise (avg 70.1%) and Public Service & Governance (69.4%); hardest are Transportation & Logistics (56.2%) and Education & Culture (57.6%).
Each model has a distinct occupational capability profile:
- Gemini 3.1 Pro excels in knowledge-intensive domains (Education, Science).
- Claude Opus 4.6 excels in operational domains (Transportation, Business).
- Qwen 3.5 Plus excels in consumer-facing domains (Commerce, Healthcare).
Practical Implication: Organizations should select agent models based on specific industry needs, not just aggregate rankings.

Case Studies Illustrating Agent Capabilities and Failures

The paper includes detailed case studies (Figures 13-17) illustrating:

Proactive constraint monitoring vs. violation (Last-Mile Delivery).
Skipped verification failure mode (Fish Farm Water Quality Control).
Procedural ordering errors (Building Inspection Compliance).
Fault resilience differences under explicit (E1) and implicit (E2) faults.

Conclusion

OCCUBENCH is the first benchmark to systematically evaluate AI agents on real-world professional tasks across a broad spectrum of industries and domains via Language World Models. Key conclusions:

Cross-industry evaluation is essential: No single model dominates, revealing unique capability profiles invisible to single-domain benchmarks.
Environmental robustness is a critical gap: Agents struggle significantly, especially with implicit faults lacking error signals.
Scaling benefits are consistent: Larger models, newer generations, and increased reasoning effort reliably improve performance.
Simulator quality is crucial for LWM-based evaluation: While strong agents aren't necessarily good simulators, using a capable simulator yields reliable agent rankings (85.7% pairwise agreement).

Limitations:

LWM simulation fidelity: Evaluates decision-making process rather than handling exact real-world data values.
Simulator dependence: Evaluation results are tied to the specific simulator used.

Future Directions: OCCUBENCH provides a framework for a richer evaluation paradigm that considers cross-industry specialization and environmental resilience, moving beyond simple task completion metrics.