OCCUBENCH: Evaluating AI Agents on Real-World Professional Tasks via Language World Models
Summary (Overview)
- Introduces OCCUBENCH, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language World Models (LWMs) that simulate domain-specific environments through LLM-driven tool response generation.
- Evaluates agents along two complementary dimensions: task completion across professional domains and environmental robustness under controlled fault injection (explicit errors, implicit data degradation, and mixed faults).
- Key Findings: (1) No single model dominates all industries, each has a distinct occupational capability profile; (2) Implicit faults are harder than explicit and mixed faults; (3) Larger models, newer generations, and higher reasoning effort consistently improve performance; (4) Strong agents are not necessarily strong environment simulators.
- Methodology: Uses a multi-agent synthesis pipeline to automatically produce evaluation instances with guaranteed solvability, calibrated difficulty, and document-grounded diversity.
- Results: Evaluates 15 frontier models; GPT-5.2 leads overall (79.6%), but performance varies significantly by industry. Implicit faults cause the largest performance drop (average 53.4% vs. 67.5% clean).
Introduction and Theoretical Foundation
AI agents are increasingly expected to perform professional work across diverse occupational domains such as emergency patient triage, financial auditing, and customs processing. However, a fundamental evaluation gap exists: the professional domains where agents would deliver the most value are precisely the domains where no benchmarks exist. Existing benchmarks (e.g., WebArena, OSWorld, SWE-bench) are confined to domains with available public environments or APIs, creating a severe blind spot covering the vast majority of high-value professional work. Limitations include:
- The Untestable Majority: Domains like healthcare, finance, and energy are bound to enterprise systems with no public access.
- Prohibitive Scaling Cost: Adding new domains requires substantial engineering (deploying applications, integrating APIs).
- No Robustness Evaluation: Benchmarks evaluate only the "happy path," ignoring real-world environmental noise like API timeouts and incomplete data.
Our Approach: Language World Models (LWMs)
The key observation is that the environment itself can be simulated by an LLM. Given a configuration c, an LLM becomes a stateful, interactive environment. This transforms environment construction from an engineering problem into a configuration problem, extending benchmark coverage to "any domain an LLM can understand."
Methodology
Language World Model Formalization
A Language World Model (LWM) is defined as a function:
where:
c = (system prompt, tool schema, initial state, state description)is the environment configuration.s_tis the latent environment state maintained implicitly by the LLM through its context window.a_tis the agent's action (a tool call with name and arguments).o_{t+1}is the observation returned to the agent (a structured JSON tool response).
Why LLMs Can Serve as World Models:
- Format Priors: Pre-training on API documentation provides priors for generating well-formatted tool responses.
- Domain Knowledge: LLMs encode operational logic for hundreds of professional domains.
- State Maintenance: System prompt constraints and in-context tracking enable coherent multi-turn simulation.
- Edge Case Handling: LLMs handle unexpected inputs more gracefully than rule-based simulators.
Environment Configuration
Each LWM environment is fully specified by four components:
- System Prompt: Defines behavioral rules, simulation logic, error handling, and output format constraints.
- Tool Schema: Defines the agent's action space as a set of callable functions with typed parameters (median 5 tools per environment).
- Initial State: A structured JSON object specifying starting conditions.
- State Description: Semantic annotations guiding the LLM to maintain causal consistency.
Multi-Agent Synthesis Pipeline
The pipeline automatically generates evaluation instances ensuring:
- Solvability: A valid solution exists and is verified.
- Verifiability: Clear, automated success criteria.
- Discriminative: Calibrated difficulty distinguishing agent capabilities.
- Diverse: Structural variation across instances (grounded by 16 non-overlapping sub-topics per scenario with professional reference documents).
The pipeline uses Gemini-3-Flash-Preview as the World Model to generate configurations, tasks, solutions, and rubrics. Tasks are executed multiple times to verify solvability and calibrate difficulty. A majority-vote verifier assesses trajectories, and a repair module fixes failures.
Evaluation Loop
The interaction between the agent and LWM follows the loop shown in Figure 1 (not reproduced here). The agent issues tool calls, the LWM generates observations, and the trajectory is scored by a rubric-based verifier.
Empirical Validation / Results
Benchmark Scale and Coverage
OCCUBENCH covers 100 scenarios across 10 industry categories (Table 1) and 65 specialized domains, resulting in 382 solvable task instances.
Table 1: Industry categories and representative scenarios in OCCUBENCH
| Category | # Scenarios | Representative Scenarios |
|---|---|---|
| Business & Enterprise | 19 | Resume screening, expense auditing, AML review |
| Technology & IT | 16 | Linux ops, CI/CD recovery, intrusion response |
| Industrial & Engineering | 12 | Production scheduling, mine ventilation |
| Transportation & Logistics | 11 | Last-mile delivery, train dispatch |
| Commerce & Consumer | 9 | Dynamic pricing, hotel revenue mgmt. |
| Education & Culture | 8 | Adaptive curriculum, fact-checking |
| Healthcare & Life Sciences | 7 | Emergency triage, drug interaction screening |
| Public Service & Governance | 7 | Permit processing, wildfire evacuation |
| Agriculture & Environment | ? |
Environmental Fault Injection Evaluates agent robustness through controlled fault injection:
- E0 (Clean): No faults. Baseline performance.
- E1 (Explicit Faults): Randomly injects clearly visible error responses (HTTP 500, TimeoutError). Clear error signals.
- E2 (Implicit Faults): Returns degraded responses with no error signal (truncated data, empty fields). Superficially correct.
- E3 (Mixed): Approximately half explicit, half implicit faults.
Faults are transient, spaced across interactions, and parameterized by fault count (default 2) and fault duration (default 2 consecutive calls).
Main Results: Cross-Industry Evaluation (E0)
Table 2: E0 completion rate (%) by industry category for all 15 models
| Model | Avg | Agri | Biz | Comm | Edu | Hlth | Ind | Pub | Sci | Tech | Trans |
|---|---|---|---|---|---|---|---|---|---|---|---|
| GPT-5.2 | 79.6 | 84 | 86 | 67 | 77 | 76 | 85 | 84 | 94 | 80 | 72 |
| Gemini 3.1 Pro | 72.3 | 68 | 73 | 75 | 84 | 62 | 73 | 72 | 81 | 78 | 60 |
| Claude Opus 4.6 | 71.5 | 74 | 78 | 53 | 75 | 76 | 73 | 68 | 62 | 68 | 77 |
| Qwen 3.5 Plus | 69.9 | 77 | 70 | 81 | 56 | 81 | 71 | 76 | 69 | 74 | 55 |
| DeepSeek V3.2 | 69.6 | 65 | 78 | 67 | 66 | 71 | 69 | 72 | 62 | 74 | 64 |
| Claude Opus 4.5 | 65.2 | 58 | 76 | 56 | 62 | 52 | 65 | 72 | 56 | 68 | 66 |
| Claude Sonnet 4.5 | 64.9 | 65 | 70 | 69 | 50 | 71 | 71 | 60 | 44 | 68 | 62 |
| Claude Sonnet 4.6 | 64.4 | 58 | 71 | 64 | 69 | 67 | 64 | 64 | 69 | 64 | 57 |
| Kimi K2.5 | 64.1 | 68 | 62 | 56 | 62 | 81 | 62 | 72 | 56 | 74 | 57 |
| GLM-5 | 62.6 | 55 | 75 | 67 | 53 | 57 | 56 | 68 | 62 | 70 | 55 |
| Claude Opus 4 | 61.3 | 52 | 75 | 50 | 53 | 57 | 58 | 76 | 81 | 66 | 51 |
| Gemini 3.1 FL | 61.3 | 68 | 70 | 58 | 53 | 67 | 58 | 68 | 62 | 68 | 45 |
| Qwen 3.5 Flash | 59.7 | 61 | 60 | 67 | 53 | 76 | 53 | 68 | 69 | 60 | 51 |
| MiniMax M2.7 | 53.9 | 48 | 60 | 56 | 31 | 57 | 60 | 60 | 62 | 64 | 40 |
| Claude Sonnet 4 | 53.4 | 35 | 63 | 61 | 38 | 57 | 51 | 76 | 31 | 60 | 47 |
Key Findings:
- No single model dominates all industries. GPT-5.2 leads overall but trails in Commerce (67%) where Qwen 3.5 Plus leads (81%).
- Open-source models are highly competitive. Qwen 3.5 Plus (69.9%) and DeepSeek V3.2 (69.6%) outperform most Claude variants.
- Each model has a distinct occupational capability profile (visualized in radar chart Figure 2).
Environmental Robustness
Table iii: Environmental robustness evaluation for 9 flagship models
| Model | E0 | E1 | E2 | E3 | Rob. |
|---|---|---|---|---|---|
| Gemini 3.1 Pro | 72.3 | 73.3 | 63.1 | 65.2 | 0.87 |
| MiniMax M2.7 | 53.9 | 52.9 | 47.1 | 46.9 | 0.87 |
| GPT-5.2 | 79.6 | 75.9 | 70.4 | 67.0 | 0.84 |
| GLM-5 | 62.6 | 59.4 | 52.6 | 47.4 | 0.76 |
| Claude Opus 4.6 | 71.5 | 68.1 | 53.9 | 63.9 | 0.75 |
| DeepSeek V3.2 | 69.6 | 59.9 | 56.0 | 51.6 | 0.74 |
| Qwen 3.5 Plus | 69.9 | 61.0 | 51.6 | 54.2 | 0.74 |
| Claude Sonnet 4.6 | 64.4 | 62.8 | 45.0 | 52.9 | 0.70 |
| Kimi K2.5 | 64.1 | 50.0 | 40.6 | 40.1 | 0.63 |
| Avg | 67.5 | 62.6 | 53.4 | 54.4 | 0.77 |
Key Findings:
- Current agents struggle under adverse environments. Average performance drops 14.1 points from E0 (67.5%) to E2 (53.4%).
- Implicit faults (E2) are harder than both explicit (E1) and mixed (E3) faults. Average E2 score (53.4%) is lower than E1 (62.6%) and E3 (54.4%). Implicit faults lack overt error signals and require agents to independently detect data degradation.
- Increasing fault severity deepens the challenge. Performance declines further as fault count and duration increase (Figure 4).
Model Scaling, Generational Progress, and Reasoning Effort
- Larger models consistently outperform smaller counterparts within families (Figure 5), e.g., Gemini Pro vs. Flash-Lite gap: 11.0%.
- Claude Opus shows consistent generational improvement: 61.3% (v4) → 65.2% (v4.5) → 71.5% (v4.6) (Figure 6).
- Higher reasoning effort leads to better performance. GPT-5.2 improves by 27.5 points from
none(54.7%) toxhigh(82.2%) effort (Figure 7).
Simulator Quality Matters
A key question: Is a strong agent also a strong environment simulator?
Table 4: Cross-simulator evaluation (E0)
| Agent | Gemini Flash (CR %, Rk) | Qwen 3.5+ (CR %, Rk) | GPT-5.2 (CR %, Rk) |
|---|---|---|---|
| GPT-5.2 | 79.6, 1 | 74.3, 1 | 42.4, 1 |
| Gemini Pro | 72.3, 2 | 68.6, 2 | 28.3, 4 |
| Opus 4.6 | 71.5, 3 | 66.2, 3 | 33.5, 2 |
| Qwen 3.5+ | 69.9, 4 | 61.8, 6 | 28.3, 4 |
| DeepSeek | 69.6, 5 | 65.2, 4 | 29.6, 3 |
| Kimi K2.5 | 64.1, 6 | 52.4, 8 | 23.0, 8 |
| GLM-5 | 62.6, 7 | 64.1, 5 | 23.6, 7 |
| MiniMax M2.7 | 53.9, 8 | 54.7, 7 | 25.4, 6 |
Findings:
- Strong agents are not necessarily strong simulators. GPT-5.2 ranks first as an agent but produces the worst simulation quality (all agents average only 29.3% under it).
- A capable simulator yields reliable rankings. Pairwise ranking agreement between Gemini Flash and Qwen 3.5 Plus simulators reaches 85.7% (24/28 pairs) (Figure 8).
- Simulator failure modes include state fabrication, entity omission, and rule invention (Figures 9-11).
Theoretical and Practical Implications
Industry Difficulty and Model-Industry Interaction
- Industry Difficulty Analysis (Figure 12): Easiest industries are Business & Enterprise (avg 70.1%) and Public Service & Governance (69.4%); hardest are Transportation & Logistics (56.2%) and Education & Culture (57.6%).
- Each model has a distinct occupational capability profile:
- Gemini 3.1 Pro excels in knowledge-intensive domains (Education, Science).
- Claude Opus 4.6 excels in operational domains (Transportation, Business).
- Qwen 3.5 Plus excels in consumer-facing domains (Commerce, Healthcare).
- Practical Implication: Organizations should select agent models based on specific industry needs, not just aggregate rankings.
Case Studies Illustrating Agent Capabilities and Failures
The paper includes detailed case studies (Figures 13-17) illustrating:
- Proactive constraint monitoring vs. violation (Last-Mile Delivery).
- Skipped verification failure mode (Fish Farm Water Quality Control).
- Procedural ordering errors (Building Inspection Compliance).
- Fault resilience differences under explicit (E1) and implicit (E2) faults.
Conclusion
OCCUBENCH is the first benchmark to systematically evaluate AI agents on real-world professional tasks across a broad spectrum of industries and domains via Language World Models. Key conclusions:
- Cross-industry evaluation is essential: No single model dominates, revealing unique capability profiles invisible to single-domain benchmarks.
- Environmental robustness is a critical gap: Agents struggle significantly, especially with implicit faults lacking error signals.
- Scaling benefits are consistent: Larger models, newer generations, and increased reasoning effort reliably improve performance.
- Simulator quality is crucial for LWM-based evaluation: While strong agents aren't necessarily good simulators, using a capable simulator yields reliable agent rankings (85.7% pairwise agreement).
Limitations:
- LWM simulation fidelity: Evaluates decision-making process rather than handling exact real-world data values.
- Simulator dependence: Evaluation results are tied to the specific simulator used.
Future Directions: OCCUBENCH provides a framework for a richer evaluation paradigm that considers cross-industry specialization and environmental resilience, moving beyond simple task completion metrics.