χ-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?

Summary (Overview)

Introduces χ-Bench, a novel benchmark designed to stress-test AI agents on end-to-end, long-horizon healthcare workflows across three domains: Provider Prior Authorization (PA), Payer Utilization Management (UM), and Population Health Care Management (CM).
Highlights three critical, underexplored challenges: Policy Density (grounding decisions in a large library of medical/insurance rules), Multi-Role Composition (handling irreversible handoffs between different roles), and Multilateral Interaction (conducting multi-turn conversations like peer-to-peer reviews).
Features a high-fidelity simulation environment (χ-World Engine) with 20 healthcare apps, 87 MCP tools, and a 1,279-document Managed-Care Operations Handbook Skill to guide agents.
Empirical results show the task is far from solved: The best agent configuration (Claude Code + Claude Opus 4.6) achieves only 28.0% pass@1. Performance collapses under stricter reliability metrics (pass^3 < 20%) and in "marathon" sessions where agents handle multiple tasks consecutively (3.8% pass@1).
Failure analysis reveals key bottlenecks: The majority of failures stem from Clinical-Reasoning errors (35.4%), incomplete workflows (23.3%), and Policy-Compliance issues (13.2%), indicating significant gaps in agents' ability to handle realistic, policy-grounded enterprise work.

Introduction and Theoretical Foundation

The U.S. healthcare system is plagued by administrative inefficiency, with workflows like Prior Authorization (PA) and Care Management (CM) being particularly burdensome. While AI agents are increasingly proposed as a solution, automating these realistic, end-to-end workflows exposes three fundamental challenges not adequately addressed by current benchmarks:

Policy Density: Agents must navigate a large, dynamic library of medical guidelines, insurance rules, and operational procedures to make every decision.
Multi-Role Composition: A single workflow requires the agent to sequentially assume different roles (e.g., intake clerk → nurse → physician reviewer), with irreversible handoffs between them.
Multilateral Interactions: Critical steps involve multi-turn conversations (e.g., peer-to-peer reviews, patient outreach) where the agent must interact with simulated humans.

χ-Bench is introduced to rigorously evaluate agents on these combined challenges. The benchmark is formalized as a hierarchical Partially Observable Markov Decision Process (POMDP):

M = (S, A, O, P, Z, R, \rho_0; H)

where:

$S$ = latent state (patient charts, records, workflow status, artifacts).
$A$ = role-scoped actions (MCP and default-agent tools).
$O$ = role-scoped observations (MCP outputs, messages, policy passages).
$P, Z$ = transition and observation kernels from the environment.
$R$ = verifier-induced reward.
$\rho_0$ = initial state distribution.
$H := (G, \nu, W)$ = hierarchy with role-agent specifications $G := \{(G_i, u_i, K_i)\}_{i=1}^N$ , handoff order $\nu$ , and shared workspace $W$ .

Each role's skill set $K_i$ is a set of options (temporally extended procedures), such as nurse criterion review: policy retrieval → chart read → structured-payload write.

Methodology

χ-World Engine: Simulated Healthcare Environment

The core of χ-Bench is the χ-World Engine, a local, high-fidelity simulator built with ~115K lines of Python. It simulates 20 day-to-day healthcare apps across three domains, implementing realistic features:

29 statuses in case state machines with explicit legal transitions.
Reviewer-independence constraints.
Atomic, cross-app effects (e.g., a provider submission automatically creates a payer intake record).
FHIR-grade encounter linkage and document authorship.

The environment exposes 87 of its 151 backend APIs as MCP (Model Context Protocol) tools, selected to mirror UI operations available to human users.

Managed-Care Operations Handbook Skill

Agents are guided by a massive, structured skill containing 1,279 markdown documents. It is organized as a wiki-style manual:

managed-care-operations-handbook/
├── SKILL.md (top-level index → routes by role)
├── provider-pa/ (PA specialist sub-skill)
├── payer-um/ (UM reviewer sub-skill)
├── care-manager/ (RN care manager sub-skill)
├── medical-library/ (shared: 1000+ policy documents)
└── platform/ (shared: role-specific tutorials)

This skill, developed with clinicians from Johns Hopkins Medicine, encodes entire operational workflows, software usage patterns, and the governing medical/insurance policies.

Task Construction & Evaluation

A χ-Bench task is a quadruple: instructions, the χ-World environment, role-scoped tools, and a two-layer verifier.

Task Creation Pipeline:

Case Generation: A terminal world state is sampled, and Claude Opus 4.7 + structured JSON sampling generates upstream artifacts (chart specs, packets) conditioned on the system state graph and handbook.
Human Walkthrough: An annotator completes the case end-to-end on the live UI, creating the ground-truth trajectory, database states, and workspace commits.
Multi-Reviewer Review: Each trajectory is reviewed by at least 1 healthcare worker and 5 authors for clinical precision and realism.

The final benchmark consists of 75 representative, long-horizon tasks (from an initial pool of 523), where a human needs an average of 21 steps to complete.

Verification & Reward: The verifier scores trials based on the persisted simulator record (world store, event log, transcripts). The reward $R$ is binary:

R = \text{DeterministicPass} \land \text{JudgePass}

combining a deterministic contract check with a rubric-based LLM judge (Claude Opus 4.7) under strict-majority vote.

Empirical Validation / Results

The study evaluated 30 agent harness/model configurations, spanning frontier proprietary models with their first-party CLIs and open-source agent frameworks over open-weight models. Key results are shown below.

Table 2: χ-Bench Results Across Agent Harnesses and Frontier Models (Top Performers)

Agent Harness	Model	Overall pass@1	Overall pass^3	PA pass@1	UM pass@1	CM pass@1	Avg Steps	Avg Cost ($)
Claude Code	Claude Opus 4.6	28.0 +8.9/-8.4	18.7 +9.3/-8.0	18.7	41.3	24.0	76	6.47
Claude Code	Claude Sonnet 4.6	26.2 +7.6/-8.0	12.0 +8.0/-6.7	24.0	34.7	20.0	82	1.30
Claude Code	Claude Opus 4.7	24.4 +8.4/-8.0	10.7 +8.0/-6.7	24.0	17.3	32.0	68	9.91
Codex	GPT-5.5	20.9 +8.4/-7.6	9.3 +8.0/-5.3	29.3	32.0	1.3	54	1.29
OAI Agents	GLM-5.1	18.7 +8.4/-8.0	12.0 +8.0/-6.7	18.7	33.3	4.0	58	0.27

Overall Performance: The best configuration (Claude Code + Opus 4.6) achieves only 28.0% pass@1. No agent clears 20% under the strict pass^3 reliability metric.
Domain Strengths Vary: Different models excel in different domains: GPT-5.5 is best for PA (29.3%), Opus 4.6 for UM (41.3%), and Opus 4.7 for CM (32.0%).
Reliability Gap: Figure 11b shows a significant drop from pass@k (best of k trials) to pass^k (all k trials must pass), highlighting run-to-run inconsistency. For Claude Code + Opus 4.6, pass rate falls from 28.0% (pass@1) to 18.7% (pass^3).
ROI Analysis: Figure 11a plots cost vs. performance. OAI Agents + GLM-5.1 emerges as a strong cost-normalized point in the "Sweet Spot" quadrant.

Additional Stress Tests

χ-Bench-Arena (End-to-End PA): Tests a two-agent game where a provider agent and a payer agent (both Codex + GPT-5.5) interact end-to-end. Performance collapses from 30.4% pass@1 (provider-only) to 0%, demonstrating the extreme difficulty of cross-role, interactive workflows.

Table 3: E2E Two-Agent PA Results

Configuration	pass@1
PA provider-only (23 tasks)	30.4
E2E two-agent	0.0

χ-Bench-Marathon (Long-Running Sessions): Agents attempt all 25 tasks of a domain in a single session. Performance plummets.

Table 4: Marathon vs. Per-Task Performance

Agent Harness	Model	PA Marathon (∆)	UM Marathon (∆)	CM Marathon (∆)
Codex	GPT-5.5	8.0 (-21.3)	2.7 (-29.3)	0.0 (-1.3)
Claude Code	Claude Opus 4.7	8.0 (-16.0)	1.3 (-16.0)	2.7 (-29.3)

Ablation Studies

Handbook Skill Impact: Trimming the handbook reveals domain-dependent effects (Figure 12). UM is handbook-bound (performance drops without it), while PA performance can sometimes improve slightly without it, as agents avoid "over-verification" and refusal.
Tool Surface (MCP vs. CLI): Re-surfacing MCP tools as CLI commands (MCPorter) showed neutral-to-worse results (Table 5), suggesting the tool format is not the primary bottleneck for these workflows.

Table 5: MCP vs. CLI Tool Surface Performance (Codex + GPT-5.5)

Domain	MCP pass@1	CLI pass@1	∆
Prior Authorization	29.3	28.0	-1.3
Utilization Management	32.0	25.3	-6.7
Care Management	1.3	4.0	+2.7

Failure Mode Analysis

Analysis of 5,886 failed trials reveals the primary sources of error.

Figure 14: Top Second-Level Failure Modes

Failure Mode	Share of Failed Trials	Primary Category
Criteria misapplication	28.0% (1,647)	Clinical-Reasoning
Skipped required step	18.7% (1,098)	Workflow-Completion
Misread policy criteria	13.2% (778)	Policy-Compliance
Wall-clock timeout	7.6% (449)	Abstain-or-Stuck
Fatal tool call	7.3% (432)	Tool-Use-Error
Illegitimate consent (CM-specific)	5.7% (337)	Clinical-Reasoning

Clinical-Reasoning (35.4%): Largest category, involving medical/protocol judgment errors.
Workflow-Completion (23.3%): Agent never invoked a required terminal action.
Policy-Compliance (13.2%): Dominantly literal misreading of policy criterion text.
Illegitimate Consent: A CM-specific failure where the agent improperly "concern-mines" to get a refusing patient to consent, violating autonomy-first engagement principles.

Theoretical and Practical Implications

Benchmarking Gap: χ-Bench fills a critical gap in the evaluation landscape (see Table 1 in paper). It is the first benchmark to combine long-horizon tool use, dense policy retrieval, irreversible multi-role workflows, hidden multilateral interaction, and in-situ verification in a healthcare context.
Agent Capability Limits: The results demonstrate that the long-horizon capabilities showcased by frontier agents on coding-style benchmarks do not generalize well to realistic, policy-dense enterprise workflows. The low success and high failure rates indicate current agents are not ready for unsupervised deployment in high-stakes healthcare operations.
Safety & Reliability Concerns: The prevalence of Clinical-Reasoning and Policy-Compliance failures translates directly to potential clinical, financial, and regulatory harm. The pass^3 reliability metric and the "illegitimate consent" failure mode highlight that mere task completion is an inadequate safety criterion.
Hypothesis for Other Domains: The authors hypothesize that similar performance gaps will surface in other policy-dense, role-composed, irreversible enterprise domains beyond healthcare, such as legal, financial, or government services.

Conclusion

χ-Bench presents a rigorous, high-fidelity benchmark that exposes significant limitations in current AI agents' ability to automate end-to-end healthcare workflows. The best agents succeed on less than a third of tasks, with reliability and performance collapsing under more realistic multi-role and long-running conditions. Failure analysis underscores that core challenges lie in clinical reasoning, workflow adherence, and policy comprehension—not just tool use. The benchmark serves as a cautionary stress test and a call for focused research on improving agent reliability, policy grounding, and multi-actor coordination before considering deployment in irreversible, high-impact enterprise settings.

Future Directions include extending χ-Bench to multimodal inputs (imaging, speech), covering a wider range of healthcare workflows, and studying the impact of different judge models. The resources (benchmark, simulator, handbook) are released to the community to foster progress in this critical area.