Healthcare AI demos are impressive. An agent reads a chest X-ray, writes a differential diagnosis, cites the literature. Clean, fast, accurate.
Now ask the same agent to complete a prior authorization for a CT scan. It needs to check the patient's insurance eligibility, verify clinical criteria against the payer's medical policy, write a clinical justification letter, submit it through the payer's portal, handle the peer-to-peer review if denied, and document the outcome in the EHR.
That is a different problem entirely.
The Problem: Benchmarks Do Not Test What Matters
Current AI agent benchmarks test narrow capabilities — question answering, tool calling, single-turn reasoning. They do not test what happens when an agent must navigate a long-horizon workflow that spans multiple systems, requires compliance with a dense policy handbook, and involves handoffs between roles (utilization nurse, case manager, physician reviewer).
In real healthcare operations, these workflows are the bottleneck. Prior authorization alone costs the US healthcare system an estimated $35 billion annually in administrative overhead. The workflows are policy-dense (thousands of rules), multi-role (nurse → doctor → payer → patient), and irreversible (a wrong submission cannot be retracted).
If AI agents cannot handle these workflows, the productivity gains enterprise leaders are counting on will not materialize.
What χ-Bench Does
χ-Bench (pronounced "Chi-Bench") introduces a benchmark of long-horizon healthcare workflows across three domains: provider prior authorization, payer utilization management, and care management.
The benchmark provides a high-fidelity simulator of 20 healthcare applications exposed through 87 MCP tools. The agent receives a clinical case and must navigate the workflow from start to finish — checking eligibility, reviewing clinical criteria, writing justification documents, handling peer-to-peer reviews, and documenting outcomes.
Critically, the agent is guided by a 1,290+ document managed-care operations handbook. This is not optional reading. The workflow cannot be completed correctly without consulting the policy manual — the same manual human staff reference daily.
The benchmark tests three capabilities that current benchmarks underrepresent:
- Policy density — decisions must be grounded in a large library of medical, insurance, and operational rules
- Multi-role composition — a single task requires the agent to play multiple roles with handoffs
- Multilateral interaction — intermediate workflow steps are multi-turn dialogs (peer-to-peer review, patient outreach)
The Results
The researchers tested 30 agent harness/model configurations. The numbers are sobering.
| Metric | Best Agent | What It Means |
|---|---|---|
| Task resolution (pass@1) | 28.0% | 7 out of 10 tasks fail |
| Strict pass^3 | < 20% | Even with retries, most tasks fail |
| Single-session execution | 3.8% | Running all tasks in one session collapses performance |
The gap between individual task performance and full-session execution is the most revealing finding. When an agent must complete multiple tasks sequentially in a single session — as it would in production — context accumulation, error propagation, and policy confusion compound. Performance does not degrade gracefully. It collapses.
Why This Matters
The 87 MCP tools are available. The 1,290-page policy handbook is the bottleneck. Agents fail because they cannot reliably ground decisions in a large, evolving rule set — not because they lack tool access.
Demos are not deployments. An agent that can answer medical questions or call tools in isolation is not an agent that can complete a prior authorization. The gap is not incremental — it is architectural. Enterprise leaders evaluating agent vendors should demand end-to-end workflow benchmarks, not single-task accuracy scores.
Single-session execution is the deployment reality. Benchmarks that test one task at a time overestimate agent capability by 7×. In production, agents will handle queues of tasks, carry context across sessions, and deal with accumulated state. χ-Bench's single-session metric (3.8%) is closer to what enterprise leaders should expect — and plan for.
Limitations
The benchmark is simulated, not live. Real-world EHR integrations, payer portals, and clinical workflows have additional friction (latency, authentication, data quality) that would likely reduce performance further. It is also limited to English and three healthcare domains. The findings may not generalize uniformly across all workflow types.
Key Takeaways
- Current AI agents cannot automate real healthcare workflows. The best agent resolves only 28% of tasks, and single-session execution collapses to 3.8%. The gap between demo and deployment is architectural, not incremental.
- Policy density, not tool access, is the bottleneck. Agents fail because they cannot reliably ground decisions in a 1,290-page operations handbook. Policy-retrieval infrastructure is the missing layer.
- Single-session metrics reveal the real deployment picture. Testing one task at a time overestimates agent capability by 7×.
Paper Details
- Title
- χ-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?
- Authors
- Haolin Chen, Deon Metelski, Leon Qi, et al.
- ArXiv
- 2605.16679
- Date
- May 2026
Building AI agents for healthcare or other policy-dense domains? Let's talk about what realistic deployment looks like.
Get in Touch