Dec 29, 2025 · 8 min read
Architecting for Reliability: How We Scaled Healthcare AI to 45M Requests
Scaling AI systems in healthcare is a different problem than scaling consumer apps. The stakes are higher, the data is more sensitive, and the failure modes are more consequential. Here's the architecture and reliability patterns we used to scale a healthcare AI platform from thousands to 45 million requests while maintaining clinical accuracy and HIPAA compliance.
The Scale Challenge
The platform started as an internal tool processing clinical notes for a single hospital network. Within 18 months, it had expanded to serve 2M+ patient records across multiple health systems. The original architecture — a monolithic Python service calling a single LLM provider — couldn't survive the load or the compliance requirements at that scale.
We needed to redesign for three simultaneous constraints: throughput (45M requests/month), latency (<200ms p95 for clinical workflows), and auditability (every AI decision logged with full provenance for HIPAA compliance).
Human-in-the-Loop as a First-Class Citizen
Healthcare AI cannot be fully autonomous. We designed HITL as a core architectural component, not an afterthought. Every AI output is assigned a confidence score. Below a configurable threshold, the output is flagged and routed to a clinical reviewer queue before being surfaced to end users.
The key insight was decoupling the review queue from the critical path. Low-confidence outputs return a "pending review" state immediately — the user sees a placeholder — while the review happens asynchronously. This kept the UI responsive while ensuring clinical accuracy was never compromised for speed.
Reliability Patterns That Held at Scale
Three patterns proved essential at 45M requests/month:
Circuit Breakers per Provider
Each AI provider has an independent circuit breaker. When error rates exceed 5% over a 60-second window, the breaker opens and traffic is re-routed to the fallback provider. This eliminated cascading failures that previously took down the entire pipeline.
Idempotent Request IDs
Every clinical AI request is tagged with a client-generated idempotency key. Retries are safe — the system deduplicates at the processing layer and returns the cached result rather than re-invoking the model. This was critical for preventing duplicate AI outputs in clinical records.
Structured Audit Logging
Every AI decision — model used, prompt hash, response, confidence score, reviewer action — is written to an append-only audit log in BigQuery. This satisfies HIPAA audit trail requirements and provides the data needed to continuously improve model accuracy.
Results at 45M Requests
The lesson: reliability in healthcare AI is not about preventing all failures — it's about making every failure mode safe, auditable, and recoverable. Design for the failure first, then optimize for the happy path.