Committed — Production AI for regulated work

Cutover, on-call, runbook, escalation, the first incident, the first audit. A short essay on the difference between "live" and "shipped".

The first system we ever shipped to production failed quietly for three weeks before anyone noticed. Not because of a model regression — the model was fine — but because the people who wrote the system had moved on, and the people now operating it had no runbook. That experience is the reason this essay exists. Production AI is the discipline of making sure that, when the team that wrote the code is no longer in the room, the system you handed over still does what it said it would.

What we mean by production ai

Most teams encounter this problem the same way: a successful proof of concept, a celebratory all-hands, and then six months of incident tickets. The promise of AI-assisted engineering is that the second half of that arc gets shorter. The reality is that — without a method — the second half is where the program dies.

We treat production ai as a contract between three parties: the engineer who wrote it, the operator who runs it, and the auditor who will eventually ask why it made a particular decision. Each one needs a different surface, and most projects only build for the first.

The three failure modes

Across roughly forty production deployments — in healthcare, finance, logistics, and government — the systems that decay tend to decay in one of three ways. They lose their ground truth, they lose their operators, or they lose their guardrails. Most postmortems tell a story about model drift, but the underlying pattern is almost always one of these three.

Ground truth slips. The data the system was evaluated against is no longer representative of the data it sees. Without a regression suite, this is invisible until a customer complains.
Operators drift.The runbook lives in someone's head, then in someone's Notion page, then nowhere. The first incident at 2am surfaces this immediately.
Guardrails relax.A redaction step is "temporarily" disabled for a debugging session and the temporary becomes permanent. The audit trail tells you eventually; you'd rather know on day one.

A system that nobody can audit on Monday morning is not in production. It is in a rehearsal.

What works

The remediation is unglamorous and well-known: written specs, machine-readable evals, signed audit on every state change, and an on-call rotation that includes the engineer who wrote the original code for the first ninety days. None of this is novel. The novelty is in refusing to declare a system "done" until all four are in place.

We have a working name for this last step — the Monday morning test. If a new operator can pick up the runbook, page the right engineer, and cleanly resolve a P1 within the first business hour after cutover, the system is in production. Otherwise it is a demo with good uptime.

A short checklist

Written spec, versioned in the same repo as the code.
Eval harness that runs in CI on every PR, with a regression gate.
Signed audit log for every model decision that touches a customer.
On-call rotation, documented escalation path, and a runbook for the top three failure modes.
A 90-day handoff window where the original team is paged for production incidents.

None of this is the work most engineers got into the field to do. It is, however, the work that lets the systems we build outlive the team that built them. That is the only definition of production we have found that holds up to an auditor.

Lior Bar-On

CTO · COMMITTED

The Monday morning test: when an AI system is actually in production

What we mean by production ai

The three failure modes

What works

A short checklist

Keep reading

Spec-driven design for enterprise AI: how we keep PR #50 from rotting the codebase

Production AI for enterprise: a checklist before you call it shipped

Voice agents with HIPAA-grade redaction: notes from 12 hospital lines