An AI incident is a production failure of an AI system serious enough to warrant a structured response. Bad output that reached a user. A regression that blew a release gate. A regulatory exposure. A runaway cost event. The pattern mirrors classical site-reliability incidents, but the failure modes are AI-specific and the postmortem looks different.

The four common classes.

  • Output incident. A wrong, harmful, or off-brand response reached a real user. The most visible class and usually the one customers complain about first.
  • Gate-blow. A release made it to production despite an eval-gate failure, because someone overrode the gate. Bad answer was inside the system the moment the override happened.
  • Cost incident. A bug or a regression caused a runaway model spend. Without a token budget enforced at the gateway, this can be tens of thousands of dollars before anyone notices.
  • Compliance incident. The system handled data, generated output, or took an action in a way that violates a policy or regulation. PII leak, jurisdiction failure, missing audit trail.

The response, in five steps.

  1. Stop the bleeding. Roll back the change, disable the workflow, or pin to the last green release. Stop users from hitting the bad path first; explain second.
  2. Reproduce from observability. The trace store should let an engineer replay the exact failed run end- to-end. If it can't, the observability layer is the next thing that gets fixed.
  3. Add a fixture. Whatever broke gets a fixture in the eval harness so the regression gate catches it next time. The fixture is the receipt for the incident.
  4. Postmortem. Blameless write-up: what happened, why it wasn't caught, what changes prevent the class of failure (not just this instance).
  5. Communicate. To affected users, to internal stakeholders, to regulators when relevant. Templated by class of incident, not improvised under pressure.

The prevention pattern.

The eval harness, the regression gate, the token budget, and the observability trace are the four layers that turn most potential AI incidents into caught regressions before they hit production. Every incident that does happen becomes a new fixture in the harness; the system gets monotonically stronger over time. Without those four layers, AI incidents are unbounded.

Frequently asked.

What is an AI incident?
An AI incident is a production failure of an AI system serious enough to warrant a structured response. Four common classes: output incident (bad answer reached a user), gate-blow (override of a failing release gate), cost incident (runaway spend), and compliance incident (PII leak, jurisdiction failure, missing audit trail).
How is AI incident response different from SRE incident response?
The structure is similar — stop the bleeding, reproduce, fix the class, postmortem, communicate — but the failure modes are AI-specific. You replay from the observability trace rather than from logs. You add a fixture to the eval harness rather than a regression test. The postmortem confronts prompt and rubric changes, not just code changes.
How do you prevent AI incidents?
Four layers in combination: the eval harness (catches regressions before release), the regression gate (blocks the merge), the token budget (caps cost incidents), and the observability trace (makes the rest of the postmortem possible). Each incident that does happen becomes a new fixture; the system gets monotonically stronger over time.
Who runs an AI incident response?
The engineering lead drives the technical response. Product owns user communication. Legal/compliance is looped in on the compliance-incident class. The Morvion incident template covers all three lanes; templates beat improvising under pressure.