What is an AI incident?

An AI incident is a production failure of an AI system serious enough to warrant a structured response. Four common classes: output incident (bad answer reached a user), gate-blow (override of a failing release gate), cost incident (runaway spend), and compliance incident (PII leak, jurisdiction failure, missing audit trail).

How is AI incident response different from SRE incident response?

The structure is similar — stop the bleeding, reproduce, fix the class, postmortem, communicate — but the failure modes are AI-specific. You replay from the observability trace rather than from logs. You add a fixture to the eval harness rather than a regression test. The postmortem confronts prompt and rubric changes, not just code changes.

How do you prevent AI incidents?

Four layers in combination: the eval harness (catches regressions before release), the regression gate (blocks the merge), the token budget (caps cost incidents), and the observability trace (makes the rest of the postmortem possible). Each incident that does happen becomes a new fixture; the system gets monotonically stronger over time.

Who runs an AI incident response?

The engineering lead drives the technical response. Product owns user communication. Legal/compliance is looped in on the compliance-incident class. The Morvion incident template covers all three lanes; templates beat improvising under pressure.

AI incident · Morvion Glossary

An AI incident is a production failure of an AI system serious enough to warrant a structured response. Bad output that reached a user. A regression that blew a release gate. A regulatory exposure. A runaway cost event. The pattern mirrors classical site-reliability incidents, but the failure modes are AI-specific and the postmortem looks different.

The four common classes.

Output incident. A wrong, harmful, or off-brand response reached a real user. The most visible class and usually the one customers complain about first.
Gate-blow. A release made it to production despite an eval-gate failure, because someone overrode the gate. Bad answer was inside the system the moment the override happened.
Cost incident. A bug or a regression caused a runaway model spend. Without a token budget enforced at the gateway, this can be tens of thousands of dollars before anyone notices.
Compliance incident. The system handled data, generated output, or took an action in a way that violates a policy or regulation. PII leak, jurisdiction failure, missing audit trail.

The response, in five steps.

Stop the bleeding. Roll back the change, disable the workflow, or pin to the last green release. Stop users from hitting the bad path first; explain second.
Reproduce from observability. The trace store should let an engineer replay the exact failed run end- to-end. If it can't, the observability layer is the next thing that gets fixed.
Add a fixture. Whatever broke gets a fixture in the eval harness so the regression gate catches it next time. The fixture is the receipt for the incident.
Postmortem. Blameless write-up: what happened, why it wasn't caught, what changes prevent the class of failure (not just this instance).
Communicate. To affected users, to internal stakeholders, to regulators when relevant. Templated by class of incident, not improvised under pressure.

The prevention pattern.

The eval harness, the regression gate, the token budget, and the observability trace are the four layers that turn most potential AI incidents into caught regressions before they hit production. Every incident that does happen becomes a new fixture in the harness; the system gets monotonically stronger over time. Without those four layers, AI incidents are unbounded.

Frequently asked.

What is an AI incident?: An AI incident is a production failure of an AI system serious enough to warrant a structured response. Four common classes: output incident (bad answer reached a user), gate-blow (override of a failing release gate), cost incident (runaway spend), and compliance incident (PII leak, jurisdiction failure, missing audit trail).
How is AI incident response different from SRE incident response?: The structure is similar — stop the bleeding, reproduce, fix the class, postmortem, communicate — but the failure modes are AI-specific. You replay from the observability trace rather than from logs. You add a fixture to the eval harness rather than a regression test. The postmortem confronts prompt and rubric changes, not just code changes.
How do you prevent AI incidents?: Four layers in combination: the eval harness (catches regressions before release), the regression gate (blocks the merge), the token budget (caps cost incidents), and the observability trace (makes the rest of the postmortem possible). Each incident that does happen becomes a new fixture; the system gets monotonically stronger over time.
Who runs an AI incident response?: The engineering lead drives the technical response. Product owns user communication. Legal/compliance is looped in on the compliance-incident class. The Morvion incident template covers all three lanes; templates beat improvising under pressure.

AI incident

The four common classes.

The response, in five steps.

The prevention pattern.

Frequently asked.

Intelligent Systems & AI Infrastructure

Keep reading the glossary.

AI infrastructure

CRM intelligence

Immersive website

AI agent

Business intelligence dashboard

Client portal

Discovery sprint

Digital operating layer

Document intelligence

Eval-driven AI

Hospitality website

Marketplace platform

Multi-agent workflow

Real-time dashboard

Retrieval-augmented generation (RAG)

Prompt engineering

Vector database

AI observability

Embedding model

Fine-tuning

Vector search

Semantic search

Hallucination

Chain-of-thought

Function calling

Model distillation

Safety rails

Eval harness

Regression gate

Model Context Protocol (MCP)

Structured output

Agent tool use

Prompt injection

Agentic search

Observability traces

LLM guardrails

Agent handoff

Vector index

Token budget

Retrieval rerank

Embedding space

Semantic cache

Context window

Faithfulness

Cross-encoder

Model router

AI cost control

Agent memory

Structured extraction

AI evaluation framework

Retrieval quality

AI guardrail policy

Eval fixture

Eval rubric

Agent orchestration

Eval versioning

Model fallback

Fine-grained routing

AI policy version control