A regression gate is the artefact that prevents Thursday-night rollbacks from becoming routine. It compares every release's per-metric eval scores against the last-blessed baseline and fails the PR check when any metric drops past its declared tolerance.

Anatomy of a gate.

  • Baseline. Per-metric mean scores from the last release that passed review. Stored alongside the eval config.
  • Tolerance. Per-metric maximum acceptable drop. Tight (0.00) for structural metrics, looser (0.05–0.10) for LLM-graded metrics that carry more score variance.
  • Comparator. The CI step that runs the eval harness, computes current per-metric means, subtracts from baseline, and exits non-zero when any drop exceeds the tolerance.

How to set a tolerance.

Run the eval harness five times against the same code, measure the standard deviation of each metric, set the tolerance at no less than twice that. Below the noise floor, the gate fires on randomness; above 2× it, the gate catches real drops without false positives.

Bypass is the failure mode.

The first time a team merges a regression with "we'll fix it next sprint", the gate becomes theatre. The discipline is to either fix the regression or update the baseline with an explicit PR-level explanation of why the new score is acceptable. Both actions are reversible. Bypass without explanation is not.

Frequently asked.

What is a regression gate in AI deployment?
A regression gate is an automated CI check that runs the eval harness against the candidate release, computes per-metric mean scores, and fails the PR if any metric drops past its tolerance versus the saved baseline. It is the only artefact that reliably prevents silent quality regressions when prompts, models, or retrieval logic change.
How do I choose the tolerance for a regression gate?
Run the eval harness five times against unchanged code and measure the standard deviation of each metric. Set the tolerance at no less than twice that. Structural metrics (schema validity, banned tokens) typically take a 0.00 tolerance; LLM-graded metrics typically need 0.05 to 0.10 because grader variance is real.
When should I update the baseline?
Only after a release has been reviewed, intended for production, and the new score is explainable. Baseline updates are decisions, not chores. Auto-updating the baseline on every green run defeats the purpose by erasing the historical anchor against which regressions are measured.
What happens when the gate fires?
Read the per-fixture report, identify the failing fixtures, classify the cause as bug, traffic shift, or acceptable trade-off, then either fix the bug, refresh the fixture set with a PR-level note, or widen the tolerance with a PR-level note. Never bypass the gate without writing down why.