AI demos look the same in week one and week nine. The system in production never does. The bridge between the two, the part most teams discover only after the launch slipped, is an eval harness: a deterministic test suite for a non-deterministic system. Built first, it is the cheapest line item on the project. Built last, it is the reason the project is six weeks late.

The demo trap.

Every AI engagement we have ever audited had the same shape in week three: a beautiful demo, a stakeholder smiling, and a private spreadsheet someone is keeping of every weird output they have seen the model produce. The demo passes because the demo is one input. Production fails because production is ten thousand of them, and nobody wrote down what “good” means.

The pattern repeats across surfaces: a sales copilot that drafts beautifully for the top five accounts and hallucinates the next fifty; a CRM enrichment agent that nails public companies and guesses on private ones; a customer-support summariser that mostly works except for the tickets that actually need escalation. Without an eval, you cannot tell whether the system is improving, regressing, or just trading one bug for another.

"An AI system without evals is a vibe. A vibe is not a product."

What an eval harness actually is.

An eval harness is the AI equivalent of an integration test suite, but for outputs that aren't binary. It takes a fixed set of inputs, runs them through the system, scores the outputs against a written rubric, and reports a number. That number is the only thing in the project that survives a model swap, a prompt change, or a retrieval refactor unchanged.

The number does not have to be a single metric. In practice the harness produces a small basket: accuracy on a labeled set, calibration on a confidence distribution, refusal rate on out-of-scope inputs, latency at the 95th percentile, cost per call. Each of those gets a target. The project ships when all targets are green and stays shipped only as long as they remain green.

The three layers.

  1. Fixture layer. A curated dataset of real inputs, labeled with what a good output looks like — see eval fixture. 50 to 200 examples is enough to start. Source them from real traffic, not from imagined personas. If you can't produce a fixture set, you don't understand the problem yet.
  2. Rubric layer. The written definition of “good” for each input class — formally, an eval rubric. Sometimes deterministic (does the JSON parse? does the function call match?), sometimes LLM-graded (does the draft hit the brief?), occasionally human-graded (does the customer reply feel like us?). Each rubric is versioned alongside the prompts.
  3. Regression layer. A baseline number for every metric, stored on every release. A regression gate ships only when no metric regresses past a defined tolerance. This is the single most valuable artifact in the system, and the one teams skip first.

Build the eval before the agent.

The order matters. If you build the agent first and the eval second, the eval ends up shaped around what the current agent happens to do well. The harness becomes a flatterer. Build the eval first, and the agent has a target to optimise against instead of a vibe to chase.

A two-day investment up front, defining 100 fixtures and a two-page rubric, saves six weeks of “is this better or worse?” debates later. Our intelligent systems engagements start here, before any prompt or retrieval pipeline is written, for exactly that reason.

Field rule

If the engineering lead cannot show you the eval script and the scoreboard from last week's release, the AI system in question is not shippable. It is just running.

Three mistakes we keep seeing.

  1. The single-example trap. A team picks one impressive output, ships, and treats every variation as anecdotal. The eval needs distribution, not specimens.
  2. Vibe grading. Outputs reviewed in Slack threads, scored by adjective (“feels better”, “sharper”, “more on-brand”). Use a rubric or use a coin.
  3. No regression gate. Every release is judged on its own outputs, never against the prior baseline. The team feels productive while the system silently drifts.

A field checklist before any AI ships.

  • A fixture set sourced from real traffic, at least 50 items.
  • A rubric document the engineering and product leads both signed.
  • A baseline number for every metric, stored in version control.
  • A release gate that blocks merges on regression past tolerance.
  • A weekly review of the lowest-scoring 10 outputs, by name.

If any of those is missing, the system isn't finished, no matter what the dashboard says.

The reference

For the canonical version of the fixture / rubric / regression model with worked examples for four common AI workflows, see the public Morvion Eval Spec. It is the methodology this article points at, written as something you can adopt in a single afternoon.

Common questions.

What is an AI eval harness?
A repeatable test suite for an AI system. It runs a fixed set of inputs through the current model + prompt + retrieval pipeline, scores each output against a written rubric, and produces metrics you can compare across releases. It is the AI equivalent of an integration test, adapted to non-deterministic outputs. The broader discipline is eval-driven AI.

How long does it take to build the first eval?
For a single-agent system, one to three days. Most of the time goes into curating fixtures from real traffic and writing the rubric. The scoring code itself is straightforward, and several mature libraries exist for the LLM-graded portion.

Can we add evals after the system ships?
You can, but it costs more. Without a baseline, the first run becomes the baseline by default, which means the first regression is invisible. We recommend retrofitting evals before the next significant release rather than waiting for a clear failure.

If you're scoping an AI workflow, copilot, or agentic system for production, we run the eval-first engagement shape end to end: start a conversation or read more about how we structure the work.