An eval harness is the artefact that turns "does this AI feel better?" into a number. It is a fixed fixture set, a written definition of "good", and a scoring run that produces a score per metric, comparable across releases.

The three parts.

  • Fixtures. 50–200 real-traffic inputs paired with the expected output shape. Sampled from logs, redacted, labelled. Synthetic fixtures lie.
  • Rubrics. The written definition of "good" per fixture class. Deterministic when the truth is structural (schema, field match, banned tokens), LLM-graded when the truth is feel- based (tone, faithfulness), human-graded for high-stakes domains.
  • Scoring. The runner that pipes fixtures through the system under test, applies every applicable rubric, aggregates per-metric scores, and emits a structured report. The CI version of this runner is a regression gate.

Why a harness is non-optional.

AI outputs are non-deterministic. Without a harness, model swaps, prompt changes, and retrieval refactors regress silently. The harness is the only artefact in an AI project that survives those changes unchanged, and the only objective answer to "is this better than last week?".

Build it first.

The most expensive AI bug is the one that ships because nobody noticed a regression. The harness is the cheapest line item when written first and the most expensive omission when added after launch. The order is fixtures, then rubrics, then the agent.

Frequently asked.

What is an eval harness in AI development?
An eval harness is a deterministic test apparatus for a non-deterministic system. It contains a fixed fixture set (real-traffic inputs paired with expected outputs), written rubrics (the definition of good per fixture class), and a scoring run that emits a per-metric number comparable across releases. It is the artefact that turns 'does this AI feel better?' into evidence.
How many fixtures does an eval harness need?
Fifty to two hundred fixtures is enough to start, sampled from real production traffic and redacted before labelling. Below fifty, score variance dominates signal and the gate fires on noise. Above two hundred, marginal value drops unless your traffic mix is unusually diverse.
What's the difference between an eval harness and a unit test?
A unit test asserts a fixed boolean against a deterministic function. An eval harness scores a probabilistic output against a rubric that may itself be probabilistic (LLM-graded). The harness aggregates many such scores into per-metric means and compares those means against a baseline; that comparison is what gates a release.
Where does the eval harness sit in CI?
As a required PR check. Every release runs the harness against the latest fixture set, compares per-metric scores to the saved baseline, and fails the check if any tolerance is breached. The Morvion Eval Spec ships a reference workflow at github.com/aloalads/eval-spec/.github/workflows/eval.yml.