What is an eval harness in AI development?

An eval harness is a deterministic test apparatus for a non-deterministic system. It contains a fixed fixture set (real-traffic inputs paired with expected outputs), written rubrics (the definition of good per fixture class), and a scoring run that emits a per-metric number comparable across releases. It is the artefact that turns 'does this AI feel better?' into evidence.

How many fixtures does an eval harness need?

Fifty to two hundred fixtures is enough to start, sampled from real production traffic and redacted before labelling. Below fifty, score variance dominates signal and the gate fires on noise. Above two hundred, marginal value drops unless your traffic mix is unusually diverse.

What's the difference between an eval harness and a unit test?

A unit test asserts a fixed boolean against a deterministic function. An eval harness scores a probabilistic output against a rubric that may itself be probabilistic (LLM-graded). The harness aggregates many such scores into per-metric means and compares those means against a baseline; that comparison is what gates a release.

Where does the eval harness sit in CI?

As a required PR check. Every release runs the harness against the latest fixture set, compares per-metric scores to the saved baseline, and fails the check if any tolerance is breached. The Morvion Eval Spec ships a reference workflow at github.com/aloalads/eval-spec/.github/workflows/eval.yml.

Eval harness · Morvion Glossary

An eval harness is the artefact that turns "does this AI feel better?" into a number. It is a fixed fixture set, a written definition of "good", and a scoring run that produces a score per metric, comparable across releases.

The three parts.

Fixtures. 50–200 real-traffic inputs paired with the expected output shape. Sampled from logs, redacted, labelled. Synthetic fixtures lie.
Rubrics. The written definition of "good" per fixture class. Deterministic when the truth is structural (schema, field match, banned tokens), LLM-graded when the truth is feel- based (tone, faithfulness), human-graded for high-stakes domains.
Scoring. The runner that pipes fixtures through the system under test, applies every applicable rubric, aggregates per-metric scores, and emits a structured report. The CI version of this runner is a regression gate.

Why a harness is non-optional.

AI outputs are non-deterministic. Without a harness, model swaps, prompt changes, and retrieval refactors regress silently. The harness is the only artefact in an AI project that survives those changes unchanged, and the only objective answer to "is this better than last week?".

Build it first.

The most expensive AI bug is the one that ships because nobody noticed a regression. The harness is the cheapest line item when written first and the most expensive omission when added after launch. The order is fixtures, then rubrics, then the agent.

Frequently asked.

What is an eval harness in AI development?: An eval harness is a deterministic test apparatus for a non-deterministic system. It contains a fixed fixture set (real-traffic inputs paired with expected outputs), written rubrics (the definition of good per fixture class), and a scoring run that emits a per-metric number comparable across releases. It is the artefact that turns 'does this AI feel better?' into evidence.
How many fixtures does an eval harness need?: Fifty to two hundred fixtures is enough to start, sampled from real production traffic and redacted before labelling. Below fifty, score variance dominates signal and the gate fires on noise. Above two hundred, marginal value drops unless your traffic mix is unusually diverse.
What's the difference between an eval harness and a unit test?: A unit test asserts a fixed boolean against a deterministic function. An eval harness scores a probabilistic output against a rubric that may itself be probabilistic (LLM-graded). The harness aggregates many such scores into per-metric means and compares those means against a baseline; that comparison is what gates a release.
Where does the eval harness sit in CI?: As a required PR check. Every release runs the harness against the latest fixture set, compares per-metric scores to the saved baseline, and fails the check if any tolerance is breached. The Morvion Eval Spec ships a reference workflow at github.com/aloalads/eval-spec/.github/workflows/eval.yml.

Eval harness

The three parts.

Why a harness is non-optional.

Build it first.

Frequently asked.

Intelligent Systems & AI Infrastructure

Keep reading the glossary.

AI infrastructure

CRM intelligence

Immersive website

AI agent

Business intelligence dashboard

Client portal

Discovery sprint

Digital operating layer

Document intelligence

Eval-driven AI

Hospitality website

Marketplace platform

Multi-agent workflow

Real-time dashboard

Retrieval-augmented generation (RAG)

Prompt engineering

Vector database

AI observability

Embedding model

Fine-tuning

Vector search

Semantic search

Hallucination

Chain-of-thought

Function calling

Model distillation

Safety rails

Regression gate

Model Context Protocol (MCP)

Structured output

Agent tool use

Prompt injection

Agentic search

Observability traces

LLM guardrails

Agent handoff

Vector index

Token budget

Retrieval rerank

Embedding space

Semantic cache

Context window

Faithfulness

Cross-encoder

Model router

AI cost control

Agent memory

Structured extraction

AI evaluation framework

Retrieval quality

AI guardrail policy

Eval fixture

Eval rubric

AI incident

Agent orchestration

Eval versioning

Model fallback

Fine-grained routing

AI policy version control