What is an eval fixture?

An eval fixture is one input-and-expected-shape pair in an evaluation harness — a real or representative example along with the labelled answer or rubric outcome the AI workflow is meant to produce. The fixture set is the collection of fixtures the harness runs on every release.

How many fixtures do we need?

50 to 200 to start, growing to a few hundred as the workflow matures. Quality outranks quantity — a tight 80-fixture set sampled from real production traffic beats a 2,000-fixture synthetic dataset. Every shipped bug should become a fixture for the next release.

What's the difference between a fixture and a unit test?

A unit test asserts deterministic behaviour. A fixture is graded against a rubric (either deterministic or LLM-graded), because the AI output isn't always exact. The shape is similar — input plus expected outcome — but the assertion is a score against a rubric rather than equality against a string.

Eval fixture · Morvion Glossary

An eval fixture is the unit primitive of an eval harness: one input plus the labelled answer or rubric outcome the AI workflow is meant to produce. The fixture set is the collection of fixtures the harness runs on every release. Fixtures are the closest thing AI engineering has to integration tests.

The shape of a fixture.

Input. The actual prompt, document, query, or event the system would see in production. Verbatim, not paraphrased.
Expected output. Either the exact answer (deterministic fixtures) or the rubric criteria the answer must satisfy (LLM-graded fixtures).
Metadata. Source (production sample, manual, synthetic), date added, owner, tags. Helps slice scores by query class later.

Where fixtures come from.

The strongest fixtures are real production samples — actual queries customers asked, actual documents the system saw — labelled by a domain expert. Synthetic fixtures fill gaps (rare cases, adversarial inputs) but should never dominate the set. A harness where most fixtures are synthetic ends up optimising the agent for imagined queries instead of real ones.

“If you can't write a fixture set, you don't understand the workflow yet.”

How many fixtures.

50 to 200 to start, growing to a few hundred as the workflow matures. Quality far outranks quantity — a tight 80-fixture set sampled from real traffic beats a 2,000-fixture synthetic dataset every time. The fixture set is also where regressions get caught fastest: every shipped bug becomes a fixture for the next release.

Frequently asked.

What is an eval fixture?: An eval fixture is one input-and-expected-shape pair in an evaluation harness — a real or representative example along with the labelled answer or rubric outcome the AI workflow is meant to produce. The fixture set is the collection of fixtures the harness runs on every release.
How many fixtures do we need?: 50 to 200 to start, growing to a few hundred as the workflow matures. Quality outranks quantity — a tight 80-fixture set sampled from real production traffic beats a 2,000-fixture synthetic dataset. Every shipped bug should become a fixture for the next release.
Where should fixtures come from?: Primarily from real production traffic, labelled by a domain expert. Synthetic fixtures fill gaps (rare cases, adversarial inputs) but should never dominate the set. A harness mostly populated by synthetic fixtures ends up optimising the agent for imagined queries.
What's the difference between a fixture and a unit test?: A unit test asserts deterministic behaviour. A fixture is graded against a rubric (either deterministic or LLM-graded), because the AI output isn't always exact. The shape is similar — input plus expected outcome — but the assertion is a score against a rubric rather than equality against a string.

Eval fixture

The shape of a fixture.

Where fixtures come from.

How many fixtures.

Frequently asked.

The Morvion Eval Spec

Keep reading the glossary.

AI infrastructure

CRM intelligence

Immersive website

AI agent

Business intelligence dashboard

Client portal

Discovery sprint

Digital operating layer

Document intelligence

Eval-driven AI

Hospitality website

Marketplace platform

Multi-agent workflow

Real-time dashboard

Retrieval-augmented generation (RAG)

Prompt engineering

Vector database

AI observability

Embedding model

Fine-tuning

Vector search

Semantic search

Hallucination

Chain-of-thought

Function calling

Model distillation

Safety rails

Eval harness

Regression gate

Model Context Protocol (MCP)

Structured output

Agent tool use

Prompt injection

Agentic search

Observability traces

LLM guardrails

Agent handoff

Vector index

Token budget

Retrieval rerank

Embedding space

Semantic cache

Context window

Faithfulness

Cross-encoder

Model router

AI cost control

Agent memory

Structured extraction

AI evaluation framework

Retrieval quality

AI guardrail policy

Eval rubric

AI incident

Agent orchestration

Eval versioning

Model fallback

Fine-grained routing

AI policy version control