An eval fixture is the unit primitive of an eval harness: one input plus the labelled answer or rubric outcome the AI workflow is meant to produce. The fixture set is the collection of fixtures the harness runs on every release. Fixtures are the closest thing AI engineering has to integration tests.
The shape of a fixture.
- Input. The actual prompt, document, query, or event the system would see in production. Verbatim, not paraphrased.
- Expected output. Either the exact answer (deterministic fixtures) or the rubric criteria the answer must satisfy (LLM-graded fixtures).
- Metadata. Source (production sample, manual, synthetic), date added, owner, tags. Helps slice scores by query class later.
Where fixtures come from.
The strongest fixtures are real production samples — actual queries customers asked, actual documents the system saw — labelled by a domain expert. Synthetic fixtures fill gaps (rare cases, adversarial inputs) but should never dominate the set. A harness where most fixtures are synthetic ends up optimising the agent for imagined queries instead of real ones.
“If you can't write a fixture set, you don't understand the workflow yet.”
How many fixtures.
50 to 200 to start, growing to a few hundred as the workflow matures. Quality far outranks quantity — a tight 80-fixture set sampled from real traffic beats a 2,000-fixture synthetic dataset every time. The fixture set is also where regressions get caught fastest: every shipped bug becomes a fixture for the next release.
Frequently asked.
- What is an eval fixture?
- An eval fixture is one input-and-expected-shape pair in an evaluation harness — a real or representative example along with the labelled answer or rubric outcome the AI workflow is meant to produce. The fixture set is the collection of fixtures the harness runs on every release.
- How many fixtures do we need?
- 50 to 200 to start, growing to a few hundred as the workflow matures. Quality outranks quantity — a tight 80-fixture set sampled from real production traffic beats a 2,000-fixture synthetic dataset. Every shipped bug should become a fixture for the next release.
- Where should fixtures come from?
- Primarily from real production traffic, labelled by a domain expert. Synthetic fixtures fill gaps (rare cases, adversarial inputs) but should never dominate the set. A harness mostly populated by synthetic fixtures ends up optimising the agent for imagined queries.
- What's the difference between a fixture and a unit test?
- A unit test asserts deterministic behaviour. A fixture is graded against a rubric (either deterministic or LLM-graded), because the AI output isn't always exact. The shape is similar — input plus expected outcome — but the assertion is a score against a rubric rather than equality against a string.