The Morvion Eval Spec is a public reference for shipping AI systems with confidence. It defines the structure of an evaluation harness, the layers it must contain, the gating semantics that govern releases, and four worked examples for workflows we see in nearly every engagement. The shape is intentionally compact: anyone reading this should be able to start a harness for their own AI system within a single afternoon.
Why this spec exists.
AI outputs are non-deterministic. Without an eval harness, model swaps, prompt changes, and retrieval refactors regress silently. The harness is the only artifact in an AI project that survives those changes unchanged, and the only objective answer to the question “is this better than last week?” Morvion built this spec across engagements where we wrote the eval first, where we wrote it second, and where we didn't write one at all. The spec is the shape that worked.
“An AI system without evals is a vibe. A vibe is not a product.”
The three layers.
Every working eval harness has three layers. Skip any one and the harness becomes a flatterer instead of a referee.
Layer 1 · Fixtures.
A curated dataset of real inputs, labeled with what a good output looks like. 50 to 200 examples is enough to start. Source them from real traffic, not from imagined personas. If you cannot produce a fixture set, you do not understand the problem yet, and any AI system you build will reflect that gap.
Fixture shape (JSON, one record per example):
{
"id": "fix-001",
"input": {
"company": "Aperitivo Bar Zurich",
"industry": "hospitality",
"context": "..."
},
"expected": {
"category": "small-venue",
"key_signals": ["live-music", "aperitivo", "zurich"],
"draft_tone": "warm, specific, never generic"
},
"labels": {
"difficulty": "medium",
"edge_case": false
},
"source": "real-traffic-2026-04-12"
}The fixture set lives in version control. New records are added whenever a real-world output surprises an operator. The set grows; old records are not removed unless the underlying workflow has fundamentally changed.
Layer 2 · Rubrics.
The written definition of “good” for each input class. Three grading patterns, used in combination:
- Deterministic checks. Does the JSON parse? Does the function call match the schema? Does the output stay under the token budget? These are pass/fail.
- LLM-graded checks. A reference model (often a larger or different model than the production one) scores the output against a written rubric prompt. Score is a small integer (0-3 or 0-5) per dimension.
- Human-graded checks. For outputs where tone or judgment cannot be captured deterministically, periodic human review of a sampled subset. Human grades feed back into the LLM-grader rubric over time.
Rubric shape (YAML, one document per workflow):
workflow: customer-reply-draft
version: 4
dimensions:
- name: factual_correctness
grader: deterministic
check: cited_sources_match_retrieval
weight: 0.4
- name: tone_alignment
grader: llm
prompt: |
Does this draft sound like the brand voice in the
reference samples? Score 0-3 where 0=robotic,
3=on-voice.
weight: 0.3
- name: actionability
grader: llm
prompt: |
Does the draft propose a clear next step the
customer can act on? Score 0-2.
weight: 0.2
- name: refusal_appropriateness
grader: deterministic
check: refused_when_out_of_scope
weight: 0.1Rubrics are versioned alongside the prompts. A prompt change that requires a new dimension bumps the rubric version too.
Layer 3 · Regression gates.
A baseline number per metric, stored on every release. A new release ships only when no metric regresses past a defined tolerance versus the baseline. This is the single most valuable artifact in the system and the one teams skip first.
Gate shape (in CI, as a check on every PR):
# .github/workflows/eval-gate.yml (excerpt)
- name: Run eval harness
run: morvion-evals run --workflow customer-reply --fixtures ./fixtures
- name: Check regression
run: |
morvion-evals compare \
--current ./evals/current.json \
--baseline ./evals/baseline.json \
--tolerance 0.02
# exits non-zero if any weighted dimension regressed
# by more than 2% versus the stored baselineThe baseline is updated only when a release ships green and is accepted into main. The tolerance is workflow-specific: factual-correctness regression has a tighter band than tone-alignment regression.
Deterministic vs. LLM-graded vs. human-graded.
The grading pattern follows the output shape. Use the simplest grader that can answer the question reliably. Three rules:
- Prefer deterministic when possible. If you can write a check in code, write a check in code. Deterministic graders are free, fast, and don't themselves regress.
- LLM-graders need their own version pinning. The grader model is part of the harness, not part of the production stack. Pin it. Re-evaluate the entire fixture set when you bump the grader.
- Human review is the audit layer. Even with strong LLM-graders, sample 5-10% of outputs weekly for human review. Disagreements feed back into the rubric.
Worked examples · four common workflows.
These are the four AI workflows we see in nearly every engagement, with the eval shape that has worked for each.
Example 1 · CRM enrichment.
- Fixtures: 120 real lead records, half public companies, half private. Labeled with the enrichment fields a human would write.
- Rubric: per-field accuracy (deterministic match), plus a llm-graded check for whether the enrichment paragraph reads as natural prose rather than concatenated bullet points.
- Regression: per-field accuracy must stay ≥ baseline. Naturalness can fluctuate ±0.3 points.
- Common failure mode caught by the harness: the model hallucinating funding rounds for private companies. Caught by a deterministic check against a public funding-data source.
Example 2 · Customer reply drafting.
- Fixtures: 80 real inbound messages with the operator's actual reply as the reference.
- Rubric: factual correctness (deterministic against retrieval), tone alignment (LLM-graded against brand-voice samples), actionability (LLM-graded), refusal appropriateness (deterministic).
- Regression: factual correctness ≥ baseline. Tone may drift up to 0.2 points. Refusal must be 100% on out-of-scope inputs.
- Common failure mode caught by the harness: the model trying to answer policy questions it shouldn't. Caught by the refusal-appropriateness gate.
Example 3 · Document summarization.
- Fixtures: 50 real documents (contracts, PDFs, intake forms), labeled with a human-written extract-of-truth.
- Rubric: coverage (LLM-graded: are all key clauses mentioned?), faithfulness (deterministic: does every claim trace to a source span?), length (deterministic).
- Regression: coverage ≥ baseline. Faithfulness must be 100%, no claim without source.
- Common failure mode caught by the harness: the model inventing clause numbers. Caught by the faithfulness check.
Example 4 · Sales-call recap.
- Fixtures: 40 real call transcripts with the rep's actual recap as the reference.
- Rubric: next-action presence (deterministic), customer-signals captured (LLM-graded), pricing accuracy (deterministic), tone (LLM-graded).
- Regression: next-action presence must be 100%. Pricing accuracy must be 100%. Signals/tone can drift ±0.2 points.
- Common failure mode caught by the harness: the model rewriting prices the customer quoted. Caught by the pricing-accuracy gate.
How to adopt the spec.
- Day 1. Source 50 real records into a fixture set. Do this before you write any prompt. If you cannot, the workflow isn't ready to be AI-ifyed.
- Day 2. Write the rubric. One page, in YAML or Markdown. Get the operator and the engineer to both sign it before any code.
- Day 3. Build the harness. A simple Python or Node script that runs the workflow on every fixture and prints the rubric scores. Store baseline.
- Day 4. Wire the regression gate into CI. PRs that regress past tolerance fail the check.
- Ongoing. Add fixtures whenever a real output surprises an operator. Re-baseline only when a release ships green.
If the engineering lead cannot show you the eval script and the scoreboard from last week's release, the AI system in question is not shippable. It is just running.
Where this fits in Morvion engagements.
Every engagement under Intelligent Systems & AI Infrastructure starts with this spec. The two-week Discovery Sprint that often precedes a production build delivers a fixture set and a baseline rubric as part of the sprint output, before any commitment to the full system.
For the long-form discussion of why this approach matters, see the field note Eval-driven AI: the only kind that ships. For the one-paragraph definitions of the underlying terms, see the glossary entries on eval-driven AI, AI agent, and retrieval-augmented generation.
Versioning of this spec.
This document is versioned. Substantial changes (new layer, renamed concept, breaking rubric format) bump the major version and are noted at the top. Minor refinements (new worked example, tightened language) are silent. Current version: 1.0.0, published 2026-05-18.