The Morvion Eval Spec

The Morvion Eval Spec is a public reference for shipping AI systems with confidence. It defines the structure of an evaluation harness, the layers it must contain, the gating semantics that govern releases, and four worked examples for workflows we see in nearly every engagement. The shape is intentionally compact: anyone reading this should be able to start a harness for their own AI system within a single afternoon.

Why this spec exists.

AI outputs are non-deterministic. Without an eval harness, model swaps, prompt changes, and retrieval refactors regress silently. The harness is the only artifact in an AI project that survives those changes unchanged, and the only objective answer to the question “is this better than last week?” Morvion built this spec across engagements where we wrote the eval first, where we wrote it second, and where we didn't write one at all. The spec is the shape that worked.

“An AI system without evals is a vibe. A vibe is not a product.”

The three layers.

Every working eval harness has three layers. Skip any one and the harness becomes a flatterer instead of a referee.

Layer 1 · Fixtures.

A curated dataset of real inputs, labeled with what a good output looks like. 50 to 200 examples is enough to start. Source them from real traffic, not from imagined personas. If you cannot produce a fixture set, you do not understand the problem yet, and any AI system you build will reflect that gap.

Fixture shape (JSON, one record per example):

{
  "id": "fix-001",
  "input": {
    "company": "Aperitivo Bar Zurich",
    "industry": "hospitality",
    "context": "..."
  },
  "expected": {
    "category": "small-venue",
    "key_signals": ["live-music", "aperitivo", "zurich"],
    "draft_tone": "warm, specific, never generic"
  },
  "labels": {
    "difficulty": "medium",
    "edge_case": false
  },
  "source": "real-traffic-2026-04-12"
}

The fixture set lives in version control. New records are added whenever a real-world output surprises an operator. The set grows; old records are not removed unless the underlying workflow has fundamentally changed.

Layer 2 · Rubrics.

The written definition of “good” for each input class. Three grading patterns, used in combination:

Deterministic checks. Does the JSON parse? Does the function call match the schema? Does the output stay under the token budget? These are pass/fail.
LLM-graded checks. A reference model (often a larger or different model than the production one) scores the output against a written rubric prompt. Score is a small integer (0-3 or 0-5) per dimension.
Human-graded checks. For outputs where tone or judgment cannot be captured deterministically, periodic human review of a sampled subset. Human grades feed back into the LLM-grader rubric over time.

Rubric shape (YAML, one document per workflow):

workflow: customer-reply-draft
version: 4
dimensions:
  - name: factual_correctness
    grader: deterministic
    check: cited_sources_match_retrieval
    weight: 0.4
  - name: tone_alignment
    grader: llm
    prompt: |
      Does this draft sound like the brand voice in the
      reference samples? Score 0-3 where 0=robotic,
      3=on-voice.
    weight: 0.3
  - name: actionability
    grader: llm
    prompt: |
      Does the draft propose a clear next step the
      customer can act on? Score 0-2.
    weight: 0.2
  - name: refusal_appropriateness
    grader: deterministic
    check: refused_when_out_of_scope
    weight: 0.1

Rubrics are versioned alongside the prompts. A prompt change that requires a new dimension bumps the rubric version too.

Layer 3 · Regression gates.

A baseline number per metric, stored on every release. A new release ships only when no metric regresses past a defined tolerance versus the baseline. This is the single most valuable artifact in the system and the one teams skip first.

Gate shape (in CI, as a check on every PR):

# .github/workflows/eval-gate.yml (excerpt)
- name: Run eval harness
  run: morvion-evals run --workflow customer-reply --fixtures ./fixtures

- name: Check regression
  run: |
    morvion-evals compare \
      --current ./evals/current.json \
      --baseline ./evals/baseline.json \
      --tolerance 0.02
  # exits non-zero if any weighted dimension regressed
  # by more than 2% versus the stored baseline

The baseline is updated only when a release ships green and is accepted into main. The tolerance is workflow-specific: factual-correctness regression has a tighter band than tone-alignment regression.

Deterministic vs. LLM-graded vs. human-graded.

The grading pattern follows the output shape. Use the simplest grader that can answer the question reliably. Three rules:

Prefer deterministic when possible. If you can write a check in code, write a check in code. Deterministic graders are free, fast, and don't themselves regress.
LLM-graders need their own version pinning. The grader model is part of the harness, not part of the production stack. Pin it. Re-evaluate the entire fixture set when you bump the grader.
Human review is the audit layer. Even with strong LLM-graders, sample 5-10% of outputs weekly for human review. Disagreements feed back into the rubric.

Worked examples · four common workflows.

These are the four AI workflows we see in nearly every engagement, with the eval shape that has worked for each.

Example 1 · CRM enrichment.

Fixtures: 120 real lead records, half public companies, half private. Labeled with the enrichment fields a human would write.
Rubric: per-field accuracy (deterministic match), plus a llm-graded check for whether the enrichment paragraph reads as natural prose rather than concatenated bullet points.
Regression: per-field accuracy must stay ≥ baseline. Naturalness can fluctuate ±0.3 points.
Common failure mode caught by the harness: the model hallucinating funding rounds for private companies. Caught by a deterministic check against a public funding-data source.

Example 2 · Customer reply drafting.

Fixtures: 80 real inbound messages with the operator's actual reply as the reference.
Rubric: factual correctness (deterministic against retrieval), tone alignment (LLM-graded against brand-voice samples), actionability (LLM-graded), refusal appropriateness (deterministic).
Regression: factual correctness ≥ baseline. Tone may drift up to 0.2 points. Refusal must be 100% on out-of-scope inputs.
Common failure mode caught by the harness: the model trying to answer policy questions it shouldn't. Caught by the refusal-appropriateness gate.

Example 3 · Document summarization.

Fixtures: 50 real documents (contracts, PDFs, intake forms), labeled with a human-written extract-of-truth.
Rubric: coverage (LLM-graded: are all key clauses mentioned?), faithfulness (deterministic: does every claim trace to a source span?), length (deterministic).
Regression: coverage ≥ baseline. Faithfulness must be 100%, no claim without source.
Common failure mode caught by the harness: the model inventing clause numbers. Caught by the faithfulness check.

Example 4 · Sales-call recap.

Fixtures: 40 real call transcripts with the rep's actual recap as the reference.
Rubric: next-action presence (deterministic), customer-signals captured (LLM-graded), pricing accuracy (deterministic), tone (LLM-graded).
Regression: next-action presence must be 100%. Pricing accuracy must be 100%. Signals/tone can drift ±0.2 points.
Common failure mode caught by the harness: the model rewriting prices the customer quoted. Caught by the pricing-accuracy gate.

How to adopt the spec.

Day 1. Source 50 real records into a fixture set. Do this before you write any prompt. If you cannot, the workflow isn't ready to be AI-ifyed.
Day 2. Write the rubric. One page, in YAML or Markdown. Get the operator and the engineer to both sign it before any code.
Day 3. Build the harness. A simple Python or Node script that runs the workflow on every fixture and prints the rubric scores. Store baseline.
Day 4. Wire the regression gate into CI. PRs that regress past tolerance fail the check.
Ongoing. Add fixtures whenever a real output surprises an operator. Re-baseline only when a release ships green.

Field rule

If the engineering lead cannot show you the eval script and the scoreboard from last week's release, the AI system in question is not shippable. It is just running.

Every engagement under Intelligent Systems & AI Infrastructure starts with this spec. The two-week Discovery Sprint that often precedes a production build delivers a fixture set and a baseline rubric as part of the sprint output, before any commitment to the full system.

For the long-form discussion of why this approach matters, see the field note Eval-driven AI: the only kind that ships. For the one-paragraph definitions of the underlying terms, see the glossary entries on eval-driven AI, AI agent, and retrieval-augmented generation.

Versioning of this spec.

This document is versioned. Substantial changes (new layer, renamed concept, breaking rubric format) bump the major version and are noted at the top. Minor refinements (new worked example, tightened language) are silent. Current version: 1.0.0, published 2026-05-18.

Common questions.

What is the Morvion Eval Spec?

The Morvion Eval Spec is a public reference for shipping AI systems with confidence. It defines a three-layer model (fixtures, rubrics, regression gates) plus deterministic and LLM-graded scoring patterns, release-gate semantics, and worked examples for four common AI workflows. It is the methodology Morvion uses on every production AI engagement.

Why is eval-driven AI necessary?

How many fixtures do you need to start?

50 to 200 examples is enough to start, sourced from real traffic rather than imagined personas. If you cannot produce a fixture set, you do not understand the problem yet, and any AI system you build will reflect that gap.

What does a rubric look like?

A rubric is the written definition of 'good' for each input class. Sometimes deterministic (does the JSON parse? does the function call match?), sometimes LLM-graded (does the draft hit the brief?), occasionally human-graded (does the customer reply feel like us?). Rubrics are versioned alongside the prompts.

How does the regression gate work?

Every release stores a baseline score per metric. A new release ships only when no metric regresses past a defined tolerance versus the baseline. The gate is automated in CI: a failing regression blocks the merge.

Is the Morvion Eval Spec open source?

Yes. The reference is publicly readable on morvion.com and the companion open-source template lives at github.com/aloalads/eval-spec under MIT license. It ships with YAML schemas for fixtures, rubrics, and regression gates, a TypeScript reference scorer (deterministic + LLM-graded + human-graded metrics), four worked examples (CRM enrichment, customer reply, document summarization, sales-call recap), and a CI workflow file that fails a PR on metric regression.

The Morvion Eval Spec.

Why this spec exists.

The three layers.

Layer 1 · Fixtures.

Layer 2 · Rubrics.

Layer 3 · Regression gates.

Deterministic vs. LLM-graded vs. human-graded.

Worked examples · four common workflows.

Example 1 · CRM enrichment.

Example 2 · Customer reply drafting.

Example 3 · Document summarization.

Example 4 · Sales-call recap.

How to adopt the spec.

Versioning of this spec.

Common questions.

We design and ship the systems we write the spec for.

The Morvion Eval Spec.

Why this spec exists.

The three layers.

Layer 1 · Fixtures.

Layer 2 · Rubrics.

Layer 3 · Regression gates.

Deterministic vs. LLM-graded vs. human-graded.

Worked examples · four common workflows.

Example 1 · CRM enrichment.

Example 2 · Customer reply drafting.

Example 3 · Document summarization.

Example 4 · Sales-call recap.

How to adopt the spec.

Where this fits in Morvion engagements.

Versioning of this spec.

Common questions.

We design and ship the systems we write the spec for.