What is eval versioning?

Eval versioning is the discipline of treating the fixture set, the rubric, the baseline scoreboard, and the grader model version as versioned artefacts in git. Each is PR-reviewed and release-noted. Without it, scores aren't comparable across releases. With it, every drift is traceable to a specific change.

Should we version the grader model?

Yes. An LLM grader update shifts scores even on identical fixtures and rubric. Pin the grader version like any other infrastructure dependency. Bump it deliberately, treat it as a scoreboard-affecting change, re-baseline after the bump.

How does eval versioning relate to the regression gate?

The regression gate reads from the versioned baseline scoreboard. A release ships only if the new scoreboard doesn't regress past tolerance from the baseline. Without versioning, the baseline is whatever someone remembers, which means the gate is theatre.

Eval versioning · Morvion Glossary

Eval versioning is the discipline of treating the fixture set, the rubric, and the regression baseline as versioned artefacts — stored in git, PR-reviewed, release-noted. Without it, a score from this week isn't comparable to last week's, and drift is invisible. With it, every movement on the scoreboard is traceable to a specific change.

What gets versioned.

The fixture set. Adding, removing, or relabelling fixtures changes the meaning of every score against it. Each change is a commit with a rationale.
The rubric. A reworded clause shifts the LLM grader's output distribution. Rubrics are versioned; major edits trigger a re-baseline of the scoreboard.
The baseline scoreboard. The numbers from the previous release. The regression gate reads from here; new releases compare against this baseline.
The grader model version. An LLM grader update shifts scores even on identical fixtures and rubric. Pin the grader version; bump it deliberately.

When to re-baseline.

Re-baseline when the rubric changes meaningfully, when the grader model is upgraded, or when the fixture set turns over by more than ~20%. Otherwise, leave the baseline alone; the value of the scoreboard is its continuity.

Frequently asked.

What is eval versioning?: Eval versioning is the discipline of treating the fixture set, the rubric, the baseline scoreboard, and the grader model version as versioned artefacts in git. Each is PR-reviewed and release-noted. Without it, scores aren't comparable across releases. With it, every drift is traceable to a specific change.
When do we re-baseline the scoreboard?: When the rubric changes meaningfully, when the grader model is upgraded, or when the fixture set turns over by more than ~20%. Otherwise leave the baseline alone — the value of the scoreboard is its continuity across releases.
Should we version the grader model?: Yes. An LLM grader update shifts scores even on identical fixtures and rubric. Pin the grader version like any other infrastructure dependency. Bump it deliberately, treat it as a scoreboard-affecting change, re-baseline after the bump.
How does eval versioning relate to the regression gate?: The regression gate reads from the versioned baseline scoreboard. A release ships only if the new scoreboard doesn't regress past tolerance from the baseline. Without versioning, the baseline is whatever someone remembers, which means the gate is theatre.

Eval versioning

What gets versioned.

When to re-baseline.

Frequently asked.

The Morvion Eval Spec

Keep reading the glossary.

AI infrastructure

CRM intelligence

Immersive website

AI agent

Business intelligence dashboard

Client portal

Discovery sprint

Digital operating layer

Document intelligence

Eval-driven AI

Hospitality website

Marketplace platform

Multi-agent workflow

Real-time dashboard

Retrieval-augmented generation (RAG)

Prompt engineering

Vector database

AI observability

Embedding model

Fine-tuning

Vector search

Semantic search

Hallucination

Chain-of-thought

Function calling

Model distillation

Safety rails

Eval harness

Regression gate

Model Context Protocol (MCP)

Structured output

Agent tool use

Prompt injection

Agentic search

Observability traces

LLM guardrails

Agent handoff

Vector index

Token budget

Retrieval rerank

Embedding space

Semantic cache

Context window

Faithfulness

Cross-encoder

Model router

AI cost control

Agent memory

Structured extraction

AI evaluation framework

Retrieval quality

AI guardrail policy

Eval fixture

Eval rubric

AI incident

Agent orchestration

Model fallback

Fine-grained routing

AI policy version control