Eval versioning is the discipline of treating the fixture set, the rubric, and the regression baseline as versioned artefacts — stored in git, PR-reviewed, release-noted. Without it, a score from this week isn't comparable to last week's, and drift is invisible. With it, every movement on the scoreboard is traceable to a specific change.

What gets versioned.

  • The fixture set. Adding, removing, or relabelling fixtures changes the meaning of every score against it. Each change is a commit with a rationale.
  • The rubric. A reworded clause shifts the LLM grader's output distribution. Rubrics are versioned; major edits trigger a re-baseline of the scoreboard.
  • The baseline scoreboard. The numbers from the previous release. The regression gate reads from here; new releases compare against this baseline.
  • The grader model version. An LLM grader update shifts scores even on identical fixtures and rubric. Pin the grader version; bump it deliberately.

When to re-baseline.

Re-baseline when the rubric changes meaningfully, when the grader model is upgraded, or when the fixture set turns over by more than ~20%. Otherwise, leave the baseline alone; the value of the scoreboard is its continuity.

Frequently asked.

What is eval versioning?
Eval versioning is the discipline of treating the fixture set, the rubric, the baseline scoreboard, and the grader model version as versioned artefacts in git. Each is PR-reviewed and release-noted. Without it, scores aren't comparable across releases. With it, every drift is traceable to a specific change.
When do we re-baseline the scoreboard?
When the rubric changes meaningfully, when the grader model is upgraded, or when the fixture set turns over by more than ~20%. Otherwise leave the baseline alone — the value of the scoreboard is its continuity across releases.
Should we version the grader model?
Yes. An LLM grader update shifts scores even on identical fixtures and rubric. Pin the grader version like any other infrastructure dependency. Bump it deliberately, treat it as a scoreboard-affecting change, re-baseline after the bump.
How does eval versioning relate to the regression gate?
The regression gate reads from the versioned baseline scoreboard. A release ships only if the new scoreboard doesn't regress past tolerance from the baseline. Without versioning, the baseline is whatever someone remembers, which means the gate is theatre.