Eval versioning is the discipline of treating the fixture set, the rubric, and the regression baseline as versioned artefacts — stored in git, PR-reviewed, release-noted. Without it, a score from this week isn't comparable to last week's, and drift is invisible. With it, every movement on the scoreboard is traceable to a specific change.
What gets versioned.
- The fixture set. Adding, removing, or relabelling fixtures changes the meaning of every score against it. Each change is a commit with a rationale.
- The rubric. A reworded clause shifts the LLM grader's output distribution. Rubrics are versioned; major edits trigger a re-baseline of the scoreboard.
- The baseline scoreboard. The numbers from the previous release. The regression gate reads from here; new releases compare against this baseline.
- The grader model version. An LLM grader update shifts scores even on identical fixtures and rubric. Pin the grader version; bump it deliberately.
When to re-baseline.
Re-baseline when the rubric changes meaningfully, when the grader model is upgraded, or when the fixture set turns over by more than ~20%. Otherwise, leave the baseline alone; the value of the scoreboard is its continuity.
Frequently asked.
- What is eval versioning?
- Eval versioning is the discipline of treating the fixture set, the rubric, the baseline scoreboard, and the grader model version as versioned artefacts in git. Each is PR-reviewed and release-noted. Without it, scores aren't comparable across releases. With it, every drift is traceable to a specific change.
- When do we re-baseline the scoreboard?
- When the rubric changes meaningfully, when the grader model is upgraded, or when the fixture set turns over by more than ~20%. Otherwise leave the baseline alone — the value of the scoreboard is its continuity across releases.
- Should we version the grader model?
- Yes. An LLM grader update shifts scores even on identical fixtures and rubric. Pin the grader version like any other infrastructure dependency. Bump it deliberately, treat it as a scoreboard-affecting change, re-baseline after the bump.
- How does eval versioning relate to the regression gate?
- The regression gate reads from the versioned baseline scoreboard. A release ships only if the new scoreboard doesn't regress past tolerance from the baseline. Without versioning, the baseline is whatever someone remembers, which means the gate is theatre.