An eval rubric is the written definition of what counts as a good output for one input class. It is the scoring contract that turns subjective judgement into a number an eval harness can compare across releases. Rubric quality determines whether the scoreboard reflects actual product quality or just the rubric author's mood.
The three rubric shapes.
- Deterministic. The output either does or doesn't match the expected value. Schema validation, fact lookup, exact-match classification. Fastest, cheapest, and the best when the answer is binary.
- LLM-graded. A judge model scores the output against a written rubric. Used for faithfulness, tone, appropriateness, and other criteria that don't reduce to exact match. Slower and noisier than deterministic, but usable on subjective dimensions.
- Human-graded. A domain expert scores a sample. The most reliable and the most expensive. Used to calibrate the LLM grader and to spot-check the production output distribution.
Writing a good rubric.
Specific over general (“the answer cites the source span verbatim” beats “the answer is grounded”). Multiple narrow dimensions over one fuzzy overall score (faithfulness, format, tone — graded separately). Worked examples — for each dimension, show one passing and one failing output. The rubric is a living document; new edge cases turn into new clauses.
Rubrics get versioned too.
A rubric change shifts the meaning of every score. Treat rubric edits the same as prompt edits: in version control, PR-reviewed, with a release-note explaining the change. The scoreboard from before the rubric edit is not directly comparable to the scoreboard after.
Frequently asked.
- What is an eval rubric?
- An eval rubric is the written definition of what counts as a good output for one input class. It's the scoring contract that turns subjective judgement into a number an eval harness can compare across releases. Without a rubric, the scoreboard measures the rubric author's mood rather than the product.
- Should we use deterministic, LLM-graded, or human-graded rubrics?
- Whichever fits the dimension. Deterministic for exact-match and schema validation. LLM-graded for faithfulness, tone, format adherence. Human-graded for high-stakes calibration and spot-checks. Most workflows use all three at different layers.
- How specific should a rubric be?
- As specific as you can make it. 'The answer cites the source span verbatim' beats 'the answer is grounded.' Multiple narrow dimensions (faithfulness, format, tone — graded separately) beat one fuzzy overall score. The act of writing the rubric specifically is where the team confronts what it actually wants from the AI.
- What happens when we change the rubric?
- Every score before the change becomes non-comparable to every score after. Treat rubric edits like prompt edits: version control, PR review, release note explaining the change. Major rubric edits trigger a re-baseline of the scoreboard.