An eval rubric is the written definition of what counts as a good output for one input class. It is the scoring contract that turns subjective judgement into a number an eval harness can compare across releases. Rubric quality determines whether the scoreboard reflects actual product quality or just the rubric author's mood.

The three rubric shapes.

  • Deterministic. The output either does or doesn't match the expected value. Schema validation, fact lookup, exact-match classification. Fastest, cheapest, and the best when the answer is binary.
  • LLM-graded. A judge model scores the output against a written rubric. Used for faithfulness, tone, appropriateness, and other criteria that don't reduce to exact match. Slower and noisier than deterministic, but usable on subjective dimensions.
  • Human-graded. A domain expert scores a sample. The most reliable and the most expensive. Used to calibrate the LLM grader and to spot-check the production output distribution.

Writing a good rubric.

Specific over general (“the answer cites the source span verbatim” beats “the answer is grounded”). Multiple narrow dimensions over one fuzzy overall score (faithfulness, format, tone — graded separately). Worked examples — for each dimension, show one passing and one failing output. The rubric is a living document; new edge cases turn into new clauses.

Rubrics get versioned too.

A rubric change shifts the meaning of every score. Treat rubric edits the same as prompt edits: in version control, PR-reviewed, with a release-note explaining the change. The scoreboard from before the rubric edit is not directly comparable to the scoreboard after.

Frequently asked.

What is an eval rubric?
An eval rubric is the written definition of what counts as a good output for one input class. It's the scoring contract that turns subjective judgement into a number an eval harness can compare across releases. Without a rubric, the scoreboard measures the rubric author's mood rather than the product.
Should we use deterministic, LLM-graded, or human-graded rubrics?
Whichever fits the dimension. Deterministic for exact-match and schema validation. LLM-graded for faithfulness, tone, format adherence. Human-graded for high-stakes calibration and spot-checks. Most workflows use all three at different layers.
How specific should a rubric be?
As specific as you can make it. 'The answer cites the source span verbatim' beats 'the answer is grounded.' Multiple narrow dimensions (faithfulness, format, tone — graded separately) beat one fuzzy overall score. The act of writing the rubric specifically is where the team confronts what it actually wants from the AI.
What happens when we change the rubric?
Every score before the change becomes non-comparable to every score after. Treat rubric edits like prompt edits: version control, PR review, release note explaining the change. Major rubric edits trigger a re-baseline of the scoreboard.