Should we use deterministic, LLM-graded, or human-graded rubrics?

Whichever fits the dimension. Deterministic for exact-match and schema validation. LLM-graded for faithfulness, tone, format adherence. Human-graded for high-stakes calibration and spot-checks. Most workflows use all three at different layers.

How specific should a rubric be?

As specific as you can make it. 'The answer cites the source span verbatim' beats 'the answer is grounded.' Multiple narrow dimensions (faithfulness, format, tone — graded separately) beat one fuzzy overall score. The act of writing the rubric specifically is where the team confronts what it actually wants from the AI.

What happens when we change the rubric?

Every score before the change becomes non-comparable to every score after. Treat rubric edits like prompt edits: version control, PR review, release note explaining the change. Major rubric edits trigger a re-baseline of the scoreboard.

Eval rubric · Morvion Glossary

An eval rubric is the written definition of what counts as a good output for one input class. It is the scoring contract that turns subjective judgement into a number an eval harness can compare across releases. Rubric quality determines whether the scoreboard reflects actual product quality or just the rubric author's mood.

The three rubric shapes.

Deterministic. The output either does or doesn't match the expected value. Schema validation, fact lookup, exact-match classification. Fastest, cheapest, and the best when the answer is binary.
LLM-graded. A judge model scores the output against a written rubric. Used for faithfulness, tone, appropriateness, and other criteria that don't reduce to exact match. Slower and noisier than deterministic, but usable on subjective dimensions.
Human-graded. A domain expert scores a sample. The most reliable and the most expensive. Used to calibrate the LLM grader and to spot-check the production output distribution.

Writing a good rubric.

Specific over general (“the answer cites the source span verbatim” beats “the answer is grounded”). Multiple narrow dimensions over one fuzzy overall score (faithfulness, format, tone — graded separately). Worked examples — for each dimension, show one passing and one failing output. The rubric is a living document; new edge cases turn into new clauses.

Rubrics get versioned too.

A rubric change shifts the meaning of every score. Treat rubric edits the same as prompt edits: in version control, PR-reviewed, with a release-note explaining the change. The scoreboard from before the rubric edit is not directly comparable to the scoreboard after.

Frequently asked.

What is an eval rubric?: An eval rubric is the written definition of what counts as a good output for one input class. It's the scoring contract that turns subjective judgement into a number an eval harness can compare across releases. Without a rubric, the scoreboard measures the rubric author's mood rather than the product.
Should we use deterministic, LLM-graded, or human-graded rubrics?: Whichever fits the dimension. Deterministic for exact-match and schema validation. LLM-graded for faithfulness, tone, format adherence. Human-graded for high-stakes calibration and spot-checks. Most workflows use all three at different layers.
How specific should a rubric be?: As specific as you can make it. 'The answer cites the source span verbatim' beats 'the answer is grounded.' Multiple narrow dimensions (faithfulness, format, tone — graded separately) beat one fuzzy overall score. The act of writing the rubric specifically is where the team confronts what it actually wants from the AI.
What happens when we change the rubric?: Every score before the change becomes non-comparable to every score after. Treat rubric edits like prompt edits: version control, PR review, release note explaining the change. Major rubric edits trigger a re-baseline of the scoreboard.

Eval rubric

The three rubric shapes.

Writing a good rubric.

Rubrics get versioned too.

Frequently asked.

The Morvion Eval Spec

Keep reading the glossary.

AI infrastructure

CRM intelligence

Immersive website

AI agent

Business intelligence dashboard

Client portal

Discovery sprint

Digital operating layer

Document intelligence

Eval-driven AI

Hospitality website

Marketplace platform

Multi-agent workflow

Real-time dashboard

Retrieval-augmented generation (RAG)

Prompt engineering

Vector database

AI observability

Embedding model

Fine-tuning

Vector search

Semantic search

Hallucination

Chain-of-thought

Function calling

Model distillation

Safety rails

Eval harness

Regression gate

Model Context Protocol (MCP)

Structured output

Agent tool use

Prompt injection

Agentic search

Observability traces

LLM guardrails

Agent handoff

Vector index

Token budget

Retrieval rerank

Embedding space

Semantic cache

Context window

Faithfulness

Cross-encoder

Model router

AI cost control

Agent memory

Structured extraction

AI evaluation framework

Retrieval quality

AI guardrail policy

Eval fixture

AI incident

Agent orchestration

Eval versioning

Model fallback

Fine-grained routing

AI policy version control