An AI evaluation framework is the discipline-level layer above any single eval harness. The harness is the tool; the framework is the methodology — how fixtures are sourced, how rubrics are versioned, how regression policies are set, how releases are gated, and how all of it stays coherent across multiple workflows on the same product.
The five pieces of a framework.
- Fixture sourcing policy. Where do real examples come from? Production sampling, manual curation, synthetic generation? How are they labelled? How often refreshed?
- Rubric library. Reusable scoring rubrics across workflows (faithfulness, refusal appropriateness, format adherence). Versioned and shared so different teams measure the same things the same way.
- Regression policy. The tolerances. How much can a metric drop before a release is blocked? Defaults differ by metric (faithfulness ≤ 0.02 drop; throughput ≤ 10%).
- Release gates. The CI rules that read the eval output and decide whether the change ships. Gate logic lives in version control, not in someone's head.
- Audit log. Every release records which rubrics it scored against, what each metric was, and whether any gate was overridden. The auditable trail of the framework.
Why a framework, not just a harness.
A harness scores one workflow. A framework keeps a hundred workflows scored consistently. Without the framework, every team picks its own metrics, the same word means different things in different scoreboards, and cross-product comparison is impossible. The framework is the difference between AI engineering as a craft and AI engineering as a discipline.
The Morvion Eval Spec.
The studio's framework, published openly at /eval-spec: schemas, scoring library, four worked examples, and the conventions every Morvion intelligent-systems engagement inherits. The framework is the version every team can read, adopt, and challenge.
Frequently asked.
- What is an AI evaluation framework?
- An AI evaluation framework is the discipline-level layer above any single eval harness. It defines fixture sourcing, rubric reuse, regression tolerances, release-gate logic, and the audit log — so multiple workflows on the same product stay scored consistently.
- What's the difference between a framework and a harness?
- A harness is the running tool — fixtures + rubric + scorer for one workflow. A framework is the methodology that keeps many harnesses coherent: shared rubric library, shared regression policy, shared release-gate logic. Harnesses are run; frameworks are written.
- Do we need a framework if we only have one AI workflow?
- Not strictly, but writing it down once costs little and pays back the moment you add a second workflow. Most production AI grows from one workflow to five within a year. The framework written at workflow #1 makes workflows #2–#5 ship faster and stay measurable.
- What does Morvion's framework include?
- Fixture and rubric JSON schemas, a TypeScript scoring library, four worked examples (RAG, classification, agentic workflow, document extraction), a CLI harness, and a CI integration template. Published openly under MIT at /eval-spec.