The discipline of evaluating AI systems has moved from a quiet engineering concern to the gating question on whether a system ships at all. This report is Morvion's yearly read on where the field actually stands in 2026, what is working in production, what is breaking in public, and what the discipline is likely to look like by 2027. It is observational, drawn from our own engagements, from the open literature published since 2025, and from operator conversations across the practices we work in. Not a survey. A reading of the field.
Why this report exists.
We publish this report for three audiences. Founder-operators who are commissioning AI work and need a way to tell shippable from not-shippable. Engineering leads building AI systems who want a yardstick against which to measure their own practice. And investors who are reading AI claims at the diligence stage and want a structured way to ask the right questions. The four levels and the twelve-question scorecard are meant to be useful in all three conversations.
We are not the first to write about this. The Anthropic and OpenAI engineering blogs, the AI Engineer summit talks, and a growing set of open-source eval frameworks have all contributed to the language. Morvion's contribution is the operator-side read: what we see when we sit with the teams shipping these systems, what changes hands between the engineering bench and the operator's reality, where the discipline holds and where it cracks.
“An AI system without evals is a vibe. A vibe is not a product. A vibe is what gets rolled back at midnight.”
The four-level maturity model.
Production AI systems in 2026 cluster into four discernible levels. Most systems sit lower than their teams claim. The level is determined by the engineering artifacts, not by the quality of the demo.
Level 0 · Vibe.
No fixture set. No rubric. No baseline. The team ships when the demo feels right and rolls back when the operator complains. Most production AI sitting at this level in 2026 does not know it is at this level; the team has a Notion page titled “evaluation” that contains five hand-typed prompts and an opinion. Our operator-informed read places roughly half of production AI systems here.
Level 1 · Scripted.
A small set of smoke-test prompts the team runs by hand before each release. No version-controlled fixture set, no written rubric, no stored baseline. Failures are noticed when someone notices them. The team is doing better than Vibe but is still discovering regressions through user-reported incidents rather than through gates. Roughly a tenth of the production AI we see sits here.
Level 2 · Eval-gated.
The team has a fixture set in version control, a written rubric, and a script that produces scores on demand. The scores are read by humans before each release. The gate is human judgment, not automation. This level is competent and shippable; it is also where the discipline starts to be visible from the outside. Roughly a quarter of production AI systems sit at this level in 2026.
Level 3 · Eval-driven.
The team has Level 2 plus an automated regression gate in continuous integration. A pull request that regresses past a per-metric tolerance fails the check. Baselines are updated only when a release ships green. Fixtures grow whenever a production output surprises an operator. The grader model is pinned and version-controlled. This level ships confidently. Roughly an eighth of production AI in 2026 reaches it, and that fraction is growing fastest. Teams at Level 3 are not louder than the rest; they are usually quieter, because they have nothing to argue about. The scoreboard answers for them.
Level claimed in a pitch is almost always one or two levels higher than level evidenced in the codebase. The fastest diligence question for any AI vendor in 2026 is “can you show me the eval script and last week's scoreboard?” Silence is an answer.
Six patterns observed in the field.
These are the patterns we and our operator network have seen recur across 2025 and into 2026. None are theoretical; each comes from a real incident or a recurring shape in the work.
Pattern 1 · Grader-model drift.
Teams using a hosted model as their LLM-grader started seeing shifted scores in 2026 as those models received silent updates from the provider. The harness did not change, the production workflow did not change, but the numbers moved. The fix is grader pinning: treat the grader model as a pinned dependency of the harness, re-evaluate the entire fixture set when the pin moves, and never read a score change as a workflow change without ruling the grader out first. This pattern is now common enough that pinning is becoming a default rather than an advanced practice.
Pattern 2 · Fixture decay.
Fixture sets captured at the MVP no longer represent production traffic six to twelve months in. The shape of real inputs drifts, edge cases that were rare become frequent, and the harness keeps scoring high on stale examples while real customers see degraded behavior. The practice that fixes this is fixture growth: every operator surprise becomes a new record in the set. Old records stay unless the underlying workflow has fundamentally changed. The fixture set should look like a year of real traffic, not a week of imagined personas.
Pattern 3 · The multi-agent eval gap.
Single-agent eval matured in 2025. Most teams now know how to score the output of a single LLM call. Multi-agent flow eval is the open frontier in 2026. When a planner agent hands off to a retrieval agent which hands off to a drafting agent which hands off to a critic agent, where does the failure live? Per-agent rubrics are necessary but not sufficient; the flow itself has properties (handoff fidelity, end-to-end coverage, latency budget) that need their own grading. The field is still inventing the shape. Expect a wave of framework work here in 2027.
Pattern 4 · The eval-cost cliff.
In 2025, the cost of running a 100-fixture rubric suite was a rounding error. In 2026, with LLM-graded rubrics running multiple dimensions per fixture against reference models, the cost of a single full eval run is starting to exceed the daily cost of the production workflow it tests. Teams are responding with tiered fixtures (a small smoke set per PR, the full set on a nightly schedule), with deterministic-first rubric design, and with sampled runs against the full set. The teams that have not redesigned for cost are running their evals weekly or monthly and shipping blind in between.
Pattern 5 · Operator-grader disagreement.
An LLM-graded rubric returns green. The operator looks at the output and says it is wrong. This happens when the rubric has overfit on the shape of a good output (the right sections, the right length, the right tone words) instead of the signal (does it actually move the customer forward, does it answer the real question, does it sound like us). The fix is human review at a sampled rate. Five to ten percent of outputs weekly, reviewed by an operator, with disagreements fed back into the rubric. Without this loop, the rubric becomes a flatterer that always tells the team the system is doing well.
Pattern 6 · Retrieval evals lagging.
When a retrieval-augmented system fails, the team almost always blames the model. In our experience the cause is more often the retrieval layer: the wrong documents surfaced, the right documents not chunked usefully, the query rewriter dropping context, the reranker scoring on the wrong axis. Almost no team in 2026 evaluates the retrieval step independently of generation. Until the retrieval harness exists alongside the generation harness, the team is debugging a two-stage system with one instrument. Expect retrieval eval to become first-class in 2027.
The three principles emerging as discipline.
Across the teams operating at Level 2 and Level 3, three principles recur. They are not best-practice in the soft sense; they are the practices the discipline appears to be converging on.
- Fixtures from real traffic, never imagined personas. The single highest-leverage decision in an eval harness is where the fixtures come from. Real production traffic produces fixtures that catch real failures. Imagined personas produce fixtures that catch nothing the team did not already anticipate. A fixture set sourced from real inputs is the difference between an eval suite that ages with the system and one that ossifies on day one.
- Grader pinning, the same way model pinning is treated. The grader model is part of the harness. The version is fixed, the prompt is fixed, and a bump on either triggers a re-evaluation of the entire fixture set with the new baseline. Teams that treat the grader as a dependency instead of a free-floating tool catch grader drift before it becomes a production decision.
- The baseline owns the release. The regression check is automated. The tolerance is per-metric. A regression past tolerance fails the PR. No exceptions, no overrides, no “we'll fix it in the next release.” The discipline only works if the gate cannot be argued with. Teams that allow manual overrides on the gate lose the gate within two releases.
The 2026 scorecard · twelve questions.
A self-audit for placing an AI system on the maturity model. Each question is answered yes or no by reference to the codebase, not by reference to the team's intent. The scorecard is designed to take under thirty minutes to run with an engineering lead and an operator in the room together.
- Do you have a fixture set in version control?
- Are the fixtures sourced from real production traffic, not imagined personas?
- Is the rubric written down and signed off by both the engineering lead and the operator?
- Does the rubric use deterministic checks wherever the output shape permits?
- Is the LLM-grader model version pinned, like the production model is pinned?
- Is there a baseline score per metric, stored from the last green release?
- Does continuous integration gate releases on a regression check against that baseline?
- Are tolerances defined per metric, not as a single global threshold?
- Are new fixtures added whenever a real-world output surprises an operator?
- Is a sampled subset of production output (five to ten percent) reviewed by humans weekly?
- Is retrieval quality evaluated independently from generation quality?
- Is the running cost of the eval suite tracked, budgeted, and bounded?
Scoring:
- 0 to 3 Yes: Level 0 (Vibe). The system ships when someone says it feels right. Start with question 1 and walk forward. A fixture set is the unlock.
- 4 to 6 Yes: Level 1 (Scripted). The team has scaffolding. The next move is a written rubric and a baseline.
- 7 to 9 Yes: Level 2 (Eval-gated). The discipline is visible. The remaining gap is automation of the gate and independent retrieval eval.
- 10 to 12 Yes: Level 3 (Eval-driven). The system ships confidently. The remaining work is to stay there as production traffic shifts.
Run the scorecard in a thirty-minute working session. Engineering lead answers each question with reference to a file path, a commit, or a CI configuration. Operator witnesses. The pairing is the point: it forces both sides of the system to share the same picture of where the work actually stands.
Where the field is heading · five predictions for 2027.
Calibrated forecasts based on what we are seeing accelerate. We will revisit each in the 2027 edition of this report and note which ones moved in the predicted direction.
- Retrieval evals become first-class. Separate harnesses for retrieval quality (recall at k, rerank precision, chunking faithfulness) and for generation quality. By the end of 2027, evaluating a RAG system without retrieval-level metrics will look the way shipping a backend without integration tests looks today.
- Multi-agent flow eval frameworks ship in the open. A handful of open-source frameworks for grading agent-to-agent handoffs, flow coverage, and per-agent accountability will land. The shape of the eval will rhyme with distributed-tracing tooling more than with unit-test tooling.
- Grading-as-a-service emerges. A small market for pinned, audited grader models with their own SLAs, their own change logs, and their own pricing axes. Teams will stop running their own grader stacks for common rubric shapes and will buy the grading layer.
- The eval gate becomes a hiring filter. By the end of 2027, “walk me through the eval harness on the last AI system you shipped” will be a standard interview question for AI engineers, the same way “walk me through your test setup” is for backend engineers today.
- The vibe-shipping cliff becomes visible. Level 0 and Level 1 teams will see public incident reports, forced rollbacks, and customer trust failures at rates materially higher than Level 2 and Level 3 teams. The discipline gap will become a competitive moat for the teams that have crossed it and a recurring liability for the teams that have not. Insurance and procurement language will start to reflect the gap before the end of 2027.
Methodology and limits of this report.
This report is observational, not survey-grounded. It is drawn from three sources. First, Morvion's own production engagements: AI workflows shipped through eval harnesses since 2025 across CRM enrichment, customer reply drafting, document summarization, sales-call recap, and hospitality and marketplace operations. Second, the open eval literature: blog posts, conference talks, and open-source frameworks published since the 2025 inflection when the language for this work started to stabilize. Third, operator conversations: founder-operators across hospitality, marketplace, and B2B service businesses who have either commissioned AI work or evaluated it for acquisition.
The numeric reads in the maturity-model section (approximately half at Level 0, a tenth at Level 1, a quarter at Level 2, an eighth at Level 3) are operator-informed estimates, not measured. The patterns section is a recurrence-frequency read across our own engagements and our operator network. The 2027 predictions are calibrated forecasts, not certainties; the value is in the structure and the falsifiability, not in the precision.
The 2027 edition of this report will include a survey component. If your team would like to participate, the intake is at the bottom of this page.
Where this fits in the rest of the practice.
The State of report is the landscape view. The companion reference, The Morvion Eval Spec, is the how-to: the three-layer model, the deterministic and LLM-graded scoring patterns, the worked examples for four common AI workflows. Read together, they answer the two questions an operator brings to this work: where the field is, and how to do the work inside it.
For the long-form field note that pre-dated this report, see Eval-driven AI · the only kind that ships. For one-paragraph definitions of the underlying terms, see the glossary entries on eval-driven AI, AI observability, retrieval-augmented generation, and multi-agent workflow. The full practice page is Intelligent Systems and AI Infrastructure.
Cite this report.
The canonical citation:
Morvion. (2026). The State of Eval-Driven AI · 2026.
Morvion Field Reports, v1.0.0.
https://morvion.com/state-of-eval-2026Excerpts and graphics may be reproduced with attribution. The report will be re-published annually in May. Substantive revisions to the 2026 edition will bump the minor version and be noted at the top.