Document intelligence is where AI earns its keep most reliably and where most teams under-engineer it most visibly. The mistake is treating extraction as a prompt-engineering problem. It isn't. It is a pipeline problem with five stages, three confidence tiers, and one fact every operator learns the hard way: extraction accuracy isn't a number, it is a distribution across field types, and the economics are won by routing the long tail to humans, not by chasing the last two percent with a bigger model. This report is the reference for how the pipeline is built.

The five-stage pipeline.

Every working document intelligence system Morvion has shipped runs the same five stages. The boundaries differ; the responsibilities do not. The shape is structured extraction inside, but the pipeline around it is what makes it shippable.

Stage 01 · Ingest.

The surface that pulls documents in. Email forwards, upload forms, S3 prefixes, scanned-input devices, partner SFTPs. The ingest stage normalises file format (PDF, DOCX, EML, image), strips obvious noise (signature blocks, email quotes), and emits a stable canonical representation the rest of the pipeline can rely on.

Stage 02 · Classify.

The stage that decides what kind of document this is. Invoice. Contract. Intake form. Compliance attachment. Free-text query. The classifier outputs a typed label plus a confidence. Documents below the classification threshold route to human review and do not move further; documents above route to the right extractor.

Stage 03 · Extract.

The stage that turns the document text into a typed object that matches a schema. Required fields, optional fields, types, validation rules, all declared up front. The model generates under structured-output constraints; the output is parsed against the schema before it is allowed to continue.

Stage 04 · Validate.

Two layers of validation. Deterministic schema checks (does the JSON parse, do required fields exist, do types match) and rubric checks (is the extracted value plausible against the source document, an LLM grader scores faithfulness per field). Both layers must pass for the extraction to leave the pipeline unsupervised.

Stage 05 · Route.

The stage that hands the structured record to the next system: the CRM, the accounting platform, the contract repository, the operator inbox, or a downstream agent. Routing rules are typed (which fields decide the route) and observable (every routing decision is part of the observability trace).

“OCR gives you words. Document intelligence gives you actions.”

The accuracy / cost / human-review tradeoff.

The economics of document intelligence are won at the boundary between “automate” and “route to human”. Three tiers, in production:

Tier 1 ─ Confidence ≥ 0.97   auto-route, no review
Tier 2 ─ Confidence 0.85–0.97  auto-route, sampled review (10–25%)
Tier 3 ─ Confidence < 0.85   route to human queue, do not auto-process

The thresholds vary by field type. A field with high regulatory cost (a contract clause number, a payment recipient) holds a tighter threshold than a free-text note. The eval harness sets the threshold per field type against a labelled fixture set, and the regression gate fails any release that lowers the threshold without raising the alert volume.

Reference architectures · three scales.

Reference 01 · Single-team pipeline.

One document type, one routing target. The team processes 100–2,000 documents per month. The smallest viable stack: one extractor, one schema, one human-review queue, one routing rule.

Ingest ──────── one inbound channel (email forward or upload form)
Classify ────── single-class shortcut (the team only ingests one type)
Extract ─────── one schema, one extraction prompt, structured-output
Validate ────── schema check + faithfulness grader per field
Route ───────── one downstream system (CRM, accounting, repository)
Cost band ───── €30–70k build, €0.5–1.5k/month run (excl. model API)
Build time ──── 6–10 weeks
Volume sweet ── 100–2,000 docs/month
Live shape ──── invoice extraction, lead intake, NDA review

Reference 02 · Multi-team pipeline.

3–8 document types, multiple routing targets. Different teams own different document classes; the pipeline is shared infrastructure. Per-class extractors and routing rules; shared ingest, classifier, validator, and observability.

Ingest ──────── shared inbound surface + per-class normalisation
Classify ────── multi-class classifier with threshold per class
Extract ─────── per-class extractor + per-class schema
Validate ────── shared validator framework + per-class rubrics
Route ───────── per-class routing rules + audit trail per record
Cost band ───── €60–140k build, €2–6k/month run
Build time ──── 10–16 weeks
Volume sweet ── 2,000–50,000 docs/month
Pattern ─────── ship one document class end-to-end, then onboard
                additional classes in 1–2 week passes

Reference 03 · Regulated pipeline (healthcare / FS / legal).

Sensitive document classes, audit obligations, data- residency boundaries. Same five stages, with extra layers for redaction at ingest, signed audit trails, residency- aware retrieval, and tighter human-review thresholds. The eval harness is shared with auditors before launch.

Ingest ──────── encrypted-at-rest channels + PII redaction layer
Classify ────── multi-class + jurisdiction tagging
Extract ─────── per-class extractor + per-jurisdiction prompts
Validate ────── deterministic + LLM-graded + sampled human review
Route ───────── per-class + per-jurisdiction routing + audit signature
Cost band ───── €100–280k build, €6–18k/month run
Build time ──── 14–24 weeks + audit cycle
Audit assets ── eval harness, fixture set, observability traces,
                rubric library, regression-gate logs
From the studio

Reference 01 and Reference 02 are shipped patterns across Morvion engagements. Reference 03 is the forward-projection of the same shape with extra regulatory scaffolding; the architecture is identical, the audit-evidence layer compounds with the regulatory footprint.

Eval harness per document class.

Every document class ships with its own eval harness. Fixture set is per-class (50–500 real labelled documents). Rubric is per-class. Regression gate fails the release if any field accuracy regresses past tolerance.

Per-class fixture set ──── 50–500 labelled documents (real, not synthetic)
Per-field rubric ───────── deterministic + LLM-graded + spot-checked
Accuracy targets ───────── stable fields ≥ 0.97 · structured ≥ 0.92
                            free-text ≥ 0.85
Faithfulness target ────── ≥ 0.95 (every extracted value traceable to source)
Human-review rate target  10–25% of Tier-2 documents (sampled)
Regression tolerance ───── ≤ 0.02 drop on any field; ≤ 5% rise in review rate

Cost bands · what to expect.

Public ranges from Morvion engagements in 2026, all-in (audit + design + engineering + launch), excluding the model API spend that scales with document volume.

Discovery Sprint ──────── €18–25k · 2 weeks · validates a single class
Single-team pipeline ──── €30–70k · 6–10 weeks
Multi-team pipeline ───── €60–140k · 10–16 weeks
Regulated pipeline ────── €100–280k · 14–24 weeks + audit cycle
Retainer (post-launch) ── €3–14k/month · ongoing fixture refresh + tuning
Per-document API spend ── €0.01–0.10 depending on length + model choice

The wider end of each band is driven by integration depth: source systems with non-standard formats, legacy routing targets, multi-language extraction, and any compliance layer (PII redaction, signed audit trails, residency-aware infrastructure) that demands extra work.

The 12-question self-audit scorecard.

For an operator to assess their current document processing. One point per affirmative answer; max 12. Below 6 = a Discovery Sprint will surface faster wins than another vendor evaluation. Below 4 = humans are the extraction system and are the bottleneck.

  1. Does every incoming document have a typed extraction record produced within minutes?
  2. Is there a written per-field accuracy target the system is measured against?
  3. Are low-confidence extractions routed to a human review queue, not auto-processed?
  4. Does the validator check both schema (deterministic) and faithfulness (rubric)?
  5. Can the operator replay why any specific field was extracted the way it was?
  6. Is the eval harness run against a fresh sampled fixture weekly or more?
  7. Does the regression gate block releases that lower field accuracy past tolerance?
  8. Is per-document cost tracked alongside accuracy (cost-per-good-extraction)?
  9. Are routing rules typed and observable (every routing decision in the trace)?
  10. Does the system survive a source-format change without silent regression?
  11. Are PII fields handled with redaction or boundary controls where required?
  12. Is there a documented incident-response template for extraction failures?

What not to build.

The mistakes we see most often. Skip these even when the vendor demo is compelling.

  • OCR-and-pray. Running OCR and calling the resulting text “extracted data” is a category error. OCR is one stage out of five; the classification, extraction, validation, and routing layers are the work.
  • One mega-prompt for every document type. The temptation is to write one prompt that handles all classes. It regresses across model updates, can't be evaluated, and produces worse accuracy per class than a per-class extractor.
  • Skipping the human-review queue. Auto-routing Tier-2 and Tier-3 extractions saves cost in the short term and produces invisible incidents in the long term. The queue is the difference between “shipped” and “quietly broken”.
  • Vibes-based accuracy claims. “The model is 95% accurate” with no fixture set, no rubric, and no regression baseline is marketing copy, not an engineering statement.
  • Vendor lock-in via proprietary extraction schemas. If the schema lives only in the vendor's portal, the day you switch vendors the pipeline is a rewrite. Keep the schema in version control on your side.

Worked example · invoice extraction at €0.04/document.

A concrete shape, drawn from an aggregated engagement pattern. The team processes ~8,000 supplier invoices per month. Pre-pipeline: ~1.5 minutes per invoice in manual processing, ~200 hours/month, plus 3% downstream errors caught only at month-end reconciliation. Post-pipeline:

  • Volume. 8,000 invoices/month, 92% auto-routed, 8% to human queue (averaging 45 seconds each).
  • Field accuracy. Total ≥ 0.99 · supplier ≥ 0.98 · line items ≥ 0.94 · payment terms ≥ 0.91.
  • Cost. €0.04 per document model spend, €0.50 per document fully-loaded with infrastructure, eval, and proportional retainer.
  • Operator time. Down to ~25 hours/month from ~200, with errors caught at extraction time rather than at reconciliation.
  • Build economics. €52k build + €2.5k/month run. Payback at ~3.5 months on the operator-time number alone, before counting reduced reconciliation errors.

The numbers above are aggregated and de-identified across comparable engagements. The pattern is the asset; the specific numbers vary by document complexity and source quality.

Regional notes · CH, DACH, EU.

The architecture applies broadly. A few region-specific considerations from CH and DACH engagements:

  • Data residency: Swiss FADP (revised 2023) and EU GDPR/DSGVO both apply to document content and any AI processing of it. For regulated clients, the extraction model runs on CH/EU-resident inference endpoints and the fixture set is hosted in the same jurisdiction.
  • Multilingual extraction: Swiss invoices arrive in DE, FR, IT, and EN. Per-language extraction prompts outperform a single multilingual prompt by 4–7 accuracy points on free-text fields. The classifier emits language alongside class so the right extractor runs.
  • Signature + audit trail: Regulated industries often require cryptographically signed audit trails per extraction. The observability layer emits a signed record per document with the model version, prompt version, schema version, and the operator who approved any human-review decision.
  • AI Act readiness: The EU AI Act phases in through 2026–2027. Document intelligence in regulated sectors falls under the high-risk band; the eval harness, fixture set, observability traces, and per-field rubrics are the primary artefacts auditors ask for. Build them from day one.
Field rule

A document intelligence pipeline is not finished when the model passes the rubric. It is finished when the operator stops opening documents to verify them and starts opening only the human-review queue.

The pipeline is built inside Intelligent Systems & AI Infrastructure, with the operator dashboard living in the Digital Products & Platforms practice. Most engagements start with a two-week Discovery Sprint that audits the current manual flow, picks the highest- leverage document class, and locks the schema + the accuracy targets before any production build.

The methodology behind the eval harness is open-sourced as The Morvion Eval Spec. For the one-paragraph definitions of the underlying terms, see the glossary entries on document intelligence, structured extraction, faithfulness, and eval rubric.

Versioning of this report.

This document is versioned. Substantial revisions (new reference architecture, new cost band, new rubric dimension) bump the major version. Minor refinements are silent. Current version: 1.0.0, published 2026-05-19.

Common questions.

What is document intelligence?
Document intelligence is the AI layer that reads, extracts, classifies, and routes the unstructured documents a business runs on (contracts, invoices, briefs, intake forms, emails, PDFs). It replaces manual triage with measurable extraction and structured downstream actions. The output is typed records the next system can consume deterministically, not prose.
What does a working document intelligence stack include?
Five stages: ingest (pull documents from email, upload forms, storage buckets), classify (decide what kind of document each one is), extract (pull typed fields against a schema), validate (run the extraction through an evaluation rubric and a schema check), and route (send the structured output to the next system). Plus an eval harness per document class, a human-review queue for low-confidence extractions, and observability traces on every step.
How accurate is AI document extraction?
Per-field, not per-document. A working pipeline targets ≥0.97 field-accuracy on stable fields (dates, totals, IDs), ≥0.92 on structured fields (line items, parties, jurisdictions), and ≥0.85 on free-text fields (terms, notes, descriptions). Anything below those thresholds routes to human review rather than straight to production. Vibes-based accuracy claims do not ship.
How much does a document intelligence stack cost?
Morvion engagement bands: single-team pipeline (one document type, one routing target) lands €30–70k. Multi-team pipeline (3–8 document types, multiple routing targets) lands €60–140k. Regulated pipeline (healthcare, financial services, legal) lands €100–280k. Per-document API spend is typically €0.01–0.10 depending on length and model choice; the eval harness keeps it stable.
How long does a document intelligence build take?
Single document type: 6–10 weeks from kickoff to operating beta. Multi-document pipeline: 10–16 weeks. Regulated pipeline: 14–24 weeks plus audit time. A working preview on real customer documents lands inside the first 3–4 weeks of any engagement. The eval harness is built before any extraction prompt is written.
Where does document intelligence break in production?
Schema drift (the document format changed and nobody told the system), confidence-threshold mis-tuning (too much human review or too many bad extractions slipping through), and silent degradation when the source model updates. All three are caught by the eval harness running on a fresh labelled sample weekly. Without that, the pipeline regresses invisibly.