What is retrieval quality?

Retrieval quality is the family of metrics — recall@k, precision@k, MRR, nDCG — that measure whether a RAG pipeline surfaces the right context for the query. Measured against a labelled fixture set, separately from the generation step, so you know which layer is the bottleneck.

Why measure retrieval separately from generation?

Because bad RAG answers can come from either layer, and the fix is different depending on which one. Low retrieval quality means invest in embeddings, rerankers, chunking. Low generation quality on good retrieved context means invest in prompts, models, or output validation. Without measuring separately, every bug looks like a model bug.

What's a good retrieval@k target?

Workload-specific. For a tightly-scoped corpus with clear factual queries, recall@10 above 0.9 is achievable and expected. For broad, semantic, conversational corpora, recall@10 above 0.75 is good. The eval harness measures both against real query distribution rather than a benchmark dataset.

How does retrieval quality interact with faithfulness?

They're complementary. Retrieval quality measures whether the right context was found. Faithfulness measures whether the model used the context it was given. A workflow can have high faithfulness and low retrieval (the model is honest about what it has, but it doesn't have enough) or low faithfulness on good retrieval (the model invented despite having the answer). Production needs both above target.

Retrieval quality · Morvion Glossary

Retrieval quality is the family of metrics that measures whether a RAG pipeline actually surfaces the right context for the query. The model's answer is only as good as what was retrieved; measuring retrieval separately from generation is what tells you which layer is the bottleneck.

The core metrics.

Recall@k. Of the documents that should have been retrieved, what fraction made it into the top k? A workflow with recall@10 of 0.6 is missing 40% of relevant context.
Precision@k. Of the documents in the top k, what fraction are actually relevant? Low precision means the generator is wading through noise.
MRR (Mean Reciprocal Rank). How high does the first correct document rank, averaged across queries? MRR = 0.5 means the first relevant doc is, on average, second in the list.
nDCG. Discounted cumulative gain — captures the full ranking quality, weighted by position. The most informative single metric for ordered retrieval.

Why measure retrieval separately.

When RAG answers go wrong, the failure is either upstream (retrieval missed the relevant chunk) or downstream (generation ignored the relevant chunk). Without retrieval quality measured separately, every regression looks like a model problem. With it, you know whether to invest in embeddings, rerankers, chunking, or in prompt and model changes.

“Bad answers from a RAG system are usually a retrieval bug masquerading as a model bug.”

Building the fixture set.

A retrieval-quality fixture is a query plus a labelled list of relevant documents in the corpus. Sourcing: real production queries plus a human pass labelling which corpus documents actually answer them. 50–200 queries is enough to start; scale up as the corpus grows.

Frequently asked.

What is retrieval quality?: Retrieval quality is the family of metrics — recall@k, precision@k, MRR, nDCG — that measure whether a RAG pipeline surfaces the right context for the query. Measured against a labelled fixture set, separately from the generation step, so you know which layer is the bottleneck.
Why measure retrieval separately from generation?: Because bad RAG answers can come from either layer, and the fix is different depending on which one. Low retrieval quality means invest in embeddings, rerankers, chunking. Low generation quality on good retrieved context means invest in prompts, models, or output validation. Without measuring separately, every bug looks like a model bug.
What's a good retrieval@k target?: Workload-specific. For a tightly-scoped corpus with clear factual queries, recall@10 above 0.9 is achievable and expected. For broad, semantic, conversational corpora, recall@10 above 0.75 is good. The eval harness measures both against real query distribution rather than a benchmark dataset.
How does retrieval quality interact with faithfulness?: They're complementary. Retrieval quality measures whether the right context was found. Faithfulness measures whether the model used the context it was given. A workflow can have high faithfulness and low retrieval (the model is honest about what it has, but it doesn't have enough) or low faithfulness on good retrieval (the model invented despite having the answer). Production needs both above target.

Retrieval quality

The core metrics.

Why measure retrieval separately.

Building the fixture set.

Frequently asked.

Intelligent Systems & AI Infrastructure

Keep reading the glossary.

AI infrastructure

CRM intelligence

Immersive website

AI agent

Business intelligence dashboard

Client portal

Discovery sprint

Digital operating layer

Document intelligence

Eval-driven AI

Hospitality website

Marketplace platform

Multi-agent workflow

Real-time dashboard

Retrieval-augmented generation (RAG)

Prompt engineering

Vector database

AI observability

Embedding model

Fine-tuning

Vector search

Semantic search

Hallucination

Chain-of-thought

Function calling

Model distillation

Safety rails

Eval harness

Regression gate

Model Context Protocol (MCP)

Structured output

Agent tool use

Prompt injection

Agentic search

Observability traces

LLM guardrails

Agent handoff

Vector index

Token budget

Retrieval rerank

Embedding space

Semantic cache

Context window

Faithfulness

Cross-encoder

Model router

AI cost control

Agent memory

Structured extraction

AI evaluation framework

AI guardrail policy

Eval fixture

Eval rubric

AI incident

Agent orchestration

Eval versioning

Model fallback

Fine-grained routing

AI policy version control