Retrieval quality is the family of metrics that measures whether a RAG pipeline actually surfaces the right context for the query. The model's answer is only as good as what was retrieved; measuring retrieval separately from generation is what tells you which layer is the bottleneck.

The core metrics.

  • Recall@k. Of the documents that should have been retrieved, what fraction made it into the top k? A workflow with recall@10 of 0.6 is missing 40% of relevant context.
  • Precision@k. Of the documents in the top k, what fraction are actually relevant? Low precision means the generator is wading through noise.
  • MRR (Mean Reciprocal Rank). How high does the first correct document rank, averaged across queries? MRR = 0.5 means the first relevant doc is, on average, second in the list.
  • nDCG. Discounted cumulative gain — captures the full ranking quality, weighted by position. The most informative single metric for ordered retrieval.

Why measure retrieval separately.

When RAG answers go wrong, the failure is either upstream (retrieval missed the relevant chunk) or downstream (generation ignored the relevant chunk). Without retrieval quality measured separately, every regression looks like a model problem. With it, you know whether to invest in embeddings, rerankers, chunking, or in prompt and model changes.

“Bad answers from a RAG system are usually a retrieval bug masquerading as a model bug.”

Building the fixture set.

A retrieval-quality fixture is a query plus a labelled list of relevant documents in the corpus. Sourcing: real production queries plus a human pass labelling which corpus documents actually answer them. 50–200 queries is enough to start; scale up as the corpus grows.

Frequently asked.

What is retrieval quality?
Retrieval quality is the family of metrics — recall@k, precision@k, MRR, nDCG — that measure whether a RAG pipeline surfaces the right context for the query. Measured against a labelled fixture set, separately from the generation step, so you know which layer is the bottleneck.
Why measure retrieval separately from generation?
Because bad RAG answers can come from either layer, and the fix is different depending on which one. Low retrieval quality means invest in embeddings, rerankers, chunking. Low generation quality on good retrieved context means invest in prompts, models, or output validation. Without measuring separately, every bug looks like a model bug.
What's a good retrieval@k target?
Workload-specific. For a tightly-scoped corpus with clear factual queries, recall@10 above 0.9 is achievable and expected. For broad, semantic, conversational corpora, recall@10 above 0.75 is good. The eval harness measures both against real query distribution rather than a benchmark dataset.
How does retrieval quality interact with faithfulness?
They're complementary. Retrieval quality measures whether the right context was found. Faithfulness measures whether the model used the context it was given. A workflow can have high faithfulness and low retrieval (the model is honest about what it has, but it doesn't have enough) or low faithfulness on good retrieval (the model invented despite having the answer). Production needs both above target.