Retrieval rerank is the second pass in a production RAG pipeline. After vector search returns the top-K candidates by embedding similarity, a small cross-encoder model scores each (query, passage) pair directly and reorders. The top N after rerank are what actually enter the model's prompt.

Why rerank.

Bi-encoder retrieval (the embedding lookup) is fast but lossy. The query and the document are embedded independently, so the score is approximate. A cross-encoder takes the query and the passage together as input and produces a single relevance score per pair. It's slower (one model call per candidate) but ten to twenty points more accurate on most benchmarks. The two- stage pattern is the standard production answer.

How to wire it.

  • Vector search returns the top 50–100 candidates.
  • Rerank scores those candidates and keeps the top 5–15.
  • Generate answers using only the reranked top N.

Common rerank models.

Cohere Rerank, Voyage rerank, BGE-Reranker (open weights), mxbai-rerank. The choice trades cost against quality; for most production RAG, a hosted reranker at ~$0.001 per query is the right starting point. The eval harness measures which reranker wins on the specific query distribution.

Frequently asked.

What is retrieval rerank?
Retrieval rerank is the second pass over the top-K passages from vector search. A small cross-encoder model scores each (query, passage) pair directly and reorders, so the most relevant chunks reach the prompt first. It's the standard production answer for getting accuracy out of a RAG pipeline.
Do I need a reranker if my embeddings are good?
Almost always yes. Bi-encoder embeddings are fast but score the query and passage independently, which is structurally lossy. A cross-encoder takes them together and produces a much better relevance score per pair. Production RAG quality reliably jumps 10–20 points when a reranker is added.
What's the latency cost of rerank?
One model call per candidate. With 50 candidates and a hosted reranker at ~80ms per batch, you're adding ~150–300ms to the pipeline. The accuracy gain almost always justifies it, but if the latency budget is brutal, top-K can be tightened to 20–30 candidates with a smaller quality hit.
Which reranker should I use?
For hosted: Cohere Rerank and Voyage rerank are both production-grade. For self-hosted: BGE-Reranker and mxbai-rerank are open-weight and competitive. The eval harness picks the right one for your query distribution; default to a hosted reranker until you have a reason to self-host.