Retrieval rerank is the second pass in a production RAG pipeline. After vector search returns the top-K candidates by embedding similarity, a small cross-encoder model scores each (query, passage) pair directly and reorders. The top N after rerank are what actually enter the model's prompt.
Why rerank.
Bi-encoder retrieval (the embedding lookup) is fast but lossy. The query and the document are embedded independently, so the score is approximate. A cross-encoder takes the query and the passage together as input and produces a single relevance score per pair. It's slower (one model call per candidate) but ten to twenty points more accurate on most benchmarks. The two- stage pattern is the standard production answer.
How to wire it.
- Vector search returns the top 50–100 candidates.
- Rerank scores those candidates and keeps the top 5–15.
- Generate answers using only the reranked top N.
Common rerank models.
Cohere Rerank, Voyage rerank, BGE-Reranker (open weights), mxbai-rerank. The choice trades cost against quality; for most production RAG, a hosted reranker at ~$0.001 per query is the right starting point. The eval harness measures which reranker wins on the specific query distribution.
Frequently asked.
- What is retrieval rerank?
- Retrieval rerank is the second pass over the top-K passages from vector search. A small cross-encoder model scores each (query, passage) pair directly and reorders, so the most relevant chunks reach the prompt first. It's the standard production answer for getting accuracy out of a RAG pipeline.
- Do I need a reranker if my embeddings are good?
- Almost always yes. Bi-encoder embeddings are fast but score the query and passage independently, which is structurally lossy. A cross-encoder takes them together and produces a much better relevance score per pair. Production RAG quality reliably jumps 10–20 points when a reranker is added.
- What's the latency cost of rerank?
- One model call per candidate. With 50 candidates and a hosted reranker at ~80ms per batch, you're adding ~150–300ms to the pipeline. The accuracy gain almost always justifies it, but if the latency budget is brutal, top-K can be tightened to 20–30 candidates with a smaller quality hit.
- Which reranker should I use?
- For hosted: Cohere Rerank and Voyage rerank are both production-grade. For self-hosted: BGE-Reranker and mxbai-rerank are open-weight and competitive. The eval harness picks the right one for your query distribution; default to a hosted reranker until you have a reason to self-host.