What is retrieval rerank?

Retrieval rerank is the second pass over the top-K passages from vector search. A small cross-encoder model scores each (query, passage) pair directly and reorders, so the most relevant chunks reach the prompt first. It's the standard production answer for getting accuracy out of a RAG pipeline.

Do I need a reranker if my embeddings are good?

Almost always yes. Bi-encoder embeddings are fast but score the query and passage independently, which is structurally lossy. A cross-encoder takes them together and produces a much better relevance score per pair. Production RAG quality reliably jumps 10–20 points when a reranker is added.

What's the latency cost of rerank?

One model call per candidate. With 50 candidates and a hosted reranker at ~80ms per batch, you're adding ~150–300ms to the pipeline. The accuracy gain almost always justifies it, but if the latency budget is brutal, top-K can be tightened to 20–30 candidates with a smaller quality hit.

Which reranker should I use?

For hosted: Cohere Rerank and Voyage rerank are both production-grade. For self-hosted: BGE-Reranker and mxbai-rerank are open-weight and competitive. The eval harness picks the right one for your query distribution; default to a hosted reranker until you have a reason to self-host.

Retrieval rerank · Morvion Glossary

Retrieval rerank is the second pass in a production RAG pipeline. After vector search returns the top-K candidates by embedding similarity, a small cross-encoder model scores each (query, passage) pair directly and reorders. The top N after rerank are what actually enter the model's prompt.

Why rerank.

Bi-encoder retrieval (the embedding lookup) is fast but lossy. The query and the document are embedded independently, so the score is approximate. A cross-encoder takes the query and the passage together as input and produces a single relevance score per pair. It's slower (one model call per candidate) but ten to twenty points more accurate on most benchmarks. The two- stage pattern is the standard production answer.

How to wire it.

Vector search returns the top 50–100 candidates.
Rerank scores those candidates and keeps the top 5–15.
Generate answers using only the reranked top N.

Common rerank models.

Cohere Rerank, Voyage rerank, BGE-Reranker (open weights), mxbai-rerank. The choice trades cost against quality; for most production RAG, a hosted reranker at ~$0.001 per query is the right starting point. The eval harness measures which reranker wins on the specific query distribution.

Frequently asked.

What is retrieval rerank?: Retrieval rerank is the second pass over the top-K passages from vector search. A small cross-encoder model scores each (query, passage) pair directly and reorders, so the most relevant chunks reach the prompt first. It's the standard production answer for getting accuracy out of a RAG pipeline.
Do I need a reranker if my embeddings are good?: Almost always yes. Bi-encoder embeddings are fast but score the query and passage independently, which is structurally lossy. A cross-encoder takes them together and produces a much better relevance score per pair. Production RAG quality reliably jumps 10–20 points when a reranker is added.
What's the latency cost of rerank?: One model call per candidate. With 50 candidates and a hosted reranker at ~80ms per batch, you're adding ~150–300ms to the pipeline. The accuracy gain almost always justifies it, but if the latency budget is brutal, top-K can be tightened to 20–30 candidates with a smaller quality hit.
Which reranker should I use?: For hosted: Cohere Rerank and Voyage rerank are both production-grade. For self-hosted: BGE-Reranker and mxbai-rerank are open-weight and competitive. The eval harness picks the right one for your query distribution; default to a hosted reranker until you have a reason to self-host.

Retrieval rerank

Why rerank.

How to wire it.

Common rerank models.

Frequently asked.

Intelligent Systems & AI Infrastructure

Keep reading the glossary.

AI infrastructure

CRM intelligence

Immersive website

AI agent

Business intelligence dashboard

Client portal

Discovery sprint

Digital operating layer

Document intelligence

Eval-driven AI

Hospitality website

Marketplace platform

Multi-agent workflow

Real-time dashboard

Retrieval-augmented generation (RAG)

Prompt engineering

Vector database

AI observability

Embedding model

Fine-tuning

Vector search

Semantic search

Hallucination

Chain-of-thought

Function calling

Model distillation

Safety rails

Eval harness

Regression gate

Model Context Protocol (MCP)

Structured output

Agent tool use

Prompt injection

Agentic search

Observability traces

LLM guardrails

Agent handoff

Vector index

Token budget

Embedding space

Semantic cache

Context window

Faithfulness

Cross-encoder

Model router

AI cost control

Agent memory

Structured extraction

AI evaluation framework

Retrieval quality

AI guardrail policy

Eval fixture

Eval rubric

AI incident

Agent orchestration

Eval versioning

Model fallback

Fine-grained routing

AI policy version control