A cross-encoder is a model architecture used for retrieval rerank. Unlike a bi-encoder (which embeds query and document independently), a cross-encoder takes the query and the candidate passage as a single joint input and produces a single relevance score per pair. Slower, but much more accurate.

Cross-encoder vs. bi-encoder.

  • Bi-encoder (the embedding model in vector search): embeds query and document independently. Fast — embeddings can be precomputed at index time. Lossy — independent encoding loses interaction signal.
  • Cross-encoder (the reranker): takes (query, document) as joint input, produces one score. Slow — one model call per pair, can't precompute. Accurate — joint attention captures fine-grained relevance.

When to use which.

The two-stage pattern uses both: bi-encoder for fast initial retrieval (the top 50–100 candidates), then cross-encoder to rerank down to the top 5–15. This combines the speed of bi-encoder retrieval with the accuracy of cross-encoder scoring. It's the standard production answer.

Common cross-encoder models.

BGE-Reranker, Cohere Rerank (hosted), Voyage Rerank (hosted), mxbai-rerank, ms-marco-MiniLM cross-encoders (classic). Choice trades cost against accuracy; the eval harness picks the right one for your specific query distribution.

Frequently asked.

What is a cross-encoder?
A cross-encoder is a neural model that takes a query and a candidate passage as a single joint input and produces one relevance score for the pair. It's used in the rerank step of production retrieval pipelines. Slower than bi-encoder embedding lookup, but 10–20 points more accurate on most benchmarks.
When should I use a cross-encoder vs. a bi-encoder?
Use both, in sequence. Bi-encoder for the first-stage retrieval (fast, embeddings precomputed at index time). Cross-encoder for the rerank step over the top-K candidates (slow per call, but accuracy gain dominates the marginal latency). This two-stage pattern is the standard production answer.
How much slower is a cross-encoder?
Per scoring call, much slower — the model has to run a forward pass on the (query, document) pair. In practice, reranking the top 50 candidates with a hosted reranker adds ~150–300ms to the pipeline. For 95% of production RAG workflows the latency hit is worth the accuracy gain.
Are cross-encoder models the same as LLM graders?
Different shapes. Cross-encoders are small (often <100M parameters) and output a single scalar score per pair — purpose-built for relevance. LLM graders are full language models that read a rubric and produce a score with reasoning. Cross-encoders are faster and cheaper; LLM graders are more flexible. Different jobs.