What is a cross-encoder?

A cross-encoder is a neural model that takes a query and a candidate passage as a single joint input and produces one relevance score for the pair. It's used in the rerank step of production retrieval pipelines. Slower than bi-encoder embedding lookup, but 10–20 points more accurate on most benchmarks.

When should I use a cross-encoder vs. a bi-encoder?

Use both, in sequence. Bi-encoder for the first-stage retrieval (fast, embeddings precomputed at index time). Cross-encoder for the rerank step over the top-K candidates (slow per call, but accuracy gain dominates the marginal latency). This two-stage pattern is the standard production answer.

How much slower is a cross-encoder?

Per scoring call, much slower — the model has to run a forward pass on the (query, document) pair. In practice, reranking the top 50 candidates with a hosted reranker adds ~150–300ms to the pipeline. For 95% of production RAG workflows the latency hit is worth the accuracy gain.

Are cross-encoder models the same as LLM graders?

Different shapes. Cross-encoders are small (often <100M parameters) and output a single scalar score per pair — purpose-built for relevance. LLM graders are full language models that read a rubric and produce a score with reasoning. Cross-encoders are faster and cheaper; LLM graders are more flexible. Different jobs.

Cross-encoder · Morvion Glossary

A cross-encoder is a model architecture used for retrieval rerank. Unlike a bi-encoder (which embeds query and document independently), a cross-encoder takes the query and the candidate passage as a single joint input and produces a single relevance score per pair. Slower, but much more accurate.

Cross-encoder vs. bi-encoder.

Bi-encoder (the embedding model in vector search): embeds query and document independently. Fast — embeddings can be precomputed at index time. Lossy — independent encoding loses interaction signal.
Cross-encoder (the reranker): takes (query, document) as joint input, produces one score. Slow — one model call per pair, can't precompute. Accurate — joint attention captures fine-grained relevance.

When to use which.

The two-stage pattern uses both: bi-encoder for fast initial retrieval (the top 50–100 candidates), then cross-encoder to rerank down to the top 5–15. This combines the speed of bi-encoder retrieval with the accuracy of cross-encoder scoring. It's the standard production answer.

Common cross-encoder models.

BGE-Reranker, Cohere Rerank (hosted), Voyage Rerank (hosted), mxbai-rerank, ms-marco-MiniLM cross-encoders (classic). Choice trades cost against accuracy; the eval harness picks the right one for your specific query distribution.

Frequently asked.

What is a cross-encoder?: A cross-encoder is a neural model that takes a query and a candidate passage as a single joint input and produces one relevance score for the pair. It's used in the rerank step of production retrieval pipelines. Slower than bi-encoder embedding lookup, but 10–20 points more accurate on most benchmarks.
When should I use a cross-encoder vs. a bi-encoder?: Use both, in sequence. Bi-encoder for the first-stage retrieval (fast, embeddings precomputed at index time). Cross-encoder for the rerank step over the top-K candidates (slow per call, but accuracy gain dominates the marginal latency). This two-stage pattern is the standard production answer.
How much slower is a cross-encoder?: Per scoring call, much slower — the model has to run a forward pass on the (query, document) pair. In practice, reranking the top 50 candidates with a hosted reranker adds ~150–300ms to the pipeline. For 95% of production RAG workflows the latency hit is worth the accuracy gain.
Are cross-encoder models the same as LLM graders?: Different shapes. Cross-encoders are small (often <100M parameters) and output a single scalar score per pair — purpose-built for relevance. LLM graders are full language models that read a rubric and produce a score with reasoning. Cross-encoders are faster and cheaper; LLM graders are more flexible. Different jobs.

Cross-encoder

Cross-encoder vs. bi-encoder.

When to use which.

Common cross-encoder models.

Frequently asked.

Intelligent Systems & AI Infrastructure

Keep reading the glossary.

AI infrastructure

CRM intelligence

Immersive website

AI agent

Business intelligence dashboard

Client portal

Discovery sprint

Digital operating layer

Document intelligence

Eval-driven AI

Hospitality website

Marketplace platform

Multi-agent workflow

Real-time dashboard

Retrieval-augmented generation (RAG)

Prompt engineering

Vector database

AI observability

Embedding model

Fine-tuning

Vector search

Semantic search

Hallucination

Chain-of-thought

Function calling

Model distillation

Safety rails

Eval harness

Regression gate

Model Context Protocol (MCP)

Structured output

Agent tool use

Prompt injection

Agentic search

Observability traces

LLM guardrails

Agent handoff

Vector index

Token budget

Retrieval rerank

Embedding space

Semantic cache

Context window

Faithfulness

Model router

AI cost control

Agent memory

Structured extraction

AI evaluation framework

Retrieval quality

AI guardrail policy

Eval fixture

Eval rubric

AI incident

Agent orchestration

Eval versioning

Model fallback

Fine-grained routing

AI policy version control