Back to Glossary
Implementation

Cross-Encoder

Definition

A cross-encoder is a neural model that takes a query and document together as input, processing them jointly to produce a relevance score. It is more accurate than bi-encoders but too slow for initial retrieval.

Why It Matters

The bi-encoder vs cross-encoder tradeoff is fundamental to retrieval system design. Bi-encoders embed query and document independently, which is fast but they miss the nuanced interaction between them. Cross-encoders process query and document together, which is slow but they catch subtle relevance signals bi-encoders miss.

Consider the query “How do I reset my password?” and a document titled “Password Reset Guide.” A bi-encoder might rank this slightly below a document about “Password Security Best Practices” because both are semantically related. A cross-encoder, seeing query and document together, immediately recognizes the direct match.

For AI engineers, understanding cross-encoders explains why two-stage retrieval works. You can’t cross-encode every document in your corpus (too slow), but you can cross-encode the top candidates from initial retrieval.

Implementation Basics

Architecture: Cross-encoders are typically BERT-style transformers. Input is the concatenation: [CLS] query [SEP] document [SEP]. The model outputs a single relevance score, often through a classification head on the [CLS] token.

Why they’re more accurate: With query and document in the same forward pass, the model can:

  • Attend across query and document tokens
  • Compare specific phrases directly
  • Model complex relevance patterns

Bi-encoders compress all document meaning into a single vector before comparison. Cross-encoders delay compression until after comparison, preserving more information.

The efficiency problem: For a corpus of 1 million documents, a bi-encoder encodes the query once, then compares against pre-computed vectors. A cross-encoder would need 1 million forward passes, which is infeasible.

Using cross-encoders in practice:

  1. Initial retrieval with bi-encoder or BM25 returns top 100 candidates
  2. Cross-encoder scores each candidate against the query
  3. Rerank by cross-encoder scores
  4. Return top 5-20 results

Model options:

  • cross-encoder/ms-marco-MiniLM-L-6-v2: Fast, decent quality
  • BAAI/bge-reranker-large: Strong open-source option
  • Cohere Rerank: Production-ready hosted API
  • Mixedbread mxbai-rerank: Open-source, multiple sizes

Latency expectations: Cross-encoding 100 documents typically takes 50-200ms depending on model size and hardware. This is acceptable for most applications when added to a 20-50ms initial retrieval step.

The takeaway: use bi-encoders for scale, cross-encoders for precision. Most production systems need both.

Source

Sentence-BERT introduced the bi-encoder/cross-encoder distinction, showing cross-encoders achieve higher accuracy at the cost of computational efficiency.

https://arxiv.org/abs/1908.10084