Cross-Encoder
Definition
A cross-encoder is a neural model that takes a query and document together as input, processing them jointly to produce a relevance score. It is more accurate than bi-encoders but too slow for initial retrieval.
Why It Matters
The bi-encoder vs cross-encoder tradeoff is fundamental to retrieval system design. Bi-encoders embed query and document independently, which is fast but they miss the nuanced interaction between them. Cross-encoders process query and document together, which is slow but they catch subtle relevance signals bi-encoders miss.
Consider the query “How do I reset my password?” and a document titled “Password Reset Guide.” A bi-encoder might rank this slightly below a document about “Password Security Best Practices” because both are semantically related. A cross-encoder, seeing query and document together, immediately recognizes the direct match.
For AI engineers, understanding cross-encoders explains why two-stage retrieval works. You can’t cross-encode every document in your corpus (too slow), but you can cross-encode the top candidates from initial retrieval.
Implementation Basics
Architecture: Cross-encoders are typically BERT-style transformers. Input is the concatenation: [CLS] query [SEP] document [SEP]. The model outputs a single relevance score, often through a classification head on the [CLS] token.
Why they’re more accurate: With query and document in the same forward pass, the model can:
- Attend across query and document tokens
- Compare specific phrases directly
- Model complex relevance patterns
Bi-encoders compress all document meaning into a single vector before comparison. Cross-encoders delay compression until after comparison, preserving more information.
The efficiency problem: For a corpus of 1 million documents, a bi-encoder encodes the query once, then compares against pre-computed vectors. A cross-encoder would need 1 million forward passes, which is infeasible.
Using cross-encoders in practice:
- Initial retrieval with bi-encoder or BM25 returns top 100 candidates
- Cross-encoder scores each candidate against the query
- Rerank by cross-encoder scores
- Return top 5-20 results
Model options:
- cross-encoder/ms-marco-MiniLM-L-6-v2: Fast, decent quality
- BAAI/bge-reranker-large: Strong open-source option
- Cohere Rerank: Production-ready hosted API
- Mixedbread mxbai-rerank: Open-source, multiple sizes
Latency expectations: Cross-encoding 100 documents typically takes 50-200ms depending on model size and hardware. This is acceptable for most applications when added to a 20-50ms initial retrieval step.
The takeaway: use bi-encoders for scale, cross-encoders for precision. Most production systems need both.
Source
Sentence-BERT introduced the bi-encoder/cross-encoder distinction, showing cross-encoders achieve higher accuracy at the cost of computational efficiency.
https://arxiv.org/abs/1908.10084