Reranking
Definition
Reranking is a second-stage retrieval process that takes initial search results and reorders them using a more powerful model to improve precision, typically using cross-encoders that compare query and document together.
Why It Matters
Initial retrieval is fast but imprecise. When you query a vector database, you get the top 100 results in milliseconds, but the ordering isn’t perfect. Document #15 might actually be more relevant than document #3. Reranking fixes this by applying a more sophisticated model to reorder the results.
The pattern is universal in production search: cast a wide net with fast retrieval, then use a slower, more accurate model to surface the best results. Google does this. Bing does this. Your RAG system should too.
For AI engineers, reranking is one of the highest-impact improvements you can add to a retrieval pipeline. It typically improves precision significantly with minimal latency impact (you’re only reranking the top 50-100 documents, not your entire corpus).
Implementation Basics
The two-stage pattern:
Stage 1: Initial Retrieval Use bi-encoder embeddings or BM25 to quickly find the top 100-500 candidate documents. This is your recall stage. You want to include all potentially relevant documents.
Stage 2: Reranking Pass the query and each candidate document through a cross-encoder model that outputs a relevance score. Sort by this score to get your final ranking.
Why cross-encoders are more accurate: Bi-encoders embed query and document separately. They can’t model their interaction directly. Cross-encoders concatenate query and document and process them together, enabling fine-grained comparison. This is slower (can’t pre-compute) but more accurate.
Reranking models:
- Cohere Rerank: Hosted API, easy to integrate
- Mixedbread rerank: Open-source, self-hostable
- BGE reranker: Open-source, multiple sizes
- Cross-encoder/ms-marco-MiniLM: Lightweight option
Implementation tips:
- Retrieve more than you need (100-500), rerank down to your final count (5-20)
- Cache reranked results for repeated queries
- Set a score threshold, low-scoring results might not be worth including
- Monitor reranking latency; it adds 50-200ms typically
When to skip reranking:
- Very small document sets (< 1000) where initial retrieval is already precise
- Latency-critical applications where 100ms matters
- When you’ve already fine-tuned your bi-encoder for your domain
For most RAG applications, adding a reranker is the single most effective improvement you can make after basic semantic search is working.
Source
Cross-encoder rerankers that jointly encode query and passage achieve higher accuracy than bi-encoders, making them ideal for reranking the top results from an initial retrieval stage.
https://arxiv.org/abs/1901.04085