Retrieval
Definition
Retrieval is the process of finding and ranking relevant documents from a knowledge base in response to a query, typically using embedding similarity, keyword matching, or hybrid approaches.
Why It Matters
Retrieval is the bottleneck in most RAG systems. Your LLM can only reason about what you provide. If retrieval misses relevant documents or returns irrelevant ones, the generated answer suffers, no matter how capable the model.
The retrieval quality equation is simple: precision (are retrieved docs relevant?) and recall (did we find all relevant docs?). In practice, you’re optimizing for both while balancing latency and cost. Retrieving 100 documents might improve recall but wastes tokens and slows responses.
For AI engineers, retrieval engineering is where you spend most debugging time. When answers are wrong or incomplete, the cause is almost always retrieval, not the LLM.
Implementation Basics
Retrieval Approaches
- Dense retrieval: Embed query and documents, find nearest neighbors. Good for semantic similarity.
- Sparse retrieval (BM25): Keyword matching with term frequency weighting. Good for specific terms.
- Hybrid: Combine dense and sparse scores. Often best of both worlds.
Key Parameters
- Top-K: How many documents to retrieve (typically 3-10)
- Similarity threshold: Minimum score to include a result
- Reranking: Use a more expensive model to reorder initial results
Improving Retrieval
- Query expansion: Rewrite user queries to match document language
- Metadata filtering: Narrow search by date, category, or source
- Multi-query: Generate variations of the query and merge results
- Reranking: Use cross-encoders to score query-document pairs more accurately
Debugging Retrieval When answers are wrong, always inspect retrieved documents first. Add logging to see what’s being retrieved. Common issues: query-document vocabulary mismatch, wrong metadata filters, insufficient chunks, or embeddings from different models.
Start simple. Basic semantic search often works well. Add complexity (hybrid search, reranking) only when you identify specific failure modes through testing.
Source
Dense retrieval using learned embeddings outperforms traditional sparse methods like BM25 on many benchmarks, while hybrid approaches combining both show further improvements.
https://arxiv.org/abs/2112.09118