Back to Glossary
Implementation

Sparse Retrieval

Definition

Sparse retrieval represents documents and queries as high-dimensional vectors where most values are zero, with non-zero values corresponding to specific vocabulary terms, the basis for traditional keyword search.

Why It Matters

Sparse retrieval has powered search for decades and remains relevant. While dense retrieval captures semantic meaning, sparse retrieval excels at exact matching, efficiency, and interpretability. When someone searches for “error ERR_CONNECTION_REFUSED,” you want exact match behavior, not semantic approximation.

The “sparse” refers to how documents are represented: as vectors with one dimension per vocabulary word. A document about “machine learning” has non-zero values only for the words it contains. Most of the vector is zeros.

For AI engineers, sparse retrieval matters because it complements dense retrieval in hybrid systems, it’s more efficient for large-scale search, and it’s essential when exact terminology matters (legal, medical, technical domains).

Implementation Basics

Traditional sparse retrieval (BM25): Each document becomes a sparse vector where dimensions correspond to vocabulary terms. BM25 scores consider term frequency, document length, and how rare a term is across all documents. Stored in inverted indexes for efficient lookup.

Learned sparse retrieval (SPLADE): Neural models learn which terms are important, expanding documents with related terms they don’t explicitly contain. A document about “cars” might get expanded to include “automobile” and “vehicle.” This bridges the vocabulary mismatch problem while keeping the efficiency of sparse representations.

Advantages of sparse retrieval:

  • Efficiency: Inverted indexes enable sub-millisecond search over billions of documents
  • Interpretability: You can see exactly which terms caused a match
  • No embedding computation: Documents don’t need to be encoded at search time
  • Works with any vocabulary: Handles new terms without retraining

When to use sparse vs dense:

  • Use sparse for exact matching, known terminology, large-scale efficiency
  • Use dense for semantic understanding, conceptual matching
  • Use hybrid to get the best of both

Implementation options:

  • Elasticsearch, OpenSearch: BM25 out of the box
  • SPLADE: Learned sparse, available through Pinecone or self-hosted
  • PyTerrier: Research framework for sparse retrieval experiments

Modern best practice combines sparse and dense in hybrid search, using reciprocal rank fusion or learned combination weights. This covers both exact matches and semantic relevance.

Source

SPLADE and similar learned sparse representations combine the efficiency of inverted indexes with the semantic understanding of neural models.

https://arxiv.org/abs/2104.08663