Implementation

BM25

Definition

BM25 (Best Match 25) is a ranking algorithm that scores documents based on term frequency and inverse document frequency, with length normalization. It is the standard for keyword-based search.

Why It Matters

BM25 has powered search for 30 years and isn’t going anywhere. When you search on Elasticsearch, Lucene, or OpenSearch, you’re using BM25 by default. Understanding it helps you tune search quality and decide when keyword search outperforms semantic alternatives.

Despite the hype around neural search, BM25 wins in specific scenarios: exact match queries, highly specialized vocabulary, systems where every millisecond matters, and as the sparse component of hybrid search.

For AI engineers, BM25 is essential knowledge because it’s the baseline. When you add semantic search to a system, you compare against BM25. When you build hybrid search, BM25 is half of your pipeline. It’s battle-tested, fast, and requires no GPUs.

Implementation Basics

The BM25 formula considers three factors:

Term Frequency (TF) How often a term appears in the document. More occurrences = higher score. But BM25 includes saturation, so the 100th occurrence adds less than the 10th. This prevents keyword stuffing from dominating.

Inverse Document Frequency (IDF) Rare terms matter more. “Machine learning” appearing in 1% of documents scores higher than “the” appearing in 99%. IDF downweights common terms that don’t indicate relevance.

Document Length Normalization Longer documents naturally contain more terms. BM25 normalizes so short, focused documents can compete with long, comprehensive ones.

Key parameters:

k1: Controls term frequency saturation (typically 1.2-2.0). Higher values give more weight to term frequency.
b: Controls length normalization (typically 0.75). Lower values reduce length penalty.

Implementation options:

Elasticsearch/OpenSearch: BM25 is the default
PostgreSQL full-text search: Built-in tsvector/tsquery
rank_bm25 Python package: For in-memory search
Tantivy (Rust): Lucene alternative

When to use BM25:

Exact match requirements (product codes, error messages)
Known vocabulary domains (legal, medical)
As part of hybrid search
When you need sub-millisecond response times
When you need to explain why a document matched

When to prefer semantic search:

Conceptual queries (“how to fix slow API”)
Vocabulary mismatch likely (users don’t know your terminology)
Exploratory search

The best modern systems combine BM25 and semantic search, using reciprocal rank fusion or learned weights. BM25 handles the exact matches that semantic search misses.

Source

BM25 was developed in the 1990s at City University London and remains the default ranking function in Elasticsearch, Lucene, and most production search engines.

https://en.wikipedia.org/wiki/Okapi_BM25

Why It Matters

Implementation Basics

🎁 Go Beyond Definitions

Related Terms

Related Articles