Keyword Search
Definition
Keyword search (also called lexical search) finds documents containing the exact words in a query, using algorithms like BM25 to rank results by term frequency and document relevance.
Why It Matters
Before semantic search, keyword search was all we had, and it still excels in specific scenarios. When users search for product codes, error messages, proper nouns, or technical identifiers, keyword search finds exact matches that semantic search might miss.
Semantic search understands that “automobile” and “car” are similar, but keyword search is better when you need “BMW 328i” exactly. Both have their place.
For AI engineers, keyword search remains relevant in three ways: as part of hybrid search systems, as a fallback when semantic search fails, and as the baseline to beat when evaluating your retrieval quality. Understanding keyword search helps you choose the right approach for each use case.
Implementation Basics
Keyword search algorithms rank documents by how well they match query terms:
TF-IDF (Term Frequency - Inverse Document Frequency) Words that appear frequently in a document but rarely across all documents are likely important. “Machine learning” appearing 10 times in one article but rarely elsewhere signals high relevance.
BM25 (Best Match 25) An improved version of TF-IDF that handles document length normalization and term frequency saturation. It’s the default in Elasticsearch and most modern search engines.
Implementation options:
- Elasticsearch/OpenSearch: Full-featured search engines with BM25, filters, facets.
- PostgreSQL full-text search: Built into your existing database.
- Tantivy (Rust) / Lucene (Java): Libraries for building custom search.
- SQLite FTS5: Lightweight option for smaller datasets.
When to use keyword search:
- Exact match requirements (IDs, codes, names)
- Highly structured queries (known terminology)
- As part of hybrid search to complement semantic retrieval
- When you need to explain why a result matched (term highlighting)
Limitations: Keyword search doesn’t understand synonyms, typos, or conceptual relationships. “Car” won’t match “automobile,” and “machine learning engineer” won’t match “ML engineer.” For these cases, you need semantic search or query expansion.
The key insight: keyword search is transparent and predictable. You can always explain why a document matched. This matters for debugging and for domains where explainability is required.
Source
TF-IDF and its successors like BM25 score documents based on term frequency (how often a word appears) inversely weighted by document frequency (how common the word is across all documents).
https://en.wikipedia.org/wiki/Tf%E2%80%93idf