Cosine Similarity
Definition
Cosine similarity measures the angle between two vectors, returning a value from -1 to 1 that indicates how similar their directions are regardless of magnitude. It is the primary metric for comparing embeddings in semantic search.
Why It Matters
When you convert text into embeddings, you get high-dimensional vectors (often 768 or 1536 dimensions). Comparing these vectors requires a similarity metric, and cosine similarity is the standard choice for text embeddings.
Why cosine over Euclidean distance? Cosine similarity focuses on direction, not magnitude. Two documents about the same topic will point in similar directions in embedding space, even if one is longer and has a larger vector magnitude. Euclidean distance would penalize this length difference; cosine similarity ignores it.
For AI engineers, understanding cosine similarity helps you debug retrieval issues. When your RAG system returns irrelevant results, you can inspect the similarity scores. A score of 0.9 indicates strong semantic overlap; 0.5 suggests weak relevance. This gives you concrete numbers to optimize against.
Implementation Basics
The formula is straightforward: divide the dot product of two vectors by the product of their magnitudes.
In practice: Most vector databases handle cosine similarity internally, and you just specify the metric when creating your index. When using Pinecone, Weaviate, or Chroma, configure the index for cosine similarity (sometimes called “cosine distance” where distance = 1 - similarity).
Similarity thresholds: Typical ranges for text embeddings:
- 0.8-1.0: Very similar or near-duplicate content
- 0.6-0.8: Related topics, good retrieval candidates
- 0.4-0.6: Loosely related, often noise
- Below 0.4: Usually irrelevant
Normalized vectors: Many embedding models output normalized vectors (magnitude = 1). For normalized vectors, cosine similarity equals the dot product, which is faster to compute. This is why some systems normalize embeddings before storage.
Choosing metrics: Cosine similarity works best for semantic text matching. For some use cases (image embeddings, certain recommendation systems), Euclidean distance or inner product may perform better. Test with your actual data.
The key insight: cosine similarity tells you “are these pointing in the same direction?” rather than “are these close together in space?” For language understanding, direction matters more than position.
Source
Cosine similarity equals the dot product of two vectors divided by the product of their magnitudes, measuring orientation rather than Euclidean distance.
https://en.wikipedia.org/wiki/Cosine_similarity