Caching
Definition
Caching in AI systems stores computed results to avoid redundant model inference, dramatically reducing latency and costs for repeated or similar queries.
Why It Matters
AI inference is expensive, both in latency (100ms-10s per request) and cost ($0.001-$0.10 per request). Caching eliminates this cost for repeated queries. If 30% of your queries are near-duplicates, effective caching reduces your inference costs by 30% while dramatically improving response times for cached queries.
For RAG systems, caching is particularly valuable. Embedding calculations for the same document or query produce identical results. Caching embeddings means you compute each document’s embedding once, regardless of how many times it’s retrieved.
Semantic caching extends this further, identifying that “What’s the weather?” and “Weather today?” should return the same cached response. This requires embedding queries and finding semantically similar past queries in your cache.
Implementation Basics
Caching layers for AI systems:
- Exact match cache: Hash the input, return cached output if hash matches
- Semantic cache: Embed queries, return cached outputs for semantically similar queries
- Embedding cache: Store computed embeddings to avoid re-embedding documents
- KV cache: LLM-internal caching of key-value pairs for faster generation
Implementation patterns:
Exact caching uses standard cache systems (Redis, Memcached). Hash the normalized input as the key, store the serialized output as the value. Simple and effective for repeated identical queries.
Semantic caching embeds each query and searches for similar cached queries using vector similarity. If similarity exceeds a threshold, return the cached response. Tools like GPTCache implement this pattern.
Cache invalidation strategies:
- TTL (Time-To-Live): Expire entries after fixed duration
- LRU (Least Recently Used): Evict oldest accessed entries
- Manual invalidation: Clear cache when underlying data changes
Key considerations:
- Cache hit rate depends on query distribution, as high repetition benefits most
- Semantic caching adds latency (embedding + similarity search) that may exceed inference time for small models
- Monitor cache effectiveness: hit rate, latency reduction, cost savings
- Consider privacy implications of caching user queries
Start with exact match caching for embeddings and repeated queries. Add semantic caching only if you have high query similarity but low exact-match rates, and measure whether the added complexity improves overall performance.
Source
Client-side caching allows applications to store frequently accessed data locally, reducing network round trips and improving response times.
https://redis.io/docs/manual/client-side-caching/