KV Cache
Definition
Key-Value cache storing computed attention states during LLM generation, enabling efficient autoregressive decoding by avoiding redundant computation of previous tokens.
The KV (Key-Value) cache stores the attention key and value tensors computed during LLM inference, allowing subsequent token generation to reuse past computations rather than recalculating them.
Why It Matters
Without KV caching, generating N tokens would require computing attention over all previous tokens N times, resulting in quadratic complexity. KV caching reduces this to linear:
- Massive speedup: Avoids recomputing attention for all previous tokens
- Memory trade-off: Stores large tensors in GPU memory
- Batch impact: Memory usage scales with batch size × sequence length
- Context limits: Maximum context often bound by KV cache memory
For AI engineers, KV cache management is critical for:
- Estimating GPU memory requirements
- Understanding inference latency behavior
- Optimizing concurrent request handling
Implementation Basics
KV cache characteristics:
- Size calculation:
2 × layers × heads × sequence_length × head_dim × batch_size × precision - Memory growth: Linear with sequence length
- Sharing: System prompts can share cached KV values
Memory management strategies:
Static allocation:
- Pre-allocate maximum context length
- Simple but wasteful for variable-length sequences
- ~60-80% memory waste in practice
PagedAttention (vLLM):
- Allocate KV cache in non-contiguous blocks
- Near-zero memory waste
- Enables efficient memory sharing
Prompt caching:
- Cache KV values for common prefixes
- System prompts cached across requests
- Anthropic and OpenAI support this feature
Practical considerations:
- Long contexts (100K+ tokens) can require 50GB+ per request
- Quantized KV cache (FP8) reduces memory by 2x
- Sliding window attention limits cache growth but loses long-range context
KV cache optimization is where much of modern LLM serving research focuses, as it’s often the bottleneck for scaling concurrent users.
Source
KV cache memory management is central to efficient LLM serving
https://arxiv.org/abs/2309.06180