Architecture

KV Cache

Definition

Key-Value cache storing computed attention states during LLM generation, enabling efficient autoregressive decoding by avoiding redundant computation of previous tokens.

The KV (Key-Value) cache stores the attention key and value tensors computed during LLM inference, allowing subsequent token generation to reuse past computations rather than recalculating them.

Why It Matters

Without KV caching, generating N tokens would require computing attention over all previous tokens N times, resulting in quadratic complexity. KV caching reduces this to linear:

Massive speedup: Avoids recomputing attention for all previous tokens
Memory trade-off: Stores large tensors in GPU memory
Batch impact: Memory usage scales with batch size × sequence length
Context limits: Maximum context often bound by KV cache memory

For AI engineers, KV cache management is critical for:

Estimating GPU memory requirements
Understanding inference latency behavior
Optimizing concurrent request handling

Implementation Basics

KV cache characteristics:

Size calculation: 2 × layers × heads × sequence_length × head_dim × batch_size × precision
Memory growth: Linear with sequence length
Sharing: System prompts can share cached KV values

Memory management strategies:

Static allocation:

Pre-allocate maximum context length
Simple but wasteful for variable-length sequences
~60-80% memory waste in practice

PagedAttention (vLLM):

Allocate KV cache in non-contiguous blocks
Near-zero memory waste
Enables efficient memory sharing

Prompt caching:

Cache KV values for common prefixes
System prompts cached across requests
Anthropic and OpenAI support this feature

Practical considerations:

Long contexts (100K+ tokens) can require 50GB+ per request
Quantized KV cache (FP8) reduces memory by 2x
Sliding window attention limits cache growth but loses long-range context

KV cache optimization is where much of modern LLM serving research focuses, as it’s often the bottleneck for scaling concurrent users.

Source

KV cache memory management is central to efficient LLM serving

https://arxiv.org/abs/2309.06180

Why It Matters

Implementation Basics

🎁 Go Beyond Definitions

Related Terms

Related Articles