Back to Glossary
Architecture

KV Cache

Definition

Key-Value cache storing computed attention states during LLM generation, enabling efficient autoregressive decoding by avoiding redundant computation of previous tokens.

The KV (Key-Value) cache stores the attention key and value tensors computed during LLM inference, allowing subsequent token generation to reuse past computations rather than recalculating them.

Why It Matters

Without KV caching, generating N tokens would require computing attention over all previous tokens N times, resulting in quadratic complexity. KV caching reduces this to linear:

  • Massive speedup: Avoids recomputing attention for all previous tokens
  • Memory trade-off: Stores large tensors in GPU memory
  • Batch impact: Memory usage scales with batch size × sequence length
  • Context limits: Maximum context often bound by KV cache memory

For AI engineers, KV cache management is critical for:

  • Estimating GPU memory requirements
  • Understanding inference latency behavior
  • Optimizing concurrent request handling

Implementation Basics

KV cache characteristics:

  1. Size calculation: 2 × layers × heads × sequence_length × head_dim × batch_size × precision
  2. Memory growth: Linear with sequence length
  3. Sharing: System prompts can share cached KV values

Memory management strategies:

Static allocation:

  • Pre-allocate maximum context length
  • Simple but wasteful for variable-length sequences
  • ~60-80% memory waste in practice

PagedAttention (vLLM):

  • Allocate KV cache in non-contiguous blocks
  • Near-zero memory waste
  • Enables efficient memory sharing

Prompt caching:

  • Cache KV values for common prefixes
  • System prompts cached across requests
  • Anthropic and OpenAI support this feature

Practical considerations:

  • Long contexts (100K+ tokens) can require 50GB+ per request
  • Quantized KV cache (FP8) reduces memory by 2x
  • Sliding window attention limits cache growth but loses long-range context

KV cache optimization is where much of modern LLM serving research focuses, as it’s often the bottleneck for scaling concurrent users.

Source

KV cache memory management is central to efficient LLM serving

https://arxiv.org/abs/2309.06180