What is Prompt Caching?

Implementation

Prompt Caching

Definition

Prompt caching stores the computed state (KV cache) from static prompt portions across requests, reducing latency and costs when multiple requests share common prefixes like system prompts or context documents.

Why It Matters

Many LLM applications send the same context repeatedly. A chatbot includes the same system prompt with every message. A RAG system includes the same retrieved documents for follow-up questions. Without prompt caching, you pay to process that identical content every single time.

The key insight: the expensive part of LLM inference is computing the key-value cache for your prompt. If that prompt doesn’t change, why recompute it? Cache the KV state once, then reuse it for subsequent requests with the same prefix.

For AI engineers, prompt caching directly impacts costs and latency. If your system prompt is 2,000 tokens and you make 1,000 requests per hour, caching eliminates 2 million tokens of redundant processing. That’s real money saved and faster responses delivered.

How It Works

Prompt caching operates at the API or serving layer:

1. Prefix Identification The system identifies static portions of your prompt, content that stays identical across requests. This is typically marked explicitly or detected automatically.

2. Cache on First Request When processing the first request with a given prefix, compute the KV cache normally. Store this cached state, keyed by a hash of the prefix content.

3. Cache Hit on Subsequent Requests For new requests with the same prefix, skip computation for the cached portion. Load the stored KV cache and continue from where it left off.

4. Cache Management Caches expire based on time (TTL) or get evicted under memory pressure. Different providers have different cache lifetimes and pricing.

Implementation Basics

Using prompt caching effectively:

Provider Support Anthropic Claude, OpenAI, and Google Gemini all offer prompt caching with different APIs. Check your provider’s documentation for the specific syntax.

Prompt Structure Design prompts with stable prefixes. Put your system prompt and any static context at the beginning. Variable content (user messages, dynamic context) comes after.

Cache Markers Some APIs require explicit cache breakpoints to mark where caching should apply. Others cache automatically based on prefix matching.

Cost Calculation Cached tokens are typically billed at a reduced rate (10-25% of normal input token cost). Factor this into your cost estimates.

Self-Hosted Caching For self-hosted models, serving frameworks like vLLM support prefix caching automatically. Multiple requests with shared prefixes share KV cache pages in memory.

The main trade-off is memory usage versus latency. Caching more prefixes uses more GPU memory but speeds up more requests. Find the balance that works for your traffic patterns.

Source

Anthropic's prompt caching can reduce costs by up to 90% and latency by up to 85% for prompts with reusable prefixes.

https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching

Why It Matters

How It Works

Implementation Basics

🎁 Go Beyond Definitions

Related Terms

Related Articles