Prefill
Definition
The initial phase of LLM inference where the model processes all input tokens to build the key-value cache, determining time-to-first-token latency.
Prefill is the first phase of LLM inference where the model processes all input tokens in parallel to populate the key-value cache, directly affecting time-to-first-token latency.
Why It Matters
LLM inference has two distinct phases with different characteristics:
Prefill phase:
- Processes entire input prompt
- Compute-bound (parallelizable)
- Duration scales with input length
- Determines time-to-first-token
Decode phase:
- Generates output tokens one at a time
- Memory-bound (sequential)
- Uses cached KV values
- Determines tokens-per-second
Understanding prefill matters because:
- Long prompts increase initial latency significantly
- RAG systems with large contexts are prefill-heavy
- Optimization strategies differ between phases
Implementation Basics
Prefill optimization strategies:
- Prompt caching: Reuse prefill results for common prefixes (system prompts, few-shot examples)
- Chunked prefill: Split long prompts across iterations to interleave with decode
- Speculative decoding: Reduce decode latency, but prefill remains the same
- Flash Attention: Reduces memory usage during prefill but same compute
Measuring prefill performance:
- TTFT (Time to First Token): Primary metric, includes prefill
- Prefill throughput: Tokens processed per second during prefill
- KV cache memory: Memory needed scales with sequence length
Practical implications:
- System prompts add to every requestโs prefill time
- Long context windows (100K+ tokens) can have multi-second prefill
- Batch similar prompt lengths to avoid prefill variance
When optimizing LLM performance, distinguish between prefill-bound and decode-bound workloads. They require different optimization approaches.
Source
The prefill phase processes the prompt to generate KV cache entries
https://arxiv.org/abs/2211.05102