Back to Glossary
LLM

Inference

Definition

Inference is the process of running a trained AI model to generate predictions or outputs from new inputs, as opposed to training which updates model weights. In LLMs, inference means generating text responses from prompts.

Why It Matters

Inference is where AI delivers value. Training happens once (or periodically), but inference runs for every user request. The speed, cost, and reliability of inference directly impact user experience and operational costs. Optimizing inference is often more impactful than improving model accuracy.

For LLMs, inference has unique characteristics. Unlike image models that process an entire input in one forward pass, LLMs generate tokens one at a time (autoregressive generation). Each token requires reading all model weights from memory, making inference memory-bandwidth bound rather than compute bound.

Understanding inference economics is essential for AI engineers. A model that’s 10% more accurate but 3x slower often loses to the faster option in production. You’ll spend more time optimizing inference than training in most production systems.

Implementation Basics

LLM inference has two distinct phases with different performance characteristics:

1. Prefill (Prompt Processing) The initial prompt gets processed in parallel, computing attention across all input tokens simultaneously. This phase is compute-bound and benefits from batching. Larger prompts take proportionally longer.

2. Decode (Token Generation) New tokens generate one at a time, each requiring a forward pass through the model. This phase is memory-bandwidth bound, as the bottleneck is reading model weights from GPU memory. Longer outputs mean more iterations.

Key optimization strategies:

Batching - Process multiple requests together to amortize the cost of loading model weights. Static batching waits for a batch; continuous batching adds requests dynamically.

KV Cache - Store computed key-value pairs from attention layers to avoid recomputation. Critical for performance but consumes significant GPU memory.

Quantization - Reduce model weights to lower precision (INT8, INT4) to decrease memory bandwidth requirements. Usually 1.5-2x speedup with minimal quality loss.

Speculative Decoding - Use a smaller “draft” model to predict multiple tokens, then verify with the main model in parallel. Can significantly improve throughput.

Inference serving frameworks like vLLM, TGI, and TensorRT-LLM implement these optimizations automatically. Most production systems use these rather than raw model inference.

Source

LLM inference is memory-bandwidth bound and autoregressive generation dominates compute time, with each token requiring a full forward pass through the model.

https://arxiv.org/abs/2211.05102