Back to Glossary
LLM

Quantization

Definition

Quantization reduces model memory footprint and increases inference speed by representing weights in lower-precision formats (INT8, INT4) instead of 16-bit or 32-bit floating point, typically with minimal quality loss.

Why It Matters

Quantization is why you can run large language models on consumer hardware. A 7B parameter model in 16-bit precision requires 14GB of GPU memory, more than most gaming GPUs have. The same model quantized to 4-bit needs only 3.5GB, fitting comfortably on a GTX 3060.

Beyond memory savings, quantization speeds up inference. LLM inference is memory-bandwidth bound. The bottleneck is reading weights from GPU memory. Smaller weights mean less data to transfer, directly improving tokens-per-second throughput.

For AI engineers, quantization is a standard deployment optimization. Most production systems use some form of quantization, whether INT8 for API serving or INT4 for edge devices. Understanding the quality-speed tradeoffs helps you make informed decisions about which quantization level to use.

Implementation Basics

Quantization maps high-precision values to lower-precision representations:

1. Weight Quantization Model weights (the learned parameters) are stored in reduced precision. This is the most common form, it reduces model size and loading time. Weights are quantized once after training.

2. Activation Quantization Intermediate activations during forward passes are also quantized. More complex because activations vary by input, requiring calibration data. Provides additional speedup but harder to implement.

3. Common Formats

  • INT8 - 8-bit integers. Industry standard, widely supported. Minimal quality loss for most models. ~2x memory reduction.
  • INT4/NF4 - 4-bit formats. Aggressive compression. Quality depends on method. ~4x memory reduction.
  • GPTQ - Post-training quantization method optimized for LLMs. Popular for open-source models.
  • AWQ - Activation-aware quantization. Preserves important weights at higher precision.

Practical considerations:

Outliers - Some weights are much larger than others and quantize poorly. Modern methods (LLM.int8(), AWQ) handle outliers specially, keeping them in higher precision.

Calibration - Quantization parameters are computed using sample data. Quality depends on calibration data matching your use case. Use representative prompts.

Hardware Support - INT8 has broad hardware support. INT4 requires specific GPU architectures or specialized libraries. Check your target hardware before choosing a format.

Combining with Other Optimizations - Quantization stacks with other techniques. Quantized models can use KV cache, batching, and speculative decoding. vLLM, TGI, and Ollama handle this automatically.

The quality-speed tradeoff is usually favorable. INT8 quantization typically loses less than 1% quality while halving memory. INT4 might lose 1-3% quality but quarters memory. For most applications, the tradeoff is worth it.

Source

LLM.int8() demonstrates that 8-bit matrix multiplication can be performed with no degradation in performance for models up to 175B parameters using vector-wise quantization.

https://arxiv.org/abs/2208.07339