Back to Glossary
MLOps

Model Optimization

Definition

Model optimization encompasses techniques to make AI models faster, smaller, and more efficient for deployment, including quantization, pruning, distillation, and compilation.

Why It Matters

Raw model performance rarely meets production requirements. A research model achieving state-of-the-art accuracy might be too slow for real-time inference, too large to fit in available memory, or too expensive to run at scale. Model optimization bridges the gap between research quality and production constraints.

For AI engineers, optimization is often the difference between a successful deployment and an abandoned project. A 7B parameter model might need 14GB VRAM in its original form, but quantized to INT4, it runs on a consumer GPU with 6GB VRAM. This accessibility transformation is pure optimization work.

The optimization tradeoff is typically accuracy versus efficiency. Aggressive optimization reduces model size and latency but may degrade output quality. Understanding these tradeoffs helps you choose appropriate optimization levels for your specific requirements.

Implementation Basics

Core optimization techniques:

Quantization reduces numerical precision (FP32 → FP16 → INT8 → INT4). Each step roughly halves memory and often improves speed, with varying accuracy impact depending on the model.

Pruning removes less important weights from the network. Structured pruning removes entire neurons or layers; unstructured pruning zeros individual weights. Results vary by model architecture.

Distillation trains a smaller “student” model to mimic a larger “teacher” model. The student learns the teacher’s behavior rather than the original training objective, often achieving better efficiency than training from scratch.

Compilation (TensorRT, ONNX Runtime) converts models to optimized formats for specific hardware. Compilation fuses operations, optimizes memory access, and generates hardware-specific code.

Optimization workflow:

  1. Measure baseline performance (latency, throughput, memory)
  2. Identify bottlenecks (compute-bound vs. memory-bound)
  3. Apply appropriate techniques based on constraints
  4. Validate accuracy on representative test data
  5. Re-measure and iterate

Tool recommendations:

  • Quantization: bitsandbytes, GPTQ, AWQ for LLMs
  • Distillation: Hugging Face Trainer with distillation support
  • Compilation: ONNX Runtime, TensorRT, vLLM
  • Profiling: PyTorch Profiler, nvidia-nsight

Start with quantization since it’s often the highest-impact, lowest-effort optimization. Add other techniques when quantization alone doesn’t meet requirements.

Source

Quantization refers to techniques for performing computations and storing tensors at lower bitwidths than floating point precision, enabling significant speedup and memory reduction.

https://pytorch.org/docs/stable/quantization.html