Model Distillation
Definition
Model distillation is a technique that trains a smaller 'student' model to mimic the outputs of a larger 'teacher' model, creating more efficient models that retain much of the original's capability at a fraction of the size and cost.
Why It Matters
Distillation lets you deploy powerful AI at a fraction of the cost. Running GPT-4 for every request is expensive; running a distilled model that captures 90% of GPT-4’s capability for your specific use case is dramatically cheaper. Many production systems use distilled models for this reason.
The key insight: large models have far more capacity than needed for most specific tasks. A 70B parameter model can answer any question, but your customer support bot only needs to handle a narrow domain. Distillation transfers the relevant knowledge to a smaller, faster, cheaper model.
For AI engineers, distillation is a practical optimization technique. When you’ve validated a solution with a frontier model but can’t afford to scale it, distillation offers a path to production. It’s also how many smaller open-source models achieve their capabilities, since they’re distilled from larger proprietary models.
Implementation Basics
Distillation involves training a student model on the outputs of a teacher:
1. Generate Training Data Run the teacher model on a large set of inputs (prompts). Collect both the outputs and, ideally, the probability distributions over tokens (logits). The teacher’s outputs become training data for the student.
2. Choose Student Architecture The student is typically much smaller than the teacher. A 7B student can distill from a 70B teacher. The student architecture doesn’t need to match the teacher since you’re transferring knowledge, not weights.
3. Distillation Loss Traditional supervised learning uses hard labels (correct answer). Distillation uses soft labels, which are the teacher’s probability distribution over all tokens. This transfers more information about the teacher’s reasoning, including its uncertainty.
4. Task-Specific Focus Distillation works best for focused domains. A student trained on general data will be mediocre at everything. A student trained on your specific use case’s data can match the teacher on that domain.
Practical approaches:
Response Distillation - Train on (prompt, teacher_response) pairs. Simple and effective. Works with API-only teachers where logits aren’t available.
Logit Distillation - Train on probability distributions, not just final outputs. Captures more nuance but requires access to teacher logits.
Synthetic Data Generation - Use the teacher to generate diverse training examples. Scale up training data without manual labeling.
The main limitation: you need significant compute and data to distill effectively. For many use cases, fine-tuning an existing small model or using prompt engineering achieves similar results with less effort. Consider distillation when you’ve exhausted simpler options and need maximum efficiency.
Source
Knowledge distillation can compress the knowledge in an ensemble of models into a single smaller model by training on soft probability distributions rather than hard labels.
https://arxiv.org/abs/1503.02531