LoRA (Low-Rank Adaptation)
Definition
LoRA is a parameter-efficient fine-tuning technique that trains small adapter matrices instead of the full model weights, reducing memory requirements by 10-100x while achieving comparable performance to full fine-tuning.
Why It Matters
LoRA made fine-tuning accessible to engineers without massive GPU clusters. Before LoRA, fine-tuning a 7B parameter model required multiple A100 GPUs. Now you can fine-tune on a single consumer GPU, and the resulting adapter weights are measured in megabytes rather than gigabytes.
The key insight: most of the model’s knowledge lives in the pre-trained weights. You don’t need to modify all of them, you just need to learn a small set of adjustments. LoRA injects trainable rank-decomposition matrices into the model’s attention layers, leaving the original weights frozen.
For AI engineers, LoRA is the default choice for customization. You can train multiple LoRA adapters for different use cases, swap them at inference time, and even combine them. This enables personalization at scale, one base model with many specialized adapters.
Implementation Basics
LoRA works by adding small trainable matrices to the model’s transformer layers:
1. Matrix Decomposition Instead of training a full weight update matrix W (size d×d), LoRA trains two smaller matrices A (d×r) and B (r×d) where r (the rank) is much smaller than d. The update becomes BA, dramatically reducing parameters.
2. Target Layers LoRA typically targets the attention projection layers (Q, K, V, O matrices). The rank r controls capacity, with higher ranks capturing more complexity but requiring more memory. Start with r=8 or r=16 for most tasks.
3. Training Process The base model weights stay frozen while only the LoRA matrices train. This means gradient computation is much cheaper, and you can use larger batch sizes. Training completes faster with less memory than traditional fine-tuning.
4. Inference At inference time, the LoRA weights can be merged into the base model (no latency overhead) or kept separate (for easy swapping). Tools like PEFT and Hugging Face make LoRA implementation straightforward.
Most practitioners start with default hyperparameters (r=8, alpha=16, dropout=0.05) and adjust based on results. If your adapter underfits, increase rank. If it overfits, add dropout or reduce epochs.
Source
LoRA reduces trainable parameters by 10,000x and GPU memory requirement by 3x compared to full fine-tuning while matching or exceeding model quality.
https://arxiv.org/abs/2106.09685