Layer Normalization
Definition
A technique that normalizes activations across features within each layer, stabilizing training and enabling deeper neural networks without batch dependencies.
Why It Matters
Layer normalization is a critical but often overlooked component that makes deep Transformers possible. Without it, training models with dozens or hundreds of layers would be unstable as gradients would explode or vanish.
Every modern LLM uses layer normalization. It’s applied multiple times per Transformer layer, typically before or after attention and feed-forward operations. This stabilization enables the extreme depth (100+ layers) of large language models.
For AI engineers, layer normalization is mostly handled by frameworks. But understanding it helps when debugging training issues, implementing custom architectures, or optimizing inference.
Implementation Basics
How It Works
- Compute mean and variance across features for each position
- Normalize to zero mean and unit variance
- Apply learned scale (γ) and shift (β) parameters
LayerNorm(x) = γ × (x - mean) / √(variance + ε) + β
Layer Norm vs. Batch Norm
- Batch Norm: Normalizes across batch dimension
- Needs batch statistics (problematic for inference)
- Batch size affects training dynamics
- Layer Norm: Normalizes across feature dimension
- Independent of batch size
- Consistent behavior training vs. inference
Pre-Norm vs. Post-Norm
- Post-Norm (original Transformer): LayerNorm after sublayer + residual
- Pre-Norm (modern standard): LayerNorm before sublayer
- More stable training
- Better gradient flow
- Used in GPT, Llama, most modern models
RMSNorm Simplified normalization used in Llama and other efficient models:
- Only divides by root mean square (no mean centering)
- Fewer operations, similar effectiveness
- 10-15% faster than standard LayerNorm
Where It Appears In a typical Transformer layer:
- Pre-attention LayerNorm
- Self-attention
- Residual connection
- Pre-FFN LayerNorm
- Feed-forward network
- Residual connection
Implementation Considerations
- Small epsilon (ε ≈ 1e-5) prevents division by zero
- Learned parameters (γ, β) per layer
- Memory for activations during training
- Fused operations available for GPU efficiency
Quantization Impact Layer normalization can be sensitive to quantization:
- Mean/variance calculations need precision
- Some quantization schemes keep LayerNorm in higher precision
- Affects inference optimization strategies
Source
Layer normalization computes the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case, eliminating batch size dependencies.
https://arxiv.org/abs/1607.06450