Back to Glossary
Architecture

Layer Normalization

Definition

A technique that normalizes activations across features within each layer, stabilizing training and enabling deeper neural networks without batch dependencies.

Why It Matters

Layer normalization is a critical but often overlooked component that makes deep Transformers possible. Without it, training models with dozens or hundreds of layers would be unstable as gradients would explode or vanish.

Every modern LLM uses layer normalization. It’s applied multiple times per Transformer layer, typically before or after attention and feed-forward operations. This stabilization enables the extreme depth (100+ layers) of large language models.

For AI engineers, layer normalization is mostly handled by frameworks. But understanding it helps when debugging training issues, implementing custom architectures, or optimizing inference.

Implementation Basics

How It Works

  1. Compute mean and variance across features for each position
  2. Normalize to zero mean and unit variance
  3. Apply learned scale (γ) and shift (β) parameters
LayerNorm(x) = γ × (x - mean) / √(variance + ε) + β

Layer Norm vs. Batch Norm

  • Batch Norm: Normalizes across batch dimension
    • Needs batch statistics (problematic for inference)
    • Batch size affects training dynamics
  • Layer Norm: Normalizes across feature dimension
    • Independent of batch size
    • Consistent behavior training vs. inference

Pre-Norm vs. Post-Norm

  • Post-Norm (original Transformer): LayerNorm after sublayer + residual
  • Pre-Norm (modern standard): LayerNorm before sublayer
    • More stable training
    • Better gradient flow
    • Used in GPT, Llama, most modern models

RMSNorm Simplified normalization used in Llama and other efficient models:

  • Only divides by root mean square (no mean centering)
  • Fewer operations, similar effectiveness
  • 10-15% faster than standard LayerNorm

Where It Appears In a typical Transformer layer:

  1. Pre-attention LayerNorm
  2. Self-attention
  3. Residual connection
  4. Pre-FFN LayerNorm
  5. Feed-forward network
  6. Residual connection

Implementation Considerations

  • Small epsilon (ε ≈ 1e-5) prevents division by zero
  • Learned parameters (γ, β) per layer
  • Memory for activations during training
  • Fused operations available for GPU efficiency

Quantization Impact Layer normalization can be sensitive to quantization:

  • Mean/variance calculations need precision
  • Some quantization schemes keep LayerNorm in higher precision
  • Affects inference optimization strategies

Source

Layer normalization computes the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case, eliminating batch size dependencies.

https://arxiv.org/abs/1607.06450