Positional Encoding
Definition
A technique for injecting sequence order information into Transformer models, which otherwise process tokens in parallel without inherent position awareness.
Why It Matters
Without positional encoding, Transformers would treat “The dog bit the man” and “The man bit the dog” identically, just bags of words. Position information is essential for understanding language.
Unlike RNNs that process sequences step-by-step (inherently capturing order), Transformers process all positions in parallel. Positional encoding solves this by adding position information to each token’s representation.
For AI engineers, positional encoding directly impacts context windows. The choice of positional encoding determines how long a model’s effective context can be, and whether it can generalize to longer sequences than it trained on.
Implementation Basics
Original Sinusoidal Encoding
PE(pos, 2i) = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))
Properties:
- Deterministic (no learned parameters)
- Can theoretically extend to any length
- Relative positions have consistent patterns
Learned Positional Embeddings
- Treat positions like vocabulary
- Learn embedding for each position
- Used in GPT-2, BERT
- Limitation: Fixed maximum length
RoPE (Rotary Position Embedding)
- Encodes position through rotation matrices
- Naturally captures relative positions
- Better extrapolation to longer sequences
- Used in Llama, Mistral, most modern models
ALiBi (Attention with Linear Biases)
- Adds position-based penalty to attention scores
- Closer tokens get less penalty
- Good length generalization
- Used in some efficient models
Why Position Methods Matter
Different encodings have different properties:
- Extrapolation: Can the model handle longer sequences than training?
- Memory: How much does position add to memory requirements?
- Relative vs. absolute: Does the model understand “3 tokens apart” vs. “position 47”?
Context Length Extensions Modern context window extensions often modify position encoding:
- Position interpolation: Scale positions to fit longer sequences
- NTK-aware scaling: Non-linear position scaling
- YaRN: Combines multiple techniques
These techniques enable models trained on 4K contexts to work with 32K+ tokens.
Practical Implications
- Model’s trained context length isn’t always its effective limit
- Position encoding choice affects performance at different lengths
- Some tasks need absolute position, others need relative
- Longer isn’t always better, attention can “dilute” with length
Source
Since the Transformer contains no recurrence and no convolution, positional encodings are added to give the model information about the relative or absolute position of tokens in the sequence.
https://arxiv.org/abs/1706.03762