What is Positional Encoding?

Architecture

Positional Encoding

Definition

A technique for injecting sequence order information into Transformer models, which otherwise process tokens in parallel without inherent position awareness.

Why It Matters

Without positional encoding, Transformers would treat “The dog bit the man” and “The man bit the dog” identically, just bags of words. Position information is essential for understanding language.

Unlike RNNs that process sequences step-by-step (inherently capturing order), Transformers process all positions in parallel. Positional encoding solves this by adding position information to each token’s representation.

For AI engineers, positional encoding directly impacts context windows. The choice of positional encoding determines how long a model’s effective context can be, and whether it can generalize to longer sequences than it trained on.

Implementation Basics

Original Sinusoidal Encoding

PE(pos, 2i) = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))

Properties:

Deterministic (no learned parameters)
Can theoretically extend to any length
Relative positions have consistent patterns

Learned Positional Embeddings

Treat positions like vocabulary
Learn embedding for each position
Used in GPT-2, BERT
Limitation: Fixed maximum length

RoPE (Rotary Position Embedding)

Encodes position through rotation matrices
Naturally captures relative positions
Better extrapolation to longer sequences
Used in Llama, Mistral, most modern models

ALiBi (Attention with Linear Biases)

Adds position-based penalty to attention scores
Closer tokens get less penalty
Good length generalization
Used in some efficient models

Why Position Methods Matter

Different encodings have different properties:

Extrapolation: Can the model handle longer sequences than training?
Memory: How much does position add to memory requirements?
Relative vs. absolute: Does the model understand “3 tokens apart” vs. “position 47”?

Context Length Extensions Modern context window extensions often modify position encoding:

Position interpolation: Scale positions to fit longer sequences
NTK-aware scaling: Non-linear position scaling
YaRN: Combines multiple techniques

These techniques enable models trained on 4K contexts to work with 32K+ tokens.

Practical Implications

Model’s trained context length isn’t always its effective limit
Position encoding choice affects performance at different lengths
Some tasks need absolute position, others need relative
Longer isn’t always better, attention can “dilute” with length

Source

Since the Transformer contains no recurrence and no convolution, positional encodings are added to give the model information about the relative or absolute position of tokens in the sequence.

https://arxiv.org/abs/1706.03762

Why It Matters

Implementation Basics

🎁 Go Beyond Definitions

Related Terms

Related Articles