Back to Glossary
Architecture

Positional Encoding

Definition

A technique for injecting sequence order information into Transformer models, which otherwise process tokens in parallel without inherent position awareness.

Why It Matters

Without positional encoding, Transformers would treat “The dog bit the man” and “The man bit the dog” identically, just bags of words. Position information is essential for understanding language.

Unlike RNNs that process sequences step-by-step (inherently capturing order), Transformers process all positions in parallel. Positional encoding solves this by adding position information to each token’s representation.

For AI engineers, positional encoding directly impacts context windows. The choice of positional encoding determines how long a model’s effective context can be, and whether it can generalize to longer sequences than it trained on.

Implementation Basics

Original Sinusoidal Encoding

PE(pos, 2i) = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))

Properties:

  • Deterministic (no learned parameters)
  • Can theoretically extend to any length
  • Relative positions have consistent patterns

Learned Positional Embeddings

  • Treat positions like vocabulary
  • Learn embedding for each position
  • Used in GPT-2, BERT
  • Limitation: Fixed maximum length

RoPE (Rotary Position Embedding)

  • Encodes position through rotation matrices
  • Naturally captures relative positions
  • Better extrapolation to longer sequences
  • Used in Llama, Mistral, most modern models

ALiBi (Attention with Linear Biases)

  • Adds position-based penalty to attention scores
  • Closer tokens get less penalty
  • Good length generalization
  • Used in some efficient models

Why Position Methods Matter

Different encodings have different properties:

  • Extrapolation: Can the model handle longer sequences than training?
  • Memory: How much does position add to memory requirements?
  • Relative vs. absolute: Does the model understand “3 tokens apart” vs. “position 47”?

Context Length Extensions Modern context window extensions often modify position encoding:

  • Position interpolation: Scale positions to fit longer sequences
  • NTK-aware scaling: Non-linear position scaling
  • YaRN: Combines multiple techniques

These techniques enable models trained on 4K contexts to work with 32K+ tokens.

Practical Implications

  • Model’s trained context length isn’t always its effective limit
  • Position encoding choice affects performance at different lengths
  • Some tasks need absolute position, others need relative
  • Longer isn’t always better, attention can “dilute” with length

Source

Since the Transformer contains no recurrence and no convolution, positional encodings are added to give the model information about the relative or absolute position of tokens in the sequence.

https://arxiv.org/abs/1706.03762