Back to Glossary
Architecture

BPE (Byte Pair Encoding)

Definition

BPE is a tokenization algorithm that iteratively merges the most frequent pairs of characters or tokens into new tokens, creating a vocabulary that balances between character-level and word-level representations for efficient text processing in LLMs.

Why It Matters

Every text you send to an LLM gets converted into tokens first. BPE is the algorithm behind most modern tokenizers, including those used by GPT models, Claude, and Llama. Understanding BPE helps you optimize prompts, estimate costs, and debug tokenization issues.

The key insight: BPE finds a middle ground between character-level and word-level tokenization. Character-level creates too many tokens (expensive, loses meaning). Word-level can’t handle new or rare words. BPE learns common subword patterns, so “unhappiness” might become [“un”, “happiness”] while “the” stays as a single token.

For AI engineers, BPE knowledge matters when you hit token limits, calculate API costs, or wonder why certain words get split unexpectedly. A 4,000-word document might be 5,000 tokens or 8,000 tokens depending on vocabulary and content type.

How It Works

BPE builds its vocabulary through an iterative merging process:

1. Initialize with Characters Start with all individual characters as your base vocabulary. The word “lower” becomes [“l”, “o”, “w”, “e”, “r”].

2. Count Pair Frequencies Scan the training corpus and count how often each adjacent pair appears. If “e” and “r” appear together frequently, they become a merge candidate.

3. Merge Most Frequent Pair Take the most common pair and merge it into a new token. Add “er” to your vocabulary. Now “lower” becomes [“l”, “o”, “w”, “er”].

4. Repeat Until Done Continue merging until you reach your target vocabulary size (typically 32K-100K tokens). Common words become single tokens. Rare words decompose into recognizable subwords.

Implementation Basics

Most AI engineers use pre-trained tokenizers rather than training BPE from scratch:

Tokenizer Libraries Use tiktoken (OpenAI), tokenizers (Hugging Face), or sentencepiece (Google) to tokenize text. Each model family has its own trained BPE vocabulary.

Token Counting Before hitting API limits, count tokens locally. Different models have different tokenizers. GPT-4 tokens don’t equal Claude tokens.

Special Tokens BPE vocabularies include special tokens like <|endoftext|> or [SEP]. These control model behavior and aren’t part of your content.

The main variants are BPE (original), Byte-Level BPE (works on UTF-8 bytes, handles any language), and SentencePiece (adds BPE to a unified framework). Most modern LLMs use byte-level BPE for universal language support.

Source

Neural Machine Translation of Rare Words with Subword Units - introduces BPE for open-vocabulary translation, achieving better handling of rare and unknown words.

https://arxiv.org/abs/1508.07909