Perplexity
Definition
Perplexity is a metric measuring how well a language model predicts text, calculated as the exponential of the average negative log-likelihood. Lower perplexity indicates the model is less 'surprised' by the text and better at prediction.
Why It Matters
Perplexity is the go-to metric for comparing language models’ raw text prediction ability. When researchers report “GPT-4 achieves lower perplexity than GPT-3,” they’re saying GPT-4 is better at predicting text sequences.
The intuition: perplexity measures how “surprised” the model is by each word. A perplexity of 20 on a text means the model’s uncertainty is equivalent to randomly choosing among 20 equally likely words at each position. Lower is better. A model with perplexity 10 is less surprised and more confident in its predictions.
For AI engineers, perplexity matters when comparing base models, evaluating fine-tuning results, or assessing model quality on your domain-specific data. A model with low perplexity on general text but high perplexity on your legal documents might not be the right choice for your legal AI application.
Implementation Basics
Calculating Perplexity Perplexity = exp(average negative log-likelihood). For a sequence of N tokens, you sum the negative log probabilities of each token given the previous tokens, divide by N, then exponentiate.
Most LLM APIs don’t directly return perplexity, but some return log probabilities which you can use to calculate it. Hugging Face transformers provides perplexity calculation utilities.
Interpretation Guidelines
- Modern LLMs achieve perplexity around 10-30 on standard benchmarks
- Domain-specific text may have higher perplexity (more specialized vocabulary)
- Lower perplexity doesn’t guarantee better downstream task performance
- Compare perplexity only on the same test set with the same tokenizer
Limitations Perplexity measures prediction quality, not generation quality. A model might predict text well but generate poorly. It’s also sensitive to tokenization. Different tokenizers make perplexity scores incomparable.
Use perplexity as one signal among many. For production applications, downstream task metrics (accuracy, user satisfaction) matter more than perplexity scores.
Source
Perplexity remains a standard intrinsic evaluation metric for language models, measuring the probability assigned to test sequences.
https://arxiv.org/abs/1808.10583