LLM

Perplexity

Definition

Perplexity is a metric measuring how well a language model predicts text, calculated as the exponential of the average negative log-likelihood. Lower perplexity indicates the model is less 'surprised' by the text and better at prediction.

Why It Matters

Perplexity is the go-to metric for comparing language models’ raw text prediction ability. When researchers report “GPT-4 achieves lower perplexity than GPT-3,” they’re saying GPT-4 is better at predicting text sequences.

The intuition: perplexity measures how “surprised” the model is by each word. A perplexity of 20 on a text means the model’s uncertainty is equivalent to randomly choosing among 20 equally likely words at each position. Lower is better. A model with perplexity 10 is less surprised and more confident in its predictions.

For AI engineers, perplexity matters when comparing base models, evaluating fine-tuning results, or assessing model quality on your domain-specific data. A model with low perplexity on general text but high perplexity on your legal documents might not be the right choice for your legal AI application.

Implementation Basics

Calculating Perplexity Perplexity = exp(average negative log-likelihood). For a sequence of N tokens, you sum the negative log probabilities of each token given the previous tokens, divide by N, then exponentiate.

Most LLM APIs don’t directly return perplexity, but some return log probabilities which you can use to calculate it. Hugging Face transformers provides perplexity calculation utilities.

Interpretation Guidelines

Modern LLMs achieve perplexity around 10-30 on standard benchmarks
Domain-specific text may have higher perplexity (more specialized vocabulary)
Lower perplexity doesn’t guarantee better downstream task performance
Compare perplexity only on the same test set with the same tokenizer

Limitations Perplexity measures prediction quality, not generation quality. A model might predict text well but generate poorly. It’s also sensitive to tokenization. Different tokenizers make perplexity scores incomparable.

Use perplexity as one signal among many. For production applications, downstream task metrics (accuracy, user satisfaction) matter more than perplexity scores.

Source

Perplexity remains a standard intrinsic evaluation metric for language models, measuring the probability assigned to test sequences.

https://arxiv.org/abs/1808.10583

Why It Matters

Implementation Basics

🎁 Go Beyond Definitions

Related Terms

Related Articles