Top-P (Nucleus Sampling)
Definition
Top-P (nucleus sampling) is a decoding strategy that limits token selection to the smallest set whose cumulative probability exceeds P, providing dynamic vocabulary restriction based on confidence.
Why It Matters
Top-P solves a problem with pure temperature sampling. At high temperatures, even very unlikely tokens (nonsense words, wrong languages) get selected occasionally. Top-P restricts selection to tokens that collectively represent P% of probability mass, ignoring the long tail of unlikely options.
The key advantage: adaptive vocabulary size. When the model is confident (one token has 95% probability), top_p=0.95 effectively picks just that token. When uncertain (many tokens around 10-15% each), it allows selection from all reasonable options.
For AI engineers, top_p is a fine-tuning lever. It’s less intuitive than temperature but can produce more natural variation for certain tasks.
Implementation Basics
Typical Settings
- top_p=1.0: No restriction (default, consider all tokens)
- top_p=0.95: Slight restriction (common for general use)
- top_p=0.9: Moderate restriction (balanced)
- top_p=0.5-0.7: Strong restriction (more focused)
How It Works
- Model calculates probability for each possible next token
- Sort tokens by probability (highest first)
- Cumulatively add probabilities until reaching top_p threshold
- Sample only from this “nucleus” of tokens
Example: If top_p=0.9 and three tokens have probabilities 0.5, 0.3, 0.15, only the first two are considered (0.5 + 0.3 = 0.8 < 0.9, but adding the third exceeds 0.9).
Temperature vs. Top-P Both control randomness, but differently:
- Temperature: Rescales all probabilities
- Top-P: Truncates the distribution
Recommendation Adjust one, keep the other at default. Temperature is more intuitive for most use cases. Use top_p when you specifically want adaptive vocabulary restriction, useful for creative tasks where you want variation but not nonsense.
Production Tip Most production systems use temperature alone (top_p=1.0). If you need both, test extensively. The interaction between them can produce unexpected behavior that’s hard to predict or debug.
Source
Nucleus sampling dynamically selects from the top-p portion of the probability mass, avoiding both the incoherence of pure sampling and the repetition of beam search.
https://arxiv.org/abs/1904.09751