Back to Glossary
LLM

Top-P (Nucleus Sampling)

Definition

Top-P (nucleus sampling) is a decoding strategy that limits token selection to the smallest set whose cumulative probability exceeds P, providing dynamic vocabulary restriction based on confidence.

Why It Matters

Top-P solves a problem with pure temperature sampling. At high temperatures, even very unlikely tokens (nonsense words, wrong languages) get selected occasionally. Top-P restricts selection to tokens that collectively represent P% of probability mass, ignoring the long tail of unlikely options.

The key advantage: adaptive vocabulary size. When the model is confident (one token has 95% probability), top_p=0.95 effectively picks just that token. When uncertain (many tokens around 10-15% each), it allows selection from all reasonable options.

For AI engineers, top_p is a fine-tuning lever. It’s less intuitive than temperature but can produce more natural variation for certain tasks.

Implementation Basics

Typical Settings

  • top_p=1.0: No restriction (default, consider all tokens)
  • top_p=0.95: Slight restriction (common for general use)
  • top_p=0.9: Moderate restriction (balanced)
  • top_p=0.5-0.7: Strong restriction (more focused)

How It Works

  1. Model calculates probability for each possible next token
  2. Sort tokens by probability (highest first)
  3. Cumulatively add probabilities until reaching top_p threshold
  4. Sample only from this “nucleus” of tokens

Example: If top_p=0.9 and three tokens have probabilities 0.5, 0.3, 0.15, only the first two are considered (0.5 + 0.3 = 0.8 < 0.9, but adding the third exceeds 0.9).

Temperature vs. Top-P Both control randomness, but differently:

  • Temperature: Rescales all probabilities
  • Top-P: Truncates the distribution

Recommendation Adjust one, keep the other at default. Temperature is more intuitive for most use cases. Use top_p when you specifically want adaptive vocabulary restriction, useful for creative tasks where you want variation but not nonsense.

Production Tip Most production systems use temperature alone (top_p=1.0). If you need both, test extensively. The interaction between them can produce unexpected behavior that’s hard to predict or debug.

Source

Nucleus sampling dynamically selects from the top-p portion of the probability mass, avoiding both the incoherence of pure sampling and the repetition of beam search.

https://arxiv.org/abs/1904.09751