Architecture

Beam Search

Definition

Beam search is a decoding algorithm that maintains multiple candidate sequences (beams) at each generation step, exploring the most promising paths to find higher-quality outputs than greedy decoding while remaining computationally tractable.

Why It Matters

When an LLM generates text, it predicts probabilities for each next token. Greedy decoding always picks the highest-probability token, but this doesn’t always produce the best overall sequence. Beam search explores multiple paths simultaneously, often finding better outputs.

The key insight: the locally optimal choice isn’t always globally optimal. If “The cat” has the highest probability for the first two tokens, but “A dog” leads to a much better complete sentence, greedy decoding will never find it. Beam search keeps multiple candidates alive.

For AI engineers, understanding beam search helps you tune generation quality. It’s the default for tasks requiring deterministic, high-quality output, like translation, summarization, and code generation. For creative tasks, you might prefer sampling methods with temperature instead.

How It Works

Beam search maintains a fixed number of candidate sequences (the beam width):

1. Initialize Start with an empty sequence. The model predicts probabilities for all possible first tokens.

2. Expand and Score For each candidate in the beam, compute probabilities for all possible next tokens. Score each new sequence by multiplying (or log-adding) token probabilities.

3. Prune to Beam Width Keep only the top-k sequences by score, where k is the beam width. Discard the rest.

4. Repeat Until Done Continue expanding and pruning until all beams reach an end token or hit the maximum length. Return the highest-scoring complete sequence.

Implementation Basics

Key parameters when using beam search:

Beam Width Typical values are 4-10. Larger beams explore more paths but increase computation linearly. Beyond 10, gains diminish while costs grow.

Length Penalty Beam search naturally favors shorter sequences (fewer multiplications). Length penalties normalize scores by sequence length to prevent truncated outputs.

N-Best Output Rather than returning just the top beam, you can get the top-n candidates. Useful for reranking or showing alternatives.

When to Use Prefer beam search for deterministic tasks (translation, summarization) where quality matters more than creativity. For open-ended generation, sampling with temperature often produces more natural, varied text.

Most inference libraries (Hugging Face, vLLM) support beam search via simple parameter flags. The trade-off is always exploration quality versus generation speed.

Source

Sequence to Sequence Learning with Neural Networks - beam search became the standard decoding method for neural machine translation, improving BLEU scores by 2-3 points over greedy decoding.

https://arxiv.org/abs/1409.3215

Why It Matters

How It Works

Implementation Basics

🎁 Go Beyond Definitions

Related Terms

Related Articles