What is Speculative Decoding?

Implementation

Speculative Decoding

Definition

An inference acceleration technique that uses a smaller, faster draft model to predict multiple tokens, which are then verified in parallel by the larger target model, achieving significant speedups without changing output quality.

Why It Matters

Speculative decoding is one of the most effective techniques for making LLM inference faster without sacrificing quality. Unlike quantization or pruning which trade accuracy for speed, speculative decoding maintains mathematically identical outputs while achieving 2-3x speedups in practice.

The insight is that LLM inference is bottlenecked by memory bandwidth, not computation. Each token requires loading all model weights from memory, which is slow. Speculative decoding amortizes this cost by verifying multiple candidate tokens in a single forward pass of the large model.

For AI engineers, this technique is increasingly important as inference costs dominate production expenses. It’s particularly valuable when you can’t compromise on output quality. The target model’s distribution is preserved exactly. Many production systems now use speculative decoding by default.

How It Works

The Basic Principle

Standard autoregressive decoding generates one token at a time: run the model, sample a token, append it, repeat. Each step requires reading all model weights from GPU memory, making this memory-bandwidth bound.

Speculative decoding inverts this: a small draft model quickly predicts multiple tokens ahead, then the large target model verifies all predictions in a single parallel forward pass. Accepted tokens are kept; the first rejected token gets resampled from the target model’s corrected distribution.

Step-by-Step Process

Draft phase: The small draft model generates K candidate tokens autoregressively (fast because the model is small)
Verify phase: The target model processes all K candidates in parallel, computing probabilities for each position
Accept/reject: Compare draft and target probabilities. Accept tokens where draft matches target distribution; reject and resample from the first mismatch
Repeat: Continue from the last accepted position

Why It’s Mathematically Exact

The acceptance criterion ensures the final output distribution is identical to running the target model alone. When the draft model’s prediction is “good enough” (probability ratio meets threshold), we accept. When it’s not, we resample from a corrected distribution that accounts for the draft model’s bias. This rejection sampling guarantees correctness.

Implementation Basics

Key Parameters

Draft model selection: The draft model should be much smaller (10-100x fewer parameters) but trained on similar data. Many model families provide draft variants, or you can use a smaller model from the same family
Speculation length (K): How many tokens to draft before verification. Typical values are 4-8 tokens. Too high wastes compute on rejections; too low doesn’t amortize the target model pass
Acceptance threshold: Controls the speed/quality tradeoff in practice (though outputs remain mathematically identical)

When Speculative Decoding Excels

Long-form generation where many tokens need producing
High-latency scenarios where parallel verification is much faster than sequential generation
When a good draft model exists (same tokenizer, similar training data)
Memory-bandwidth bound systems (most modern deployments)

Limitations

Requires a compatible draft model, not always available
Overhead for short outputs where draft phase dominates
Complex to implement correctly; use libraries like vLLM or Hugging Face TGI that handle it
Speedup varies with how well the draft model predicts the target

Framework Support

Most modern inference frameworks support speculative decoding: vLLM has built-in support with --speculative-model, Hugging Face TGI offers it for supported model pairs, and TensorRT-LLM includes optimized implementations for NVIDIA GPUs.

Source

Speculative decoding achieves 2-3x speedup on large language models by using a draft model to generate candidate tokens that the target model verifies in parallel, with mathematically guaranteed identical outputs to standard decoding.

https://arxiv.org/abs/2211.17192

Why It Matters

How It Works

Implementation Basics

🎁 Go Beyond Definitions

Related Terms

Related Articles