Speculative Decoding
Definition
An inference acceleration technique that uses a smaller, faster draft model to predict multiple tokens, which are then verified in parallel by the larger target model, achieving significant speedups without changing output quality.
Why It Matters
Speculative decoding is one of the most effective techniques for making LLM inference faster without sacrificing quality. Unlike quantization or pruning which trade accuracy for speed, speculative decoding maintains mathematically identical outputs while achieving 2-3x speedups in practice.
The insight is that LLM inference is bottlenecked by memory bandwidth, not computation. Each token requires loading all model weights from memory, which is slow. Speculative decoding amortizes this cost by verifying multiple candidate tokens in a single forward pass of the large model.
For AI engineers, this technique is increasingly important as inference costs dominate production expenses. It’s particularly valuable when you can’t compromise on output quality. The target model’s distribution is preserved exactly. Many production systems now use speculative decoding by default.
How It Works
The Basic Principle
Standard autoregressive decoding generates one token at a time: run the model, sample a token, append it, repeat. Each step requires reading all model weights from GPU memory, making this memory-bandwidth bound.
Speculative decoding inverts this: a small draft model quickly predicts multiple tokens ahead, then the large target model verifies all predictions in a single parallel forward pass. Accepted tokens are kept; the first rejected token gets resampled from the target model’s corrected distribution.
Step-by-Step Process
- Draft phase: The small draft model generates K candidate tokens autoregressively (fast because the model is small)
- Verify phase: The target model processes all K candidates in parallel, computing probabilities for each position
- Accept/reject: Compare draft and target probabilities. Accept tokens where draft matches target distribution; reject and resample from the first mismatch
- Repeat: Continue from the last accepted position
Why It’s Mathematically Exact
The acceptance criterion ensures the final output distribution is identical to running the target model alone. When the draft model’s prediction is “good enough” (probability ratio meets threshold), we accept. When it’s not, we resample from a corrected distribution that accounts for the draft model’s bias. This rejection sampling guarantees correctness.
Implementation Basics
Key Parameters
- Draft model selection: The draft model should be much smaller (10-100x fewer parameters) but trained on similar data. Many model families provide draft variants, or you can use a smaller model from the same family
- Speculation length (K): How many tokens to draft before verification. Typical values are 4-8 tokens. Too high wastes compute on rejections; too low doesn’t amortize the target model pass
- Acceptance threshold: Controls the speed/quality tradeoff in practice (though outputs remain mathematically identical)
When Speculative Decoding Excels
- Long-form generation where many tokens need producing
- High-latency scenarios where parallel verification is much faster than sequential generation
- When a good draft model exists (same tokenizer, similar training data)
- Memory-bandwidth bound systems (most modern deployments)
Limitations
- Requires a compatible draft model, not always available
- Overhead for short outputs where draft phase dominates
- Complex to implement correctly; use libraries like vLLM or Hugging Face TGI that handle it
- Speedup varies with how well the draft model predicts the target
Framework Support
Most modern inference frameworks support speculative decoding: vLLM has built-in support with --speculative-model, Hugging Face TGI offers it for supported model pairs, and TensorRT-LLM includes optimized implementations for NVIDIA GPUs.
Source
Speculative decoding achieves 2-3x speedup on large language models by using a draft model to generate candidate tokens that the target model verifies in parallel, with mathematically guaranteed identical outputs to standard decoding.
https://arxiv.org/abs/2211.17192