Autoregressive
Definition
A generation approach where models predict one token at a time, with each new token conditioned on all previously generated tokens. The standard method for LLM text generation.
Why It Matters
Autoregressive generation is how every major LLM produces text. Understanding this process explains key LLM behaviors: why generation takes time proportional to output length, why models can’t “go back” and edit, and why the first tokens in a response influence everything that follows.
The autoregressive approach is both powerful and limited. It enables coherent, contextually appropriate text but makes parallel generation impossible and creates strong dependencies on early tokens. This is why prompt engineering matters, because the prompt heavily influences all generated content.
For AI engineers, understanding autoregressive generation helps with optimization (batching, caching), debugging (why did the model go in that direction?), and architecture decisions (when to use alternatives).
Implementation Basics
How It Works
- Model receives input tokens
- Predicts probability distribution over vocabulary for next token
- Samples or selects next token
- Appends token to sequence
- Repeats until stop condition (max length, EOS token)
Generation Loop
input = [prompt tokens]
while not done:
probs = model(input)[-1] # predict next token
next_token = sample(probs)
input.append(next_token)
if next_token == EOS or len(input) >= max_length:
done = True
Sampling Strategies
- Greedy: Always pick highest probability token
- Temperature: Scale logits to control randomness
- Top-k: Sample from k most likely tokens
- Top-p (nucleus): Sample from smallest set summing to p
- Beam search: Track multiple candidates
Computational Implications
- Each token requires a forward pass
- Generation time = O(output_length)
- Can’t parallelize output generation
- KV cache reduces redundant computation
KV Cache Optimization Without cache: Recompute attention for all previous tokens each step With cache: Store key/value vectors, only compute new token’s attention
- Massive speedup (10-100x for long sequences)
- Memory vs. compute tradeoff
- Why GPU memory matters for inference
Limitations
- Sequential generation is slow
- Early tokens heavily influence later ones
- Can’t revise previous outputs
- Potential for error accumulation
- Exposure bias: trained on ground truth, generates from own outputs
Alternatives
- Non-autoregressive: Predict all tokens in parallel (faster, less coherent)
- Semi-autoregressive: Generate chunks at a time
- Diffusion models: Iterative refinement (used for images, emerging for text)
Practical Tips
- Use streaming for better UX (show tokens as generated)
- Batch multiple requests for efficiency
- KV cache management is critical for serving
- Temperature affects coherence vs. creativity tradeoff
Source
GPT-3 is an autoregressive language model that generates text by predicting the next token given all preceding tokens, enabling coherent long-form generation.
https://arxiv.org/abs/2005.14165