Back to Glossary
Architecture

Autoregressive

Definition

A generation approach where models predict one token at a time, with each new token conditioned on all previously generated tokens. The standard method for LLM text generation.

Why It Matters

Autoregressive generation is how every major LLM produces text. Understanding this process explains key LLM behaviors: why generation takes time proportional to output length, why models can’t “go back” and edit, and why the first tokens in a response influence everything that follows.

The autoregressive approach is both powerful and limited. It enables coherent, contextually appropriate text but makes parallel generation impossible and creates strong dependencies on early tokens. This is why prompt engineering matters, because the prompt heavily influences all generated content.

For AI engineers, understanding autoregressive generation helps with optimization (batching, caching), debugging (why did the model go in that direction?), and architecture decisions (when to use alternatives).

Implementation Basics

How It Works

  1. Model receives input tokens
  2. Predicts probability distribution over vocabulary for next token
  3. Samples or selects next token
  4. Appends token to sequence
  5. Repeats until stop condition (max length, EOS token)

Generation Loop

input = [prompt tokens]
while not done:
    probs = model(input)[-1]  # predict next token
    next_token = sample(probs)
    input.append(next_token)
    if next_token == EOS or len(input) >= max_length:
        done = True

Sampling Strategies

  • Greedy: Always pick highest probability token
  • Temperature: Scale logits to control randomness
  • Top-k: Sample from k most likely tokens
  • Top-p (nucleus): Sample from smallest set summing to p
  • Beam search: Track multiple candidates

Computational Implications

  • Each token requires a forward pass
  • Generation time = O(output_length)
  • Can’t parallelize output generation
  • KV cache reduces redundant computation

KV Cache Optimization Without cache: Recompute attention for all previous tokens each step With cache: Store key/value vectors, only compute new token’s attention

  • Massive speedup (10-100x for long sequences)
  • Memory vs. compute tradeoff
  • Why GPU memory matters for inference

Limitations

  • Sequential generation is slow
  • Early tokens heavily influence later ones
  • Can’t revise previous outputs
  • Potential for error accumulation
  • Exposure bias: trained on ground truth, generates from own outputs

Alternatives

  • Non-autoregressive: Predict all tokens in parallel (faster, less coherent)
  • Semi-autoregressive: Generate chunks at a time
  • Diffusion models: Iterative refinement (used for images, emerging for text)

Practical Tips

  • Use streaming for better UX (show tokens as generated)
  • Batch multiple requests for efficiency
  • KV cache management is critical for serving
  • Temperature affects coherence vs. creativity tradeoff

Source

GPT-3 is an autoregressive language model that generates text by predicting the next token given all preceding tokens, enabling coherent long-form generation.

https://arxiv.org/abs/2005.14165