Back to Glossary
Architecture

Encoder-Decoder

Definition

A neural network architecture with two components: an encoder that processes input into a representation, and a decoder that generates output from that representation. Used for sequence-to-sequence tasks.

Why It Matters

Encoder-decoder architectures were the original Transformer design and remain important for specific tasks. Understanding this architecture helps you choose the right model for your use case and understand why different models excel at different tasks.

While most modern LLMs (GPT, Claude, Llama) are decoder-only, encoder-decoder models like T5 and BART still excel at certain tasks: translation, summarization, and structured generation. Some multimodal models also use encoder-decoder designs.

For AI engineers, knowing when to use encoder-decoder vs. decoder-only architectures is a practical skill. The choice affects both capability and efficiency.

Implementation Basics

Architecture Components

Encoder:

  • Bidirectional self-attention (sees full input)
  • Processes entire input in parallel
  • Creates rich contextual representations
  • No masking needed

Decoder:

  • Causal self-attention (can’t see future)
  • Cross-attention to encoder outputs
  • Generates output autoregressively
  • Each token can attend to all encoded positions

Information Flow

Input → Encoder → Encoded Representation

        Decoder → Output (token by token)
                  (uses cross-attention)

Model Examples

  • Encoder-Decoder: T5, BART, mT5, FLAN-T5
  • Encoder-Only: BERT, RoBERTa (classification, embeddings)
  • Decoder-Only: GPT, Claude, Llama (generation)

When to Use Encoder-Decoder

  • Translation: Encode source, decode target
  • Summarization: Encode document, decode summary
  • Structured output: Encode prompt, decode structure
  • When input and output have different “roles”

When Decoder-Only Works Better

  • General text generation
  • Instruction following
  • In-context learning
  • When you want a unified model for many tasks

Cross-Attention Explained

  • Decoder queries attend to encoder keys/values
  • Every decoder position can see every encoder position
  • Allows decoder to “look back” at input while generating
  • Critical for tasks where output maps to input

Efficiency Considerations

  • Encoder runs once per input
  • Decoder runs per output token
  • For long outputs, decoder dominates compute
  • Encoder-decoder can be more efficient for translation-like tasks

Modern Trends

  • Decoder-only dominates due to versatility
  • Encoder-decoder still strong for specific tasks
  • Multimodal often uses encoders for non-text (images, audio)
  • Hybrid approaches emerging

Source

The Transformer follows an encoder-decoder structure: the encoder maps an input sequence to a sequence of continuous representations, which the decoder then uses to generate an output sequence one element at a time.

https://arxiv.org/abs/1706.03762