Encoder-Decoder
Definition
A neural network architecture with two components: an encoder that processes input into a representation, and a decoder that generates output from that representation. Used for sequence-to-sequence tasks.
Why It Matters
Encoder-decoder architectures were the original Transformer design and remain important for specific tasks. Understanding this architecture helps you choose the right model for your use case and understand why different models excel at different tasks.
While most modern LLMs (GPT, Claude, Llama) are decoder-only, encoder-decoder models like T5 and BART still excel at certain tasks: translation, summarization, and structured generation. Some multimodal models also use encoder-decoder designs.
For AI engineers, knowing when to use encoder-decoder vs. decoder-only architectures is a practical skill. The choice affects both capability and efficiency.
Implementation Basics
Architecture Components
Encoder:
- Bidirectional self-attention (sees full input)
- Processes entire input in parallel
- Creates rich contextual representations
- No masking needed
Decoder:
- Causal self-attention (can’t see future)
- Cross-attention to encoder outputs
- Generates output autoregressively
- Each token can attend to all encoded positions
Information Flow
Input → Encoder → Encoded Representation
↓
Decoder → Output (token by token)
(uses cross-attention)
Model Examples
- Encoder-Decoder: T5, BART, mT5, FLAN-T5
- Encoder-Only: BERT, RoBERTa (classification, embeddings)
- Decoder-Only: GPT, Claude, Llama (generation)
When to Use Encoder-Decoder
- Translation: Encode source, decode target
- Summarization: Encode document, decode summary
- Structured output: Encode prompt, decode structure
- When input and output have different “roles”
When Decoder-Only Works Better
- General text generation
- Instruction following
- In-context learning
- When you want a unified model for many tasks
Cross-Attention Explained
- Decoder queries attend to encoder keys/values
- Every decoder position can see every encoder position
- Allows decoder to “look back” at input while generating
- Critical for tasks where output maps to input
Efficiency Considerations
- Encoder runs once per input
- Decoder runs per output token
- For long outputs, decoder dominates compute
- Encoder-decoder can be more efficient for translation-like tasks
Modern Trends
- Decoder-only dominates due to versatility
- Encoder-decoder still strong for specific tasks
- Multimodal often uses encoders for non-text (images, audio)
- Hybrid approaches emerging
Source
The Transformer follows an encoder-decoder structure: the encoder maps an input sequence to a sequence of continuous representations, which the decoder then uses to generate an output sequence one element at a time.
https://arxiv.org/abs/1706.03762