What is Encoder-Decoder?

Architecture

Encoder-Decoder

Definition

A neural network architecture with two components: an encoder that processes input into a representation, and a decoder that generates output from that representation. Used for sequence-to-sequence tasks.

Why It Matters

Encoder-decoder architectures were the original Transformer design and remain important for specific tasks. Understanding this architecture helps you choose the right model for your use case and understand why different models excel at different tasks.

While most modern LLMs (GPT, Claude, Llama) are decoder-only, encoder-decoder models like T5 and BART still excel at certain tasks: translation, summarization, and structured generation. Some multimodal models also use encoder-decoder designs.

For AI engineers, knowing when to use encoder-decoder vs. decoder-only architectures is a practical skill. The choice affects both capability and efficiency.

Implementation Basics

Architecture Components

Encoder:

Bidirectional self-attention (sees full input)
Processes entire input in parallel
Creates rich contextual representations
No masking needed

Decoder:

Causal self-attention (can’t see future)
Cross-attention to encoder outputs
Generates output autoregressively
Each token can attend to all encoded positions

Information Flow

Input → Encoder → Encoded Representation
                      ↓
        Decoder → Output (token by token)
                  (uses cross-attention)

Model Examples

Encoder-Decoder: T5, BART, mT5, FLAN-T5
Encoder-Only: BERT, RoBERTa (classification, embeddings)
Decoder-Only: GPT, Claude, Llama (generation)

When to Use Encoder-Decoder

Translation: Encode source, decode target
Summarization: Encode document, decode summary
Structured output: Encode prompt, decode structure
When input and output have different “roles”

When Decoder-Only Works Better

General text generation
Instruction following
In-context learning
When you want a unified model for many tasks

Cross-Attention Explained

Decoder queries attend to encoder keys/values
Every decoder position can see every encoder position
Allows decoder to “look back” at input while generating
Critical for tasks where output maps to input

Efficiency Considerations

Encoder runs once per input
Decoder runs per output token
For long outputs, decoder dominates compute
Encoder-decoder can be more efficient for translation-like tasks

Modern Trends

Decoder-only dominates due to versatility
Encoder-decoder still strong for specific tasks
Multimodal often uses encoders for non-text (images, audio)
Hybrid approaches emerging

Source

The Transformer follows an encoder-decoder structure: the encoder maps an input sequence to a sequence of continuous representations, which the decoder then uses to generate an output sequence one element at a time.

https://arxiv.org/abs/1706.03762

Why It Matters

Implementation Basics

🎁 Go Beyond Definitions

Related Terms

Related Articles