Self-Attention
Definition
An attention mechanism where a sequence attends to itself, allowing each position to gather information from all other positions in the same sequence. The core operation in Transformer models.
Why It Matters
Self-attention is what makes Transformers work. It allows every token to “see” every other token in the sequence, regardless of distance. This is fundamentally different from older architectures where information had to propagate step-by-step.
When you ask an LLM a question and it references information from your earlier context, that’s self-attention at work. It’s also why models can understand that “it” in a sentence refers to something mentioned paragraphs earlier.
For AI engineers, self-attention explains both capabilities and costs. It’s why models can maintain coherent long-form generation, but also why processing long contexts is computationally expensive.
Implementation Basics
How Self-Attention Works
- Each token generates Query, Key, and Value vectors
- Each Query compares against all Keys (including its own)
- Comparison scores indicate relevance
- Scores are normalized via softmax
- Output is a weighted sum of Values
The Difference from Cross-Attention
- Self-attention: Q, K, V all come from the same sequence
- Cross-attention: Q from one sequence, K and V from another
Causal vs. Bidirectional
- Causal (masked): Each position only attends to previous positions
- Used in: GPT, Claude, Llama (generation models)
- Enables: Autoregressive text generation
- Bidirectional: Each position attends to all positions
- Used in: BERT (understanding models)
- Enables: Better context understanding for classification
What Self-Attention Learns Through training, self-attention learns to:
- Connect pronouns to their referents
- Relate verbs to their subjects and objects
- Link questions to relevant context
- Identify important vs. filler words
- Track entities across long passages
Limitations
- Quadratic complexity: O(n²) with sequence length
- Fixed context window (though expanding)
- Can lose focus in very long sequences
- Computationally expensive for each token generated
Efficient Alternatives Research continues on more efficient attention:
- Sparse attention (attend to subset of positions)
- Linear attention (approximate attention with O(n))
- Sliding window attention (local + global)
These enable longer context windows with manageable compute costs.
Source
Self-attention, or intra-attention, relates different positions of a single sequence to compute a representation of the sequence, enabling the Transformer to capture dependencies regardless of distance.
https://arxiv.org/abs/1706.03762