What is Self-Attention?

Architecture

Self-Attention

Definition

An attention mechanism where a sequence attends to itself, allowing each position to gather information from all other positions in the same sequence. The core operation in Transformer models.

Why It Matters

Self-attention is what makes Transformers work. It allows every token to “see” every other token in the sequence, regardless of distance. This is fundamentally different from older architectures where information had to propagate step-by-step.

When you ask an LLM a question and it references information from your earlier context, that’s self-attention at work. It’s also why models can understand that “it” in a sentence refers to something mentioned paragraphs earlier.

For AI engineers, self-attention explains both capabilities and costs. It’s why models can maintain coherent long-form generation, but also why processing long contexts is computationally expensive.

Implementation Basics

How Self-Attention Works

Each token generates Query, Key, and Value vectors
Each Query compares against all Keys (including its own)
Comparison scores indicate relevance
Scores are normalized via softmax
Output is a weighted sum of Values

The Difference from Cross-Attention

Self-attention: Q, K, V all come from the same sequence
Cross-attention: Q from one sequence, K and V from another

Causal vs. Bidirectional

Causal (masked): Each position only attends to previous positions
- Used in: GPT, Claude, Llama (generation models)
- Enables: Autoregressive text generation
Bidirectional: Each position attends to all positions
- Used in: BERT (understanding models)
- Enables: Better context understanding for classification

What Self-Attention Learns Through training, self-attention learns to:

Connect pronouns to their referents
Relate verbs to their subjects and objects
Link questions to relevant context
Identify important vs. filler words
Track entities across long passages

Limitations

Quadratic complexity: O(n²) with sequence length
Fixed context window (though expanding)
Can lose focus in very long sequences
Computationally expensive for each token generated

Efficient Alternatives Research continues on more efficient attention:

Sparse attention (attend to subset of positions)
Linear attention (approximate attention with O(n))
Sliding window attention (local + global)

These enable longer context windows with manageable compute costs.

Source

Self-attention, or intra-attention, relates different positions of a single sequence to compute a representation of the sequence, enabling the Transformer to capture dependencies regardless of distance.

https://arxiv.org/abs/1706.03762

Why It Matters

Implementation Basics

🎁 Go Beyond Definitions

Related Terms

Related Articles