Back to Glossary
Architecture

Self-Attention

Definition

An attention mechanism where a sequence attends to itself, allowing each position to gather information from all other positions in the same sequence. The core operation in Transformer models.

Why It Matters

Self-attention is what makes Transformers work. It allows every token to “see” every other token in the sequence, regardless of distance. This is fundamentally different from older architectures where information had to propagate step-by-step.

When you ask an LLM a question and it references information from your earlier context, that’s self-attention at work. It’s also why models can understand that “it” in a sentence refers to something mentioned paragraphs earlier.

For AI engineers, self-attention explains both capabilities and costs. It’s why models can maintain coherent long-form generation, but also why processing long contexts is computationally expensive.

Implementation Basics

How Self-Attention Works

  1. Each token generates Query, Key, and Value vectors
  2. Each Query compares against all Keys (including its own)
  3. Comparison scores indicate relevance
  4. Scores are normalized via softmax
  5. Output is a weighted sum of Values

The Difference from Cross-Attention

  • Self-attention: Q, K, V all come from the same sequence
  • Cross-attention: Q from one sequence, K and V from another

Causal vs. Bidirectional

  • Causal (masked): Each position only attends to previous positions
    • Used in: GPT, Claude, Llama (generation models)
    • Enables: Autoregressive text generation
  • Bidirectional: Each position attends to all positions
    • Used in: BERT (understanding models)
    • Enables: Better context understanding for classification

What Self-Attention Learns Through training, self-attention learns to:

  • Connect pronouns to their referents
  • Relate verbs to their subjects and objects
  • Link questions to relevant context
  • Identify important vs. filler words
  • Track entities across long passages

Limitations

  • Quadratic complexity: O(n²) with sequence length
  • Fixed context window (though expanding)
  • Can lose focus in very long sequences
  • Computationally expensive for each token generated

Efficient Alternatives Research continues on more efficient attention:

  • Sparse attention (attend to subset of positions)
  • Linear attention (approximate attention with O(n))
  • Sliding window attention (local + global)

These enable longer context windows with manageable compute costs.

Source

Self-attention, or intra-attention, relates different positions of a single sequence to compute a representation of the sequence, enabling the Transformer to capture dependencies regardless of distance.

https://arxiv.org/abs/1706.03762