Back to Glossary
Architecture

Attention Mechanism

Definition

A neural network mechanism that lets models weigh the relevance of different parts of the input when producing each output, enabling dynamic focus on contextually important information.

Why It Matters

Attention is the core innovation that makes modern AI work. It solves the fundamental problem of how models can focus on relevant information in variable-length inputs.

Before attention, models compressed entire inputs into fixed-size vectors, losing information in long sequences. Attention lets models dynamically look at any part of the input when generating each output. This is why LLMs can reference information from thousands of tokens ago.

For AI engineers, understanding attention explains key LLM behaviors: why context ordering matters, why certain prompts work better, and why models have context limits. Attention also determines computational costs, and it’s why long prompts are expensive.

Implementation Basics

The Attention Formula

Attention(Q, K, V) = softmax(QK^T / √d) × V

Where:

  • Q (Query): What we’re looking for
  • K (Key): What each position offers
  • V (Value): What each position contains
  • √d: Scaling factor for numerical stability

How It Works

  1. Query compares against all Keys
  2. Comparison scores show relevance
  3. Softmax normalizes scores to weights
  4. Weights determine how much of each Value to include

Attention Types

  • Self-attention: Sequence attends to itself
  • Cross-attention: One sequence attends to another
  • Causal attention: Can only see previous positions (used in generation)
  • Bidirectional attention: Can see all positions (used in understanding)

Computational Cost Attention has O(n²) complexity with sequence length:

  • 1K tokens: 1M attention computations
  • 4K tokens: 16M attention computations
  • 128K tokens: 16B attention computations

This quadratic scaling is why context windows have limits and why long prompts cost more.

Attention Patterns Models learn different attention patterns:

  • Recent tokens often get high attention
  • Punctuation and special tokens can act as “anchors”
  • Key information positions receive focused attention
  • Some heads specialize in syntactic relationships

Understanding these patterns helps with prompt engineering and debugging model behavior.

Source

Neural machine translation by jointly learning to align and translate introduced attention mechanisms to allow models to focus on relevant source words when generating each target word.

https://arxiv.org/abs/1409.0473