What is Attention Mechanism?

Architecture

Attention Mechanism

Definition

A neural network mechanism that lets models weigh the relevance of different parts of the input when producing each output, enabling dynamic focus on contextually important information.

Why It Matters

Attention is the core innovation that makes modern AI work. It solves the fundamental problem of how models can focus on relevant information in variable-length inputs.

Before attention, models compressed entire inputs into fixed-size vectors, losing information in long sequences. Attention lets models dynamically look at any part of the input when generating each output. This is why LLMs can reference information from thousands of tokens ago.

For AI engineers, understanding attention explains key LLM behaviors: why context ordering matters, why certain prompts work better, and why models have context limits. Attention also determines computational costs, and it’s why long prompts are expensive.

Implementation Basics

The Attention Formula

Attention(Q, K, V) = softmax(QK^T / √d) × V

Where:

Q (Query): What we’re looking for
K (Key): What each position offers
V (Value): What each position contains
√d: Scaling factor for numerical stability

How It Works

Query compares against all Keys
Comparison scores show relevance
Softmax normalizes scores to weights
Weights determine how much of each Value to include

Attention Types

Self-attention: Sequence attends to itself
Cross-attention: One sequence attends to another
Causal attention: Can only see previous positions (used in generation)
Bidirectional attention: Can see all positions (used in understanding)

Computational Cost Attention has O(n²) complexity with sequence length:

1K tokens: 1M attention computations
4K tokens: 16M attention computations
128K tokens: 16B attention computations

This quadratic scaling is why context windows have limits and why long prompts cost more.

Attention Patterns Models learn different attention patterns:

Recent tokens often get high attention
Punctuation and special tokens can act as “anchors”
Key information positions receive focused attention
Some heads specialize in syntactic relationships

Understanding these patterns helps with prompt engineering and debugging model behavior.

Source

Neural machine translation by jointly learning to align and translate introduced attention mechanisms to allow models to focus on relevant source words when generating each target word.

https://arxiv.org/abs/1409.0473

Why It Matters

Implementation Basics

🎁 Go Beyond Definitions

Related Terms

Related Articles