Back to Glossary
Architecture

Feed-Forward Network

Definition

A neural network component in Transformers that processes each position independently through two linear transformations with a non-linear activation, providing position-wise computation after attention.

Why It Matters

The feed-forward network (FFN) is where Transformers store most of their “knowledge.” While attention handles relationships between positions, FFNs process information at each position independently, functioning like a huge lookup table of learned patterns.

Research suggests FFNs act as key-value memories: the first layer identifies patterns, and the second layer retrieves associated information. This is why larger FFN dimensions (more parameters) generally mean more knowledgeable models.

For AI engineers, understanding FFNs helps with model selection (FFN size correlates with knowledge capacity) and efficiency optimization (FFNs are often the target of quantization and pruning).

Implementation Basics

Standard Architecture

FFN(x) = activation(x × W₁ + b₁) × W₂ + b₂

Typical dimensions:

  • Input: d_model (e.g., 4096)
  • Hidden: 4 × d_model (e.g., 16384)
  • Output: d_model (e.g., 4096)

Activation Functions

  • ReLU: Original Transformer (simple, fast)
  • GELU: GPT models (smoother gradients)
  • SiLU/Swish: Llama, Mistral (better performance)
  • GeGLU/SwiGLU: Gated variants (even better, used in modern LLMs)

Gated FFN (Modern Standard)

FFN_gated(x) = (activation(x × W_gate) ⊙ (x × W_up)) × W_down
  • Gate controls information flow
  • Better performance but more parameters
  • Used in Llama, PaLM, Claude

Where Knowledge Lives Research on knowledge localization:

  • Factual associations stored in FFN weights
  • Different layers store different types of knowledge
  • Early layers: syntax, basic patterns
  • Middle layers: factual knowledge
  • Late layers: task-specific computation

Computation Cost FFNs often dominate compute in Transformers:

  • With 4x expansion: ~2/3 of layer FLOPs
  • Target for efficiency optimizations
  • Sparse FFNs (MoE) activate only subset

Mixture of Experts (MoE) Replace single FFN with multiple “expert” FFNs:

  • Router selects which experts process each token
  • Only 1-2 experts active per token
  • Massive parameter count, but efficient compute
  • Used in Mixtral, GPT-4 (reportedly)

Memory Considerations

  • FFN weights are large (4x model dimension squared)
  • Often quantized more aggressively than attention
  • Memory bandwidth bound during inference
  • Batching helps amortize weight loading

Source

In addition to attention sub-layers, each layer contains a fully connected feed-forward network applied to each position separately and identically, consisting of two linear transformations with a ReLU activation.

https://arxiv.org/abs/1706.03762