What is Feed-Forward Network?

Architecture

Feed-Forward Network

Definition

A neural network component in Transformers that processes each position independently through two linear transformations with a non-linear activation, providing position-wise computation after attention.

Why It Matters

The feed-forward network (FFN) is where Transformers store most of their “knowledge.” While attention handles relationships between positions, FFNs process information at each position independently, functioning like a huge lookup table of learned patterns.

Research suggests FFNs act as key-value memories: the first layer identifies patterns, and the second layer retrieves associated information. This is why larger FFN dimensions (more parameters) generally mean more knowledgeable models.

For AI engineers, understanding FFNs helps with model selection (FFN size correlates with knowledge capacity) and efficiency optimization (FFNs are often the target of quantization and pruning).

Implementation Basics

Standard Architecture

FFN(x) = activation(x × W₁ + b₁) × W₂ + b₂

Typical dimensions:

Input: d_model (e.g., 4096)
Hidden: 4 × d_model (e.g., 16384)
Output: d_model (e.g., 4096)

Activation Functions

ReLU: Original Transformer (simple, fast)
GELU: GPT models (smoother gradients)
SiLU/Swish: Llama, Mistral (better performance)
GeGLU/SwiGLU: Gated variants (even better, used in modern LLMs)

Gated FFN (Modern Standard)

FFN_gated(x) = (activation(x × W_gate) ⊙ (x × W_up)) × W_down

Gate controls information flow
Better performance but more parameters
Used in Llama, PaLM, Claude

Where Knowledge Lives Research on knowledge localization:

Factual associations stored in FFN weights
Different layers store different types of knowledge
Early layers: syntax, basic patterns
Middle layers: factual knowledge
Late layers: task-specific computation

Computation Cost FFNs often dominate compute in Transformers:

With 4x expansion: ~2/3 of layer FLOPs
Target for efficiency optimizations
Sparse FFNs (MoE) activate only subset

Mixture of Experts (MoE) Replace single FFN with multiple “expert” FFNs:

Router selects which experts process each token
Only 1-2 experts active per token
Massive parameter count, but efficient compute
Used in Mixtral, GPT-4 (reportedly)

Memory Considerations

FFN weights are large (4x model dimension squared)
Often quantized more aggressively than attention
Memory bandwidth bound during inference
Batching helps amortize weight loading

Source

In addition to attention sub-layers, each layer contains a fully connected feed-forward network applied to each position separately and identically, consisting of two linear transformations with a ReLU activation.

https://arxiv.org/abs/1706.03762

Why It Matters

Implementation Basics

🎁 Go Beyond Definitions

Related Terms

Related Articles