Back to Glossary
Architecture

Mixture of Experts (MoE)

Definition

Mixture of Experts is a neural network architecture that uses multiple specialized sub-networks (experts) with a routing mechanism that activates only a subset for each input, enabling larger model capacity without proportionally increasing compute costs.

Why It Matters

MoE architectures power some of the most capable models available. Mixtral, GPT-4 (reportedly), and other frontier models use MoE to achieve better quality at lower compute costs. Understanding MoE helps you interpret model behavior and make informed deployment decisions.

The key insight: not all inputs need the same processing. A question about cooking activates different knowledge than a question about calculus. MoE implements this intuition architecturally, with specialized expert networks handling different types of inputs and a router deciding which experts to use.

For AI engineers, MoE models have distinct deployment characteristics. They’re memory-intensive (all experts must be loaded) but compute-efficient (only some experts run per token). This affects infrastructure choices since MoE models need more GPU memory but don’t need proportionally more compute.

Implementation Basics

MoE replaces standard feedforward layers with a routing mechanism and multiple experts:

1. Expert Networks Each expert is a feedforward network (typically the same architecture as a standard transformer’s FFN). A model might have 8, 16, or more experts per MoE layer. Each expert learns to specialize on different types of inputs.

2. Router/Gating A small network that takes each token and decides which experts should process it. Outputs a probability distribution over experts. Top-k routing activates the k highest-probability experts (commonly k=1 or k=2).

3. Sparse Activation Only the selected experts run for each token. A 70B parameter model with 8 experts and top-2 routing only uses ~25B parameters worth of compute per token, achieving 70B quality at 25B speed.

4. Load Balancing Without constraints, routers collapse to using only a few experts. Auxiliary losses encourage balanced expert utilization, ensuring all experts contribute and preventing wasted capacity.

Practical implications:

Memory vs. Compute - MoE models require loading all experts (high memory) but only run a subset (lower compute). A 8x7B MoE model needs ~56B parameters of memory but runs like a ~14B model.

Inference Serving - MoE models benefit from expert parallelism, as different experts can run on different devices. Specialized serving frameworks optimize this.

Batch Efficiency - MoE is most efficient with large batches. Different tokens in a batch route to different experts, maximizing utilization. Small batches waste capacity.

Quality Characteristics - MoE models sometimes show “modular” capabilities, being strong at certain tasks and weaker at others depending on expert specialization. Test on your specific use case.

When choosing between dense and MoE models of similar capability, consider your memory budget and batch sizes. MoE excels in high-throughput serving scenarios with sufficient memory.

Source

The Switch Transformer demonstrates that sparse MoE models can achieve significant speedups over dense models while maintaining quality, with only a fraction of experts activated per token.

https://arxiv.org/abs/2101.03961