Back to Glossary
Architecture

Diffusion Model

Definition

A generative model that learns to reverse a gradual noising process, generating data by iteratively denoising from pure noise. The dominant architecture for AI image generation.

Why It Matters

Diffusion models power the AI image generation revolution. Stable Diffusion, DALL-E, and Midjourney are all built on diffusion. Understanding this architecture is essential if you’re working with any form of AI-generated visual content.

Unlike autoregressive models (used for text), diffusion works by iterative refinement: starting from random noise and gradually removing it to reveal an image. This approach produces remarkably high-quality outputs and allows fine-grained control over generation.

For AI engineers, diffusion models are increasingly relevant beyond images. They’re being applied to video, audio, 3D models, and even text generation. The diffusion paradigm is expanding.

Implementation Basics

Core Process

  1. Forward process (training): Gradually add noise to data over T steps
  2. Learn to denoise: Train network to predict noise at each step
  3. Reverse process (generation): Start from pure noise, iteratively denoise

Forward Process

x_t = √(α_t) × x_{t-1} + √(1-α_t) × noise

After many steps, data becomes indistinguishable from random noise.

Reverse Process

x_{t-1} = (x_t - predicted_noise) / scaling + noise

Model predicts the noise to remove at each step.

Key Components

  • U-Net backbone: Commonly used denoising network
  • Cross-attention: Incorporates text conditioning
  • Noise schedule: How quickly noise is added/removed
  • Classifier-free guidance: Balances prompt adherence vs. diversity

Text-to-Image Pipeline

  1. Encode text prompt with language model (CLIP)
  2. Start with random latent noise
  3. Iteratively denoise using U-Net
  4. U-Net conditioned on text embeddings via cross-attention
  5. Decode final latent to image

Latent Diffusion (Stable Diffusion)

  • Operate in compressed latent space, not pixel space
  • VAE encodes images to latents
  • Much faster than pixel-space diffusion
  • Enables practical image generation

Sampling Methods

  • DDPM: Original, many steps (1000)
  • DDIM: Deterministic, fewer steps (50-100)
  • DPM-Solver: Fast, high quality (20-50 steps)
  • Euler, Heun: Alternative ODE solvers

Fewer steps = faster but potentially lower quality.

Practical Applications

  • Text-to-image generation
  • Image editing and inpainting
  • Super-resolution upscaling
  • Style transfer
  • Video generation (emerging)

Control Mechanisms

  • ControlNet: Add structural guidance (poses, edges, depth)
  • LoRA: Fine-tune for specific styles/subjects
  • Textual Inversion: Learn new concepts from few examples
  • Inpainting masks: Edit specific regions

Source

Denoising diffusion probabilistic models achieve image quality rivaling GANs by learning to reverse a Markov chain that gradually adds noise to data until signal is destroyed.

https://arxiv.org/abs/2006.11239