Diffusion Model
Definition
A generative model that learns to reverse a gradual noising process, generating data by iteratively denoising from pure noise. The dominant architecture for AI image generation.
Why It Matters
Diffusion models power the AI image generation revolution. Stable Diffusion, DALL-E, and Midjourney are all built on diffusion. Understanding this architecture is essential if you’re working with any form of AI-generated visual content.
Unlike autoregressive models (used for text), diffusion works by iterative refinement: starting from random noise and gradually removing it to reveal an image. This approach produces remarkably high-quality outputs and allows fine-grained control over generation.
For AI engineers, diffusion models are increasingly relevant beyond images. They’re being applied to video, audio, 3D models, and even text generation. The diffusion paradigm is expanding.
Implementation Basics
Core Process
- Forward process (training): Gradually add noise to data over T steps
- Learn to denoise: Train network to predict noise at each step
- Reverse process (generation): Start from pure noise, iteratively denoise
Forward Process
x_t = √(α_t) × x_{t-1} + √(1-α_t) × noise
After many steps, data becomes indistinguishable from random noise.
Reverse Process
x_{t-1} = (x_t - predicted_noise) / scaling + noise
Model predicts the noise to remove at each step.
Key Components
- U-Net backbone: Commonly used denoising network
- Cross-attention: Incorporates text conditioning
- Noise schedule: How quickly noise is added/removed
- Classifier-free guidance: Balances prompt adherence vs. diversity
Text-to-Image Pipeline
- Encode text prompt with language model (CLIP)
- Start with random latent noise
- Iteratively denoise using U-Net
- U-Net conditioned on text embeddings via cross-attention
- Decode final latent to image
Latent Diffusion (Stable Diffusion)
- Operate in compressed latent space, not pixel space
- VAE encodes images to latents
- Much faster than pixel-space diffusion
- Enables practical image generation
Sampling Methods
- DDPM: Original, many steps (1000)
- DDIM: Deterministic, fewer steps (50-100)
- DPM-Solver: Fast, high quality (20-50 steps)
- Euler, Heun: Alternative ODE solvers
Fewer steps = faster but potentially lower quality.
Practical Applications
- Text-to-image generation
- Image editing and inpainting
- Super-resolution upscaling
- Style transfer
- Video generation (emerging)
Control Mechanisms
- ControlNet: Add structural guidance (poses, edges, depth)
- LoRA: Fine-tune for specific styles/subjects
- Textual Inversion: Learn new concepts from few examples
- Inpainting masks: Edit specific regions
Source
Denoising diffusion probabilistic models achieve image quality rivaling GANs by learning to reverse a Markov chain that gradually adds noise to data until signal is destroyed.
https://arxiv.org/abs/2006.11239