What is Diffusion Model?

Architecture

Diffusion Model

Definition

A generative model that learns to reverse a gradual noising process, generating data by iteratively denoising from pure noise. The dominant architecture for AI image generation.

Why It Matters

Diffusion models power the AI image generation revolution. Stable Diffusion, DALL-E, and Midjourney are all built on diffusion. Understanding this architecture is essential if you’re working with any form of AI-generated visual content.

Unlike autoregressive models (used for text), diffusion works by iterative refinement: starting from random noise and gradually removing it to reveal an image. This approach produces remarkably high-quality outputs and allows fine-grained control over generation.

For AI engineers, diffusion models are increasingly relevant beyond images. They’re being applied to video, audio, 3D models, and even text generation. The diffusion paradigm is expanding.

Implementation Basics

Core Process

Forward process (training): Gradually add noise to data over T steps
Learn to denoise: Train network to predict noise at each step
Reverse process (generation): Start from pure noise, iteratively denoise

Forward Process

x_t = √(α_t) × x_{t-1} + √(1-α_t) × noise

After many steps, data becomes indistinguishable from random noise.

Reverse Process

x_{t-1} = (x_t - predicted_noise) / scaling + noise

Model predicts the noise to remove at each step.

Key Components

U-Net backbone: Commonly used denoising network
Cross-attention: Incorporates text conditioning
Noise schedule: How quickly noise is added/removed
Classifier-free guidance: Balances prompt adherence vs. diversity

Text-to-Image Pipeline

Encode text prompt with language model (CLIP)
Start with random latent noise
Iteratively denoise using U-Net
U-Net conditioned on text embeddings via cross-attention
Decode final latent to image

Latent Diffusion (Stable Diffusion)

Operate in compressed latent space, not pixel space
VAE encodes images to latents
Much faster than pixel-space diffusion
Enables practical image generation

Sampling Methods

DDPM: Original, many steps (1000)
DDIM: Deterministic, fewer steps (50-100)
DPM-Solver: Fast, high quality (20-50 steps)
Euler, Heun: Alternative ODE solvers

Fewer steps = faster but potentially lower quality.

Practical Applications

Text-to-image generation
Image editing and inpainting
Super-resolution upscaling
Style transfer
Video generation (emerging)

Control Mechanisms

ControlNet: Add structural guidance (poses, edges, depth)
LoRA: Fine-tune for specific styles/subjects
Textual Inversion: Learn new concepts from few examples
Inpainting masks: Edit specific regions

Source

Denoising diffusion probabilistic models achieve image quality rivaling GANs by learning to reverse a Markov chain that gradually adds noise to data until signal is destroyed.

https://arxiv.org/abs/2006.11239

Why It Matters

Implementation Basics

🎁 Go Beyond Definitions

Related Terms

Related Articles