Stable Diffusion A Tutorial
Stable Diffusion A Tutorial
Stable Diffusion A Tutorial
It can turn text prompts (e.g. “an astronaut riding a horse”) into images.
It can also do a variety of other things!
You may have also heard of DALL·E 2, which works in a similar way.
Why should we care?
Could be a model of imagination.
It’s complicated…
but here’s the high-level idea.
Making a ‘good’ generative model is about making all these parts work together well!
Stable Diffusion in Action
Cartoon with StableDiffusion + Cartoon
https://www.reddit.com/r/Sta
bleDiffusion/comments/xcjj7u
/sd_img2img_after_effects_i
_generated_2_images_and/
Some Resources
• Diffusion model in general
• What are Diffusion Models? | Lil'Log
• Generative Modeling by Estimating Gradients of the Data Distribution |
Yang Song
• Stable diffusion
• Annotated & simplified code: U-Net for Stable Diffusion (labml.ai)
• Illustrations: The Illustrated Stable Diffusion – Jay Alammar
• Attention & Transformers
• The Illustrated Transformer
Outline
• Stable Diffusion is cool!
• Build Stable Diffusion “from Scratch”
• Principle of Diffusion models (sampling, learning)
• Diffusion for Images – UNet architecture
• Understanding prompts – Word as vectors, CLIP
• Let words modulate diffusion – Conditional Diffusion, Cross Attention
• Diffusion in latent space – AutoEncoderKL
• Training on Massive Dataset. – LAION 5Billion
• Let’s try ourselves.
Principle of Diffusion Models
Learning to generate by iterative denoising.
“Creating noise from data is easy;
Creating data from noise is generative modeling.”
-- Song, Yang
Diffusion models
• Forward diffusion (noising)
• 𝑥0 → 𝑥1 → ⋯ 𝑥𝑇
• Take a data distribution 𝑥0 ~𝑝(𝑥), turn it into noise by
diffusion 𝑥𝑇 ~𝒩 0, 𝜎 2 𝐼
𝒙𝟎 𝒙𝟏 𝒙𝑻−𝟏 𝒙𝑻
Score Vector Field Reverse Diffusion guided by the score vector field
https://yang-song.net/blog/2021/score/
Training diffusion model =
Learning to denoise
• If we can learn a score model
𝑓𝜃 𝑥, 𝑡 ≈ ∇ log 𝑝(𝑥, 𝑡)
• Then we can denoise samples, by running the reverse diffusion equation. 𝑥𝑡 → 𝑥𝑡−1
https://lilianweng.github.io/posts/2021-07-11-diffusion-models/
Diffusion vs GAN / VAE
GAN Diffusion
• One shot generation. Fast. • Multi-iteration generation. Slow.
• Harder to control in one pass. • Easier to control during generation.
• Adversarial min-max objective. Can • Simple objective, no adversary in
collapse. training.
Activation maximization ~
Reverse Diffusion
𝑧𝑡+1 ← 𝑧𝑡 + ∇𝑓 𝐺 𝑧𝑡 +𝜖
• Motivation
• Features are translational
invariant
• Extract feature at different
scale / abstraction level
• Key modules
• Convolution
• Downsamping (Max-pool)
VGG
CNN + inverted CNN ⇒ UNet
• Inverted CNN
(generator) can
generate images.
• Down/Up sampling
• Multiscale / Hierarchy
• Learn modulation at multi scale
and multi-abstraction levels.
• Skip connection
• No bottleneck
• Route feature of the same
scaledirectly.
• Cf. AutoEncoder has bottleneck
Note: Add Time Dependency
• The score function is time-dependent.
• Target: 𝑠 𝑥, 𝑡 = ∇𝑥 log 𝑝(𝑥, 𝑡) 𝑡 embedding
Linear/
MLP
⊕ 𝒕
• Add time dependency Conv
tensor
[𝐬𝐢𝐧 𝝎𝒊 𝒕 ,
• Assume time dependency is spatially 𝐜𝐨𝐬 𝝎𝒊 𝒕 ,
homogeneous. …]
RNN / Transformers
• Meaning of word depends on context,
not always the same.
Transformer
• “I book a ticket to buy that book.” Block
Transformer
• Transformers let each word “absorb” Block
influence from other words to be
“contextualized”
• Maximize
Vision representation similarity
Transformer
between an image and
its caption.
• Face attributes (e.g. {female, blonde hair, with glasses, …}, {male, short hair, dark skin}):
• set of vectors, 1 vector per attributes
• Time to be creative!!
How does text affect diffusion?
Incoming Cross Attention
Origin of Attention:
Machine Translation (Seq2Seq)
Original French
sentence I love cats and dogs . Translation J'adore les chats et les chiens.
Encoder Decoder
hidden state hidden state
(Word 𝑒1 𝑒2 𝑒3 𝑒4 𝑒5 𝑒6 ℎ ℎ2 ℎ3
(Word 1
Vectors) Vectors)
• `dic[2]` × 1 =
• Query 2 𝑣1 𝑣2 𝑣3 0 𝑣2
• Find 2 in keys
• Get corresponding value.
• Soft indexing
𝟏 𝟐 𝟑
• Define an attention distribution
0.1
𝑎 over the keys
× 0.8 =
𝑣1 𝑣2 𝑣3 0.1 0.8 𝑣2 +0.1 𝑣1+0.1 𝑣3
• Matrix vector product.
Attention + RNN
https://jalammar.github.io/visualizing-neural-machine-
translation-mechanics-of-seq2seq-models-with-attention/
Cross & Self Attention
• Cross Attention
• Tokens in one language pay attention French
to tokens in another. Translation J'adore les chats
Decoder
hidden state
• Self Attention (𝑒𝑖 = ℎ𝑖 ) (Word 1
ℎ ℎ2 ℎ3
https://jalammar.github.io/illustrated-gpt2/
“A robot must obey the order given it.”
https://jalammar.github.io/illustrated-gpt2/
https://jalammar.github.io/illustrated-gpt2/
Note: Feed Forward network
Spatial Dimensions
Sequence Dimensions
Channel
Dimensions
Encoded Latent State
Word Vectors of Image
Patch
“ A ballerina chasing her cat running Vectors!
• `einops.rearrange` function
• Shift order of axes
• Split / combine dimension.
• `torch.einsum` function
• Multiply & sum tensors along
axes.
• Down blocks • Up blocks
UNet = Giant Sandwich of ResBlock
ResBlock
ResBlock
Spatial transformer + SpatialTransformer
ResBlock
ResBlock
UpSample
Latent
tensor Spatial Spatial
Resblock Resblock
𝟒, 𝟔𝟒, 𝟔𝟒 Transformer Transformer
Word
Vectors
𝑳𝒔𝒆𝒒 , 𝟕𝟖𝟒
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-
resolution image synthesis with latent diffusion models, CVPR
Diffusion in latent space DownSampling
32 pix 180 pix
• Motivation:
• Natural images are high dimensional
• but have many redundant details that could be
compressed / statistically filled out
𝑑 = 2352 𝑑 = 97200
• Division of labor
• Diffusion model -> Generate low resolution sketch
• AutoEncoder -> Fill out high resolution details
Face
CelebA-HQ ImageNet
Spatial Compression Tradeoff
• LDM-{𝑓}. 𝑓 = Spatial downsampling factor
• Too little compression 𝑓 = 1,2 or too much compression 𝑓 = 32, makes
diffusion hard to train.
Details in Stable Diffusion
• In stable diffusion, spatial downsampling 𝑓 = 8
• KL regularizer
• Similar to VAE, make latent distribution like Gaussian distribution.
• VQ regularizer
• Make the latent representation quantized to be a set of discrete tokens.
Let the GPUs roar!
Training data & details.
Large Data Training
• https://laion.ai/blog/laion-5b/
Diffusion Process Visualized
Meaning of latent space
𝑧[0: 3, : , : ]