Stable Diffusion A Tutorial

Stable Diffusion
Binxu Wang, John Vastola

Machine Learning from Scratch
Nov.1st 2022
What’s the deal with all these pictures?
These pictures were generated by Stable Diffusion,
a recent diffusion generative model.
It can turn text prompts (e.g. “an astronaut riding a horse”) into images.
It can also do a variety of other things!
You may have also heard of DALL·E 2, which works in a similar way.
Why should we care?
Could be a model of imagination.
Similar techniques could be used to generate

any number of things (e.g. neural data).
"a lovely cat running
in the desert in Van
Gogh style, trending
It’s cool!
art."
How does it work?
It’s complicated…
but here’s the high-level idea.
“Batman eating pizza

in a diner"
What do we need? “bad stick figure
drawing"
Example pictures of people
1. Method of learning to generate new stuff given many examples

What do we need?
2. Way to link text and images
“cool professor person”
3. Way to compress images

(for speed in training and generation)
𝑧[0: 3, : , : ]
What do we need?
4. Way to add in good image-related inductive biases…
…since when you’re generating something new, you need a

way to safely go beyond the images you’ve seen before.
What do we need?
1. Method of learning to generate new stuff Forward/reverse dffusion
2. Way to link text and images Text-image representation model
3. Way to compress images Autoencoder
4. Way to add in good inductive biases U-net + ‘attention’

architecture
Making a ‘good’ generative model is about making all these parts work together well!
Stable Diffusion in Action
Cartoon with StableDiffusion + Cartoon
https://www.reddit.com/r/Sta
bleDiffusion/comments/xcjj7u
/sd_img2img_after_effects_i
_generated_2_images_and/
Some Resources
• Diffusion model in general
• What are Diffusion Models? | Lil'Log
• Generative Modeling by Estimating Gradients of the Data Distribution |
Yang Song
• Stable diffusion
• Annotated & simplified code: U-Net for Stable Diffusion (labml.ai)
• Illustrations: The Illustrated Stable Diffusion – Jay Alammar
• Attention & Transformers
• The Illustrated Transformer
Outline
• Stable Diffusion is cool!
• Build Stable Diffusion “from Scratch”
• Principle of Diffusion models (sampling, learning)
• Diffusion for Images – UNet architecture
• Understanding prompts – Word as vectors, CLIP
• Let words modulate diffusion – Conditional Diffusion, Cross Attention
• Diffusion in latent space – AutoEncoderKL
• Training on Massive Dataset. – LAION 5Billion
• Let’s try ourselves.
Principle of Diffusion Models
Learning to generate by iterative denoising.
“Creating noise from data is easy;
Creating data from noise is generative modeling.”
-- Song, Yang
Diffusion models
• Forward diffusion (noising)
• 𝑥0 → 𝑥1 → ⋯ 𝑥𝑇
• Take a data distribution 𝑥0 ~𝑝(𝑥), turn it into noise by
diffusion 𝑥𝑇 ~𝒩 0, 𝜎 2 𝐼
𝒙𝟎 𝒙𝟏 𝒙𝑻−𝟏 𝒙𝑻
• Reverse diffusion (denoising)

• 𝑥𝑇 → 𝑥𝑇−1 → ⋯ 𝑥0
• Sample from the noise distribution 𝑥𝑇 ~𝒩(0, 𝜎 2 𝐼),
reverse the diffusion process to generate data 𝑥0 ~𝑝(𝑥)
Math Formalism
• For a forward diffusion process

𝑑𝒙 = 𝑓 𝒙, 𝑡 𝑑𝑡 + 𝑔 𝑡 𝑑𝒘
• There is a backward diffusion process that reverse the time

𝑑𝒙 = 𝑓 𝑥, 𝑡 − 𝑔 𝑡 2 ∇𝑥 log 𝑝(𝒙, 𝑡) 𝑑𝑡 + 𝑔 𝑡 𝑑𝒘
• If we know the time-dependent score function ∇𝑥 log 𝑝(𝒙, 𝑡)
• Then we can reverse the diffusion process.
Animation for the Reverse Diffusion
Score Vector Field Reverse Diffusion guided by the score vector field
https://yang-song.net/blog/2021/score/
Training diffusion model =
Learning to denoise
• If we can learn a score model
𝑓𝜃 𝑥, 𝑡 ≈ ∇ log 𝑝(𝑥, 𝑡)
• Then we can denoise samples, by running the reverse diffusion equation. 𝑥𝑡 → 𝑥𝑡−1
• Score model 𝑓𝜃 : 𝒳 × 0,1 → 𝒳

• A time dependent vector field over 𝑥 space.
• Training objective: Infer noise from a noised sample

𝑥 ∼ 𝑝 𝑥 , 𝜖 ∼ 𝒩 0, 𝐼 , 𝑡 ∈ [0,1]
min 𝜖 + 𝑓𝜃 𝑥 + 𝜎 𝑡 𝜖, 𝑡 22
• Add Gaussian noise 𝜖 to an image 𝑥 with scale 𝜎 𝑡 , learn to infer the noise 𝜎.
Conditional denoising
• Infer noise from a noised sample, based on a condition 𝑦

• 𝑥, 𝑦 ∼ 𝑝 𝑥, 𝑦 , 𝜖 ∼ 𝒩 0, 𝐼 , 𝑡 ∈ [0,1]
2
• min 𝜖 − 𝑓𝜃 𝑥 + 𝜎 𝑡 𝜖, 𝑦, 𝑡 2
• Conditional score model 𝑓𝜃 : 𝒳 × 𝒴 × 0,1 → 𝒳

• Use Unet as to model image to image mapping
• Modulate the Unet with condition (text prompt).
Comparing
Generative
Models
https://lilianweng.github.io/posts/2021-07-11-diffusion-models/
Diffusion vs GAN / VAE
GAN Diffusion
• One shot generation. Fast. • Multi-iteration generation. Slow.
• Harder to control in one pass. • Easier to control during generation.
• Adversarial min-max objective. Can • Simple objective, no adversary in
collapse. training.
Activation maximization ~
Reverse Diffusion
• For a neuron, activation maximization

can be realized by gradient ascent
𝑧𝑡+1 ← 𝑧𝑡 + ∇𝑓 𝐺 𝑧𝑡 +𝜖
• Homologous to the reverse diffusion

equation.
• Idea: Neuron activation defines a

Generative model on image space.
Modelling Score function
over Image Domain
Introducing UNet
Convolutional Neural Network
Features of larger scale (larger RF) • CNN parametrizes function
Higher abstraction level. over images
• Motivation
• Features are translational
invariant
• Extract feature at different
scale / abstraction level
• Key modules
• Convolution
• Downsamping (Max-pool)
VGG
CNN + inverted CNN ⇒ UNet
• Inverted CNN
(generator) can
generate images.
• CNN + inverted CNN

Down Sampling Up Sampling could model Image →
Convolution TransposedConvolution
Image function.
UNet: a natural architecture for image-to-
image function
Skip connection
Transporting information
at the same resolution.
Down (sampling) Up (sampling)

side side
Encoder Decoder
Key Ingredients of UNet
• Convolution operation
• Save parameter, spatial
invariant
• Down/Up sampling
• Multiscale / Hierarchy
• Learn modulation at multi scale
and multi-abstraction levels.
• Skip connection
• No bottleneck
• Route feature of the same
scaledirectly.
• Cf. AutoEncoder has bottleneck
Note: Add Time Dependency
• The score function is time-dependent.
• Target: 𝑠 𝑥, 𝑡 = ∇𝑥 log 𝑝(𝑥, 𝑡) 𝑡 embedding
Linear/
MLP
⊕ 𝒕
• Add time dependency Conv
tensor
[𝐬𝐢𝐧 𝝎𝒊 𝒕 ,
• Assume time dependency is spatially 𝐜𝐨𝐬 𝝎𝒊 𝒕 ,
homogeneous. …]
• Add one scalar value per channel 𝑓(𝑡)
• Parametrize 𝑓(𝑡) by MLP / linear of Fourier basis.

(conv_in): Conv2d(4, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(time_proj): Timesteps()
(time_embedding): TimestepEmbedding
Unet in Stable (linear_1): Linear(in_features=320, out_features=1280, bias=True)
(act): SiLU()
Diffusion (linear_2): Linear(in_features=1280, out_features=1280, bias=True)
(down_blocks):
(0): CrossAttnDownBlock2D
(3): DownBlock2D
(up_blocks):
(0): UpBlock2D
(1): CrossAttnUpBlock2D
(mid_block): UNetMidBlock2DCrossAttn
(attentions):
(resnets):
(conv_norm_out): GroupNorm(32, 320, eps=1e-05, affine=True)
(conv_act): SiLU()
(conv_out): Conv2d(320, 4, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
How to understand prompts?
Language / Multimodal Transformer, CLIP!
Word as Vectors: Language Model 101
• Unlike pixel, meaning of word are
not explicitly in the characters. Words in a
sentence I love cats and dogs .
Token Index 328, 793, 3989, 537, 3255, 269

• Word can be represented as index
in dictionary
Word
• But index is also meaning less.
Vectors
• Represent words in a vector space

• Vector geometry => semantic relation.
Word Vector in Context: N layers ……
RNN / Transformers
• Meaning of word depends on context,
not always the same.
Transformer
• “I book a ticket to buy that book.” Block
• Word vectors should depend on context.
Transformer
• Transformers let each word “absorb” Block
influence from other words to be
“contextualized”
More on attention later…

I love cats and dogs .
Learning Word Vectors:
GPT & BERT & CLIP
Downstream Classifier can decode:
Part of speech, Sentiment, …
• Self-supervised learning of word
representation
• Predicting missing / next words in

a sentence. (BERT, GPT)
• Contrastive Learning, matching

image and text. (CLIP)
MLM — Sentence-Transformers documentation (sbert.net)

Joint Representation for Vision and Language :
CLIP
Transformer • Learn a joint encoding
space for text caption
and image
• Maximize
Vision representation similarity
Transformer
between an image and
its caption.
• Minimize other pairs
CLIP paper 2021

Choice of text encoding
• Encoder in Stable Diffusion: pre-trained CLIP ViT-L/14 text encoder
• Word vector can be randomly initialized and learned online.
• Representing other conditional signals
• Object categories (e.g. Shark, Trout, etc.):

• 1 vector per class
• Face attributes (e.g. {female, blonde hair, with glasses, …}, {male, short hair, dark skin}):
• set of vectors, 1 vector per attributes
• Time to be creative!!
How does text affect diffusion?
Incoming Cross Attention
Origin of Attention:
Machine Translation (Seq2Seq)
Original French
sentence I love cats and dogs . Translation J'adore les chats et les chiens.
Encoder Decoder
hidden state hidden state
(Word 𝑒1 𝑒2 𝑒3 𝑒4 𝑒5 𝑒6 ℎ ℎ2 ℎ3
(Word 1
Vectors) Vectors)
• Use Attention to retrieve useful info from a batch of vectors.

From Dictionary to Attention
Dictionary: Hard-indexing
• `dic = {1 : 𝑣1 , 2 : 𝑣2 , 3 : 𝑣3 }`
• Keys 1,2,3
• Values 𝑣1 , 𝑣2 , 𝑣3 𝟏 𝟐 𝟑
0
• `dic[2]` × 1 =
• Query 2 𝑣1 𝑣2 𝑣3 0 𝑣2
• Find 2 in keys
• Get corresponding value.
• Retrieving values as matrix vector product

• One hot vector over the keys
• Matrix vector product
From Dictionary to Attention
Attention: Soft-indexing
• Soft indexing
𝟏 𝟐 𝟑
• Define an attention distribution
0.1
𝑎 over the keys
× 0.8 =
𝑣1 𝑣2 𝑣3 0.1 0.8 𝑣2 +0.1 𝑣1+0.1 𝑣3
• Matrix vector product.
• Distribution based on similarity

of query and key.
QKV attention
• Query : what I need (J’adore : “I want subject pronoun & verb”)
• Key : what the target provide (I : “Here is the subject”)
• Value : the information to be retrieved (latent related to Je or J’ )
• Linear projection of “word vector”

• Query 𝑞𝑖 = 𝑊𝑞 ℎ𝑖
• Key 𝑘𝑗 = 𝑊𝑘 𝑒𝑗
• Value 𝑣𝑗 = 𝑊𝑣 𝑒𝑗
• 𝑒𝑗 hidden state of encoder (English, source)

• ℎ𝑖 hidden state of decoder (French, target)
Attention mechanism
• Compute the inner product (similarity) of key 𝑘 and query 𝑞

• SoftMax the normalized score as attention distribution.
𝑘𝑗𝑇 𝑞𝑖
𝑎𝑖𝑗 = SoftMax , ෍ 𝑎𝑖𝑗 = 1
𝑙𝑒𝑛(𝑞) 𝑗
• Use attention distribution to weighted average values 𝑣.

𝑐𝑖 = ෍ 𝑎𝑖𝑗 𝑣𝑗
𝑗
Visualizing Attention
matrix 𝒂𝒊𝒋
• French 2 English
• “Learnt to pay Attention”
• “la zone economique
europeenne” -> “the
European Economic Area”
• “a ete signe” -> “was

signed”
Attention + RNN
https://jalammar.github.io/visualizing-neural-machine-
translation-mechanics-of-seq2seq-models-with-attention/
Cross & Self Attention
• Cross Attention
• Tokens in one language pay attention French
to tokens in another. Translation J'adore les chats
Decoder
hidden state
• Self Attention (𝑒𝑖 = ℎ𝑖 ) (Word 1
ℎ ℎ2 ℎ3
• Tokens in a language pay attention to Vectors)

each other.
“A robot must obey the order given it.”
https://jalammar.github.io/illustrated-gpt2/
“A robot must obey the order given it.”
Note: Feed Forward network
• Attention is usually followed by a

2-layer MLP and Normalization
• Learn nonlinear transform.

Text2Image as translation
Source language: Words Target language: Images
Spatial Dimensions
Sequence Dimensions
Channel
Dimensions
Encoded Latent State
Word Vectors of Image
Patch
“ A ballerina chasing her cat running Vectors!
on the grass in the style of Monet "

Text2Image as translation Spatial Dimensions
Sequence Dimensions
Channel
Dimensions
Encoded Latent State
Word Vectors of Image
“ A ballerina chasing her cat running

on the grass in the style of Monet "
Cross Attention: Self Attention:

Image to Words Image to Image
Spatial Transformer
• Rearrange spatial tensor to
sequence.
• Cross Attention
• Self Attention
• FFN
• Rearrange back to spatial tensor
(same shape)
Tips: Implementing attention `einops` lib
• `einops.rearrange` function
• Shift order of axes
• Split / combine dimension.
• `torch.einsum` function
• Multiply & sum tensors along
axes.
• Down blocks • Up blocks
UNet = Giant Sandwich of ResBlock
ResBlock
ResBlock
Spatial transformer + SpatialTransformer
ResBlock
ResBlock
UpSample
ResBlock (Conv layer) SpatialTransformer

DownSample
ResBlock
SpatialTransformer
ResBlock ResBlock
SpatialTransformer
SpatialTransformer
ResBlock
ResBlock
SpatialTransformer
SpatialTransformer UpSample
DownSample ResBlock
ResBlock SpatialTransformer
SpatialTransformer ResBlock
SpatialTransformer
ResBlock
ResBlock
SpatialTransformer
SpatialTransformer
DownSample UpSample
ResBlock ResBlock
ResBlock SpatialTransformer
ResBlock
SpatialTransformer
ResBlock
SpatialTransformer
Spatial transformer + ResBlock (Conv layer)
Time
embedding
𝟏𝟐𝟖𝟎
Latent
tensor Spatial Spatial
Resblock Resblock
𝟒, 𝟔𝟒, 𝟔𝟒 Transformer Transformer
Word
Vectors
𝑳𝒔𝒆𝒒 , 𝟕𝟖𝟒
• Alternating Time and Word Modulation

• Alternating Local and Nonlocal operation
Diffusion in Latent Space
Adding in AutoEncoder
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-
resolution image synthesis with latent diffusion models, CVPR
Diffusion in latent space DownSampling
32 pix 180 pix
• Motivation:
• Natural images are high dimensional
• but have many redundant details that could be
compressed / statistically filled out
𝑑 = 2352 𝑑 = 97200
• Division of labor
• Diffusion model -> Generate low resolution sketch
• AutoEncoder -> Fill out high resolution details
• Train a VAE model to compress images into latent

space.
• 𝑥→𝑧→𝑥 𝑧
[4,512/𝑓, 512/𝑓]
• Train diffusion models in latent space of 𝑧. 𝑥 𝑥ො
[3,512,512] [3,512,512]
Spatial Compression Tradeoff
• LDM-{𝑓}. 𝑓 = Spatial downsampling factor
• Higher 𝑓 leads to faster sampling, with degraded image quality (FID ↑)
• Fewer sampling steps leads to faster sampling, with lower quality (FID ↑)
Face
CelebA-HQ ImageNet
Spatial Compression Tradeoff
• LDM-{𝑓}. 𝑓 = Spatial downsampling factor
• Too little compression 𝑓 = 1,2 or too much compression 𝑓 = 32, makes
diffusion hard to train.
Details in Stable Diffusion
• In stable diffusion, spatial downsampling 𝑓 = 8
• 𝑥 is (3, 512, 512) image tensor

• 𝑧 is (4, 64, 64) latent tensor
Regularizing the Latent Space
• KL regularizer
• Similar to VAE, make latent distribution like Gaussian distribution.
• VQ regularizer
• Make the latent representation quantized to be a set of discrete tokens.
Let the GPUs roar!
Training data & details.
Large Data Training
• SD is trained on ~ 2 Billion image – caption (English) pairs.
• Scraped from web, filtered by CLIP.
• https://laion.ai/blog/laion-5b/
Diffusion Process Visualized
Meaning of latent space
• Latent state contains a “sketch version” of the image.
𝑧[0: 3, : , : ]

Stable Diffusion A Tutorial

Uploaded by

Copyright:

Available Formats

Stable Diffusion A Tutorial

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Stable Diffusion A Tutorial

Uploaded by

Copyright:

Available Formats

Stable Diffusion

Binxu Wang, John Vastola

Similar techniques could be used to generate

“Batman eating pizza

Example pictures of people

1. Method of learning to generate new stuff given many examples

“cool professor person”

3. Way to compress images

4. Way to add in good image-related inductive biases…

…since when you’re generating something new, you need a

2. Way to link text and images Text-image representation model

3. Way to compress images Autoencoder

4. Way to add in good inductive biases U-net + ‘attention’

• Reverse diffusion (denoising)

• For a forward diffusion process

• There is a backward diffusion process that reverse the time

• Score model 𝑓𝜃 : 𝒳 × 0,1 → 𝒳

• Training objective: Infer noise from a noised sample

• Infer noise from a noised sample, based on a condition 𝑦

• Conditional score model 𝑓𝜃 : 𝒳 × 𝒴 × 0,1 → 𝒳

• For a neuron, activation maximization

• Homologous to the reverse diffusion

• Idea: Neuron activation defines a

• CNN + inverted CNN

Down (sampling) Up (sampling)

• Add one scalar value per channel 𝑓(𝑡)

• Parametrize 𝑓(𝑡) by MLP / linear of Fourier basis.

Token Index 328, 793, 3989, 537, 3255, 269

• Represent words in a vector space

• Word vectors should depend on context.

More on attention later…

• Predicting missing / next words in

• Contrastive Learning, matching

MLM — Sentence-Transformers documentation (sbert.net)

• Minimize other pairs

CLIP paper 2021

• Word vector can be randomly initialized and learned online.

• Representing other conditional signals

• Object categories (e.g. Shark, Trout, etc.):

• Use Attention to retrieve useful info from a batch of vectors.

• Retrieving values as matrix vector product

• Distribution based on similarity

• Linear projection of “word vector”

• 𝑒𝑗 hidden state of encoder (English, source)

• Compute the inner product (similarity) of key 𝑘 and query 𝑞

• Use attention distribution to weighted average values 𝑣.

• “a ete signe” -> “was

• Tokens in a language pay attention to Vectors)

• Attention is usually followed by a

• Learn nonlinear transform.

on the grass in the style of Monet "

“ A ballerina chasing her cat running

Cross Attention: Self Attention:

ResBlock (Conv layer) SpatialTransformer

• Alternating Time and Word Modulation

• Train a VAE model to compress images into latent

• 𝑥 is (3, 512, 512) image tensor

• SD is trained on ~ 2 Billion image – caption (English) pairs.

• Scraped from web, filtered by CLIP.