11 GANs
11 GANs
11 GANs
Computer Vision
Generative Adversarial Networks
Steve Elston
Training
Data
Discriminator Loss
Real Images
Noise
Generator
Generated Images
Introduction to Game Theory for GAN
Training
A bit of game theory
The generator and discriminator engage in a two player non-
cooperative zero sum game
• In a zero sum game the cost to one player has the same magnitude
but opposite sign of the other player
• If the cost the the discriminator is and the cost to the generator is ,
then:
A bit of game theory
The generator and discriminator engage in a two player non-
cooperative zero sum game
• If the cost the the discriminator is and the cost to the generator is ,
then:
• Both players employ strategies to reduce their cost, but each move
causes the other player to make a counter move
• If the players continue to employ optimal strategies, the game
eventually reaches a Nash equilibrium
A bit of game theory
The generator and discriminator engage in a two player non-
cooperative zero sum game
• If the players continue to employ optimal strategies, the game
eventually reaches a Nash equilibrium
• At Nash equilibrium, the players are deadlocked
• A first player makes an optimal move to reduce costs
• The other player to make an optimal counter move to reduce cost
• Counter move increases cost of the first player
• A subsequent optimal counter-counter move of first player increases cost of
the second player
• etc…
A bit of game theory
Nash equilibrium does not imply stability!
• Consider a game where a first player can change a value, x, and the
second player can change a value, y, with cost functions for each
player:
• At Nash equilibrium the costs to the players are not stable with time:
• is the expected value of the discriminator over the distribution of the real
data,
• Maximizing means discriminator optimally recognizes real data
• is one minus the expected value of the discriminator over the distribution
of the generated data,
• Minimizing means generator optimally fools discriminator
Game theory and GAN training
The generator and discriminator engage in a two player non-cooperative
zero sum game
• How can we understand this intimidating equation?
𝑉 ( 𝜃 𝐷 , 𝜃𝐺 )
Discriminator
Training
Data
𝑥𝑅 Loss
Real Images
𝑥𝐺
Noise, z
Generator
Generated Images
Training a Discriminator and Generator
Basic GAN architecture
Alternately train generator and discriminator
Training
Discriminator
Data
Loss
Real Images
Noise
Generator
Generated Images
Alternate Training Discriminator and Generator
Train generator in opposition to discriminator
Train discriminator as per any regressor
Input noise, z
Input real and fake
data, x
Differentiable
function G(z)
Differentiable
function D(x)
xG sampled from G(z)
D(x) tries to be large
for fake data D(xG) = D(G(z))
D learns to get
D(G(z)) close to 0
and
G tries to make
D(G(z)) close to 1
Gradients of the value function
To converge to a solution for the equilibrium problem, we need to find
gradients
• The gradient with respect to for m samples:
Ez[Log(D(G(z)))]
Increasing J(G)
Ez[Log(1 – D(G(z)))]
0 Increasing D(G(z))
Mode collapse and training failure
Mode collapse is a common problem that prevents GANs from learning
• Mode collapse is a dead lock in the two-player game
• Example: discriminator becomes too good at recognizing fake image and
stops generator from learning
• Example: discriminator and generator alternate between modes of the
loss function
• Example: gradient of loss function becomes nearly zero – typically for
generator
Mode collapse and training failure
Discriminator
Discriminator learns one
learns p = .5 mode is true
true data data
Generator
learns
one mode
Generator
learns
other mode
Discriminator
Discriminator
learns p = .5
learns one
true data
mode is true
data
0 1 2 3 4
Actual Train generator Train discriminator Train generator Train discriminator
data
Loss Functions for GANs
Loss Functions for Training Neural Networks
Need the distribution of the generated data be the same the real data
• The Kullback-Leibler divergence between two the distributions and is such
a measure:
• When ,
• But KL divergence is not symmetric,
• Perhaps we can train the generator by minimizing KL divergence?
Loss Functions for Training Neural Networks
The KL divergence is asymmetric
• For two distributions the KL divergence depend on the order or arguments
• Notice that the gradient becomes near zero on one side of the KL function
The Wasserstein distance metric
The Wasserstein distance metric is symmetric and intuitive
• The Wasserstein distance metric symmetric with bounded gradients
• Wasserstein GAN, or W-GAN (Arjovsky, et. al. 2017) uses the Wasserstein
distance as a loss function
• Using the Wasserstein loss helps W-GANs avoid mode collapse
The Wasserstein distance metric
The Wasserstein distance metric is symmetric and intuitive
• The definition of the Wasserstein distance is a bit intimidating:
Where
least upper bound
set of joint distributions, , with marginal distributions and
The Wasserstein distance metric
The Wasserstein distance metric is symmetric and intuitive
• The definition of the Wasserstein distance is a bit intimidating:
4 4 4 4
3 3 3 3
Q(x) 2 2 2 2
1 1 1 1
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
4 4 4 4
3 3 3 3
P(x) 2 2 2 2
1 1 1 1
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
• GANs should produce a diversity of objects – not the same few many times
– Should be a nearly uniform distribution of probability of occurrence of each
recognized object class
Evaluation of GANs
Inception score attempts to balance accuracy at creating real objects and
diversity of objects created
• Recall the relationships of entropy for probability distributions
• For a classifier with feature vector, x, and prediction, y, we what the
output, , to have low entropy
• For diversity of objects from a generator, , given random noise input, z,
recognized we want the marginal distribution, , to have high entropy
• Combining these measures we can write the inception score in terms of
expected Kullback-Leibler (KL) divergence:
Evaluation of GANs
Inception score attempts to balance accuracy at creating real objects and
diversity of objects created
• How can we interpret the inception score?
• Conversely, objects that make no sense, like an animal with two heads, can
have high inception scores
• As a result of the above problems human evaluators my not agree with the
quality based on inception score
Evaluation of GANs
The Fréchet Inception distance (FID) was proposed to overcome the
limitations of inception score
• Rather than use the output of the Inception network, the FID compares
distributions of activations in a hidden layer, typically Inception V4
• Model the distribution of activations as multivariate Normal distributions
– is the distribution of activations from the generated images
– is the distribution of activations from the real-world, denoted by w
• Notice that to compute FID we only need the means and covariances of the
distributions – easy to estimate
• To understand FID, consider the behavior of the two terms
– The first term is the Euclidean difference between the mean vectors
– The second term is the difference in standard deviations between the two
distributions
Evaluation of GANs
How can we interpret the Fréchet Inception distance?
• The FID is the square of the Wasserstein distance between activations from
generated and real-world distributions
Exogenous
variable
Conditional vector, y
generator
function
The Conditional GAN
For conditional GAN image generated is conditional on an exogenous variable
Random Input
vector
The Convolutional GAN
Can perform linear mathematical operations on convolutional embedding space
GANs trained with two time scales
Heusel, et. al, 2018, use two time scale update rule (TTUR) train the
discriminator and generator
• Observer that simple generator-discriminator training may only converge to
a local Nash equilibrium – no global convergence guarantee
• Discriminator learns too fast
– Discriminator learns to tell real from generated data
– Generator is prevented from further learning
• We can efficiently compute the first singular values by the power iteration
algorithm
– For details see, for example, Section 11.1 of Mining Massive Datasets, Leskovec
, et.al., 2020
Spectral normalization of weights
The stochastic gradient descent (SGD) algorithm for SN-GAN is straight
forward
Spectral normalization of weights
Comparing inception scores of GAN models for different ADAM optimizer
hyperparameters
Spectral normalization of weights
Comparison of squared largest singular values for several regularization
methods
Self-attention GAN
Images of real-world scenes have a significant spatial extent
• Convolutional operators used to create an embedding (latent) space have a
small spatial extent – receptive field
• As a result of only local sensitivity, convolutional GANs are poor at creating
many types of scenes
• Wang, et. al., 2017, develop a non-local neural network algorithm using a
self-attention mechanism which gives superior performance for several
tasks
• Zhang, et. al., 2019, applied non-local self-attention to training GANs
Self-attention GAN
Images of real-world scenes have a significant spatial extent
• Non-local self-attention adds non-local behavior to GAN training
• Non-local self-attention applied to both discriminator and generator
• Start with two latent feature spaces, and g for hidden layer activation x
• and and learnable weight tensors
• Interaction between the latent spaces computed as the inner product
Self-attention GAN
Images of real-world scenes have a significant spatial extent
• Interaction between the latent spaces computed as the inner product