BTP Report On Text To Image Synthesis
BTP Report On Text To Image Synthesis
BTP Report On Text To Image Synthesis
Synthesis
PROJECT REPORT
submitted towards the partial fulfillment of the
requirements for the award of the degree
of
BACHELOR OF TECHNOLOGY
in
ELECTRONICS AND COMMUNICATION ENGINEERING
Submitted by
We declare that the work presented in this report with title “Generative Adversarial
Text to Image Synthesis” towards the fulfillment of the requirement for the award of the
degree of Bachelor of Technology in Electronics & Communication Engineering
submitted in the Dept. of Electronics & Communication Engineering, Indian Insti-
tute of Technology, Roorkee, India is an authentic record of our own work carried out
during the period from August 2017 to May 2018 under the supervision of DR. VINOD
PANKAJAKSHAN, Assistant Professor, IIT Roorkee. The content of this report has not
been submitted by us for the award of any other degree of this or any other institute.
This is to certify that the statement made by the candidates is correct to the best of my
knowledge and belief.
(D R . VINOD PANKAJAKSHAN)
A SSISTANT P ROFESSOR
D EPT. OF E LECTRONICS A ND C OMMUNICATION
IIT R OORKEE
iii
A CKNOWLEDGEMENTS
IRST and foremost, we would like to express our sincere gratitude towards our
Finally, we appreciate the help of all our friends for keeping us motivated and providing
with valuable insights as a part of various healthy discussions.
v
A BSTRACT
Generative Adversarial Networks have shown striking results in the unconditional and
conditional task but have been limited by the size and is not able to generate fine-grained
details from text. In this report, we implement several techniques to successfully obtain a
model for synthesizing images using text descriptions. Attention Generative Adversarial
Network, that uses attention mechanism at multiple stages to generate fine-grained images.
It improves the details by selecting the words on which it need to focus. We also implement
image-word loss, Deep Attentional Multimodal Similarity Modal, to be used while training
of the model. It generates 64 × 64, 128 × 128 and 256 × 256 dimensions of photo-realistic
quality. We have also explored the use of conditional Wasserstein Generative Adversarial
Network for generating images as Wasserstein distance provide insightful representation
of the distance in between two low dimension distributions and therefore have shown
plausible results in generating non-conditional images.
vii
TABLE OF C ONTENTS
Page
List of Figures xi
1 Introduction 1
2 Related Work 3
3 Background 7
3.1 Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1.1 DCGAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.2 Conditional GANs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1.3 Wasserstein GANs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Text Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.1 Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.2 Skip Thought Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4 Approach 17
4.1 Model Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.1.1 Vanilla and Wasserstein Conditional GANs . . . . . . . . . . . . . . . 17
4.1.2 StackGAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.1.3 Attention GANs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2 Training the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2.1 Vanilla Conditional GAN . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2.2 Wasserstein Conditional GAN . . . . . . . . . . . . . . . . . . . . . . . 24
4.2.3 Attention GANs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
ix
TABLE OF CONTENTS
5 Experimental Details 27
5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.2 Vanilla Conditional GANs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.3 Wasserstein GANs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.4 Attention GANs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6 Results 31
6.1 Vanilla Conditional GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.2 Wasserstein GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.3 Attention GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7 Future Work 39
8 Conclusion 41
Bibliography 43
x
L IST OF F IGURES
F IGURE Page
2.1 Image Captioning Using multi-modal networks (RNN and CNN) [1]. . . . . . . 4
2.2 Images generated via DCGAN generator [2]. . . . . . . . . . . . . . . . . . . . . . 5
6.1 Text descriptions and the image generated with Vanilla Conditional GAN. . . . 31
6.2 64 × 64 images generated with Vanilla Conditional GAN. . . . . . . . . . . . . . . 32
6.3 Wasserstein loss of WGAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.4 Generator loss of WGAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.5 Text descriptions and the image generated with WGAN. . . . . . . . . . . . . . . 34
6.6 64 × 64 images generated from WGAN. . . . . . . . . . . . . . . . . . . . . . . . . 35
6.7 Discriminator loss of Attention GANs . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.8 Generator loss of Attention GANs . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.9 Images generated in multiple stages along with their attention maps. The
attention maps placed above are of Stage 2 and below ones are from stage 3. . 37
6.10 256 × 256 images generated using AttnGAN model trained on CUB. . . . . . . . 38
xi
L IST OF TABLES
TABLE Page
xiii
CHAPTER
1
I NTRODUCTION
Human beings have the ability to imagine and mentally picturize an image just by
analyzing the textual description of the scene or the object. For example, “An apple is lying
on a wooden table.”, this sentence provides us with a mental image of the information that
the sentence is trying to convey. However, there can be multiple possible outcomes which
would be able to correctly depict the entire information of the sentence. Artificial synthesis
of images through textual descriptions could have numerous applications in visual editing,
picture based summarization, digital designing, and animation generation.
This challenging problems has two subproblems: first is to learn to represent textual
sentences into an encoded embedding that captures the important visual details required
to draw the image and second one is to develop a generative model that would use the
encoded embedding to synthesize an image that could be mistaken for real. The recent
advancements in Deep learning has made huge progress in both the domains i.e. Natural
Language Processing and generative modeling.
In recent year, Recurrent Neural Networks and Long Short term Memory networks
have been able to encode the text sequences very efficiently by retaining the long term
temporal dependencies. Also, advancements in generative modeling by using Variational
Autoencoders and Generative Adversarial Networks have begun to generate highly realistic
images. We build on these previous works and employ techniques from the two fields.
1
CHAPTER 1. INTRODUCTION
We aim at generating image pixel values directly from raw text in the form of a caption -
a single sentence description of the image. For example feeding an input “A red flower with
a yellow stigma” should generate an image corresponding to this caption. To accomplish this
translation of text to images, the caption needs to be encoded to capture the discriminative
text features. The encoded sentence will then be used to condition a generative adversarial
model to generate images corresponding to the caption.
One challenging aspect of text to image synthesis is that synthesizing images using
Generative models, and GANs in particular, conditioned on textual description leads to the
problem of multi-modality in output. A single caption could lead to many possible images
each of which could be aesthetically correct and covers all the information in the caption.
Figure 1.1: Example results for image synthesizing model using captions.
The further report is organised as follows. We discuss related works in chapter 2 and
in chapter 3, we briefly describe the necessary background work required to understand
the concepts involved with the methods. In chapter 4, we discuss the approach and the
algorithms. Chapter 5 explains the implementation details such as hyperparameter details.
Chapter 6 shows the results of our experiments followed by future works and conclusion in
chapter 7 and 8 respectively.
2
CHAPTER
2
R ELATED W ORK
Generating images from text is a multimodal learning in which we learn shared rep-
resentation between different modalities and synthesis unavailable data in one modality
conditioned on other. Audio and video multimodal learning has been achieved [10] using
stacked autoencoders. Srivastava and Salakhutdinov [11] modeled text tags with images
using a deep Boltzmann machine. The earliest work of multimodal prediction with mathe-
matical justification was proposed in 2014[12]. Other Variational Autoencoder approaches
have been used in this direction works on maximizing the lower bound of data likelihood.
Researchers have also used deep deconvolutional network to generate 3D chair models[13].
The draw model of Variational Autoencoders(VAE) has also proved very successful in
conditioning the VAE with text embeddings to produce images from captions [14]. They
used skip-thought vectors for text embeddings for conditioning the VAE.
For a long time major focus in previous works was retrieval of images from a text query
or vice versa. However, from the past couple of years people have been using recurrent
neural network architectures for generating text descriptions of images [15]. They uses
a Long-Short Term Memory [16] on the top of a deep convolutional neural network archi-
tecture to generate caption on real image using MS COCO [17] and other datasets. The
trained encoder of an RNN can be coupled with a multimodal network involving other
domains like images, sound etc. to train across multiple domains using transfer learning.
Attention Mechanism [18] has also been introduced to identify the regions on which a
particular word focuses on. Figure 2.1 shows the architecture of this multi-modal network.
3
CHAPTER 2. RELATED WORK
Figure 2.1: Image Captioning Using multi-modal networks (RNN and CNN) [1].
Recently, people have been looking into Generative Adversarial Networks[19] composed
of deep recurrent and convolutional networks to generate realistic images with assuring
results. GANs is composed of two separate models where both of them compete for a
zero-sum game and end up improving each other based on Game Theory. But GANs are
highly unstable and require adequate hyperparameter tuning to generate high resolution
images. A lot of research had been made [2], [20], [21], [22], [23], [24] in stabilizing the
training with Deep Convolutional Generative Adversarial Network (DCGAN) as a major
breakthrough.
[4] proposed the first differential conditional GANs architecture conditioned on text
description in place of class labels, from character level to pixel level. It generated 64 × 64
realistic images with the help of matching aware discriminator and manifold interpolation
regularizer. Using series of GANs for generating detailed images have also been studied [5].
The first stage generates the basic structure of the description and the job of the second
stage GAN is to correct the defects of GAN-1 and complete object details. Therefore by
4
Figure 2.2: Images generated via DCGAN generator [2].
5
CHAPTER
3
B ACKGROUND
GANs work on the principle of game theory and set up a minimax game between two
players namely generator and discriminator. The generator and discriminator compete
against each other. The generator G parameterized by θ generates samples using the
random noise z, G (z; θ ). The generated samples are intended to be drawn from the same
distribution as of training data x. The other player discriminator D tries to discriminate
7
CHAPTER 3. BACKGROUND
between the samples drawn from the real training distribution and those drawn from the
generator. The discriminator learns by classifying inputs two classes (real or fake). It emits
a probability D (x) or D (G (z)) based on whether the input is from the training data or the
generator. The generator’s aim is to fool the discriminator by producing as real looking
samples as possible and discriminator’s aim is to correctly identify the synthesized samples
from the real ones. In mathematical terms, GANs are structured probabilistic models
having latent variables z(random noise) and observed variables x, as shown in Figure 3.1.
Both players are associated with a cost function which is also commonly called loss
function that are defined in terms of their parameters. The discriminator has to minimize
J (D) (θ (D) , θ (G) ) and must do by adjusting only θ (D) . On the other hand, the generator has
to minimize J (G) (θ (D) , θ (G) ) by controlling θ (G) . The cost function of both players is also
dependent on each others parameters but they cannot modify the other’s parameters. This
scenario can be better described as a game. The optimum solution to this game is a point in
parameter space where the two mentioned cost functions are jointly minimized and all the
other neighboring points endure greater or equal cost.
The cost function of discriminator is shown in Equation 3.1 where p data represents the
probability distribution of true data. It is same as the standard cross entropy loss.
1 1
J (D) (θ (D) , θ (G) ) = − E x∼ p data log D ( x) − E z∼ p z log(1 − D (G ( z))) (3.1)
2 2
8
3.1. GENERATIVE ADVERSARIAL NETWORKS
The simplest version to the two player game of GAN is the zero-sum game in which the
costs of all the players sum to zero. Hence,
Zero sum games involves minimization of the total loss through one parameter and max-
imization through other. Hence they are also called minimax games. The value function
which is to be optimized for GANs then becomes:
min max V (D,G ) = E x∼ p data (x) log D ( x) + E z∼ p z (z) log(1 − D (G ( z))) (3.3)
G D
However, this generator loss is not optimal to be used in practical purposes. This is because,
when the discriminator rejects the generators sample with high confidence, generator’s
gradient vanishes and does not allow the parameters to be trained for generator. In practice,
following cost function is used for generator.
1
J (G) = − E z∼ p z log(D (G ( z))) (3.4)
2
3.1.1 DCGAN
The practical GAN architectures used today are mostly based on the DCGAN (Deep
Convolutional GAN) architecture[2]. It specifies a set of guidelines which are important for
stable implementation and training of GANs. Some of the key insights of DCGAN were to:
• The architecture is based upon the all-convolutional net which contains only convo-
lutional layers. Strided convolution is used instead of pooling layers to increase the
spatial dimension of the output.
• The fully connected hidden layers were removed for deeper architectures.
• Using ReLU activation in generator and Leaky Relu in discriminator. The last layer
of generator uses tanh activation. Both generator and discriminator use the Adam
optimizer.
9
CHAPTER 3. BACKGROUND
The architecture of DCGAN is also explained in Figure 3.2. They have shown good
results when trained to generate images such as images of bedrooms and birds. They have
also shown that their latent codes can indulge in simple arithmetic operations to generate
meaningful outcomes as demonstrated in figure 2.2.
min max V (D,G ) = E x∼ p data (x) log D ( x| c) + E z∼ p z (z) log(1 − D (G ( z| c))) (3.5)
G D
10
3.2. TEXT EMBEDDINGS
learn the true distribution if the discriminator does not provide enough information to the
generator. The Wasserstein GAN or WGAN[32] leverages on this fact and models a known
distribution function to the desired distribution. To achieve this task, it needs to compute
the distance between the real and model distributions.
WGANs have proved to improve stability. Here, the discriminator is trained many more
times than generator which proves better in generator training for WGAN. Also, the authors
of the paper[32] claim that they experienced no mode-collapse by this approach.
11
CHAPTER 3. BACKGROUND
One of the major drawback of RNN is that, in native form it can’t have long-term
dependencies due to vanishing gradient. Long Short-Term Memory (LSTM) networks
is a variation of RNN designed specifically for this purpose. Remembering long term
information is the unique selling point. Native RNN have a single tanh function as the
activation while LSTMs deploy 3 gates and 2 activation functions to keep the memory and
current state saved. The three gates/mechanism are as follows:
1. Forgetting Mechanism: When a new input comes, it needs to decide which prior
information is important and to be kept and which information can be forgotten or
thrown away.
2. Saving Mechanism: This mechanism decides which information from the new input
is important and is worth saving. Therefore first the LSTM mechanism throws away
any long term information that is no longer required and then saves the important
information from input.
12
3.2. TEXT EMBEDDINGS
3. Extracting information from long term memory: The model after the forgetting
and saving mechanism identifies which information from the long-term memory is of
immediate importance conditioned on the current input.
Encoder: Let w1i , w2i , ..., w iN represent the words in a i th sentence, N being the length
of the sentence. h ti represent the hidden state of the encoder of sentence i at an instant
of t th word being input. h N
i represent the skip-thought vector of the whole sentence. The
following equations explain the encoding way.
13
CHAPTER 3. BACKGROUND
Figure 3.5: Skip-thought model. Showing the input and neighboring sentences
[3].
h t = (1 − z t ) ¯ h t−1 + z t ¯ h̄ t (3.11)
Decoder: The decoder model conditions on the hidden state h proposed by the encoder.
Cr , C z , C are used to bias the reset gate, update gate, and hidden state. Two decoders
are used with separate parameters except the Vocabulary consisting of the word mapping
with vectors. Two decoders are used for the previous and next sentence. Decoding uses the
following equations
r t = σ(Wrd x t−1 + Urd h t−1 + C r h i ) (3.12)
P (w it+1 |w< t t
i +1 , h i ) ∝ exp(vw t h i +1 ) (3.16)
i +1
Using these equations our objective is to optimize the sum of forward and backward
sentence log-probabilities:
14
3.3. ATTENTION MECHANISM
A attention model takes n inputs y1 , y2 , ..., yn along with a context and returns a vector
z which is weighted sum of yi . By focusing on the contextual information is picks and give
more weight to specific yi0 s. These weights are easily accessible and therefore are used to
identify the regions of focus. Figure 3.6 shows the model of the attention mechanism.
15
CHAPTER 3. BACKGROUND
To begin with, we find a similarity or dissimilarity (depending on the use case) between
the input yi0 s and context vector. An important thing to notice is that the m0i s are calculated
without looking at other yi0 s. The m0i s are then passed to a softmax layer to normalize them
all. The output z is the weighted sum of s i and yi .
In [18] the author applied attention mechanism to generate text description from images.
They used Convolutional Neural Networks and Long Short Term Memory Networks with
Attention mechanism. In [34] authors used this mechanism in passage question answering
system. [35] used the approach to translate a text written in English to French.
16
CHAPTER
4
A PPROACH
Vanilla Conditional GAN was the first architecture for Text to Image synthesis using GANs
followed by [4]. The architecture is shown in figure 4.1.
The authors use a DCGAN architecture which is conditioned by text features. The
text features are generated by encoding the text using a character level recurrent neural
17
CHAPTER 4. APPROACH
network. The generator and discriminator architecture used in Vanilla Conditional GANs
is explained below in 4.1.1.
1. Generator Network
b) The caption is encoded with using a pretrained skip-thought model. The encoded
caption is mapped to vector of smaller size and concatenated with the z.
c) The concatenated embedding then goes through a feed forward deep deconvolu-
tional network to obtain a synthetic image x̂ .
2. Discriminator Network
b) The discriminator applies layer wise strided convolutions with batch normaliza-
tion
The architecture explain in Wasserstein GAN is same as the Vanilla Conditional GAN
architecture mentioned in 4.1.1. The difference is in the loss functions of the generator and
discriminator and the training process.
4.1.2 StackGAN
The images generated in [4] were of size 64 × 64 and were also not of optimum quality. The
process used in StackGAN [5] is a two stage process to generate images of size 64 × 64 in
stage 1 and 256 × 256 in stage 2. Also, the images generated are plausible enough to be
treated as real images. The stage-1 GAN draws the primitive layout of image and shape
of the object along with the colors, yielding a low resolution image. The stage-2 GAN
conditions on the stage-1 output as well as the text embedding and complete the fine details
18
4.1. MODEL ARCHITECTURES
and imparts minute features, giving a high resolution output. The model architecture is
depicted in figure 4.2
In StackGAN, the conditioning variable ĉ0 is sampled from the gaussian distribution
N (µ0 (φ t ), Σ0 φ t ) where φ t is the text embeddings. This is called Conditioning Augmentation.
The different stages of StackGAN is explained below:
1. Stage-1 GAN
a) The text embedding is fed through a fully connected layer to get a Gaussian
distribution from which the conditioning variable ĉ0 is sampled.
b) Then ĉ0 is concatenated with the noise vector to generate a W0 × H0 image by
passing through a series of upsampling blocks.
c) For the discriminator D0 , the text embedding is converted to Nd dimensions by
passing it to a fully connected layer and then replicated spatially to form a tensor
of dimension Md × Md × Nd . The real and fake images are then downsampled to
Md × Md and then concatenated to the text tensor along the channel dimension.
d) A single node fully connected layer produces the final decision score.
e) This stage maximize the discriminator loss and minimize the generator loss as
mentioned in equation 4.1 and 4.2.
19
CHAPTER 4. APPROACH
2. Stage-2 GAN
b) The tensor obtained is passed through residual blocks which learn multi-modal
representation of text and images. Then upsampling blocks are used to generate
a W × H image.
e) This stage maximize the discriminator loss and minimize the generator loss as
mentioned in equation 4.3 and 4.4.
L D = E(I,t)∼ p data [log D ( I, φ t )] + Es0 ∼ pG 0 ,t∼ p data [log(1 − D (G ( s 0 , ĉ), φ t ))] (4.3)
L G = Es0 ∼ pG 0 ,t∼ p data [log(1 − D (G ( s 0 , ĉ), φ t ))] + λD K L (N (µ(φ t ), Σ(φ t )||N (0, I ))
(4.4)
Built upon the success of StackGAN, the authors proposed a modified version of Stack-
GAN called StackGAN++[6]. In this Stackgan-V2, multiple generators and discriminators
are arranged in a tree like structure and each branch of the tree generates an image
from low-resolution to high-resolution. The discriminator analyzes whether an image is
coming from the true data or from the generator. The generators are jointly trained to
hierarchically generate images from random noise to high resolution. The architecture for
StackGAN++ is shown in figure 4.3
20
4.1. MODEL ARCHITECTURES
sentence into a global vector but this global vector miss out the fine-grain details at word
level. Therefore [7] propose an multi-stage attention driven architecture for text to image
generation. Architecture of the GAN is described in Figure. 4.4
21
CHAPTER 4. APPROACH
h 0 = F0 ( z, F ca ( ē)); (4.6)
c) The hidden input is passed to the attention mechanism F iattn along with the
word features. The word-context vector, representation of words relevant to h, is
calculated as follows.
TX
−1 exp( s0j,i )
cj = = β j,i e0i where β j,i = (4.8)
TP
−1
i =0
exp( s0j,k )
k=0
d) This word context vector along with the hidden output is used to generate
images for the next stage.
e) The discriminators are defined at each node to classify the if the inputs are real
or fake along with the authenticity of image-text pair.
22
4.2. TRAINING THE MODEL
23
CHAPTER 4. APPROACH
24
4.2. TRAINING THE MODEL
25
CHAPTER 4. APPROACH
26
CHAPTER
5
E XPERIMENTAL D ETAILS
5.1 Datasets
We used the Caltech-UCSD Birds(CUB) dataset[8] and Oxford-102[9] of flowers for our
experiments. CUB dataset contains 11,788 birds images of 200 categories. 80% of the
images have an object to image size ratio less than 0.5. Therefore the first step in the
experiments was to crop the images to have more than 75% of object inside the image.
Oxford-102 dataset contains 8,189 images from 102-different flower categories common in
United Kingdom. Table 5.1 and 5.2 shows the split in the datasets.
27
CHAPTER 5. EXPERIMENTAL DETAILS
1. We used uni-skip vectors from the skip thought vectors. We have not tried training
the model with combine-skip vectors.
3. The model was trained for 600 epochs with a batch size of 64 on a GPU with a learning
rate α = 0.0002 using Adam Optimizer.
4. While processing the batches before training, we flipped the images horizontally with
a probability of 0.5.
6. During a single-mini batch iteration the weights of the generator are updated twice
to prevent the discriminator loss going down to 0.
1. The text encoder used a CNN input size (sequence length) of 201 and a text embedding
layer of 1024 dimensions.
2. The model was trained for 600 epochs. The discriminator was trained 100 times for a
single update of generator in the initial phase to provide substantial gradients to the
generator and trigger the learning.
28
5.4. ATTENTION GANS
4. Batch size used was 64 using a learning rate of 0.0002. Adam Optimizer was used.
1. The model generates three images of 64 × 64, 128 × 128, 256 × 256 dimensions. It also
generates two attention maps applied on 64 × 64, 128 × 128 dimension images.
2. The DAMSM model is trained for 600 epochs on GeForce GTX 1080Ti GPU with the
learning rate of 0.002 with γ1 , γ2 , γ3 equals to 4.0, 5.0, 10.0 respectively and Adam
optimizer(β1 = 0.5, β2 = 0.999)
3. The Attention Generative model is also trained for 600 epochs GeForce GTX 1080Ti
GPU with the discriminator and generator learning rate of 0.002 and γ1 , γ2 , γ3 , λ
equals to 4.0, 5.0, 10.0, 5.0 respectively.
29
CHAPTER
6
R ESULTS
Figure 6.1 shows the 64 × 64 images generated with the Vanilla Conditional GAN along
with the text description. Most of the flower images are realistic enough. However, the
images are of suboptimal quality and the method also suffers from mode-collapse as many
of the generated flowers have very similar features and do not show much variety.
Figure 6.1: Text descriptions and the image generated with Vanilla Conditional
GAN.
31
CHAPTER 6. RESULTS
Figure 6.2 shows some more images generated from the Vanilla Conditional GAN. The
text description are randomly chosen strings from the 1000 samples kept out for validation.
However, as seen from the figure, similar flowers are generated multiple times, this is
because of the text description being very generic which can be fit into a lot of variety of
flowers, a single global sentence can’t capture all the information of a sentence and mode
collapse. These drawbacks have been rectified in the future Attention GAN mechanism.
32
6.2. WASSERSTEIN GAN
Figure 6.3: Wasserstein loss of WGAN Figure 6.4: Generator loss of WGAN
Wasserstein GAN try to distinguish between real Image-real Caption, fake Image-real
Caption and wrong Image-real Caption. The Wasserstein loss Figure 6.3 seems to decrease
with the improvements in image quality. This is different as compared to the Vanilla GAN
losses because here the loss(K ∗ W (P r , Pθ )) represents the closeness of the two distributions.
Therefore, wasserstein loss is different and should not be confused with discriminator loss
as discriminator loss would never be zero even for an ideal case when the distributions are
same.
The instability of generator loss and the decreasing discriminator loss explains that the
generator generates an improved image but discriminator is also trained to find flaws in
that improved image. Therefore the generator has to improve the image even further.
Some results shown in the Figure 6.5 and 6.6 are able to generate the shape and colors of
the birds. Even they lack minute details like legs, beak and other authentic details because
of which they can’t be called real. This is due of the partial training of the model because of
limitation in the computation power. As we can see from the Figure. 6.3 the discriminator
has not yet saturated and could be further trained to improve results.
Figure 6.5 shows the text description along with the image generated through condi-
tional WGAN algorithm and figure 6.6 shows some more images of birds generated.
33
CHAPTER 6. RESULTS
Figure 6.5: Text descriptions and the image generated with WGAN.
34
6.2. WASSERSTEIN GAN
35
CHAPTER 6. RESULTS
Figure 6.7: Discriminator loss of Atten- Figure 6.8: Generator loss of Attention
tion GANs GANs
As shown in the Figure 6.9, a low resolution image is generated first (by G 0 ) that shows
only the shape and color of the object since global sentence vector was used. In the next
stages (G 1 ,G 2 ), due to focus on words is able to rectify the drawbacks of the first image and
generate a more detailed high resolution image. These figures shows the top-5 words taken
by the algorithm along with the region to apply attention in the two stages. Figure 6.10
shows other examples of final stage AttnGAN. Attention Mechanism provide a dual benefit.
First, it is able to extract the important words which it needs to focus on. Secondly, we are
able to visualize and take insights about the position where each word is focusing. These
images are able to capture and focus on minute details, like all the textures, of the bird
along with the background. These results are definitely better than the current WGAN
results.
36
6.3. ATTENTION GAN
Figure 6.9: Images generated in multiple stages along with their attention maps.
The attention maps placed above are of Stage 2 and below ones are from stage 3.
37
CHAPTER 6. RESULTS
Figure 6.10: 256 × 256 images generated using AttnGAN model trained on CUB.
38
CHAPTER
7
F UTURE W ORK
The current architecture has performed better than the vanilla architecture condition
GANs. In this architecture we have explored the improvements in text embedding part.
The way we could focus on specific words and then draw fine-details in the image. Below
mention are some experiments that could be tried to improve the results.
• If we consider the way an artist draw an image, first a bounding box is created for an
object, then the object inside a bounding box is given shape and then the whole image
is generated. Therefore if we could break down our generation process into similar
steps along with Attention Mechanism, better results could be hypothesized.
• Wasserstein GANs have become a trending word within the GAN community be-
cause for two low dimensional distributions, Wasserstein distance provide a smooth
and meaningful representation of the distance in-between than Kullback-Leibler
divergence or Jensen-Shannon divergence. But training WGAN takes more time and
more computational power. Therefore, applying conditional WGAN with attention
mechanism could produce better results.
• The results in this report are generated on CUB and Oxford-102 datasets. The same
architecture with different hyper-parameters could be used to train on MS COCO
dataset to produce generalized images.
39
CHAPTER
8
C ONCLUSION
We also explored conditional Wasserstein GANs which can be used instead of normal
GANs to generate images. The modified loss function in WGAN provides sufficient theoret-
ical as well as practical proof for the success of WGANs. Although, the training process
is slower as compared to Vanilla GAN model as the discriminator has to be trained many
more number of times before the generator can effectively start learning, WGANs have
proved to be resistant to the mode-collapse problem.
The current methods work well on datasets like CUB[8] and Oxford-102 flowers[9].
One limitation for the current methods is that the models only learn the distinguishable
feature of a single object or features of the whole image but does not learn the concept of
objects in it. Hence, besides improving the available models for text to image synthesis, one
another possible area of research is to learn the concepts of individual objects in image for
improving upon the problem of image generation through text.
41
B IBLIOGRAPHY
[1] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image
descriptions,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, pp. 664–676, Apr.
2017.
[4] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, “Generative adver-
sarial text to image synthesis,” in Proceedings of the 33rd International Conference
on International Conference on Machine Learning - Volume 48, ICML’16, pp. 1060–
1069, JMLR.org, 2016.
[5] H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang, X. Wang, and D. N. Metaxas, “Stack-
gan: Text to photo-realistic image synthesis with stacked generative adversarial
networks,” in ICCV, 2017.
[6] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. Metaxas, “Stackgan++:
Realistic image synthesis with stacked generative adversarial networks,” arXiv:
1710.10916, 2017.
[7] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He, “Attngan: Fine-
grained text to image generation with attentional generative adversarial networks,”
CoRR, vol. abs/1711.10485, 2017.
43
BIBLIOGRAPHY
[9] M.-E. Nilsback and A. Zisserman, “Automated flower classification over a large number
of classes,” in Proceedings of the Indian Conference on Computer Vision, Graphics
and Image Processing, Dec 2008.
[10] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, “Multimodal deep learning,”
in ICML, 2011.
[12] S. W. Sohn, K. and and H. Lee, “Improved multimodal deep learning with variation of
information,” in NIPS, 2014.
[13] A.Dosovitskiy, J.T.Springenberg, and T.Brox, “Learning to generate chairs with convo-
lutional neural networks,” in IEEE International Conference on Computer Vision
and Pattern Recognition (CVPR), 2015.
[15] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption
generator,” in CVPR, June 2015.
[16] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9,
pp. 1735–1780, Nov. 1997.
[17] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L.
Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision – ECCV
2014, (Cham), pp. 740–755, Springer International Publishing, 2014.
[18] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio,
“Show, attend and tell: Neural image caption generation with visual attention,” in
ICML, 2015.
44
BIBLIOGRAPHY
[23] T. Che, Y. Li, A. P. Jacob, Y. Bengio, and W. Li, “Mode regularized generative adversar-
ial networks,” in ICLR, 2017.
[24] A. Nguyen, J. Yosinski, Y. Bengio, A. Dosovitskiy, and J. Clune, “Plug & play generative
networks: Conditional iterative generation of images in latent space,” in CVPR,
2017.
[26] A. Brock, T. Lim, J. M. Ritchie, and N. Weston, “Neural photo editing with introspective
adversarial networks,” CoRR, vol. abs/1609.07093, 2016.
[27] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with condi-
tional adversarial networks,” CVPR, 2017.
[31] M. Mirza and S. Osindero, “Conditional generative adversarial nets,” arXiv preprint
arXiv:1411.1784, 2014.
45
BIBLIOGRAPHY
[35] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning
to align and translate,” CoRR, vol. abs/1409.0473, 2014.
[36] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training
by reducing internal covariate shift.,” in ICML (F. R. Bach and D. M. Blei, eds.),
vol. 37 of JMLR Workshop and Conference Proceedings, pp. 448–456, JMLR.org,
2015.
46