College Documentation - Automated Image Captioning

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 26

AUTOMATIC IMAGE

CAPTIONING
ABSTRACT

We examine the problem of automatic image captioning. Given a training set of


captioned images, we want to discover correlations between image features and keywords, so
that we can automatically find good keywords for a new image. We experiment thoroughly with
multiple design alternatives on large datasets of various images, and our proposed methods
achieve a fair accuracy on captioning over the state of the art.

Caption generation is a challenging artificial intelligence problem where a textual description


must be generated for a given photograph.

It requires both methods from computer vision to understand the content of the image and a
language model from the field of natural language processing to turn the understanding of the
image into words in the right order. Recently, deep learning methods have achieved state-of-the-
art results on examples of this problem.
INDEX

 INTRODUCTION
 SOFTWARE AND HARDWARE REQUIREMENTS
 MODULES
 TECHNOLOGY DESCRIPTION
 DESIGN
 SYNOPSIS
 SCREENSHOTS
 CONCLUSION
 REFERENCES
INTRODUCTION
Develop and implement the model that detects the features from the images and the model
created must be able to provide all the possible captions that image can have.

Automated image captioning is an interesting problem, where you can learn both computer
vision techniques and natural language processing technique. To create a model we use the

CNN method and process the Flickr8K data set which is already a pre-trained model. The reason
is because it is realistic and relatively small so that you can download it and build models on
your workstation using a CPU.

This model takes a single image as input and output the caption to this image.

The main objective of this project is to generate the captions by just taking the image as an input
and producing the relative captions.
SOFTWARE REQUIREMENTS
1. Jupyter Notebook
2. Keras
3. Tensor flow

HARDWARE REQUIREMENTS
1. RAM : 13 GB
2. DISK : 5 GB
3. OPERATING SYSTEM: WINDOWS 10
4. GPU
TECHNOLOGY DESCRIPTION
 Keras
Keras is an open-source neural-network library written in Python. It is capable of running
on top of TensorFlow, Microsoft Cognitive Toolkit, Theano, or PlaidML. Designed to
enable fast experimentation with deep neural networks, it focuses on being user-friendly,
modular, and extensible. It was developed as part of the research effort of project
ONEIROS (Open-ended Neuro-Electronic Intelligent Robot Operating System), and its
primary author and maintainer is François Chollet, a Google engineer. Chollet also is the
author of the XCeption deep neural network model.

In 2017, Google's TensorFlow team decided to support Keras in Tensor Flow’s core
library. Chollet explained that Keras was conceived to be an interface rather than a
standalone machine-learning framework. It offers a higher-level, more intuitive set of
abstractions that make it easy to develop deep learning models regardless of the
computational backend used. Microsoft added a CNTK backend to Keras as well,
available as of CNTK v2.0.

Keras contains numerous implementations of commonly used neural-network building


blocks such as layers, objectives, activation functions, optimizers, and a host of tools to
make working with image and text data easier. The code is hosted on GitHub, and
community support forums include the GitHub issues page, and a Slack channel.
In addition to standard neural networks, Keras has support for convolutional and
recurrent neural networks. It supports other common utility layers like dropout, batch
normalization, and pooling.

Keras allows users to productize deep models on smartphones (iOS and Android), on the
web, or on the Java Virtual Machine. It also allows use of distributed training of deep-
learning models on clusters of Graphics Processing Units (GPU) and Tensor processing
units (TPU).

Keras was developed and maintained by François Chollet, a Google engineer using four
guiding principles:
• Modularity: A model can be understood as a sequence or a graph alone. All the
concerns of a deep learning model are discrete components that can be combined in
arbitrary ways.
• Minimalism: The library provides just enough to achieve an outcome, no frills
and maximizing readability.
• Extensibility: New components are intentionally easy to add and use within the
framework, intended for researchers to trial and explore new ideas.
• Python: No separate model files with custom file formats. Everything is native
Python.

Keras doesn't handle low-level computation. Instead, it uses another library to do it,
called the "Backend. So Keras is high-level API wrapper for the low-level API, capable
of running on top of TensorFlow, CNTK, or Theano.

Keras High-Level API handles the way we make models, defining layers, or set up
multiple input-output models. In this level, Keras also compiles our model with loss and
optimizer functions, training process with fit function. Keras doesn't handle Low-Level
API such as making the computational graph, making tensors or other variables because
it has been handled by the "backend" engine.

Backend is a term in Keras that performs all low-level computation such as tensor
products, convolutions and many other things with the help of other libraries such as
Tensorflow or Theano.

 TensorFlow
TensorFlow is a free and open-source software library for dataflow and differentiable
programming across a range of tasks. It is a symbolic math library, and is also used for
machine learning applications such as neural networks. It is used for both research and
production at Google. It is a standard expectation in the industry to have experience in
TensorFlow to work in machine learning.

TensorFlow was developed by the Google Brain team for internal Google use. It was
released under the Apache 2.0 open-source license on November 9, 2015.

Starting in 2011, Google Brain built DistBelief as a proprietary machine learning system
based on deep learning neural networks. Its use grew rapidly across diverse Alphabet
companies in both research and commercial applications. Google assigned multiple
computer scientists, including Jeff Dean, to simplify and refactor the codebase of
DistBelief into a faster, more robust application-grade library, which became
TensorFlow. In 2009, the team, led by Geoffrey Hinton, had implemented generalized
back propagation and other improvements which allowed generation of neural networks
with substantially higher accuracy, for instance a 25% reduction in errors in speech
recognition.

TensorFlow is Google Brain's second-generation system. Version 1.0.0 was released on


February 11, 2017. While the reference implementation runs on single devices,
TensorFlow can run on multiple CPUs and GPUs (with optional CUDA and SYCL
extensions for general-purpose computing on graphics processing units) TensorFlow is
available on 64-bit Linux, macOS, Windows, and mobile computing platforms including
Android and iOS.

Its flexible architecture allows for the easy deployment of computation across a variety of
platforms (CPUs, GPUs, TPUs), and from desktops to clusters of servers to mobile and
edge devices.
TensorFlow computations are expressed as stateful dataflow graphs. The name
TensorFlow derives from the operations that such neural networks perform on
multidimensional data arrays, which are referred to as tensors. During the Google I/O
Conference in June 2016, Jeff Dean stated that 1,500 repositories on GitHub mentioned
TensorFlow, of which only 5 were from Google.

In May 2016, Google announced its Tensor Processing Unit (TPU), an application-
specific integrated circuit (a hardware chip) built specifically for machine learning and
tailored for TensorFlow. TPU is a programmable AI accelerator designed to provide high
throughput of low-precision arithmetic (e.g., 8-bit), and oriented toward using or running
models rather than training them. Google announced they had been running TPUs inside
their data centers for more than a year, and had found them to deliver an order of
magnitude better-optimized performance per watt for machine learning.

TensorFlow is a machine learning framework that might be your new best friend if you
have a lot of data and/or you’re after the state-of-the-art in AI: deep learning. Neural
networks. Big ones. It’s not a data science Swiss Army Knife, it’s the industrial lathe…
which means you can probably stop reading if all you want to do is put a regression line
through 20-by-2 spreadsheet.

But if big is what you’re after, get excited. TensorFlow has been used to go hunting for
new planets, prevent blindness by helping doctors screen for diabetic retinopathy, and
help save forests by alerting authorities to signs of illegal deforestation activity. It’s what
Alpha Go and Google Cloud Vision are built on top of and it’s yours to play with.
TensorFlow is open source, you can download it for free and get started immediately.

Keras is all about user-friendliness and easy prototyping, something old TensorFlow
sorely craved more of. If you like object oriented thinking and you like building neural
networks one layer at a time, you’ll love tf.keras.

 CNN
In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of deep
neural networks, most commonly applied to analyzing visual imagery.

CNNs use a variation of multilayer perceptrons designed to require minimal


preprocessing. They are also known as shift invariant or space invariant artificial neural
networks (SIANN), based on their shared-weights architecture and translation invariance
characteristics.

Convolutional networks were inspired by biological processes in that the connectivity


pattern between neurons resembles the organization of the animal visual cortex.
Individual cortical neurons respond to stimuli only in a restricted region of the visual
field known as the receptive field. The receptive fields of different neurons partially
overlap such that they cover the entire visual field.

CNNs use relatively little pre-processing compared to other image classification


algorithms. This means that the network learns the filters that in traditional algorithms
were hand-engineered. This independence from prior knowledge and human effort in
feature design is a major advantage.

A convolutional neural network consists of an input and an output layer, as well as


multiple hidden layers. The hidden layers of a CNN typically consist of convolutional
layers, RELU layer i.e. activation function, pooling layers, fully connected layers and
normalization layers.

Description of the process as a convolution in neural networks is by convention.


Mathematically it is a cross-correlation rather than a convolution (although cross-
correlation is a related operation). This only has significance for the indices in the matrix,
and thus which weights are placed at which index.

Convolutional layers apply a convolution operation to the input, passing the result to the
next layer. The convolution emulates the response of an individual neuron to visual
stimuli.

Each convolutional neuron processes data only for its receptive field. Although fully
connected feed forward neural networks can be used to learn features as well as classify
data, it is not practical to apply this architecture to images. A very high number of
neurons would be necessary, even in a shallow (opposite of deep) architecture, due to the
very large input sizes associated with images, where each pixel is a relevant variable. For
instance, a fully connected layer for a (small) image of size 100 x 100 has 10000 weights
for each neuron in the second layer. The convolution operation brings a solution to this
problem as it reduces the number of free parameters, allowing the network to be deeper
with fewer parameters. For instance, regardless of image size, tiling regions of size 5 x 5,
each with the same shared weights, requires only 25 learnable parameters. In this way, it
resolves the vanishing or exploding gradients problem in training traditional multi-layer
neural networks with many layers by using back propagation.

Convolutional networks may include local or global pooling layers. Pooling layers reduce
the dimensions of the data by combining the outputs of neuron clusters at one layer into a
single neuron in the next layer. Local pooling combines small clusters, typically 2 x 2.
Global pooling acts on all the neurons of the convolutional layer. In addition, pooling
may compute a max or an average. Max pooling uses the maximum value from each of a
cluster of neurons at the prior layer. Average pooling uses the average value from each of
a cluster of neurons at the prior layer.

Fully connected layers connect every neuron in one layer to every neuron in another
layer. It is in principle the same as the traditional multi-layer perceptron neural network
(MLP). The flattened matrix goes through a fully connected layer to classify the images.

In neural networks, each neuron receives input from some number of locations in the
previous layer. In a fully connected layer, each neuron receives input from every element
of the previous layer. In a convolutional layer, neurons receive input from only a
restricted subarea of the previous layer. Typically the subarea is of a square shape (e.g.,
size 5 by 5). The input area of a neuron is called its receptive field. So, in a fully
connected layer, the receptive field is the entire previous layer. In a convolutional layer,
the receptive area is smaller than the entire previous layer.

Each neuron in a neural network computes an output value by applying some function to
the input values coming from the receptive field in the previous layer. The function that is
applied to the input values is specified by a vector of weights and a bias (typically real
numbers). Learning in a neural network progresses by making incremental adjustments to
the biases and weights. The vector of weights and the bias are called a filter and
represents some feature of the input (e.g., a particular shape). A distinguishing feature of
CNNs is that many neurons share the same filter. This reduces memory footprint because
a single bias and a single vector of weights is used across all receptive fields sharing that
filter, rather than each receptive field having its own bias and vector of weights.

A system to recognize hand-written ZIP Code numbers involved convolutions in which


the kernel coefficients had been laboriously hand designed.
Yann LeCun et al. (1989) used back-propagation to learn the convolution kernel
coefficients directly from images of hand-written numbers. Learning was thus fully
automatic, performed better than manual coefficient design, and was suited to a broader
range of image recognition problems and image types.
This approach became a foundation of modern computer vision.

The feed-forward architecture of convolutional neural networks was extended in the


neural abstraction pyramid by lateral and feedback connections.The resulting recurrent
convolutional network allows for the flexible incorporation of contextual information to
iteratively resolve local ambiguities. In contrast to previous models, image-like outputs at
the highest resolution were generated.

Traditional multilayer perceptron (MLP) models were successfully used for image
recognition. However, due to the full connectivity between nodes they suffer from the
curse of dimensionality, and thus do not scale well to higher resolution images. A
1000×1000-pixel image with RGB color channels has 3 million dimensions, which is too
high to feasibly process efficiently at scale with full connectivity.

Convolutional neural networks are biologically inspired variants of multilayer


perceptrons that are designed to emulate the behavior of a visual cortex. These models
mitigate the challenges posed by the MLP architecture by exploiting the strong spatially
local correlation present in natural images. As opposed to MLPs, CNNs have the
following distinguishing features:

1) 3D volumes of neurons. The layers of a CNN have neurons arranged in 3 dimensions:


width, height and depth. The neurons inside a layer are connected to only a small
region of the layer before it, called a receptive field. Distinct types of layers, both
locally and completely connected, are stacked to form a CNN architecture.

2) Local connectivity: following the concept of receptive fields, CNNs exploit spatial
locality by enforcing a local connectivity pattern between neurons of adjacent layers.
The architecture thus ensures that the learned "filters" produce the strongest response
to a spatially local input pattern. Stacking many such layers leads to non-linear filters
that become increasingly global (i.e. responsive to a larger region of pixel space) so
that the network first creates representations of small parts of the input, then from
them assembles representations of larger areas.

3) Shared weights: In CNNs, each filter is replicated across the entire visual field. These
replicated units share the same parameterization (weight vector and bias) and form a
feature map. This means that all the neurons in a given convolutional layer respond to
the same feature within their specific response field.

The convolutional layer is the core building block of a CNN. The layer's parameters
consist of a set of learnable filters (or kernels), which have a small receptive field, but
extend through the full depth of the input volume. During the forward pass, each filter
is convolved across the width and height of the input volume, computing the dot
product between the entries of the filter and the input and producing a 2-dimensional
activation map of that filter. As a result, the network learns filters that activate when
it detects some specific type of feature at some spatial position in the input.

Stacking the activation maps for all filters along the depth dimension forms the full
output volume of the convolution layer. Every entry in the output volume can thus
also be interpreted as an output of a neuron that looks at a small region in the input
and shares parameters with neurons in the same activation map.
 LSTM
Long short-term memory (LSTM) is an artificial recurrent neural network (RNN)
architecture used in the field of deep learning. Unlike standard feed forward neural
networks, LSTM has feedback connections that make it a "general purpose computer"
(that is, it can compute anything that a Turing machine can). It can not only process
single data points (such as images), but also entire sequences of data (such as speech or
video).
A common LSTM unit is composed of a cell, an input gate, an output gate and a forget
gate. The cell remembers values over arbitrary time intervals and the three gates regulate
the flow of information into and out of the cell.

LSTM networks are well-suited to classifying, processing and making predictions based
on time series data, since there can be lags of unknown duration between important
events in a time series. LSTMs were developed to deal with the exploding and vanishing
gradient problems that can be encountered when training traditional RNNs. Relative
insensitivity to gap length is an advantage of LSTM over RNNs, hidden Markov models
and other sequence learning methods in numerous applications.

In theory, classic (or "vanilla") RNNs can keep track of arbitrary long-term dependencies
in the input sequences. The problem of vanilla RNNs is computational (or practical) in
nature: when training a vanilla RNN using back-propagation, the gradients which are
back-propagated can "vanish" (that is, they can tend to zero) or "explode" (that is, they
can tend to infinity), because of the computations involved in the process, which use
finite-precision numbers. RNNs using LSTM units partially solve the vanishing gradient
problem, because LSTM units allow gradients to also flow unchanged. However, LSTM
networks can still suffer from the exploding gradient problem.

There are several architectures of LSTM units. A common architecture is composed of a


cell (the memory part of the LSTM unit) and three "regulators", usually called gates, of
the flow of information inside the LSTM unit: an input gate, an output gate and a forget
gate. Some variations of the LSTM unit do not have one or more of these gates or maybe
have other gates. For example, gated recurrent units (GRUs) do not have an output gate.

Intuitively, the cell is responsible for keeping track of the dependencies between the
elements in the input sequence. The input gate controls the extent to which a new value
flows into the cell, the forget gate controls the extent to which a value remains in the cell
and the output gate controls the extent to which the value in the cell.
A RNN using LSTM units can be trained in a supervised fashion, on a set of training
sequences, using an optimization algorithm, like gradient descent, combined with back
propagation through time to compute the gradients needed during the optimization
process, in order to change each weight of the LSTM network in proportion to the
derivative of the error (at the output layer of the LSTM network) with respect to
corresponding weight.

A problem with using gradient descent for standard RNNs is that error gradients vanish
exponentially quickly with the size of the time lag between important events. This is due
to {\displaystyle \lim _{n\to \infty }W^{n}=0} {\displaystyle \lim _{n\to \infty
}W^{n}=0} if the spectral radius of {\displaystyle W} W is smaller than 1.

However, with LSTM units, when error values are back-propagated from the output
layer, the error remains in the LSTM unit's cell. This "error carousel" continuously feeds
error back to each of the LSTM unit's gates, until they learn to cut off the value.

Sequence prediction problems have been around for a long time. They are considered as
one of the hardest problems to solve in the data science industry. These include a wide
range of problems; from predicting sales to finding patterns in stock markets’ data, from
understanding movie plots to recognizing your way of speech, from language translations
to predicting your next word on your iPhone’s keyboard.
With the recent breakthroughs that have been happening in data science, it is found that
for almost all of these sequence prediction problems, Long short Term Memory
networks, a.k.a LSTMs have been observed as the most effective solution.

DESIGN
SYNOPSIS
We loaded the VGG model in Keras using the VGG class. We removed the last layer
from the loaded model, as this is the model used to predict a classification for a photo.
We were not interested in classifying images, but we were interested in the internal
representation of the photo right before a classification is made. These are the “features”
that the model has extracted from the photo.

Keras also provides tools for reshaping the loaded photo into the preferred size for the
model (e.g. 3 channel 224 x 224 pixel image).

Below is a function named extract features() that, given a directory name, loaded each
photo, prepared it for VGG, and collected the predicted features from the VGG model.
The image features are a 1-dimensional 4,096 element vector.
Next, we stepped through the list of photo descriptions. Below defines a function
load_descriptions() that, given the loaded document text, will return a dictionary of photo
identifiers to descriptions. Each photo identifier maps to a list of one or more textual
descriptions.

We cleaned the text in the following ways in order to reduce the size of the vocabulary of
words we will need to work with:

1) Converted all words to lowercase.


2) Removed all punctuation.
3) Removed all words that are one character or less in length (e.g. ‘a’).
4) Removed all words with numbers in them.

Ideally, we wanted a vocabulary that is both expressive and as small as possible. A


smaller vocabulary will result in a smaller model that will train faster.

The train and development dataset have been predefined in the Flickr_8k.trainImages.txt
and Flickr_8k.devImages.txt files respectively, that both contain lists of photo file names.
From these file names, we can extract the photo identifiers and use these identifiers to
filter photos and descriptions for each set.
The model we have developed generates a caption given a photo, and the caption is
generated one word at a time. The sequence of previously generated words have been
provided as input. Therefore, we will need a ‘first word’ to kick-off the generation
process and a ‘last word’ to signal the end of the caption.

We have used the strings ‘startseq‘ and ‘endseq‘for this purpose. These tokens are added
to the loaded descriptions as they are loaded. It is important to do this now before we
encode the text so that the tokens are also encoded correctly.

The first step in encoding the data was to create a consistent mapping from words to
unique integer values. Keras provides the Tokenizer class that can learn this mapping
from the loaded description data.

Below defines the to_lines() to convert the dictionary of descriptions into a list of strings
and the create_tokenizer() function that will fit a Tokenizer given the loaded photo
description text.

We will describe the model in three parts:-

1) Photo Feature Extractor: This is a 16-layer VGG model pre-trained on the ImageNet
dataset. We have pre-processed the photos with the VGG model (without the output
layer) and will use the extracted features predicted by this model as input.
2) Sequence Processor. This is a word embedding layer for handling the text input,
followed by a Long Short-Term Memory (LSTM) recurrent neural network layer.
3) Decoder (for lack of a better name). Both the feature extractor and sequence
processor output a fixed-length vector. These are merged together and processed by a
Dense layer to make a final prediction.

The model learns fast and quickly overfits the training dataset. For this reason, we
will monitor the skill of the trained model on the holdout development dataset. When
the skill of the model on the development dataset improves at the end of an epoch, we
will save the whole model to file.

At the end of the run, we can then use the saved model with the best skill on the
training dataset as our final model.

We can do this by defining a Model Checkpoint in Keras and specifying it to monitor


the minimum loss on the validation dataset and save the model to a file that has both
the training and validation loss in the filename.

To get a sense for the structure of the model, specifically the shapes of the layers, see
the summary listed below.

The model learns fast and quickly overfits the training dataset. For this reason, we will monitor
the skill of the trained model on the holdout development dataset. When the skill of the model on
the development dataset improves at the end of an epoch, we will save the whole model to file.

At the end of the run, we can then use the saved model with the best skill on the training dataset
as our final model.
SCREENSHOTS
CONCLUSION

From the above analysis, we can observe, the task of image captioning can be divided into two
modules logically – one is an image based model – which extracts the features and nuances out
of our image, and the other is a language based model – which translates the features and objects
given by our image based model to a natural sentence. For our image based model (viz encoder)
– we usually rely on a Convolutional Neural Network model. And for our language based model
(viz decoder) – we rely on a Recurrent Neural Network.

This section lists some ideas for extending the tutorial that you may wish to explore: -

1) Alternate Pre-Trained Photo Models: A small 16-layer VGG model was used for feature
extraction. Consider exploring larger models that offer better performance on the
ImageNet dataset, such as Inception.
2) Smaller Vocabulary: A larger vocabulary of nearly eight thousand words was used in the
development of the model. Many of the words supported may be misspellings or only
used once in the entire dataset. Refine the vocabulary and reduce the size, perhaps by
half.
3) Pre-trained Word Vectors: The model learned the word vectors as part of fitting the
model. Better performance may be achieved by using word vectors either pre-trained on
the training dataset or trained on a much larger corpus of text, such as news articles or
Wikipedia.
4) Tune sModel: The configuration of the model was not tuned on the problem. Explore
alternate configurations and see if you can achieve better performance.

In this project, we learned:

 How to prepare photo and text data ready for training a deep learning model.
 How to design and train a deep learning caption generation model.
 How to evaluate a train caption generation model and use it to caption entirely new
photographs.

REFERENCES

 https://www.analyticsvidhya.com/blog/2018/04/solving-an-image-captioning-task-using-
deep-learning/
 https://medium.freecodecamp.org/building-an-image-caption-generator-with-deep-
learning-in-tensorflow-a142722e9b1f
 https://www.kdnuggets.com/2019/02/deep-learning-nlp-rnn-cnn.html

You might also like