Module 1
Module 1
Module 1
Deep Neural networks are currently capable of providing human level solutions to a variety of
problems such as image recognition, speech recognition, machine translation, natural language
processing, and many more.
Deep learning is a subset of machine learning (ML), which is itself a subset of artificial intelligence
(AI). The concept of AI has been around since the 1950s, with the goal of making computers able to
think and reason in a way similar to humans. As part of making machines able to think, ML is focused
on how to make them learn without being explicitly programmed. Deep learning goes beyond ML by
creating more complex hierarchical models meant to mimic how humans learn new information.
The “deep" in deep learning refers to the many layers the neural network accumulates over time,
with performance improving as the network gets deeper. Each level of the network processes its
input data in a specific way, which then informs the next layer. So the output from one layer
becomes the input for the next.
Machine learning is a subset of artificial intelligence. Its aim is to give computers the ability to
learn without being specifically programmed on what output to deliver. The algorithms used by
machine learning help the computer learn how to recognize things. This training can be tedious and
require a significant amount of human effort.
Deep learning algorithms go a step further by creating hierarchical models meant to mirror our own
brain’s thought processes. It uses a multi-layered neural network that does not require
preprocessing the input data in order to produce a result. Data scientists feed the raw data into the
algorithm, the system analyzes the data based on what it already knows and what it can infer from
the new data, and makes a prediction.
The advantage of deep learning is that it can process data in ways that simple rules-based AI cannot.
The technology can be used to drive clear business outcomes as diverse as improved fraud
detection, increased crop yields, improved accuracy of warehouse inventory control systems, and
many others.
Evolution of deep learning
A lot of the important work on neural networks happened in the 80's and 90's, but back then
computers were slow and datasets very tiny. The research didn't really find many applications in the
real world. As a result, in the first decade of the 21st century neural networks have completely
disappeared from the world of machine learning. It's only in the last few years, first in speech
recognition around 2009, and then in computer vision around 2012, that neural networks made a
big comeback (with LeNet, AlexNet, and so on). What changed? Lots of data (big data) and cheap,
fast GPU's. Today, neural networks are everywhere. So, if you're doing anything with data, analytics,
or prediction, deep learning is definitely something that you want to get familiar with.
Figure-1: Evolution of deep learning
Deep learning is an exciting branch of machine learning that uses data, lots of data, to teach
computers how to do things only humans were capable of before, such as recognizing what's in an
image, what people are saying when they are talking on their phones, translating a document into
another language, and helping robots explore the world and interact with it. Deep learning has
emerged as a central tool to solve perception problems and it's state of the art with computer vision
and speech recognition. Today many companies have made deep learning a central part of their
machine learning toolkit—Facebook, Baidu, Amazon, Microsoft, and Google are all using deep
learning in their products because deep learning shines wherever there is lots of data and complex
problems to solve.
Deep learning is the name we often use for "deep neural networks" composed of several layers.
Each layer is made of nodes. The computation happens in the nodes, where it combines input data
with a set of parameters or weights, that either amplify or dampen that input. These input-weight
products are then summed and the sum is passed through the activation function, to determine to
what extent the value should progress through the network to affect the final prediction, such as an
act of classification. A layer consists of a row of nodes that that turn on or off as the input is fed
through the network. The input of the first layer becomes the input of the second layer and so on.
Here's a diagram of what neural a network might look like:
Sigmoid Activation:
An activation function is a function that is added into an artificial neural network in order to help
the network learn complex patterns in the data. When comparing with a neuron-based model that is
in our brains, the activation function is at the end deciding what is to be fired to the next neuron.
The simplest activation function is referred to as the linear activation, where no transform is applied
at all. A network comprised of only linear activation functions is very easy to train, but cannot learn
complex mapping functions. Linear activation functions are still used in the output layer for
networks that predict a quantity (e.g. regression problems).
Nonlinear activation functions are preferred as they allow the nodes to learn more complex
structures in the data. Traditionally, two widely used nonlinear activation functions are
the sigmoid and hyperbolic tangent activation functions
A sigmoid function also called a logistic function is a mathematical function having a characteristic
"S"-shaped curve or sigmoid curve. A common example of a sigmoid function is the logistic function .
The sigmoid activation function used in neural networks has an output boundary of (0, 1), and α is
the offset parameter to set the value at which the sigmoid evaluates to 0.
The sigmoid function often works fine for gradient descent as long as the input data x is kept within
a limit. For large values of x, y is constant. Hence, the derivatives dy/dx (the gradient) equates to 0,
which is often termed as the vanishing gradient problem.
Limitation of sigmoid
A general problem with both the sigmoid and tanh functions is that they saturate. This means that
large values snap to 1.0 and small values snap to -1 or 0 for tanh and sigmoid respectively. Further,
the functions are only really sensitive to changes around their mid-point of their input, such as 0.5
for sigmoid and 0.0 for tanh.
The limited sensitivity and saturation of the function happen regardless of whether the summed
activation from the node provided as input contains useful information or not. Once saturated, it
becomes challenging for the learning algorithm to continue to adapt the weights to improve the
performance of the model.
ReLU
The rectified linear activation function or ReLU for short is a piecewise linear function that will
output the input directly if it is positive, otherwise, it will output zero. It has become the default
activation function for many types of neural networks because a model that uses it is easier to train
and often achieves better performance.
A neural network can be built by combining some linear classifiers with some non-linear functions.
The Rectified Linear Unit (ReLU) has become very popular in the last few years. It computes the
function f(x)=max(0,x). In other words, the activation is simply thresholded at zero. Unfortunately,
ReLU units can be fragile during training and can die, as a ReLU neuron could cause the weights to
update in such a way that the neuron will never activate on any datapoint again, and so the gradient
flowing through the unit will forever be zero from that point on.
To overcome this problem, a leaky ReLU function will have a small negative slope (of 0.01, or so)
instead of zero when x<0:
Gradient, in plain terms means slope or slant of a surface. So gradient descent literally means
descending a slope to reach the lowest point on that surface. Let us imagine a two dimensional
graph, such as a parabola in the figure below.
In the above graph, the lowest point on the parabola occurs at x = 1. The objective of gradient
descent algorithm is to find the value of “x” such that “y” is minimum. “y” here is termed as the
objective function that the gradient descent algorithm operates upon, to descend to the lowest
point. Gradient descent is an iterative algorithm, that starts from a random point on a function and
travels down its slope in steps until it reaches the lowest point of that function. There are a few
downsides of the gradient descent algorithm. We need to take a closer look at the amount of
computation we make for each iteration of the algorithm.
Say we have 10,000 data points and 10 features. The sum of squared residuals consists of as many
terms as there are data points, so 10000 terms in our case. We need to compute the derivative of
this function with respect to each of the features, so in effect we will be doing 10000 * 10 = 100,000
computations per iteration. It is common to take 1000 iterations, in effect we have 100,000 * 1000 =
100000000 computations to complete the algorithm. That is pretty much an overhead and hence
gradient descent is slow on huge data.
Stochastic gradient descent comes to our rescue !! “Stochastic”, in plain terms means “random”.
Where can we potentially induce randomness in our gradient descent algorithm??
It is while selecting data points at each step to calculate the derivatives. SGD randomly picks one
data point from the whole data set at each iteration to reduce the computations enormously.
It is also common to sample a small number of data points instead of just one point at each step and
that is called “mini-batch” gradient descent. Mini-batch tries to strike a balance between the
goodness of gradient descent and speed of SGD.
Scaling batch gradient descent is cumbersome because it has to compute a lot if the dataset is big,
and as a rule of thumb, if computing your loss takes n floating point operations,computing its
gradient takes about three times that to compute. But in practice we want to be able to train lots of
data because on real problems we will always get more gains the more data we use. And because
gradient descent is iterative and has to do that for many steps, that means that in order to update
the parameters in a single step, it has to go through all the data samples and then do this iteration
over the data tens or hundreds of times.
Instead of computing the loss over entire data samples for every step, we can compute the average
loss for a very small random fraction of the training data. Think between 1 and 1000 training
samples each time. This technique is called Stochastic Gradient Descent (SGD) and is at the core of
deep learning. That's because SGD scales well with both data and model size. SGD gets its
reputation for being black magic as it has lots of hyper-parameters to play and tune such as
initialization parameters, learning rate parameters, decay, and momentum, and you have to get
them right.
AdaGrad is a simple modification of SGD, which implicitly does momentum and learning rate decay
by itself. Using AdaGrad often makes learning less sensitive to hyper-parameters. But it often tends
to be a little worse than precisely tuned SDG with momentum. It's still a very good option though, if
you're just trying to get things to work:
The word ‘stochastic‘ means a system or process linked with a random probability. Hence, in
Stochastic Gradient Descent, a few samples are selected randomly instead of the whole data set for
each iteration. In Gradient Descent, there is a term called “batch” which denotes the total number of
samples from a dataset that is used for calculating the gradient for each iteration. In typical Gradient
Descent optimization, like Batch Gradient Descent, the batch is taken to be the whole dataset.
Although using the whole dataset is really useful for getting to the minima in a less noisy and less
random manner, the problem arises when our dataset gets big.
Suppose, you have a million samples in your dataset, so if you use a typical Gradient Descent
optimization technique, you will have to use all of the one million samples for completing one
iteration while performing the Gradient Descent, and it has to be done for every iteration until the
minima are reached. Hence, it becomes computationally very expensive to perform.
This problem is solved by Stochastic Gradient Descent. In SGD, it uses only a single sample, i.e., a
batch size of one, to perform each iteration. The sample is randomly shuffled and selected for
performing the iteration.
So, in SGD, we find out the gradient of the cost function of a single example at each iteration instead
of the sum of the gradient of the cost function of all the examples.
In SGD, since only one sample from the dataset is chosen at random for each iteration, the path
taken by the algorithm to reach the minima is usually noisier than your typical Gradient Descent
algorithm. But that doesn’t matter all that much because the path taken by the algorithm does not
matter, as long as we reach the minima and with a significantly shorter training time.
The path taken by Batch Gradient Descent –
One thing to be noted is that, as SGD is generally noisier than typical Gradient Descent, it usually
took a higher number of iterations to reach the minima, because of its randomness in its descent.
Even though it requires a higher number of iterations to reach the minima than typical Gradient
Descent, it is still computationally much less expensive than typical Gradient Descent. Hence, in most
scenarios, SGD is preferred over Batch Gradient Descent for optimizing a learning algorithm.
Learning Rate
In machine learning, we deal with two types of parameters; 1) machine learnable parameters and 2)
hyper-parameters. The Machine learnable parameters are the one which the algorithms
learn/estimate on their own during the training for a given dataset. The Hyper-parameters are the
one which the machine learning engineers or data scientists will assign specific values to, to control
the way the algorithms learn and also to tune the performance of the model. Learning rate,
generally represented by the symbol ‘α’, shown in equation-4, is a hyper-parameter used to control
the rate at which an algorithm updates the parameter estimates or learns the values of the
parameters.
The learning rate is a configurable hyperparameter used in the training of neural networks that has a
small positive value, often in the range between 0.0 and 1.0. The learning rate controls how quickly
the model is adapted to the problem. The learning rate is a tuning parameter in an optimization
algorithm that determines the step size at each iteration while moving toward a minimum of a loss
function. The amount that the weights are updated during training is referred to as the step size or
the “learning rate.” Specifically, the learning rate is a configurable hyperparameter used in the
training of neural networks that has a small positive value, often in the range between 0.0 and 1.0. a
large learning rate allows the model to learn faster, at the cost of arriving on a sub-optimal final set
of weights. A smaller learning rate may allow the model to learn a more optimal or even globally
optimal set of weights but may take significantly longer to train.
Effect of different values for learning rate
Learning rate is used to scale the magnitude of parameter updates during gradient descent. The
choice of the value for learning rate can impact two things: 1) how fast the algorithm learns and 2)
whether the cost function is minimized or not. Figure 2 shows the variation in cost function with a
number of iterations/epochs for different learning rates.
It can be seen that for an optimal value of the learning rate, the cost function value is minimized in a
few iterations (smaller time). This is represented by the blue line in the figure. If the learning rate
used is lower than the optimal value, the number of iterations/epochs required to minimize the cost
function is high (takes longer time). This is represented by the green line in the figure. If the learning
rate is high, the cost function could saturate at a value higher than the minimum value. This is
represented by the red line in the figure. If the learning rate selected is very high, the cost function
could continue to increase with iterations/epochs. An optimal learning rate is not easy to find for a
given problem. Though getting the right learning is always a challenge, there are some well-
researched methods documented to figure out optimal learning rates. Some of these techniques are
discussed in the following sections. In all these techniques the fundamental idea is to vary the
learning rate dynamically instead of using a constant learning rate.
Regularization
Regularization refers to a set of different techniques that lower the complexity of a neural network
model during training, and thus prevent the overfitting.
There are three very popular and efficient regularization techniques called L1, L2, and dropout
Underfitting destroys the accuracy of our machine learning model. Its occurrence simply means
that our model or the algorithm does not fit the data well enough. It usually happens when we have
fewer data to build an accurate model and also when we try to build a linear model with fewer non-
linear data.
Underfitting means that your model makes accurate, but initially incorrect predictions. In this case,
train error is large and val/test error is large too. Overfitting means that your model makes not
accurate predictions. In this case, train error is very small and val/test error is large Regularization
refers to techniques that are used to calibrate machine learning models in order to minimize the
adjusted loss function and prevent overfitting or underfitting.
The first way to prevent over fitting is by looking at the performance under validation set, and
stopping to train as soon as it stops improving. It's called early termination, and it's one way to
prevent a neural network from over-optimizing on the training set. Another way is to apply
regularization. Regularizing means applying artificial constraints on the network that implicitly
reduce the number of free parameters while not making it more difficult to optimize.
Fig: Early termination
In the skinny jeans analogy as shown in Figure 6b, think stretch pants. They fit just as well, but
because they're flexible, they don't make things harder to fit in. The stretch pants of deep learning
are sometime called L2 regularization. The idea is to add another term to the loss, which penalizes
large weights.
L2 Regularization
Currently, in deep learning practice, the widely used approach for preventing overfitting is to
feed lots of data into the deep network.
Performing L2 regularization encourages the weight values towards zero (but not exactly zero)
Performing L1 regularization encourages the weight values to be zero
In the case of L2 regularization, our weight parameters decrease, but not necessarily become zero,
since the curve becomes flat near zero. On the other hand during the L1 regularization, the weight
are always forced all the way towards zero.
Intuitively speaking smaller weights reduce the impact of the hidden neurons. In that case, those
hidden neurons become neglectable and the overall complexity of the neural network gets reduced.
Dropout
In addition to the L2 and L1 regularization, another famous and powerful regularization technique is
called the dropout regularization. dropout means that during training with some probability P a
neuron of the neural network gets turned off during training
If both L1 and L2 regularization work well, you might be wondering why we need both. It turns out
they have different but equally useful properties. From a practical standpoint, L1 tends to shrink
coefficients to zero whereas L2 tends to shrink coefficients evenly. L1 is therefore useful for feature
selection, as we can drop any variables associated with coefficients that go to zero. L2, on the other
hand, is useful when you have collinear/codependent features. (An example pair of codependent
features is gender and ispregnant since, at the current level of medical technology, only females can
be ispregnant.) Codependence tends to increase coefficient variance, making coefficients
unreliable/unstable, which hurts model generality. L2 reduces the variance of these estimates, which
counteracts the effect of codependencies.
CNN
A convolution is an operation that changes a function into something else. We do convolutions so
that we can transform the original function into a form to get more information.
Convolutions have been used for a long time in image processing to blur and sharpen images, and
perform other operations, such as, enhance edges and emboss.
The term ‘Convolution” in CNN denotes the mathematical function of convolution which is a special
kind of linear operation wherein two functions are multiplied to produce a third function which
expresses how the shape of one function is modified by the other. In simple terms, two images
which can be represented as matrices are multiplied to give an output that is used to extract
features from the image.
There are three types of layers that make up the CNN which are the convolutional layers, pooling
layers, and fully-connected (FC) layers. When these layers are stacked, a CNN architecture will be
formed. In addition to these three layers, there are two more important parameters which are the
dropout layer and the activation function
Convolutional Layer
This layer is the first layer that is used to extract the various features from the input images. In this
layer, the mathematical operation of convolution is performed between the input image and a filter
of a particular size MxM. By sliding the filter over the input image, the dot product is taken between
the filter and the parts of the input image with respect to the size of the filter (MxM).
The output is termed as the Feature map which gives us information about the image such as the
corners and edges. Later, this feature map is fed to other layers to learn several other features of the
input image.
The convolution layer in CNN passes the result to the next layer once applying the convolution
operation in the input. Convolutional layers in CNN benefit a lot as they ensure the spatial
relationship between the pixels is intact.
Pooling Layer
In most cases, a Convolutional Layer is followed by a Pooling Layer. The primary aim of this layer is to
decrease the size of the convolved feature map to reduce the computational costs. This is performed
by decreasing the connections between layers and independently operates on each feature map.
Depending upon method used, there are several types of Pooling operations. It basically summarises
the features generated by a convolution layer.
In Max Pooling, the largest element is taken from feature map. Average Pooling calculates the
average of the elements in a predefined sized Image section. The total sum of the elements in the
predefined section is computed in Sum Pooling. The Pooling Layer usually serves as a bridge
between the Convolutional Layer and the FC Layer.
This CNN model generalises the features extracted by the convolution layer, and helps the networks
to recognise the features independently. With the help of this, the computations are also reduced in
a network.
3. Fully Connected Layer
The Fully Connected (FC) layer consists of the weights and biases along with the neurons and is used
to connect the neurons between two different layers. These layers are usually placed before the
output layer and form the last few layers of a CNN Architecture.
In this, the input image from the previous layers are flattened and fed to the FC layer. The flattened
vector then undergoes few more FC layers where the mathematical functions operations usually
take place. In this stage, the classification process begins to take place. The reason two layers are
connected is that two fully connected layers will perform better than a single connected layer. These
layers in CNN reduce the human supervision
4. Dropout
Usually, when all the features are connected to the FC layer, it can cause overfitting in the training
dataset. Overfitting occurs when a particular model works so well on the training data causing a
negative impact in the model’s performance when used on a new data.
To overcome this problem, a dropout layer is utilised wherein a few neurons are dropped from the
neural network during training process resulting in reduced size of the model. On passing a dropout
of 0.3, 30% of the nodes are dropped out randomly from the neural network.
Dropout results in improving the performance of a machine learning model as it prevents overfitting
by making the network simpler. It drops neurons from the neural networks during training.
Activation Functions
Finally, one of the most important parameters of the CNN model is the activation function. They are
used to learn and approximate any kind of continuous and complex relationship between variables
of the network. In simple words, it decides which information of the model should fire in the
forward direction and which ones should not at the end of the network.
It adds non-linearity to the network. There are several commonly used activation functions such as
the ReLU, Softmax, tanH and the Sigmoid functions. Each of these functions have a specific usage.
For a binary classification CNN model, sigmoid and softmax functions are preferred an for a multi-
class classification, generally softmax us used. In simple terms, activation functions in a CNN model
determine whether a neuron should be activated or not. It decides whether the input to the work is
important or not to predict using mathematical operations.
The full picture of Convolutional Neural Network
To decompose the entire picture of CNN there are three parts to it:
1. The convolution step
2. The subsampling
(The 1 & 2 can be repeated)
3. The linear layer
(The 3 is also called the fully connected layer)
Imagine you are given an image of a cute kitty. The kitty with a ‘please’ face is actually matrix of
pixels. The size of the image is the pixel size represented by width x height. This is the size of our
input. A black and white image is basically an image represented by different shades of black and
white.
Now if the image is colored it is a combination of red, blue and green to obtain different colors. A
black and white image will be a single flat matrix n*n. [2D]
Now if the image is colored it will be a 3D matrix. It is as simple as three 2D matrices stacked on top of
each other, one each representing red, green and blue.
The three layers are called channels or also known as the depth.
In a convolution, we don’t look at an image as a whole. We take a window. A window is a small
portion of the image. How do we choose the small portion? Well, say we basically want to see only
the first 2*2 values of the 4*4 matrix, then the portion/window size is 2*2.
This small window is then moved over entire image sequentially.
Why?
For now, just say, to see image carefully we see small parts of it in detail. This small portion with a
chosen size is called a window or patch.
Every time a window moves, it does some calculation and calculates one value which represents
whatever part of the image is seen in the window.
The process of moving the patch or window sequentially over the entire image is called convolution.
The process of moving the patch or window sequentially over the entire
image is called convolution. But while moving how much do we move? We can move adjacent or skip
one and move adjacent two. How many pixels to move is defined by a value called stride.
Say we have stride(our value of step to move forward as 1) .The image for simplicity is made up of
3*3 pixels as shown in the box on the left. The image is black and white so the depth is 1 in the above
case. The input size of our image is 3*3*1. The window or patch in this case is 2*2*1.
Why? we randomly decided to look at only first 4 pixels out of 9 pixels i.e 2*2 pixels at a time.
The window or path is also called Filter in CNN terminology.
Say when we see the first 2*2 window, we calculate a number that represents 2*2 part of an image,
this number is noted and we move ahead by step(stride) one. Then we calculate another number for
it till we move to end. The numbers we noted will look like the box on the left and the matrix will be
of size 2*2.
Suppose we use a 2*2 window, each number except the corner numbers are seen in the window at
least twice, right? so we are losing some information by seeing corner numbers only once. How about
adding zeros at the end so we can move the window and see every number exactly twice? The adding
of zeros is called padding.
when we see the first 2*2 window we calculate a number that represents 2*2 part of an image, this
number is noted and we move ahead by step(stride) one.
What is this calculation?
We could simply take dot products of the filter and pixels of part of the image we see in the window.
Given we have an RGB image with depth 3 and image size of 32*32. And given a window size of 5*5,
the window/filter becomes a matrix 5*5*3 and we just take dot product with a weight w.
Next we use this information and do pooling. A pooling is nothing but reducing the information from
activation map and this is done a similar way like when we moved the window over image to get the
activation map in the first place.
The calculations done here are fundamentally different. The most common one is max-pooling.
(remember for getting activation map we used dot products)
What is max pooling? It is just selecting the max of the number in the window.
Similar to max pooling, we can just take average value in of the numbers in the window and that gives
us avg pooling i.e. average pooling
Then we can stack several convolutions in a similar way as we stacked different linear layers in our
previous posts. The final output of all convolutions is however given to linear layer (also called fully
connected layer).
We started with a full picture of CNN and explained all the parts by decomposing it. When we join
back all parts above we again get the full picture of CNN
The metric applied is the loss. The higher the loss function, the dumber the model is. To improve the
knowledge of the network, some optimization is required by adjusting the weights of the net. The
stochastic gradient descent is the method employed to change the values of the weights in the rights
direction. Once the adjustment is made, the network can use another batch of data to test its new
knowledge.
The error, fortunately, is lower than before, yet not small enough. The optimization step is done
iteratively until the error is minimized, i.e., no more information can be extracted.
The problem with this type of model is, it does not have any memory. It means the input and output
are independent. In other words, the model does not care about what came before. It raises some
question when you need to predict time series or sentences because the network needs to have
information about the historical data or past words.
To overcome this issue, a new type of architecture has been developed: Recurrent Neural network
In a feed-forward neural network, the information only moves in one direction — from the input
layer, through the hidden layers, to the output layer. The information moves straight through the
network and never touches a node twice.
Feed-forward neural networks have no memory of the input they receive and are bad at predicting
what’s coming next. Because a feed-forward network only considers the current input, it has no
notion of order in time. It simply can’t remember anything about what happened in the past except
its training.
In a RNN the information cycles through a loop. When it makes a decision, it considers the current
input and also what it has learned from the inputs it received previously.
The two images below illustrate the difference in information flow between a RNN and a feed-
forward neural network.
A usual RNN has a short-term memory. In combination with a LSTM they also have a long-term
memory, Imagine you have a normal feed-forward neural network and give it the word "neuron" as
an input and it processes the word character by character. By the time it reaches the character "r," it
has already forgotten about "n," "e" and "u," which makes it almost impossible for this type of
neural network to predict which character would come next.
A recurrent neural network, however, is able to remember those characters because of its internal
memory. It produces output, copies that output and loops it back into the network.
Simply put: recurrent neural networks add the immediate past to the present.
Therefore, a RNN has two inputs: the present and the recent past. This is important because the
sequence of data contains crucial information about what is coming next, which is why a RNN can do
things other algorithms can’t.
A feed-forward neural network assigns, like all other deep learning algorithms, a weight matrix to its
inputs and then produces the output. Note that RNNs apply weights to the current and also to the
previous input. Furthermore, a recurrent neural network will also tweak the weights for
both gradient descent and backpropagation through time (BPTT).
TYPES OF RNNS
• One to One
• One to Many
• Many to One
• Many to Many
While feed-forward neural networks map one input to one output, RNNs can map one to many,
many to many (translation) and many to one (classifying a voice).
BPTT is basically just a fancy buzz word for doing backpropagation on an unrolled RNN. Unrolling is a
visualization and conceptual tool, which helps you understand what’s going on within the network.
Most of the time when implementing a recurrent neural network in the common programming
frameworks, backpropagation is automatically taken care of, but you need to understand how it
works to troubleshoot problems that may arise during the development process.
You can view a RNN as a sequence of neural networks that you train one after another with
backpropagation.
The image below illustrates an unrolled RNN. On the left, the RNN is unrolled after the equal sign.
Note there is no cycle after the equal sign since the different time steps are visualized and
information is passed from one time step to the next. This illustration also shows why a RNN can be
seen as a sequence of neural networks.
If you do BPTT, the conceptualization of unrolling is required since the error of a given time step
depends on the previous time step.
Within BPTT the error is backpropagated from the last to the first timestep, while unrolling all the
timesteps. This allows calculating the error for each timestep, which allows updating the weights.
Note that BPTT can be computationally expensive when you have a high number of timesteps.
Two issues of standard RNN’s
There are two major obstacles RNN’s have had to deal with, but to understand them, you first need
to know what a gradient is.
A gradient is a partial derivative with respect to its inputs. If you don’t know what that means, just
think of it like this: a gradient measures how much the output of a function changes if you change
the inputs a little bit.
You can also think of a gradient as the slope of a function. The higher the gradient, the steeper the
slope and the faster a model can learn. But if the slope is zero, the model stops learning. A gradient
simply measures the change in all weights with regard to the change in error.
EXPLODING GRADIENTS
Exploding gradients are when the algorithm, without much reason, assigns a stupidly high
importance to the weights. Fortunately, this problem can be easily solved by truncating or squashing
the gradients.
VANISHING GRADIENTS
Vanishing gradients occur when the values of a gradient are too small and the model stops learning
or takes way too long as a result. This was a major problem in the 1990s and much harder to solve
than the exploding gradients. Fortunately, it was solved through the concept of LSTM by Sepp
Hochreiter and Juergen Schmidhuber.
Long Short-Term Memory (LSTM)
Long Short-Term Memory (LSTM) networks are an extension of RNN that extend the memory. LSTM are used as the building blocks for
the layers of a RNN. LSTMs assign data “weights” which helps RNNs to either let new information in, forget information or give it
importance enough to impact the output.
Long short-term memory networks (LSTMs) are an extension for recurrent neural networks, which basically extends
the memory. Therefore it is well suited to learn from important experiences that have very long time lags in
between.
The units of an LSTM are used as building units for the layers of a RNN, often called an LSTM network.
LSTMs enable RNNs to remember inputs over a long period of time. This is because LSTMs contain information in a
memory, much like the memory of a computer. The LSTM can read, write and delete information from its memory.
This memory can be seen as a gated cell, with gated meaning the cell decides whether or not to store or delete
information (i.e., if it opens the gates or not), based on the importance it assigns to the information. The assigning of
importance happens through weights, which are also learned by the algorithm. This simply means that it learns over
time what information is important and what is not.
In an LSTM you have three gates: input, forget and output gate. These gates determine whether or not to let new input
in (input gate), delete the information because it isn’t important (forget gate), or let it impact the output at the current
timestep (output gate). Below is an illustration of a RNN with its three gates:
The gates in an LSTM are analog in the form of sigmoids, meaning they range from zero to one. The fact that they are
analog enables them to do backpropagation.
The problematic issue of vanishing gradients is solved through LSTM because it keeps the gradients steep enough,
which keeps the training relatively short and the accuracy high.
Discriminative models are those used for most supervised classification or regression problems. As an example
of a classification problem, suppose you’d like to train a model to classify images of handwritten digits from 0 to
9. For that, you could use a labeled dataset containing images of handwritten digits and their associated labels
indicating which digit each image represents.
During the training process, you’d use an algorithm to adjust the model’s parameters. The goal would be to
minimize a loss function so that the model learns the probability distribution of the output given the input.
After the training phase, you could use the model to classify a new handwritten digit image by estimating the
most probable digit the input corresponds to, as illustrated in the figure below:
You can picture discriminative models for classification problems as blocks that use the training data to learn the
boundaries between classes. They then use these boundaries to discriminate an input and predict its class. In
mathematical terms, discriminative models learn the conditional probability P(y|x) of the output y given the
input x.
Besides neural networks, other structures can be used as discriminative models such as logistic regression models
and support vector machines (SVMs).
Generative models like GANs, however, are trained to describe how a dataset is generated in terms of
a probabilistic model. By sampling from a generative model, you’re able to generate new data. While
discriminative models are used for supervised learning, generative models are often used with unlabeled datasets
and can be seen as a form of unsupervised learning.
Using the dataset of handwritten digits, you could train a generative model to generate new digits. During the
training phase, you’d use some algorithm to adjust the model’s parameters to minimize a loss function and learn
the probability distribution of the training set. Then, with the model trained, you could generate new samples, as
illustrated in the following figure:
To output new samples, generative models usually consider a stochastic, or random, element that influences the
samples generated by the model. The random samples used to drive the generator are obtained from a latent
space in which the vectors represent a kind of compressed form of the generated samples.
Unlike discriminative models, generative models learn the probability P(x) of the input data x, and by having the
distribution of the input data, they’re able to generate new data instances.