DL UNIT 1 and 2 - NOTES

UNIT – I – DEEP LEARNING
1
UNIT I INTRODUCTION TO DEEP LEARNING
Introduction to machine learning - Linear models (SVMs and Perceptron’s, logistic

regression)- Introduction to Neural Nets: What are a shallow network computes- Training a
network: loss functions, back propagation and stochastic gradient descent- Neural networks as
universal function approximates.
1 Definition of Machine Learning [ML]:
Well posed learning problem: "A computer program is said to learn from experience E with
respect to some class of tasks T and performance measure P, if its performance at tasks in T, as
measured by P, improves with experience E."(Tom Michel)
Machine Learning (ML) is an algorithm that works on consequently through experience and
by the utilization of information. It is viewed as a piece of AI. ML calculations assemble a model
dependent on example information (Data), known as "training Data or information", to settle on
forecasts or choices without being unequivocally customized to do as such. AI calculations are
utilized in a wide assortment of utilizations, for example, in medication, email sifting, discourse
acknowledgment, and Computer vision, where it is troublesome or impractical to foster customary
models to play out the needed tasks.
ML includes PCs finding how they can perform tasks without being expressly modified to
do as such. It includes Systems that can perform tasks without being expressly modified to do as
such. It includes models gaining data so they can do certain specific applications. For basic
undertakings appointed to Models, it is feasible to models advising the machine how to execute all
means needed to tackle the current issue; on the systems part, no learning is required. For further
developed undertakings, it tends to be trying for a human to physically make the required
calculations. For more advanced tasks, it can be challenging for a human to manually create the
needed algorithms. In practice, it can turn out to be more effective to help the machine develop its
own algorithm, rather than having human programmers specify every needed step.
The term machine learning was coined in 1959 by Arthur Samuel, an American IBMer and pioneer in the
field of computer gaming and artificial intelligence.
2
1.1 Fundamentals of ANN
Neural computing is an information processing paradigm, inspired by biological system,

composed of a large number of highly interconnected processing elements(neurons) working in
unison to solve specific problems.
Artificial neural networks (ANNs), like people, learn by example. An ANN is configured
for a specific application, such as pattern recognition or data classification, through a learning
process. Learning in biological systems involves adjustments to the synaptic connections that exist
between the neurons. This is true of ANNs as well.
1.2 The Biological Neuron
The human brain consists of a large number, more than a billion of neural cells that process
information. Each cell works like a simple processor. The massive interaction between all cells and
3
their parallel processing only makes the brain’s abilities possible. Figure 1 represents a human
biological nervous unit. Various parts of biological neural network(BNN) is marked in Figure 1.
Figure 1: Biological Neural Network
Dendrites are branching fibres that extend from the cell body or soma.
Soma or cell body of a neuron contains the nucleus and other structures, support chemical
processing and production of neurotransmitters.
Axon is a singular fiber carries information away from the soma to the synaptic sites of
other neurons (dendrites ans somas), muscels, or glands.
Axon hillock is the site of summation for incoming information. At any moment, the
collective influence of all neurons that conduct impulses to a given neuron will determine whether
or n ot an action potential will be initiated at the axon hillock and propagated along the axon.
Myelin sheath consists of fat-containing cells that insulate the axon from electrical activity.
This insulation acts to increase the rate of transmission of signals. A gap exists between each
myelin sheath cell along the axon. Since fat inhibits the propagation of electricity, the signals jump
from one gap to the next.
Nodes of Ranvier are the gaps (about 1 μm) between myelin sheath cells. Since fat serves as
a good insulator, the myelin sheaths speed the rate of transmission of an electrical impulse along the
4
axon.
Synapse is the point of connection between two neurons or a neuron and a muscle or a gland.
Electrochemical communication between neurons take place at these junctions.
Terminal buttons of a neuron are the small knobs at the end of an axon that release
chemicals called neurotransmitters.
5
Information flow in a neural cell
The input/output and the propagation of information are shown below.
1.3. Artificial neuron model
An artificial neuron is a mathematical function conceived as a simple model of a real (biological)

neuron.
 The McCulloch-Pitts Neuron

This is a simplified model of real neurons, known as a Threshold Logic Unit.
 A set of input connections brings in activations from other neuron.
 A processing unit sums the inputs, and then applies a non-linear activation function (i.e.
squashing/transfer/threshold function).
 An output line transmits the result to other neurons.
1.3.1 Basic Elements of ANN:
Neuron consists of three basic components –weights, thresholds and a single activation
function. An Artificial neural network(ANN) model based on the biological neural sytems is
shown in figure 2.
Figure 2: Basic Elements of Artificial Neural Network
6
1.4 Different Learning Rules
A brief classification of Different Learning algorithms is depicted in figure 3.
 Training: It is the process in which the network is taught to change its weight
and bias.
 Learning: It is the internal process of training where the artificial neural
system learns to update/adapt the weights and biases.
7
Different Training /Learning procedure available in ANN are
 Supervised learning
 Unsupervised learning
 Reinforced learning
 Hebbian learning
 Gradient descent learning
 Competitive learning
 Stochastic learning
1.4.1. Requirements of Learning Laws:
• Learning Law should lead to convergence of weights
• Learning or training time should be less for capturing the information from the
training pairs
• Learning should use the local information
• Learning process should able to capture the complex non linear mapping available
between the input & output pairs
• Learning should able to capture as many as patterns as possible
• Storage of pattern information's gathered at the time of learning should be high for
the given network
8
Figure 3: Different Training methods of Artificial Neural Network
1.4.1.1.Supervised learning :
9
Every input pattern that is used to train the network is associated with an output pattern which is
the target or the desired pattern.
A teacher is assumed to be present during the training process, when a comparison is made
between the network’s computed output and the correct expected output, to determine the error.The
error can then be used to change network parameters, which result in an improvement in
performance.
1.4.1.2 Unsupervised learning:
In this learning method the target output is not presented to the network.It is as if there is no
teacher to present the desired patterns and hence the system learns of its own by discovering and
adapting to structural features in the input patterns.
1.4.1.3 Reinforced learning:
In this method, a teacher though available, doesnot present the expected answer but only
indicates if the computed output correct or incorrect.The information provided helps the network in
the learning process.
1.4.1.4 Hebbian learning:
This rule was proposed by Hebb and is based on correlative weight adjustment.This is the oldest
learning mechanism inspired by biology.In this, the input-output pattern pairs (𝑥𝑖, 𝑦𝑖) are
associated by the weight matrix W, known as the correlation matrix.
It is computed as
𝑛
W= 𝑖=1 𝑥 𝑖𝑦 𝑖 ------------ eq(1)
𝑇
Here 𝑦𝑖𝑇 is the transposeof the associated output vector 𝑦𝑖.Numerous variants of the rule
have been proposed.
1.4.1.5 Gradient descent learning:
This is based on the minimization of error E defined in terms of weights and activation function
of the network.Also it is required that the activation function employed by the network is
differentiable, as the weight update is dependent on the gradient of the error E.
10
Thus if ∆𝑤𝑖𝑗 is the weight update of the link connecting the 𝑖𝑡ℎ and 𝑗𝑡ℎ neuron of the two
neighbouring layers, then ∆𝑤𝑖𝑗 is defined as,
∆𝑤 = ɳ
𝜕𝑤𝑖𝑗
----------- eq(2)
�
𝜕𝐸
Where, ɳ is the learning rate parameter and
𝜕𝑤𝑖𝑗 is the error gradient with reference to the
weight 𝑤𝑖𝑗.
1.5 Perceptron Model
1.5.1 Simple Perceptron for Pattern Classification
Perceptron network is capable of performing pattern classification into two or more

categories. The perceptron is trained using the perceptron learning rule. We will first consider
classification into two categories and then the general multiclass classification later. For
classification
11
into only two categories, all we need is a single output neuron. Here we will use bipolar neurons.
The simplest architecture that could do the job consists of a layer of N input neurons, an output
layer with a single output neuron, and no hidden layers. This is the same architecture as we saw
before for Hebb learning. However, we will use a different transfer function here for the output
neurons as given below in eq (7). Figure 7 represents a single layer perceptron network.
eq (7)
Figure 4: Single Layer Perceptron
Equation 7 gives the bipolar activation function which is the most common function used in
the perceptron networks. Figure 7 represents a single layer perceptron network. The inputs arising
from the problem space are collected by the sensors and they are fed to the aswociation
units.Association units are the units which are responsible to associate the inputs based on their
12
similarities. This unit groups the similar inputs hence the name association unit. A single input from
each group is given to the summing unit.Weights are randomnly fixed intially and assigned to this
inputs. The net value is calculate by using the expression
x = Σ wiai – θ eq(8)
This value is given to the activation function unit to get the final output response.The actual
output is compared with the Target or desired .If they are same then we can stop training else the
weights haqs to be updated .It means there is error .Error is given as δ = b-s , where b is the desired
13
/ Target output and S is the actual outcome of the machinehere the weights are updated based on the
perceptron Learning law as given in equation 9.
Weight change is given as Δw= η δ ai. So new weight is given as
Wi (new) = Wi (old) + Change in weight vector (Δw) eq(9)
1.5.2. Perceptron Algorithm
Step 1: Initialize weights and bias.For simplicity, set weights and bias to zero.Set learning
rate in the range of zero to one.
• Step 2: While stopping condition is false do steps 2-6
• Step 3: For each training pair s:t do steps 3-5
• Step 4: Set activations of input units xi = ai
• Step 5: Calculate the summing part value Net = Σ aiwi-θ
• Step 6: Compute the response of output unit based on the activation functions
• Step 7: Update weights and bias if an error occurred for this pattern(if yis not equal to t)
Weight (new) = wi(old) + atxi , & bias (new) = b(old) + at
Else wi(new) = wi(old) & b(new) = b(old)
• Step 8: Test Stopping Condition
1.5.3. Limitations of single layer perceptrons:
• Uses only Binary Activation function
• Can be used only for Linear Networks
• Since uses Supervised Learning ,Optimal Solution is provided
• Training Time is More
• Cannot solve Linear In-separable Problem
1.5.4. Multi-Layer Perceptron Model:
14
Figure 8 is the general representation of Multi layer Perceptron network.Inbetween the
input and output Layer there will be some more layers also known as Hidden layers.
15
Figure 5: Multi-Layer Perceptron
1.5.5. Multi Layer Perceptron Algorithm
1.
Initialize the weights (Wi) & Bias (B0) to small random values near Zero
2.
Set learning rate η or α in the range of “0” to “1”
3.
Check for stop condition. If stop condition is false do steps 3 to 7
4.
For each Training pairs do step 4 to 7
5.
Set activations of Output units: xi = si for i=1 to N
6.
Calculate the output Response
yin = b0 + Σ xiwi
7.
Activation function used is Bipolar sigmoidal or Bipolar Step functions
For Multi Layer networks, based on the number of layers steps 6 & 7 are repeated
8.
If the Targets is (not equal to) = to the actual output (Y), then update weights and bias
based on Perceptron Learning Law
Wi (new) = Wi (old) + Change in weight vector
Change in weight vector = ηtixi
Where η = Learning Rate
16
ti = Target output of ith unit
xi = ith Input vector
b0(new) = b0 (old) + Change in Bias
Change in Bias = ηti
Else Wi (new) = Wi (old)
b0(new) = b0 (old)
9.
Test for Stop condition
17
1.6. linearly seperable & Linear in separable tasks:
Figure 6: Representation of Linear seperable & Linear-in separable Tasks
Perceptron are successful only on problems with a linearly separable solution sapce.Figure 9
represents both linear separable as well as linear in seperable problem.Perceptron cannot handle, in
particular, tasks which are not linearly separable.(Known as linear inseparable problem).Sets of
points in two dimensional spaces are linearly separable if the sets can be seperated by a straight
line.Generalizing, a set of points in n-dimentional space are that can be seperated by a straight
line.is called Linear seperable as represented in figure 9.
Single layer perceptron can be used for linear separation.Example AND gate.But it cant be
used for non linear ,inseparable problems.(Example XOR Gate).Consider figure 10.
Figure 7: XOR representation (Linear-in separable Task)

18
Here a single decision line cannot separate the Zeros and Ones Linearly.At least Two lines
are required to separate Zeros and Onesas shown in Figure 10. Hence single layer networks can not
be used to solve inseparable problems. To over come this problem we go for creation of convex
regions.
19
Convex regions can be created by multiple decision lines arising from multi layer
networks.Single layer network cannot be used to solve inseparable problem.Hence we go for
multilayer network there by creating convex regions which solves the inseparable problem.
1.6.1 Convex Region:
Select any Two points in a region and draw a straight line between these two points. If the
points selected and the lines joining them both lie inside the region then that region is known as
convex regions.
1.6.2. Types of convex regions
(a) Open Convex region (b) Closed Convex region
Figure 8: Open convex region
Figure 9 A: Circle - Closed convex region Figure 9 B: Triangle - Closed convex region
1.7. Logistic Regression

Logistic regression is a probabilistic model that organizes the instances in terms of
probabilities. Because the classification is probabilistic, a natural method for optimizing the
parameters is to ensure that the predicted probability of the observed class for each training
20
occurrence is as large as possible. This goal is achieved by using the notion of maximumlikelihood
estimation in order to learn the parameters of the model. The likelihood of the training data is
defined as the product of the probabilities of the observed labels of each training instance. Clearly,
larger values of this objective function are better. By using the negative logarithm of this value, one
obtains a loss function in minimization form. Therefore, the output node uses the negative log-
likelihood as a loss function. This loss function replaces the squared error used in the Widrow-Hoff
method. The output layer can be formulated with the sigmoid activation function, which is very
common in neural network design.
21
 Logistic regression is another supervised learning algorithm which is
used to solve the classification problems. In classification problems, we
have dependent variables in a binary or discrete format such as 0 or 1.
 Logistic regression algorithm works with the categorical variable such as

0 or 1, Yes or No, True or False, Spam or not spam, etc.
 It is a predictive analysis algorithm which works on the concept of probability.
 Logistic regression is a type of regression, but it is different from the

linear regression algorithm in the term how they are used.
 Logistic regression uses sigmoid function or logistic function which is a

complex cost function. This sigmoid function is used to model the data in
logistic regression. The function can be represented as:
Where f(x)= Output between the 0 and 1 value.

x= input to the function
e= base of natural logarithm.
When we provide the input values (data) to the function, it gives the S-
curve as follows: It uses the concept of threshold levels, values above the
threshold level are rounded up to 1, and values below the threshold level are
rounded up to 0.
22
Figure 10: Circle – Logistic Function
23
UNIT – II – DEEP LEARNING – SECA4002
UNIT II
ntroduction
Neural networks Feedforward is an artificial neural network that solves many problems, including image classification,
natural language processing, and time series prediction. They are particularly effective for tasks involving pattern
recognition. These networks consist of interconnected "neurons" organized into layers, with the inputs passed through the
first layer and the output produced by the final layer. There may also be any number of hidden layers between the input
and output layers. Each neuron has associated weights and biases that are adjusted during training to optimize the network's
performance. Once trained, these networks can process new inputs and produce outputs based on their learned patterns.
Neural networks Feedforward is widely used and valuable in the machine learning toolkit.
What is a Feed-forward Neural Network?
Neural networks feedforward, also known as multi-layered networks of neurons, are called "feedforward," where
information flows in one direction from the input to the output layer without looping back. It is composed of three types of
layers:
 Input Layer:
The input layer accepts the input data and passes it to the next layer.
 Hidden Layers:
One or more hidden layers that process and transform the input data. Each hidden layer has a set of neurons
connected to the neurons of the previous and next layers. These layers use activation functions, such as ReLU or
sigmoid, to introduce non-linearity into the network, allowing it to learn and model more complex relationships
between the inputs and outputs.
 Output Layer:
The output layer generates the final output. Depending on the type of problem, the number of neurons in the output
layer may vary. For example, in a binary classification problem, it would only have one neuron. In contrast, a multi-
class classification problem would have as many neurons as the number of classes.
24
The purpose of Neural networks feedforward is to approximate certain functions. The input to the network is a vector of
values, x, which is passed through the network, layer by layer, and transformed into an output, y. The network's final
output predicts the target function for the given input. The network makes this prediction using a set of parameters, θ
(theta), adjusted during training to minimize the error between the network's predictions and the target function.
The training involves adjusting the �θ (theta) values to minimize errors. This is done by presenting the network with a set
of input-output pairs (also called training data) and computing the error between the network's prediction and the true
output for each pair. This error is then used to compute the gradient of the error concerning the parameters, which tells us
how to adjust the parameters to reduce the error. This is done using optimization techniques like gradient descent. Once the
training process is completed, the network has " learned " the function and can be used to predict new input.
Finally, the network stores this optimal value of �θ (theta) in its memory, so it can use it to predict new inputs.
 I:
Input node (the starting point for data entering the neural network)
 W:
Connection weight (used to determine the strength of the connection between nodes)
 H:
Hidden node (a layer within the network that processes input)
 HA:
Activated hidden node (the value of the hidden node after passing through a predefined function)
 O:
Output node (the final output of the network, calculated as a weighted sum of the last hidden layer)
 OA:
Activated output node (the final output of the network after passing through a predefined function)
 B:
Bias node (a constant value, typically set to 1.0, used to adjust the output of the network)
25
Feed Forward Process in Deep Neural
Network
Now, we know how with the combination of lines with different weight and biases can result in non-linear models. How
does a neural network know what weight and biased values to have in each layer? It is no different from how we did it for
the single based perceptron model.
We are still making use of a gradient descent optimization algorithm which acts to minimize the error of our model by
iteratively moving in the direction with the steepest descent, the direction which updates the parameters of our model while
ensuring the minimal error. It updates the weight of every model in every single layer. We will talk more about
optimization algorithms and backpropagation later.
It is important to recognize the subsequent training of our neural network. Recognition is done by dividing our data samples
through some decision boundary.
"The process of receiving an input to produce some kind of output to make some kind of prediction is known as Feed
Forward." Feed Forward neural network is the core of many other important neural networks such as convolution neural
network.
In the feed-forward neural network, there are not any feedback loops or connections in the network. Here is simply an input
layer, a hidden layer, and an output layer.
26
There can be multiple hidden layers which depend on what kind of data you are dealing with. The number of hidden layers
is known as the depth of the neural network. The deep neural network can learn from more functions. Input layer first
provides the neural network with data and the output layer then make predictions on that data which is based on a series of
functions. ReLU Function is the most commonly used activation function in the deep neural network.
To gain a solid understanding of the feed-forward process, let's see this mathematically.
1) The first input is fed to the network, which is represented as matrix x1, x2, and one where one is the bias value.
2) Each input is multiplied by weight with respect to the first and second model to obtain their probability of being in the
positive region in each model.
So, we will multiply our inputs by a matrix of weight using matrix multiplication.
3) After that, we will take the sigmoid of our scores and gives us the probability of the point being in the positive region in
both models.
27
4) We multiply the probability which we have obtained from the previous step with the second set of weights. We always
include a bias of one whenever taking a combination of inputs.
And as we know to obtain the probability of the point being in the positive region of this model, we take the sigmoid and
thus producing our final output in a feed-forward process.
Let takes the neural network which we had previously with the following linear models and the hidden layer which
combined to form the non-linear model in the output layer.
28
So, what we will do we use our non-linear model to produce an output that describes the probability of the point being in
the positive region. The point was represented by 2 and 2. Along with bias, we will represent the input as
The first linear model in the hidden layer recall and the equation defined it
Which means in the first layer to obtain the linear combination the inputs are multiplied by -4, -1 and the bias value is
multiplied by twelve.
The weight of the inputs are multiplied by -1/5, 1, and the bias is multiplied by three to obtain the linear combination of
that same point in our second model.
Now, to obtain the probability of the point is in the positive region relative to both models we apply sigmoid to both points
as
The second layer contains the weights which dictated the combination of the linear models in the first layer to obtain the
non-linear model in the second layer. The weights are 1.5, 1, and a bias value of 0.5.
29
Now, we have to multiply our probabilities from the first layer with the second set of weights as
Now, we will take the sigmoid of our final score
It is complete math behind the feed forward process where the inputs from the input traverse the entire depth of the neural
network. In this example, there is only one hidden layer. Whether there is one hidden layer or twenty, the computational
processes are the same for all hidden layers.
1.8. Gradient Descent:
Gradient Descent is a popular optimization technique in Machine Learning and Deep

Learning, and it can be used with most, if not all, of the learning algorithms. A gradient is the slope
of a function. It measures the degree of change of a variable in response to the changes of another
variable. Mathematically, Gradient Descent is a convex function whose output is the partial
derivative of a set of parameters of its inputs. The greater the gradient, the steeper the slope.Starting
from an initial value, Gradient Descent is run iteratively to find the optimal values of the parameters
to find the minimum possible value of the given cost function.
30
1.9.1. Types of Gradient Descent:
Typically, there are three types of Gradient Descent:
1. Batch Gradient Descent
2. Stochastic Gradient Descent
3. Mini-batch Gradient Descent
1.9.2. Stochastic Gradient Descent (SGD):
The word ‘stochastic‘ means a system or a process that is linked with a random probability.
Hence, in Stochastic Gradient Descent, a few samples are selected randomly instead of the whole
data set for each iteration. In Gradient Descent, there is a term called “batch” which denotes the
total number of samples from a dataset that is used for calculating the gradient for each iteration. In
typical Gradient Descent optimization, like Batch Gradient Descent, the batch is taken to be the
whole dataset. Although, using the whole dataset is really useful for getting to the minima in a less
noisy and less random manner, but the problem arises when our datasets gets big.
Suppose, you have a million samples in your dataset, so if you use a typical Gradient Descent
optimization technique, you will have to use all of the one million samples for completing one
iteration while performing the Gradient Descent, and it has to be done for every iteration until the
minima is reached. Hence, it becomes computationally very expensive to perform.
Momentum based GD:-
The problem with gradient descent is that the weight update at a moment (t) is governed by the
learning rate and gradient at that moment only. It doesn’t take into account the past steps taken
while traversing the cost space.
Image by author
It leads to the following problems.
31
1. The gradient of the cost function at saddle points( plateau) is negligible or zero, which in
turn leads to small or no weight updates. Hence, the network becomes stagnant, and
learning stops
2. The path followed by Gradient Descent is very jittery even when operating with mini-
batch mode
Consider the below cost surface.
Image by author
Let’s assume the initial weights of the network under consideration correspond to point A. With
gradient descent, the Loss function decreases rapidly along the slope AB as the gradient along this
slope is high. But as soon as it reaches point B the gradient becomes very low. The weight updates
around B is very small. Even after many iterations, the cost moves very slowly before getting stuck
at a point where the gradient eventually becomes zero.
32
In this case, ideally, cost should have moved to the global minima point C, but because the gradient
disappears at point B, we are stuck with a sub-optimal solution.
How can momentum fix this?
Now, Imagine you have a ball rolling from point A. The ball starts rolling down slowly and gathers
some momentum across the slope AB. When the ball reaches point B, it has accumulated enough
momentum to push itself across the plateau region B and finally following slope BC to land at the
global minima C.
How can this be used and applied to Gradient Descent?
To account for the momentum, we can use a moving average over the past gradients. In regions
where the gradient is high like AB, weight updates will be large. Thus, in a way we are
gathering momentum by taking a moving average over these gradients. But there is a problem with
this method, it considers all the gradients over iterations with equal weightage. The gradient at t=0
has equal weightage as that of the gradient at current iteration t. We need to use some sort of
weighted average of the past gradients such that the recent gradients are given more weightage.
This can be done by using an Exponential Moving Average(EMA). An exponential moving

average is a moving average that assigns a greater weight on the most recent values.
The EMA for a series Y may be calculated recursively
Image by author
33
where
 The coefficient β represents the degree of weighting increase, a constant smoothing factor
between 0 and 1. A lower β discounts older observations faster.
 Y(t) is the value at a period t.
 S(t) is the value of the EMA at any period t.
In our case of a sequence of gradients, the new weight update equation at iteration t becomes
Image by author
Let's break it down.
(t): is the new weight update done at iteration t
β: Momentum constant
(t): is the gradient at iteration t
Assume the weight update at the zeroth iteration t=0 is zero
34
Image by author
Think about the constant β and ignore the term (1-β) in the above equation.
Note: In many texts, you might find (1-β) replaced with η the learning rate.
what if β is 0.1?
35
At n=3; the gradient at t =3 will contribute 100% of its value, the gradient at t=2 will contribute
10% of its value, and gradient at t=1 will only contribute 1% of its value.
here contribution from earlier gradients decreases rapidly.
what if β is 0.9?
At n=3; the gradient at t =3 will contribute 100% of its value, t=2 will contribute 90% of its value,
and gradient at t=1 will contribute 81% of its value.
From above, we can deduce that higher β will accommodate more gradients from the past. Hence,
generally, β is kept around 0.9 in most of the cases.
Note: The actual contribution of each gradient in the weight update will be further subjected to the
learning rate.
This addresses our first point where we said when the gradient at the current moment is negligible
or zero the learning becomes zero. Using momentum with gradient descent, gradients from the
past will push the cost further to move around a saddle point.
In the cost surface shown earlier let's zoom into point C.
With gradient descent, if the learning rate is too small, the weights will be updated very slowly
hence convergence takes a lot of time even when the gradient is high. This is shown in the left side
image below. If the learning rate is too high cost oscillates around the minima as shown in the right
side image below.
36
Image by author
How does Momentum fix this?
Let's look at the last summation equation of the momentum again.
Case 1: When all the past gradients have the same sign
The summation term will become large and we will take large steps while updating the weights.
Along the curve BC, even if the learning rate is low, all the gradients along the curve will have the
same direction(sign) thus increasing the momentum and accelerating the descent.
Case 2: when some of the gradients have +ve sign whereas others have -ve
The summation term will become small and weight updates will be small. If the learning rate is
high, the gradient at each iteration around the valley C will alter its sign between +ve and -ve and
37
after few oscillations, the sum of past gradients will become small. Thus, making small updates in
the weights from there on and damping the oscillations.
This to some amount addresses our second problem. Gradient Descent with Momentum takes
small steps in directions where the gradients oscillate and take large steps along the direction where
the past gradients have the same direction(same sign).
Conclusion
By adding a momentum term in the gradient descent, gradients accumulated from past iterations
will push the cost further to move around a saddle point even when the current gradient is
negligible or zero.
Even though momentum with gradient descent converges better and faster, it still doesn’t resolve
all the problems. First, the hyperparameter η (learning rate) has to be tuned manually. Second, in
some cases, where, even if the learning rate is low, the momentum term and the current gradient
can alone drive and cause oscillations.
First, the Learning rate problem can be further resolved by using other variations of Gradient
Descent like AdaptiveGradient and RMSprop. Second, a large momentum problem can be
further resolved by using a variation of momentum-based gradient descent called Nesterov
Accelerated Gradient Descent.
Gradient descent with momentum
The issue discussed above can be solved by including the previous gradients in our calculation. The intuition behind this is
if we are repeatedly asked to go in a particular direction, we can take bigger steps towards that direction.
The weighted average of all the previous gradients is added to our equation, and it acts as momentum to our step.
38
We can understand gradient descent with momentum from the above image. As we start to descend, the momentum
increases, and even at gentle slopes where the gradient is minimal, the actual movement is large due to the added
momentum.
But this added momentum causes a different type of problem. We actually cross the minimum point and have to take a U-
turn to get to the minimum point. Momentum-based gradient descent oscillates around the minimum point, and we have to
take a lot of U-turns to reach the desired point. Despite these oscillations, momentum-based gradient descent is faster than
conventional gradient descent.
To reduce these oscillations, we can use Nesterov Accelerated Gradient.
Nesterov Accelerated Gradient (NAG)
NAG resolves this problem by adding a look ahead term in our equation. The intuition behind NAG can be summarized as
‘look before you leap’. Let us try to understand this through an example.
As can see, in the momentum-based gradient, the steps become larger and larger due to the accumulated momentum, and
then we overshoot at the 4th step. We then have to take steps in the opposite direction to reach the minimum point.
However, the update in NAG happens in two steps. First, a partial step to reach the look-ahead point, and then the final
update. We calculate the gradient at the look-ahead point and then use it to calculate the final update. If the gradient at the
look-ahead point is negative, our final update will be smaller than that of a regular momentum-based gradient. Like in the
above example, the updates of NAG are similar to that of the momentum-based gradient for the first three steps because the
gradient at that point and the look-ahead point are positive. But at step 4, the gradient of the look-ahead point is negative.
39
In NAG, the first partial update 4a will be used to go to the look-ahead point and then the gradient will be calculated at that
point without updating the parameters. Since the gradient at step 4b is negative, the overall update will be smaller than the
momentum-based gradient descent.
We can see in the above example that the momentum-based gradient descent takes six steps to reach the minimum point,
while NAG takes only five steps.
This looking ahead helps NAG to converge to the minimum points in fewer steps and reduce the chances of overshooting.
How NAG actually works?
We saw how NAG solves the problem of overshooting by ‘looking ahead’. Let us see how this is calculated and the actual
math behind it.
Update rule for gradient descent:
wt+1 = wt − η∇wt
In this equation, the weight (W) is updated in each iteration. η is the learning rate, and ∇wt is the gradient.
Update rule for momentum-based gradient descent:
In this, momentum is added to the conventional gradient descent equation. The update equation is
wt+1 = wt − updatet
updatet is calculated by:
updatet = γ · updatet−1 + η∇wt
This is how the gradient of all the previous updates is added to the current update.
Update rule for NAG:
wt+1 = wt − updatet
While calculating the updatet, We will include the look ahead gradient (∇wlook_ahead).
updatet = γ · updatet−1 + η∇wlook_ahead
∇wlook_ahead is calculated by:
wlook_ahead = wt − γ · updatet−1
This look-ahead gradient will be used in our update and will prevent overshooting.
Stochastic Gradient Descent (SGD) is a simple yet very efficient approach to fitting linear classifiers and regressors under
convex loss functions such as (linear) Support Vector Machines and Logistic Regression. Even though SGD has been
around in the machine learning community for a long time, it has received a considerable amount of attention just recently
in the context of large-scale learning.
40
SGD has been successfully applied to large-scale and sparse machine learning problems often encountered in text
classification and natural language processing. Given that the data is sparse, the classifiers in this module easily scale to
problems with more than 10^5 training examples and more than 10^5 features.
Strictly speaking, SGD is merely an optimization technique and does not correspond to a specific family of machine
learning models. It is only a way to train a model. Often, an instance of SGDClassifier or SGDRegressor will have an
equivalent estimator in the scikit-learn API, potentially using a different optimization technique. For example,
using SGDClassifier(loss='log_loss') results in logistic regression, i.e. a model equivalent to LogisticRegression which is fitted
via SGD instead of being fitted by one of the other solvers in LogisticRegression.
Similarly, SGDRegressor(loss='squared_error', penalty='l2') and Ridge solve the same optimization problem, via different means.
The advantages of Stochastic Gradient Descent are:
 Efficiency.
 Ease of implementation (lots of opportunities for code tuning).
The disadvantages of Stochastic Gradient Descent include:
 SGD requires a number of hyperparameters such as the regularization parameter and the number of iterations.
 SGD is sensitive to feature scaling.
Warning
Make sure you permute (shuffle) your training data before fitting the model or use shuffle=True to shuffle after each iteration
(used by default). Also, ideally, features should be standardized using
e.g. make_pipeline(StandardScaler(), SGDClassifier()) (see Pipelines).
1.5.1. Classification
The class SGDClassifier implements a plain stochastic gradient descent learning routine which supports different loss
functions and penalties for classification. Below is the decision boundary of a SGDClassifier trained with the hinge loss,
equivalent to a linear SVM.
41
SGD is a variation on gradient descent, also called batch gradient descent. As a review, gradient descent seeks to minimize an objective
function by iteratively updating each parameter by a small amount based on the negative gradient of a given data set. The steps for
performing gradient descent are as follows:
Step 1: Select a learning rate
Step 2: Select initial parameter values as the starting point
Step 3: Update all parameters from the gradient of the training data set, i.e. compute
Step 4: Repeat Step 3 until a local minima is reached
Under batch gradient descent, the gradient, , is calculated at every step against a full data set. When the training data is
large, computation may be slow or require large amounts of computer memory.[2]
42
Visualization of the stochastic gradient descent algorithm [6]
Stochastic Gradient Descent Algorithm
SGD modifies the batch gradient descent algorithm by calculating the gradient for only one training example at every iteration.[7] The steps
for performing SGD are as follows:
Step 1: Randomly shuffle the data set of size m
Step 2: Select a learning rate
Step 3: Select initial parameter values as the starting point
Step 4: Update all parameters from the gradient of a single training example , i.e. compute
Step 5: Repeat Step 4 until a local minimum is reached
By calculating the gradient for one data set per iteration, SGD takes a less direct route towards the local minimum. However, SGD has the
advantage of having the ability to incrementally update an objective function when new training data is available at minimum cost.
Learning Rate
The learning rate is used to calculate the step size at every iteration. Too large a learning rate and the step sizes may overstep too far past the
optimum value. Too small a learning rate may require many iterations to reach a local minimum. A good starting point for the learning rate is
0.1 and adjust as necessary.[8]
Mini-Batch Gradient Descent
A variation on stochastic gradient descent is the mini-batch gradient descent. In SGD, the gradient is computed on only one training example
and may result in a large number of iterations required to converge on a local minimum. Mini-batch gradient descent offers a compromise
between batch gradient descent and SGD by splitting the training data into smaller batches. The steps for performing mini-batch gradient
descent are identical to SGD with one exception - when updating the parameters from the gradient, rather than calculating the gradient of a
single training example, the gradient is calculated against a batch size of training examples, i.e. compute
Numerical Example
Data preparation
Consider a simple 2-D data set with only 6 data points (each point has ), and each data point have a label value assigned to them.
Model overview
For the purpose of demonstrating the computation of the SGD process, simply employ a linear regression model: , where and are weights
and is the constant term. In this case, the goal of this model is to find the best value for and , based on the datasets.
Definition of loss function
In this example, the loss function should be l2 norm square, that is .
Forward
Initial Weights:
43
The linear regression model starts by initializing the weights and setting the bias term at 0. In this case, initiate [] = [-0.044, -0.042].
Dataset:
For this problem, the batch size is set to 1 and the entire dataset of [ , , ] is given by:
4 1 2
2 8 -14
1 0 1
3 2 -1
1 4 -7
6 7 -8
Gradient Computation and Parameter Update
The purpose of BP is to obtain the impact of the weights and bias terms for the entire model. The update of the model is entirely dependent
on the gradient values. To minimize the loss during the process, the model needs to ensure the gradient is dissenting so that it could finally
converge to a global optimal point. All the 3 partial differential equations are shown as:
Where the stands for the learning rate and in this model, is set to be 0.05. To update each parameter, simply substitute the value of resulting .
Use the first data point [] = [4, 1] and the corresponding being 2. The the model gave should be -0.2. Now with and value, update the new
parameters as [0.843, 0.179, 0.222] = []. That marks the end of iteration 1.
Now, iteration 2 begins, with the next data point [2, 8] and the label -14. The estimation , is now 3.3. With the new and value, once again,
we update the weight as [-2.625, -13.696, 1.513]. And that marks the end of iteration 2.
Keep on updating the model through additional iterations to output [] = [-19.021, -35.812, -1.232].
This is just a simple demonstration of the SGD process. In actual practice, more epochs can be utilized to run through the entire dataset
enough times to ensure the best learning results based on the training dataset[9]. But learning overly specific with the training dataset could
sometimes also expose the model to the risk of overfitting[9]. Therefore, tuning such parameters is quite tricky and often costs days or even
weeks before finding the best results.
Application
44
SGD, often referred to as the cornerstone for deep learning, is an algorithm for training a wide range of models in machine learning. Deep
learning is a machine learning technique that teaches computers to do what comes naturally to humans. In deep learning, a computer model
learns to perform classification tasks directly from images, text, or sound. Models are trained by using a large set of labeled data and neural
network architectures that contain many layers. Neural networks make up the backbone of deep learning algorithms. A neural network that
consists of more than three layers which would be inclusive of the inputs and the output can be considered a deep learning algorithm. Due to
SGD’s efficiency in dealing with large scale datasets, it is the most common method for training deep neural networks. Furthermore, SGD
has received considerable attention and is applied to text classification and natural language processing. It is best suited for unconstrained
optimization problems and is the main way to train large linear models on very large data sets. Implementation of stochastic gradient descent
include areas in ridge regression and regularized logistic regression. Other problems, such as Lasso[10] and support vector machines[11] can be
solved by stochastic gradient descent.
Support Vector Machine
SGD is a simple yet very efficient approach to fitting linear classifiers and regressors under convex functions such as (linear) Support Vector
Machines (SVM). A support vector machine is a supervised machine learning model that uses classification algorithms for two-group
classification problems. An SVM finds what is known as a separating hyperplane: a hyperplane (a line, in the two-dimensional case) which
separates the two classes of points from one another. It is a fast and dependable classification algorithm that performs very well with a
limited amount of data to analyze. However, because SVM is computationally costly, software applications often do not provide sufficient
performance in order to meet time requirements for large amounts of data. To improve SVM scalability regarding the size of the data set,
SGD algorithms are used as a simplified procedure for evaluating the gradient of a function.[12]
Logistic regression
Logistic regression models the probabilities for classification problems with two possible outcomes. It's an extension of the linear regression
model for classification problems. It is a statistical technique with the input variables as continuous variables and the output variable as a
binary variable. It is a class of regression where the independent variable is used to predict the dependent variable. The objective of training a
machine learning model is to minimize the loss or error between ground truths and predictions by changing the trainable parameters. Logistic
regression has two phases: training, and testing. The system, specifically the weights w and b, is trained using stochastic gradient descent and
the cross-entropy loss.
45
What is Adam Optimizer?
Adam derives its name from adaptive moment estimation. This optimization algorithm is a stochastic gradient descent extension
that updates network weights during training. It is a hybrid of the “gradient descent with momentum” and the “RMSP”
algorithms.
It is an adaptive learning rate method that calculates individual learning rates for various parameters.
Adam can be used instead of the classical stochastic gradient descent procedure to update network weights iterative based on
training data.
The Adam optimizer employs a hybrid of two gradient descent methods:
Momentum: This algorithm is used to speed up the gradient descent algorithm by considering the ”exponentially weighted
average” of the gradients. Using averages causes the algorithm to converge to the minima more quickly.
mt = Aggregate of gradients at time t [Current] (Initially, mt = 0)

mt-1 = Aggregate of gradients at time t-1 [Previous]
Wt = Weights at time t
Wt+1 = Weights at time t+1
αt = Learning rate at time t
∂L = Derivative of Loss Function
∂Wt = Derivative of weights at time t
β = Moving average parameter (Constant, 0.9)
Root Mean Square Propagation (RMSP):
1
RMSprop, or root mean square prop, is an adaptive learning algorithm that attempts to improve AdaGrad. It uses the
”exponential moving average” rather than the cumulative Sum of squared gradients as AdaGrad does.
Wt = Weights at time t
Wt+1 = Weights at time t+1
αt = Learning rate at time t
∂L = Derivative of Loss Function
∂Wt = Derivative of weights at time t
Vt = Sum of the square of past gradients. [i.e sum(∂L/∂Wt-1)] (initially, Vt = 0)
β = Moving average parameter (const, 0.9)
ϵ = A small positive constant (10-8)
Adam Optimizer takes the strengths or positive characteristics of the previous two methods and builds on them to provide a more
optimized gradient descent.
In this case, we control the gradient descent rate so that there is minimal oscillation when it reaches the global minimum while
taking large enough steps (step size) to avoid the local minima hurdles along the way—as a result, combining the features of the
above methods to reach the global minimum efficiently.
Mathematical Aspect of Adam Optimizer:
Using the formulas used in the previous two methods, we get the following:
2
Implementation of Adam Optimizer
Now it it’s time to put it into practice and compare the results using different optimizers on a simple neural network. Let’sLet’s
use the MNIST dataset, and We will train a simple model with some basic layers, using the same batch size and epochs but
different optimizers. We will use the default values with each optimizer.
Advantages of Adam Optimizer
Simple to put into action

Effective in terms of computation
Memory requirements are minimal.
Appropriate for gradients that are very noisy or sparse.
Ideal for problems with a large amount of data or parameters.
Appropriate for non-stationary goals.
Appropriate for gradients that are very noisy or sparse.
Hyper-parameters have an intuitive interpretation and usually require some tuning.
Limitations of Adam Optimizer
3
Source: Adam Official Paper
Building on the strengths of previous models, the Adam optimizer provides significantly better performance than previous
models. It outperforms them by a wide margin in terms of providing an optimized gradient descent. The plot below clearly
shows how Adam Optimizer outperforms the rest of the optimizers in training cost (low) and performance by a significant
margin (high).
Adagrad stands for Adaptive Gradient Optimizer. There were optimizers like Gradient Descent, Stochastic
Gradient Descent, mini-batch SGD, all were used to reduce the loss function with respect to the weights. The
weight updating formula is as follows:
Based on iterations, this formula can be written as:
where
w(t) = value of w at current iteration, w(t-1) = value of w at previous iteration and η = learning rate.
4
In SGD and mini-batch SGD, the value of η used to be the same for each weight, or say for each parameter.
Typically, η = 0.01. But in Adagrad Optimizer the core idea is that each weight has a different learning rate
(η). This modification has great importance, in the real-world dataset, some features are sparse (for example,
in Bag of Words most of the features are zero so it’s sparse) and some are dense (most of the features will be
noon-zero), so keeping the same value of learning rate for all the weights is not good for optimization. The
weight updating formula for adagrad looks like:
Where alpha(t) denotes different learning rates for each weight at each iteration.
Here, η is a constant number, epsilon is a small positive value number to avoid divide by zero error if in
case alpha(t) becomes 0 because if alpha(t) become zero then the learning rate will become zero which in turn
after multiplying by derivative will make w(old) = w(new), and this will lead to small convergence.
What is RMSProp?
For optimizing the training of neural networks, RMSprop relies on gradients. Backpropagation has its roots in
this idea.
As data travels through very complicated functions, such as neural networks, the resulting gradients often
disappear or expand. RMSprop is an innovative stochastic mini-batch learning method.
RMSprop (Root Mean Squared Propagation) is an optimization algorithm used in deep learning and
other Machine Learning techniques.
It is a variant of the gradient descent algorithm that helps to improve the convergence speed and stability of the
model training process.
RMSProp algorithm
Like other gradient descent algorithms, RMSprop works by calculating the gradient of the loss function with
respect to the model’s parameters and updating the parameters in the opposite direction of the gradient to
5
minimize the loss. However, RMSProp introduces a few additional techniques to improve the performance of
the optimization process.
One key feature is its use of a moving average of the squared gradients to scale the learning rate for each
parameter. This helps to stabilize the learning process and prevent oscillations in the optimization trajectory.
The algorithm can be summarized by the following RMSProp formula:
v_t = decay_rate * v_{t-1} + (1 - decay_rate) * gradient^2

parameter = parameter - learning_rate * gradient / (sqrt(v_t) + epsilon)
Where:
v_t is the moving average of the squared gradients;
decay_rate is a hyperparameter that controls the decay rate of the moving average;
learning_rate is a hyperparameter that controls the step size of the update;
gradient is the gradient of the loss function with respect to the parameter; and
epsilon is a small constant added to the denominator to prevent division by zero.
Adam vs RMSProp
RMSProp is often compared to the Adam (Adaptive Moment Estimation) optimization algorithm, another
popular optimization method for deep learning. Both algorithms combine elements of momentum and adaptive
learning rates to improve the optimization process, but Adam uses a slightly different approach to compute the
moving averages and adjust the learning rates. Adam is generally more popular and widely used than the
RMSProp optimizer, but both algorithms can be effective in different settings.
RMSProp advantages
6
Fast convergence. RMSprop is known for its fast convergence speed, which means that it can find good
solutions to optimization problems in fewer iterations than some other algorithms. This can be especially
useful for training large or complex models, where training time is a critical concern.
Stable learning. The use of a moving average of the squared gradients in RMSprop helps to stabilize the
learning process and prevent oscillations in the optimization trajectory. This can make the optimization
process more robust and less prone to diverging or getting stuck in local minima.
Fewer hyperparameters. RMSprop has fewer hyperparameters than some other optimization algorithms that
make it easier to tune and use in practice. The main hyperparameters in RMSprop are the learning rate and the
decay rate, which can be chosen using techniques like grid search or random search.
Good performance on non-convex problems. RMSprop tends to perform well on non-convex optimization
problems, common in Machine Learning and deep learning. Non-convex optimization problems have multiple
local minima, and RMSprop’s fast convergence speed and stable learning can help it find good solutions even
in these cases.
Overall, RMSprop is a powerful and widely used optimization algorithm that can be effective for training a
variety of Machine Learning models, especially deep learning models.
is derivative of loss with respect to weight and will always be positive since its a square term, which
means that alpha(t) will also remain positive, this implies that alpha(t) >= alpha(t-1).
It can be seen from the formula that as alpha(t) and is inversely proportional to one another, this implies that
as alpha(t) will increase, will decrease. This means that as the number of iterations will increase, the
learning rate will reduce adaptively, so you no need to manually select the learning rate.
Advantages of Adagrad:
7
 No manual tuning of the learning rate required.
 Faster convergence
 More reliable
One main disadvantage of Adagrad optimizer is that alpha(t) can become large as the number of iterations will
increase and due to this will decrease at the larger rate. This will make the old weight almost equal to the
new weight which may lead to slow convergence.
RMSProp
Introduction
Optimization is a mathematical technique that determines the best solution. Optimization algorithms in deep learning
(especially in neural networks) minimize an objective function like a loss function, which calculates the difference between
the predicted data and the expected values.
Optimization algorithms generate better results by updating the model parameters such as the Weight(W) and bias(b). The
most important of them is Gradient Descent. The objective of Gradient Descent is to reach the global minima where the
cost function attains the least possible value. Gradient descent is an algorithm that helps calculate the loss function's
gradient to navigate the search space.
Another is Stochastic Gradient Descent, where we fixed the learning rate value for all the recurrent networks. Hence, it
results in slow convergence for adaptive methods like RMSprop, where each parameter is a variable even learning rate.
This algorithm is ensured to converge, even if the input sample is not linearly separable, to a minimum error function for a
well-chosen learning rate.
Reason Behind Using RMSProp
Firstly, let us look at the contour plotting of a cost function:
img_src
At the start of the training of our model, our cost will be pretty high (point A). From there, we try to calculate the negative
descent and take another step in the direction it specifies unless we reach the global minima(Redpoint).
When we start gradient descent from point A after one iteration of gradient descent, we may end up at point B. Then
another gradient descent step may end up at point C. So, it will take many steps to move slowly towards the minimum.
But what's the reason for such a haphazard motion? This is due to a lot of local optima due to high dimension (as in reality,
the cost function depends on many weights, increasing the dimension.)
When trying to optimize the parameters in the case of multi-dimension, there are a lot of local optima (one in each
dimension). So, it is easier for Gradient Descent to get stuck in a local optimum than moving to the global optimum.
8
Therefore the algorithm keeps moving from one local optimum to others, and delays reaching the global optimum—these
slow down the Gradient Descent.
To overcome this, we want slower learning in the vertical direction and faster learning in the horizontal direction. So here is
what we can do- use RMSProp. RMSProp uses the concept of the Exponentially Weighted Average(EWA) of the
gradients.
RMSProp
RMSProp stands for Root Mean Square Propagation. RMSprop is an optimization technique that we use in training neural
networks. RMSProp was first proposed by the father of back-propagation, Geoffrey Hinton.
The gradients of complex functions like neural networks tend to explode or vanish as the data propagates through the
function (known as vanishing gradients problem or exploding gradient descent). RMSProp was developed as a stochastic
technique for mini-batch learning.
RMSprop deals with the above issue by using a moving average of squared gradients to normalize the gradient. This
normalization balances the step size (momentum), decreasing the step for large gradients to avoid exploding and increasing
the step for small gradients to avoid vanishing.
In simple terms, RMSprop uses an adaptive learning rate instead of treating the learning rate as a hyperparameter. This
means that the learning rate varies over time. RMSprop uses the same concept of the exponentially weighted average of
gradient as gradient descent with momentum, but the difference is parameter update.
Parameters Updation
The recursive formula of EWA is given by:
Where,
Vt: Moving average value at t.
In RMSProp, we find the EWA of gradients and update our parameters using those EWAs. On each iteration t, we calculate
dw and db on the current minibatch. Then we calculate vdw and vdb using the following formulae:
We will update our parameters after calculating the exponentially weighted averages.
W = W - learning rate * dW
b = b - learning rate * db.
Putting the value of db and dw in respective equations, we get:
9
We try to diminish the vertical movement using the average because they sum up to 0(approximately) while averaging.
Ideal Values of Alpha and Beta
The suggested value of β is 0.9, which is observed experimentally. If we keep β lesser than 0.9, then the fluctuation in the
value of vdw and vdb will be very high. On the other hand, keeping the value of β higher than 0.9 will not give a proper
value for average.
The suggested value of alpha is 0.001. Keeping alpha less than 0.001 will make the learning slower, whereas keeping it
more than 0.001 will generate the chance of overshooting the minima.
Working Of RMSProp
We start gradient descent from point A. After one iteration of gradient descent, we may end up at point B. Then another
gradient descent step may end up at point C. After some iteration, we step towards the local optima with oscillations up and
down. Using a higher learning rate would make the vertical oscillation frequency more significant, slowing our gradient
descent. Thus, using a higher learning rate is prevented.
The bias plays a role in determining the vertical oscillations, whereas the weight defines the movement in the horizontal
direction. If we slow down the updatation process for bias, then the vertical oscillations can be dampened, and if we update
‘weights’ with higher values, we can still move quickly towards the local optima.
We require to slow down the learning in the vertical direction and boost up or at least not slow down the learning in the
horizontal direction. i.e., in the horizontal direction, we want to move fast, and in the vertical direction, we want to slow
down or damp out the oscillation.
The derivative db is much larger in the vertical direction than dw in the horizontal direction. So because of this Vdw will be
small, and Vdb will be larger, and hence the updates will be slowed down in vertical directions.
Talking about the net effect of RMSProp, the vertical direction movement is reduced because reaching the minimum point
is quick and learning of the parameters isis much faster than the earlier cases.
10
UNIT II INTRODUCTION TO DEEP LEARNING
History of Deep Learning- A Probabilistic Theory of Deep Learning- Backpropagation and

regularization, batch normalization- VC Dimension and Neural Nets-Deep Vs Shallow Networks
Convolutional Networks- Generative Adversarial Networks (GAN), Semi-supervised Learning
2 History of Deep Learning [DL]:
 The chain rule that underlies the back-propagation algorithm was invented in the
seventeenth century (Leibniz, 1676; L’Hôpital, 1696)
 Beginning in the 1940s, the function approximation techniques were used to motivate
machine learning models such as the perceptron
 The earliest models were based on linear models. Critics including Marvin Minsky
pointed out several of the flaws of the linear model family, such as its inability to learn
the XOR function, which led to a backlash against the entire neural network approach
 Efficient applications of the chain rule based on dynamic programming began to appear
in the 1960s and 1970s
 Werbos (1981) proposed applying chain rule techniques for training artificial neural
networks. The idea was finally developed in practice after being independently
rediscovered in different ways (LeCun, 1985; Parker, 1985; Rumelhart et al., 1986a)
 Following the success of back-propagation, neural network research gained popularity
and reached a peak in the early 1990s. Afterwards, other machine learning techniques
became more popular until the modern deep learning renaissance that began in 2006
 The core ideas behind modern feedforward networks have not changed substantially
since the 1980s. The same back-propagation algorithm and the same approaches to
gradient descent are still in use.
Most of the improvement in neural network performance from 1986 to 2015 can be
attributed to two factors. First, larger datasets have reduced the degree to which statistical
generalization is a challenge for neural networks. Second, neural networks have become
much larger, because of more powerful computers and better software infrastructure.A
11
small number of algorithmic changes have also improved the performance of neural
networks noticeably. One of these algorithmic changes was the replacement of mean
squared error with the cross-entropy family of loss functions. Mean squared error was
popular in the 1980s and 1990s but was gradually replaced by cross-entropy losses and the
principle of maximum likelihood as ideas spread between the statistics community and the
machine learning community.
The other major algorithmic change that has greatly improved the performance of
feedforward networks was the replacement of sigmoid hidden units with piecewise linear
hidden units, such as rectified linear units. Rectification using the max{0, z} function was
introduced in early neural network models and dates back at least as far as the Cognitron
and Neo-Cognitron (Fukushima, 1975, 1980).
For small datasets, Jarrett et al. (2009) observed that using rectifying nonlinearities
is even more important than learning the weights of the hidden layers. Random weights
are
12
sufficient to propagate useful information through a rectified linear network, enabling the
classifier layer at the top to learn how to map different feature vectors to class identities.
When more data is available, learning begins to extract enough useful knowledge to exceed
the performance of randomly chosen parameters. Glorot et al. (2011a) showed that learning
is far easier in deep rectified linear networks than in deep networks that have curvature or
two-sided saturation in their activation functions.
When the modern resurgence of deep learning began in 2006, feedforward networks
continued to have a bad reputation. From about 2006 to 2012, it was widely believed that
feedforward networks would not perform well unless they were assisted by other models,
such as probabilistic models. Today, it is now known that with the right resources and
engineering practices, feedforward networks perform very well. Today, gradient-based
learning in feedforward networks is used as a tool to develop probabilistic models.
Feedforward networks continue to have unfulfilled potential. In the future, we expect they
will be applied to many more tasks, and that advances in optimization algorithms and model
design will improve their performance even further.
2.1 A Probabilistic Theory of Deep Learning
Probability is the science of quantifying uncertain things. Most of machine learning and
deep learning systems utilize a lot of data to learn about patterns in the data. Whenever data is
13
utilized in a system rather than sole logic, uncertainty grows up and whenever uncertainty
grows up, probability becomes relevant.
By introducing probability to a deep learning system, we introduce common sense to the

system. In deep learning, several models like Bayesian models, probabilistic graphical models,
Hidden Markov models are used. They depend entirely on probability concepts.
Real world data is chaotic. Since deep learning systems utilize real world data, they
require a tool to handle the chaoticness.
14
2.2 Back Propagation Networks (BPN)
2.2.1. Need for Multilayer Networks
 Single Layer networks cannot used to solve Linear Inseparable problems &
can only be used to solve linear separable problems
 Single layer networks cannot solve complex problems
 Single layer networks cannot be used when large input-output data set is
available
 Single layer networks cannot capture the complex information’s available in
the training pairs
Hence to overcome the above said Limitations we use Multi-Layer Networks.
2.2.2. Multi-Layer Networks
 Any neural network which has at least one layer in between input and
output layers is called Multi-Layer Networks
 Layers present in between the input and out layers are called Hidden Layers
 Input layer neural unit just collects the inputs and forwards them to the next
higher layer
 Hidden layer and output layer neural units process the information’s feed to
them and produce an appropriate output
 Multi -layer networks provide optimal solution for arbitrary classification
problems
 Multi -layer networks use linear discriminants, where the inputs are non
linear
2.2.3. Back Propagation Networks (BPN)
Introduced by Rumelhart, Hinton, & Williams in 1986. BPN is a Multi-

layer Feedforward Network but error is back propagated, Hence the name Back
Propagation Network (BPN). It uses Supervised Training process; it has a
systematic procedure for training the network and is used in Error Detection and
Correction. Generalized Delta Law /Continuous Perceptron Law/ Gradient
Descent Law is used in this network. Generalized Delta rule minimizes the mean
squared error of the output calculated from the output. Delta law has faster
15
convergence rate when compared with Perceptron Law. It is the extended version
of Perceptron Training Law. Limitations of this law is the Local minima problem.
Due to this the convergence speed reduces, but it is better than perceptron’s.
Figure 1 represents a BPN network architecture. Even though Multi level
perceptron’s can be used they are flexible and efficient that BPN. In figure 1 the
weights between input and the hidden portion is considered as Wij and the weight
between first hidden to the next layer is considered as Vjk. This network is valid
only for Differential Output functions. The Training process used in
backpropagation involves three stages, which are listed as below
1. Feedforward of input training pair
16
2. Calculation and backpropagation of associated error
3. Adjustments of weights
Figure 1: Back Propagation Network
2.2.4. BPN Algorithm
The algorithm for BPN is as classified int four major steps as follows:
1. Initialization of Bias, Weights
2. Feedforward process
3. Back Propagation of Errors
4. Updating of weights & biases
Algorithm:
I. Initialization of weights:
Step 1: Initialize the weights to small random values near zero
Step 2: While stop condition is false , Do steps 3 to 10
Step 3: For each training pair do steps 4 to 9
II. Feed forward of inputs
Step 4: Each input xi is received and forwarded to higher layers (next
hidden)
17
Step 5: Hidden unit sums its weighted inputs as follows
Zinj = Woj + Σxiwij
Applying Activation function
Zj = f(Zinj)
This value is passed to the output layer
Step 6: Output unit sums it’s weighted inputs
yink= Voj + Σ ZjVjk
Applying Activation function
18
Yk = f(yink)
III. Backpropagation of Errors
Step 7: δk = (tk – Yk)f(yink )

Step 8: δinj = Σ δjVjk
IV. Updating of Weights & Biases
Step 8: Weight correction is Δwij = αδkZj
bias Correction is Δwoj = αδk
V. Updating of Weights & Biases
Step 9: continued:
New Weight is
Wij(new) = Wij(old) + Δwij
Vjk(new) = Vjk(old) + ΔVjk
New bias is
Woj(new) = Woj(old) + Δwoj
Vok(new) = Vok(old) + ΔVok
Step 10: Test for Stop Condition
2.2.5 Merits
•Has smooth effect on weight correction
•Computing time is less if weight’s are small
•100 times faster than perceptron model
• Has a systematic weight updating procedure
2.2.6. Demerits
• Learning phase requires intensive calculations
• Selection of number of Hidden layer neurons is an issue
• Selection of number of Hidden layers is also an issue
• Network gets trapped in Local Minima
• Temporal Instability
• Network Paralysis
19
• Training time is more for Complex problems
20
21

22

DL UNIT 1 and 2 - NOTES

Uploaded by

Copyright:

Available Formats

DL UNIT 1 and 2 - NOTES

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DL UNIT 1 and 2 - NOTES

Uploaded by

Copyright:

Available Formats

UNIT – I – DEEP LEARNING

Introduction to machine learning - Linear models (SVMs and Perceptron’s, logistic

1 Definition of Machine Learning [ML]:

Neural computing is an information processing paradigm, inspired by biological system,

1.2 The Biological Neuron

Figure 1: Biological Neural Network

The input/output and the propagation of information are shown below.

1.3. Artificial neuron model

An artificial neuron is a mathematical function conceived as a simple model of a real (biological)

 The McCulloch-Pitts Neuron

1.3.1 Basic Elements of ANN:

Figure 2: Basic Elements of Artificial Neural Network

A brief classification of Different Learning algorithms is depicted in figure 3.

1.4.1. Requirements of Learning Laws:

• Learning Law should lead to convergence of weights

• Learning should use the local information

• Learning should able to capture as many as patterns as possible

1.4.1.2 Unsupervised learning:

1.4.1.3 Reinforced learning:

1.4.1.4 Hebbian learning:

1.4.1.5 Gradient descent learning:

1.5 Perceptron Model

1.5.1 Simple Perceptron for Pattern Classification

Perceptron network is capable of performing pattern classification into two or more

Figure 4: Single Layer Perceptron

Weight change is given as Δw= η δ ai. So new weight is given as

Wi (new) = Wi (old) + Change in weight vector (Δw) eq(9)

1.5.2. Perceptron Algorithm

• Step 2: While stopping condition is false do steps 2-6

• Step 3: For each training pair s:t do steps 3-5

• Step 4: Set activations of input units xi = ai

• Step 5: Calculate the summing part value Net = Σ aiwi-θ

• Step 8: Test Stopping Condition

1.5.3. Limitations of single layer perceptrons:

• Uses only Binary Activation function

• Can be used only for Linear Networks

• Since uses Supervised Learning ,Optimal Solution is provided

• Training Time is More

• Cannot solve Linear In-separable Problem

1.5.4. Multi-Layer Perceptron Model:

1.5.5. Multi Layer Perceptron Algorithm

Figure 6: Representation of Linear seperable & Linear-in separable Tasks

Figure 7: XOR representation (Linear-in separable Task)

1.6.1 Convex Region:

1.6.2. Types of convex regions

(a) Open Convex region (b) Closed Convex region

Figure 8: Open convex region

1.7. Logistic Regression

 Logistic regression algorithm works with the categorical variable such as

 It is a predictive analysis algorithm which works on the concept of probability.

 Logistic regression is a type of regression, but it is different from the

 Logistic regression uses sigmoid function or logistic function which is a

Where f(x)= Output between the 0 and 1 value.

What is a Feed-forward Neural Network?

Now, we will take the sigmoid of our final score

1.8. Gradient Descent:

Gradient Descent is a popular optimization technique in Machine Learning and Deep

Momentum based GD:-

It leads to the following problems.