Unit 1

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 70

What is Deep Learning?

• Deep learning is a subset of machine learning which is


completely based on artificial neural networks.
• Like neural networks, deep learning also mimics the
human brain.
• In deep learning, we don’t need to explicitly program
everything.
• Deep Learning with multiple layers is known as Deep
Neural Networks (DNNs).
• These networks are inspired by the structure and
function of the human brain.
• They are designed to learn from large amounts of data
in an unsupervised or semi-supervised manner.
1
Introduction to Simple DNN
• Deep Learning models automatically learn features from the data.
• It is an algorithm that learns multiple levels of abstractions in data.
• Hence it is applied for Image recognition, Speech recognition and Natural
language processing.
• Most widely used architectures in deep learning are
• Feedforward Neural Networks (FNNs)
• Convolutional Neural Networks (CNNs)
• Recurrent Neural Networks (RNNs).
• It includes statistics and predictive modeling.
• It is extremely beneficial in analyzing huge volume of data.

2
..contd..

3
..contd..

4
..contd..
• Why deep learning outperforms other machine learning (ML) approaches
for vision, speech, language?

5
Benefits
• Scalability of neural networks indicates that results get better with more
data and larger models, that in turn requires more computation to train.
• Another benefit of deep learning models is their ability to perform
automatic feature extraction from raw data, also called feature learning.

6
Difference between ML and DL

7
..contd..

8
How Deep Learning Works?
• Each algorithm applies a nonlinear transformation to its input and uses
what it learns to create a statistical model as output.
• Iterations continue until the output has reached an acceptable level of
accuracy.
• The number of processing layers through which data must pass is what
inspired the label deep.

9
Working

10
Use Cases / Applications of DL
• Customer experience - DL models are used as chatbots and it continues to
mature to improve the customer experiences and increase customer
satisfaction.
• Automatic Text generation - Machines are being taught the grammar and
style of a piece of text and are then using this model to automatically create a
completely new text matching the proper spelling, grammar and style of the
original text.
• Aerospace and military - DL is being used to detect objects from satellites
that identify areas of interest, as well as safe or unsafe zones for troops.
• Industrial automation - DL is improving worker safety by providing
services that automatically detect when a worker or object is getting too close
to a machine.
11
..contd..
• Adding color - Color can be added to black and white photos and videos using
deep learning models.
• Medical research / Healthcare - DL is used in diagnosing various diseases and
treating them. Eg: to automatically detect cancer cells.
• Computer vision - DL provides computers with extreme accuracy for object
detection and image classification, restoration and segmentation.
• Automatic Machine Translation - Certain words, sentences or phrases in one
language is transformed into another language.
• Image Recognition - Recognizes and identifies peoples and objects in images.
• Predicting Earthquakes - Teaches a computer to perform viscoelastic
computations which are used in predicting earthquakes.

12
Advantages & Disadvantages
Advantages:
• Best in-class performance on problems.
• Reduces need for feature engineering.
• Eliminates unnecessary costs.
• Identifies defects easily that are difficult to detect.
Disadvantages:
• Large amount of data required.
• Computationally expensive to train.
• No strong theoretical foundation.

13
Platforms for Deep Learning
Some of the most popular open – source Deep Learning platfroms are
• TensorFlow
• DL4J – Deep Learning for Java
• Theano
• Torch
• Microsoft Cognitive Toolkit (CNTK)
• Caffe & Caffe2
• Apache MXNet
• Keras

14
..contd..
Refer these links for more platforms:
• https://en.wikipedia.org/wiki/Comparison_of_deep-learning_software
(Refer the above link)
• https://www.predictiveanalyticstoday.com/deep-learning-software-
libraries/

15
Software Libraries
• Tools used : Anaconda, Jupyter, Pycharm, etc.
• Languages used : R, Python, Matlab, CPP, Java, Julia, Lisp, Java Script,
etc.

16
Biological vs Artificial Neuron

Biological Neural Artificial Neural


Network Network
Dendrites Inputs
Cell Nucleus Nodes
Synapse Weights
Axon Output

17
Feed Forward Neural Network
• It is an artificial neural network
in which the connections between
nodes does not form a cycle.
• Opposite of a feed forward neural
network is a recurrent neural
network, in which certain
pathways are cycled.
• Information is processed in only
one direction. Though data may
pass through multiple hidden
nodes, it always moves in one
direction and never backwards.
18
Working
• A series of inputs enter the layer and are multiplied by the respective
weights.
• Each value is then added together to get a sum of the weighted input values.
• Weighted sum is passed via an activation function, being a non-linear
function.
• If the sum of the values is above a specific threshold, the value produced is
often 1, whereas if the sum falls below the threshold, the output value is -1.
• Using delta rule property, the neural network can compare the outputs of its
nodes with the intended values, thus allowing the network to adjust its
weights through training in order to produce more accurate output values.

19
Layers of Feed Forward Neural Network
• Input layer - Neurons of this layer receive input and pass it on to the other
layers of the network. Feature or attribute numbers in the dataset must
match the number of neurons in the input layer.
• Output layer - According to the type of model getting built, this layer
represents the forecasted feature.
• Hidden layer - There may be several hidden layers. There are several
neurons in hidden layers that transform the input before actually
transferring it to the next layer. This network gets constantly updated with
weights in order to make it easier to predict.
• Neuron weights - Weights measure neuron’s strength or magnitude. Weight
is normally between 0 and 1.

20
..contd..
• Neurons - A neural network consists of artificial neurons. Neurons first
create weighted input sums, and then, they activate the sums to make them
normal. Neurons have weights based on their inputs.
• Activation Functions – They can either be linear or nonlinear. According to
the activation function, the neurons determine whether to make a linear or
nonlinear decision. They can be classified into three major categories:
Sigmoid, Tanh, and Rectified Linear Unit (ReLu).
• Sigmoid - Input values between 0 and 1 get mapped to the output values.
• Tanh - Values between -1 and 1 gets mapped to the output values.
• Rectified linear Unit - Only positive values are allowed to flow through this
function. Negative values get mapped to 0.

21
Mathematical Model
• Artificial neuron takes a vector of input features x 1, x2, . . . , xn and each of
them is multiplied by a specific weight, w1, w2, . . . , wn.
• The weighted inputs are summed together and a constant value called bias
(b) is added to them to produce the net input of the neuron.

• Net input is then passed through an activation function ‘g’ to produce the
output a = g(z) which is then transmitted to other neurons.

22
Architecture of FFN
• It consist of large number of simple neuron-like processing units,
organized in layers.
• Every unit in a layer is connected with all the units in the previous layer.
These connections are not all equal. Each connection may have a different
strength or weight.
• Weights on these connections encode the knowledge of a network.
• Units in a neural network are also called nodes.

23
..contd..

24
Algorithm of FFN
Randomly choose the initial weights.
While error is too large
• For each training pattern (presented in random order)
Apply the inputs to the network
Calculate the output for every neuron from the input layer, through the hidden
layer(s), to the output layer
Calculate the error at the outputs
Use the output error to compute error signals for pre-output layers
Use the error signals to compute weight adjustments
Apply the weight adjustments
• Periodically evaluate the network performance

25
FFNN
The operation of this network can be divided into two phases:
• The learning phase
• The classification phase

26
FFNN Functions

27
Cost Function
• In FFNN, cost function plays an important role. Categorized data points are little
affected by minor adjustments to weights and biases.
• Thus, a smooth cost function can get used to determine a method of adjusting weights
and biases to improve performance. Definition of the mean square error cost function:

28
Loss Function
• The loss function of a neural network gets used to determine if an adjustment needs to
be made in the learning process.
• Neurons in the output layer are equal to the number of classes. Showing the differences
between predicted and actual probability distributions. Following is the cross-entropy
loss for binary classification.

29
The Perceptron – Structure and Properties
A perceptron has the following components:
• Input nodes – passes input to the network
• Output node
• An activation function – allows to fit the output in a way that makes more sense.
Sigmoidal activation function is used to fit value in the range of 0 to 1.
• Weights and biases – Weights are initialized to some random value and gets updated
during training process. Bias is a weight independent of any input node.
• Error function

30
..contd..
Evaluation:
• Compute the dot product of the input and weight vector
• Add the bias
• Apply the activation function.

31
..contd..
Classification:
• A datapoint’s evaluation is expressed by the relation wX + b .
• Threshold (θ) is defined to classify the data. It is usually set to 0 for a perceptron.
• So points for which wX + b is greater than or equal to 0 will belong to one class while
the rest belong to another class.

32
..contd..
Training Algorithm:
• We start the training algorithm by calculating the gradient, or Δw.
• It is the product of:
• the value of the input node corresponding to that weight
• the difference between the actual value and the computed value

• New weights are received by incrementing original weights with the computed
gradients multiplied by the learning rate.

33
Learning XOR
• XOR problem – try to train a model to mimic a XOR function.
• XOR function is defined as:

34
..contd..
• A perceptron can only converge on linearly separable data. Therefore, it isn’t capable
of imitating the XOR function.
• Single perceptron will not make the classes, linearly separable.
• Non-linearity allows for more complex decision boundaries.

35
Solution 1
• First break down the XOR function into its AND and OR counterparts.
• The XOR function on two boolean variables A and B is defined as:

36
..contd..
• Let’s replace A and B with x_1 and x_2 respectively.

• XOR function can be condensed into two parts: a NAND and an OR.
• Combine the results, using an AND gate.

37
..contd..
• Representation of OR and NAND gate:

38
..contd..
• Performing a logical AND on the outputs of two logic gates and both functions are
being passed the same input (x1 and x2).
• Let’s model this into our network.

39
..contd..
• AND of OR and NAND models is:

40
..contd..
• XOR and XNOR gates are not linearly-separable.
• Most problems can’t be split into just simple intermediate problems that can be
individually solved and then combined. For something like this:

41
Solution 2 – Multi Layer Perceptron (MLP)
• MLP have hidden layers. MLP is generally restricted to having a single hidden layer.
• The hidden layer allows for non-linearity.
• A node in the hidden layer isn’t too different to an output node: nodes in the previous
layers connect to it with their own weights and biases, and an output is computed,
generally with an activation function.

42
..contd..
• MLP have hidden layers. MLP is generally restricted to having a single hidden layer.
• The hidden layer allows for non-linearity.
• A node in the hidden layer isn’t too different to an output node: nodes in the previous
layers connect to it with their own weights and biases, and an output is computed,
generally with an activation function.
• Backpropagation is a way to update the weights and biases of a model starting from the
output layer all the way to the beginning.
• The main principle behind it is that each parameter changes in proportion to how much it
affects the network’s output.
• Backpropagation is an algorithm for update the weights and biases of a model based on
their gradients with respect to the error function, starting from the output layer all the
way to the first layer.
• The method of updating weights directly follows from derivation and the chain rule.
43
..contd..
• Number of hidden layers, the number of nodes in each layer and how these nodes are
inter-connected.

44
..contd..
• Number of hidden layers, the number of
nodes in each layer and how these nodes
are inter-connected.
1. Adding more layers or nodes gives
increasingly complex decision
boundaries. But this could lead to
overfitting — where a model achieves
very high accuracies on the training
data, but fails to generalize.
2. Choosing Loss function for MLP.
Mean Squared loss function makes
assumptions on the data and isn’t
always convex when it comes to a
classification problem.
45
Neural Network for XOR

46
Learning Algorithm
• Initialize the weights and biases randomly.
• Iterate over the data
i. Compute the predicted output using the sigmoid function
ii. Compute the loss using the square error loss function
iii. W(new) = W(old) — α ∆W
iv. B(new) = B(old) — α ∆B
• Repeat until the error is minimal

47
Implementation
• inputs = np.array([[0,0],[0,1],[1,0],[1,1]])
expected_output = np.array([[0],[1],[1],[0]])
• To initialize the weights and biases with random values:

48
def sigmoid (x):
return 1/(1 + np.exp(-x))hidden_layer_activation = np.dot(inputs,hidden_weights)
hidden_layer_activation += hidden_bias
hidden_layer_output = sigmoid(hidden_layer_activation)
output_layer_activation = np.dot(hidden_layer_output,output_weights)
output_layer_activation += output_bias
predicted_output = sigmoid(output_layer_activation)

49
Alpydin & Ch. Eick: ML Topic1
def sigmoid_derivative(x):
return x * (1 - x)#Backpropagation
error = expected_output - predicted_output
d_predicted_output = error * sigmoid_derivative(predicted_output)
error_hidden_layer = d_predicted_output.dot(output_weights.T)
d_hidden_layer = error_hidden_layer *
sigmoid_derivative(hidden_layer_output)#Updating Weights and Biases
output_weights += hidden_layer_output.T.dot(d_predicted_output) * lr
output_bias += np.sum(d_predicted_output,axis=0,keepdims=True) * lr
hidden_weights += inputs.T.dot(d_hidden_layer) * lr
hidden_bias += np.sum(d_hidden_layer,axis=0,keepdims=True) * lr

50
Alpydin & Ch. Eick: ML Topic1
Example problem

51
Alpydin & Ch. Eick: ML Topic1
52
Alpydin & Ch. Eick: ML Topic1
Activation Functions
• Neural network is made of interconnected neurons. Each neuron is characterized by its
weight, bias and activation function.
• Input is fed to the input layer, the neurons perform a linear transformation on this input
using the weights and biases. y = w.X + b
• An activation function is applied on the above result.
• Finally, the output from the activation function moves to the next hidden layer and the
same process is repeated. This forward movement of information is known as the
forward propagation.
• If the output generated is far away from the actual value, error is calculated using the
output.
• Based on the error value, the weights and biases of the neurons are updated. This
process is known as back-propagation.

53
Types of Activation Functions
1. Binary Step Function
• If the input to the activation function is greater than a
threshold, then the neuron is activated, else it is
deactivated.
i.e: f(x) = 1, x>=0
f(x) = 0, x<0
• Implementation:
def binary_step(x):
if x<0:
return 0
else:
return 1
binary_step(5), binary_step(-1)
Output: (5,0)
54
..contd..
• This function can be used as an activation
function while creating a binary classifier.
• Limitation: This function will not be useful
when there are multiple classes in the target
variable.
• Gradient of the binary step function.
Derivative of f(x) with respect to x is 0 
f’(x) = 0, for all x
• Gradients calculate the weights and biases
and since the gradient of the function is zero,
the weights and biases don’t update.

55
..contd..
• This function can be used as an activation function while creating a binary classifier.
• Limitation: This function will not be useful when there are multiple classes in the
target variable.
• Gradient of the binary step function. Derivative of f(x) with respect to x is 0 
f’(x) = 0, for all x
• Gradients calculate the weights and biases and since the gradient of the function is
zero, the weights and biases don’t update.

56
• Linear Function
• The activation is proportional to the input

• f(x) = ax

• Implementation

• def linear_function:

• return 4 *x

• Ex: linear_function(4), linear_function(-2)

• output (16, -8)

57
• Gradient of Linear Function
•For Gradient we differentiate the function with respect to x

•f'(x) = a

• Though the gradient not zero, but it is a


constant not depend upon the input value x at all.
• Hence the updating factor would be the same constant.
• The neural network will not really improve the error
since the gradient is the same for every iteration.

58
• Sigmoid Function
• Most widely used non-linear activation function.

• Sigmoid transforms the values between the range 0 and 1.

• Mathematical expression : f(x) = 1/(1+e^-x

• import numpy as np

• def sigmoid_function(x):

• z = (1/(1 + np.exp(-x)))

• return z

• Ex: sigmoid_function(7),sigmoid_function(-22)

• 0.9990889488055994, 2.7894680920908113e-10

59
• Gradient of Sigmoid Function
•For Gradient we differentiate the function with respect to x

•f'(x) = sigmoid(x)*(1-sigmoid(x)

• The gradient values are significant for


range -3 and 3 but the graph gets much flatter

in other regions.
• for values greater than 3 or less than -3,
will have very small gradients.

60
• Tanh Function
• Similar to sigmoid activation function. Symmetric around the origin.

• Range of values is from -1 to +1

• Mathematical expression tanh(x) = 2sigmoid(2x)-1

• def tanh_function(x):

• z = (2/(1 + np.exp(-2*x))) -1

• return z

• Ex: tanh_function(-1)

• -0.761594155955

61
• Gradient of tanh Function
•For Gradient we differentiate the function with respect to x

•f'(x) = g(x) = 1-tanh*2(x)

• The gradient values are steeper as


compared to sigmoid function.

• tanh is preferred over the sigmoid function


since it is zero centered and the gradients are not
restricted to move in a certain direction

62
• ReLU Function
• ReLU stands for Rectified Linear Unit.
• Main advantage is that it does not activate all the neurons at the
same time
• Mathematical expression f(x) = x x>=0

= 0 x<0
• def relu_function(x):
• If X<0:
return 0
Else:
return x

• Ex: relu_function(-1)

•0
63
• Gradient of ReLU Function
•For Gradient we differentiate the function with respect to x

•f'(x) = 1, X>=0

• = 0, x<0

• The gradient is zero. Due to this during


backpropogation process, the weights and biases
for some neurons are not updated

• This can create dead neurons which never get


activated
• This is taken care of by the ‘Leaky’ ReLU function

64
• Leaky ReLU Function
• improved version of the ReLU function
• Instead of 0 for negative values , extremely smaller component
of x is considered to avoid deactivation of neuron.
• Mathematical expression f(x) = 0.01x x< =0

= x x>0
• def leaky_relu_function(x):
• If X<0:
return 0.01*x
Else:
return x

• Ex: leaky_relu_function(-1)

•-0.07
65
• Gradient of Leaky ReLU Function
•For Gradient we differentiate the function with respect to x

•f'(x) = 1, X>=0

• = a, x<0

66
• Parameterised ReLU Function
• to solve the problem of gradient’s becoming zero for the left half
of the axis.
• in case of a parameterised ReLU function, ‘a‘ is also a trainable
parameter.
• Mathematical expression f(x) = ax x< 0

= x x>=0
• def leaky_relu_function(x):
• If X<0:
return a*x
Else:
return x

67
• Gradient of parameterized ReLU Function
•For Gradient we differentiate the function with respect to x

•f'(x) = 1, X>=0

• = a, x<0

68
• Exponential Linear Unit
• . Unlike the leaky relu and parametric ReLU functions, instead of
a straight line, ELU uses a log curve for defining the negative
values
• Mathematical expression f(x) = x x>= 0

= a(e^x-1), x<0

• def elu_function(x,a):
• If X<0:
return a*(np.exp(x)-1)
Else:
return x

69
• Gradient ELU
•For Gradient we differentiate the function with respect to x

•f'(x) = x, X>=0

• = a(e^x) , , x<0

70

You might also like