DUnit I
DUnit I
DUnit I
Syllabus
Historical trends in deep learning – Machine Learning basics, Learning algorithms – Supervised
and Unsupervised Training, Linear Algebra for machine learning, Testing - Cross Validation,
Dimensionality Reduction, Over fitting /Under Fitting, Hyper parameters and validation sets
Estimators – Bias – Variance, Loss Function—Regularization, Biological Neuron – Idea of
Computational units, McCulloch-Pitts units and Thresholding logic, Linear Perceptron, Perceptron
Learning Algorithm, Convergence theorem for Perceptron Learning Algorithm, Linear Separability
Multilayer perceptron –The first example of network with Keras code, Backpropagation
⚫ Suppose we have a dataset of different types of shapes which includes square, rectangle,
triangle, and Polygon. Now the first step is that we need to train the model for each
shape.
⚫ If the given shape has four sides, and all the sides are equal, then it will be labelled as
a Square.
⚫ If the given shape has three sides, then it will be labelled as a triangle.
⚫ If the given shape has six equal sides then it will be labelled as hexagon.
⚫ Now, after training, we test our model using the test set, and the task of the model is to
identify the shape.
⚫ The machine is already trained on all types of shapes, and when it finds a new shape, it
classifies the shape on the bases of a number of sides, and predicts the output.
Example of supervised Learning
• you get bunch of photos with information what is on them and you train a model to
recognize new photos
• predicting stock market price
• an email is spam or not
• predicting house/property price
• a patient has disease or not
• Face detection and recognition
3. Cross-Validation
Cross-validation is a technique in which we train our model using the subset of the data-set and
then evaluate using the complementary subset of the data-set.
The three steps involved in cross-validation are as follows:
1. Reserve some portion of sample data-set.
2. Using the rest data-set train the model.
3. Test the model using the reserve portion of the data-set.
In this method, we perform training on the whole data-set but leaves only one data-point of the
available data-set and then iterates for each data-point. It has some advantages as well as
disadvantages
An advantage of using this method is that we make use of all data points and hence it is low bias.
The major drawback of this method is that it leads to higher variation in the testing model as we
are testing against one data point. If the data point is an outlier it can lead to higher variation.
Another drawback is it takes a lot of execution time as it iterates over ‘the number of data points’
times.
Ideally, the case when the model makes the predictions with 0 error, is said to have a good fit on
the data. This situation is achievable at a spot between overfitting and under fitting. In order to
understand it, we will have to look at the performance of our model with the passage of time,
while it is learning from training dataset.
With the passage of time, our model will keep on learning and thus the error for the model on the
training and testing data will keep on decreasing. If it will learn for too long, the model will
become more prone to overfitting due to the presence of noise and less useful details. Hence the
performance of our model will decrease. In order to get a good fit, we will stop at a point just
before where the error starts increasing. At this point, the model is said to have good skills on
training datasets as well as our unseen testing dataset.
6. Hyper parameters
A hyper parameter is a parameter that is set before the learning process begins. These parameters
are tunable and can directly affect how well a model trains. Some examples of hyper parameters
in machine learning:
1. Learning Rate
2. Number of Epochs
3. Momentum
4. Regularization constant
5. Number of branches in a decision tree
6. Number of clusters in a clustering algorithm (like k-means)
7.Loss Function
Loss functions measure how far an estimated value is from its true value. A loss function maps
decisions to their associated costs. Loss functions are not fixed, they change depending on the task
in hand and the goal to be met.
Where yi is the true value, ŷi is the predicted value and ‘n’ is the total number of data points in the
dataset.
Huber Loss
A comparison between L1 and L2 loss yields the following results:
1. L1 loss is more robust than its counterpart.
On taking a closer look at the formulas, one can observe that if the difference between the predicted
and the actual value is high, L2 loss magnifies the effect when compared to L1. Since L2 succumbs
to outliers, L1 loss function is the more robust loss function.
2. L1 loss is less stable than L2 loss.
Since L1 loss deals with the difference in distances, a small horizontal change can lead to the
regression line jumping a large amount. Such an effect taking place across multiple iterations would
lead to a significant change in the slope between iterations.
On the other hand, MSE ensures the regression line moves lightly for a small adjustment in the data
point.
Huber Loss combines the robustness of L1 with the stability of L2, essentially the best of L1 and
L2 losses. For huge errors, it is linear and for small errors, it is quadratic in nature.
Huber Loss is characterized by the parameter delta (𝛿). For a prediction f(x) of the data point y,
with the characterizing parameter 𝛿, Huber Loss is formulated as:
Where yi is the true label and hθ(xi) is the predicted value post hypothesis.
Since binary classification means the classes take either 0 or 1, if yi = 0, that term ceases to exist
and if yi = 1, the (1-yi) term becomes 0.
Categorical Cross Entropy Loss
Categorical Cross Entropy loss is essentially Binary Cross Entropy Loss expanded to multiple
classes. One requirement when categorical cross entropy loss function is used is that the labels
should be one-hot encoded.
This way, only one element will be non-zero as other elements in the vector would be multiplied
by zero. This property is extended to an activation function called softmax,
Hinge Loss
Another commonly used loss function for classification is the hinge loss. Hinge loss is primarily
developed for support vector machines for calculating the maximum margin from the hyperplane
to the classes.
Loss functions penalize wrong predictions and does not do so for the right predictions. So, the score
of the target label should be greater than the sum of all the incorrect labels by a margin of (at the
least) one.
This margin is the maximum margin from the hyperplane to the data points, which is why hinge
loss is preferred for SVMs. The following image clears the air on what a hyperplane and maximum
margin is:
8. REGULARIZATION
Regularization are techniques used to reduce the error by fitting function appropriately on the given
training set and avoid overfitting
Consider the training dataset comprising of independent variables X=(x1,x2….xn) and the
corresponding target variables t=(t1,t2,…tn). X are random variables lying uniformly between [0,1].
The target dataset ‘t’ is obtained by substituting the value of X into the function sin(2πx) and then
adding some Gaussian noise into it.
Now, our goal is to find patterns in this underlying dataset and generalize it to predict the
corresponding target value for some new values of ‘x’. The problem here is, our target dataset is
inflicted with some random noise. So, it will be difficult to find the inlying function sin(2πx) in the
training data. So, how do we solve it?
In order to minimize the error, calculus is used. The derivative of E(w) is equated with 0 to get the
value of w which will result at the minimum value of error function. E(w) is a quadratic equation,
but it’s derivative will be a linear equation and hence will result in only a single value of w. Let that
be denoted by w*.
So now, we will get the correct value of w but the issue is what degree of polynomial (given in Eq
1.1) to choose? All degree of polynomials can be used to fit on the training data, but how to decide
the best choice with minimum complexity?
9. Biological Neurons
It is an unusual-looking cell mostly found in animal cerebral cortexes (e.g., your brain), composed
of a cell body containing the nucleus and most of the cell’s complex components, and many
branching extensions called dendrites, plus one very long extension called the axon. The axon’s
length may be just a few times longer than the cell body, or up to tens of thousands of times longer.
Near its extremity the axon splits off into many branches called telodendria, and at the tip of these
branches are minuscule structures called synaptic terminals (or simply synapses), which are
connected to the dendrites (or directly to the cell body) of other neurons. Biological neurons receive
short electrical impulses called signals from other neurons via these synapses. When a neuron
receives a sufficient number of signals from other neurons within a few milliseconds, it fires its
own signals.
Thus, individual biological neurons seem to behave in a rather simple way, but they are organized
in a vast network of billions of neurons, each neuron typically connected to thousands of other
neurons. Highly complex computations can be performed by a vast network of fairly simple
neurons, much like a complex anthill can emerge from the combined efforts of simple ants. The
architecture of biological neural networks (BNN) it seems that neurons are often organized in
consecutive layers, as shown in Figure
The first computational model of a neuron was proposed by Warren MuCulloch (neuroscientist) and
aggregation and based on the aggregated value the second part, f makes a decision.
Lets suppose that I want to predict my own decision, whether to watch a random football game or
not on TV. The inputs are all boolean i.e., {0,1} and my output variable is also boolean {0: Will
watch it, 1: Won’t watch it}.
2. x_2 could be is It A Friendly Game (I tend to care less about the friendlies)
3. x_3 could be is Not Home (Can’t watch it when I’m running errands. Can I?)
4. x_4 could be is Man United Playing (I am a big Man United fan. GGMU!) and so on.
These inputs can either be excitatory or inhibitory. Inhibitory inputs are those that have maximum
effect on the decision making irrespective of other inputs i.e., if x_3 is 1 (not home) then my output
will always be 0 i.e., the neuron will never fire, so x_3 is an inhibitory input. Excitatory inputs are
NOT the ones that will make the neuron fire on their own but they might fire it when combined
11. Perceptron
The perceptron model is a more general computational model than McCulloch-Pitts neuron. It takes
an input, aggregates it (weighted sum) and returns 1 only if the aggregated sum is more than some
threshold else returns 0. Rewriting the threshold as shown above and making it a constant input with
a variable weight, we would end up with something like the following
A single perceptron can only be used to implement linearly separable functions. It takes both real
and boolean inputs and associates a set of weights to them, along with a bias (the threshold thing I
mentioned above). We learn the weights, we get the function. Let's use a perceptron to learn an OR
function.
Multilayer perceptron
A multilayer perceptron (MLP) is a class of feedforward artificial neural network (ANN). The
term MLP is used ambiguously, sometimes loosely to mean any feedforward ANN, sometimes
strictly to refer to networks composed of multiple layers of perceptrons (with threshold activation);
see § Terminology. Multilayer perceptrons are sometimes colloquially referred to as "vanilla"
neural networks, especially when they have a single hidden layer.
An MLP consists of at least three layers of nodes: an input layer, a hidden layer and an output layer.
Except for the input nodes, each node is a neuron that uses a nonlinear activation function. MLP
utilizes a supervised learning technique called backpropagation for training. Its multiple layers and
non-linear activation distinguish MLP from a linear perceptron. It can distinguish data that is
not linearly separable
1. Load Data.
2. Define Keras Model.
3. Compile Keras Model.
4. Fit Keras Model.
5. Evaluate Keras Model.
6. Tie It All Together.
7. Make Predictions
b) Requirements
1. You have Python 2 or 3 installed and configured.
2. You have SciPy (including NumPy) installed and configured.
3. You have Keras and a backend (Theano or TensorFlow) installed and configured.
4.
Dept. of Computer Science & Engineering Prepared By – Prof.S.V.Pingale Page 23
SKN SINHGAD COLLEGE OF ENGINEERING PANDHARPUR
Load Data
The first step is to define the functions and classes
We will use the NumPy library to load our dataset and we will use two classes from the Keras
library to define our model.
# first neural network with keras
The imports required are listed below.
from numpy import loadtxt
from keras.models import Sequential
from keras.layers import Dense
# load the dataset
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=',')
# split into input (X) and output (y) variables
X = dataset[:,0:8]
y = dataset[:,8]
# define the keras model
model = Sequential()
model.add(Dense(12, input_dim=8, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
14. Backpropagation
The calculations are then used to give artificial network nodes with high error rates less weight than
nodes with lower error rates.
Backpropagation uses a methodology called chain rule to improve outputs. Basically, after each
forward pass through a network, the algorithm performs a backward pass to adjust the model’s
weights.
An important goal of backpropagation is to give data scientists insight into how changing a weight
function will change loss functions and the overall behaviour of the neural network. The term is
sometimes used as a synonym for "error correction."