Jntuk r20 ML Unit-II
Jntuk r20 ML Unit-II
Jntuk r20 ML Unit-II
in
UNIT-II
Supervised Learning(Regression/Classification)
Basic Methods: Distance based Methods, Nearest Neighbours, Decision Trees, Naive Bayes,
Linear Models: Linear Regression, Logistic Regression, Generalized Linear Models, Support Vector Machines
……………………………………………………………………………………………………………………………..
Euclidean Distance
It is the most common use of distance. In most cases when people said about distance, they will refer to
Euclidean distance. Euclidean distance is also known as simply distance. When data is dense or continuous, this
is the best proximity measure. The Euclidean distance between two points is the length of the path connecting
them. The Pythagorean theory gives distance between two points.
Manhattan Distance
If you want to find Manhattan distance between two different points (x1, y1) and (x2, y2) such as the
following, it would look like the following:
Manhattan distance = (x2 – x1) + (y2 – y1)
Diagrammatically, it would look like traversing the path from point A to point B while walking on
the pink straight line.
The generalized form of the Euclidean and Manhattan Distances is the Minkowski Distance. You can express
the Minkowski distance as
When an order(p) is 1, Manhattan Distance is represented, and when order(p) is 2 in the above formula,
Euclidean Distance is represented.
2. Nearest Neighbours
The abbreviation KNN stands for “K-Nearest Neighbour”. It is a supervised machine learning algorithm. The
algorithm can be used to solve both classification and regression problem statements.
The number of nearest neighbours to a new unknown variable that has to be predicted or classified is
denoted by the symbol ‘K’.
KNN calculates the distance from all points in the proximity of the unknown data and filters out the ones
with the shortest distances to it. As a result, it’s often referred to as a distance-based algorithm.
In order to correctly classify the results, we must first determine the value of K (Number of Nearest
Neighbours).
When the value of K is set to even, a situation may arise in which the elements from both groups are equal.
In the diagram below, elements from both groups are equal in the internal “Red” circle (k == 4).
In this condition, the model would be unable to do the correct classification for you. Here the model will
randomly assign any of the two classes to this new unknown data.
Choosing an odd value for K is preferred because such a state of equality between the two classes would
never occur here. Due to the fact that one of the two groups would still be in the majority, the value of K is
selected as odd.
Src: https://images.app.goo.gl/Q8ZKxQ8mhP68yxqn7
• Larger K value: The case of underfitting occurs when the value of k is increased. In this case, the
model would be unable to correctly learn on the training data.
• Smaller k value: The condition of overfitting occurs when the value of k is smaller. The model will
capture all of the training data, including noise. The model will perform poorly for the test data in
this scenario.
Src: https://images.app.goo.gl/vXStNS4NeEqUCDXn8
How does KNN work for ‘Classification’ and ‘Regression’ problem statements?
Classification
When the problem statement is of ‘classification’ type, KNN tends to use the concept of “Majority Voting”.
Within the given range of K values, the class with the most votes is chosen.
Consider the following diagram, in which a circle is drawn within the radius of the five closest neighbours.
Four of the five neighbours in this neighbourhood voted for ‘RED,’ while one voted for ‘WHITE.’ It will be
classified as a ‘RED’ wine based on the majority votes.
Src: https://images.app.goo.gl/Ud42nZn8Q8FpDVcs5
Real-world example:
Several parties compete in an election in a democratic country like India. Parties compete for voter support
during election campaigns. The public votes for the candidate with whom they feel more connected.
When the votes for all of the candidates have been recorded, the candidate with the most votes is declared
as the election’s winner.
www.tutorialtpoint.net Prepared By D.Venkata Reddy M.Tech(Ph.D),UGC NET, AP SET Qualified
www.jntumaterials.co.in
Regression
KNN employs a mean/average method for predicting the value of new data. Based on the value of K, it
would consider all of the nearest neighbours.
The algorithm attempts to calculate the mean for all the nearest neighbours’ values until it has identified all
the nearest neighbours within a certain range of the K value.
Consider the diagram below, where the value of k is set to 3. It will now calculate the mean (52) based on
the values of these neighbours (50, 55, and 51) and allocate this value to the unknown data.
Src: https://images.app.goo.gl/pzW97weL6vHJByni8
• Imbalanced dataset
When dealing with an imbalanced data set, the model will become biased. Consider the example shown in
the diagram below, where the “Yes” class is more prominent.
As a consequence, the bulk of the closest neighbours to this new point will be from the dominant class.
Because of this, we must balance our data set using either an “Upscaling” or “Downscaling” strategy.
Src: https://images.app.goo.gl/1XkGHtn16nXDkrTL7
Outliers are the points that differ significantly from the rest of the data points.
The outliers will impact the classification/prediction of the model. The appropriate class for the new data
point, according to the following diagram, should be “Category B” in green.
The model, however, would be unable to have the appropriate classification due to the existence of outliers.
As a result, removing outliers before using KNN is recommended.
Src: https://images.app.goo.gl/K35WtKYCTnGBDLW36
1) Magnitude
2) Unit
For instance; if we say 20 years then “20” is the magnitude here and “years” is its unit.
Since it is a distance-dependent algorithm, KNN selects the neighbours in the closest vicinity based solely on
the magnitude of the data. Have a look at the diagram below; the data is not scaled, so it can not find the
closest neighbours correctly. As a consequence, the outcome will be influenced.
Src:https://images.app.goo.gl/M1oenLdEo427VBGc7
The data values in the previous figure have now been scaled down to the same level in the following
example. Based on the scaled distance, all of the closest neighbours would be accurately identified.
Src:https://images.app.goo.gl/CtdoNXq5hPVvynre9
3. Decision Trees
Like SVMs, Decision Trees are versatile Machine Learning algorithms that can perform both classification and
regression tasks, and even multioutput tasks. They are very powerful algorithms, capable of fitting complex
datasets.
Decision Trees are also the fundamental components of Random Forests, which are among the most
powerful Machine Learning algorithms available today.
While constructing a decision tree, the very first question to be answered is, Which Attribute Is the Best
Classifier?
The central choice in the ID3 algorithm is selecting which attribute to test at each node in the tree.
We would like to select the attribute that is most useful for classifying examples.
What is a good quantitative measure of the worth of an attribute? We will define a statistical property, called
information gain, that measures how well a given attribute separates the training examples according to their
target classification.
ID3 uses this information gain measure to select among the candidate attributes at each step while growing
the tree.
where p+, is the proportion of positive examples in S and p-, is the proportion of negative examples in S.
Given entropy as a measure of the impurity in a collection of training examples, we can now define a measure
of the effectiveness of an attribute in classifying the training data.
Now, the information gain is simply the expected reduction in entropy caused by partitioning the examples
according to this attribute.
More precisely, the information gain, Gain(S, A) of an attribute A, relative to a collection of examples S, is
defined as,
where Values(A) is the set of all possible values for attribute A, and S, is the subset of S for which attribute A
has value v (i.e., S_v= {s ∈ S|A(s) = v})
For example, suppose S is a collection of training-example days described by attributes including Wind, which
can have the values Weak or Strong.
Information gain is precisely the measure used by ID3 to select the best attribute at each step in growing the
tree.
• ID3 Algorithm:
Ross Quinlan developed ID# algorithm in 1987. It is greedy algorithm for decision tree construction. The ID3
algorithm is inspired from CLSs algorithm by Hunt in 1966. CLSs algorithm start with training object set,
O={o1,o2,…on} from a universal where each object is described by a set of m attributes aj1,aj2,…ajk is selected
and a tree node structure is formed to represent Aj.
In ID3 system, a relatively small number of training examples are randomly selected from a large set of objects
O through a window. Using these a preliminary decision tree is constructed. The tree is then tested by scanning
all the objects in O to see if there are any expectations to the tree. A new subset is formed using the original
examples together with some of the expectations found during the scan. This process is repeated until no
exceptions are found. The resulting decision tree is can be used to classify new objects.
• ID3 is the way in which the attributes are ordered for use in the classification process.
• Attributes which discriminate best are selected for the evaluation first.
• This requires computing an estimate of the expected information gain using all available attributes and
then selecting the attribute having the largest expected gain.
• The attribute having the next largest gain is assigned to the next level of nodes in the tree and so on
until the leaves of the tree have been reached.
• A decision tree is created by recursive selection of the best attribute in a top-down manner to use at
the current node in the tree.
• When a particular attribute is selected as the current node, it creates its child nodes, one for each
possible value of the selected attribute.
• The next step is to partition the samples using the possible values of the attribute and to assign these
subsets of examples to the appropriate child node.
• The process is repeated for every child node until we get a positive or negative result for all nodes
associated with the particular sample.
ID3 Algorithm
4. Select attribute with the highest gain to be the next node in the tree.
5. Remove node attribute, creating reduced table Rs.
6. Repeat steps 3-5 until all the attributes have been used, or the same classification value remains
for all rows in the reduced table.
www.tutorialtpoint.net Prepared By D.Venkata Reddy M.Tech(Ph.D),UGC NET, AP SET Qualified
www.jntumaterials.co.in
Information gain for attribute A on set S is defined by taking the entropy of S and subtracting from it the
summation of entropy of each subset of S, determined by the values of A multiplied by each subset’s proportion
of S.
Example:
A decision tree can easily be transformed to a set of rules by mapping from the root node to the leaf nodes
one by one.
A decision tree does its own feature extraction. The univariate tree only uses the necessary variables and after
the tree is built, certain features may not be used at all. As shown in fig. x1,x2 and x4 variables are used but
not x3. It is possible to use a decision tree for feature extraction: we build a tree and then take only those
features used by the tree as inputs to another learning method.
Main advantage of decision tree is interpretability: the decision nodes carry conditions that are simple to
understand. Each path from the root to a leaf corresponds to one conjunction of tests as all these conditions
must me satisfied to reach leaf node. These paths together can be written down as asset of IF-THEN rules,
called a rule base.
For example, the decision tree of fig can be written down as the following set of rules:
• people who are thirty –eight years old or less are different from people who are thirty- nine or more
years old.
• And in the latter group, it is the job type that makes them different.
In case of classification tree, there may be more than one leaf labelled with the same class. In such a case, these
multiple conjunctive expressions corresponding to different paths can be combined as a disjunction. The class
region then corresponds to a union of these multiple patches, each patch corresponding to the region defined
by one leaf. For example
Pruning rules is possible for simplification. Pruning a subtree corresponds to pruning terms from a number of
rules at the same time. It may be possible to prune a term from one rule without touching other examples.
For example, in the previous rule set, for R3, if we see that all whose job-type =’A’ have outcomes close to
0.4, regardless of age, R3 can be pruned as
Once the rules are pruned we cannot write them back as a tree anymore.
import sys
import pandas as pd
df = pd.read_csv("DTree.csv")
print(df)
df['Nationality'] = df['Nationality'].map(d)
d = {'YES': 1, 'NO': 0}
df['Go'] = df['Go'].map(d)
print(df)
X = df[features]
y = df['Go']
www.tutorialtpoint.net Prepared By D.Venkata Reddy M.Tech(Ph.D),UGC NET, AP SET Qualified
www.jntumaterials.co.in
dtree = DecisionTreeClassifier()
dtree = dtree.fit(X, y)
tree.plot_tree(dtree, feature_names=features)
Output:-
Decision Trees are also capable of performing regression tasks. Let us build a regression tree using Scikit-
Learns DecisionTreeRegressor class, training it on a noisy quadratic dataset with max_depth=2:
#Loading Dataset
iris = load_iris()
y = iris.target
tree_reg = DecisionTreeRegressor(max_depth=2)
tree_reg.fit(X, y)
tree.plot_tree(tree_reg, feature_names=X)
Scikit-Learn uses the Classification And Regression Tree (CART) algorithm to train Decision Trees (also called
“growing” trees). The idea is really quite simple: the algorithm first splits the training set in two subsets using
a single feature k and a threshold tk
CART algorithm uses Gini Impurity to split the dataset into a decision tree .It does that by searching for the
best homogeneity for the sub nodes, with the help of the Gini index criterion.
Gini index/Gini impurity: The Gini index is a metric for the classification tasks in CART. It stores the sum of
squared probabilities of each class. It computes the degree of probability of a specific variable that is wrongly
being classified when chosen randomly and a variation of the Gini coefficient. It works on categorical variables,
provides outcomes either “successful” or “failure” and hence conducts binary splitting only.
• Where 0 depicts that all the elements are allied to a certain class, or only one class exists there.
• The Gini index of value 1 signifies that all the elements are randomly distributed across various classes,
and
• A value of 0.5 denotes the elements are uniformly distributed into some classes.
Classification tree
A classification tree is an algorithm where the target variable is categorical. The algorithm is then used to
identify the “Class” within which the target variable is most likely to fall. Classification trees are used when the
dataset needs to be split into classes that belong to the response variable(like yes or no)
Regression tree
A Regression tree is an algorithm where the target variable is continuous and the tree is used to predict its
value. Regression trees are used when the response variable is continuous. For example, if the response variable
is the temperature of the day.
Advantages of CART
Limitations of CART
• Overfitting.
• High Variance.
• low bias.
Linear Models: Linear Regression, Logistic Regression, Generalized Linear Models, Support Vector Machines
……………………………………………………………………………………………………………………………..
Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a statistical method
that is used for predictive analysis. Linear regression makes predictions for continuous/real or numeric variables
such as sales, salary, age, product price, etc.
Linear regression algorithm shows a linear relationship between a dependent (y) and one or more independent
(y) variables, hence called as linear regression. Since linear regression shows the linear relationship, which
means it finds how the value of the dependent variable is changing according to the value of the independent
variable.
The linear regression model provides a sloped straight line representing the relationship between the variables.
Consider the below image:
y= a0+a1x+ ε
Here,
ε = random error
The values for x and y variables are training datasets for Linear Regression model representation.
Linear regression can be further divided into two types of the algorithm:
Simple Linear Regression is a type of Regression algorithms that models the relationship between a dependent
variable and a single independent variable. The relationship shown by a Simple Linear Regression model is
linear or a sloped straight line, hence it is called Simple Linear Regression.
The key point in Simple Linear Regression is that the dependent variable must be a continuous/real value.
However, the independent variable can be measured on continuous or categorical values.
• Model the relationship between the two variables. Such as the relationship between Income and
expenditure, experience and Salary, etc.
The Simple Linear Regression model can be represented using the below equation:
y= a0+a1x+ ε
Where,
a0= It is the intercept of the Regression line (can be obtained putting x=0)
a1= It is the slope of the regression line, which tells whether the line is increasing or decreasing.
ε = The error term. (For a good model it will be negligible)
Here we are taking a dataset that has two variables: salary (dependent variable) and experience (Independent
variable). The goals of this problem is:
• We want to find out if there is any correlation between these two variables
Here, we will create a Simple Linear Regression model to find out the best fitting line for representing the
relationship between these two variables.
To implement the Simple Linear regression model in machine learning using Python, we need to follow the
below steps:
The first step for creating the Simple Linear Regression model is data pre-processing. We have already done it
earlier in this tutorial. But there will be some changes, which are given in the below steps:
a) First, we will import the three important libraries, which will help us for loading the dataset, plotting the
graphs, and creating the Simple Linear Regression model.
import numpy as np
www.tutorialtpoint.net Prepared By D.Venkata Reddy M.Tech(Ph.D),UGC NET, AP SET Qualified
www.jntumaterials.co.in
import matplotlib.pyplot as plt
import pandas as pd
b) Next, we will load the dataset into our code. After that, we need to extract the dependent and independent
variables from the given dataset. The independent variable is years of experience, and the dependent variable
is salary.
dataset = pd.read_csv('Salary_Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 1].values
In the above lines of code, for x variable, we have taken -1 value since we want to remove the last column
from the dataset. For y variable, we have taken 1 value as a parameter, since we want to extract the second
column and indexing starts from the zero.
c) Next, we will split both variables into the test set and training set. We have 30 observations, so we will take
20 observations for the training set and 10 observations for the test set. We are splitting our dataset so that
we can train our model using a training dataset and then test the model using a test dataset.
# Splitting the dataset into the Training set and Test set
Now the second step is to fit our model to the training dataset. To do so, we will import the LinearRegression
class of the linear_model library from the scikit learn. After importing the class, we are going to create an
object of the class named as a regressor.
regressor = LinearRegression()
regressor.fit(X_train, y_train)
In the above code, we have used a fit() method to fit our Simple Linear Regression object to the training set.
In the fit() function, we have passed the x_train and y_train, which is our training dataset for the dependent
and an independent variable. We have fitted our regressor object to the training set so that the model can
easily learn the correlations between the predictor and target variables.
dependent (salary) and an independent variable (Experience). So, now, our model is ready to predict the
output for the new observations. In this step, we will provide the test dataset (new observations) to the model
to check whether it can predict the correct output or not.
We will create a prediction vector y_pred, which will contain predictions of test dataset, and prediction of
training set respectively.
y_pred = regressor.predict(X_test)
In the x-axis, we will plot the Years of Experience of employees and on the y-axis, salary of employees. In the
function, we will pass the real values of training set, which
means a year of experience x_train, training set of Salaries
y_train, and color of the observations. Here we are taking a
green color for the observation, but it can be any color as per
the choice.
Next, we will give the title for the plot. So here, we will use
the title() function of the pyplot library and pass the name
("Salary vs Experience (Training Dataset)".
After that, we will assign labels for x-axis and y-axis using xlabel() and ylabel() function.
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()
In the above plot, we can see the real values observations in green dots and predicted values are covered by
the red regression line. The regression line shows a correlation between the dependent and independent
variable.
The good fit of the line can be observed by calculating the difference between actual values and predicted
values. But as we can see in the above plot, most of the observations are close to the regression line, hence
our model is good for the training set.
plt.xlabel('Years of Experience')
plt.show()
Multiple Linear Regression is one of the important regression algorithms which models the linear relationship
between a single dependent continuous variable and more than one independent variable.
Example:
Prediction of CO2 emission based on engine size and number of cylinders in a car.
• For MLR, the dependent or target variable(Y) must be the continuous/real, but the predictor or
independent variable may be of continuous or categorical form.
• Each feature variable must model the linear relationship with the dependent variable.
The multiple regression equation explained above takes the following form:
Where,
Y= Output/Response variable
• A linear relationship should exist between the Target and predictor variables.
• MLR assumes little or no multicollinearity (correlation between the independent variable) in data.
Problem Description:
We have a dataset of 50 start-up companies. This dataset contains five main information: R&D Spend,
Administration Spend, Marketing Spend, State, and Profit for a financial year. Our goal is to create a model
that can easily determine which company has a maximum profit, and which is the most affecting factor for the
profit of a company.
Since we need to find the Profit, so it is the dependent variable, and the other four variables are independent
variables. Below are the main steps of deploying the MLR model:
• Importing libraries: Firstly, we will import the library which will help in building the model. Below is
the code for it:
import numpy as np
import pandas as pd
• Importing dataset: Now we will import the dataset(50_CompList), which contains all the variables.
Extracting dependent and independent variables from it.
dataset = pd.read_csv('50_Startups.csv')
X = dataset.iloc[:, :-1]
y = dataset.iloc[:, 4]
states=pd.get_dummies(X['State'],drop_first=True)
X=X.drop('State',axis=1)
X=pd.concat([X,states],axis=1)
• Now we will split the dataset into training and test set.
# Splitting the dataset into the Training set and Test set
Now, we have well prepared our dataset in order to provide training, which means we will fit our
regression model to the training set. It will be similar to as we did in Simple Linear Regression model.
regressor = LinearRegression()
regressor.fit(X_train, y_train)
The last step for our model is checking the performance of the model. We will do it by predicting the test set
result. For prediction, we will create a y_pred vector.
y_pred = regressor.predict(X_test)
The above score tells that our model is 95% accurate with the training dataset and 93% accurate with the
test dataset.
Logistic regression aims to solve classification problems. It does this by predicting categorical outcomes,
unlike linear regression that predicts a continuous outcome.
In the simplest case there are two outcomes, which is called binomial, an example of which is predicting if a
tumor is malignant or benign. Other cases have more than two outcomes to classify, in this case it is called
multinomial. A common example for multinomial logistic regression would be predicting the class of an iris
flower between 3 different species.
Here we will be using basic logistic regression to predict a binomial variable. This means it has only two
possible outcomes.
Example:
import numpy
X = numpy.array([3.78, 2.44, 2.09, 0.14, 1.72, 1.65, 4.92, 4.37, 4.96, 4.52, 3.69, 5.88]).reshape(-1,1)
y = numpy.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1])
logr = linear_model.LogisticRegression()
logr.fit(X,y)
predicted = logr.predict(numpy.array([3.46]).reshape(-1,1))
print(predicted)
……………………………………………………………………………………………………………………………..
Binary Classification
It is a process or task of classification, in which a given data is being classified into two classes. It’s basically a
kind of prediction about which of two groups the thing belongs to.
Let us suppose, two emails are sent to you, one is sent by an insurance company that keeps sending their ads,
and the other is from your bank regarding your credit card bill. The email service provider will classify the
two emails, the first one will be sent to the spam folder and the second one will be kept in the primary one.
This process is known as binary classification, as there are two discrete classes, one is spam and the other is
primary. So, this is a problem of binary classification.
Binary classification uses some algorithms to do the task, some of the most common algorithms used by
binary classification are .
Logistic Regression
k-Nearest Neighbors
Decision Trees
Support Vector Machine
Naive Bayes
Multiclass Classification
Multi-class classification is the task of classifying elements into different classes. Unlike binary, it doesn’t
restrict itself to any number of classes.
In these, there are different classes for the response variable to be classified in and thus according to the
name, it is a Multi-class classification.
Let us suppose we have to do sentiment analysis of a person, if the classes are just “positive” and “negative”,
then it will be a problem of binary class. But if the classes are “sadness”, happiness”, “disgusting”,
“depressed”, then it will be called a problem of Multi-class classification.
No. of It is a classification of two groups, i.e. There can be any number of classes in it, i.e.,
classes classifies objects in at most two classes. classifies the object into more than two classes.
• k-Nearest Neighbors
• Logistic Regression
Algorithms • Decision Trees
used • k-Nearest Neighbors
• Naive Bayes
• Decision Trees
• Random Forest.
• Support Vector Machine
• Gradient Boosting
• Naive Bayes
Q) MNIST
The MNIST database (Modified National Institute of Standards and Technology database) is a large database
of handwritten digits that is commonly used for training various image processing systems.
The database is also widely used for training and testing in the field of machine learning.
The creators felt that since NIST's training dataset was taken from American Census Bureau employees, while
the testing dataset was taken from American high school students, it was not well-suited for machine learning
experiments. Furthermore, the black and white images from NIST were normalized to fit into a 28x28 pixel
bounding box and anti-aliased, which introduced grayscale levels.
The MNIST database contains 60,000 training images and 10,000 testing images. Half of the training set and
half of the test set were taken from NIST's training dataset, while the other half of the training set and the
other half of the test set were taken from NIST's testing dataset. The original creators of the database keep a
list of some of the methods tested on it. In their original paper, they use a support-vector machine to get an
error rate of 0.8%.
Extended MNIST (EMNIST) is a newer dataset developed and released by NIST to be the (final) successor to
MNIST.[11][12] MNIST included images only of handwritten digits. EMNIST includes all the images from
NIST Special Database 19, which is a large database of handwritten uppercase and lower case letters as well
as digits. The images in EMNIST were converted into the same 28x28 pixel format, by the same process, as
were the MNIST images. Accordingly, tools which work with the older, smaller, MNIST dataset will likely
work unmodified with EMNIST.
www.tutorialtpoint.net Prepared By D.Venkata Reddy M.Tech(Ph.D),UGC NET, AP SET Qualified
www.jntumaterials.co.in
Q) Ranking
A binary classification system involves a system that generates ratings for each occurrence, which, by ordering
them, are turned into rankings, which are then compared to a threshold. Occurrences with rankings above the
threshold are declared positive, and occurrences below the threshold are declared negative.