Unit 2
Unit 2
Unit 2
Learning a Class from Examples, Linear, Non-linear, Multi-class and Multi-label classification,
Decision Trees: ID3, Classification and Regression Trees (CART), Regression: Linear Regression,
Multiple Linear Regression, Logistic Regression, Bayesian Network, Bayesian Classifier
Regression analysis is a statistical method to model the relationship between a dependent (target) and
independent (predictor) variables with one or more independent variables. More specifically,
Regression analysis helps us to understand how the value of the dependent variable is changing
corresponding to an independent variable when other independent variables are held fixed. It predicts
continuous/real values such as temperature, age, salary, price, etc.
Example: Suppose there is a marketing company A, who does various advertisement every year and
get sales on that. The below list shows the advertisement made by the company in the last 5 years and
the corresponding sales:
Regression is a supervised learning technique which helps in finding the correlation between variables
and enables us to predict the continuous output variable based on the one or more predictor variables.
It is mainly used for prediction, forecasting, time series modeling, and determining the causal-
effect relationship between variables.
In Regression, we plot a graph between the variables which best fits the given datapoints, using this
plot, the machine learning model can make predictions about the data. In simple words, "Regression
shows a line or curve that passes through all the datapoints on target-predictor graph in such a way
that the vertical distance between the datapoints and the regression line is minimum." The distance
between datapoints and line tells whether a model has captured a strong relationship or not.
o Dependent Variable: The main factor in Regression analysis which we want to predict or
understand is called the dependent variable. It is also called target variable.
o Independent Variable: The factors which affect the dependent variables or which are used to
predict the values of the dependent variables are called independent variable, also called as
a predictor.
o Outliers: Outlier is an observation which contains either very low value or very high value in
comparison to other observed values. An outlier may hamper the result, so it should be
avoided.
o Multicollinearity: If the independent variables are highly correlated with each other than
other variables, then such condition is called Multicollinearity. It should not be present in the
dataset, because it creates problem while ranking the most affecting variable.
o Underfitting and Overfitting: If our algorithm works well with the training dataset but not
well with test dataset, then such problem is called Overfitting. And if our algorithm does not
perform well even with training dataset, then such problem is called underfitting.
As mentioned above, Regression analysis helps in the prediction of a continuous variable. There are
various scenarios in the real world where we need some future predictions such as weather condition,
sales prediction, marketing trends, etc., for such case we need some technology which can make
predictions more accurately. So for such case we need Regression analysis which is a statistical
method and used in machine learning and data science. Below are some other reasons for using
Regression analysis:
o Regression estimates the relationship between the target and the independent variable.
o It is used to find the trends in data.
o It helps to predict real/continuous values.
o By performing the regression, we can confidently determine the most important factor, the
least important factor, and how each factor is affecting the other factors.
Types of Regression
There are various types of regressions which are used in data science and machine learning. Each type
has its own importance on different scenarios, but at the core, all the regression methods analyze the
effect of the independent variable on dependent variables. Here we are discussing some important
types of regression which are given below:
o Linear Regression
o Logistic Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression
o Ridge Regression
o Lasso Regression:
Linear Regression:
o Linear regression is a statistical regression method which is used for predictive analysis.
o It is one of the very simple and easy algorithms which works on regression and shows the
relationship between the continuous variables.
o It is used for solving the regression problem in machine learning.
o Linear regression shows the linear relationship between the independent variable (X-axis) and
the dependent variable (Y-axis), hence called linear regression.
o If there is only one input variable (x), then such linear regression is called simple linear
regression. And if there is more than one input variable, then such linear regression is
called multiple linear regression.
o The relationship between variables in the linear regression model can be explained using the
below image. Here we are predicting the salary of an employee on the basis of the year of
experience.
o Below is the mathematical equation for Linear regression:
Y= aX+b
Logistic Regression:
o Logistic regression is another supervised learning algorithm which is used to solve the
classification problems. In classification problems, we have dependent variables in a binary
or discrete format such as 0 or 1.
o Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes or No,
True or False, Spam or not spam, etc.
o It is a predictive analysis algorithm which works on the concept of probability.
o Logistic regression is a type of regression, but it is different from the linear regression
algorithm in the term how they are used.
o Logistic regression uses sigmoid function or logistic function which is a complex cost
function. This sigmoid function is used to model the data in logistic regression. The function
can be represented as:
When we provide the input values (data) to the function, it gives the S-curve as follows:
o It uses the concept of threshold levels, values above the threshold level are rounded up to 1,
and values below the threshold level are rounded up to 0.
o Binary(0/1, pass/fail)
o Multi(cats, dogs, lions)
o Ordinal(low, medium, high)
Polynomial Regression:
o Polynomial Regression is a type of regression which models the non-linear dataset using a
linear model.
o It is similar to multiple linear regression, but it fits a non-linear curve between the value of x
and corresponding conditional values of y.
o Suppose there is a dataset which consists of datapoints which are present in a non-linear
fashion, so for such case, linear regression will not best fit to those datapoints. To cover such
datapoints, we need Polynomial regression.
o In Polynomial regression, the original features are transformed into polynomial features
of given degree and then modeled using a linear model. Which means the datapoints are
best fitted using a polynomial line.
o The equation for polynomial regression also derived from linear regression equation that
means Linear regression equation Y= b0+ b1x, is transformed into Polynomial regression
equation Y= b0+b1x+ b2x2+ b3x3+.....+ bnxn.
o Here Y is the predicted/target output, b0, b1,... bn are the regression coefficients. x is
our independent/input variable.
Support Vector Machine is a supervised learning algorithm which can be used for regression as well
as classification problems. So if we use it for regression problems, then it is termed as Support Vector
Regression.
Support Vector Regression is a regression algorithm which works for continuous variables. Below are
some keywords which are used in Support Vector Regression:
o Kernel: It is a function used to map a lower-dimensional data into higher dimensional data.
o Hyperplane: In general SVM, it is a separation line between two classes, but in SVR, it is a
line which helps to predict the continuous variables and cover most of the datapoints.
o Boundary line: Boundary lines are the two lines apart from hyperplane, which creates a
margin for datapoints.
o Support vectors: Support vectors are the datapoints which are nearest to the hyperplane and
opposite class.
In SVR, we always try to determine a hyperplane with a maximum margin, so that maximum number
of datapoints are covered in that margin. The main goal of SVR is to consider the maximum
datapoints within the boundary lines and the hyperplane (best-fit line) must contain a maximum
number of datapoints. Consider the below image:
Here, the blue line is called hyperplane, and the other two lines are known as boundary lines.
o Decision Tree is a supervised learning algorithm which can be used for solving both
classification and regression problems.
o It can solve problems for both categorical and numerical data
o Decision Tree regression builds a tree-like structure in which each internal node represents
the "test" for an attribute, each branch represent the result of the test, and each leaf node
represents the final decision or result.
o A decision tree is constructed starting from the root node/parent node (dataset), which splits
into left and right child nodes (subsets of dataset). These child nodes are further divided into
their children node, and themselves become the parent node of those nodes. Consider the
below image:
Above image showing the example of Decision Tee regression, here, the model is trying to predict the
choice of a person between Sports cars or Luxury car.
o Random forest is one of the most powerful supervised learning algorithms which is capable of
performing regression as well as classification tasks.
o The Random Forest regression is an ensemble learning method which combines multiple
decision trees and predicts the final output based on the average of each tree output. The
combined decision trees are called as base models, and it can be represented more formally
as:
o Ridge regression is one of the most robust versions of linear regression in which a small
amount of bias is introduced so that we can get better long term predictions.
o The amount of bias added to the model is known as Ridge Regression penalty. We can
compute this penalty term by multiplying with the lambda to the squared weight of each
individual features.
o A general linear or polynomial regression will fail if there is high collinearity between the
independent variables, so to solve such problems, Ridge regression can be used.
o Ridge regression is a regularization technique, which is used to reduce the complexity of the
model. It is also called as L2 regularization.
o It helps to solve the problems if we have more parameters than samples.
Lasso Regression:
o Lasso regression is another regularization technique to reduce the complexity of the model.
o It is similar to the Ridge Regression except that penalty term contains only the absolute
weights instead of a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge Regression
can only shrink it near to 0.
o It is also called as L1 regularization. The equation for Lasso regression will be:
Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
statistical method that is used for predictive analysis. Linear regression makes predictions for
continuous/real or numeric variables such as sales, salary, age, product price, etc.
Linear regression algorithm shows a linear relationship between a dependent (y) and one or more
independent (y) variables, hence called as linear regression. Since linear regression shows the linear
relationship, which means it finds how the value of the dependent variable is changing according to
the value of the independent variable.
The linear regression model provides a sloped straight line representing the relationship between the
variables. Consider the below image:
y= a0+a1x+ ε
Here,
Linear regression can be further divided into two types of the algorithm:
A linear line showing the relationship between the dependent and independent variables is called
a regression line. A regression line can show two types of relationship:
When working with linear regression, our main goal is to find the best fit line that means the error
between predicted values and actual values should be minimized. The best fit line will have the least
error.
The different values for weights or the coefficient of lines (a0, a1) gives a different line of regression,
so we need to calculate the best values for a 0 and a1 to find the best fit line, so to calculate this we use
cost function.
Cost function-
o The different values for weights or coefficient of lines (a0, a1) gives the different line of
regression, and the cost function is used to estimate the values of the coefficient for the best
fit line.
o Cost function optimizes the regression coefficients or weights. It measures how a linear
regression model is performing.
o We can use the cost function to find the accuracy of the mapping function, which maps the
input variable to the output variable. This mapping function is also known as Hypothesis
function.
For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the average
of squared error occurred between the predicted values and actual values. It can be written as:
Where,
N=Total number of observation
Yi = Actual value
(a1xi+a0)= Predicted value.
Residuals: The distance between the actual value and predicted values is called residual. If the
observed points are far from the regression line, then the residual will be high, and so cost function
will high. If the scatter points are close to the regression line, then the residual will be small and hence
the cost function.
Gradient Descent:
o Gradient descent is used to minimize the MSE by calculating the gradient of the cost function.
o A regression model uses gradient descent to update the coefficients of the line by reducing the
cost function.
o It is done by a random selection of values of coefficient and then iteratively update the values
to reach the minimum cost function.
Model Performance:
The Goodness of fit determines how the line of regression fits the set of observations. The process of
finding the best model out of various models is called optimization. It can be achieved by below
method:
1. R-squared method:
Below are some important assumptions of Linear Regression. These are some formal checks while
building a Linear Regression model, which ensures to get the best possible result from the given
dataset.
o Linear relationship between the features and target:
Linear regression assumes the linear relationship between the dependent and independent
variables.
o Small or no multicollinearity between the features:
Multicollinearity means high-correlation between the independent variables. Due to
multicollinearity, it may difficult to find the true relationship between the predictors and
target variables. Or we can say, it is difficult to determine which predictor variable is
affecting the target variable and which is not. So, the model assumes either little or no
multicollinearity between the features or independent variables.
o Homoscedasticity Assumption:
Homoscedasticity is a situation when the error term is the same for all the values of
independent variables. With homoscedasticity, there should be no clear pattern distribution of
data in the scatter plot.
o Normal distribution of error terms:
Linear regression assumes that the error term should follow the normal distribution pattern. If
error terms are not normally distributed, then confidence intervals will become either too
wide or too narrow, which may cause difficulties in finding coefficients.
It can be checked using the q-q plot. If the plot shows a straight line without any deviation,
which means the error is normally distributed.
o No autocorrelations:
The linear regression model assumes no autocorrelation in error terms. If there will be any
correlation in the error term, then it will drastically reduce the accuracy of the model.
Autocorrelation usually occurs if there is a dependency between residual errors.
Simple Linear Regression is a type of Regression algorithms that models the relationship between a
dependent variable and a single independent variable. The relationship shown by a Simple Linear
Regression model is linear or a sloped straight line, hence it is called Simple Linear Regression.
The key point in Simple Linear Regression is that the dependent variable must be a continuous/real
value. However, the independent variable can be measured on continuous or categorical values.
o Model the relationship between the two variables. Such as the relationship between
Income and expenditure, experience and Salary, etc.
o Forecasting new observations. Such as Weather forecasting according to temperature,
Revenue of a company according to the investments in a year, etc.
Simple Linear Regression Model:
The Simple Linear Regression model can be represented using the below equation:
y= a0+a1x+ ε
Where,
a0= It is the intercept of the Regression line (can be obtained putting x=0)
a1= It is the slope of the regression line, which tells whether the line is increasing or decreasing.
ε = The error term. (For a good model it will be negligible)
Here we are taking a dataset that has two variables: salary (dependent variable) and experience
(Independent variable). The goals of this problem is:
o We want to find out if there is any correlation between these two variables
o We will find the best fit line for the dataset.
o How the dependent variable is changing by changing the independent variable.
In this section, we will create a Simple Linear Regression model to find out the best fitting line for
representing the relationship between these two variables.
To implement the Simple Linear regression model in machine learning using Python, we need to
follow the below steps:
The first step for creating the Simple Linear Regression model is data pre-processing. We have
already done it earlier in this tutorial. But there will be some changes, which are given in the below
steps:
o First, we will import the three important libraries, which will help us for loading the dataset,
plotting the graphs, and creating the Simple Linear Regression model.
1. import numpy as nm
2. import matplotlib.pyplot as mtp
3. import pandas as pd
1. data_set= pd.read_csv('Salary_Data.csv')
By executing the above line of code (ctrl+ENTER), we can read the dataset on our Spyder IDE screen
by clicking on the variable explorer option.
The above output shows the dataset, which has two variables: Salary and Experience.
After that, we need to extract the dependent and independent variables from the given dataset. The
independent variable is years of experience, and the dependent variable is salary. Below is code for it:
1. x= data_set.iloc[:, :-1].values
2. y= data_set.iloc[:, 1].values
In the above lines of code, for x variable, we have taken -1 value since we want to remove the last
column from the dataset. For y variable, we have taken 1 value as a parameter, since we want to
extract the second column and indexing starts from the zero.
By executing the above line of code, we will get the output for X and Y variable as:
In the above output image, we can see the X (independent) variable and Y (dependent) variable has
been extracted from the given dataset.
o Next, we will split both variables into the test set and training set. We have 30 observations,
so we will take 20 observations for the training set and 10 observations for the test set. We are
splitting our dataset so that we can train our model using a training dataset and then test the
model using a test dataset. The code for this is given below:
By executing the above code, we will get x-test, x-train and y-test, y-train dataset. Consider the below
images:
Test-dataset:
Training Dataset:
o For simple linear Regression, we will not use Feature Scaling. Because Python libraries take
care of it for some cases, so we don't need to perform it here. Now, our dataset is well
prepared to work on it and we are going to start building a Simple Linear Regression model
for the given problem.
In the above code, we have used a fit() method to fit our Simple Linear Regression object to the
training set. In the fit() function, we have passed the x_train and y_train, which is our training dataset
for the dependent and an independent variable. We have fitted our regressor object to the training set
so that the model can easily learn the correlations between the predictor and target variables. After
executing the above lines of code, we will get the below output.
Output:
dependent (salary) and an independent variable (Experience). So, now, our model is ready to predict
the output for the new observations. In this step, we will provide the test dataset (new observations) to
the model to check whether it can predict the correct output or not.
We will create a prediction vector y_pred, and x_pred, which will contain predictions of test dataset,
and prediction of training set respectively.
On executing the above lines of code, two variables named y_pred and x_pred will generate in the
variable explorer options that contain salary predictions for the training set and test set.
Output:
You can check the variable by clicking on the variable explorer option in the IDE, and also compare
the result by comparing values from y_pred and y_test. By comparing these values, we can check how
good our model is performing.
Now in this step, we will visualize the training set result. To do so, we will use the scatter() function
of the pyplot library, which we have already imported in the pre-processing step. The scatter ()
function will create a scatter plot of observations.
In the x-axis, we will plot the Years of Experience of employees and on the y-axis, salary of
employees. In the function, we will pass the real values of training set, which means a year of
experience x_train, training set of Salaries y_train, and color of the observations. Here we are taking a
green color for the observation, but it can be any color as per the choice.
Now, we need to plot the regression line, so for this, we will use the plot() function of the pyplot
library. In this function, we will pass the years of experience for training set, predicted salary for
training set x_pred, and color of the line.
Next, we will give the title for the plot. So here, we will use the title() function of the pyplot library
and pass the name ("Salary vs Experience (Training Dataset)".
After that, we will assign labels for x-axis and y-axis using xlabel() and ylabel() function.
Finally, we will represent all above things in a graph using show(). The code is given below:
Output:
By executing the above lines of code, we will get the below graph plot as an output.
In the above plot, we can see the real values observations in green dots and predicted values are
covered by the red regression line. The regression line shows a correlation between the dependent and
independent variable.
The good fit of the line can be observed by calculating the difference between actual values and
predicted values. But as we can see in the above plot, most of the observations are close to the
regression line, hence our model is good for the training set.
Step: 5. visualizing the Test set results:
In the previous step, we have visualized the performance of our model on the training set. Now, we
will do the same for the Test set. The complete code will remain the same as the above code, except in
this, we will use x_test, and y_test instead of x_train and y_train.
Here we are also changing the color of observations and regression line to differentiate between the
two plots, but it is optional.
Output:
By executing the above line of code, we will get the output as:
In the above plot, there are observations given by the blue color, and prediction is given by the red
regression line. As we can see, most of the observations are close to the regression line, hence we can
say our Simple Linear Regression is a good model and able to make good predictions.
In the previous topic, we have learned about Simple Linear Regression, where a single
Independent/Predictor(X) variable is used to model the response variable (Y). But there may be
various cases in which the response variable is affected by more than one predictor variable; for such
cases, the Multiple Linear Regression algorithm is used.
Moreover, Multiple Linear Regression is an extension of Simple Linear regression as it takes more
than one predictor variable to predict the response variable. We can define it as:
Multiple Linear Regression is one of the important regression algorithms which models the linear
relationship between a single dependent continuous variable and more than one independent
variable.
Example:
Prediction of CO2 emission based on engine size and number of cylinders in a car.
o For MLR, the dependent or target variable(Y) must be the continuous/real, but the predictor
or independent variable may be of continuous or categorical form.
o Each feature variable must model the linear relationship with the dependent variable.
o MLR tries to fit a regression line through a multidimensional space of data-points.
MLR equation:
In Multiple Linear Regression, the target variable(Y) is a linear combination of multiple predictor
variables x1, x2, x3, ...,xn. Since it is an enhancement of Simple Linear Regression, so the same is
applied for the multiple linear regression equation, the equation becomes:
Where,
Y= Output/Response variable
o A linear relationship should exist between the Target and predictor variables.
o The regression residuals must be normally distributed.
o MLR assumes little or no multicollinearity (correlation between the independent variable) in
data.
We have a dataset of 50 start-up companies. This dataset contains five main information: R&D
Spend, Administration Spend, Marketing Spend, State, and Profit for a financial year. Our goal
is to create a model that can easily determine which company has a maximum profit, and which is the
most affecting factor for the profit of a company.
Since we need to find the Profit, so it is the dependent variable, and the other four variables are
independent variables. Below are the main steps of deploying the MLR model:
The very first step is data pre-processing, which we have already discussed in this tutorial. This
process contains the below steps:
o Importing libraries: Firstly we will import the library which will help in building the model.
Below is the code for it:
1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
o Importing dataset: Now we will import the dataset(50_CompList), which contains all the
variables. Below is the code for it:
1. #importing datasets
2. data_set= pd.read_csv('50_CompList.csv')
Output:
Out[5]:
As we can see in the above output, the last column contains categorical variables which are not
suitable to apply directly for fitting the model. So we need to encode this variable.
As we have one categorical variable (State), which cannot be directly applied to the model, so we will
encode it. To encode the categorical variable into numbers, we will use the LabelEncoder class. But
it is not sufficient because it still has some relational order, which may create a wrong model. So in
order to remove this problem, we will use OneHotEncoder, which will create the dummy variables.
Below is code for it:
1. #Catgorical data
2. from sklearn.preprocessing import LabelEncoder, OneHotEncoder
3. labelencoder_x= LabelEncoder()
4. x[:, 3]= labelencoder_x.fit_transform(x[:,3])
5. onehotencoder= OneHotEncoder(categorical_features= [3])
6. x= onehotencoder.fit_transform(x).toarray()
Here we are only encoding one independent variable, which is state as other variables are continuous.
Output:
As we can see in the above output, the state column has been converted into dummy variables (0 and
1). Here each dummy variable column is corresponding to the one State. We can check by
comparing it with the original dataset. The first column corresponds to the California State, the
second column corresponds to the Florida State, and the third column corresponds to the New York
State.
Now, we are writing a single line of code just to avoid the dummy variable trap:
If we do not remove the first dummy variable, then it may introduce multicollinearity in the model.
As we can see in the above output image, the first column has been removed.
o Now we will split the dataset into training and test set. The code for this is given below:
The above code will split our dataset into a training set and test set.
Output: The above code will split the dataset into training set and test set. You can check the output
by clicking on the variable explorer option given in Spyder IDE. The test set and training set will look
like the below image:
Test set:
Training set:
Now, we have well prepared our dataset in order to provide training, which means we will fit our
regression model to the training set. It will be similar to as we did in Simple Linear Regression model.
The code for this will be:
Output:
Now, we have successfully trained our model using the training dataset. In the next step, we will test
the performance of the model using the test dataset.
The last step for our model is checking the performance of the model. We will do it by predicting the
test set result. For prediction, we will create a y_pred vector. Below is the code for it:
By executing the above lines of code, a new vector will be generated under the variable explorer
option. We can test our model by comparing the predicted values and test set values.
Output:
In the above output, we have predicted result set and test set. We can check model performance by
comparing these two value index by index. For example, the first index has a predicted value
of 103015$ profit and test/real value of 103282$ profit. The difference is only of 267$, which is a
good prediction, so, finally, our model is completed here.
o We can also check the score for training dataset and test dataset. Below is the code for it:
1. print('Train Score: ', regressor.score(x_train, y_train))
2. print('Test Score: ', regressor.score(x_test, y_test))
The above score tells that our model is 95% accurate with the training dataset and 93%
accurate with the test dataset.
ML Polynomial Regression
o It is also called the special case of Multiple Linear Regression in ML. Because we add some
polynomial terms to the Multiple Linear regression equation to convert it into Polynomial
Regression.
o It is a linear model with some modification in order to increase the accuracy.
o The dataset used in Polynomial regression for training is of non-linear nature.
o It makes use of a linear regression model to fit the complicated and non-linear functions and
datasets.
o Hence, "In Polynomial regression, the original features are converted into Polynomial
features of required degree (2,3,..,n) and then modeled using a linear model."
o If we apply a linear model on a linear dataset, then it provides us a good result as we have
seen in Simple Linear Regression, but if we apply the same model without any modification
on a non-linear dataset, then it will produce a drastic output. Due to which loss function will
increase, the error rate will be high, and accuracy will be decreased.
o So for such cases, where data points are arranged in a non-linear fashion, we need the
Polynomial Regression model. We can understand it in a better way using the below
comparison diagram of the linear dataset and non-linear dataset.
o In the above image, we have taken a dataset which is arranged non-linearly. So if we try to
cover it with a linear model, then we can clearly see that it hardly covers any data point. On
the other hand, a curve is suitable to cover most of the data points, which is of the Polynomial
model.
o Hence, if the datasets are arranged in a non-linear fashion, then we should use the
Polynomial Regression model instead of Simple Linear Regression. Equation of the
Polynomial Regression Model:
Here we will implement the Polynomial Regression using Python. We will understand it by
comparing Polynomial Regression model with the Simple Linear Regression model. So first, let's
understand the problem for which we are going to build the model.
Problem Description: There is a Human Resource company, which is going to hire a new candidate.
The candidate has told his previous salary 160K per annum, and the HR have to check whether he is
telling the truth or bluff. So to identify this, they only have a dataset of his previous company in which
the salaries of the top 10 positions are mentioned with their levels. By checking the dataset available,
we have found that there is a non-linear relationship between the Position levels and the salaries.
Our goal is to build a Bluffing detector regression model, so HR can hire an honest candidate.
Below are the steps to build such a model.
Steps for Polynomial Regression:
o Data Pre-processing
o Build a Linear Regression model and fit it to the dataset
o Build a Polynomial Regression model and fit it to the dataset
o Visualize the result for Linear Regression and Polynomial Regression model.
o Predicting the output. regression model is for reference.
The data pre-processing step will remain the same as in previous regression models, except for some
changes. In the Polynomial Regression model, we will not use feature scaling, and also we will not
split our dataset into training and test set. It has two reasons:
o The dataset contains very less information which is not suitable to divide it into a test and
training set, else our model will not be able to find the correlations between the salaries and
levels.
o In this model, we want very accurate predictions for salary, so the model should have enough
information.
1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('Position_Salaries.csv')
8.
9. #Extracting Independent and dependent Variable
10. x= data_set.iloc[:, 1:2].values
11. y= data_set.iloc[:, 2].values
Explanation:
o In the above lines of code, we have imported the important Python libraries to import dataset
and operate on it.
o Next, we have imported the dataset 'Position_Salaries.csv', which contains three columns
(Position, Levels, and Salary), but we will consider only two columns (Salary and Levels).
o After that, we have extracted the dependent(Y) and independent variable(X) from the dataset.
For x-variable, we have taken parameters as [:,1:2], because we want 1 index(levels), and
included :2 to make it as a matrix.
Output:
As we can see in the above output, there are three columns present (Positions, Levels, and Salaries).
But we are only considering two columns because Positions are equivalent to the levels or may be
seen as the encoded form of Positions.
Here we will predict the output for level 6.5 because the candidate has 4+ years' experience as a
regional manager, so he must be somewhere between levels 7 and 6.
Building the Linear regression model:
Learn more
Now, we will build and fit the Linear regression model to the dataset. In building polynomial
regression, we will take the Linear regression model as reference and compare both the results. The
code is given below:
In the above code, we have created the Simple Linear model using lin_regs object
of LinearRegression class and fitted it to the dataset variables (x and y).
Output:
Now we will build the Polynomial Regression model, but it will be a little different from the Simple
Linear model. Because here we will use PolynomialFeatures class of preprocessing library. We are
using this class to add some extra features to our dataset.
In the above lines of code, we have used poly_regs.fit_transform(x), because first we are converting
our feature matrix into polynomial feature matrix, and then fitting it to the Polynomial regression
model. The parameter value(degree= 2) depends on our choice. We can choose it according to our
Polynomial features.
After executing the code, we will get another matrix x_poly, which can be seen under the variable
explorer option:
Next, we have used another LinearRegression object, namely lin_reg_2, to fit our x_poly vector to
the linear model.
Output:
Now we will visualize the result for Linear regression model as we did in Simple Linear Regression.
Below is the code for it:
Output:
Learn more
In the above output image, we can clearly see that the regression line is so far from the datasets.
Predictions are in a red straight line, and blue points are actual values. If we consider this output to
predict the value of CEO, it will give a salary of approx. 600000$, which is far away from the real
value.
So we need a curved model to fit the dataset other than a straight line.
Here we will visualize the result of Polynomial regression model, code for which is little different
from the above model.
Output:
As we can see in the above output image, the predictions are close to the real values. The above plot
will vary as we will change the degree.
For degree= 3:
If we change the degree=3, then we will give a more accurate plot, as shown in the below image.
SO as we can see here in the above output image, the predicted salary for level 6.5 is near to 170K$-
190k$, which seems that future employee is saying the truth about his salary.
Degree= 4: Let's again change the degree to 4, and now will get the most accurate plot. Hence we can
get more accurate results by increasing the degree of Polynomial.
Predicting the final result with the Linear Regression model:
Now, we will predict the final output using the Linear regression model to see whether an employee is
saying truth or bluff. So, for this, we will use the predict() method and will pass the value 6.5. Below
is the code for it:
1. lin_pred = lin_regs.predict([[6.5]])
2. print(lin_pred)
Output:
[330378.78787879]
Now, we will predict the final output using the Polynomial Regression model to compare with Linear
model. Below is the code for it:
1. poly_pred = lin_reg_2.predict(poly_regs.fit_transform([[6.5]]))
2. print(poly_pred)
Output:
[158862.45265153]
As we can see, the predicted output for the Polynomial Regression is [158862.45265153], which is
much closer to real value hence, we can say that future employee is saying true.
As we know, the Supervised Machine Learning algorithm can be broadly classified into Regression
and Classification Algorithms. In Regression algorithms, we have predicted the output for continuous
values, but to predict the categorical values, we need Classification algorithms.
Classification Algorithm
The Classification algorithm is a Supervised Learning technique that is used to identify the category
of new observations on the basis of training data. In Classification, a program learns from the given
dataset or observations and then classifies new observation into a number of classes or groups. Such
as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc. Classes can be called as targets/labels or
categories.
Unlike regression, the output variable of Classification is a category, not a value, such as "Green or
Blue", "fruit or animal", etc. Since the Classification algorithm is a Supervised learning technique,
hence it takes labeled input data, which means it contains input with the corresponding output.
The main goal of the Classification algorithm is to identify the category of a given dataset, and these
algorithms are mainly used to predict the output for the categorical data.
Classification algorithms can be better understood using the below diagram. In the below diagram,
there are two classes, class A and Class B. These classes have features that are similar to each other
and dissimilar to other classes.
The algorithm which implements the classification on a dataset is known as a classifier. There are two
types of Classifications:
o Binary Classifier: If the classification problem has only two possible outcomes, then it is
called as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
o Multi-class Classifier: If a classification problem has more than two outcomes, then it is
called as Multi-class Classifier.
Example: Classifications of types of crops, Classification of types of music.
1. Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it receives the
test dataset. In Lazy learner case, classification is done on the basis of the most related data
stored in the training dataset. It takes less time in training but more time for predictions.
Example: K-NN algorithm, Case-based reasoning
2. Eager Learners:Eager Learners develop a classification model based on a training dataset
before receiving a test dataset. Opposite to Lazy learners, Eager Learner takes more time in
learning, and less time in prediction. Example: Decision Trees, Naïve Bayes, ANN.
Classification Algorithms can be further divided into the Mainly two category:
o Linear Models
o Logistic Regression
o Support Vector Machines
o Non-linear Models
o K-Nearest Neighbours
o Kernel SVM
o Naïve Bayes
o Decision Tree Classification
o Random Forest Classification.
Once our model is completed, it is necessary to evaluate its performance; either it is a Classification
or Regression model. So for evaluating a Classification model, we have the following ways:
o It is used for evaluating the performance of a classifier, whose output is a probability value
between the 0 and 1.
o For a good binary Classification model, the value of log loss should be near to 0.
o The value of log loss increases if the predicted value deviates from the actual value.
o The lower log loss represents the higher accuracy of the model.
o For Binary classification, cross-entropy can be calculated as:
1. ?(ylog(p)+(1?y)log(1?p))
2. Confusion Matrix:
o The confusion matrix provides us a matrix/table as output and describes the performance of
the model.
o It is also known as the error matrix.
o The matrix consists of predictions result in a summarized form, which has a total number of
correct predictions and incorrect predictions. The matrix looks like as below table:
3. AUC-ROC curve:
o ROC curve stands for Receiver Operating Characteristics Curve and AUC stands for Area
Under the Curve.
o It is a graph that shows the performance of the classification model at different thresholds.
o To visualize the performance of the multi-class classification model, we use the AUC-ROC
Curve.
o The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on Y-axis and
FPR(False Positive Rate) on X-axis.
Classification algorithms can be used in different places. Below are some popular use cases of
Classification Algorithms:
o Logistic regression is one of the most popular Machine Learning algorithms, which comes
under the Supervised Learning technique. It is used for predicting the categorical dependent
variable using a given set of independent variables.
o Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values
which lie between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Logistic regression is
used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as whether the
cells are cancerous or not, a mouse is obese or not based on its weight, etc.
o Logistic Regression is a significant machine learning algorithm because it has the ability to
provide probabilities and classify new data using continuous and discrete datasets.
o Logistic Regression can be used to classify the observations using different types of data and
can easily determine the most effective variables used for the classification. The below image
is showing the logistic function:
Logistic Function (Sigmoid Function):
o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go beyond this
limit, so it forms a curve like the "S" form. The S-form curve is called the Sigmoid function
or the logistic function.
o In logistic regression, we use the concept of the threshold value, which defines the probability
of either 0 or 1. Such as values above the threshold value tends to 1, and a value below the
threshold values tends to 0.
The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:
o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):
o But we need range between -[infinity] to +[infinity], then take logarithm of the equation it
will become:
o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered
types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described as:
o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the bases
of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence
each feature individually contributes to identify that it is an apple without depending on each
other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability.
o The formula for Bayes' theorem is given as:
Where,
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
Working of Naïve Bayes' Classifier can be understood with the help of the below example:
Suppose we have a dataset of weather conditions and corresponding target variable "Play". So using
this dataset we need to decide that whether we should play or not on a particular day according to the
weather conditions. So to solve this problem, we need to follow the below steps:
Problem: If the weather is sunny, then the Player should play or not?
Outlook Play
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Weather Yes
Overcast 5
Rainy 2
Sunny 3
Total 10
Weather No Yes
Overcast 0 5
Rainy 2 2
Sunny 2 3
Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny)= 0.35
P(Yes)=0.71
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other Algorithms.
o It is the most popular choice for text classification problems.
o Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the
relationship between features.
There are three types of Naive Bayes Model, which are given below:
o Gaussian: The Gaussian model assumes that features follow a normal distribution. This
means if predictors take continuous values instead of discrete, then the model assumes that
these values are sampled from the Gaussian distribution.
o Multinomial: The Multinomial Naïve Bayes classifier is used when the data is multinomial
distributed. It is primarily used for document classification problems, it means a particular
document belongs to which category such as Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the
predictor variables are the independent Booleans variables. Such as if a particular word is
present or not in a document. This model is also famous for document classification tasks.
Now we will implement a Naive Bayes Algorithm using Python. So for this, we will use the
"user_data" dataset, which we have used in our other classification model. Therefore we can easily
compare the Naive Bayes model with the other models.
Steps to implement:
In this step, we will pre-process/prepare the data so that we can use it efficiently in our code. It is
similar as we did in data-pre-processing. The code for this is given below:
In the above code, we have loaded the dataset into our program using "dataset =
pd.read_csv('user_data.csv'). The loaded dataset is divided into training and test set, and then we
have scaled the feature variable.
After the pre-processing step, now we will fit the Naive Bayes model to the Training set. Below is the
code for it:
In the above code, we have used the GaussianNB classifier to fit it to the training dataset. We can
also use other classifiers as per our requirement.
Output:
Now we will predict the test set result. For this, we will create a new predictor variable y_pred, and
will use the predict function to make the predictions.
Output:
The above output shows the result for prediction vector y_pred and real vector y_test. We can see
that some predications are different from the real values, which are the incorrect predictions.
Now we will check the accuracy of the Naive Bayes classifier using the Confusion matrix. Below is
the code for it:
Output:
As we can see in the above confusion matrix output, there are 7+3= 10 incorrect predictions, and
65+25=90 correct predictions.
Next we will visualize the training set result using Naïve Bayes Classifier. Below is the code for it:
Output:
In the above output we can see that the Naïve Bayes classifier has segregated the data points with the
fine boundary. It is Gaussian curve as we have used GaussianNB classifier in our code.
The above output is final output for test set data. As we can see the classifier has created a Gaussian
curve to divide the "purchased" and "not purchased" variables. There are some wrong predictions
which we have calculated in Confusion matrix. But still it is pretty good classifier.
o Decision Tree is a Supervised learning technique that can be used for both classification
and Regression problems, but mostly it is preferred for solving Classification problems. It is a
tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches, whereas
Leaf nodes are the output of those decisions and do not contain any further branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a problem/decision
based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further split the
tree into subtrees.
o Below diagram explains the general structure of a decision tree:
There are various algorithms in Machine learning, so choosing the best algorithm for the given dataset
and problem is the main point to remember while creating a machine learning model. Below are the
two reasons for using the Decision tree:
o Decision Trees usually mimic human thinking ability while making a decision, so it is easy to
understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like
structure.
Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which
further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according
to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are called
the child nodes.
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root node
of the tree. This algorithm compares the values of root attribute with the record (real dataset) attribute
and, based on the comparison, follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes and
move further. It continues the process until it reaches the leaf node of the tree. The complete process
can be better understood using the below algorithm: Step-1: Begin the tree with the root node, says S,
which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in step -
3. Continue this process until a stage is reached where you cannot further classify the nodes
and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he should
accept the offer or Not. So, to solve this problem, the decision tree starts with the root node (Salary
attribute by ASM). The root node splits further into the next decision node (distance from the office)
and one leaf node based on the corresponding labels. The next decision node further gets split into one
decision node (Cab facility) and one leaf node. Finally, the decision node splits into two leaf nodes
(Accepted offers and Declined offer). Consider the below diagram:
While implementing a Decision tree, the main issue arises that how to select the best attribute for the
root node and for sub-nodes. So, to solve such problems there is a technique which is called
as Attribute selection measure or ASM. By this measurement, we can easily select the best attribute
for the nodes of the tree. There are two popular techniques for ASM, which are:
o Information Gain
o Gini Index
1. Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation of a dataset
based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision tree.
o A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using the
below formula:
Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in
data. Entropy can be calculated as:
Where,
2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create binary
splits.
o Gini index can be calculated using the below formula:
Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal
decision tree.
A too-large tree increases the risk of overfitting, and a small tree may not capture all the important
features of the dataset. Therefore, a technique that decreases the size of the learning tree without
reducing accuracy is known as Pruning. There are mainly two types of tree pruning technology used:
o It is simple to understand as it follows the same process which a human follow while making
any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.
CART is a predictive algorithm used in Machine learning and it explains how the target variable’s
values can be predicted based on other matters. It is a decision tree where each fork is split into a
predictor variable and each node has a prediction for the target variable at the end.
In the decision tree, nodes are split into sub-nodes on the basis of a threshold value of an attribute.
The root node is taken as the training set and is split into two by considering the best attribute and
threshold value. Further, the subsets are also split using the same logic. This continues till the last
pure sub-set is found in the tree or the maximum number of leaves possible in that growing tree.
Where 0 depicts that all the elements are allied to a certain class, or only one class exists there.
The Gini index of value 1 signifies that all the elements are randomly distributed across various
classes, and
A value of 0.5 denotes the elements are uniformly distributed into some classes.
Mathematically, we can write Gini Impurity as follows:
Classification tree
A classification tree is an algorithm where the target variable is categorical. The algorithm is then
used to identify the “Class” within which the target variable is most likely to fall. Classification
trees are used when the dataset needs to be split into classes that belong to the response
variable(like yes or no)
Regression tree
A Regression tree is an algorithm where the target variable is continuous and the tree is used to
predict its value. Regression trees are used when the response variable is continuous. For example,
if the response variable is the temperature of the day.
Greedy algorithm: In this The input space is divided using the Greedy method which is known
as a recursive binary spitting. This is a numerical method within which all of the values are
aligned and several other split points are tried and assessed using a cost function.
Stopping Criterion: As it works its way down the tree with the training data, the recursive
binary splitting method described above must know when to stop splitting. The most frequent
halting method is to utilize a minimum amount of training data allocated to every leaf node. If
the count is smaller than the specified threshold, the split is rejected and also the node is
considered the last leaf node.
Tree pruning: Decision tree’s complexity is defined as the number of splits in the tree. Trees
with fewer branches are recommended as they are simple to grasp and less prone to cluster the
data. Working through each leaf node in the tree and evaluating the effect of deleting it using a
hold-out test set is the quickest and simplest pruning approach.
Data preparation for the CART: No special data preparation is required for the CART
algorithm.
Advantages of CART
Results are simplistic.
Classification and regression trees are Nonparametric and Nonlinear.
Classification and regression trees implicitly perform feature selection.
Outliers have no meaningful effect on CART.
It requires minimal supervision and produces easy-to-understand models.
Limitations of CART
Overfitting.
High Variance.
low bias.
the tree structure may be unstable.
Applications of the CART algorithm
For quick Data insights.
In Blood Donors Classification.
For environmental and ecological data.
In the financial sectors.
ID3 stands for Iterative Dichotomiser 3 and is named such because the algorithm iteratively
(repeatedly) dichotomizes(divides) features into two or more groups at each step.
Most generally ID3 is only used for classification problems with nominal features only.
Dataset description
In this article, we’ll be using a sample dataset of COVID-19 infection. A preview of the entire dataset
is shown below.
+----+-------+-------+------------------+----------+
| ID | Fever | Cough | Breathing issues | Infected |
+----+-------+-------+------------------+----------+
| 1 | NO | NO | NO | NO |
+----+-------+-------+------------------+----------+
| 2 | YES | YES | YES | YES |
+----+-------+-------+------------------+----------+
| 3 | YES | YES | NO | NO |
+----+-------+-------+------------------+----------+
| 4 | YES | NO | YES | YES |
+----+-------+-------+------------------+----------+
| 5 | YES | YES | YES | YES |
+----+-------+-------+------------------+----------+
| 6 | NO | YES | NO | NO |
+----+-------+-------+------------------+----------+
| 7 | YES | NO | YES | YES |
+----+-------+-------+------------------+----------+
| 8 | YES | NO | YES | YES |
+----+-------+-------+------------------+----------+
| 9 | NO | YES | YES | YES |
+----+-------+-------+------------------+----------+
| 10 | YES | YES | NO | YES |
+----+-------+-------+------------------+----------+
| 11 | NO | YES | NO | NO |
+----+-------+-------+------------------+----------+
| 12 | NO | YES | YES | YES |
+----+-------+-------+------------------+----------+
| 13 | NO | YES | YES | NO |
+----+-------+-------+------------------+----------+
| 14 | YES | YES | NO | NO |
+----+-------+-------+------------------+----------+
The columns are self-explanatory. Y and N stand for Yes and No respectively. The values or classes in
The columns used to make decision nodes viz. ‘Breathing Issues’, ‘Cough’ and ‘Fever’ are called
feature columns or just features and the column used for leaf nodes i.e. ‘Infected’ is called the target
column.
Metrics in ID3
As mentioned previously, the ID3 algorithm selects the best feature at each step while building a
Decision tree.
Before you ask, the answer to the question: ‘How does ID3 select the best feature?’ is that ID3
Information Gain calculates the reduction in the entropy and measures how well a given feature
separates or classifies the target classes. The feature with the highest Information Gain is selected as
In simple words, Entropy is the measure of disorder and the Entropy of a dataset is the measure of
In the case of binary classification (where the target column has only two types of classes) entropy
is 0 if all values in the target column are homogenous(similar) and will be 1 if the target column has
where,
n is the total number of classes in the target column (in our case n = 2 i.e YES and NO)
pᵢ is the probability of class ‘i’ or the ratio of “number of rows with class i in the target column” to the
where Sᵥ is the set of rows in S for which the feature column A has value v, |Sᵥ| is the number of rows
ID3 Steps
2. Considering that all rows don’t belong to the same class, split the dataset S into subsets using the
feature for which the Information Gain is maximum.
3. Make a decision tree node using the feature with the maximum Information gain.
4. If all rows belong to the same class, make the current node as a leaf node with the class as its
label.
5. Repeat for the remaining features until we run out of all features, or the decision tree has all leaf
nodes.
maximum Information Gain(IG). We’ll calculate the IG for each of the features now, but for that, we
From the total of 14 rows in our dataset S, there are 8 rows with the target value YES and 6 rows with
In this(Fever) feature there are 8 rows having value YES and 6 rows having value NO.
As shown below, in the 8 rows with YES for Fever, there are 6 rows having target
As shown below, in the 6 rows with NO, there are 2 rows having target value YES and 4 rows having
The block, below, demonstrates the calculation of Information Gain for Fever.
# total rows
|S| = 14For v = YES, |Sᵥ| = 8
Entropy(Sᵥ) = - (6/8) * log₂(6/8) - (2/8) * log₂(2/8) = 0.81For v = NO, |Sᵥ| = 6
Entropy(Sᵥ) = - (2/6) * log₂(2/6) - (4/6) * log₂(4/6) = 0.91# Expanding the summation in the IG formula:
IG(S, Fever) = Entropy(S) - (|Sʏᴇꜱ | / |S|) * Entropy(Sʏᴇꜱ ) -
(|Sɴᴏ| / |S|) * Entropy(Sɴᴏ)∴ IG(S, Fever) = 0.99 - (8/14) * 0.81 - (6/14) * 0.91 = 0.13
Next, we calculate the IG for the features “Cough” and “Breathing issues”.
You can use this free online tool to calculate the Information Gain.
IG(S, Cough) = 0.04
IG(S, BreathingIssues) = 0.40
Since the feature Breathing issues have the highest Information Gain it is used to create the root node.
Hence, after this initial step our tree looks like this:
Next, from the remaining two unused features, namely, Fever and Cough, we decide which one is the
Since the left branch of Breathing Issues denotes YES, we will work with the subset of the original
data i.e the set of rows having YES as the value in the Breathing Issues column. These 8 rows are
shown below:
+-------+-------+------------------+----------+
| Fever | Cough | Breathing issues | Infected |
+-------+-------+------------------+----------+
| YES | YES | YES | YES |
+-------+-------+------------------+----------+
| YES | NO | YES | YES |
+-------+-------+------------------+----------+
| YES | YES | YES | YES |
+-------+-------+------------------+----------+
| YES | NO | YES | YES |
+-------+-------+------------------+----------+
| YES | NO | YES | YES |
+-------+-------+------------------+----------+
| NO | YES | YES | YES |
+-------+-------+------------------+----------+
| NO | YES | YES | YES |
+-------+-------+------------------+----------+
| NO | YES | YES | NO |
+-------+-------+------------------+----------+
Next, we calculate the IG for the features Fever and Cough using the subset Sʙʏ
Note: For IG calculation the Entropy will be calculated from the subset Sʙʏ and not the original
dataset S.
IG(Sʙʏ, Fever) = 0.20
IG(Sʙʏ, Cough) = 0.09
IG of Fever is greater than that of Cough, so we select Fever as the left branch of Breathing Issues:
Next, we find the feature with the maximum IG for the right branch of Breathing Issues. But, since
there is only one unused feature left we have no other choice but to make it the right branch of the root
node.
nodes.
For the left leaf node of Fever, we see the subset of rows from the original data set that has Breathing
Since all the values in the target column are YES, we label the left leaf node as YES, but to make it
Similarly, for the right node of Fever we see the subset of rows from the original data set that
We repeat the same process for the node Cough, however here both left and right leaves turn out to be