Unit 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 67

UNIT-2

SUPERVISED LEARNING ALGORITHMS

Learning a Class from Examples, Linear, Non-linear, Multi-class and Multi-label classification,
Decision Trees: ID3, Classification and Regression Trees (CART), Regression: Linear Regression,
Multiple Linear Regression, Logistic Regression, Bayesian Network, Bayesian Classifier

Regression Analysis in Machine learning

Regression analysis is a statistical method to model the relationship between a dependent (target) and
independent (predictor) variables with one or more independent variables. More specifically,
Regression analysis helps us to understand how the value of the dependent variable is changing
corresponding to an independent variable when other independent variables are held fixed. It predicts
continuous/real values such as temperature, age, salary, price, etc.

Example: Suppose there is a marketing company A, who does various advertisement every year and
get sales on that. The below list shows the advertisement made by the company in the last 5 years and
the corresponding sales:

Regression is a supervised learning technique which helps in finding the correlation between variables
and enables us to predict the continuous output variable based on the one or more predictor variables.
It is mainly used for prediction, forecasting, time series modeling, and determining the causal-
effect relationship between variables.

In Regression, we plot a graph between the variables which best fits the given datapoints, using this
plot, the machine learning model can make predictions about the data. In simple words, "Regression
shows a line or curve that passes through all the datapoints on target-predictor graph in such a way
that the vertical distance between the datapoints and the regression line is minimum." The distance
between datapoints and line tells whether a model has captured a strong relationship or not.

Some examples of regression can be as:

o Prediction of rain using temperature and other factors


o Determining Market trends
o Prediction of road accidents due to rash driving.

Terminologies Related to the Regression Analysis:

o Dependent Variable: The main factor in Regression analysis which we want to predict or
understand is called the dependent variable. It is also called target variable.
o Independent Variable: The factors which affect the dependent variables or which are used to
predict the values of the dependent variables are called independent variable, also called as
a predictor.
o Outliers: Outlier is an observation which contains either very low value or very high value in
comparison to other observed values. An outlier may hamper the result, so it should be
avoided.
o Multicollinearity: If the independent variables are highly correlated with each other than
other variables, then such condition is called Multicollinearity. It should not be present in the
dataset, because it creates problem while ranking the most affecting variable.
o Underfitting and Overfitting: If our algorithm works well with the training dataset but not
well with test dataset, then such problem is called Overfitting. And if our algorithm does not
perform well even with training dataset, then such problem is called underfitting.

Why do we use Regression Analysis?

As mentioned above, Regression analysis helps in the prediction of a continuous variable. There are
various scenarios in the real world where we need some future predictions such as weather condition,
sales prediction, marketing trends, etc., for such case we need some technology which can make
predictions more accurately. So for such case we need Regression analysis which is a statistical
method and used in machine learning and data science. Below are some other reasons for using
Regression analysis:

o Regression estimates the relationship between the target and the independent variable.
o It is used to find the trends in data.
o It helps to predict real/continuous values.
o By performing the regression, we can confidently determine the most important factor, the
least important factor, and how each factor is affecting the other factors.

Types of Regression

There are various types of regressions which are used in data science and machine learning. Each type
has its own importance on different scenarios, but at the core, all the regression methods analyze the
effect of the independent variable on dependent variables. Here we are discussing some important
types of regression which are given below:
o Linear Regression
o Logistic Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression
o Ridge Regression
o Lasso Regression:

Linear Regression:

o Linear regression is a statistical regression method which is used for predictive analysis.
o It is one of the very simple and easy algorithms which works on regression and shows the
relationship between the continuous variables.
o It is used for solving the regression problem in machine learning.
o Linear regression shows the linear relationship between the independent variable (X-axis) and
the dependent variable (Y-axis), hence called linear regression.
o If there is only one input variable (x), then such linear regression is called simple linear
regression. And if there is more than one input variable, then such linear regression is
called multiple linear regression.
o The relationship between variables in the linear regression model can be explained using the
below image. Here we are predicting the salary of an employee on the basis of the year of
experience.
o Below is the mathematical equation for Linear regression:

Y= aX+b

Here, Y = dependent variables (target variables),


X= Independent variables (predictor variables),
a and b are the linear coefficients

Some popular applications of linear regression are:

o Analyzing trends and sales estimates


o Salary forecasting
o Real estate prediction
o Arriving at ETAs in traffic.

Logistic Regression:

o Logistic regression is another supervised learning algorithm which is used to solve the
classification problems. In classification problems, we have dependent variables in a binary
or discrete format such as 0 or 1.
o Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes or No,
True or False, Spam or not spam, etc.
o It is a predictive analysis algorithm which works on the concept of probability.
o Logistic regression is a type of regression, but it is different from the linear regression
algorithm in the term how they are used.
o Logistic regression uses sigmoid function or logistic function which is a complex cost
function. This sigmoid function is used to model the data in logistic regression. The function
can be represented as:

o f(x)= Output between the 0 and 1 value.


o x= input to the function
o e= base of natural logarithm.

When we provide the input values (data) to the function, it gives the S-curve as follows:

o It uses the concept of threshold levels, values above the threshold level are rounded up to 1,
and values below the threshold level are rounded up to 0.

There are three types of logistic regression:

o Binary(0/1, pass/fail)
o Multi(cats, dogs, lions)
o Ordinal(low, medium, high)

Polynomial Regression:

o Polynomial Regression is a type of regression which models the non-linear dataset using a
linear model.
o It is similar to multiple linear regression, but it fits a non-linear curve between the value of x
and corresponding conditional values of y.
o Suppose there is a dataset which consists of datapoints which are present in a non-linear
fashion, so for such case, linear regression will not best fit to those datapoints. To cover such
datapoints, we need Polynomial regression.
o In Polynomial regression, the original features are transformed into polynomial features
of given degree and then modeled using a linear model. Which means the datapoints are
best fitted using a polynomial line.

o The equation for polynomial regression also derived from linear regression equation that
means Linear regression equation Y= b0+ b1x, is transformed into Polynomial regression
equation Y= b0+b1x+ b2x2+ b3x3+.....+ bnxn.
o Here Y is the predicted/target output, b0, b1,... bn are the regression coefficients. x is
our independent/input variable.

Support Vector Regression:

Support Vector Machine is a supervised learning algorithm which can be used for regression as well
as classification problems. So if we use it for regression problems, then it is termed as Support Vector
Regression.

Support Vector Regression is a regression algorithm which works for continuous variables. Below are
some keywords which are used in Support Vector Regression:

o Kernel: It is a function used to map a lower-dimensional data into higher dimensional data.
o Hyperplane: In general SVM, it is a separation line between two classes, but in SVR, it is a
line which helps to predict the continuous variables and cover most of the datapoints.
o Boundary line: Boundary lines are the two lines apart from hyperplane, which creates a
margin for datapoints.
o Support vectors: Support vectors are the datapoints which are nearest to the hyperplane and
opposite class.

In SVR, we always try to determine a hyperplane with a maximum margin, so that maximum number
of datapoints are covered in that margin. The main goal of SVR is to consider the maximum
datapoints within the boundary lines and the hyperplane (best-fit line) must contain a maximum
number of datapoints. Consider the below image:

Here, the blue line is called hyperplane, and the other two lines are known as boundary lines.

Decision Tree Regression:

o Decision Tree is a supervised learning algorithm which can be used for solving both
classification and regression problems.
o It can solve problems for both categorical and numerical data
o Decision Tree regression builds a tree-like structure in which each internal node represents
the "test" for an attribute, each branch represent the result of the test, and each leaf node
represents the final decision or result.
o A decision tree is constructed starting from the root node/parent node (dataset), which splits
into left and right child nodes (subsets of dataset). These child nodes are further divided into
their children node, and themselves become the parent node of those nodes. Consider the
below image:
Above image showing the example of Decision Tee regression, here, the model is trying to predict the
choice of a person between Sports cars or Luxury car.

o Random forest is one of the most powerful supervised learning algorithms which is capable of
performing regression as well as classification tasks.
o The Random Forest regression is an ensemble learning method which combines multiple
decision trees and predicts the final output based on the average of each tree output. The
combined decision trees are called as base models, and it can be represented more formally
as:

g(x)= f0(x)+ f1(x)+ f2(x)+....

o Random forest uses Bagging or Bootstrap Aggregation technique of ensemble learning in


which aggregated decision tree runs in parallel and do not interact with each other.
o With the help of Random Forest regression, we can prevent Overfitting in the model by
creating random subsets of the dataset.
Ridge Regression:

o Ridge regression is one of the most robust versions of linear regression in which a small
amount of bias is introduced so that we can get better long term predictions.
o The amount of bias added to the model is known as Ridge Regression penalty. We can
compute this penalty term by multiplying with the lambda to the squared weight of each
individual features.

o The equation for ridge regression will be:

o A general linear or polynomial regression will fail if there is high collinearity between the
independent variables, so to solve such problems, Ridge regression can be used.
o Ridge regression is a regularization technique, which is used to reduce the complexity of the
model. It is also called as L2 regularization.
o It helps to solve the problems if we have more parameters than samples.

Lasso Regression:

o Lasso regression is another regularization technique to reduce the complexity of the model.
o It is similar to the Ridge Regression except that penalty term contains only the absolute
weights instead of a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge Regression
can only shrink it near to 0.
o It is also called as L1 regularization. The equation for Lasso regression will be:

Linear Regression in Machine Learning

Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
statistical method that is used for predictive analysis. Linear regression makes predictions for
continuous/real or numeric variables such as sales, salary, age, product price, etc.

Linear regression algorithm shows a linear relationship between a dependent (y) and one or more
independent (y) variables, hence called as linear regression. Since linear regression shows the linear
relationship, which means it finds how the value of the dependent variable is changing according to
the value of the independent variable.

The linear regression model provides a sloped straight line representing the relationship between the
variables. Consider the below image:

Mathematically, we can represent a linear regression as:

y= a0+a1x+ ε

Here,

Y= Dependent Variable (Target Variable)


X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error
The values for x and y variables are training datasets for Linear Regression model representation.

Types of Linear Regression

Linear regression can be further divided into two types of the algorithm:

o Simple Linear Regression:


If a single independent variable is used to predict the value of a numerical dependent variable,
then such a Linear Regression algorithm is called Simple Linear Regression.
o Multiple Linear regression:
If more than one independent variable is used to predict the value of a numerical dependent
variable, then such a Linear Regression algorithm is called Multiple Linear Regression.

Linear Regression Line

A linear line showing the relationship between the dependent and independent variables is called
a regression line. A regression line can show two types of relationship:

o Positive Linear Relationship:


If the dependent variable increases on the Y-axis and independent variable increases on X-
axis, then such a relationship is termed as a Positive linear relationship.

o Negative Linear Relationship:


If the dependent variable decreases on the Y-axis and independent variable increases on the
X-axis, then such a relationship is called a negative linear relationship.
Finding the best fit line:

When working with linear regression, our main goal is to find the best fit line that means the error
between predicted values and actual values should be minimized. The best fit line will have the least
error.

The different values for weights or the coefficient of lines (a0, a1) gives a different line of regression,
so we need to calculate the best values for a 0 and a1 to find the best fit line, so to calculate this we use
cost function.

Cost function-

o The different values for weights or coefficient of lines (a0, a1) gives the different line of
regression, and the cost function is used to estimate the values of the coefficient for the best
fit line.
o Cost function optimizes the regression coefficients or weights. It measures how a linear
regression model is performing.
o We can use the cost function to find the accuracy of the mapping function, which maps the
input variable to the output variable. This mapping function is also known as Hypothesis
function.

For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the average
of squared error occurred between the predicted values and actual values. It can be written as:

For the above linear equation, MSE can be calculated as:

Where,
N=Total number of observation
Yi = Actual value
(a1xi+a0)= Predicted value.

Residuals: The distance between the actual value and predicted values is called residual. If the
observed points are far from the regression line, then the residual will be high, and so cost function
will high. If the scatter points are close to the regression line, then the residual will be small and hence
the cost function.

Gradient Descent:

o Gradient descent is used to minimize the MSE by calculating the gradient of the cost function.
o A regression model uses gradient descent to update the coefficients of the line by reducing the
cost function.
o It is done by a random selection of values of coefficient and then iteratively update the values
to reach the minimum cost function.

Model Performance:

The Goodness of fit determines how the line of regression fits the set of observations. The process of
finding the best model out of various models is called optimization. It can be achieved by below
method:

1. R-squared method:

o R-squared is a statistical method that determines the goodness of fit.


o It measures the strength of the relationship between the dependent and independent variables
on a scale of 0-100%.
o The high value of R-square determines the less difference between the predicted values and
actual values and hence represents a good model.
o It is also called a coefficient of determination, or coefficient of multiple determination for
multiple regression.
o It can be calculated from the below formula:

Assumptions of Linear Regression

Below are some important assumptions of Linear Regression. These are some formal checks while
building a Linear Regression model, which ensures to get the best possible result from the given
dataset.
o Linear relationship between the features and target:
Linear regression assumes the linear relationship between the dependent and independent
variables.
o Small or no multicollinearity between the features:
Multicollinearity means high-correlation between the independent variables. Due to
multicollinearity, it may difficult to find the true relationship between the predictors and
target variables. Or we can say, it is difficult to determine which predictor variable is
affecting the target variable and which is not. So, the model assumes either little or no
multicollinearity between the features or independent variables.
o Homoscedasticity Assumption:
Homoscedasticity is a situation when the error term is the same for all the values of
independent variables. With homoscedasticity, there should be no clear pattern distribution of
data in the scatter plot.
o Normal distribution of error terms:
Linear regression assumes that the error term should follow the normal distribution pattern. If
error terms are not normally distributed, then confidence intervals will become either too
wide or too narrow, which may cause difficulties in finding coefficients.
It can be checked using the q-q plot. If the plot shows a straight line without any deviation,
which means the error is normally distributed.

o No autocorrelations:
The linear regression model assumes no autocorrelation in error terms. If there will be any
correlation in the error term, then it will drastically reduce the accuracy of the model.
Autocorrelation usually occurs if there is a dependency between residual errors.

Simple Linear Regression in Machine Learning

Simple Linear Regression is a type of Regression algorithms that models the relationship between a
dependent variable and a single independent variable. The relationship shown by a Simple Linear
Regression model is linear or a sloped straight line, hence it is called Simple Linear Regression.

The key point in Simple Linear Regression is that the dependent variable must be a continuous/real
value. However, the independent variable can be measured on continuous or categorical values.

Simple Linear regression algorithm has mainly two objectives:

o Model the relationship between the two variables. Such as the relationship between
Income and expenditure, experience and Salary, etc.
o Forecasting new observations. Such as Weather forecasting according to temperature,
Revenue of a company according to the investments in a year, etc.
Simple Linear Regression Model:

The Simple Linear Regression model can be represented using the below equation:

y= a0+a1x+ ε

Where,

a0= It is the intercept of the Regression line (can be obtained putting x=0)
a1= It is the slope of the regression line, which tells whether the line is increasing or decreasing.
ε = The error term. (For a good model it will be negligible)

Implementation of Simple Linear Regression Algorithm using Python

Problem Statement example for Simple Linear Regression:

Here we are taking a dataset that has two variables: salary (dependent variable) and experience
(Independent variable). The goals of this problem is:

o We want to find out if there is any correlation between these two variables
o We will find the best fit line for the dataset.
o How the dependent variable is changing by changing the independent variable.

In this section, we will create a Simple Linear Regression model to find out the best fitting line for
representing the relationship between these two variables.

To implement the Simple Linear regression model in machine learning using Python, we need to
follow the below steps:

Step-1: Data Pre-processing

The first step for creating the Simple Linear Regression model is data pre-processing. We have
already done it earlier in this tutorial. But there will be some changes, which are given in the below
steps:

o First, we will import the three important libraries, which will help us for loading the dataset,
plotting the graphs, and creating the Simple Linear Regression model.

1. import numpy as nm
2. import matplotlib.pyplot as mtp
3. import pandas as pd

o Next, we will load the dataset into our code:

1. data_set= pd.read_csv('Salary_Data.csv')

By executing the above line of code (ctrl+ENTER), we can read the dataset on our Spyder IDE screen
by clicking on the variable explorer option.
The above output shows the dataset, which has two variables: Salary and Experience.

After that, we need to extract the dependent and independent variables from the given dataset. The
independent variable is years of experience, and the dependent variable is salary. Below is code for it:

1. x= data_set.iloc[:, :-1].values
2. y= data_set.iloc[:, 1].values

In the above lines of code, for x variable, we have taken -1 value since we want to remove the last
column from the dataset. For y variable, we have taken 1 value as a parameter, since we want to
extract the second column and indexing starts from the zero.

By executing the above line of code, we will get the output for X and Y variable as:
In the above output image, we can see the X (independent) variable and Y (dependent) variable has
been extracted from the given dataset.

o Next, we will split both variables into the test set and training set. We have 30 observations,
so we will take 20 observations for the training set and 10 observations for the test set. We are
splitting our dataset so that we can train our model using a training dataset and then test the
model using a test dataset. The code for this is given below:

1. # Splitting the dataset into training and test set.


2. from sklearn.model_selection import train_test_split
3. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 1/3, random_state=0)

By executing the above code, we will get x-test, x-train and y-test, y-train dataset. Consider the below
images:

Test-dataset:
Training Dataset:

o For simple linear Regression, we will not use Feature Scaling. Because Python libraries take
care of it for some cases, so we don't need to perform it here. Now, our dataset is well
prepared to work on it and we are going to start building a Simple Linear Regression model
for the given problem.

Step-2: Fitting the Simple Linear Regression to the Training Set:


Now the second step is to fit our model to the training dataset. To do so, we will import
the LinearRegression class of the linear_model library from the scikit learn. After importing the
class, we are going to create an object of the class named as a regressor. The code for this is given
below:

1. #Fitting the Simple Linear Regression model to the training dataset


2. from sklearn.linear_model import LinearRegression
3. regressor= LinearRegression()
4. regressor.fit(x_train, y_train)

In the above code, we have used a fit() method to fit our Simple Linear Regression object to the
training set. In the fit() function, we have passed the x_train and y_train, which is our training dataset
for the dependent and an independent variable. We have fitted our regressor object to the training set
so that the model can easily learn the correlations between the predictor and target variables. After
executing the above lines of code, we will get the below output.

Output:

Out[7]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Step: 3. Prediction of test set result:

dependent (salary) and an independent variable (Experience). So, now, our model is ready to predict
the output for the new observations. In this step, we will provide the test dataset (new observations) to
the model to check whether it can predict the correct output or not.

We will create a prediction vector y_pred, and x_pred, which will contain predictions of test dataset,
and prediction of training set respectively.

1. #Prediction of Test and Training set result


2. y_pred= regressor.predict(x_test)
3. x_pred= regressor.predict(x_train)

On executing the above lines of code, two variables named y_pred and x_pred will generate in the
variable explorer options that contain salary predictions for the training set and test set.

Output:

You can check the variable by clicking on the variable explorer option in the IDE, and also compare
the result by comparing values from y_pred and y_test. By comparing these values, we can check how
good our model is performing.

Step: 4. visualizing the Training set results:

Now in this step, we will visualize the training set result. To do so, we will use the scatter() function
of the pyplot library, which we have already imported in the pre-processing step. The scatter ()
function will create a scatter plot of observations.

In the x-axis, we will plot the Years of Experience of employees and on the y-axis, salary of
employees. In the function, we will pass the real values of training set, which means a year of
experience x_train, training set of Salaries y_train, and color of the observations. Here we are taking a
green color for the observation, but it can be any color as per the choice.

Now, we need to plot the regression line, so for this, we will use the plot() function of the pyplot
library. In this function, we will pass the years of experience for training set, predicted salary for
training set x_pred, and color of the line.

Next, we will give the title for the plot. So here, we will use the title() function of the pyplot library
and pass the name ("Salary vs Experience (Training Dataset)".

After that, we will assign labels for x-axis and y-axis using xlabel() and ylabel() function.

Finally, we will represent all above things in a graph using show(). The code is given below:

1. mtp.scatter(x_train, y_train, color="green")


2. mtp.plot(x_train, x_pred, color="red")
3. mtp.title("Salary vs Experience (Training Dataset)")
4. mtp.xlabel("Years of Experience")
5. mtp.ylabel("Salary(In Rupees)")
6. mtp.show()

Output:

By executing the above lines of code, we will get the below graph plot as an output.

In the above plot, we can see the real values observations in green dots and predicted values are
covered by the red regression line. The regression line shows a correlation between the dependent and
independent variable.

The good fit of the line can be observed by calculating the difference between actual values and
predicted values. But as we can see in the above plot, most of the observations are close to the
regression line, hence our model is good for the training set.
Step: 5. visualizing the Test set results:

In the previous step, we have visualized the performance of our model on the training set. Now, we
will do the same for the Test set. The complete code will remain the same as the above code, except in
this, we will use x_test, and y_test instead of x_train and y_train.

Here we are also changing the color of observations and regression line to differentiate between the
two plots, but it is optional.

1. #visualizing the Test set results


2. mtp.scatter(x_test, y_test, color="blue")
3. mtp.plot(x_train, x_pred, color="red")
4. mtp.title("Salary vs Experience (Test Dataset)")
5. mtp.xlabel("Years of Experience")
6. mtp.ylabel("Salary(In Rupees)")
7. mtp.show()

Output:

By executing the above line of code, we will get the output as:

In the above plot, there are observations given by the blue color, and prediction is given by the red
regression line. As we can see, most of the observations are close to the regression line, hence we can
say our Simple Linear Regression is a good model and able to make good predictions.

Multiple Linear Regression

In the previous topic, we have learned about Simple Linear Regression, where a single
Independent/Predictor(X) variable is used to model the response variable (Y). But there may be
various cases in which the response variable is affected by more than one predictor variable; for such
cases, the Multiple Linear Regression algorithm is used.
Moreover, Multiple Linear Regression is an extension of Simple Linear regression as it takes more
than one predictor variable to predict the response variable. We can define it as:

Multiple Linear Regression is one of the important regression algorithms which models the linear
relationship between a single dependent continuous variable and more than one independent
variable.

Example:

Prediction of CO2 emission based on engine size and number of cylinders in a car.

Some key points about MLR:

o For MLR, the dependent or target variable(Y) must be the continuous/real, but the predictor
or independent variable may be of continuous or categorical form.
o Each feature variable must model the linear relationship with the dependent variable.
o MLR tries to fit a regression line through a multidimensional space of data-points.

MLR equation:

In Multiple Linear Regression, the target variable(Y) is a linear combination of multiple predictor
variables x1, x2, x3, ...,xn. Since it is an enhancement of Simple Linear Regression, so the same is
applied for the multiple linear regression equation, the equation becomes:

1. Y= b<sub>0</sub>+b<sub>1</sub>x<sub>1</sub>+ b<sub>2</sub>x<sub>2</sub>+ b<sub>3</su


b>x<sub>3</sub>+...... bnxn ............... (a)

Where,

Y= Output/Response variable

b0, b1, b2, b3 , bn....= Coefficients of the model.

x1, x2, x3, x4,...= Various Independent/feature variable

Assumptions for Multiple Linear Regression:

o A linear relationship should exist between the Target and predictor variables.
o The regression residuals must be normally distributed.
o MLR assumes little or no multicollinearity (correlation between the independent variable) in
data.

Implementation of Multiple Linear Regression model using Python:

To implement MLR using Python, we have below problem:


Problem Description:

We have a dataset of 50 start-up companies. This dataset contains five main information: R&D
Spend, Administration Spend, Marketing Spend, State, and Profit for a financial year. Our goal
is to create a model that can easily determine which company has a maximum profit, and which is the
most affecting factor for the profit of a company.

Since we need to find the Profit, so it is the dependent variable, and the other four variables are
independent variables. Below are the main steps of deploying the MLR model:

1. Data Pre-processing Steps


2. Fitting the MLR model to the training set
3. Predicting the result of the test set

Step-1: Data Pre-processing Step:

The very first step is data pre-processing, which we have already discussed in this tutorial. This
process contains the below steps:

o Importing libraries: Firstly we will import the library which will help in building the model.
Below is the code for it:

1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd

o Importing dataset: Now we will import the dataset(50_CompList), which contains all the
variables. Below is the code for it:

1. #importing datasets
2. data_set= pd.read_csv('50_CompList.csv')

Output: We will get the dataset as:


In above output, we can clearly see that there are five variables, in which four variables are
continuous and one is categorical variable.

o Extracting dependent and independent Variables:

1. #Extracting Independent and dependent Variable


2. x= data_set.iloc[:, :-1].values
3. y= data_set.iloc[:, 4].values

Output:

Out[5]:

array([[165349.2, 136897.8, 471784.1, 'New York'],


[162597.7, 151377.59, 443898.53, 'California'],
[153441.51, 101145.55, 407934.54, 'Florida'],
[144372.41, 118671.85, 383199.62, 'New York'],
[142107.34, 91391.77, 366168.42, 'Florida'],
[131876.9, 99814.71, 362861.36, 'New York'],
[134615.46, 147198.87, 127716.82, 'California'],
[130298.13, 145530.06, 323876.68, 'Florida'],
[120542.52, 148718.95, 311613.29, 'New York'],
[123334.88, 108679.17, 304981.62, 'California'],
[101913.08, 110594.11, 229160.95, 'Florida'],
[100671.96, 91790.61, 249744.55, 'California'],
[93863.75, 127320.38, 249839.44, 'Florida'],
[91992.39, 135495.07, 252664.93, 'California'],
[119943.24, 156547.42, 256512.92, 'Florida'],
[114523.61, 122616.84, 261776.23, 'New York'],
[78013.11, 121597.55, 264346.06, 'California'],
[94657.16, 145077.58, 282574.31, 'New York'],
[91749.16, 114175.79, 294919.57, 'Florida'],
[86419.7, 153514.11, 0.0, 'New York'],
[76253.86, 113867.3, 298664.47, 'California'],
[78389.47, 153773.43, 299737.29, 'New York'],
[73994.56, 122782.75, 303319.26, 'Florida'],
[67532.53, 105751.03, 304768.73, 'Florida'],
[77044.01, 99281.34, 140574.81, 'New York'],
[64664.71, 139553.16, 137962.62, 'California'],
[75328.87, 144135.98, 134050.07, 'Florida'],
[72107.6, 127864.55, 353183.81, 'New York'],
[66051.52, 182645.56, 118148.2, 'Florida'],
[65605.48, 153032.06, 107138.38, 'New York'],
[61994.48, 115641.28, 91131.24, 'Florida'],
[61136.38, 152701.92, 88218.23, 'New York'],
[63408.86, 129219.61, 46085.25, 'California'],
[55493.95, 103057.49, 214634.81, 'Florida'],
[46426.07, 157693.92, 210797.67, 'California'],
[46014.02, 85047.44, 205517.64, 'New York'],
[28663.76, 127056.21, 201126.82, 'Florida'],
[44069.95, 51283.14, 197029.42, 'California'],
[20229.59, 65947.93, 185265.1, 'New York'],
[38558.51, 82982.09, 174999.3, 'California'],
[28754.33, 118546.05, 172795.67, 'California'],
[27892.92, 84710.77, 164470.71, 'Florida'],
[23640.93, 96189.63, 148001.11, 'California'],
[15505.73, 127382.3, 35534.17, 'New York'],
[22177.74, 154806.14, 28334.72, 'California'],
[1000.23, 124153.04, 1903.93, 'New York'],
[1315.46, 115816.21, 297114.46, 'Florida'],
[0.0, 135426.92, 0.0, 'California'],
[542.05, 51743.15, 0.0, 'New York'],
[0.0, 116983.8, 45173.06, 'California']], dtype=object)

As we can see in the above output, the last column contains categorical variables which are not
suitable to apply directly for fitting the model. So we need to encode this variable.

Encoding Dummy Variables:

As we have one categorical variable (State), which cannot be directly applied to the model, so we will
encode it. To encode the categorical variable into numbers, we will use the LabelEncoder class. But
it is not sufficient because it still has some relational order, which may create a wrong model. So in
order to remove this problem, we will use OneHotEncoder, which will create the dummy variables.
Below is code for it:

1. #Catgorical data
2. from sklearn.preprocessing import LabelEncoder, OneHotEncoder
3. labelencoder_x= LabelEncoder()
4. x[:, 3]= labelencoder_x.fit_transform(x[:,3])
5. onehotencoder= OneHotEncoder(categorical_features= [3])
6. x= onehotencoder.fit_transform(x).toarray()

Here we are only encoding one independent variable, which is state as other variables are continuous.

Output:

As we can see in the above output, the state column has been converted into dummy variables (0 and
1). Here each dummy variable column is corresponding to the one State. We can check by
comparing it with the original dataset. The first column corresponds to the California State, the
second column corresponds to the Florida State, and the third column corresponds to the New York
State.

Now, we are writing a single line of code just to avoid the dummy variable trap:

1. #avoiding the dummy variable trap:


2. x = x[:, 1:]

If we do not remove the first dummy variable, then it may introduce multicollinearity in the model.
As we can see in the above output image, the first column has been removed.

o Now we will split the dataset into training and test set. The code for this is given below:

1. # Splitting the dataset into training and test set.


2. from sklearn.model_selection import train_test_split
3. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)

The above code will split our dataset into a training set and test set.

Output: The above code will split the dataset into training set and test set. You can check the output
by clicking on the variable explorer option given in Spyder IDE. The test set and training set will look
like the below image:

Test set:
Training set:

Step: 2- Fitting our MLR model to the Training set:

Now, we have well prepared our dataset in order to provide training, which means we will fit our
regression model to the training set. It will be similar to as we did in Simple Linear Regression model.
The code for this will be:

1. #Fitting the MLR model to the training set:


2. from sklearn.linear_model import LinearRegression
3. regressor= LinearRegression()
4. regressor.fit(x_train, y_train)

Output:

Out[9]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Now, we have successfully trained our model using the training dataset. In the next step, we will test
the performance of the model using the test dataset.

Step: 3- Prediction of Test set results:

The last step for our model is checking the performance of the model. We will do it by predicting the
test set result. For prediction, we will create a y_pred vector. Below is the code for it:

1. #Predicting the Test set result;


2. y_pred= regressor.predict(x_test)

By executing the above lines of code, a new vector will be generated under the variable explorer
option. We can test our model by comparing the predicted values and test set values.

Output:

In the above output, we have predicted result set and test set. We can check model performance by
comparing these two value index by index. For example, the first index has a predicted value
of 103015$ profit and test/real value of 103282$ profit. The difference is only of 267$, which is a
good prediction, so, finally, our model is completed here.

o We can also check the score for training dataset and test dataset. Below is the code for it:
1. print('Train Score: ', regressor.score(x_train, y_train))
2. print('Test Score: ', regressor.score(x_test, y_test))

Output: The score is:

Train Score: 0.9501847627493607


Test Score: 0.9347068473282446

The above score tells that our model is 95% accurate with the training dataset and 93%
accurate with the test dataset.

Applications of Multiple Linear Regression:

There are mainly two applications of Multiple Linear Regression:

o Effectiveness of Independent variable on prediction:


o Predicting the impact of changes:

ML Polynomial Regression

o Polynomial Regression is a regression algorithm that models the relationship between a


dependent(y) and independent variable(x) as nth degree polynomial. The Polynomial
Regression equation is given below:

y= b0+b1x1+ b2x12+ b2x13+...... bnx1n

o It is also called the special case of Multiple Linear Regression in ML. Because we add some
polynomial terms to the Multiple Linear regression equation to convert it into Polynomial
Regression.
o It is a linear model with some modification in order to increase the accuracy.
o The dataset used in Polynomial regression for training is of non-linear nature.
o It makes use of a linear regression model to fit the complicated and non-linear functions and
datasets.
o Hence, "In Polynomial regression, the original features are converted into Polynomial
features of required degree (2,3,..,n) and then modeled using a linear model."

Need for Polynomial Regression:

The need of Polynomial Regression in ML can be understood in the below points:

o If we apply a linear model on a linear dataset, then it provides us a good result as we have
seen in Simple Linear Regression, but if we apply the same model without any modification
on a non-linear dataset, then it will produce a drastic output. Due to which loss function will
increase, the error rate will be high, and accuracy will be decreased.
o So for such cases, where data points are arranged in a non-linear fashion, we need the
Polynomial Regression model. We can understand it in a better way using the below
comparison diagram of the linear dataset and non-linear dataset.

o In the above image, we have taken a dataset which is arranged non-linearly. So if we try to
cover it with a linear model, then we can clearly see that it hardly covers any data point. On
the other hand, a curve is suitable to cover most of the data points, which is of the Polynomial
model.
o Hence, if the datasets are arranged in a non-linear fashion, then we should use the
Polynomial Regression model instead of Simple Linear Regression. Equation of the
Polynomial Regression Model:

Simple Linear Regression equation: y = b0+b1x .........(a)

Multiple Linear Regression equation: y= b0+b1x+ b2x2+ b3 x3+....+ bnxn .........(b)

Polynomial Regression equation: y= b0+b1x + b2x2+ b3x3+....+ bnxn ..........(c)

Implementation of Polynomial Regression using Python:

Here we will implement the Polynomial Regression using Python. We will understand it by
comparing Polynomial Regression model with the Simple Linear Regression model. So first, let's
understand the problem for which we are going to build the model.

Problem Description: There is a Human Resource company, which is going to hire a new candidate.
The candidate has told his previous salary 160K per annum, and the HR have to check whether he is
telling the truth or bluff. So to identify this, they only have a dataset of his previous company in which
the salaries of the top 10 positions are mentioned with their levels. By checking the dataset available,
we have found that there is a non-linear relationship between the Position levels and the salaries.
Our goal is to build a Bluffing detector regression model, so HR can hire an honest candidate.
Below are the steps to build such a model.
Steps for Polynomial Regression:

The main steps involved in Polynomial Regression are given below:

o Data Pre-processing
o Build a Linear Regression model and fit it to the dataset
o Build a Polynomial Regression model and fit it to the dataset
o Visualize the result for Linear Regression and Polynomial Regression model.
o Predicting the output. regression model is for reference.

Data Pre-processing Step:

The data pre-processing step will remain the same as in previous regression models, except for some
changes. In the Polynomial Regression model, we will not use feature scaling, and also we will not
split our dataset into training and test set. It has two reasons:

o The dataset contains very less information which is not suitable to divide it into a test and
training set, else our model will not be able to find the correlations between the salaries and
levels.
o In this model, we want very accurate predictions for salary, so the model should have enough
information.

The code for pre-processing step is given below:

1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('Position_Salaries.csv')
8.
9. #Extracting Independent and dependent Variable
10. x= data_set.iloc[:, 1:2].values
11. y= data_set.iloc[:, 2].values

Explanation:

o In the above lines of code, we have imported the important Python libraries to import dataset
and operate on it.
o Next, we have imported the dataset 'Position_Salaries.csv', which contains three columns
(Position, Levels, and Salary), but we will consider only two columns (Salary and Levels).
o After that, we have extracted the dependent(Y) and independent variable(X) from the dataset.
For x-variable, we have taken parameters as [:,1:2], because we want 1 index(levels), and
included :2 to make it as a matrix.

Output:

By executing the above code, we can read our dataset as:

As we can see in the above output, there are three columns present (Positions, Levels, and Salaries).
But we are only considering two columns because Positions are equivalent to the levels or may be
seen as the encoded form of Positions.

Here we will predict the output for level 6.5 because the candidate has 4+ years' experience as a
regional manager, so he must be somewhere between levels 7 and 6.
Building the Linear regression model:

Learn more

Now, we will build and fit the Linear regression model to the dataset. In building polynomial
regression, we will take the Linear regression model as reference and compare both the results. The
code is given below:

1. #Fitting the Linear Regression to the dataset


2. from sklearn.linear_model import LinearRegression
3. lin_regs= LinearRegression()
4. lin_regs.fit(x,y)

In the above code, we have created the Simple Linear model using lin_regs object
of LinearRegression class and fitted it to the dataset variables (x and y).

Output:

Out[5]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Building the Polynomial regression model:

Now we will build the Polynomial Regression model, but it will be a little different from the Simple
Linear model. Because here we will use PolynomialFeatures class of preprocessing library. We are
using this class to add some extra features to our dataset.

1. #Fitting the Polynomial regression to the dataset


2. from sklearn.preprocessing import PolynomialFeatures
3. poly_regs= PolynomialFeatures(degree= 2)
4. x_poly= poly_regs.fit_transform(x)
5. lin_reg_2 =LinearRegression()
6. lin_reg_2.fit(x_poly, y)

In the above lines of code, we have used poly_regs.fit_transform(x), because first we are converting
our feature matrix into polynomial feature matrix, and then fitting it to the Polynomial regression
model. The parameter value(degree= 2) depends on our choice. We can choose it according to our
Polynomial features.

After executing the code, we will get another matrix x_poly, which can be seen under the variable
explorer option:
Next, we have used another LinearRegression object, namely lin_reg_2, to fit our x_poly vector to
the linear model.

Output:

Out[11]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Visualizing the result for Linear regression:

Now we will visualize the result for Linear regression model as we did in Simple Linear Regression.
Below is the code for it:

1. #Visulaizing the result for Linear Regression model


2. mtp.scatter(x,y,color="blue")
3. mtp.plot(x,lin_regs.predict(x), color="red")
4. mtp.title("Bluff detection model(Linear Regression)")
5. mtp.xlabel("Position Levels")
6. mtp.ylabel("Salary")
7. mtp.show()

Output:

Learn more
In the above output image, we can clearly see that the regression line is so far from the datasets.
Predictions are in a red straight line, and blue points are actual values. If we consider this output to
predict the value of CEO, it will give a salary of approx. 600000$, which is far away from the real
value.

So we need a curved model to fit the dataset other than a straight line.

Visualizing the result for Polynomial Regression

Here we will visualize the result of Polynomial regression model, code for which is little different
from the above model.

Code for this is given below:

1. #Visulaizing the result for Polynomial Regression


2. mtp.scatter(x,y,color="blue")
3. mtp.plot(x, lin_reg_2.predict(poly_regs.fit_transform(x)), color="red")
4. mtp.title("Bluff detection model(Polynomial Regression)")
5. mtp.xlabel("Position Levels")
6. mtp.ylabel("Salary")
7. mtp.show()

In the above code, we have taken lin_reg_2.predict(poly_regs.fit_transform(x), instead of x_poly,


because we want a Linear regressor object to predict the polynomial features matrix.

Output:
As we can see in the above output image, the predictions are close to the real values. The above plot
will vary as we will change the degree.

For degree= 3:

If we change the degree=3, then we will give a more accurate plot, as shown in the below image.

SO as we can see here in the above output image, the predicted salary for level 6.5 is near to 170K$-
190k$, which seems that future employee is saying the truth about his salary.

Degree= 4: Let's again change the degree to 4, and now will get the most accurate plot. Hence we can
get more accurate results by increasing the degree of Polynomial.
Predicting the final result with the Linear Regression model:

Now, we will predict the final output using the Linear regression model to see whether an employee is
saying truth or bluff. So, for this, we will use the predict() method and will pass the value 6.5. Below
is the code for it:

1. lin_pred = lin_regs.predict([[6.5]])
2. print(lin_pred)

Output:

[330378.78787879]

Predicting the final result with the Polynomial Regression model:

Now, we will predict the final output using the Polynomial Regression model to compare with Linear
model. Below is the code for it:

1. poly_pred = lin_reg_2.predict(poly_regs.fit_transform([[6.5]]))
2. print(poly_pred)

Output:

[158862.45265153]

As we can see, the predicted output for the Polynomial Regression is [158862.45265153], which is
much closer to real value hence, we can say that future employee is saying true.

Classification Algorithm in Machine Learning

As we know, the Supervised Machine Learning algorithm can be broadly classified into Regression
and Classification Algorithms. In Regression algorithms, we have predicted the output for continuous
values, but to predict the categorical values, we need Classification algorithms.
Classification Algorithm

The Classification algorithm is a Supervised Learning technique that is used to identify the category
of new observations on the basis of training data. In Classification, a program learns from the given
dataset or observations and then classifies new observation into a number of classes or groups. Such
as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc. Classes can be called as targets/labels or
categories.

Unlike regression, the output variable of Classification is a category, not a value, such as "Green or
Blue", "fruit or animal", etc. Since the Classification algorithm is a Supervised learning technique,
hence it takes labeled input data, which means it contains input with the corresponding output.

In classification algorithm, a discrete output function(y) is mapped to input variable(x).

y=f(x), where y = categorical output

The best example of an ML classification algorithm is Email Spam Detector.

The main goal of the Classification algorithm is to identify the category of a given dataset, and these
algorithms are mainly used to predict the output for the categorical data.

Classification algorithms can be better understood using the below diagram. In the below diagram,
there are two classes, class A and Class B. These classes have features that are similar to each other
and dissimilar to other classes.

The algorithm which implements the classification on a dataset is known as a classifier. There are two
types of Classifications:

o Binary Classifier: If the classification problem has only two possible outcomes, then it is
called as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
o Multi-class Classifier: If a classification problem has more than two outcomes, then it is
called as Multi-class Classifier.
Example: Classifications of types of crops, Classification of types of music.

Learners in Classification Problems:

In the classification problems, there are two types of learners:

1. Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it receives the
test dataset. In Lazy learner case, classification is done on the basis of the most related data
stored in the training dataset. It takes less time in training but more time for predictions.
Example: K-NN algorithm, Case-based reasoning
2. Eager Learners:Eager Learners develop a classification model based on a training dataset
before receiving a test dataset. Opposite to Lazy learners, Eager Learner takes more time in
learning, and less time in prediction. Example: Decision Trees, Naïve Bayes, ANN.

Types of ML Classification Algorithms:

Classification Algorithms can be further divided into the Mainly two category:

o Linear Models
o Logistic Regression
o Support Vector Machines
o Non-linear Models
o K-Nearest Neighbours
o Kernel SVM
o Naïve Bayes
o Decision Tree Classification
o Random Forest Classification.

Evaluating a Classification model:

Once our model is completed, it is necessary to evaluate its performance; either it is a Classification
or Regression model. So for evaluating a Classification model, we have the following ways:

1. Log Loss or Cross-Entropy Loss:

o It is used for evaluating the performance of a classifier, whose output is a probability value
between the 0 and 1.
o For a good binary Classification model, the value of log loss should be near to 0.
o The value of log loss increases if the predicted value deviates from the actual value.
o The lower log loss represents the higher accuracy of the model.
o For Binary classification, cross-entropy can be calculated as:

1. ?(ylog(p)+(1?y)log(1?p))

Where y= Actual output, p= predicted output.

2. Confusion Matrix:

o The confusion matrix provides us a matrix/table as output and describes the performance of
the model.
o It is also known as the error matrix.
o The matrix consists of predictions result in a summarized form, which has a total number of
correct predictions and incorrect predictions. The matrix looks like as below table:

Actual Positive Actual Negative

Predicted Positive True Positive False Positive

Predicted Negative False Negative True Negative

3. AUC-ROC curve:

o ROC curve stands for Receiver Operating Characteristics Curve and AUC stands for Area
Under the Curve.
o It is a graph that shows the performance of the classification model at different thresholds.
o To visualize the performance of the multi-class classification model, we use the AUC-ROC
Curve.
o The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on Y-axis and
FPR(False Positive Rate) on X-axis.

Use cases of Classification Algorithms

Classification algorithms can be used in different places. Below are some popular use cases of
Classification Algorithms:

o Email Spam Detection


o Speech Recognition
o Identifications of Cancer tumor cells.
o Drugs Classification
o Biometric Identification, etc.

Logistic Regression in Machine Learning

o Logistic regression is one of the most popular Machine Learning algorithms, which comes
under the Supervised Learning technique. It is used for predicting the categorical dependent
variable using a given set of independent variables.
o Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values
which lie between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Logistic regression is
used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as whether the
cells are cancerous or not, a mouse is obese or not based on its weight, etc.
o Logistic Regression is a significant machine learning algorithm because it has the ability to
provide probabilities and classify new data using continuous and discrete datasets.
o Logistic Regression can be used to classify the observations using different types of data and
can easily determine the most effective variables used for the classification. The below image
is showing the logistic function:
Logistic Function (Sigmoid Function):

o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go beyond this
limit, so it forms a curve like the "S" form. The S-form curve is called the Sigmoid function
or the logistic function.
o In logistic regression, we use the concept of the threshold value, which defines the probability
of either 0 or 1. Such as values above the threshold value tends to 1, and a value below the
threshold values tends to 0.

Assumptions for Logistic Regression:

o The dependent variable must be categorical in nature.


o The independent variable should not have multi-collinearity.

Logistic Regression Equation:

The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:

o We know the equation of the straight line can be written as:

o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):

o But we need range between -[infinity] to +[infinity], then take logarithm of the equation it
will become:

The above equation is the final equation for Logistic Regression.

Type of Logistic Regression:


On the basis of the categories, Logistic Regression can be classified into three types:

o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered
types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".

Naïve Bayes Classifier Algorithm

o Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes


theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional training dataset.
o Naïve Bayes Classifier is one of the simple and most effective Classification algorithms
which helps in building the fast machine learning models that can make quick predictions.

o It is a probabilistic classifier, which means it predicts on the basis of the probability of


an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental
analysis, and classifying articles.

Why is it called Naïve Bayes?

The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described as:

o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the bases
of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence
each feature individually contributes to identify that it is an apple without depending on each
other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.

Bayes' Theorem:

o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability.
o The formula for Bayes' theorem is given as:
Where,

P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.

P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.

P(A) is Prior Probability: Probability of hypothesis before observing the evidence.

P(B) is Marginal Probability: Probability of Evidence.

Working of Naïve Bayes' Classifier:

Working of Naïve Bayes' Classifier can be understood with the help of the below example:

Suppose we have a dataset of weather conditions and corresponding target variable "Play". So using
this dataset we need to decide that whether we should play or not on a particular day according to the
weather conditions. So to solve this problem, we need to follow the below steps:

1. Convert the given dataset into frequency tables.


2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.

Problem: If the weather is sunny, then the Player should play or not?

Solution: To solve this, first consider the below dataset:

Outlook Play

0 Rainy Yes

1 Sunny Yes

2 Overcast Yes

3 Overcast Yes

4 Sunny No

5 Rainy Yes

6 Sunny Yes

7 Overcast Yes

8 Rainy No
9 Sunny No

10 Sunny Yes

11 Rainy No

12 Overcast Yes

13 Overcast Yes

Frequency table for the Weather Conditions:

Weather Yes

Overcast 5

Rainy 2

Sunny 3

Total 10

Likelihood table weather condition:

Weather No Yes

Overcast 0 5

Rainy 2 2

Sunny 2 3

All 4/14=0.29 10/14=0.71

Applying Bayes'theorem:

P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)

P(Sunny|Yes)= 3/10= 0.3

P(Sunny)= 0.35

P(Yes)=0.71

So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60

P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)

P(Sunny|NO)= 2/4=0.5
P(No)= 0.29

P(Sunny)= 0.35

So P(No|Sunny)= 0.5*0.29/0.35 = 0.41

So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)

Hence on a Sunny day, Player can play the game.

Advantages of Naïve Bayes Classifier:

o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other Algorithms.
o It is the most popular choice for text classification problems.

Disadvantages of Naïve Bayes Classifier:

o Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the
relationship between features.

Applications of Naïve Bayes Classifier:

o It is used for Credit Scoring.


o It is used in medical data classification.
o It can be used in real-time predictions because Naïve Bayes Classifier is an eager learner.
o It is used in Text classification such as Spam filtering and Sentiment analysis.

Types of Naïve Bayes Model:

There are three types of Naive Bayes Model, which are given below:

o Gaussian: The Gaussian model assumes that features follow a normal distribution. This
means if predictors take continuous values instead of discrete, then the model assumes that
these values are sampled from the Gaussian distribution.
o Multinomial: The Multinomial Naïve Bayes classifier is used when the data is multinomial
distributed. It is primarily used for document classification problems, it means a particular
document belongs to which category such as Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the
predictor variables are the independent Booleans variables. Such as if a particular word is
present or not in a document. This model is also famous for document classification tasks.

Python Implementation of the Naïve Bayes algorithm:

Now we will implement a Naive Bayes Algorithm using Python. So for this, we will use the
"user_data" dataset, which we have used in our other classification model. Therefore we can easily
compare the Naive Bayes model with the other models.

Steps to implement:

o Data Pre-processing step


o Fitting Naive Bayes to the Training set
o Predicting the test result
o Test accuracy of the result(Creation of Confusion matrix)
o Visualizing the test set result.

1) Data Pre-processing step:

In this step, we will pre-process/prepare the data so that we can use it efficiently in our code. It is
similar as we did in data-pre-processing. The code for this is given below:

1. Importing the libraries


2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5.
6. # Importing the dataset
7. dataset = pd.read_csv('user_data.csv')
8. x = dataset.iloc[:, [2, 3]].values
9. y = dataset.iloc[:, 4].values
10.
11. # Splitting the dataset into the Training set and Test set
12. from sklearn.model_selection import train_test_split
13. x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 0)
14.
15. # Feature Scaling
16. from sklearn.preprocessing import StandardScaler
17. sc = StandardScaler()
18. x_train = sc.fit_transform(x_train)
19. x_test = sc.transform(x_test)

In the above code, we have loaded the dataset into our program using "dataset =
pd.read_csv('user_data.csv'). The loaded dataset is divided into training and test set, and then we
have scaled the feature variable.

The output for the dataset is given as:

2) Fitting Naive Bayes to the Training Set:

After the pre-processing step, now we will fit the Naive Bayes model to the Training set. Below is the
code for it:

1. # Fitting Naive Bayes to the Training set


2. from sklearn.naive_bayes import GaussianNB
3. classifier = GaussianNB()
4. classifier.fit(x_train, y_train)

In the above code, we have used the GaussianNB classifier to fit it to the training dataset. We can
also use other classifiers as per our requirement.

Output:

Out[6]: GaussianNB(priors=None, var_smoothing=1e-09)


3) Prediction of the test set result:

Now we will predict the test set result. For this, we will create a new predictor variable y_pred, and
will use the predict function to make the predictions.

1. # Predicting the Test set results


2. y_pred = classifier.predict(x_test)

Output:

The above output shows the result for prediction vector y_pred and real vector y_test. We can see
that some predications are different from the real values, which are the incorrect predictions.

4) Creating Confusion Matrix:

Now we will check the accuracy of the Naive Bayes classifier using the Confusion matrix. Below is
the code for it:

1. # Making the Confusion Matrix


2. from sklearn.metrics import confusion_matrix
3. cm = confusion_matrix(y_test, y_pred)

Output:
As we can see in the above confusion matrix output, there are 7+3= 10 incorrect predictions, and
65+25=90 correct predictions.

5) Visualizing the training set result:

Next we will visualize the training set result using Naïve Bayes Classifier. Below is the code for it:

1. # Visualising the Training set results


2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_train, y_train
4. X1, X2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() -
1, stop = x_set[:, 0].max() + 1, step = 0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(X1, X2, classifier.predict(nm.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
7. alpha = 0.75, cmap = ListedColormap(('purple', 'green')))
8. mtp.xlim(X1.min(), X1.max())
9. mtp.ylim(X2.min(), X2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13. mtp.title('Naive Bayes (Training set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()

Output:

In the above output we can see that the Naïve Bayes classifier has segregated the data points with the
fine boundary. It is Gaussian curve as we have used GaussianNB classifier in our code.

6) Visualizing the Test set result:

1. # Visualising the Test set results


2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_test, y_test
4. X1, X2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() -
1, stop = x_set[:, 0].max() + 1, step = 0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(X1, X2, classifier.predict(nm.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
7. alpha = 0.75, cmap = ListedColormap(('purple', 'green')))
8. mtp.xlim(X1.min(), X1.max())
9. mtp.ylim(X2.min(), X2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13. mtp.title('Naive Bayes (test set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()
Output:

The above output is final output for test set data. As we can see the classifier has created a Gaussian
curve to divide the "purchased" and "not purchased" variables. There are some wrong predictions
which we have calculated in Confusion matrix. But still it is pretty good classifier.

Decision Tree Classification Algorithm

o Decision Tree is a Supervised learning technique that can be used for both classification
and Regression problems, but mostly it is preferred for solving Classification problems. It is a
tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches, whereas
Leaf nodes are the output of those decisions and do not contain any further branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a problem/decision
based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further split the
tree into subtrees.
o Below diagram explains the general structure of a decision tree:

Why use Decision Trees?

There are various algorithms in Machine learning, so choosing the best algorithm for the given dataset
and problem is the main point to remember while creating a machine learning model. Below are the
two reasons for using the Decision tree:

o Decision Trees usually mimic human thinking ability while making a decision, so it is easy to
understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like
structure.

Decision Tree Terminologies

Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which
further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according
to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are called
the child nodes.

How does the Decision Tree algorithm Work?

In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root node
of the tree. This algorithm compares the values of root attribute with the record (real dataset) attribute
and, based on the comparison, follows the branch and jumps to the next node.

For the next node, the algorithm again compares the attribute value with the other sub-nodes and
move further. It continues the process until it reaches the leaf node of the tree. The complete process
can be better understood using the below algorithm: Step-1: Begin the tree with the root node, says S,
which contains the complete dataset.

o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in step -
3. Continue this process until a stage is reached where you cannot further classify the nodes
and called the final node as a leaf node.

Example: Suppose there is a candidate who has a job offer and wants to decide whether he should
accept the offer or Not. So, to solve this problem, the decision tree starts with the root node (Salary
attribute by ASM). The root node splits further into the next decision node (distance from the office)
and one leaf node based on the corresponding labels. The next decision node further gets split into one
decision node (Cab facility) and one leaf node. Finally, the decision node splits into two leaf nodes
(Accepted offers and Declined offer). Consider the below diagram:

Attribute Selection Measures

While implementing a Decision tree, the main issue arises that how to select the best attribute for the
root node and for sub-nodes. So, to solve such problems there is a technique which is called
as Attribute selection measure or ASM. By this measurement, we can easily select the best attribute
for the nodes of the tree. There are two popular techniques for ASM, which are:

o Information Gain
o Gini Index

1. Information Gain:

o Information gain is the measurement of changes in entropy after the segmentation of a dataset
based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision tree.
o A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using the
below formula:

1. Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)

Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in
data. Entropy can be calculated as:

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

Where,

o S= Total number of samples


o P(yes)= probability of yes
o P(no)= probability of no

2. Gini Index:

o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create binary
splits.
o Gini index can be calculated using the below formula:

Gini Index= 1- ∑jPj2

Pruning: Getting an Optimal Decision tree

Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal
decision tree.

A too-large tree increases the risk of overfitting, and a small tree may not capture all the important
features of the dataset. Therefore, a technique that decreases the size of the learning tree without
reducing accuracy is known as Pruning. There are mainly two types of tree pruning technology used:

o Cost Complexity Pruning


o Reduced Error Pruning.
Advantages of the Decision Tree

o It is simple to understand as it follows the same process which a human follow while making
any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree

o The decision tree contains lots of layers, which makes it complex.


o It may have an overfitting issue, which can be resolved using the Random Forest algorithm.
o For more class labels, the computational complexity of the decision tree may increase.

CART (Classification And Regression Tree)

CART is a predictive algorithm used in Machine learning and it explains how the target variable’s
values can be predicted based on other matters. It is a decision tree where each fork is split into a
predictor variable and each node has a prediction for the target variable at the end.
In the decision tree, nodes are split into sub-nodes on the basis of a threshold value of an attribute.
The root node is taken as the training set and is split into two by considering the best attribute and
threshold value. Further, the subsets are also split using the same logic. This continues till the last
pure sub-set is found in the tree or the maximum number of leaves possible in that growing tree.

The CART algorithm works via the following process:

 The best split point of each input is obtained.


 Based on the best split points of each input in Step 1, the new “best” split point is identified.
 Split the chosen input according to the “best” split point.
 Continue splitting until a stopping rule is satisfied or no further desirable splitting is available.
CART algorithm uses Gini Impurity to split the dataset into a decision tree .It does that by
searching for the best homogeneity for the sub nodes, with the help of the Gini index criterion.

Gini index/Gini impurity


The Gini index is a metric for the classification tasks in CART. It stores the sum of squared
probabilities of each class. It computes the degree of probability of a specific variable that is
wrongly being classified when chosen randomly and a variation of the Gini coefficient. It works on
categorical variables, provides outcomes either “successful” or “failure” and hence conducts binary
splitting only.

The degree of the Gini index varies from 0 to 1,

 Where 0 depicts that all the elements are allied to a certain class, or only one class exists there.
 The Gini index of value 1 signifies that all the elements are randomly distributed across various
classes, and
 A value of 0.5 denotes the elements are uniformly distributed into some classes.
Mathematically, we can write Gini Impurity as follows:

where pi is the probability of an object being classified to a particular class.

Classification tree
A classification tree is an algorithm where the target variable is categorical. The algorithm is then
used to identify the “Class” within which the target variable is most likely to fall. Classification
trees are used when the dataset needs to be split into classes that belong to the response
variable(like yes or no)

Regression tree
A Regression tree is an algorithm where the target variable is continuous and the tree is used to
predict its value. Regression trees are used when the response variable is continuous. For example,
if the response variable is the temperature of the day.

CART model representation


CART models are formed by picking input variables and evaluating split points on those variables
until an appropriate tree is produced.

Steps to create a Decision Tree using the CART algorithm:

 Greedy algorithm: In this The input space is divided using the Greedy method which is known
as a recursive binary spitting. This is a numerical method within which all of the values are
aligned and several other split points are tried and assessed using a cost function.
 Stopping Criterion: As it works its way down the tree with the training data, the recursive
binary splitting method described above must know when to stop splitting. The most frequent
halting method is to utilize a minimum amount of training data allocated to every leaf node. If
the count is smaller than the specified threshold, the split is rejected and also the node is
considered the last leaf node.
 Tree pruning: Decision tree’s complexity is defined as the number of splits in the tree. Trees
with fewer branches are recommended as they are simple to grasp and less prone to cluster the
data. Working through each leaf node in the tree and evaluating the effect of deleting it using a
hold-out test set is the quickest and simplest pruning approach.
 Data preparation for the CART: No special data preparation is required for the CART
algorithm.
Advantages of CART
 Results are simplistic.
 Classification and regression trees are Nonparametric and Nonlinear.
 Classification and regression trees implicitly perform feature selection.
 Outliers have no meaningful effect on CART.
 It requires minimal supervision and produces easy-to-understand models.
Limitations of CART
 Overfitting.
 High Variance.
 low bias.
 the tree structure may be unstable.
Applications of the CART algorithm
 For quick Data insights.
 In Blood Donors Classification.
 For environmental and ecological data.
 In the financial sectors.

Decision Trees: ID3 Algorithm

ID3 stands for Iterative Dichotomiser 3 and is named such because the algorithm iteratively
(repeatedly) dichotomizes(divides) features into two or more groups at each step.

Most generally ID3 is only used for classification problems with nominal features only.

Dataset description

In this article, we’ll be using a sample dataset of COVID-19 infection. A preview of the entire dataset

is shown below.
+----+-------+-------+------------------+----------+
| ID | Fever | Cough | Breathing issues | Infected |
+----+-------+-------+------------------+----------+
| 1 | NO | NO | NO | NO |
+----+-------+-------+------------------+----------+
| 2 | YES | YES | YES | YES |
+----+-------+-------+------------------+----------+
| 3 | YES | YES | NO | NO |
+----+-------+-------+------------------+----------+
| 4 | YES | NO | YES | YES |
+----+-------+-------+------------------+----------+
| 5 | YES | YES | YES | YES |
+----+-------+-------+------------------+----------+
| 6 | NO | YES | NO | NO |
+----+-------+-------+------------------+----------+
| 7 | YES | NO | YES | YES |
+----+-------+-------+------------------+----------+
| 8 | YES | NO | YES | YES |
+----+-------+-------+------------------+----------+
| 9 | NO | YES | YES | YES |
+----+-------+-------+------------------+----------+
| 10 | YES | YES | NO | YES |
+----+-------+-------+------------------+----------+
| 11 | NO | YES | NO | NO |
+----+-------+-------+------------------+----------+
| 12 | NO | YES | YES | YES |
+----+-------+-------+------------------+----------+
| 13 | NO | YES | YES | NO |
+----+-------+-------+------------------+----------+
| 14 | YES | YES | NO | NO |
+----+-------+-------+------------------+----------+

The columns are self-explanatory. Y and N stand for Yes and No respectively. The values or classes in

Infected column Y and N represent Infected and Not Infected respectively.

The columns used to make decision nodes viz. ‘Breathing Issues’, ‘Cough’ and ‘Fever’ are called

feature columns or just features and the column used for leaf nodes i.e. ‘Infected’ is called the target

column.

Metrics in ID3

As mentioned previously, the ID3 algorithm selects the best feature at each step while building a

Decision tree.

Before you ask, the answer to the question: ‘How does ID3 select the best feature?’ is that ID3

uses Information Gain or just Gain to find the best feature.

Information Gain calculates the reduction in the entropy and measures how well a given feature

separates or classifies the target classes. The feature with the highest Information Gain is selected as

the best one.

In simple words, Entropy is the measure of disorder and the Entropy of a dataset is the measure of

disorder in the target feature of the dataset.

In the case of binary classification (where the target column has only two types of classes) entropy

is 0 if all values in the target column are homogenous(similar) and will be 1 if the target column has

equal number values for both the classes.

We denote our dataset as S, entropy is calculated as:


Entropy(S) = - ∑ pᵢ * log₂(pᵢ) ; i = 1 to n

where,

n is the total number of classes in the target column (in our case n = 2 i.e YES and NO)

pᵢ is the probability of class ‘i’ or the ratio of “number of rows with class i in the target column” to the

“total number of rows” in the dataset.

Information Gain for a feature column A is calculated as:


IG(S, A) = Entropy(S) - ∑((|Sᵥ| / |S|) * Entropy(Sᵥ))

where Sᵥ is the set of rows in S for which the feature column A has value v, |Sᵥ| is the number of rows

in Sᵥ and likewise |S| is the number of rows in S.

ID3 Steps

1. Calculate the Information Gain of each feature.

2. Considering that all rows don’t belong to the same class, split the dataset S into subsets using the
feature for which the Information Gain is maximum.

3. Make a decision tree node using the feature with the maximum Information gain.

4. If all rows belong to the same class, make the current node as a leaf node with the class as its
label.

5. Repeat for the remaining features until we run out of all features, or the decision tree has all leaf
nodes.

Implementation on our Dataset


As stated in the previous section the first step is to find the best feature i.e. the one that has the

maximum Information Gain(IG). We’ll calculate the IG for each of the features now, but for that, we

first need to calculate the entropy of S

From the total of 14 rows in our dataset S, there are 8 rows with the target value YES and 6 rows with

the target value NO. The entropy of S is calculated as:


Entropy(S) = — (8/14) * log₂(8/14) — (6/14) * log₂(6/14) = 0.99

We now calculate the Information Gain for each feature:

IG calculation for Fever:

In this(Fever) feature there are 8 rows having value YES and 6 rows having value NO.

As shown below, in the 8 rows with YES for Fever, there are 6 rows having target

value YES and 2 rows having target value NO.


+-------+-------+------------------+----------+
| Fever | Cough | Breathing issues | Infected |
+-------+-------+------------------+----------+
| YES | YES | YES | YES |
+-------+-------+------------------+----------+
| YES | YES | NO | NO |
+-------+-------+------------------+----------+
| YES | NO | YES | YES |
+-------+-------+------------------+----------+
| YES | YES | YES | YES |
+-------+-------+------------------+----------+
| YES | NO | YES | YES |
+-------+-------+------------------+----------+
| YES | NO | YES | YES |
+-------+-------+------------------+----------+
| YES | YES | NO | YES |
+-------+-------+------------------+----------+
| YES | YES | NO | NO |
+-------+-------+------------------+----------+

As shown below, in the 6 rows with NO, there are 2 rows having target value YES and 4 rows having

target value NO.


+-------+-------+------------------+----------+
| Fever | Cough | Breathing issues | Infected |
+-------+-------+------------------+----------+
| NO | NO | NO | NO |
+-------+-------+------------------+----------+
| NO | YES | NO | NO |
+-------+-------+------------------+----------+
| NO | YES | YES | YES |
+-------+-------+------------------+----------+
| NO | YES | NO | NO |
+-------+-------+------------------+----------+
| NO | YES | YES | YES |
+-------+-------+------------------+----------+
| NO | YES | YES | NO |
+-------+-------+------------------+----------+

The block, below, demonstrates the calculation of Information Gain for Fever.
# total rows
|S| = 14For v = YES, |Sᵥ| = 8
Entropy(Sᵥ) = - (6/8) * log₂(6/8) - (2/8) * log₂(2/8) = 0.81For v = NO, |Sᵥ| = 6
Entropy(Sᵥ) = - (2/6) * log₂(2/6) - (4/6) * log₂(4/6) = 0.91# Expanding the summation in the IG formula:
IG(S, Fever) = Entropy(S) - (|Sʏᴇꜱ | / |S|) * Entropy(Sʏᴇꜱ ) -
(|Sɴᴏ| / |S|) * Entropy(Sɴᴏ)∴ IG(S, Fever) = 0.99 - (8/14) * 0.81 - (6/14) * 0.91 = 0.13

Next, we calculate the IG for the features “Cough” and “Breathing issues”.

You can use this free online tool to calculate the Information Gain.
IG(S, Cough) = 0.04
IG(S, BreathingIssues) = 0.40

Since the feature Breathing issues have the highest Information Gain it is used to create the root node.

Hence, after this initial step our tree looks like this:

Next, from the remaining two unused features, namely, Fever and Cough, we decide which one is the

best for the left branch of Breathing Issues.

Since the left branch of Breathing Issues denotes YES, we will work with the subset of the original

data i.e the set of rows having YES as the value in the Breathing Issues column. These 8 rows are

shown below:
+-------+-------+------------------+----------+
| Fever | Cough | Breathing issues | Infected |
+-------+-------+------------------+----------+
| YES | YES | YES | YES |
+-------+-------+------------------+----------+
| YES | NO | YES | YES |
+-------+-------+------------------+----------+
| YES | YES | YES | YES |
+-------+-------+------------------+----------+
| YES | NO | YES | YES |
+-------+-------+------------------+----------+
| YES | NO | YES | YES |
+-------+-------+------------------+----------+
| NO | YES | YES | YES |
+-------+-------+------------------+----------+
| NO | YES | YES | YES |
+-------+-------+------------------+----------+
| NO | YES | YES | NO |
+-------+-------+------------------+----------+

Next, we calculate the IG for the features Fever and Cough using the subset Sʙʏ

(Set Breathing Issues Yes) which is shown above :

Note: For IG calculation the Entropy will be calculated from the subset Sʙʏ and not the original

dataset S.
IG(Sʙʏ, Fever) = 0.20
IG(Sʙʏ, Cough) = 0.09

IG of Fever is greater than that of Cough, so we select Fever as the left branch of Breathing Issues:

Our tree now looks like this:

Next, we find the feature with the maximum IG for the right branch of Breathing Issues. But, since

there is only one unused feature left we have no other choice but to make it the right branch of the root

node.

So our tree now looks like this:


There are no more unused features, so we stop here and jump to the final step of creating the leaf

nodes.

For the left leaf node of Fever, we see the subset of rows from the original data set that has Breathing

Issues and Fever both values as YES.


+-------+-------+------------------+----------+
| Fever | Cough | Breathing issues | Infected |
+-------+-------+------------------+----------+
| YES | YES | YES | YES |
+-------+-------+------------------+----------+
| YES | NO | YES | YES |
+-------+-------+------------------+----------+
| YES | YES | YES | YES |
+-------+-------+------------------+----------+
| YES | NO | YES | YES |
+-------+-------+------------------+----------+
| YES | NO | YES | YES |
+-------+-------+------------------+----------+

Since all the values in the target column are YES, we label the left leaf node as YES, but to make it

more logical we label it Infected.

Similarly, for the right node of Fever we see the subset of rows from the original data set that

have Breathing Issues value as YES and Fever as NO.


+-------+-------+------------------+----------+
| Fever | Cough | Breathing issues | Infected |
+-------+-------+------------------+----------+
| NO | YES | YES | YES |
+-------+-------+------------------+----------+
| NO | YES | YES | NO |
+-------+-------+------------------+----------+
| NO | YES | YES | NO |
+-------+-------+------------------+----------+
Here not all but most of the values are NO, hence NO or Not Infected becomes our right leaf node.

Our tree, now, looks like this:

We repeat the same process for the node Cough, however here both left and right leaves turn out to be

the same i.e. NO or Not Infected as shown below:

You might also like