ML_Unit-2_Material
ML_Unit-2_Material
ML_Unit-2_Material
UNIT – II
Supervised Learning
Learning a Class from Examples, Linear, Non-linear, Multi-class and Multi-label classification, Decision
Trees: ID3, Classification and Regression Trees (CART), Regression: Linear Regression, Multiple Linear
Decision Tree
Introduction Decision Trees are a type of Supervised Machine Learning (that is you
explain what the input is and what the corresponding output is in the training data) where the
data is continuously split according to a certain parameter. The tree can be explained by two
entities, namely decision nodes and leaves. The leaves are the decisions or the final outcomes.
And the decision nodes are where the data is split.
An example of a decision tree can be explained using above binary tree. Let‟s say you want to
predict whether a person is fit given their information like age, eating habit, and physical
activity, etc. The decision nodes here are questions like „What‟s the age?‟, „Does he exercise?‟,
and „Does he eat a lot of pizzas‟? And the leaves, which are outcomes like either „fit‟, or „unfit‟.
In this case this was a binary classification problem (a yes no type problem). There are two main
types of Decision Trees:
What we have seen above is an example of classification tree, where the outcome was a
variable like „fit‟ or „unfit‟. Here the decision variable is Categorical.
Here the decision or the outcome variable is Continuous, e.g. a number like 123. Working Now
that we know what a Decision Tree is, we‟ll see how it works internally. There are many
algorithms out there which construct Decision Trees, but one of the best is called as ID3
Algorithm. ID3 Stands for Iterative Dichotomiser 3. Before discussing the ID3 algorithm,
we‟ll go through few definitions. Entropy Entropy, also called as Shannon Entropy is denoted
by H(S) for a finite set S, is the measure of the amount of uncertainty or randomness in data.
Intuitively, it tells us about the predictability of a certain event. Example, consider a coin toss
whose probability of heads is 0.5 and probability of tails is 0.5. Here the entropy is the highest
possible, since there‟s no way of determining what the outcome might be. Alternatively,
consider a coin which has heads on both the sides, the entropy of such an event can be predicted
where IG(S, A) is the information gain by applying feature A. H(S) is the Entropy of the entire
set, while the second term calculates the Entropy after applying the feature A, where P(x) is the
probability of event x. Let‟s understand this with the help of an example Consider a piece of data
collected over the course of 14 days where the features are Outlook, Temperature, Humidity,
Wind and the outcome variable is whether Golf was played on the day. Now, our job is to build
a predictive model which takes in above 4 parameters and predicts whether Golf will be played
on the day. We‟ll build a decision tree to do that using ID3 algorithm.
ID3
ID3 Algorithm will perform following tasks recursively
Now we‟ll go ahead and grow the decision tree. The initial step is to calculate H(S), the Entropy
of the current state. In the above example, we can see in total there are 5 No‟s and 9 Yes‟s.
Ye No Tota
s l
9 5 14
where „x‟ are the possible values for an attribute. Here, attribute „Wind‟ takes two possible
values in the sample data, hence x = {Weak, Strong} we‟ll have to calculate:
Amongst all the 14 examples we have 8 places where the wind is weak and 6 where the wind is Strong.
Now out of the 8 Weak examples, 6 of them were „Yes‟ for Play Golf and 2 of them were „No‟
for „Play Golf‟. So, we have,
Similarly, out of 6 Strong examples, we have 3 examples where the outcome was „Yes‟ for
Play Golf and 3 where we had „No‟ for Play Golf.
Which tells us the Information Gain by considering „Wind‟ as the feature and give us
information gain of 0.048. Now we must similarly calculate the Information Gain for all the
features.
We can clearly see that IG(S, Outlook) has the highest information gain of 0.246, hence
we chose Outlook attribute as the root node. At this point, the decision tree looks like.
In a regression tree, a regression model is fit to the target variable using each of the independent
variables. After this, the data is split at several points for each independent variable.
At each such point, the error between the predicted values and actual values is squared to get
“A Sum of Squared Errors” (SSE). The SSE is compared across the variables and the variable or
point which has the lowest SSE is chosen as the split point. This process is continued
recursively.
Classification and regression tree tutorials, as well as classification and regression tree ppts, exist in
abundance. This is a testament to the popularity of these decision trees and how frequently they are used.
However, these decision trees are not without their disadvantages.
There are many classification and regression trees examples where the use of a decision tree
has not led to the optimal result. Here are some of the limitations of classification and regression
trees.
(i) Overfitting
Overfitting occurs when the tree takes into account a lot of noise that exists in the data
and comes up with an inaccurate result.
(ii) High variance
In this case, a small variance in the data can lead to a very high variance in the
prediction, thereby affecting the stability of the outcome.
(iii) Low bias
A decision tree that is very complex usually has a low bias. This makes it very difficult
for the model to incorporate any new data.
Regression
Regression Analysis in Machine learning
Regression analysis is a statistical method to model the relationship between a dependent
(target) and independent (predictor) variables with one or more independent variables. More
specifically, Regression analysis helps us to understand how the value of the dependent variable
is changing corresponding to an independent variable when other independent variables are held
Dept. Of CSE Machine Learning – IV CSE – I Sem. (Unit-2)
fixed. It predicts continuous/real values such as temperature, age, salary, price, etc.
We can understand the concept of regression analysis using the below example:
Example: Suppose there is a marketing company A, who does various advertisement every year
and get sales on that. The below list shows the advertisement made by the company in the last 5
years and the corresponding sales:
Now, the company wants to do the advertisement of $200 in the year 2019 and wants to know
the prediction about the sales for this year. So to solve such type of prediction problems in
machine learning, we need regression analysis.
Regression is a supervised learning technique which helps in finding the correlation between
variables and enables us to predict the continuous output variable based on the one or more
predictor variables. It is mainly used for prediction, forecasting, time series modeling, and
determining the causal-effect relationship between variables.
In Regression, we plot a graph between the variables which best fits the given datapoints,
using this plot, the machine learning model can make predictions about the data. In simple
words, "Regression shows a line or curve that passes through all the datapoints on target-
predictor graph in such a way that the vertical distance between the datapoints and the
regression line is minimum." The distance between datapoints and line tells whether a model
has captured a strong relationship or not.
Types of Regression
There are various types of regressions which are used in data science and machine learning.
Each type has its own importance on different scenarios, but at the core, all the regression
methods analyze the effect of the independent variable on dependent variables. Here we are
discussing some important types of regression which are given below:
o Linear Regression
o Logistic Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression
o Ridge Regression
o Lasso Regression
o If there is only one input variable (x), then such linear regression is called simple linear
regression. And if there is more than one input variable, then such linear regression is
called multiple linear regression.
o The relationship between variables in the linear regression model can be explained using
the below image. Here we are predicting the salary of an employee on the basis of the
year of experience.
o Salary forecasting
Logistic Regression:
o Logistic regression is another supervised learning algorithm which is used to solve the
classification problems. In classification problems, we have dependent variables in a
binary or discrete format such as 0 or 1.
o Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes or
No, True or False, Spam or not spam, etc.
o Logistic regression is a type of regression, but it is different from the linear regression
algorithm in the term how they are used.
o Logistic regression uses sigmoid function or logistic function which is a complex cost
function. This sigmoid function is used to model the data in logistic regression. The
34
When we provide the input values (data) to the function, it gives the S-curve as follows:
Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
statistical method that is used for predictive analysis. Linear regression makes predictions for
continuous/real or numeric variables such as sales, salary, age, product price, etc.
Linear regression algorithm shows a linear relationship between a dependent (y) and one or
more independent (y) variables, hence called as linear regression. Since linear regression shows
the linear relationship, which means it finds how the value of the dependent variable is changing
according to the value of the independent variable.
The linear regression model provides a sloped straight line representing the relationship between
the variables. Consider the below image:
y= a0+a1x+ ε
Here,
Y= Dependent Variable (Target Variable)
X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each
input value). ε = random error
The values for x and y variables are training datasets for Linear Regression model representation.
Linear regression can be further divided into two types of the algorithm:
o Simple Linear Regression:
If a single independent variable is used to predict the value of a numerical dependent
variable, then such a Linear Regression algorithm is called Simple Linear Regression.
Finding the
best fit line:
When working with linear regression, our main goal is to find the best fit line that means
the error between predicted values and actual values should be minimized. The best fit line will
have the least error.
The different values for weights or the coefficient of lines (a0, a1) gives a different line of
regression, so we need to calculate the best values for a0 and a1 to find the best fit line, so to
calculate this we use cost function.
Cost function-
o The different values for weights or coefficient of lines (a0, a1) gives the different line of
regression, and the cost function is used to estimate the values of the coefficient for the
best fit line.
o Cost function optimizes the regression coefficients or weights. It measures how a linear
regression model is performing.
For Linear Regression, we use the Mean Squared Error (MSE) cost function, which
is the average of squared error occurred between the predicted values and actual values. It can
be written as:
Where,
N=Total number of
observation Yi = Actual
value
(a1xi+a0)= Predicted value.
Gradient Descent:
o Gradient descent is used to minimize the MSE by calculating the gradient of the cost function.
o A regression model uses gradient descent to update the coefficients of the line by
reducing the cost function.
o It is done by a random selection of values of coefficient and then iteratively update the
values to reach the minimum cost function.
Model Performance:
The Goodness of fit determines how the line of regression fits the set of
observations. The process of finding the best model out of various models is called
optimization. It can be achieved by below method:
1. R-squared method:
o R-squared is a statistical method that determines the goodness of fit.
o It measures the strength of the relationship between the dependent and independent
variables on a scale of 0-100%.
o The high value of R-square determines the less difference between the predicted values
and actual values and hence represents a good model.
o Homoscedasticity Assumption:
Homoscedasticity is a situation when the error term is the same for all the values of
independent variables. With homoscedasticity, there should be no clear pattern
distribution of data in the scatter plot.
o No autocorrelations:
The linear regression model assumes no autocorrelation in error terms. If there will be
any correlation in the error term, then it will drastically reduce the accuracy of the
model. Autocorrelation usually occurs if there is a dependency between residual errors.
Simple Linear Regression is a type of Regression algorithms that models the relationship
between a dependent variable and a single independent variable. The relationship shown by a
Simple Linear Regression model is linear or a sloped straight line, hence it is called Simple
Linear Regression.
The key point in Simple Linear Regression is that the dependent variable must be a
continuous/real value. However, the independent variable can be measured on continuous or
categorical values.
o Model the relationship between the two variables. Such as the relationship
between Income and expenditure, experience and Salary, etc.
o Forecasting new observations. Such as Weather forecasting according to
temperature, Revenue of a company according to the investments in a year, etc.
Wher
e, a0= It is the intercept of the Regression line (can be obtained putting x=0)
a1= It is the slope of the regression line, which tells whether the line is
increasing or decreasing. ε = The error term. (For a good model it will be
negligible)
Example:
Prediction of CO2 emission based on engine size and number of cylinders in a car.
o Each feature variable must model the linear relationship with the dependent variable.
MLR equation:
In Multiple Linear Regression, the target variable(Y) is a linear combination of multiple
predictor variables x1, x2, x3, ...,xn. Since it is an enhancement of Simple Linear Regression, so
the same is applied for the multiple linear regression equation, the equation becomes:
Y= b<sub>0</sub>+b<sub>1</sub>x<sub>1</sub>+ b<sub>2</sub>x<sub>2</sub>+
b<sub>3</sub>x<sub> 3</sub>+...... bnxn (a)
Where,
Y= Output/Response variable
b0, b1, b2, b3 , bn = Coefficients of the model.
x1, x2, x3, x4, = Various Independent/feature variable
Assumptions for Multiple Linear Regression:
o A linear relationship should exist between the Target and predictor variables.
o The regression residuals must be normally distributed.
o MLR assumes little or no multicollinearity (correlation between the independent variable) in data.