Accuracy Assessment and Confusion Matrix

Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

Machine Learning

Module1_3
Regression and Evaluation Metrics

Dr. Abhishek Bhatt

[email protected]
Topics to be covered

• Various Regression Types


• Dimensionality Reduction (Why and Types)
• Evaluation Metrics
Regression techniques
There are various kinds of regression techniques available to make predictions. These
techniques are mostly driven by three metrics (number of independent variables, type

of dependent variables and shape o f regression line).


What are the types of Regressions
• Linear Regression
• Logistic Regression
• Polynomial Regression
• Stepwise Regression
• Ridge Regression
• Lasso Regression
• ElasticNet Regression
Linear Regression
• In this technique, the dependent variable is continuous, independent variable(s)
can be continuous or discrete, and nature of regression line is linear.
• Linear Regression establishes a relationship between dependent variable (Y) and
one or more independent variables (X) using a best fit straight line (also known as
regression line).
• It is represented by an equation Y=a+b*X + e, where a is intercept, b is slope of the
line and e is error term. This equation can be used to predict the value of target
variable based on given predictor variable(s).
How to obtain best fit line
• by Least Square Method
• Because the deviations are first squared, when added, there is no cancelling out
between positive and negative values.

We can evaluate the model performance using the metric R-square.


Points to remember in Linear Reg.
• There must be linear relationship between
independent and dependent variables
• Multiple regression suffers
from multicollinearity, autocorrelation,
heteroskedasticity.
• Linear Regression is very sensitive to Outliers.
It can terribly affect the regression line
and eventually the forecasted values.
Logistic Regression
• Logistic regression is used to find the
probability of event=Success and
event=Failure. We should use logistic
regression when the dependent variable is
binary (0/ 1, True/ False, Yes/ No) in nature.
• Since we are working here with a binomial
distribution (dependent variable), we need to
choose a link function which is best suited for
this distribution. And, it is logit function.
Logistic Regression
• Logistic regression is widely used for classification problems
• Logistic regression doesn’t require linear relationship between dependent and
independent variables. It can handle various types of relationships because it
applies a non-linear log transformation to the predicted odds ratio
• To avoid over fitting and under fitting, we should include all significant variables. A
good approach to ensure this practice is to use a step wise method to estimate the
logistic regression
• It requires large sample sizes because maximum likelihood estimates are less
powerful at low sample sizes than ordinary least square
• The independent variables should not be correlated with each other i.e. no multi
collinearity. However, we have the options to include interaction effects of
categorical variables in the analysis and in the model.
• If the values of dependent variable is ordinal, then it is called as Ordinal logistic
regression
• If dependent variable is multi class then it is known as Multinomial Logistic
regression.
Logistic Regression
we use two types of algorithms (dependent on
the kind of output it creates):
• Class output: Algorithms like SVM and KNN
create a class output. For instance, in a binary
classification problem, the outputs will be
either 0 or 1.
• Probability output: Algorithms like Logistic
Regression, Random Forest, Gradient
Boosting, Adaboost etc. give probability
outputs.(Apply Threshold)
Confusion Matrix
• A confusion matrix is an N X N matrix, where N is the number of
classes being predicted. For the problem in hand, we have N=2, and
hence we get a 2 X 2 matrix. Here are a few definitions, you need to
remember for a confusion matrix :
• Accuracy : the proportion of the total number of predictions that
were correct.
• Positive Predictive Value or Precision : the proportion of positive
cases that were correctly identified.
• Negative Predictive Value : the proportion of negative cases that
were correctly identified.
• Sensitivity or Recall : the proportion of actual positive cases which
are correctly identified.
• Specificity : the proportion of actual negative cases which are
correctly identified.
Confusion Matrix

Confusion Matrix
F1 Score
• F1-Score is the harmonic mean of precision
and recall values for a classification problem.
The formula for F1-Score is as follows:
Area Under the ROC curve
(AUC – ROC)
• The biggest advantage of using ROC curve is
that it is independent of the change in
proportion of responders.
• Let’s first try to understand what is ROC
(Receiver operating characteristic) curve. If we
look at the confusion matrix below, we
observe that for a probabilistic model, we get
different value for each metric.
(AUC – ROC)

• Hence, for each sensitivity, we get a different


specificity.The two vary as follows
(AUC – ROC)

The ROC curve is the plot between sensitivity and (1- specificity). (1- specificity) is
also known as false positive rate and sensitivity is also known as True Positive
rate. Following is the ROC curve for the case in hand.
Polynomial Regression
• A regression equation is a polynomial regression equation if the power of
independent variable is more than 1.
• In this regression technique, the best fit line is not a straight line. It is rather a
curve that fits into the data points.
Underfitting/Overfitting

A high bias and a low variance model would


underfit the data and this tells us that the
model is too simple. In the diagram
corresponding to this, All the points are close
together but they have not hit the right
mark/bulls eye

A low bias and high variance model would


overfit the data and would perform badly on
unseen data. We see in the diagram that some
of the points have hit the mark but most
haven’t
Stepwise Regression
• when we deal with multiple independent variables.
• In this technique, the selection of independent
variables is done with the help of an automatic
process, which involves no human intervention
• Standard stepwise regression does two things. It adds
and removes predictors as needed for each step.
– Forward selection starts with most significant predictor in
the model and adds variable for each step.
– Backward elimination starts with all predictors in the
model and removes the least significant variable for each
step.
Ridge Regression
• Ridge Regression is a technique used when the
data suffers from multicollinearity (independent
variables are highly correlated).
• In multicollinearity, even though the least squares
estimates (OLS) are unbiased, their variances are
large which deviates the observed value far from
the true value.
• The least square estimator βLS may provide a
good fit to the training data, but it will not fit
sufficiently well to the test data.
• By adding a degree of bias to the regression
estimates, ridge regression reduces the standard
errors.
• equation for linear regression
y=a+ b*x
• This equation also has an error term. The
complete equation becomes
y=a+b*x+e
Dimension Reduction Techniques-Why???
• There are too many variables – do I need to explore
each and every variable?
• Are all variables important?
• All variables are numeric and what if they have multi-
collinearity? How can I identify these variables?
• Is there any machine learning algorithm that can
identify the most significant variables automatically?
• Solution: to find out most significant variable is the
process called dimensionality reduction
– Reduce no. of variables (Significance)
– Arrange variables in terms of their utility

You might also like