Logistic Regression

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Introduction to R

History of R
Logistic regression is a type of regression analysis used for predicting the outcome of a Categorical dependent
variable (a dependent variable that can take on a limited number of categories) based on one or more predictor
variables (Continuous, Ordinal or Categorical).

Binary Logistic Regression has Outcome or Dependent Variable: Binary or Dichotomous i.e. 0 or 1

Example:

a customer will churn (1) or not (0)

a customer will respond to a campaign (1) or not (0)

should we grant a loan to a particular person (1) or not (0)

Logistic regression measures the relationship between a categorical dependent variable and independent variable (or
several), by converting the dependent variable to probability scores (P)

Probability score signifies the probability of event happening, for example probability of a customer to churn or
respond to a campaign
How is Logistic Regression different from Linear Regression
In Linear regression, the outcome variable is continuous and the predictor variables can be a mix of numeric and
categorical. But often there are situations where we wish to evaluate the effects of multiple explanatory variables on
a binary outcome variable
For example, the effects of a number of factors on the development or otherwise of a disease. A patient may be
cured or not; a prospect may respond or not, should we grant a loan to particular person or not, etc.
When the outcome or dependent variable is binary, and we wish to measure the effects of several independent
variables on it, we uses Logistic Regression
Probability of each observation will not be linearly distributed but more like sigmoid function i.e. values would be
closer to 0 and 1.
 The binary outcome variable can be coded as 0 or 1.
 The logistic curve is shown in the figure below:

Sigmoid Function
Concept of Sigmoid Function in Logistic Regression
 The sigmoid function is a bounded function.

If

 If b is –ve, thens shape will get reversed


 If we use linear regression, the predicted
value can become greater than one and less
than zero.
 Basically Y is a random variable having 0 or 1
outcome, which is a Bernoulli random
variable.

log of odds:

ln( p / 1  p)  a  bx
This is also called as a logit function

The estimation of parameters is done using Maximum Likelihood Estimate(for Non Linear
distribution) unlike Linear regression where method of Ordinary Least square is used.
Odds Ratio
Odds is calculated as P(Y = 1)/P(Y = 0)
Odds > 1 if Y = 1 is more likely
Odds < 1 if Y = 0 is more likely

This is called logit and it looks like “linear regression equation”


The bigger the logit is, bigger is P(Y = 1)
Quick Question 1
Suppose the coefficients of a logistic regression model with two
independent variables are as follows:

β0 = -1.5, β1 = 3, β2 = -0.5

And we have an observation with the following values of independent


variables:

x1 = 1, x2 = 5

What is the value of the Logit for this observation? Recall that the Logit is
log(Odds)

What is the value of the Odds for this observation? Note that you can
compute e^x, for some number x, in your R console by typing exp(x). The
function exp() computes the exponential of its argument

What is the value of P(y = 1) for this observation?


Applications of logistic regression in business

Response to an Subscriber
Churning up of
E-mail conversion after
subscribers
Campaign a Campaign

Response Conversion Attrition


Model Model Model

Cross Sell
Application Behavioral
Up Sell
Risk Model Risk Model
Model

Finding Credit Finding


Finding
card defaulters parameters that
probability of
by Demography boost Cross sell
loan defaults
and Behavior & Up sell

Just a few of them


Logistic Process
THE FRAMINGHAM HEART STUDY
Evaluating Risk Factors to Save Lives

Misconceptions in the first half of 20th Century about blood pressure


High blood pressure, dubbed hypertension, was considered important to
force blood through arteries and it was considered harmful to lower
blood pressure
In late 1940s, the US government set out to better understand
cardiovascular disease
The plan was to track a large cohort of initially healthy patients over
their lifetimes
A city was chosen, the city of Framingham, Massachusetts, to be the site
for the study
 Appropriate Size
 Stable population
5209 patients aged 30 – 59 enrolled
Patients were given questionnaire and exam every 2 years:
 Physical characteristics
 Behavioral characteristics
THE FRAMINGHAM HEART STUDY Contd..
Use the anonymized version of the original data that was collected
 Includes several demographic risk factors:
 the sex of the patient, male or female;
 the age of the patient in years;
 the education level coded as either 1 for some high school, 2 for a
high school diploma or GED, 3 for some college or vocational
school, and 4 for a college degree.
 Includes behavioral risk factors:
 Does the patient smoke (yes/no)
 Medical history - blood pressure medication, previously had a
stroke, hypertensive or not, diabetic or not
 Includes risk factors from physical examination:
 Cholesterol level
 Systolic/diastolic blood pressure
 Body mass index
 Heart rate
 Blood glucose level
THE FRAMINGHAM HEART STUDY Contd..
Building the model
Split our data randomly into training and testing set
Logistic regression to predict whether or not a patient experienced
Coronary Heart Disease within 10 years of first examination
After building the model, we will evaluate the predictive power of the
model on the test set
Threshold Value
The outcome of a logistic regression model is a probability
Often, we want to make a binary prediction – whether this person will
suffer from CHD or not
We can do this using a threshold value t

If P(CHD = 1) ≥ t, predict CHD


If P(PoorCare = 1) < t, predict healthy

What value should we pick?


Often selected based on which errors are better
Confusion Matrix
Compare actual outcomes to predicted outcomes using a confusion
matrix (classification matrix)

 Sensitivity = TP/(TP + FN)


 Specificity = TN/(TN + FP)
Receiver Operator Characteristic
 Receiver Operating Characteristic (ROC) curve is a graph between True Positive Rate (Sensitivity)
and False Positive Rate (1-Specificity)
 Accuracy is measured by the area under the ROC curve. The greater the area under curve better is
the model. An area of 1 represents a perfect test.
 Each point on the ROC curve represents a cutoff probability. These cutoff point represent the
tradeoff between sensitivity and specificity probabilities
 Ideally the goal should be to have high probabilities for both Sensitivity and Specificity
Selecting a Threshold using ROC
 Captures all thresholds simultaneously
 High threshold means High specificity and Low sensitivity
 Low Threshold means Low specificity and High sensitivity
 Choose best threshold for best trade off:
 cost of failing to detect positives
 costs of raising false alarms
Compute Outcome Measures
 Overall accuracy = (TN + TP)/N
 Overall error rate = (FP + FN)/N
 Sensitivity = TP/(TP + FN)
 Specificity = TN/(TN + FP)
 False negative error rate = FN/(TP + FN)
 False positive error rate = FP/(TN + FP)
Quick Question 2
Using the below confusion matrix, answer the following questions.

FALSE TRUE
0 1069 6
1 187 11
What is the sensitivity of our logistic regression model on the test set,
using a threshold of 0.5?
What is the specificity of our logistic regression model on the test set,
using a threshold of 0.5?
Thank You

You might also like