(BA ZG524/MBA ZG538/PDBA ZG538) Advanced Statistical Methods Lecture No: 11 (13-04-24)

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 43

BITS Pilani

Pilani Campus

[BA ZG524/MBA ZG538/PDBA


ZG538] Advanced Statistical
Methods
Lecture No : 11[13-04-24]
Logistic regression is a statistical analysis to predict a binary outcome, such as
yes or no, based on prior observations of a data set( on independent
variables)
For example, a logistic regression could be used to predict whether a political
candidate will win or lose an election or whether a high school student will
pass the exam or not pass the exam.
These binary outcomes allow straightforward decisions between two
alternatives.

BITS Pilani, Pilani Campus


Logistic Regression
Equation
If the DV Y are coded as 0(or)1, the value of E(Y) in the equation
given below provides the probability that Y=1 given set of
Independent variables.

BITS Pilani, Pilani Campus


To get better understanding on the logistic regression equation.

BITS Pilani, Pilani Campus


Example :

Let us consider an application of Logistic Regression involving direct mail


promotion being used by Simmons stores.
Simmons owns and Operates a national chain of women’s apparel stores
5000 copies of an expensive 4 color sales catalog have been printed, and each
catalog includes a coupon that provides a $50 discount on purchases of
$200 (or) more
The catalogs are expensive and Simmons would like to send them to only
those customers who have the highest probability of using the coupon.
Source : David R Anderson, Dennis J Sweeney, Thomas A Williams, Jeffrey D.
Camm and James J. Cochran, Statistics for Business and Economics.
Twelfth edition. Cengage Learning. 2014.[Page nos 771-779]

BITS Pilani, Pilani Campus


Variables
• Management thinks that annual spending at Simmon stores
and whether a customer has a Simmons credit card are two
variables that might be helpful in predicting whether a
customer who receives the catalog will use the coupon.
• Simmons conducted a pilot study using a random sample of
50 to customers who have a Simmons credit card and 50 to
customers who do not have the card.
• Sent the catalog to each of 100 customers
• At the end, Simmons noted whether the customer used the
coupon or not.

BITS Pilani, Pilani Campus


Dataset

The data is available in simmons.csv(webfile)

Source : Simmons data file

BITS Pilani, Pilani Campus


Explanation of Variables

The amount each customer spent last year at Simmons is shows


in thousands of dollars and the credit card information has
been coded as 1 if customer has Simmons credit card and 0 if
not.
In the Coupon column, a 1 recorded if the sampled customer
used the coupon and 0 if not.

BITS Pilani, Pilani Campus


Estimating the Logistic
Regression Equation

Output and Interpretation:


>LR<-read.csv(file.choose(),header=TRUE)
>LR
> fit <-glm(Y~X1+X2, data=LR, family="binomial")
> summary(fit)
Observe the output and interpret.

BITS Pilani, Pilani Campus


• Call:
• glm(formula = Coupon ~ Spending + Card, family = "binomial",
• data = buy)
• Deviance Residuals:
• Min 1Q Median 3Q Max
• -1.6839 -1.0140 -0.6503 1.1216 1.8794
• Coefficients:
• Estimate Std. Error z value Pr(>|z|)
• (Intercept) -2.1464 0.5772 -3.718 0.000201 ***
• Spending 0.3416 0.1287 2.655 0.007928 **
• Card 1.0987 0.4447 2.471 0.013483 *

• Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
• (Dispersion parameter for binomial family taken to be 1)
• Null deviance: 134.60 on 99 degrees of freedom
• Residual deviance: 120.97 on 97 degrees of freedom
• AIC: 126.97
• Number of Fisher Scoring iterations: 4
Tests of Significance :

BITS Pilani, Pilani Campus


• Call:
• glm(formula = Coupon ~ Spending + Card, family = "binomial",
• data = buy)
• Deviance Residuals:
• Min 1Q Median 3Q Max
• -1.6839 -1.0140 -0.6503 1.1216 1.8794
• Coefficients:


Estimate
(Intercept) -2.1464
Std. Error
0.5772
z value
-3.718
Pr(>|z|)
0.000201 ***
Discuss on overall


Spending
Card
0.3416
1.0987
0.1287
0.4447
2.655
2.471
0.007928 **
0.013483 *
significance
• Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Individual significance


(Dispersion parameter for binomial family taken to be 1)
Null deviance: 134.60 on 99 degrees of freedom
Conduct a test of significance
• Residual deviance: 120.97 on 97 degrees of freedom
using the G Statistic[Chi-
Square test statistic]. Use l.o.s
0.05
BITS Pilani, Pilani Campus
We described how to develop the estimated logistic regression
equation and how to test it for significance.
Let us now use it on how decision recommendations.

? How can Simmons use this information to better target


customers for the new promotion.
Suppose Simmons wants to send the promotional catalog only to
customers who have a 0.40 (or) higher probability of using the
coupon.
Using the above table his promotion strategy is :
Customers who have a Simmon credit card : send the catalog to
every customers who spends $2000 or more last year
Customers who do not have a Simmons credit card : Send the
copy every customer who spends $6000 or more last year.
BITS Pilani, Pilani Campus
With logistic regression, it is difficult to interpret the
relation between the independent variables and the
probability that y=1 directly ?
However, Statisticians have shown the relationship can
be interpreted indirectly using a concept called the
odds ratio.

BITS Pilani, Pilani Campus


The odds in favor of an event occurring is defined as the
probability that the event will occur divided by the probability
the event will not occur.
Note:
In logistic regression the event of interest is always y=1
Given particular set of values for independent variables
The odds in favor y=1 can be computed as:

BITS Pilani, Pilani Campus


The odds ratio measures
the impact on the odds of a one unit increase in only one of
the independent variables.
i.E
The odds ratio is the odds that y=1 given that one of the
independent variables has been increased by one
unit(odds1) divided by the odds that y=1 given no change
in the values for the independent variables(odds0)

BITS Pilani, Pilani Campus


Interpretation

Further, suppose we want to compare the odds of using the


coupon for customers who spend $2000 and have Simmons
credit card(X1=2,X2=1) to the odds of using the coupon for
customers who spends $2000 annually and do not have a
Simmons credit card(X1=2,X2=0)

“ We are interested in interpreting the effect of a one one-unit


increase in the I.V X2

BITS Pilani, Pilani Campus


Conclusion:
“ The estimated odds in favor of using the coupon for customers who spent
$2000 last year and have credit card are 3 times greater than that the
estimated odds in favor of using the coupon for customers who spent
$2000 and do not have credit card”

BITS Pilani, Pilani Campus


Note:
The odds ratio for each independent variable is computed while holding
all the other independent variables as constant.
But it does not matter what constant values are used for the other IVs
For instance, if we computed the odds ratio for scc variable(X2) using 3,000
instead of 2000,as the value for the annual spending variable(X 1), we
would still obtain the same odds ratio(3.00)

“ Thus, we can conclude that the estimated odds in favour of using the
coupon for customers who have a credit card are 3 times greater that the
estimated odds in favor of using the coupon for customers who do not
have credit card”

BITS Pilani, Pilani Campus


Call:
glm(formula = Coupon ~ Spending + Card, family = "binomial",
data = buy)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6839 -1.0140 -0.6503 1.1216 1.8794
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.1464 0.5772 -3.718 0.000201 ***
Spending 0.3416 0.1287 2.655 0.007928 **
Card 1.0987 0.4447 2.471 0.013483 *

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 134.60 on 99 degrees of freedom
Residual deviance: 120.97 on 97 degrees of freedom

BITS Pilani, Pilani Campus


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
• In this day and age, researchers find
themselves with dozens or even hundreds of
different variables entering into their analyses.
• Whenever the size of the data set becomes
unwieldly(interms of the number variables), the
process is further complicated by the fact that
there is often substantial redundancy among
dimensions, leading to high levels of
correlation and multicollinearity.

BITS Pilani, Pilani Campus


BITS Pilani, Pilani Campus
BITS Pilani, Pilani Campus
BITS Pilani, Pilani Campus
BITS Pilani, Pilani Campus
LDA Concept

BITS Pilani, Pilani Campus


BITS Pilani, Pilani Campus
Objectives:
PCA finds most accurate data representation in a
lower dimensional space by projecting data in the
direction of Max variance

LDA find projection to a line such that samples from


different classes are well separated.

BITS Pilani, Pilani Campus


Q. Referring to the Simmons stores example introduced in this section. The DV is coded as y=1 if the
customer used the coupon and 0 if not. Suppose that the only information available to help
predict whether the customer will use the coupon is the customers credit card status, coded as
x=1 if the customer has Simmons credit card and x=0 if not.
1.Write the logistic regression equation relating x to y
2. What is the estimated odds ratio and its interpretation
3. Conduct a test of significance using the G Statistic[Chi-Square test statistic]. Use l.o.s 0.05

Call:
glm(formula = Y ~ X1, family = "binomial", data = LR)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.2116 -0.8106 -0.8106 1.1436 1.5956

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.9445 0.3150 -2.999 0.00271 **
X1 1.0245 0.4235 2.419 0.01555 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be


1)

Null deviance: 134.60 on 99 degrees of freedom


Residual deviance: 128.53 on 98 degrees of freedom

BITS Pilani, Pilani Campus


BITS Pilani, Pilani Campus
BITS Pilani, Pilani Campus
BITS Pilani, Pilani Campus
Call:
glm(formula = Y ~ ., family = "binomial", data = LR)

Deviance Residuals: Odds ratio Interpretation:


Min 1Q Median 3Q Max
-2.4174 -0.7444 -0.5674 0.8416 1.9893

Coefficients: “Estimated odds for signing up for


Estimate Std. Error z value Pr(>|z|) payroll direct deposit for customers
(Intercept) -2.63348 0.79851 -3.298 0.000974 *** that have an average monthly
X1 0.22018 0.09001 2.446 0.014441 *
---
balance of $600 is 1.2463 times
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 greater than estimated odds for
signing up for payroll direct deposit
(Dispersion parameter for binomial family taken to be 1) for customers that have an
Null deviance: 61.086 on 49 degrees of freedom
average monthly balance of $500.
Residual deviance: 51.626 on 48 degrees of freedom

BITS Pilani, Pilani Campus


##########
Factor analysis is a way to take a mass of data and
shrinking it to a smaller data set that is more manageable
and more understandable. It’s a way to find hidden
patterns.
It is also used to create a set of variables for similar items
in the set and label them.
It can be a very useful tool for complex sets of data
involving psychological studies, socioeconomic status and
other involved concepts.

BITS Pilani, Pilani Campus


##########
Dataset : L1

BITS Pilani, Pilani Campus


Interpretation :
1.

2.

BITS Pilani, Pilani Campus


Datafile : L
>L<-read.csv(file.choose(),header=TRUE)
>datpca=prcomp(L, center=TRUE, scale=TRUE)
>summary(datpca)
R output :
Importance of components PC1 PC2
SD 1.4097 0.11304
Prop of var 0.99 0.01
Cum Prop 0.99 1.00

BITS Pilani, Pilani Campus

You might also like