Binary Logistic Regression - 6.2
Binary Logistic Regression - 6.2
Binary Logistic Regression - 6.2
edu/stat504)
Home > 6.2 - Binary Logistic Regression with a Single Categorical Predictor
Key Concepts
Objectives
Understand the basic ideas behind modeling categorical data with binary logistic regression.
Understand how to fit the model and interpret the parameter estimates, especially in terms
of odds and odd ratios.
Overview
Binary logistic regression estimates the probability that a characteristic is present (e.g. estimate
probability of "success") given the values of explanatory variables, in this case a single categorical
variable ; = Pr (Y = 1|X = x). Suppose a physician is interested in estimating the proportion of
diabetic persons in a population. Naturally she knows that all sections of the population do not have
equal probability of success, i.e. being diabetic. Older population, population with hypertension,
individuals with diabetes incidence in family are more likely to have diabetes. Consider the predictor
variable X to be any of the risk factor that might contribute to the disease. Probability of success will
depend on levels of the risk factor.
Variables:
X = (X1, X2, ..., Xk) be a set of explanatory variables which can be discrete, continuous, or a
combination. xi is the observed value of the explanatory variables for observation i. In this
section of the notes, we focus on a single variable X.
Model:
exp(0 + 1 i )
i = Pr( i = 1| i = i) =
exp(0 + 1 x i )
i = Pr(Yi = 1|Xi = x i ) =
1 + exp(0 + 1 x i )
or,
logit(i ) = log (
1 i )
i
= 0 + 1 x i
= 0 + 1 x i1 + + k x ik
Assumptions:
The data Y1, Y2, ..., Yn are independently distributed, i.e., cases are independent.
Distribution of Yi is Bin(ni, i), i.e., binary logistic regression model assumes binomial
distribution of the response. The dependent variable does NOT need to be normally
distributed, but it typically assumes a distribution from an exponential family (e.g. binomial,
Poisson, multinomial, normal,...)
Does NOT assume a linear relationship between the dependent variable and the independent
variables, but it does assume linear relationship between the logit of the response and the
explanatory variables; logit() = 0 + X.
Independent (explanatory) variables can be even the power terms or some other nonlinear
transformations of the original independent variables.
The homogeneity of variance does NOT need to be satisfied. In fact, it is not even possible in
many cases given the model structure.
Errors need to be independent but NOT normally distributed.
It uses maximum likelihood estimation (MLE) rather than ordinary least squares (OLS) to
estimate the parameters, and thus relies on large-sample approximations.
Goodness-of-fit measures rely on sufficiently large samples, where a heuristic rule is that not
more than 20% of the expected cells counts are less than 5.
Model Fit:
Parameter Estimation:
The maximum likelihood estimator (MLE) for (0, 1) is obtained by finding (0 , 1 ) that maximizes:
N N exp{y i (0 + 1 x i )}
y ni yi
L(0 , 1 ) = i i (1 i ) =
i=1 i=1 1 + exp(0 + 1 x i )
In general, there are no closed-form solutions, so the ML estimates are obtained by using iterative
algorithms such as Newton-Raphson (NR), or Iteratively re-weighted least squares (IRWLS). In
Agresti (2013), see section 4.6.1 for GLMs, and for logistic regression, see sections 5.5.4-5.5.5.
More Advanced material (not required):
Brief overview of Newton-Raphson (NR). Suppose we want to maximize a loglikelihood l() with respect to a
parameter = (1 , , p )T . At each step of NR, the current estimate (t) is updated as
V () = [l ()]
1
(the inverse of the ``observed information''), whose diagonal elements give the estimated variance, that is the standard
errors, of the parameter estimates.
Applying NR to logistic regression. For the logit model with p 1 predictors X , = (0 , 1 , , p1 ) , the likelihood
is
N
ni !
L() = iyi (1 i )n i yi
y !(ni yi )!
i=1 i
N
(
1 i )
i yi
(1 i )n i
i=1
N
exi yi (1 + exi )
T T n i
,
i=1
( 1 + exiT )
N N N
l 1 xiT
= yi xij ni
(yi i )xij ,
e xij =
j i=1 i=1 i=1
where i = E(yi ) = ni i .
The second derivatives used in computing the standard errors of the parameter estimates, , are
( 1 + exiT )
N TN
2 l exi
= ni xij
ni i (1 i )xij xik
=
j k i=1
k i=1
For reasons including numerical stability and speed, it is generally advisable to avoid computing matrix inverses directly.
Thus in many implementations, clever methods are used to obtain the required information without directly constructing
the inverse, or even the Hessian.
exp(0) = the odds that the characteristic is present in an observation i when Xi = 0, i.e., at
baseline.
exp(1) = for every unit increase in Xi1, the odds that the characteristic is present is multiplied
by exp(1). This is similar to simple linear regression but instead of additive change it is a
multiplicative change in rate. This is an estimated odds ratio.
exp(0 + 1 (x i1 + 1))
= exp(1 )
exp(0 + 1 x i1 )
In general, the logistic model stipulates that the effect of a covariate on the chance of "success" is
linear on the log-odds scale, or multiplicative on the odds scale.
_________________________
Student smokes?
How many parents smoke? Yes (Z = 1) No (Z = 2)
Both (Y = 1) 400 1380
One (Y = 2) 416 1823
Neither (Y = 3) 188 1168
The test for independence yields X2 = 37.6 and G2 = 38.4 with 2 df (p-values are essentially zero),
so Y and Z are related. It is natural to think of Z as a response and Y as a predictor in this example.
We might be interested in exploring the dependency of student's smoking behavior on neither
parent smoking versus at least one parent smoking. Thus we can combine or collapse the first two
rows of our 3 2 table and look at a new 2 2 table:
Student Student does
smokes not smoke
12 parents smoke 816 3203
Neither parent smokes 188 1168
For the chi-square test of independence, this table has X2 = 27.7, G2 = 29.1, p-value 0, and =
1.58. Therefore, we estimate that a student is 58% more likely, on the odds scale, to smoke if he or
she has at least one smoking parent than none.
we want to model the "risk" of student smoking as a function of parents' smoking behavior.
we want to describe the differences between student smokers and nonsmokers as a function
of parents smoking behavior via descriptive discriminate analyses.
we want to predict probabilities that individuals fall into two categories of the binary response
as a function of some explanatory variables, e.g. what is the probability that a student is a
smoker given that neither of his/her parents smokes.
we want to predict that a student is a smoker given that neither of his/her parents smokes, i.e.
probabilities that individuals fall into two categories of the binary response as a function of
some explanatory variables, we want to classify new students into "smoking" or "nonsmoking"
group depending on parents smoking behavior.
we want to develop a social network model, adjust for "bias", analyze choice data, etc...
These are just some of the possibilities of logistic regression, which cannot be handled within a
framework of goodness-of-fit only.
yi ni
12 parents smoke 816 4019
Neither parent smokes 188 1356
where yi is the number of children who smoke, ni is the number of children for a given level of
parents' smoking behaviour, and i = P(yi = 1|xi) is the probability of a randomly chosen student i
smoking given parents' smoking status. Here i takes values 1 and 2. Thus, we claim that all
students who have at least one parent smoking will have the same probability of "success", and all
student who have neither parent smoking will have the same probability of "success", though for
the two groups success probabilities will be different. Then the response variable Y has a binomial
distribution:
y i Bin(ni , i )
Explanatory variable X is a dummy variable such that
Xi = 0 if neither parent smokes,
Xi = 1 if at least one parent smokes.
Think about the following question, then click on the icon to the left display an answer.
If data come in a tabular form, i.e., response pattern is with counts (as seen in the previous
example).
PROC GENMOD: We need a variable that specifies the number of cases that equals marginal
frequency counts or number of trials (e.g. n), and the number of events (y)
model y/n = x1 x2 /link=logit dist=binomial [any other options you may want];
PROC LOGISTIC: We do need a variable that specifies the number of cases that equals
marginal frequency counts
model y/n = x1 x2 / [put any other options you may want here];
If data come in a matrix form, i.e., subject variables matrix with one line for each subject, like a
database
PROC GENMOD: We need a variable that specifies the number of cases that equals marginal
frequency counts or number of trials (e.g. n), and the number of events (y)
model y/n = x1 x2 / link = logit dist = binomial [put any other options you
may want here];
PROC LOGISTIC: We do NOT need a variable that specifies the number of cases that equals
marginal frequency counts
For both GENMOD and LOGISTIC, as before, include interaction terms with *, and make sure to
include all lower order terms.
We will follow both the SAS output through to explain the different parts of model fitting. The
outputs from R will be essentially the same.
In the data step, the dollar sign $ as before indicates that S is a character-string variable.
descending
insures that you are modeling a probability of an "event" which takes value 1, otherwise by
default SAS models the probability of "nonevent"
scale=none
option serves two purposes. One, it will give us the overall goodness-of-fit test statistics,
deviance G2 and Person X2. It also enables us to specify a value for a dispersion parameter
in order to correct for over- or under-dispersion. In this case, we are NOT controlling for either.
Overdispersion is an important concept with discrete data. In the context of logistic
regression, overdispersion occurs when there are discrepancies between the observed
responses yi and their predicted values i = n i i and these values are larger than what the
binomial model would predict.
If yi ~ Bin(ni, i), the mean is i = ni i and the variance is i(ni i)/ni. Overdispersion means that
the data show evidence that the variance of the response yi is greater than i(ni i)/ni.
If overdispersion is present in a dataset, the estimated standard errors and test statistics for
individual parameters and the overall goodness-of-fit will be distorted and adjustments should be
made. We will look at this briefly later when we look into continuous predictors, but you can also
now read Agresti (2002), Sec. 4.7, Agresti (1996), Sec. 4.4.4. and SAS online description on
dispersion and adjustments at:
http://support.sas.com/onlinedoc/913/getDoc/en/statug.hlp/logistic_sect35.htm#stat_logistic_logisticod
[6]
Now let's review some of the output from the program smoke.sas [5] that uses PROC LOGISTIC
('jump-down links' below):
Model information
Response profile
Class Level Information
Model convergence
Goodness of fit-statistics
Model fit
Testing null hypothesis beta=0
Analysis of maximum likelihood estimates
Odds Ratio Estimates
Model Information
This section, as before, tells us which dataset we are manipulating, the labels of the
response and explanatory variables and what type of model we are fitting (e.g. binarly
logit), and type of scoring algorithm for parameter estimation. Fisher scoring is a variant
of Newton-Raphson method for ML estimation. In logistic regression they are equivalent.
Response Profile
This information tells us how many observations were "successful", e.g., how many
"event" observations take value 1, that is how many children smoke, 816 + 188 = 1004
and "nonevent", i.e., how many do NOT smoke, (4019 816) + (1356 188) = 4371.
Recall that the model is estimating the probability of the "event".
From an explanatory categorical (i.e, class) variable S with 2 levels (0,1), we created
one dummy variable (e.g. design variable):
Since we are using an iterative procedure to fit the model, that is to find the ML
estimates, we need some indication if the algorithm converged.
The Pearson statistic X2 = 27.6766 is precisely equal to the usual X2 for testing independence in
the 2 2 table. And the deviance G2 = 29.1207 is precisely equal to the G2 for testing
independence in the 2 2 table. Thus by the assumption, the intercept-only model or the null
logistic regression model states that student's smoking is unrelated to parents' smoking (e.g.,
assumes independence, or odds-ratio=1). But clearly, based on the values of the calculated
statistics, this model (i.e., independence) does NOT fit well. This example shows that analyzing a 2
2 table for association is equivalent to logistic regression with a single dummy variable. Later on
we will compare these tests to the loglinear model of independence see smokelog.sas [7] and
smokelog.lst [8].
The goodness-of-fit statistics, X2 and G2, are defined as before in the tests of independence and
loglinear models (e.g. compare observed and fitted values). For the 2 approximation to work well,
we need the ni's sufficiently large so that the expected values, i 5, and ni i 5 for most of the
rows i. We can afford to have about 20% of these values less than 5, but none of them should fall
below 1.
With real data, we often find that the ni's are not big enough for an accurate test, and there is no
single best solution to handle this but a possible solution may rely strongly on the data and context
of the problem.
Estimated 0 = 1.827 with a standard error of 0.078 is significant and it says that log-odds of a
child smoking versus not smoking if neither parents is smoking (the baseline level) is -1.827 and it's
statistically significant.
Estimated 1 = 0.459 with a standard error of 0.088 is significant and it says that log-odds-ratio of
a child smoking versus not smoking if at least one parent is smoking versus neither parents is
smoking (the baseline level) is 0.459 and it's statistically significant. $exp(0.459)=1.58 are the
estimated odds-ratios; compare with our analysis in Section 6.2.
Thus there is a strong association between parent's and children smoking behavior, and the model
fits well.
And the estimated probability of a student smoking given the explanatory variable (e.g. parent
smoking behavior):
exp(1.826 + 0.4592Xi )
i =
1 + exp(1.826 + 0.4592Xi )
For example, the predicted probability of a student smoking given that neither parent smokes is
exp(1.826 + 0.4592 0)
P(Yi = 1|Xi = 0) = = 0.14
1 + exp(1.826 + 0.4592 0)
and the predicted probability of a student being a smoker if at least one parent smokes is
exp(1.826 + 0.4592(Xi = 1)
P(Yi = 1|Xi = 1) = = 0.20
1 + exp(1.826 + 0.4592(Xi = 1))
By invoking the following option in MODEL, output out=predict pred=prob the PROC LOGISTIC
will print the predicted probabilities in the output file:
so you do NOT need to do this calculation by hand; but it maybe useful to try it out to see if you
understand what's going on. In this model, 0 is the log-odds of children smoking for no-smoking
parents (Xi = 0). Thus exp(1.8266) are the odds that a student smokes when the parents do not
smoke.
log (
1168 )
188
= log(0.161) = 1.8266
From analysis of this example as a 2 2 table, exp(1), should be the estimate odds ratio. The
estimated coefficient of the dummy variable,
1 = 0.4592
agrees exactly with the log-odds ratio from the 2 2 table (e.g. ln(1.58) = (8161168) / (1883203)
= 0.459). That is exp(0.459) = 1.58 which is the estimated odds ratio of a student smoking.
To relate this to interpretation of the coefficients in a linear regression, you could say that for every
one-unit increase in the explanatory variable X1 (e.g. changing from no smoking parents to smoking
parents), the odds of "success" i / (1 i) will be multiplied by exp(1), given that all the other
variables are held constant.
This is not surprising, because in the logistic regression model 1 is the difference in the log-odds
of children smoking as we move from "nosmoke" (i.e. neither parent smokes) (Xi = 0) to "smoke"
(i.e. at least one parent smokes) (Xi = 1), and the difference in log-odds is a log-odds ratio.
Testing H0 : j = 0 versus H1 : j 0.
The Wald chi-squared statistics z 2 = (j /SE( k ))2 for these tests are displayed along with the
estimated coefficients in the "Analysis of Maximum Likelihood Estimates" section.
The standard error for 1 , 0.0878, agrees exactly with the standard error that you can calculate
from the corresponding 2 2 table.
The values indicate the significant relationship between the logit of the odds of student smoking in
parents' smoking behavior.
Or, we can use the information from "Type III Analysis of Effects" section.
Again, this information indicates that parents' smoking behavior is a significant factor in the model.
We could compare z2 to a chisquare with one degree of freedom; the p-value would then be the
area to the right of z2 under the 12 density curve.
A value of z2 (Wald statistic) bigger than 3.84 indicates that we can reject the null hypothesis j = 0
at the .05-level.
0.4592 2
1 : (
0.0878 )
= 27.354
j z (1/2) SE(j )
For example, a 95% confidence interval for 1
where 0.0878 is the "Standard Error"' from the "Analysis of Maximum Likelihood Estimates" section.
Then, the 95% confidence interval for the odds-ratio of a child smoking if one parent is smoking in
comparison to neither smoking is
When fitting logistic regression, we need to evaluate the overall fit of the model, significance of
individual parameter estimates and consider their interpretation. For assessing the fit of the model,
we also need to consider the analysis of residuals. Definition of Pearson, deviance and adjusted
residuals is as before, and you should be able to carry this analysis.
For now you can try this on your own, or see the next sections for more analysis and code.
If data come in a tabular form, i.e., response pattern is with counts (as seen in the previous
example), that is data are grouped.
glm(): Let y be the response variable capturing the number of events with the number of success
(y = 1) and failures (y = 0). We need to create a response table (e.g., count) that has count for both
the "success" and "failure" out of n number of trials in its columns. Notice that count table below,
could be also the number of success y = 1, and then a column computed as n-y
> count=cbind(xytab[,2],xytab[,1])
> count
[,1] [,2]
0 29 24
1 21 26
> xfactor=factor(c("0","1"))
> xfactor
[1] 0 1
Levels: 0 1
We need to specify the the response distribution and a link, which in this case is done by specifying
family=binomial("logit")
> tmp3=glm(count~xfactor, family=binomial("logit"))
> tmp3
Coefficients:
(Intercept) xfactor1
0.1892 -0.4028
If data come in a matrix form, i.e., subject variables matrix with one line for each subject, like a
database, where data are ungrouped
> xydata=cbind(x,y)
> xydata ## 100 rows, we are showing first 7
x y
[1,] 0 1
[2,] 0 1
[3,] 0 0
[4,] 0 1
[5,] 1 0
[6,] 0 1
[7,] 0 0.....
glm(): We need a binary response variable y and a predictor variable x, which in this case was
also binary. We need to specify the the response distribution and a link, which in this case is done
by specifying family=binomial("logit")
Coefficients:
(Intercept) x
0.1892 -0.4028
For more options see the course examples and homework. We will follow the R output through to
explain the different parts of model fitting. The outputs from SAS (or from many other software) will
be essentially the same.
In R, we can use the glm()function and specify the family = binomial(link = logit).
See the files smoke.R [10] and the output generated in smoke.out [11]. That R code corresponds to
SAS code discussed in the previous section:
Here is the R output for the 2 2 table that we will use in R for logistics regression:
Please Note: the table above is different from the one given from the SAS program. We input y and
n as data columns in SAS; here we just input data columns as yes and no.
Call:
glm(formula = response ~ parentsmoke, family = binomial(link = logit))
These sections tells us which dataset we are manipulating, the labels of the response and
explanatory variables and what type of model we are fitting (e.g. binary logit), and type of scoring
algorithm for parameter estimation. Fisher scoring is a variant of Newton-Raphson method for ML
estimation. In logistic regression they are equivalent. Since we are using an iterative procedure to
fit the model, that is to find the ML estimates, we need some indication if the algorithm converged.
Overdispersion is an important concept with discrete data. In the context of logistic regression,
overdispersion occurs when there are discrepancies between the observed responses yi and their
predicted values i = n i i and these values are larger than what the binomial model would
predict.
If yi ~ Bin(ni, i), the mean is i = ni i and the variance is i(ni i)/ni. Overdispersion means that
the data show evidence that the variance of the response yi is greater than i(ni i)/ni.
If overdispersion is present in a dataset, the estimated standard errors and test statistics for
individual parameters and the overall goodness-of-fit will be distorted and adjustments should be
made. We will look at this briefly later when we look into continuous predictors, but you can also
now read Agresti (2013), Sec. 4.7, Agresti (2007), Sec. 4.4.4. and help(summary.glm())
Table of Coefficients:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.82661 0.07858 -23.244 < 2e-16 ***
parentsmoke1 0.45918 0.08782 5.228 1.71e-07 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
^i ^ ^
logit(^i ) = log = 0 + 1 Xi = 1.837 + 0.459parentsmoke1
1 ^i
where parentsmoke1 is a dummy variable ((e.g. design variable) that takes value 1 if at least one
parents is smoking and 0 if neither is smoking.
parent smoking = 0 is the baseline. We are modeling here probability of "children smoking".
Estimated 0 = 1.827 with a standard error of 0.078 is significant and it says that log-odds of a
child smoking versus not smoking if neither parents is smoking (the baseline level) is -1.827 and it's
statistically significant.
Estimated 1 = 0.459 with a standard error of 0.088 is significant and it says that log-odds-ratio of
a child smoking versus not smoking if at least one parent is smoking versus neither parents is
smoking (the baseline level) is 0.459 and it's statistically significant. $exp(0.459)=1.58 are the
estimated odds-ratios; compare with our analysis in Section 6.2.
Testing the hypothesis that the probability of the characteristic depends on the value of the jth
variable.
Testing H0 : j = 0 versus H1 : j 0.
The Wald chi-squared statistics z 2 = (j /SE( k ))2 for these tests are displayed along with the
estimated coefficients in the "Coefficients" section.
The standard error for 1 , 0.0878, agrees exactly with the standard error that you can calculate
from the corresponding 2 2 table. The values indicate the significant relationship between the logit
of the odds of student smoking in parents' smoking behavior. This information indicates that
parents' smoking behavior is a significant factor in the model. We could compare z2 to a chisquare
with one degree of freedom; the p-value would then be the area to the right of z2 under the 12
density curve.
A value of z2 (Wald statistic) bigger than 3.84 indicates that we can reject the null hypothesis j = 0
at the .05-level.
0.4592 2
1 : (
0.0878 )
= 27.354
j z (1/2) SE(j )
For example, a 95% confidence interval for 1
where 0.0878 is the "Standard Error"' from the "Coefficients" section. Then, the 95% confidence
interval for the odds-ratio of a child smoking if one parent is smoking in comparison to neither
smoking is
Thus there is a strong association between parent's and children smoking behavior. But does this
model fit?
The goodness-of-fit statistics, deviance, G2 from this model is zero, because the model is
saturated. If we want to know the fit of the intercept only model that is provided by
Suppose that we fit the intercept-only model. This is accomplished by removing the predictor from
the model statement, like this:
glm(response~1, family=binomial(link=logit)
The deviance G2 = 29.1207 is precisely equal to the G2 for testing independence in this 2 2 table.
Thus by the assumption, the intercept-only model or the null logistic regression model states that
student's smoking is unrelated to parents' smoking (e.g., assumes independence, or odds-ratio=1).
But clearly, based on the values of the calculated statistics, this model (i.e., independence) does
NOT fit well.
Analysis of deviance table [12]: In R, we can test factors effects with the anova function to give an
analysis of deviance table. We only include one factor in this model. So R dropped this factor
(parentsmoke) and fit the intercept-only model to get the same statistics as above, i.e., the
deviance G2 = 29.121.
This example shows that analyzing a 2 2 table for association is equivalent to logistic regression
with a single dummy variable. We can further compare these tests to the loglinear model of
independence, e.g. see smokelog.R [13] in Lesson 10.
The goodness-of-fit statistics, X2 and G2, are defined as before in the tests of independence and
loglinear models (e.g. compare observed and fitted values). For the 2 approximation to work well,
we need the ni's sufficiently large so that the expected values, i 5, and ni i 5 for most of the
rows i. We can afford to have about 20% of these values less than 5, but none of them should fall
below 1.
With real data, we often find that the ni's are not big enough for an accurate test, and there is no
single best solution to handle this but a possible solution may rely strongly on the data and context
of the problem.
Notice the difference in the null and alternative hypothesis from the section above. Here to test the
null hypothesis that an arbitrary group of k coefficients from the model is set equal to zero (e.g. no
relationship with the response), we need to fit two models:
and the degrees of freedom is k (the number of coefficients in question). The p-value is
P(k2 G2 ) .
To perform the test, we must look at the "Model Fit Statistics" section and examine the value of "2
Log L" for "Intercept and Covariates." Here, the reduced model is the "intercept-only" model (e.g.
no predictors) and "intercept and covariates" is the current model we fitted. For our running
example, this would be equivalent to testing "intercept-only" model vs. full (saturated) model (since
we have only one covariate).
Larger values of G2 ("2 Log L" ) lead to small p-values, which provide evidence against the
reduced model in favor of the current model; you can explore AIC (Akaike Information Criterion)
and SC (Schwarz Criterion) on your own through SAS help files or see Lesson 5 for AIC.
For our example, G2 = 5176.510 5147.390 = 29.1207 with df = 2 1 = 1. Notice that this
matches Deviance we got in the earlier text above.
This value of -2 Log L is useful to compare two nested models which differ by an arbitrary set of
coefficients.
from "Testing Global Hypothesis: BETA=0" section (the next part of the output, see below).
log ( )= 0 + 1 1 + 2 2 ++ k k
log (
1)
= 0 + 1 X1 + 2 X2 + + k Xk
test H0 : 1 = 2 = ... = 0 versus the alternative that at least one of the coefficients 1, . . . , k is not
zero.
This is like the overall Ftest in linear regression. In other words, this is testing the null hypothesis
that an intercept-only model is correct,
log (
1)
= 0
log (
1)
= 0 + 1 X1 + 2 X2 + + k Xk
In our example, we are testing the null hypothesis that an intercept-only model is correct,
log (
1)
= 0
versus the alternative that the current model (in this case saturated model) is correct
log (
1)
= 0 + 1 X1
Using SAS
Using R
In the SAS output, three different chisquare statistics for this test are displayed in the
section "Testing Global Null Hypothesis: Beta=0," corresponding to the likelihood ratio, score and
Wald tests. Recall their definitions from the very first lessons.
This test has k degrees of freedom (e.g. the number of dummy indicators (design variables), that is
the number of -parameters (except the intercept)).
Large chisquare statistics lead to small p-values and provide evidence against the intercept-only
model in favor of the current model.
The Wald test is based on asymptotic normality of ML estimates of 's. Rather than using the
Wald, most statisticians would prefer the LR test.
If these three tests agree, that is evidence that the large-sample approximations are working well
and the results are trustworthy. If the results from the three tests disagree, most statisticians would
tend to trust the likelihood-ratio test more than the other two.
In our example, the "intercept only" model or the null model says that student's smoking is
unrelated to parents' smoking habits. Thus the test of the global null hypothesis 1 = 0 is
equivalent to the usual test for independence in the 2 2 table. We will see that the estimated
coefficients and SE's are as we predicted before, as well as the estimated odds and odds ratios.
Note: we use one predictor model here, that is, at least one parent smokes.
This is a Pearson-like 2 that is computed after data are grouped by having similar predicted
probabilities. It is more useful when there is more than one predictor and/or continuous predictors in
the model too. We will see more on this later, and in your homework.
It is a conservative statistic, i.e. its value is smaller than what it aught to be and therefore
rejection probability of the null hypothesis is smaller.
It has low power in predicting a certain types of lack of fit such as nonlinearity in explanatory
variable
It is highly dependent on how the observations are grouped
If too few groups are used (e.g. 5 or less) it almost always indicates that the model fits the
data; this means that it's usually not a good measure if you only have one or two categorical
predictor variables, and it's best used for continuous predictors.
In the model statement, the option lackfit tells SAS to compute HL statistics and print the
partitioning.
For our example, because we have small number of groups (e.g, 2), this statistic gives a perfect fit
(HL = 0, p-value = 1).
______
Instead of deriving the diagnostics we will look at them from a purely applied viewpoint. Recall the
definitions and introductions to the regression residuals and Pearson and Deviance residuals.
Residuals
y i i yi ni i
ri = =
n i)
i i (1
V (i )
(y i i ) 2 ((ni y i ) (ni i ) )2
+ = r i2
i n i i
and the Pearson goodness-of fit statistic is
N
X2 = r i2
i=1
2
which we would compare to a Np distribution. The deviance test statistic is
{ ( i ) ( ni i )}
N yi ni yi
2
G =2 y i log + (ni y i )log
i=1
2 , and the contribution of the ith row to the deviance is
which we would again compare to Np
{ ( i ) ( ni i )}
yi ni yi
2 y i log + (ni y i )log
We will note how these quantities are derived through appropriate software and how they provide
useful information to understand and interpret the models. For an example see the SAS or R
analysis in the next section [14].
We begin by replicating the analysis of the original 3 2 tables with logistic regression.
Student smokes?
How many parents smoke? Yes (Z = 1) No (Z = 2)
Both (Y = 1) 400 1380
One (Y = 2) 416 1823
Neither (Y = 3) 188 1168
First, we re-express the data in terms of yi = number of smoking students, and ni = number of
students for the three groups based on parents behavior:
Student smokes?
How many parents smoke? yi ni
Both 400 1780
One 416 2239
Neither 188 1356
Then we decide on a baseline level for the explanatory variable X, and create k 1 dummy
indicators if X is a categorical variable with k levels.
For our example, let's parent smoking = Neither be a baseline, and define a pair of dummy
indicators (or design variables) that takes one of two values,
Let = odds of student smoking; notice that we dropped the index i denoting each individual
1
student to simplify the notation --- we are still modeling each student or rather a randomly chosen
student from the same classification cell. Then the model
log (
1)
= 0 + 1 X1 + 2 X2
says that the log-odds of a student smoking is 0 for parents smoking = neither, 0 + 1 for parents
smoking = one and 0 + 2 for parents smoking = both, and exp(j) = conditional odds ratio for
level j of predictor X versus the baseline. Therefore,
1 = log_odds for one log_odds for neither
odds smoking|one
exp(1 ) =
odds smoking|neither
2 = log_odds for both log_odds for neither
odds smoking|both
exp(2 ) =
odds smoking|neither
exp(1 ) odds smoking|one
= exp(1 2 ) =
exp(2 ) odds smoking|both
and we expect to get 1 = ln(1.42) = 0.351, e.g. 1.42=(416 1168)/(188 1823), and
2 = ln(1.80) = 0.588, e.g., 1.80=400 1168/188 1380. The estimated intercept should be
0 = ln(188/1168) = 1.826 .
There are different versions of SAS program on the course web site to fit this data.
Here is one from smoke.sas [5] where '0' = neither parent smokes, '1' = one smokes, and '2' = both
smoke, and we use PROC LOGISTIC; notice we could use proc GENMOD too.
The option param=ref tells SAS to create a set of two dummy variables to distinguish among the
three categories, where '0'=neither is a baseline because of option descending and ref=first
(see the previous section for details).
Another way of doing the same in smoke.sas but using character labels (e..g, "both", "one",
"neither") rather than numbers for the categories:
In the class statement, the option order=data tells SAS to sort the categories of S by the order
in which they appear in the dataset rather than alphabetical order. The option ref='neither'
makes neither the reference group (i.e. the group for which both dummy variables are zero). Let's
look at some relevant portions of the output of smoke.lst [15] that differ from the analysis of the
corresponding 2 2 table from the previous section of the notes.
From an explanatory variable S with 3 levels (0,1,2), we created two dummy variables, i.e., design
variables:
X2 = 1 if parent smoking=Both,
X2 = 0 otherwise.
Since parent smoking = Neither is equal 0 for both dummy variables, that's the baseline.
Here, we specified neither to be a reference variable. We used option data=order so that both
takes values (1,0), and one (0,1). If we didn't use that option, the other levels would be set based
on the alphabetical order, but in this case they coincide.
then SAS creates a new dataset called "results" that includes all of the variables in the original
dataset, the predicted probabilities i , and the Pearson and deviance residuals. We can add some
code to calculate and print out the estimated expected number of successes i = n i i (e.g.,
"shat") and failures n i i = n i (1 i ) (e.g., "fhat").
All of the "shat" and "fhat" values, that is predicted number of success and failures are greater than
5.0, so the 2 approximation is trustworthy.
Overall goodness-of-fit
The goodness-of-fit statistics 2 and G2 from this model are both zero, because the model is
saturated. Suppose that we fit the intercept-only model as before by removing the predictors from
the model statement:
The Pearson statistic 2 = 37.5663 and the deviance G2 = 38.3658 are precisely equal to the
usual 2 and G2 for testing independence in the 2 3 table.
Clearly, in this example, the saturated model fits perfectly and the independence model does not fit
well.
We can use the Hosmer-Lemeshow test to assess the overall fit of the model. As in the previous
example, the Hosmer and Lemeshow statistic is not very meaningful since the number of groups is
small.
We have shown that analyzing a 2 3 table for associations is equivalent to a binary logistic
regression with two dummy variables as predictors. For 2 J tables, we would fit a binary logistic
regression with J 1 dummy variables.
In our model
log (
1)
= 0 + 1 X1 + 2 X2
Test H0 : 1 = 2 = 0 versus the alternative that at least one of the coefficients 1, . . . , k is not
zero.
In other words, this is testing the null hypothesis that an intercept-only model is correct,
log (
1)
= 0
log (
1)
= 0 + 1 X1 + 2 X2
This test has k degrees of freedom (e.g. the number of dummy or indicator variables, that is the
number of -parameters (except the intercept).
Large chisquare statistics lead to small p-values and provide evidence against the intercept-only
model in favor of the current model.
Testing for an Arbitrary Group of Coefficients
For the previous test this would be equivalent to testing "intercept-only" model vs. full (saturated)
model.
and the degrees of freedom is k (the number of coefficients in question). The p-value is
P(k2 G2 ) .
How would you test only for X2? Which two models can you compare?
How do you interpret this case?
Compare this to
Testing H0 : j = 0 versus H1 : j 0.
A value of z2 bigger than 3.84 indicates that we can reject the null hypothesis j = 0 at the .05-level.
0.3491 2
1 : (
0.0955 )
= 13.3481
j z (1/2) SE(j )
For example, a 95% CI for 1
Then, the 95% CI for the odds-ratio of a student smoking if one parent is smoking in comparison to
neither smoking is:
Links:
[1] https://onlinecourses.science.psu.edu/stat504/node/84
[2]
https://onlinecourses.science.psu.edu/stat504/sites/onlinecourses.science.psu.edu.stat504/files/lesson03/smokeindep.sas
[3]
https://onlinecourses.science.psu.edu/stat504/sites/onlinecourses.science.psu.edu.stat504/files/lesson03/smokeindep.R
[4] http://www.dynamicdrive.com
[5] https://onlinecourses.science.psu.edu/stat504/sites/onlinecourses.science.psu.edu.stat504/files/lesson06/smoke.sas
[6] http://support.sas.com/onlinedoc/913/getDoc/en/statug.hlp/logistic_sect35.htm#stat_logistic_logisticod
[7] https://onlinecourses.science.psu.edu/stat504/sites/onlinecourses.science.psu.edu.stat504/files/lesson06/smokelog.sas
[8] https://onlinecourses.science.psu.edu/stat504/sites/onlinecourses.science.psu.edu.stat504/files/lesson06/smokelog.lst
[9]
https://onlinecourses.science.psu.edu/stat504/sites/onlinecourses.science.psu.edu.stat504/files/lesson06/dataformats.R
[10] https://onlinecourses.science.psu.edu/stat504/sites/onlinecourses.science.psu.edu.stat504/files/lesson06/smoke.R
[11] https://onlinecourses.science.psu.edu/stat504/sites/onlinecourses.science.psu.edu.stat504/files/lesson06/smoke.out
[12] https://onlinecourses.science.psu.edu/stat504/node/157
[13] https://onlinecourses.science.psu.edu/stat504/sites/onlinecourses.science.psu.edu.stat504/files/lesson06/smokelog.R
[14] https://onlinecourses.science.psu.edu/stat504/node/152#multitab
[15] https://onlinecourses.science.psu.edu/stat504/sites/onlinecourses.science.psu.edu.stat504/files/lesson06/smoke.lst
[16] https://onlinecourses.science.psu.edu/stat504/#
[17]
https://onlinecourses.science.psu.edu/stat504/sites/onlinecourses.science.psu.edu.stat504/files/lesson06/ROCandHL.R