Logistic Regression: Logistic Regression and The New: Residual Logistic Regression
Logistic Regression: Logistic Regression and The New: Residual Logistic Regression
Logistic Regression: Logistic Regression and The New: Residual Logistic Regression
F. Berenice Baez-Revueltas
Wei Zhu
1
Outline
1. Logistic Regression
2. Confounding Variables
3. Controlling for Confounding Variables
4. Residual Linear Regression
5. Residual Logistic Regression
6. Examples
7. Discussion
8. Future Work
2
1. Logistic Regression Model
In 1938, Ronald Fisher
and Frank Yates
suggested the logit
link for regression
with a binary response
variable.
ln(Odds of Y 1| x)
P(Y 1| x) P(Y 1| x)
ln ln
P (Y 0 | x ) 1 P (Y 1| x )
x
ln 0 1 x
1 x
A popular model for categorical response variable
exp( 0 1 x)
P(Y 1 | x)
1 exp( 0 1 x)
E(Y|x)= P(Y=1| x) *1 + P(Y=0|x) * 0 = P(Y=1|x) is bounded between 0 and 1 for
all values of x. The following linear model may violate this condition sometimes:
P(Y=1|x) = 0 1 x
More on the properties of the logistic regression model
In the simple logistic regression, the regression coefficient 1 has the interpretation
that it is the log of the odds ratio of a success event (Y=1) for a unit change in x.
P(Y 1 | x 1) P(Y 1 | x)
ln ln [ 0 1 ( x 1)] [ 0 1 x] 1
P(Y 0 | x 1) P(Y 0 | x)
For multiple predictor variables, the logistic regression model is
P(Y 1 | x1 , x 2 ,..., x k )
ln 0 1 x1 ... k x k
P(Y 0 | x1 , x 2 ,..., x k )
Logistic Regression, SAS Procedure
http://www.ats.ucla.edu/stat/sas/output/SAS_logit_output.htm
Proc Logistic
This page shows an example of logistic regression with footnotes explaining the output. The
data were collected on 200 high school students, with measurements on various tests, including
science, math, reading and social studies. The response variable is high writing test score
(honcomp), where a writing score greater than or equal to 60 is considered high, and less than
60 considered low; from which we explore its relationship with gender (female), reading test
score (read), and science test score (science). The dataset used in this page can be downloaded
from http://www.ats.ucla.edu/stat/sas/webbooks/reg/default.htm.
data logit;
set "c:\temp\hsb2";
honcomp = (write >= 60);
run;
proc logistic data= logit descending;
model honcomp = female read science;
run;
7
Logistic Regression, SAS Output
8
2. Confounding Variables
Correlated with both the dependent and
independent variables
Represent major threat to the validity of inferences
on cause and effect
Add to multicollinearity
Can lead to over or underestimation of an effect, it
can even change the direction of the conclusion
They add error in the interpretation of what may
be an accurate measurement
9
For a variable to be a confounder it needs to
have
Relationship with the exposure
Relationship with the outcome even in the
absence of the exposure (not an intermediary)
Not on the causal pathway
Uneven distribution in comparison groups
Exposure Outcome
Third variable
10
Birth order Down Syndrome
Maternal age is correlated with birth order and a risk factor for Down
Syndrome, even if Birth order is low
No
Smoking Confounding
Smoking is correlated with alcohol consumption and is a risk factor for
Lung Cancer even for persons who don’t drink alcohol
11
3. Controlling for Confounding
Variables
In study designs
Restriction
Random allocation of subjects to study
Cohort studies
13
Pros and Cons of Controlling Methods
Matching methods call for subjects with exactly
the same characteristics
Risk of over or under matching
Cohort studies can lead to too much loss of
information when excluding subjects
Some strata might become too thin and thus
insignificant creating also loss of information
Regression methods, if well handled, can
control for confounding factors
14
4. Residual Linear Regression
Consider a dependant variable Y and a set of
n independent covariates, from which the
first k (k<n) of them are potential
confounding factors
Initial model treating only the confounding
variables as follows
Y 0 1 X1 2 X 2 ... k X k
Residuals are calculated from this model, let
Yˆ ˆ0 ˆ1 X 1 ˆ2 X 2 ... ˆk X k
15
The residuals are e j Y j Y j with the following properties:
Zero mean
Homoscedasticity
Normally distributed
Corr ei , e j 0 , i j
This residual
will be considered the new dependant
variable. That is, the new model to be fitted is
Y Y 0 k 1 X k 1 k 2 X k 2 ... t X t
which is equivalent to:
Y 0 Y k 1 X k 1 k 2 X k 2 ... t X t
16
The Usual Logistic Regression Approach
to ‘Control for’ Confounders
Consider a binary outcome Y and n covariates
where the first k (k<n) of them being potential
confounding factors
The usual way to ‘control for’ these confounding
variables is to simply put all the n variables in the
same model as:
log 1 X1 2 X 2 ... k X k ... n X n
1 0
17
5. Residual Logistic Regression
Each subject has a binary outcome Y
18
Method 1
The confounding variables effect is retained and
plugged in to the second level regression model
along with the variables of interest following the
residual linear regression approach.
That is, let
T 1 X1 2 X 2 ... k X k
The new model to be fitted is
log 0 T k 1 X k 1 k 2 X k 2 ... n X n
1
19
Method 2
Pearson residuals are calculated from the initial
model using the Pearson residual (Hosmer and
Lemeshow, 1989)
Y
Z
1
where is the estimated probability of success
based on the confounding variables alone:
i 1
k
0 i X i
e
0,1
k
0 i X i
1 e i 1
21
6. Example 1
Data: Low Birth Weight
Dow. Indicator of birth weight less than 2.5 Kg
Age: Mother’s age in years
Lwt: Mother’s weight in pounds
Smk: Smoking status during pregnancy
Ht: History of hypertension
Ht 1.0000
22
Potential confounding factor: Age
Model for (probability of low birth weight)
Logistic regression
age 2 lwt 3 smk 4 ht
1 0 1
log
Method 2
Z 0
2 lwt 3 smk 4 ht
23
Results
Logistic Regression RLR Method1
Variables
Odds ratio P-value SE Odds ratio P-value SE
lwt 0.988 0.060 0.0064 0.989 0.078 0.0065
smk 3.480 0.001 0.3576 3.455 0.001 0.3687
ht 3.395 0.053 0.6322 3.317 0.059 0.6342
24
Example 2
Data: Alzheimer patients
Decline: Whether the subjects cognitive capabilities deteriorates or not
Age: Subjects age
Gender: Subjects gender
MMS: Mini Mental Score
PDS: Psychometric deterioration scale
HDT: Depression scale
HDT 1.0000
25
Potential confounding factors: Age, Gender
Model for (probability of declining)
Logistic regression
age 2 gender 3 mms 4 pds 5 hdt
1 0 1
log
26
Results
Logistic Regression RLR Method1
Variables
Odds ratio P-value SE Odds ratio P-value SE
mms 0.717 0.023 0.1451 0.720 0.023 0.1443
pds 1.691 0.001 0.1629 1.674 0.001 0.1565
hdt 1.018 0.643 0.0380 1.018 0.644 0.0377
27
7. Discussion
The usual logistic regression is not designed to
control for confounding factors and there is a risk
for multicollinearity.
Method 1 is designed to control for confounding
factors; however, from the given examples we can
see Method 1 yields similar results to the usual
logistic regression approach
Method 2 appears to be more accurate with some
SE significantly reduced and thus the p-values for
some regressors are significantly smaller. However
it will not yield the odds ratios as Method 1 can.
28
8. Future Work
We will further examine the assumptions
behind Method 2 to understand why it
sometimes yields more significant results.
We will also study residual longitudinal data
analysis, including the survival analysis,
where one or more time dependant
variable(s) will be taken into account.
29
Selected References
Menard, S. Applied Logistic Regression Analysis. Series:
Quantitative Applications in the Social Sciences. Sage
University Series
Lemeshow, S; Teres, D.; Avrunin, J.S. and Pastides, H.
Predicting the Outcome of Intensive Care Unit Patients.
Journal of the American Statistical Association 83, 348-356
Hosmer, D.W.; Jovanovic, B. and Lemeshow, S. Best Subsets
Logistic Regression. Biometrics 45, 1265-1270. 1989.
Pergibon, D. Logistic Regression Diagnostics. The Annals of
Statistics 19(4), 705-724. 1981.
30
Questions?
31