Analysis of Complex Sample Survey Data: Multinomial and Ordinal Logistic Regression For Complex Samples
Analysis of Complex Sample Survey Data: Multinomial and Ordinal Logistic Regression For Complex Samples
Analysis of Complex Sample Survey Data: Multinomial and Ordinal Logistic Regression For Complex Samples
SURVMETH 614
e-mail: [email protected]
1
Multinomial Responses
NHANES HUQ.040
2
Multinomial Responses
NCS-R EM7.1
3
Multinomial Response Models
Consider modeling a multinomial dependent variable with k=3 categories:
Y= employment status: 1 (employed), 2 (unemployed), 3 (not in labor force)
as a function of six independent variables:
X1 Age category
X 2 Gender
X3 Dx of alcohol dependency
X 4 Dx of major depression episode
X5 Education level
X 6 Marital status
4
Multinomial Responses:
Analysis Options
• Contingency table analysis
– Stata: svy: tab
– SAS: PROC SURVEYFREQ
5
Review:
Binary Logistic Regression Model
1 ( x)
g (x) = ln 0 1 X 1 p X p
0 ( x)
where:
0 ( x) Prob(Y=0|x);
1 ( x) Prob(Y=1|x);
0 unique intercept for logit function;
1 , , p model coefficient corresponding to predictors x= X1 ,..., X p
6
Multinomial Logistic Model:
Generalized Logit Form
k ( x)
g k (x) = ln k 0 k1 X 1 kp X p
1 ( x)
where:
1 ( x) Prob(Y=1|x);
k ( x) Prob(Y=k|x);
k 0 unique intercept for k th logit function;
k1 , , kp model coefficient corresponding to predictors x= X 1,..., X p
7
Multinomial Logistic Regression:
Generalized Logit Model
Using the example where Y takes the values
1=Employed, 2=Unemployed, 3= Not in Labor Force ,
consider two logit functions with Y=1 as the reference.
( y 2 | x)
logit( ("UN" | x )) logit( 2 ) ln
( y 1| x )
B2:0 B2:1 x1 B2: p x p
( y 3 | x)
logit( (" NLF" | x )) logit( 3 ) ln
( y 1| x )
B3:0 B3:1 x1 B3: p x p
8
Multinomial Logistic Regression:
Generalized Logit Model
X ˆ2
e
ˆ ( y 2 | x) X ˆ2 X ˆ3
1 e e
X ˆ3
e
ˆ ( y 3 | x) X ˆ2 X ˆ3
1 e e
1
ˆ ( y 1| x) X ˆ2 X ˆ3
1 e e
note : ˆ ( y 2 | x) ˆ ( y 3 | x) ˆ ( y 1| x) 1.0
9
Multinomial vs. Separate Logits
• A natural question: “Is it possible to simply estimate the
multinomial logit regression model as a series of binary
logistic regression models that consider only the response
data for two categories at a time?”
• Strictly speaking, the answer is no. The “separate-fitting”
approach will be similar but not identical to that for
simultaneous estimation of the multinomial logits.
• Standard errors for the former will be greater than those
for the simultaneous estimation and only the latter yields
the full variance-covariance matrix that is needed to test
hypotheses concerning the significance or equivalence of
parameters across the estimated logits.
10
Multinomial Logistic Model:
Generalized Logit
• Pseudo-maximum likelihood estimation of
parameters:
wi
n
K
yi ( k )
PLMult ( | X ) ˆ k ( xi )
ˆ
i 1 k 1
where : yi ( k ) 1 if y = k for sampled unit i, 0 otherwise;
ˆ k ( xi ) is the estimated probability that yi k | xi ; and
wi is the survey weight for sampled unit i.
11
Logit Probability Transform
exp( x' k )
k ( B) K
,
1 exp( x'Bk )
k 1
ˆ k : j exp( Bˆ k : j )
CI (ˆ k : j ) exp[ Bˆ k : j tdf ,1 / 2 se( Bˆ k : j )]
where :
Bˆ the parameter estimate corresponding to
k: j
where :
Bˆ k : j , Bˆ k ': j the parameter estimates corresponding to
predictor j in logit equations k and k'.
14
Multinomial Logistic Model:
Generalized Logit
• Variance Estimation by Taylor Series
Approximation or Replication Methods
15
Weighted Distribution of WKSTAT3
(Source: NCS-R)
.6 .4
Proportions
.2
0
16
Initial Bivariate Design-Based Tests Assessing Potential
Predictors of WKSTAT3 for the NCS-R Adult Sample
P (F > F)
Categorical Predictor F-test Statistic
AGE4CAT F4.96,208.51 = 113.49 < 0.001
17
Multinomial Logistic Model Example:
Stata Code
svy: mlogit wkstat3 i.sex ald mde i.ed4cat /// i.ag4cat
i.mar3cat
svy: mlogit, rrr
The rrr option is used in the repetition of the svy: mlogit command
to request output of the estimated odds ratios (which Stata
interprets as relative risk ratios) and 95% confidence intervals.
The default baseline category for the multinomial logit regression
model in Stata will be the lowest-valued category of the dependent
variable, which in this example would be 1 = “Employed”.
18
Multinomial Logistic Model Example
Stata: svy: mlogit
svy: mlogit wkstat3 i.sex ald mde i.ed4cat \\\
i.ag4cat i.mar3cat, baseoutcome(3)
19
Multinomial Logistic Model Example:
R Code
library(svyVGAM)
multi_model <- svy_vglm(as.factor(wkstat3) ~
factor(sex) + ald + mde + factor(ed4cat) +
factor(ag4cat) + factor(mar3cat),
design=nhanes_design, family=multinomial(refLevel =
"1"))
summary(multi_model)
20
Estimated Multinomial Logit Regression
Model for WKSTAT3: Logit 2.
Logit 2: Unemployed vs. Employed
Predictor* Category Bˆ 2: j se( Bˆ 2: j ) t P(t42 > t)
INTERCEPT -0.643 0.296 -2.17 0.035
30-44 -0.852 0.294 -2.89 0.006
AGE4CAT 45-59 -0.838 0.258 -3.25 0.002
60+ 1.828 0.295 6.20 < 0.001
SEX Male -1.393 0.198 -7.05 < 0.001
ALD Yes -0.164 0.357 -0.46 0.649
MDE Yes -0.140 0.157 -0.89 0.379
12 -0.847 0.235 -3.60 0.001
ED4CAT 13-15 -1.365 0.258 -5.30 < 0.001
16+ -1.731 0.310 -5.57 < 0.001
Previously -0.589 0.225 -2.62 0.012
MAR3CAT
Never -2.785 0.380 -7.32 < 0.001
21
Estimated Multinomial Logit Regression
Model for WKSTAT3: Logit 3.
Logit 3: Not in Labor Force vs. Employed
Predictor* Category Bˆ 3: j se ( Bˆ 3: j ) t P(t42 > t)
INTERCEPT -3.790 0.173 -2.19 0.034
30-44 -0.316 0.129 -2.46 0.018
AGE4CAT 45-59 0.065 0.171 0.38 0.706
60+ 2.381 0.173 13.78 < 0.001
SEX Male -0.640 0.110 -5.82 < 0.001
ALD Yes 0.333 0.130 2.56 0.014
MDE Yes 0.098 0.088 1.12 0.269
12 -0.651 0.141 -4.62 < 0.001
ED4CAT 13-15 -0.917 0.146 -6.26 < 0.001
16+ -1.229 0.160 -7.70 < 0.001
Previously -0.052 0.105 -0.50 0.621
MAR3CAT
Never 0.553 0.132 4.18 < 0.001
22
Estimates of Adjusted Odds Ratios for the
Workforce Status Outcome (WKSTAT3).
Unemployed: Employed NLF: Employed
Predictor* Category ˆ 2: j ˆ 3: j
95% CI 95% CI
30-44 0.43 (0.24, 0.77) 0.73 (0.56, 0.94)
AGE4CAT 45-59 0.43 (0.26, 0.73) 1.07 (0.76, 1.51)
60+ 6.22 (3.43, 11.28) 10.81 (7.62, 15.34)
SEX Male 0.25 (0.17, 0.37) 0.53 (0.42, 0.66)
ALD Yes 0.85 (0.41, 1.74) 1.40 (1.07, 1.82)
MDE Yes 0.87 (0.63, 1.19) 1.10 (0.92, 1.32)
12 0.43 (0.27, 0.69) 0.52 (0.39, 0.69)
ED4CAT 13-15 0.26 (0.15, 0.43) 0.40 (0.30, 0.54)
16+ 0.18 (0.10, 0.33) 0.29 (0.21, 0.40)
Previously 0.55 (0.35, 0.87) 0.95 (0.77, 1.17)
MAR3CAT
Never 0.06 (0.03, 0.13) 1.74 (1.33, 2.70)
23
Stata: Wald Tests of Model Parameters
• To evaluate the fitted model, we perform multi-parameter
Wald tests of the overall significance of each of the
predictors: AG4CAT, SEX, ALD, MDE, MAR3CAT, and
ED4CAT:
24
Stata: Wald Tests of Model Parameters
25
Stata Wald Test Results
26
Multinomial Logistic Regression:
Ordinal Logit Models
• Multiple “parameterizations”
• Ordinal models use fewer parameters than
generalized logit models by assuming a
functional relationship for the odds in adjacent
categories.
• Hosmer et al. (2013)
– Generalized logit ~ “Baseline Model”
– Adjacent categories model (ordinal)
– Continuation ratio model (ordinal)
– Proportional odds model (ordinal)
27
Multinomial Logistic Model:
Ordinal Model Example
Proportional Odds or "Cumulative Logit Model"
P(Y k | x )
c k ( x ) ln
P (Y k | x )
0 ( x ) + 1 ( x ) + + k ( x )
ln
k 1 ( x ) + k 2 ( x ) + + K ( x )
k ( 0 1 X 1 ... p X p )
28
Multinomial Logistic Regression:
Cumulative Logit Model
• Stata: svy: ologit
• R: svyolr()
• Pseudo Maximum Likelihood Estimation of
Parameters
• Sampling errors estimated using Taylor
Series Approximation or Replication
29
Ordinal Response Question
NHANES 2005-06 PAQ.180
Please tell me which of these four sentences best describes your usual daily
activities:
1. You sit during the day and do not walk about very much
2. You stand or walk about quite a lot during the day,
but do not have to carry or lift things very often
3. You lift or carry light loads, or have to climb stairs or hills often; or
4. You do heavy work or carry heavy loads.
7. Refused
9. Don’t know
30
Recoded Ordinal Response
ESS-Russian 2016 STFLIFE
Score of satisfaction with life, on a 0-10 scale, with 0 representing extremely
dissatisfied and 10 representing very satisfied
1. 0-1
2. 2-4
3. 5
4. 6-8
5. 9-10
31
Bar Chart (Weighted) of the Recoded
Variable STFLIFE2 (2016 ESS)
32
Cumulative Logit Models
(a.k.a. Proportional Odds Models)
P( y k ) | x
logit[P( y k ) | x ] ln
P ( y k | x )
P( y 1| x ) ... P( y k | x )
ln
P ( y k 1| x ) ... P ( y K | x )
Bk ( B1 x1 B2 x2 ... Bp x p )
33
Cumulative Logit Model
• For an ordinal variable with K categories, K - 1
cumulative logit functions are defined.
• Each cumulative logit function includes a unique
intercept or “cut point,” τk, but all share a common set of
regression parameters for the p predictors.
• Consequently, a cumulative logit model for an ordinal
response variable with K categories and j = 1, …, p
predictors requires the estimation of (K - 1) + p
parameters—far fewer than the (K - 1) × (p + 1)
parameters for a multinomial logit model.
• Still need design-adjusted tests of goodness of fit!
34
Logit Probability Transform:
Cumulative and Category-specific
ˆ k ( x ) = ˆ ( y k | x ) - ˆ ( y k 1| x )
where :
ˆ ( y 0 | x ) = 0.
35
Cumulative Logit Model:
Stata and R Code
Stata:
svy: ologit stflife2 i.agecat i.marcat male
svy: ologit, or
* Note: can use gologit2 (user-written) for design-
adjusted test of proportional odds assumption; see
pages 322-323 in ASDA
R:
stflife.olr = svyolr(stflife2 ~ factor(agecat) +
factor(marcat) + male, design = russia.dsgn)
summary(stflife.olr)
#obtain OR
exp(stflife.olr$coef)
36
Estimated Cumulative Logit Regression
Model for STFLIFE2
se( Bˆ )
37
Interpretation of Parameter Estimates
in Cumulative Logit Models
• Given Stata’s parameterization, negative coefficients
indicate decreased odds of higher-valued categories on
the ordinal dependent variable
• Hence, older individuals have lower odds of higher-
valued categories on the score of satisfaction with life,
indicating less satisfaction with life
• Individuals who previous married have lower odds of
higher-valued categories, indicating less satisfaction with
life, comparing with those married
• Fit the same model using standard linear regression
to confirm directions of relationships!
38
Estimated Cumulative Odds Ratios in the
Cumulative Logit Regression Model for STFLIFE2
39