Analysis of Complex Sample Survey Data: Multinomial and Ordinal Logistic Regression For Complex Samples

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 39

Analysis of Complex Sample Survey Data

SURVMETH 614

Lecture Notes: Module 12

Multinomial and Ordinal Logistic


Regression for Complex Samples

Instructor: Brady T. West

e-mail: [email protected]

July 21, 2021

1
Multinomial Responses
NHANES HUQ.040

What kind of place do you go to most often: is it a clinic, doctor's office,


emergency room, or some other place?
1. CLINIC OR HEALTH CENTER......................
2. DOCTOR'S OFFICE OR HMO......................
3. HOSPITAL EMERGENCY ROOM.................
4. HOSPITAL OUTPATIENT DEPARTMENT...
5. SOME OTHER PLACE..................................
6. REFUSED.....................................................
7. DON'T KNOW...............................................

2
Multinomial Responses
NCS-R EM7.1

What about your current employment situation as of today -- are you?


1. EMPLOYED.............................................................................
2. SELF-EMPLOYED..................................................................
3. LOOKING FOR WORK; UNEMPLOYED............................
4. TEMPORARILY LAID OFF..................................................
5. RETIRED.................................................................................
6. HOMEMAKER.......................................................................
7. STUDENT...............................................................................
8. MATERNITY LEAVE...........................................................
9. ILLNESS/SICK LEAVE........................................................
10. DISABLED..........................................................................
11. OTHER (SPECIFY)....

3
Multinomial Response Models
Consider modeling a multinomial dependent variable with k=3 categories:
Y= employment status: 1 (employed), 2 (unemployed), 3 (not in labor force)
as a function of six independent variables:
X1  Age category
X 2  Gender
X3  Dx of alcohol dependency
X 4  Dx of major depression episode
X5  Education level
X 6  Marital status

4
Multinomial Responses:
Analysis Options
• Contingency table analysis
– Stata: svy: tab
– SAS: PROC SURVEYFREQ

• Log linear models


– Limited programs for complex sample survey data

• Multinomial versions of Probit, Negative Binomial, and C-


Log-Log models (generalized linear regression models)

• Multinomial Logistic Regression (our focus today)


– Generalized logit models / Polytomous models
– Ordinal logit models (see Hosmer et al., 2013)

5
Review:
Binary Logistic Regression Model
  1 ( x) 
g (x) = ln     0  1 X 1     p X p
  0 ( x) 

where:
 0 ( x)  Prob(Y=0|x);
 1 ( x)  Prob(Y=1|x);
 0  unique intercept for logit function;
1 , ,  p  model coefficient corresponding to predictors x=  X1 ,..., X p 

6
Multinomial Logistic Model:
Generalized Logit Form
  k ( x) 
g k (x) = ln     k 0   k1 X 1     kp X p
  1 ( x) 

where:
 1 ( x)  Prob(Y=1|x);
 k ( x)  Prob(Y=k|x);
 k 0  unique intercept for k th logit function;
 k1 , ,  kp  model coefficient corresponding to predictors x=  X 1,..., X p 

7
Multinomial Logistic Regression:
Generalized Logit Model
Using the example where Y takes the values
 1=Employed, 2=Unemployed, 3= Not in Labor Force ,
consider two logit functions with Y=1 as the reference.
  ( y  2 | x) 
logit( ("UN" | x ))  logit( 2 )  ln  
  ( y  1| x ) 
 B2:0  B2:1 x1    B2: p x p
  ( y  3 | x) 
logit( (" NLF" | x ))  logit( 3 )  ln  
  ( y  1| x ) 
 B3:0  B3:1 x1    B3: p x p

8
Multinomial Logistic Regression:
Generalized Logit Model
X ˆ2
e
ˆ ( y  2 | x)  X ˆ2 X ˆ3
1 e e

X ˆ3
e
ˆ ( y  3 | x)  X ˆ2 X ˆ3
1 e e

1
ˆ ( y  1| x)  X ˆ2 X ˆ3
1 e e

note : ˆ ( y  2 | x)  ˆ ( y  3 | x)  ˆ ( y  1| x)  1.0
9
Multinomial vs. Separate Logits
• A natural question: “Is it possible to simply estimate the
multinomial logit regression model as a series of binary
logistic regression models that consider only the response
data for two categories at a time?”
• Strictly speaking, the answer is no. The “separate-fitting”
approach will be similar but not identical to that for
simultaneous estimation of the multinomial logits.
• Standard errors for the former will be greater than those
for the simultaneous estimation and only the latter yields
the full variance-covariance matrix that is needed to test
hypotheses concerning the significance or equivalence of
parameters across the estimated logits.
10
Multinomial Logistic Model:
Generalized Logit
• Pseudo-maximum likelihood estimation of
parameters:
wi
n
 K
yi ( k ) 
PLMult ( | X )    ˆ k ( xi ) 
ˆ
i 1  k 1 
where : yi ( k )  1 if y = k for sampled unit i, 0 otherwise;
ˆ k ( xi ) is the estimated probability that yi  k | xi ; and
wi is the survey weight for sampled unit i.

11
Logit Probability Transform
exp( x'  k )
 k ( B)  K
,
1   exp( x'Bk )
k 1

xh i  a column vector of the p + 1 design


matrix elements for case i

 1 x1,h i  x p ,h i  ;


B =  B2 ,0 ,..., B2 ,p ,...., BK,0 ,..., BK,p  is a
(K - 1)  (p + 1) vector of parameters
with  1 = 0 for k = 1 (the baseline) .
12
Multinomial Logistic Model:
Generalized Logit
• Odds ratio interpretation for regression
parameters, as in the simple logit model:

ˆ k : j  exp( Bˆ k : j )
CI (ˆ k : j )  exp[ Bˆ k : j  tdf ,1 / 2  se( Bˆ k : j )]

where :
Bˆ  the parameter estimate corresponding to
k: j

predictor j in logit equation k.


13
Multinomial Logistic Model:
Generalized Logit
• Odds ratio interpretation for regression parameters,
comparing two non-reference categories:

ˆ k ,k ': j  exp( Bˆ k: j  Bˆk ': j )


CI (ˆ k ,k ': j )  exp[( Bˆ k: j  Bˆ k ': j )  tdf ,1 / 2  se( Bˆk : j  Bˆk ': j )]

where :
Bˆ k : j , Bˆ k ': j  the parameter estimates corresponding to
predictor j in logit equations k and k'.

14
Multinomial Logistic Model:
Generalized Logit
• Variance Estimation by Taylor Series
Approximation or Replication Methods

15
Weighted Distribution of WKSTAT3
(Source: NCS-R)
.6 .4
Proportions
.2
0

Employed Unemployed NLF

16
Initial Bivariate Design-Based Tests Assessing Potential
Predictors of WKSTAT3 for the NCS-R Adult Sample

P (F > F)
Categorical Predictor F-test Statistic
AGE4CAT F4.96,208.51 = 113.49 < 0.001

SEX F1.87,78.75 = 27.33 < 0.001

ALD F1.72,72.44 = 3.12 0.057

MDE F1.73,72.86 = 4.67 0.016

ED4CAT F5.15,216.12 = 27.64 < 0.001

MAR3CAT F3.20,134.34 = 23.12 < 0.001

17
Multinomial Logistic Model Example:
Stata Code
svy: mlogit wkstat3 i.sex ald mde i.ed4cat /// i.ag4cat
i.mar3cat
svy: mlogit, rrr

 The rrr option is used in the repetition of the svy: mlogit command
to request output of the estimated odds ratios (which Stata
interprets as relative risk ratios) and 95% confidence intervals.
 The default baseline category for the multinomial logit regression
model in Stata will be the lowest-valued category of the dependent
variable, which in this example would be 1 = “Employed”.

 NOTE: At present, there is not a modeling option in the survey


package in R for fitting multinomial logistic regression models!
These models need to be fitted using

18
Multinomial Logistic Model Example
Stata: svy: mlogit
svy: mlogit wkstat3 i.sex ald mde i.ed4cat \\\
i.ag4cat i.mar3cat, baseoutcome(3)

 This command would fit the multinomial logit model


to WKSTAT3 with 3 = “Not in Labor Force” as the
baseline outcome category.

19
Multinomial Logistic Model Example:
R Code

 NOTE: At present, there is not a modeling option in the survey package


in R for fitting multinomial logistic regression models!

 These models need to be fitted using the svyVGAM package:

library(svyVGAM)
multi_model <- svy_vglm(as.factor(wkstat3) ~
factor(sex) + ald + mde + factor(ed4cat) +
factor(ag4cat) + factor(mar3cat),
design=nhanes_design, family=multinomial(refLevel =
"1"))
summary(multi_model)

20
Estimated Multinomial Logit Regression
Model for WKSTAT3: Logit 2.
Logit 2: Unemployed vs. Employed
Predictor* Category Bˆ 2: j se( Bˆ 2: j ) t P(t42 > t)
INTERCEPT -0.643 0.296 -2.17 0.035
30-44 -0.852 0.294 -2.89 0.006
AGE4CAT 45-59 -0.838 0.258 -3.25 0.002
60+ 1.828 0.295 6.20 < 0.001
SEX Male -1.393 0.198 -7.05 < 0.001
ALD Yes -0.164 0.357 -0.46 0.649
MDE Yes -0.140 0.157 -0.89 0.379
12 -0.847 0.235 -3.60 0.001
ED4CAT 13-15 -1.365 0.258 -5.30 < 0.001
16+ -1.731 0.310 -5.57 < 0.001
Previously -0.589 0.225 -2.62 0.012
MAR3CAT
Never -2.785 0.380 -7.32 < 0.001

21
Estimated Multinomial Logit Regression
Model for WKSTAT3: Logit 3.
Logit 3: Not in Labor Force vs. Employed
Predictor* Category Bˆ 3: j se ( Bˆ 3: j ) t P(t42 > t)
INTERCEPT -3.790 0.173 -2.19 0.034
30-44 -0.316 0.129 -2.46 0.018
AGE4CAT 45-59 0.065 0.171 0.38 0.706
60+ 2.381 0.173 13.78 < 0.001
SEX Male -0.640 0.110 -5.82 < 0.001
ALD Yes 0.333 0.130 2.56 0.014
MDE Yes 0.098 0.088 1.12 0.269
12 -0.651 0.141 -4.62 < 0.001
ED4CAT 13-15 -0.917 0.146 -6.26 < 0.001
16+ -1.229 0.160 -7.70 < 0.001
Previously -0.052 0.105 -0.50 0.621
MAR3CAT
Never 0.553 0.132 4.18 < 0.001

22
Estimates of Adjusted Odds Ratios for the
Workforce Status Outcome (WKSTAT3).
Unemployed: Employed NLF: Employed
Predictor* Category ˆ 2: j ˆ 3: j
95% CI 95% CI
30-44 0.43 (0.24, 0.77) 0.73 (0.56, 0.94)
AGE4CAT 45-59 0.43 (0.26, 0.73) 1.07 (0.76, 1.51)
60+ 6.22 (3.43, 11.28) 10.81 (7.62, 15.34)
SEX Male 0.25 (0.17, 0.37) 0.53 (0.42, 0.66)
ALD Yes 0.85 (0.41, 1.74) 1.40 (1.07, 1.82)
MDE Yes 0.87 (0.63, 1.19) 1.10 (0.92, 1.32)
12 0.43 (0.27, 0.69) 0.52 (0.39, 0.69)
ED4CAT 13-15 0.26 (0.15, 0.43) 0.40 (0.30, 0.54)
16+ 0.18 (0.10, 0.33) 0.29 (0.21, 0.40)
Previously 0.55 (0.35, 0.87) 0.95 (0.77, 1.17)
MAR3CAT
Never 0.06 (0.03, 0.13) 1.74 (1.33, 2.70)

23
Stata: Wald Tests of Model Parameters
• To evaluate the fitted model, we perform multi-parameter
Wald tests of the overall significance of each of the
predictors: AG4CAT, SEX, ALD, MDE, MAR3CAT, and
ED4CAT:

test 2.ag4cat 3.ag4cat 4.ag4cat


test sex
test ald
test mde
test 2.mar3cat 3.mar3cat
test 2.ed4cat 3.ed4cat 4.ed4cat

24
Stata: Wald Tests of Model Parameters

• To test whether the education level coefficients in the two


logits are equal to each other, a test of the following form
may be used (note the use of labels here):
 
test [NLF=unemployed]: 2.ed4cat 3.ed4cat 4.ed4cat

25
Stata Wald Test Results

Categorical Predictor F-test Statistic P (F > F)


AGE4CAT F(6,37) = 83.59 0.001

SEX F(2,41) = 35.75 0.001

ALD F(2,41) = 5.05 0.011

MDE F(2,41) = 1.14 0.330

ED4CAT F(6,37) = 13.68 0.001

MAR3CAT F(4,39) = 24.81 0.001

26
Multinomial Logistic Regression:
Ordinal Logit Models
• Multiple “parameterizations”
• Ordinal models use fewer parameters than
generalized logit models by assuming a
functional relationship for the odds in adjacent
categories.
• Hosmer et al. (2013)
– Generalized logit ~ “Baseline Model”
– Adjacent categories model (ordinal)
– Continuation ratio model (ordinal)
– Proportional odds model (ordinal)
27
Multinomial Logistic Model:
Ordinal Model Example
Proportional Odds or "Cumulative Logit Model"

 P(Y  k | x ) 
c k ( x )  ln  
 P (Y  k | x ) 

  0 ( x ) +  1 ( x ) +  +  k ( x ) 
 ln  

 k 1 ( x ) +  k 2 ( x ) +  +  K ( x ) 

  k  (  0  1 X 1  ...   p X p )
28
Multinomial Logistic Regression:
Cumulative Logit Model
• Stata: svy: ologit
• R: svyolr()
• Pseudo Maximum Likelihood Estimation of
Parameters
• Sampling errors estimated using Taylor
Series Approximation or Replication

29
Ordinal Response Question
  NHANES 2005-06 PAQ.180

Please tell me which of these four sentences best describes your usual daily
activities:
 
1. You sit during the day and do not walk about very much
2. You stand or walk about quite a lot during the day,
but do not have to carry or lift things very often
3. You lift or carry light loads, or have to climb stairs or hills often; or
4. You do heavy work or carry heavy loads.
7. Refused
9. Don’t know

30
Recoded Ordinal Response
ESS-Russian 2016 STFLIFE
Score of satisfaction with life, on a 0-10 scale, with 0 representing extremely
dissatisfied and 10 representing very satisfied
1. 0-1
2. 2-4
3. 5
4. 6-8
5. 9-10
 

31
Bar Chart (Weighted) of the Recoded
Variable STFLIFE2 (2016 ESS)

32
Cumulative Logit Models
(a.k.a. Proportional Odds Models)

 P( y  k ) | x 
logit[P( y  k ) | x ]  ln  
 P ( y  k | x ) 
 P( y  1| x )  ...  P( y  k | x ) 
 ln  
 P ( y  k  1| x )  ...  P ( y  K | x ) 
 Bk  ( B1 x1  B2 x2  ...  Bp x p )

33
Cumulative Logit Model
• For an ordinal variable with K categories, K - 1
cumulative logit functions are defined.
• Each cumulative logit function includes a unique
intercept or “cut point,” τk, but all share a common set of
regression parameters for the p predictors.
• Consequently, a cumulative logit model for an ordinal
response variable with K categories and j = 1, …, p
predictors requires the estimation of (K - 1) + p
parameters—far fewer than the (K - 1) × (p + 1)
parameters for a multinomial logit model.
• Still need design-adjusted tests of goodness of fit!

34
Logit Probability Transform:
Cumulative and Category-specific

exp( xBˆ ) exp[ Bˆ k  ( Bˆ1 x1  Bˆ1 x2  ...  Bˆ p x p )]


ˆ ( y  k | x )  
1  exp( xBˆ ) 1  exp[ Bˆ k  ( Bˆ1 x1  Bˆ1 x2  ...  Bˆ p x p )]

ˆ k ( x ) = ˆ ( y  k | x ) - ˆ ( y  k  1| x )

where :
ˆ ( y  0 | x ) = 0.

35
Cumulative Logit Model:
Stata and R Code
Stata:
svy: ologit stflife2 i.agecat i.marcat male
svy: ologit, or
* Note: can use gologit2 (user-written) for design-
adjusted test of proportional odds assumption; see
pages 322-323 in ASDA

R:
stflife.olr = svyolr(stflife2 ~ factor(agecat) +
factor(marcat) + male, design = russia.dsgn)
summary(stflife.olr)
#obtain OR
exp(stflife.olr$coef)
36
Estimated Cumulative Logit Regression
Model for STFLIFE2
se( Bˆ )

37
Interpretation of Parameter Estimates
in Cumulative Logit Models
• Given Stata’s parameterization, negative coefficients
indicate decreased odds of higher-valued categories on
the ordinal dependent variable
• Hence, older individuals have lower odds of higher-
valued categories on the score of satisfaction with life,
indicating less satisfaction with life
• Individuals who previous married have lower odds of
higher-valued categories, indicating less satisfaction with
life, comparing with those married
• Fit the same model using standard linear regression
to confirm directions of relationships!

38
Estimated Cumulative Odds Ratios in the
Cumulative Logit Regression Model for STFLIFE2

Cumulative Odds Ratio


Predictor Category
 y k: j 95% CI

2=30-44 0.596 (0.457, 0.777)


AGECAT 3=45-59 0.481 (0.362, 0.639)
4=60+ 0.457 (0.330, 0.633)
2=Previous 0.807 (0.656, 0.993)
MARCAT
3=Never 0.888 (0.686, 1.149)

GENDER Male 0.894 (0.741, 1.077)

39

You might also like