Analysis of Complex Sample Survey Data

Analysis of Complex Sample Survey Data
SURVMETH 614
Lecture Notes: Module 7
Percentiles, Subpopulation Analysis, and

Functions of Estimates
Instructor: Brady T. West
e-mail: [email protected]
July 7, 2021
Estimation of Percentiles
• Analysts are often interested in estimating
more than just the mean to describe the
distribution of a variable in a population
• Example: 95th percentile of PSA levels in a
national sample of men over age 40
• Example: Median total assets (a very
skewed variable) for rural homes
• Example: Quartiles of systolic blood
pressure for African-American females
2
The Ungrouped Estimation Method
• First, generate a weighted estimate of the
CDF for a variable: N
 I(y i  x)
F ( x)  i 1
N
• A weighted estimator of this CDF can be

written as follows: H ah n
 w  I(y  h i h i  x)
Fˆ ( x)  
h 1 1 i 1
H ah n


h 1
w
1 i 1
h i
3
• The q-th quantile for a variable y is the
smallest value of the variable such that the
estimated CDF is greater than or equal to
q (q = 0.25, 0.5, 0.75, etc.)
• Ungrouped method: first order sample
values from smallest to largest, x(1),…, x(n),
and then find the value of j such that
Fˆ ( x j )  q  Fˆ ( x j 1 ).
4
• Then, a weighted estimate of the q-th
population quantile, Xq, is computed as
follows: q  Fˆ ( x j )
Xˆ q  x j  ( x j 1  x j )
Fˆ ( x j 1 )  Fˆ ( x j )
• Taylor Series Linearization (Binder, 1991)
approaches can be used to estimate the
variance of this estimate, but BRR
approaches are recommended (Kovar,
Rao, and Wu, 1988)
5
Software for Design-based
Estimation of Percentiles
• R: svyquantile() function
• WesVar: Balanced Repeated Replication
(recommended)
• SAS: Woodruff’s Method (uses TSL)
• Procedures enabling appropriate computation of
standard errors and confidence intervals (using
BRR) are not readily implemented in the current
versions of Stata and SPSS
• Stata users can install the user-written epctile
command (findit epctile, net install epctile.pkg)
6
Estimation of Percentiles:
Total Household Wealth (2012 HRS)
SAS(TSL) SAS(BRR) Stata(TSL)

Percentile
Qˆ p se(Qˆ p ) Qˆ p se(Qˆ p ) Qˆ p se(Qˆ p )
Q25 $21,953 $2,215 $21,953 $2,287 $22,000 $2,200
Q50 $141,907 $7,754 $141,907 $7,761 $142,000 $8,000

(Median)
Q75 $439,965 $18,652 $439,965 $19,940 $440,000 $18,869
Stata code:
svyset secu [pweight = nwgthh], strata(stratum)
epctile H11ATOTA, percentiles(25 50 75) subpop(if nfinr==1) svy
R code (default CI approach works well for large samples):

svyquantile(~H11ATOTA, hrs.sub.dsgn, c(0.25,0.50,0.75), ci=T)
7
Subpopulation Analysis
• Examples of Subpopulations:
– Men, women
– Age groups
– Disease groups (asthma sufferers)
– Voters
– Home Owners
– Persons with income >$20,000
8
“Unconditional” vs. “Conditional”
Analysis
• Cochran (1977), West et al. (2008) – “Conditional”
analysis “conditions” on observed subpopulation sizes as
though they were fixed. Results from using “if”, “by”, or
“where” statements when analyzing the data.
• Conditional analysis is OK for simple random samples…
• …but not necessarily for stratified samples:

– Distribution of subpopulation cases, m(h), to strata h=1,…H is a
random variable. Rarely fixed.
– Correct if the subpopulation of interest is used to define explicit
strata, e.g., Census Region in a multi-stage national sample of
U.S. households, or gender in a stratified (by sex) sample of
men and women in a University student body.
9
Analysis
• “Unconditional” analysis treats stratum
subpopulation sizes as a random variable,
m(h), h=1,…,H
• Variability of subpopulation across strata and to
clusters within strata (including m(h)=0) must be
reflected in the variance estimation.
• Stata subpop() and R subset / svyby()
keywords/options/arguments/functions
ensure that this variability is reflected in
standard errors of estimates.
10
Analysis
• Incorrect conditional analyses can also result in
the case where only one sampling error
computation unit (SECU) is detected within a
stratum by the software, preventing appropriate
variance estimation!
• Sampling error calculation models based on
combining strata attempt to prevent this from
happening…
• …but caution is still needed. When in doubt, use
the unconditional approach!
11
Unconditional Analyses in Stata
• over(varname) option for command
– varname is a categorical variable
– Analyses (correct) will be replicated for each
level of the categorical varname
• subpop(varname) for command
– varname is a user-generated 0,1 variable
where 1 indicates membership in subclass
– applies to all svy procedures
12
Unconditional Analyses in Stata
• The Stata subpop() option always produces the
correct result.
• The same is true when using the subset() / svyby()

functions in R (see earlier examples).
• Stata examines the design distribution (by strata

and clusters) of the subpopulation. If a stratum
contains 0 cases, that stratum and its clusters do
not contribute to degrees of freedom, i.e., DF
= # clusters - # of strata.
13
Functions of Estimates: Stata svy: mean
Mean Household Assets by Education (2012 HRS)
• gen finr = 1
• replace finr = 0 if kfinr != 1
• svyset secu [pweight = nwgthh], strata(stratum)
• svy, subpop(finr): mean h11atota, over(edcat)
Education of Stata
Head Label yw se( yw ) CI.95 ( yw )
0-11 yrs 1 $122,089 $10,595 ($100,863, $143,314)
12 yrs 2 $259,027 $9,802 ($239,390, $278,664)
13-15 yrs 3 $336,308 $17,201 ($301,849, $370,768)
16+ yrs 4 $834,141 $46,478 ($741,035, $927,247)
14
Functions of Estimates: R Example
Mean Household Assets by Education (2012 HRS)
• R Code:
ex5_15 <- svyby(~H11ATOTA, ~edcat,
hrssvysub, svymean, na.rm=T)
print(ex5_15)
confint(ex5_15)
15
Sampling Errors for Functions of Estimates:
Differences of Subpopulation Means
J J J 1 K
var( a j j )   a j var(ˆj )  2   a j ak  cov(ˆj ,ˆk )
ˆ 2
j 1 j 1 j 1 k  j
where : a j , ak are any chosen constants.

Example:
var(ysub1  y sub2 )  var(ysub1 )  var(ysub2 )  2cov(ysub1 , y sub2 )
where : ysub1 , y sub2 are estimates of the mean of y for
two subclasses.
16
Sampling Errors for Functions of Estimates:
Complex Sample Designs
• Sampling errors for statistics that are functions of survey
estimates require computations of variances and
covariances of estimates. These should be stored by the
software in a variance/covariance matrix.
• In complex sample designs, the covariance terms for

subpopulation estimates are often positive due to the
clustering (non-independence) in the design.
• This is true even when the subpopulations are distinct

(male and female). The covariance will be zero if the
subpopulations are defined by distinct design strata (e.g.,
Northeast vs. South Region).
17
Stata Example: Compute Subpopulation Means
. svy, vce(linearized): mean numadl, over(arthrtis)

1: arthrtis = 1
2: arthrtis = 2
------------------------------------------------------------
Mean Std. Err. [95% Conf. Interval]
------------------------------------------------------------
numadl |
1 | .9487063 .0356663 .8780096 1.019403
2 | .3940189 .0279537 .3386098 .449428
------------------------------------------------------------
18
Stata Example (2): Compute the standard error for
the difference of the subpopulation means.
. lincom [email protected] – [email protected]

( 1) [email protected] – [email protected] = 0
------------------------------------------------------------------------------
| Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
(1) | .5546874 .0408345 13.58 0.000 .4737463 .6356285
------------------------------------------------------------------------------
Note: Difference
in two estimates
from previous
slide!
19
Stata Example: Display Variance/Covariance
Matrix for the Two Subpopulation Means
. estat vce
Covariance matrix of coefficients of mean model
| numadl
e(V) | 1 2
-------------------------------------
numadl |
1 | .00127208
2 | .00019302 .00078141
20
Difference of Means: Stata lincom post-estimation
Total Assets by Education Level of Head: 2012 HRS
Command immediately following svy: mean command from Slide 15:
lincom [email protected] – [email protected]
Education of
Head y011  y16 se( y011  y16 ) CI.95 ( y011  y16  )
0-11 vs. 16+ -$712,052 $48,886 (-$809,983, -$614,122)
Note: Difference
in two estimates
from Slide 14.
21
Difference of Means: R svycontrast() function
• R code:
ex5_15 <- svyby(~H11ATOTA, ~edcat,

hrssvysub, svymean, na.rm=T)
svycontrast(ex5_15, list(avg=c(.5,0,0,.5),
diff=c(1,0,0,-1)))
22
Stata Example: Display Variance-Covariance
Matrix for the Subpopulation Means
. estat vce
Covariance matrix of coefficients of mean model
Subpopulation 1 2 3 4
1 1.123 × 108 0.257 × 108 0.564 × 108 -0.587 × 108
2 0.961 × 108 0.640 × 108 0.194 × 108
3 2.959 × 108 1.572 × 108
4 2.160 × 1010
23
Difference of Means: Hand Computation
24

Analysis of Complex Sample Survey Data

Uploaded by

Copyright:

Available Formats

Analysis of Complex Sample Survey Data

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Analysis of Complex Sample Survey Data

Uploaded by

Copyright:

Available Formats

Analysis of Complex Sample Survey Data

Lecture Notes: Module 7

Percentiles, Subpopulation Analysis, and

Instructor: Brady T. West

• A weighted estimator of this CDF can be

SAS(TSL) SAS(BRR) Stata(TSL)

Q50 $141,907 $7,754 $141,907 $7,761 $142,000 $8,000

Q75 $439,965 $18,652 $439,965 $19,940 $440,000 $18,869

R code (default CI approach works well for large samples):

• Conditional analysis is OK for simple random samples…

• …but not necessarily for stratified samples:

• The same is true when using the subset() / svyby()

• Stata examines the design distribution (by strata

0-11 yrs 1 $122,089 $10,595 ($100,863, $143,314)

12 yrs 2 $259,027 $9,802 ($239,390, $278,664)

13-15 yrs 3 $336,308 $17,201 ($301,849, $370,768)

16+ yrs 4 $834,141 $46,478 ($741,035, $927,247)

where : a j , ak are any chosen constants.

• In complex sample designs, the covariance terms for

• This is true even when the subpopulations are distinct

. svy, vce(linearized): mean numadl, over(arthrtis)

. lincom [email protected] – [email protected]

Covariance matrix of coefficients of mean model

0-11 vs. 16+ -$712,052 $48,886 (-$809,983, -$614,122)

ex5_15 <- svyby(~H11ATOTA, ~edcat,

Covariance matrix of coefficients of mean model

1 1.123 × 108 0.257 × 108 0.564 × 108 -0.587 × 108

2 0.961 × 108 0.640 × 108 0.194 × 108

3 2.959 × 108 1.572 × 108

You might also like