Analysis of Complex Sample Survey Data
Analysis of Complex Sample Survey Data
Analysis of Complex Sample Survey Data
SURVMETH 614
e-mail: [email protected]
July 7, 2021
Estimation of Percentiles
• Analysts are often interested in estimating
more than just the mean to describe the
distribution of a variable in a population
• Example: 95th percentile of PSA levels in a
national sample of men over age 40
• Example: Median total assets (a very
skewed variable) for rural homes
• Example: Quartiles of systolic blood
pressure for African-American females
2
The Ungrouped Estimation Method
• First, generate a weighted estimate of the
CDF for a variable: N
I(y i x)
F ( x) i 1
N
w I(y h i h i x)
Fˆ ( x)
h 1 1 i 1
H ah n
h 1
w
1 i 1
h i
3
The Ungrouped Estimation Method
• The q-th quantile for a variable y is the
smallest value of the variable such that the
estimated CDF is greater than or equal to
q (q = 0.25, 0.5, 0.75, etc.)
• Ungrouped method: first order sample
values from smallest to largest, x(1),…, x(n),
and then find the value of j such that
Fˆ ( x j ) q Fˆ ( x j 1 ).
4
The Ungrouped Estimation Method
• Then, a weighted estimate of the q-th
population quantile, Xq, is computed as
follows: q Fˆ ( x j )
Xˆ q x j ( x j 1 x j )
Fˆ ( x j 1 ) Fˆ ( x j )
• Taylor Series Linearization (Binder, 1991)
approaches can be used to estimate the
variance of this estimate, but BRR
approaches are recommended (Kovar,
Rao, and Wu, 1988)
5
Software for Design-based
Estimation of Percentiles
• R: svyquantile() function
• WesVar: Balanced Repeated Replication
(recommended)
• SAS: Woodruff’s Method (uses TSL)
• Procedures enabling appropriate computation of
standard errors and confidence intervals (using
BRR) are not readily implemented in the current
versions of Stata and SPSS
• Stata users can install the user-written epctile
command (findit epctile, net install epctile.pkg)
6
Estimation of Percentiles:
Total Household Wealth (2012 HRS)
Stata code:
svyset secu [pweight = nwgthh], strata(stratum)
epctile H11ATOTA, percentiles(25 50 75) subpop(if nfinr==1) svy
8
“Unconditional” vs. “Conditional”
Analysis
• Cochran (1977), West et al. (2008) – “Conditional”
analysis “conditions” on observed subpopulation sizes as
though they were fixed. Results from using “if”, “by”, or
“where” statements when analyzing the data.
9
“Unconditional” vs. “Conditional”
Analysis
• “Unconditional” analysis treats stratum
subpopulation sizes as a random variable,
m(h), h=1,…,H
• Variability of subpopulation across strata and to
clusters within strata (including m(h)=0) must be
reflected in the variance estimation.
• Stata subpop() and R subset / svyby()
keywords/options/arguments/functions
ensure that this variability is reflected in
standard errors of estimates.
10
“Unconditional” vs. “Conditional”
Analysis
• Incorrect conditional analyses can also result in
the case where only one sampling error
computation unit (SECU) is detected within a
stratum by the software, preventing appropriate
variance estimation!
• Sampling error calculation models based on
combining strata attempt to prevent this from
happening…
• …but caution is still needed. When in doubt, use
the unconditional approach!
11
Unconditional Analyses in Stata
• over(varname) option for command
– varname is a categorical variable
– Analyses (correct) will be replicated for each
level of the categorical varname
• subpop(varname) for command
– varname is a user-generated 0,1 variable
where 1 indicates membership in subclass
– applies to all svy procedures
12
Unconditional Analyses in Stata
• The Stata subpop() option always produces the
correct result.
Education of Stata
Head Label yw se( yw ) CI.95 ( yw )
14
Functions of Estimates: R Example
Mean Household Assets by Education (2012 HRS)
• R Code:
ex5_15 <- svyby(~H11ATOTA, ~edcat,
hrssvysub, svymean, na.rm=T)
print(ex5_15)
confint(ex5_15)
15
Sampling Errors for Functions of Estimates:
Differences of Subpopulation Means
J J J 1 K
var( a j j ) a j var(ˆj ) 2 a j ak cov(ˆj ,ˆk )
ˆ 2
j 1 j 1 j 1 k j
18
Stata Example (2): Compute the standard error for
the difference of the subpopulation means.
------------------------------------------------------------------------------
| Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
(1) | .5546874 .0408345 13.58 0.000 .4737463 .6356285
------------------------------------------------------------------------------
Note: Difference
in two estimates
from previous
slide!
19
Stata Example: Display Variance/Covariance
Matrix for the Two Subpopulation Means
. estat vce
| numadl
e(V) | 1 2
-------------------------------------
numadl |
1 | .00127208
2 | .00019302 .00078141
20
Difference of Means: Stata lincom post-estimation
Total Assets by Education Level of Head: 2012 HRS
Command immediately following svy: mean command from Slide 15:
lincom [email protected] – [email protected]
Education of
Head y011 y16 se( y011 y16 ) CI.95 ( y011 y16 )
Note: Difference
in two estimates
from Slide 14.
21
Difference of Means: R svycontrast() function
Total Assets by Education Level of Head: 2012 HRS
• R code:
svycontrast(ex5_15, list(avg=c(.5,0,0,.5),
diff=c(1,0,0,-1)))
22
Stata Example: Display Variance-Covariance
Matrix for the Subpopulation Means
. estat vce
Subpopulation 1 2 3 4
4 2.160 × 1010
23
Difference of Means: Hand Computation
Total Assets by Education Level of Head: 2012 HRS
24