2 DHS6 - Sampling - Manual - Sept2012 - DHSM4
2 DHS6 - Sampling - Manual - Sept2012 - DHSM4
2 DHS6 - Sampling - Manual - Sept2012 - DHSM4
The MEASURE DHS program has used several methods 1 for selecting households within
clusters including:
1) Systematic selection: From a random starting point select every nth household (see
Chapter 3 Section 3.2 for more details).
2) Systematic selection with runs: From a random starting point, select a group of sequential
households called a “run”. Several runs may be used within a cluster. Runs are selected
with systematic selection. Selecting households in runs can greatly reduce the amount of
travel within cluster during data collection, especially in rural clusters where households
can be far apart.
The advantages of household selection in the central office can be summarized as:
1) It allows for a check of coverage of the household listing results before the main survey
and for the review and possible relisting of problematic clusters in advance.
3) The field work procedure is exactly replicable which provides the possibility of easy and
close supervision of the field work.
However, in cases when travelling between clusters represents a substantial cost, it is possible
to forego the step of selecting households in the central office. In such cases, the household listing
operation and the main survey can be combined into a single field operation. No essential changes are
needed in the household listing procedure or household numbering, but making a detailed sketch map
for the cluster may not be necessary because the listing team and the interviewing team are the same,
and the household interview will begin immediately after the listing, so identifying the exact selected
households during a separate visit is no longer a problem. The household selection must be done in
the field manually if portable computers are not available. Some manual selection procedures have
been developed for this purpose. Household listing and interviewing are two very different jobs, so in
1
The MEASURE DHS program has developed various Excel templates for household selection in the central office:
systematic selection, systematic selection with runs, self-weighting selection with and without control of sample
size and with or without runs. Once the household listing is completed, it is possible to just copy the number of
households listed in a cluster into the spreadsheet and the spreadsheet will show the selected household numbers
automatically. See Chapter 3 Section 3.2.2 for details.
1
surveys where listing, selection and interviewing takes place in the same visit by the same staff, it may
be necessary to conduct more extensive training of field teams before the field work begins and to
supervise the teams more closely during the fieldwork. See Chapter 3 Section 3.2.2 for more details for
manual household selection.
Once training is complete, teams of interviewers will be assigned a list of clusters and
deployed to the field. Upon arrival in a new area, the interviewer team must first contact the local
authorities for help to identify the correct cluster and to solicit cooperation during the field work. A
team leader or supervisor is assigned for each interviewing team. The supervisor is responsible for
cluster identification and should guarantee that the correct cluster will be interviewed. After checking
the listing materials and verifying with the local authorities, the supervisor will distribute the sampled
households among the interviewers. After locating a selected household, the interviewer will begin
with a brief household interview, listing household members and visitors, and identifying among them
all eligible women and men for the individual interview. Eligible individuals are defined as those who
are in the specified age group (15-49), and are either usual members of the selected household or who
slept in the household the night before the interviewer’s visit.
In the event of failure to contact a household or an eligible person in the first visit, the
interviewer is required to make at least two repeat visits, or call backs, on different days and at
different times of the day before the interview is abandoned. The process of making call backs requires
the teams to stay in a cluster for at least two to three days. Some countries propose large interviewing
teams in order to try to cover an entire cluster in one day. This process is not acceptable for a DHS
survey, even when the designed sample size can bear a large non-response rate, because non-
response biases the survey results. A quick survey usually ends up with poor data quality. Both theory
and practice prove that call backs and efforts to get difficult units to respond to the survey are the best
way to remove bias and reduce the non-sampling errors to a minimum. For more details, refer to the
DHS Survey Organization Manual and the Interviewer’s Manual.
2
1.13 Sampling weight calculation
1.13.1 Why we need to weight the survey data
A DHS sample is a representative sample randomly selected from the target population. Each
interviewed unit (household and individual) represents a certain number of similar units in the target
population. In order for any statistical inferences drawn from the survey data to be valid, this
representativeness of the sample must be taken into account. In general terms, sampling weights are
used to make the sample more like the target population. All analyses should use the sampling
weights calculated for each interviewed household and for each interviewed individual.
A sampling weight is an inflation factor which extrapolates the sample to the target population.
For example, if equal probability sampling (or a self-weighting sample) is applied in a domain with a
sampling fraction 1/500, this means that each sampled individual represents 500 similar individuals in
the target population. Therefore, if we observed one particular individual having secondary education,
we would conclude that there are 500 individuals in the target population having secondary education,
corresponding to this particular individual. The total number of individuals with secondary education in
the target population would be 500 times the total number of interviewed individuals having secondary
education observed in the sample. This explanation also applies to unequal probability sampling. It is
very important that sampling weights are properly calculated and applied in data analysis; otherwise,
serious bias may be introduced, leading to incorrect conclusions.
Although all of the DHS indicators are means, proportions, rates or ratios, since a nationwide
self-weighting sample is not usually feasible due to study domains as explained in Section 1.9,
sampling weights are always necessary. Even when a survey is designed to be nationally self-
weighting, it is necessary to correct for the different response patterns across domains/strata (see
Section 1.13.4 for more details). Therefore, even surveys with self-weighting sample designs require
the use of sampling weights.
Though the effect of sampling weights on survey indicators may be small, it is necessary to
use sampling weights for the following reasons:
2) For correcting or reducing bias; weighting can reduce bias introduced by non-response or
other non-sampling errors.
3) For keeping the weighted sample distribution close to the target population distribution,
especially when oversampling is applied in certain domains/strata.
Since is the DHS protocol involves no selection of eligible individuals within a sampled
household (except for the domestic violence module, in which one eligible woman is selected from a
sampled household), all eligible individuals from the same household share the same design weight,
which is the same as the household’s design weight. Therefore, the design weight is the basic weight
for DHS surveys. All other weights are calculated based on the design weight. In calculating the
3
sampling weight, it is possible to correct for both unit non-response (a sampling unit is not interviewed
at all) and item non-response (the sampling unit does not provide answer for a specific question). The
policy of the MEASURE DHS program is to correct for unit non-response at the stratum level (see
Section 1.13.4) and leave the correction of item non-response to data users because it is variable
specific. Correction of unit non-response at cluster level will increase the variability of sampling
weights and therefore increase sampling errors. Because the correction for unit non-response is the
same for an entire cluster and because household selection within a cluster is an equal probability
selection, all the households in the same cluster share the same design weight and sampling weight,
and the same is true for all individuals in the same cluster. This means that the DHS weights (both
design weights and sampling weights) are cluster weights.
Let nh be the number of clusters selected in stratum h; let Mhi be the measure of size of the
cluster used in the first stage’s selection, usually the measure of size is the number of households
residing in the cluster according to the sampling frame; let ∑ M hi be the total measure of size in the
stratum h. The probability of selecting the ith cluster in the sample is calculated as follows:
nh M hi
P1 hi=
∑ M hi
Let
b hi be the proportion of households in the selected cluster compared to the total number
b =1 . Then the probability of
of households in EA i in stratum h if the EA is segmented, otherwise hi
selecting cluster i in the sample is:
n h M hi
P1hi = ×b hi
∑ M hi
Let
Lhi be the number of households listed in the household listing operation in cluster i in
t
stratum h; let hi be the number of households selected in the cluster. The second stage selection
probability for each household in the cluster is calculated as follows:
t hi
P2 hi =
Lhi
The overall selection probability of each household in cluster i of stratum h is therefore the
product of the selection probabilities of the two stages:
4
d hi =1/ Phi
The calculation of the design weight is not complicated; however, difficulties often result from
not having of all the design parameters involved in the above calculation because they are not well
documented, especially when the sampling frame is a master sample. See Chapter 5 for more details
on sample documentation.
The idea of correcting for unit non-response is to calculate a response rate for each
homogeneous response group, then inflate the design weight by dividing it by the response rate for
each response group. The construction of homogeneous response groups depends on the knowledge
of the response behavior of the sampling units. DHS surveys always use the sampling stratum as the
response group because the stratification is usually achieved by regrouping homogeneous sampling
units in a single stratum. It is possible to use a cluster as a response group, but the disadvantage is
that the response rates may vary too much at the cluster level, which will increase the variability of
the sampling weight; which in turn increases the sampling variance. Furthermore, correction of non-
response at the cluster level will interfere with self-weighting if a self-weighting sample has been
designed.
By assuming that the response groups coincide with the sampling strata, the following steps
explain how to calculate the sampling weight by first calculating the various response rates for unit
non-response. Please note that the response rates calculated here are different from the response
rates calculated in Appendix A of DHS survey final reports. In Appendix A, household and individual
response rates are calculated as ratios of the number of interviewed units over the number of eligible
units because the aim is just to show the results of survey implementation. Here we use weighted
ratios because the aim is to correct the design weight to compensate for non-response, therefore the
design weight should be involved. Because a non-responding unit with a large sampling weight will
have a larger impact on survey estimates than a non-responding unit with a small design weight, a
weighted response rate for correction of non-response is better than an un-weighted response rate.
5
1. Cluster level response rate
¿
n
Let h be the number of clusters selected in stratum h; let
n h be the number of clusters
interviewed. The cluster level response rate in stratum h is therefore
¿
Rch =n h / nh
Let
m hi be the number of households found (see Chapter 2, Section 2.10 for definition) in
¿
m
cluster i of stratum h; let hi be the number of households interviewed in the cluster. The household
response rate in stratum h is calculated by
R ph=∑ d hi k ¿hi / ∑ d hi k hi
d
where hi is the design weight of cluster i in stratum h; the summation is over all clusters in the
stratum h.
The household sampling weight of cluster i in stratum h is calculated by dividing the household
design weight by the product of the cluster response rate and the household response rate, for each of
the sampling stratum:
The individual sampling weight of cluster i in stratum h is calculated by dividing the household
sampling weight by the individual response rate, or equivalently, by dividing the household design
weight by the product of the cluster response rate, the household response rate and the individual
response rate, for each of the sampling strata:
The sampling weights for households selected for the men’s survey and for men can be
calculated similarly. We need a separate household sampling weight for the men’s survey in cases
where the men’s survey is conducted in a sub-sample of households selected for the women’s survey,
and we suppose that the response behavior of households in the men’s survey sub-sample may be
different from the overall household response rate.
6
If no normalization is requested, we can stop here. The above calculated household sampling
weight and individual sampling weight can be used to produce any indicators at the household level
and the individual level, respectively. As we mentioned earlier in Section 1.13.1, a sampling weight is
an inflation or extrapolation factor. The weighted sum of households interviewed
T =∑ ∑ D hi m¿hi
is ¿an unbiased estimate of the total number of ordinary residential households of the country; where
m hi is the number of households interviewed in the ith cluster of stratum h, and the summation is over
all clusters and strata in the total sample. Similarly, the weighted sum of all interviewed women
W =∑ ∑ W hi k ¿hi
is an unbiased estimate of the total women in the target population (women age 15-49) of the country;
¿
k hi
where is the number of women interviewed in the ith cluster of stratum h, and the summation is
over all clusters and strata in the total sample.
2
The MEASURE DHS program has developed Excel templates for facilitating standard weight calculations. If all
design parameters and the survey results (number of households found and interviewed, number of eligible women
found and interviewed, number of eligible men found and interviewed, number of eligible women and men found
and tested, by cluster) are provided in the input page, the standard weights will be calculated automatically in
different pages.
7
HV 005 hi =Dhi ×FH =Dhi ×∑ ∑ m¿hi / ∑ ∑ Dhi m¿hi
where HV005 is the household standard weight variable in the DHS Recode data files.
It is easy to see that the weighted sum of households interviewed by using the standard
weight equals the unweighted sum of households interviewed for the total sample. This condition will
not be met at the domain level or for sub-populations. At the domain level, the weighted sum of
households interviewed may be larger or smaller than the unweighted sum of households interviewed,
depending on whether the domain is undersampled or oversampled.
The normalization factor for calculating the women’s standard weight is calculated as
FW =∑ ∑ k ¿hi / ∑ ∑ W hi k ¿hi
where V005 is the women’s standard weight variable in the DHS Recode data files.
The standard weights for households selected for the men’s survey and for men can be
calculated in a similar way.
where
WHIV hi is the number of women eligible for HIV testing, and WHIV ¿hi is the number of women
tested with a valid test result, in cluster i of stratum h;
MHIV hi and MHIV ¿hi are the number of men
eligible and the number of men tested with a valid test result, respectively, in cluster i of stratum h.
The sampling weights for HIV testing for women and men, respectively, are calculated by
HIV W M
hi =MDhi / WRhi , HIV hi =MD hi / MR hi
8
In cluster i of stratum h, the normalized standard weights for HIV testing for women and men,
respectively, are calculated by
HIV 05hi =HIV hi ×( ∑ ∑ WHIV hi + ∑ ∑ MHIV hi ) / ( ∑ ∑ HIV hi ×WHIV hi + ∑ ∑ HIV hiM ×MHIV hi )
M M ¿ ¿ W ¿ ¿
where the double summations are over all clusters and strata in the total sample.
As mentioned above, if pooled data analysis is required, the standard weight variables HV005,
V005 and MV005 must be rescaled or de-normalized. The de-normalization procedure is the inverse of
the normalization procedure: that is, multiply the standard weight by the target population and divide
by the number of completed cases, for each survey. The de-normalized weights for households,
women and men (HV005*, V005*, and MV005*, respectively) can be calculated using the following
formulas:
3
http://esa.un.org/unpd/wpp/index.htm
9
If normalized weights are preferred, the above re-scaled weights can be re-normalized by
multiplying by the total number of completed women’s and men’s interviews combined, dividing by
the total number of weighted cases combined, and applying the above re-scaled weights to the pooled
data.
Note that the normalization of sampling weights is done for the total sample for households,
women and men separately. If the aim is to tabulate indicators for a certain sub-population from
pooled data, for example, vaccination coverage for children 12-23 months, the de-normalization has
nothing to do with the total population of children 12-23 months because there is no standard weight
calculated for children 12-23 months in DHS surveys. If the indicator is tabulated at the household
level using the household weight, the household standard weights must be de-normalized for all of the
surveys included in the analysis as explained above; likewise, if the indicator is tabulated at the
individual level using the women’s (or child’s mother’s) weight, the women’s standard weights must be
de-normalized for each of the surveys.
Let X be a multivariate auxiliary variable with p components such that the population totals of
each of the component variables are known beforehand from the recent population census, that is,
t x = ∑ X i=(t x1 , t x 2 , .. . , t x P )τ
i∈U is known. Let xi be the observations of the auxiliary variables from the
τ
D
survey x i=( x 1i , x 2i , . . . , x pi ) for the respondent sampling unit i. Let i be the sampling weight for unit
D i to W i such that
i. The calibration procedure consists of modifying the sampling weight slightly from
D
a given distance measure between the sampling weights i and the calibrated weights i
W
∑ g (W i , Di )
i∈ s
10
where F (¿) is called the calibration function which is the reciprocal of the derivative of the distance
function g; qi is a calibration weight which is usually set to 1 in the lack of prior knowledge; λ(s) is a
constant depending on the particular sample s which is to be solved. When ( i
F x τ λ ( s ) )=( 1+ qi x τi λ ( s ) )
,
which corresponds to one of the five proposed calibration functions in Deville et al, 1993, it is easy to
solve, λ(s) is given by
λ (s)=T −1 ^
s (t x− t πx )
with T s =∑ D i q i x i x τi
i∈s
For a given variable of interest y, the calibrated estimator of the population total is equivalent
to the generalized regression estimator
^ t y =∑ W i y i =t πy + B s ( t x− t πx )
^ ^τ ^
i∈s
s ∑ qi D i xi yi
B^ s =T −1 ^t πy ^t
where i∈ s is the sample estimation of the regression coefficient; and πx are the
simple estimators using the sampling weight
^t πy =∑ D i y i ^t πx =∑ D i x i
i∈ s , i∈ s
∑ W i yi
Y¯
^ = i∈ s
∑ Wi
i∈s
The calibration estimator can be equivalently formulated with known proportions of one or
more auxiliary variables. The calibration can be conducted at the individual level, which will result in
an individual specific weight, or it can be conducted at the cluster level with aggregated data, which
will result in a cluster weight. For more details see the related references given in the end of this
document.
DHS survey final reports usually include tables in an appendix for data quality evaluation
purposes, including: age distributions of household population by sex; age distributions of eligible and
interviewed women and men; completeness of reporting on date of birth, age at death, age/date at
first union, education and anthropometric measures, etc. The MEASURE DHS program also conducts
some in-depth studies on data quality for specific topics, which are provided in published reports.
Apart from the data quality tables, DHS survey final reports provide sampling errors for
selected indicators in Appendix B. Sampling errors are important reliability measures which tell the
user the degree of error associated with a particular estimated indicator value, the number of cases
involved in the calculation of the indicator, the efficiency or clustering effects of the sample design
11
compared to a simple random sampling and the range for the true value of an indicator at a certain
confidence level. The reader is referred to Chapter 4, Section 4.2 for more details on sampling errors
and their calculation.
DHS survey final reports also provide an appendix on the sample design of the survey. The
sample design document reports the survey methodology used for the survey, including the aim of the
survey, the target population, the sample size, the reporting domains, the stratification and sample
allocation, sample selection procedure, sampling weight calculation, correction for non-response,
calibration of sampling weights, and the results of survey implementation. See Chapter 5, Section 5.2
for more details on sample design.
The sample documentation must comply with the survey confidentiality requirements. When
HIV testing is conducted in a DHS or AIS (AIDS Indicator Survey), the confidentiality guidelines require
the complete destruction of all intermediate documents which can potentially be used to identify any
single household or individual who participated in the testing. This requirement reinforces the
importance of timely sample documentation. See Chapter 5 for detailed requirements in sample
documentation.
1.17 Confidentiality
The final data files for DHS surveys are made available to interested researchers. Therefore,
the confidentiality of private information collected from individual respondents is a major concern,
especially when sensitive information such as sexual activity and HIV status are collected. Protecting
the confidentiality of the individual respondent is not only an ethical obligation, but it also promotes
more accurate data because respondents are more likely to provide truthful responses if they feel
confident their information will be kept private.
DHS surveys follow strict rules imposed at various steps during the survey implementation to
prevent the direct or indirect disclosure of the identity of individual respondents. The principal pieces
of information that can indirectly identify an individual respondent are cluster number, household
number, the cluster selection probability and the sampling weights. The cluster number is an
important identifier for sampling error calculations; the household number is important for household
level and individual level data management and tabulation; the cluster selection probability is useful
for cluster level modeling; and sampling weights are necessary for all analysis. So these variables must
be present in the final data file. The household number in the final DHS data file is not informative, and
sampling weights are not informative after correction of non-response and normalization. The cluster
selection probability is potentially informative only if lower level identification information such as
district and locality are present, and DHS survey final data files do not provide geographic information
below the level of region or survey domain, especially when HIV testing is conducted. Thus the only
concern is the disclosure of the cluster. For DHS or AIS surveys with HIV testing, the final data files
provide scrambled cluster and household numbers for further insurance against disclosure.
12