2 DHS6 - Sampling - Manual - Sept2012 - DHSM4

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 12

1.

11 Household selection in the central office


After the household listing operation, once the central office receives the completed listing
materials for a cluster, they must first create a serial number for each of the occupied residential
households, beginning with 1 and continuing to the total number of occupied residential households
listed in the cluster. An occupied residential household designates those households occupied at the
time of the listing, even if the occupant refused to cooperate at the time of listing, and those
households where the occupants were absent at the time of listing but neighbors confirmed that they
would not be absent for a long period and would be at home during the period of the main survey.
Only occupied residential households should be numbered. This serial number is an ID number for the
households. The household selection procedure will be performed based on this serial number.
Whether or not a household is considered occupied at the time of the listing is very important because
this fact will be related to the proportion of vacant households in the main survey.

The MEASURE DHS program has used several methods 1 for selecting households within
clusters including:

1) Systematic selection: From a random starting point select every nth household (see
Chapter 3 Section 3.2 for more details).

2) Systematic selection with runs: From a random starting point, select a group of sequential
households called a “run”. Several runs may be used within a cluster. Runs are selected
with systematic selection. Selecting households in runs can greatly reduce the amount of
travel within cluster during data collection, especially in rural clusters where households
can be far apart.

The advantages of household selection in the central office can be summarized as:

1) It allows for a check of coverage of the household listing results before the main survey
and for the review and possible relisting of problematic clusters in advance.

2) Sampled households are pre-determined which prevents potential bias introduced by


allowing the interviewers to select in the field which households are to be interviewed.

3) The field work procedure is exactly replicable which provides the possibility of easy and
close supervision of the field work.

4) It is easier to control the work load for each interviewing team.

However, in cases when travelling between clusters represents a substantial cost, it is possible
to forego the step of selecting households in the central office. In such cases, the household listing
operation and the main survey can be combined into a single field operation. No essential changes are
needed in the household listing procedure or household numbering, but making a detailed sketch map
for the cluster may not be necessary because the listing team and the interviewing team are the same,
and the household interview will begin immediately after the listing, so identifying the exact selected
households during a separate visit is no longer a problem. The household selection must be done in
the field manually if portable computers are not available. Some manual selection procedures have
been developed for this purpose. Household listing and interviewing are two very different jobs, so in

1
The MEASURE DHS program has developed various Excel templates for household selection in the central office:
systematic selection, systematic selection with runs, self-weighting selection with and without control of sample
size and with or without runs. Once the household listing is completed, it is possible to just copy the number of
households listed in a cluster into the spreadsheet and the spreadsheet will show the selected household numbers
automatically. See Chapter 3 Section 3.2.2 for details.

1
surveys where listing, selection and interviewing takes place in the same visit by the same staff, it may
be necessary to conduct more extensive training of field teams before the field work begins and to
supervise the teams more closely during the fieldwork. See Chapter 3 Section 3.2.2 for more details for
manual household selection.

1.12 Household interviews


The household interview procedure is out of the scope of this manual since it is explained in
detail in the interviewer’s manual. This section will briefly discuss the main statistical points of the
household interview. After the household selection, interviewers will be recruited and trained for the
household and individual interviews. The training of the interviewer is an intensive training lasting at
least four weeks for a standard DHS survey, and longer if the survey includes many biomarkers. Prior
to the training, a pretest of the questionnaire will be conducted in a small number of clusters not
selected for the main survey to assess the quality of the questionnaires and the understanding of the
translations by interviewers and respondents. Problems and potential errors observed in the pretest
will be addressed and resolved prior to fieldwork training. Finally, the interviewing team will be sent to
selected clusters with a certain work load per team.

Once training is complete, teams of interviewers will be assigned a list of clusters and
deployed to the field. Upon arrival in a new area, the interviewer team must first contact the local
authorities for help to identify the correct cluster and to solicit cooperation during the field work. A
team leader or supervisor is assigned for each interviewing team. The supervisor is responsible for
cluster identification and should guarantee that the correct cluster will be interviewed. After checking
the listing materials and verifying with the local authorities, the supervisor will distribute the sampled
households among the interviewers. After locating a selected household, the interviewer will begin
with a brief household interview, listing household members and visitors, and identifying among them
all eligible women and men for the individual interview. Eligible individuals are defined as those who
are in the specified age group (15-49), and are either usual members of the selected household or who
slept in the household the night before the interviewer’s visit.

Conscious omission of eligible individuals on the part of an interviewer by mis-reporting their


age outside of the eligible age group is a real concern. Measures to eliminate this problem should be
undertaken. For example, the field editor should check the consistency of each completed
questionnaire and, if suspicious things are identified, should return to the household for further
verification of key items such as the number of household members, number of eligible individuals and
number of children under age five.

In the event of failure to contact a household or an eligible person in the first visit, the
interviewer is required to make at least two repeat visits, or call backs, on different days and at
different times of the day before the interview is abandoned. The process of making call backs requires
the teams to stay in a cluster for at least two to three days. Some countries propose large interviewing
teams in order to try to cover an entire cluster in one day. This process is not acceptable for a DHS
survey, even when the designed sample size can bear a large non-response rate, because non-
response biases the survey results. A quick survey usually ends up with poor data quality. Both theory
and practice prove that call backs and efforts to get difficult units to respond to the survey are the best
way to remove bias and reduce the non-sampling errors to a minimum. For more details, refer to the
DHS Survey Organization Manual and the Interviewer’s Manual.

2
1.13 Sampling weight calculation
1.13.1 Why we need to weight the survey data
A DHS sample is a representative sample randomly selected from the target population. Each
interviewed unit (household and individual) represents a certain number of similar units in the target
population. In order for any statistical inferences drawn from the survey data to be valid, this
representativeness of the sample must be taken into account. In general terms, sampling weights are
used to make the sample more like the target population. All analyses should use the sampling
weights calculated for each interviewed household and for each interviewed individual.

A sampling weight is an inflation factor which extrapolates the sample to the target population.
For example, if equal probability sampling (or a self-weighting sample) is applied in a domain with a
sampling fraction 1/500, this means that each sampled individual represents 500 similar individuals in
the target population. Therefore, if we observed one particular individual having secondary education,
we would conclude that there are 500 individuals in the target population having secondary education,
corresponding to this particular individual. The total number of individuals with secondary education in
the target population would be 500 times the total number of interviewed individuals having secondary
education observed in the sample. This explanation also applies to unequal probability sampling. It is
very important that sampling weights are properly calculated and applied in data analysis; otherwise,
serious bias may be introduced, leading to incorrect conclusions.

Although all of the DHS indicators are means, proportions, rates or ratios, since a nationwide
self-weighting sample is not usually feasible due to study domains as explained in Section 1.9,
sampling weights are always necessary. Even when a survey is designed to be nationally self-
weighting, it is necessary to correct for the different response patterns across domains/strata (see
Section 1.13.4 for more details). Therefore, even surveys with self-weighting sample designs require
the use of sampling weights.

Though the effect of sampling weights on survey indicators may be small, it is necessary to
use sampling weights for the following reasons:

1) For valid statistical inference.

2) For correcting or reducing bias; weighting can reduce bias introduced by non-response or
other non-sampling errors.

3) For keeping the weighted sample distribution close to the target population distribution,
especially when oversampling is applied in certain domains/strata.

1.13.2 Design weights and sampling weights


The MEASURE DHS program calculates both design weights and sampling weights (or survey
weights) for both households and individuals. The design weight of a sampling unit (household or
individual) is the inverse of the overall probability with which the unit was selected in the sample. The
sampling weight of a sampling unit is the design weight corrected for non-response or other
calibrations.

Since is the DHS protocol involves no selection of eligible individuals within a sampled
household (except for the domestic violence module, in which one eligible woman is selected from a
sampled household), all eligible individuals from the same household share the same design weight,
which is the same as the household’s design weight. Therefore, the design weight is the basic weight
for DHS surveys. All other weights are calculated based on the design weight. In calculating the

3
sampling weight, it is possible to correct for both unit non-response (a sampling unit is not interviewed
at all) and item non-response (the sampling unit does not provide answer for a specific question). The
policy of the MEASURE DHS program is to correct for unit non-response at the stratum level (see
Section 1.13.4) and leave the correction of item non-response to data users because it is variable
specific. Correction of unit non-response at cluster level will increase the variability of sampling
weights and therefore increase sampling errors. Because the correction for unit non-response is the
same for an entire cluster and because household selection within a cluster is an equal probability
selection, all the households in the same cluster share the same design weight and sampling weight,
and the same is true for all individuals in the same cluster. This means that the DHS weights (both
design weights and sampling weights) are cluster weights.

1.13.3 How to calculate the design weights


Assuming that a DHS survey sample is drawn with two-stage, stratified cluster sampling,
design weights will be calculated based on the separate sampling probabilities for each sampling stage
and for each cluster. We use the following notations:

P1hi: first-stage sampling probability of the ith cluster in stratum h


P2hi: second-stage sampling probability within the ith cluster (household selection)

Let nh be the number of clusters selected in stratum h; let Mhi be the measure of size of the
cluster used in the first stage’s selection, usually the measure of size is the number of households
residing in the cluster according to the sampling frame; let ∑ M hi be the total measure of size in the
stratum h. The probability of selecting the ith cluster in the sample is calculated as follows:

nh M hi
P1 hi=
∑ M hi
Let
b hi be the proportion of households in the selected cluster compared to the total number
b =1 . Then the probability of
of households in EA i in stratum h if the EA is segmented, otherwise hi
selecting cluster i in the sample is:

n h M hi
P1hi = ×b hi
∑ M hi
Let
Lhi be the number of households listed in the household listing operation in cluster i in
t
stratum h; let hi be the number of households selected in the cluster. The second stage selection
probability for each household in the cluster is calculated as follows:

t hi
P2 hi =
Lhi
The overall selection probability of each household in cluster i of stratum h is therefore the
product of the selection probabilities of the two stages:

Phi =P1 hi ×P 2hi


The design weight for each household in cluster i of stratum h is the inverse of its overall
selection probability:

4
d hi =1/ Phi

The calculation of the design weight is not complicated; however, difficulties often result from
not having of all the design parameters involved in the above calculation because they are not well
documented, especially when the sampling frame is a master sample. See Chapter 5 for more details
on sample documentation.

1.13.4 Correction of unit non-response and calculation of sampling weights


The design weight calculated above is based on sample design parameters. If there is no non-
response at the cluster level, at the household level, or at the individual level, the design weight is
enough for all analyses, for both household indicators and individual indicators. However, non-
response is inevitable in all surveys, and different units have different response behaviors. The
experience of the MEASURE DHS program shows that urban households are less likely to respond to
the survey than their counterparts in rural areas, households in developed regions are less likely to
respond to the survey than their counterparts in less-developed regions, rich households are less likely
to respond to the survey than poor households, individuals with higher levels of education are less
likely to respond to the survey than those with lower levels of education, men are less likely to respond
to the survey than women, and so forth.

The idea of correcting for unit non-response is to calculate a response rate for each
homogeneous response group, then inflate the design weight by dividing it by the response rate for
each response group. The construction of homogeneous response groups depends on the knowledge
of the response behavior of the sampling units. DHS surveys always use the sampling stratum as the
response group because the stratification is usually achieved by regrouping homogeneous sampling
units in a single stratum. It is possible to use a cluster as a response group, but the disadvantage is
that the response rates may vary too much at the cluster level, which will increase the variability of
the sampling weight; which in turn increases the sampling variance. Furthermore, correction of non-
response at the cluster level will interfere with self-weighting if a self-weighting sample has been
designed.

By assuming that the response groups coincide with the sampling strata, the following steps
explain how to calculate the sampling weight by first calculating the various response rates for unit
non-response. Please note that the response rates calculated here are different from the response
rates calculated in Appendix A of DHS survey final reports. In Appendix A, household and individual
response rates are calculated as ratios of the number of interviewed units over the number of eligible
units because the aim is just to show the results of survey implementation. Here we use weighted
ratios because the aim is to correct the design weight to compensate for non-response, therefore the
design weight should be involved. Because a non-responding unit with a large sampling weight will
have a larger impact on survey estimates than a non-responding unit with a small design weight, a
weighted response rate for correction of non-response is better than an un-weighted response rate.

5
1. Cluster level response rate
¿
n
Let h be the number of clusters selected in stratum h; let
n h be the number of clusters
interviewed. The cluster level response rate in stratum h is therefore

¿
Rch =n h / nh

2. Household level response rate

Let
m hi be the number of households found (see Chapter 2, Section 2.10 for definition) in
¿
m
cluster i of stratum h; let hi be the number of households interviewed in the cluster. The household
response rate in stratum h is calculated by

Rhh=∑ d hi m¿hi / ∑ d hi mhi


d
where hi is the design weight of cluster i in stratum h; the summation is over all clusters in the
stratum h.

3. Individual response rate


¿
k
Let hi be the number of eligible individuals found in cluster i of stratum h; let
k hi be the
number of individuals interviewed. The individual response rate in stratum h is calculated as

R ph=∑ d hi k ¿hi / ∑ d hi k hi
d
where hi is the design weight of cluster i in stratum h; the summation is over all clusters in the
stratum h.

The household sampling weight of cluster i in stratum h is calculated by dividing the household
design weight by the product of the cluster response rate and the household response rate, for each of
the sampling stratum:

Dhi =d hi /( R ch×R hh ), for cluster i of stratum h.

The individual sampling weight of cluster i in stratum h is calculated by dividing the household
sampling weight by the individual response rate, or equivalently, by dividing the household design
weight by the product of the cluster response rate, the household response rate and the individual
response rate, for each of the sampling strata:

W hi=D hi / R ph=d hi /( Rch ×Rhh×R ph ), for cluster i of stratum h.


It is easy to see that the difference between the household sampling weights and the
individual sampling weights is introduced by individual non-response.

The sampling weights for households selected for the men’s survey and for men can be
calculated similarly. We need a separate household sampling weight for the men’s survey in cases
where the men’s survey is conducted in a sub-sample of households selected for the women’s survey,
and we suppose that the response behavior of households in the men’s survey sub-sample may be
different from the overall household response rate.

6
If no normalization is requested, we can stop here. The above calculated household sampling
weight and individual sampling weight can be used to produce any indicators at the household level
and the individual level, respectively. As we mentioned earlier in Section 1.13.1, a sampling weight is
an inflation or extrapolation factor. The weighted sum of households interviewed

T =∑ ∑ D hi m¿hi
is ¿an unbiased estimate of the total number of ordinary residential households of the country; where
m hi is the number of households interviewed in the ith cluster of stratum h, and the summation is over
all clusters and strata in the total sample. Similarly, the weighted sum of all interviewed women

W =∑ ∑ W hi k ¿hi

is an unbiased estimate of the total women in the target population (women age 15-49) of the country;
¿
k hi
where is the number of women interviewed in the ith cluster of stratum h, and the summation is
over all clusters and strata in the total sample.

1.13.5 Normalization of sampling weights


Normalization of sampling weights is not necessary for survey data analysis. In order to
prevent large numbers for the number of weighted cases in the tables in DHS survey final reports, it is
the MEASURE DHS tradition to calculate normalized standard weights for both households and
individuals. With the normalized standard weight, the number of unweighted cases coincides with the
number of weighted cases at the national level for both total households and total individuals. The
normalized standard weight of a sampling unit is calculated based on its sampling weight, by
multiplying the sampling weight with a unique constant at the national level. The constant or the
normalization factor is the total number of completed cases divided by the total number of weighted
cases (based on the sampling weight). This number is equal to the estimated total sampling fraction
because the total number of weighted cases with the sampling weight is an estimation of the total
target population. Therefore the standard weights in the DHS data files are relative weights. Relative
weights can be used to estimate means, proportions, rates and ratios because the normalization factor
is cancelled out when used in both numerator and denominator, so it has no effect on the calculated
indicator values. This point also explains why the normalization must be done at the national level and
not the regional level: at the regional level, the normalization factor cannot be cancelled out, and bias
will be introduced in the calculated indicator values. Because the normalized standard weights have no
scale, they are not valid for estimating totals. Also the normalized weight is not valid for pooled data,
even for data pooled for women and men in the same survey, because the normalization factor is
country and sex specific.

1. Normalized household standard weight2

The normalization factor for calculating household standard weight is calculated as

FH =∑ ∑ m¿hi / ∑ ∑ Dhi m¿hi

The household standard weight for cluster i in stratum h is calculated by

2
The MEASURE DHS program has developed Excel templates for facilitating standard weight calculations. If all
design parameters and the survey results (number of households found and interviewed, number of eligible women
found and interviewed, number of eligible men found and interviewed, number of eligible women and men found
and tested, by cluster) are provided in the input page, the standard weights will be calculated automatically in
different pages.

7
HV 005 hi =Dhi ×FH =Dhi ×∑ ∑ m¿hi / ∑ ∑ Dhi m¿hi

where HV005 is the household standard weight variable in the DHS Recode data files.

It is easy to see that the weighted sum of households interviewed by using the standard
weight equals the unweighted sum of households interviewed for the total sample. This condition will
not be met at the domain level or for sub-populations. At the domain level, the weighted sum of
households interviewed may be larger or smaller than the unweighted sum of households interviewed,
depending on whether the domain is undersampled or oversampled.

2. Normalized women’s standard weight

The normalization factor for calculating the women’s standard weight is calculated as

FW =∑ ∑ k ¿hi / ∑ ∑ W hi k ¿hi

The women’s standard weight for cluster i in stratum h is calculated by

V 005hi =W hi ×FW =W hi ×∑ ∑ k ¿hi / ∑ ∑ W hi k ¿hi

where V005 is the women’s standard weight variable in the DHS Recode data files.

The standard weights for households selected for the men’s survey and for men can be
calculated in a similar way.

1.13.6 Standard weights for HIV testing


The sampling weights for HIV testing are calculated separately for women and men, but they
are calculated using the same methodology. The only difference is in the calculation of the
normalization factors, if a normalized weight is requested. In order to calculate the weighted HIV
prevalence for women and men together using a normalized weight, the standard weight for HIV
testing must be normalized for women and men together. In most DHS surveys, HIV testing is
conducted in the same subsample of households selected for men’s survey, and every woman or man
in the household who is eligible for the individual interview is eligible for HIV testing. Once the
household sampling weight for the men’s survey is calculated using the procedures stated in Section
1.13.5, the sampling weights for HIV testing for women and men may be calculated separately by
correcting the household sampling weight for the non-response rates of women and men for HIV
testing, respectively. For simplicity, let
MD
hi be the household sampling weight in cluster i of stratum h
for the men’s survey sub-sample, the response rates to HIV testing for women and men are calculated
respectively by
WR hi =∑ MDhi WHIV ¿hi / ∑ MDhi WHIV hi
MR hi =∑ MDhi MHIV ¿hi / ∑ MDhi MHIV hi

where
WHIV hi is the number of women eligible for HIV testing, and WHIV ¿hi is the number of women
tested with a valid test result, in cluster i of stratum h;
MHIV hi and MHIV ¿hi are the number of men
eligible and the number of men tested with a valid test result, respectively, in cluster i of stratum h.

The sampling weights for HIV testing for women and men, respectively, are calculated by
HIV W M
hi =MDhi / WRhi , HIV hi =MD hi / MR hi

8
In cluster i of stratum h, the normalized standard weights for HIV testing for women and men,
respectively, are calculated by

hi =HIV hi × ( ∑ ∑ WHIV hi + ∑ ∑ MHIV hi ) / ( ∑ ∑ HIV hi ×WHIV hi + ∑ ∑ HIV hi ×MHIV hi )


¿ ¿ ¿ ¿
HIV 05W W W M

HIV 05hi =HIV hi ×( ∑ ∑ WHIV hi + ∑ ∑ MHIV hi ) / ( ∑ ∑ HIV hi ×WHIV hi + ∑ ∑ HIV hiM ×MHIV hi )
M M ¿ ¿ W ¿ ¿

where the double summations are over all clusters and strata in the total sample.

1.13.7 De-normalization of standard weights for pooled data


For all of the DHS data, the weight variables HV005 (household standard weight), V005
(women’s standard weight) and MV005 (men’s standard weight) are relative weights which are
normalized so that the total number of weighted cases is equal to the total number of unweighted
cases, for the three kinds of units. In some situations, such as analyses involving data from more than
one survey, data users may need the un-normalized sampling weight for analyzing pooled data. As
mentioned in Section 1.13.5, since normalization is country specific and sex specific, it is necessary to
de-normalize the standard weights provided in the DHS Recode data files for analyzing pooled data.

The normalization procedure consists of multiplying the sampling weight by a normalization


factor for the total sample. The normalization factor is the estimated total sampling fraction: the
number of completed cases divided by the number of weighted cases by using the sampling weight,
for each kind of sampling unit. The weighted number of cases with sampling weight is an estimation of
the total target population. Therefore, in order to de-normalize a normalized weight, simply divide the
normalized weight by the total sampling fraction. The estimated total sampling fraction is usually not
provided in the DHS data file or in the final report. In order to calculate the total sampling fraction, it is
necessary to know the total target population at the time of the survey. The total target population at
the time of the survey is easy to get from various sources. The country’s statistical office, the United
Nations Population Division’s (UNPD) World Population Prospects 3, and the United Nations Population
Fund (UNFPA) are three sources that may be easy to access.

As mentioned above, if pooled data analysis is required, the standard weight variables HV005,
V005 and MV005 must be rescaled or de-normalized. The de-normalization procedure is the inverse of
the normalization procedure: that is, multiply the standard weight by the target population and divide
by the number of completed cases, for each survey. The de-normalized weights for households,
women and men (HV005*, V005*, and MV005*, respectively) can be calculated using the following
formulas:

HV005* = HV005 × (total number of residential households in the country)/


(total number of households interviewed in the survey)

V005* = V005 × (total female population 15-49 in the country)/


(total number of women 15-49 interviewed in the survey)

MV005* = MV005 × (total male population 15-49 (15-59) in the country)/


(total number of men 15-49 (15-59) interviewed in the survey)

3
http://esa.un.org/unpd/wpp/index.htm

9
If normalized weights are preferred, the above re-scaled weights can be re-normalized by
multiplying by the total number of completed women’s and men’s interviews combined, dividing by
the total number of weighted cases combined, and applying the above re-scaled weights to the pooled
data.

Note that the normalization of sampling weights is done for the total sample for households,
women and men separately. If the aim is to tabulate indicators for a certain sub-population from
pooled data, for example, vaccination coverage for children 12-23 months, the de-normalization has
nothing to do with the total population of children 12-23 months because there is no standard weight
calculated for children 12-23 months in DHS surveys. If the indicator is tabulated at the household
level using the household weight, the household standard weights must be de-normalized for all of the
surveys included in the analysis as explained above; likewise, if the indicator is tabulated at the
individual level using the women’s (or child’s mother’s) weight, the women’s standard weights must be
de-normalized for each of the surveys.

1.14 Calibration of sampling weights in case of bias


Generalized calibration (Deville and Särndal, 1992; Deville et al, 1993) has now become a
popular and powerful framework in survey data analysis for statistical offices in many countries. It
allows for the utilization of different sources of auxiliary information to improve estimates from sample
surveys. Calibration can reduce sampling errors, can correct bias caused by non-response and other
non-sampling errors, and can reduce the influence of extreme values. Calibration is a “weight tuning”
procedure such that the tuned sampling weight can produce estimates without error for known
population characteristics. The precision of an estimator using a calibrated weight is equivalent to a
regression estimator but is much easier to calculate with the help of calibration software such as
CALMAR, a SAS Macro procedure developed by the French Institute of Statistics and Economic Studies
(INSEE), and the SPSS procedure developed by Statistics Belgium. DHS surveys employ calibration of
sampling weights only in cases where serious bias is observed in the collected data, and there is
reliable auxiliary information available for the calibration.

Let X be a multivariate auxiliary variable with p components such that the population totals of
each of the component variables are known beforehand from the recent population census, that is,
t x = ∑ X i=(t x1 ,  t x 2 ,  .. . ,  t x P )τ
i∈U is known. Let xi be the observations of the auxiliary variables from the
τ
D
survey x i=( x 1i ,  x 2i ,  . . . , x pi ) for the respondent sampling unit i. Let i be the sampling weight for unit
D i to W i such that
i. The calibration procedure consists of modifying the sampling weight slightly from
D
a given distance measure between the sampling weights i and the calibrated weights i
W
∑ g (W i ,  Di )
i∈ s

is minimized under the constraints ∑ W i xi =t x


i∈ s

where g is a distance function which measures the distance between i and D W


i . The constraints
imposed are that the known auxiliary variable totals are estimated without error with the calibrated
weights. If the variable of interest is well correlated with the auxiliary variables, then we expect that
the precision can be greatly improved for estimating the variable of interest. The calibration theory
states that the calibrated weights have the following formula
W i =D i F ( qi x τi λ( s ))

10
where F (¿) is called the calibration function which is the reciprocal of the derivative of the distance
function g; qi is a calibration weight which is usually set to 1 in the lack of prior knowledge; λ(s) is a
constant depending on the particular sample s which is to be solved. When ( i
F x τ λ ( s ) )=( 1+ qi x τi λ ( s )  )
,

which corresponds to one of the five proposed calibration functions in Deville et al, 1993, it is easy to
solve, λ(s) is given by
λ (s)=T −1 ^
s (t x− t πx )

with T s =∑ D i q i x i x τi
i∈s

For a given variable of interest y, the calibrated estimator of the population total is equivalent
to the generalized regression estimator
^ t y =∑ W i y i =t πy + B s ( t x− t πx )
^ ^τ ^
i∈s

s ∑ qi D i xi yi
B^ s =T −1 ^t πy ^t
where i∈ s is the sample estimation of the regression coefficient; and πx are the
simple estimators using the sampling weight
^t πy =∑ D i y i ^t πx =∑ D i x i
i∈ s , i∈ s

A mean estimation of the variable of interest y can be calculated by

∑ W i yi

^ = i∈ s
∑ Wi
i∈s

The calibration estimator can be equivalently formulated with known proportions of one or
more auxiliary variables. The calibration can be conducted at the individual level, which will result in
an individual specific weight, or it can be conducted at the cluster level with aggregated data, which
will result in a cluster weight. For more details see the related references given in the end of this
document.

1.15 Data quality and sampling error reporting


Data quality is always a major concern for all MEASURE DHS projects. Though numerous efforts
are made in implementing DHS surveys to maximize the quality of the data collected, non-sampling
errors are always the main concerns for data quality. Data quality of a survey directly affects the
reliability of the statistics produced. Many countries have laws that require reports of survey findings
to include an evaluation of data quality and reliability. Data quality can be measured by total survey
error including bias introduced by various sampling and non-sampling errors.

DHS survey final reports usually include tables in an appendix for data quality evaluation
purposes, including: age distributions of household population by sex; age distributions of eligible and
interviewed women and men; completeness of reporting on date of birth, age at death, age/date at
first union, education and anthropometric measures, etc. The MEASURE DHS program also conducts
some in-depth studies on data quality for specific topics, which are provided in published reports.

Apart from the data quality tables, DHS survey final reports provide sampling errors for
selected indicators in Appendix B. Sampling errors are important reliability measures which tell the
user the degree of error associated with a particular estimated indicator value, the number of cases
involved in the calculation of the indicator, the efficiency or clustering effects of the sample design

11
compared to a simple random sampling and the range for the true value of an indicator at a certain
confidence level. The reader is referred to Chapter 4, Section 4.2 for more details on sampling errors
and their calculation.

DHS survey final reports also provide an appendix on the sample design of the survey. The
sample design document reports the survey methodology used for the survey, including the aim of the
survey, the target population, the sample size, the reporting domains, the stratification and sample
allocation, sample selection procedure, sampling weight calculation, correction for non-response,
calibration of sampling weights, and the results of survey implementation. See Chapter 5, Section 5.2
for more details on sample design.

1.16 Sample documentation


The task of a sampling statistician does not end with the selection of the sample. The
preservation of sampling documentation is an essential requisite for sampling weight calculation, for
sampling error computation, for data quality evaluation, for linkage with other data sources, and for
various kinds of checks and supplementary studies. Special efforts are needed at the time of the
sample design, at the end of the fieldwork, and at the completion of the data file if the task of sample
documentation is to be carried out effectively. If preservation of documentation is delayed,
considerable effort will be required to reconstitute the missing information when it is needed.

The sample documentation must comply with the survey confidentiality requirements. When
HIV testing is conducted in a DHS or AIS (AIDS Indicator Survey), the confidentiality guidelines require
the complete destruction of all intermediate documents which can potentially be used to identify any
single household or individual who participated in the testing. This requirement reinforces the
importance of timely sample documentation. See Chapter 5 for detailed requirements in sample
documentation.

1.17 Confidentiality
The final data files for DHS surveys are made available to interested researchers. Therefore,
the confidentiality of private information collected from individual respondents is a major concern,
especially when sensitive information such as sexual activity and HIV status are collected. Protecting
the confidentiality of the individual respondent is not only an ethical obligation, but it also promotes
more accurate data because respondents are more likely to provide truthful responses if they feel
confident their information will be kept private.

DHS surveys follow strict rules imposed at various steps during the survey implementation to
prevent the direct or indirect disclosure of the identity of individual respondents. The principal pieces
of information that can indirectly identify an individual respondent are cluster number, household
number, the cluster selection probability and the sampling weights. The cluster number is an
important identifier for sampling error calculations; the household number is important for household
level and individual level data management and tabulation; the cluster selection probability is useful
for cluster level modeling; and sampling weights are necessary for all analysis. So these variables must
be present in the final data file. The household number in the final DHS data file is not informative, and
sampling weights are not informative after correction of non-response and normalization. The cluster
selection probability is potentially informative only if lower level identification information such as
district and locality are present, and DHS survey final data files do not provide geographic information
below the level of region or survey domain, especially when HIV testing is conducted. Thus the only
concern is the disclosure of the cluster. For DHS or AIS surveys with HIV testing, the final data files
provide scrambled cluster and household numbers for further insurance against disclosure.

12

You might also like