Sampling Notes 2016 PDF
Sampling Notes 2016 PDF
Sampling Notes 2016 PDF
David Steel
Carole Birrell
Contents
1 Introduction 4
1.1 What is Survey Sampling? . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Alternatives to Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Advantages of Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Disadvantages of Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Complementary Roles of Samples and Censuses . . . . . . . . . . . . . . . 7
1.6 Probability Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.7 Other Sampling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.8 The Survey Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.8.1 Steps in the Survey Process . . . . . . . . . . . . . . . . . . . . . . 8
1.9 Sources of Errors in Samples . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.10 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.11 References: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.12 Additional Reading: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1
3.3 Setting Sample Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.1 Sample Size for Estimating Proportions . . . . . . . . . . . . . . . 42
3.4 Estimation of Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5 Helpful background information . . . . . . . . . . . . . . . . . . . . . . . . 51
3.5.1 Alternative formulas . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.5.2 Taylor Series Expansion . . . . . . . . . . . . . . . . . . . . . . . . 51
3.6 References: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.7 Additional Reading: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4 Systematic Sampling 53
4.1 References: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Additional Reading: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5 Stratified Sampling 58
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.1.1 Benefits of Stratification: . . . . . . . . . . . . . . . . . . . . . . . 59
5.1.2 Decisions to be made: . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.3 Definitions and Basic Properties . . . . . . . . . . . . . . . . . . . . . . . 61
5.3.1 Probability of selection . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.4 Allocation of Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.4.1 Proportional Allocation . . . . . . . . . . . . . . . . . . . . . . . . 64
5.4.2 Optimal Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.4.3 Equal Sampling Variance . . . . . . . . . . . . . . . . . . . . . . . 72
5.4.4 Power Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.4.5 Allocation in Practice . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.5 Variables to use as Stratification Variables . . . . . . . . . . . . . . . . . . 74
5.6 Number of Strata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.7 Choosing Stratum Boundaries . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.7.1 Dalenius and Hodges Method . . . . . . . . . . . . . . . . . . . . . 75
5.8 References: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.9 Additional Reading: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6 Ratio Estimation 78
6.1 Introduction and Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.2 Properties of Ratio Estimation . . . . . . . . . . . . . . . . . . . . . . . . 79
6.2.1 Comparison of Ratio Estimator with Number Raised Estimator . . 82
6.3 Ratio Estimation Under a Super-population Model . . . . . . . . . . . . . 86
6.4 Use of Ratio Estimation with Stratification . . . . . . . . . . . . . . . . . 88
6.5 Additional Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.6 Additional Reading: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
2
7 Other Sampling Designs 92
7.1 Introduction to Cluster Sampling . . . . . . . . . . . . . . . . . . . . . . . 92
7.2 Introduction to Multi-stage Sampling . . . . . . . . . . . . . . . . . . . . . 93
3
Chapter 1
Introduction
Sample surveys belong to the class of studies generally called ‘observational studies’
which are separate from experimental studies. They are sometimes referred to as ‘cross-
sectional studies’ as they provide a ‘snapshot’ of a population at some point in time,
although repeated and longitudinal surveys are often conducted. Other types of obser-
vational studies are cohort or case control studies which are designed for the purpose of
testing a hypothesis. Cohort studies are also called prospective studies as they follow
one group of people into the future and collect information at several points in time.
Sample surveys are descriptive surveys and will be the focus of this part of the subject,
although they are increasingly used for analytical purposes.
4
1.1 What is Survey Sampling?
In its widest sense, a survey is any process that involves the collection of information
about some population. The population in question will often be some group of people,
but may be a group of businesses or events, or episodes (e.g. trips). It may also be,
for example, an area of land or a volume of the atmosphere. In general a population
consists of a group of units about which you wish to draw conclusions.
From a sample we attempt to draw conclusions about the population from which that
sample was drawn - this involves inductive logic or inference.
If the variables that we measure in the population have low variability, then only a
few observations, possibly selected in some haphazard way, are needed to provide good
estimates of characteristics of the population. However, it is usually not that simple.
Examples of Surveys: Monthly Labour Force, Retail Trade, Capital Expenditure, Com-
pany Profits, Market Research, Performance Appraisal (job satisfaction), Opinion Polls,
Quality Management, Ecological or Geological Surveys.
5
• how to calculate estimates and make inferences
• the precision of inferences and how to estimate that precision.
It takes into account:
• the “messiness” of the real world
• cost.
This topic differs from most statistics subjects in the following ways.
• It is concerned with estimating characteristics of a finite population, not the pa-
rameters of a probability distribution, or infinite population.
• It will consider how to obtain samples taking costs and other operational factors
into account. Often you will have been taught how to analyse data already avail-
able.
The topic covers methods of sample design and related estimation procedures aimed at
providing the required precision at least cost.
A sample survey is one way to obtain information about a finite population. Other
important ways of obtaining information are;
• censuses or complete enumerations,
• administrative systems e.g. Medicare, Australian Tax Office.
Sampling may be used in a census or administrative system. In a census, some ques-
tions which are more detailed and expensive to process may be asked of a sample of
people whereas basic questions are asked of everyone. To reduce costs, statistics may be
generated from an administrative system using a sample of the records.
6
1.4 Disadvantages of Sampling
• introduces sampling errors (errors due to use of a sample rather than the entire
population)
• limits level of disaggregation possible (i.e. breaking sample up into small areas or
groups).
• census provides information for use in sample design and estimation, e.g in strati-
fication and ratio estimation
• can combine samples and censuses e.g. 1st phase - census, 2nd phase - follow up
sample for more detail.
• each unit in the population has a known non-zero chance of selection, which can
be determined for each unit in the sample
• this justifies the theory that follows, which enables unbiased estimates to be cal-
culated and the precision of the estimates to be evaluated through the calculation
of estimates of standard errors.
This subject will focus on probability sampling methods, which may also be called
scientific or random sampling methods. It is not necessary that the selection
probabilities are equal, but if unequal selection probabilities are used, this must
be accounted for in the estimation procedures used.
• purposive selection
• haphazard selection
• volunteer
• quota sampling.
7
These methods may be useful in some situations, but since they result in some units
having zero or unknown probabilities of selection they are not recommended in general.
Even with a probability sample, the final sample may not be exactly as designed because
of non-response, non-contact, bad operations, sampling process not followed properly,
bad lists, poor frames e.g. missing units, duplicates.
Surveys vary widely in the subjects they cover, the methods used, their size and com-
plexity and the purposes they fulfil. Conducting a survey is a process that involves
a number of steps that must fit together well. To ensure everything fits together the
whole survey process must be properly planned. While the steps involved follow a logical
sequence there is always a degree of iteration involved in the development of a survey.
Decisions have to reviewed in light of later developments.
A common fault is lack of effort in the development phase, e.g testing key aspects of
questionnaire.
Survey development
• determine objectives;
• determine resources available and constraints;
• review alternative sources of information;
• specify population of interest;
• specify population units and sampling units;
• identify research issues;
8
• decide data items and classifications;
• determine precision required;
• decide type of investigation needed;
• determine collection method, including non-response follow-up;
• develop collection instrument;
• specify sampling method, design and estimation procedures;
• develop and plan survey operations.
Survey operations
• weighting;
• calculation of estimates;
• account for missing data;
• production of tables, charts and diagrams;
• identifying important subgroups and relationships;
• calculation of estimates of sampling errors;
• report preparation.
Evaluation
In any particular survey some of these steps may not be significant and the relative
importance of them will vary between projects.
9
1.9 Sources of Errors in Samples
• sampling errors
• non-sampling errors e.g., errors due to the form, respondent, frame or list from
which we sample, etc. These types of errors potentially affect all statistical collec-
tions.
The reliability of the estimates from a survey depends on the errors that affect the survey.
Groves (1989), Chapter 1, gives an excellent review of the potential sources of survey
errors:
• Sampling error - If instead of including all units in the population in the survey
a sample is selected then the estimates will differ from the result that a complete
enumeration would give. The size of this difference is called the sampling error.
For a probability sample, an indication of the likely size, but not direction, of this
error, can be calculated from the sample using the standard error. This is one key
advantage of using probability sampling. For other methods, it is not possible to
estimate the likely size of the sampling error, although in some cases an attempt is
made by assuming the sampling procedure is equivalent to a probability sampling
scheme.
• Coverage error - errors caused by some units not being on the sampling frame or
list.
• Non-response error - there arise if some selected units could not be contacted or
refused to provide the information.
• Instrument errors - errors or differences may arise from the way the questions are
asked and instructions given.
• Mode of data collection - different answers to the same question may be obtained
when using different modes (e.g. mail, telephone, face-to-face) of data collection.
All data collections are potentially subject to these errors. A census or complete enu-
meration would have no sampling error but would be subject to all of the other sources
of error. Although they introduce sampling error, sample surveys can sometimes give
more reliable results than censuses because more effort can be put into reducing the
non-sampling errors for the same cost.
10
1.10 Examples
Example 1.1: Discuss the following diagram
Figure 1.3: The target population and sample population in a telephone survey of likely
voters (adapted from Lohr Figure 1.1 p4)
Notes:
11
Example 1.2: (Griffiths et al, p228)
Previous to the 1936 presidential election of two candidates: Roosevelt and Landon, two
polls were conducted:
The results of the 2 polls and the actual result are given in the following table.
Literary Digest
Gallup
12
Comments:
1.11 References:
Griffiths, D., Stirling, W.D. and Weldon, L.L. (1998) Understanding Data: Principles &
Practice of Statistics. John Wiley.
Groves, R. M. (1989) Survey Errors and Survey Costs: New York; John Wiley.
Lemeshow, S. and Levy, P.S. (2008) Sampling of populations. Methods and Applications.
4th edition. New York; John Wiley.
13
Chapter 2
2.1 Population
• Survey population - population that is sampled from i.e., set of units with non-zero
chance of selection e.g. all retailers with employees operating in June 2008. The
survey population and the target population may differ.
• Population units - the elements or entities about which we wish to make estimates
(e.g. people).
• Reporting (or observation) Units - the units providing the information (e.g. people)
or an object on which a measurement is taken.
• Population values - each individual in the population has associated with it the
value of one or more characteristics of interest. We will usually consider one
characteristic or variable and denote the population values as Y1 , . . . , YN . The
value of the characteristic of interest of the ith population unit is denoted Yi .
14
Y
Population mean: Ȳ = N,
Population median,
∑N
Population variance: σY2 = 1
N i=1 (Yi − Ȳ )2 ,
∑N
(later we will consider SY2 = 1
N −1 i=1 (Yi − Ȳ )2 , )
σY
Population coefficient of variation: CV = |Ȳ |
.
2.2 Sample
• Sampling Frame - the list of sampling units. (e.g. the list of households)
y
Sample mean is ȳ = n
∑n
Sample variance is s2y = 1
n−1 i=1 (yi − ȳ)2 .
A sample s is any subset of U . The sample will often be obtained by selecting sampling
units from a list of these units (sampling frame). Usually the sampling units are the
population units but can be groups of population units (e.g. cluster and multi-stage
sampling). More generally, the sampling frame is the set of materials used to obtain the
sample.
15
Sample values will be written y1 , . . . , yn , where n is the sample size. Usually, the order
of selection contains no information and is ignored. The sampling scheme may allow
the same population unit to be selected more than once - if this is the case it is a with
replacement scheme. If the duplicates can be identified it is better, in theory, to remove
them. This must then be taken into account in the estimation procedure. If duplicates in
the sample cannot be identified, unbiased estimates can still be calculated provided the
expected number of times a population unit is selected can be determined. See Theorem
3.1. In practice, without replacement schemes are generally used. Unknown duplicates
on the sampling frame are a different issue and can cause biases.
Let S be the random variable denoting which sample is drawn and for the sample design
or selection procedure
P (S = s) = pd (s)
∑
Note pd (s) ≥ 0 ∀s ∈ S and s∈S pd (s) = 1. So pd (s) defines a probability distribu-
tion over S.
16
Example 2.2 (b) N = 4 so U = {1, 2, 3, 4}
Note not every sample has a non-zero chance of selection, but every population unit
does, so provided these probabilities are known, this is a probability sampling method.
Suppose we define an estimation procedure which results in the estimate y ′ (s) for s, then
y ′ (S) is a random variable such that
∑
P (y ′ (S) = a) = pd (s)
{s:y ′ (s)=a}
We will drop the “d” subscript, but remember different designs give different sampling
distributions. We define the mean and variance of y ′ (S) over all possible samples, that
is the randomisation distribution introduced by sampling.
∑
Ep [y ′ (S)] = p(s)y ′ (s)
s∈S
∑
′
Vp [y (S)] = p(s)(y ′ (s) − Ep [y ′ (S)])2
s∈S
17
Example 2.2 cont.
(c) Consider the population values {−1, 2, 4, 10}
(i) Determine the population mean and the sample mean for each possible sample
with n = 2.
The shape of the probability distribution over S will depend on the original distribution
of the variable in the finite population, the sample design and the estimator used. We
will drop the “p” subscript and also write y ′ (s) and y ′ (S) as y ′ - the context will make
clear which is meant. Considering the properties of estimators with respect to the
randomisation distribution is called the design based approach.
P (i ∈ S) = pi
P (i, j ∈ S) = pij
Note that the finite population value, Yi , is regarded as a fixed quantity. On some occa-
sions we may let it be a random variable also generated by some stochastic mechanism
or drawn from some superpopulation. This approach is called a model-based approach.
18
Note: y ′ is unbiased for Y if E(y ′ ) = Y ;
Bias(y ′ ) = E(y ′ ) − Y
MSE(y ′ ) = V (y ′ ) + Bias2 (y ′ )
See Cochran (1977) section 1.8 about the effect of bias on the reliability of confidence
intervals.
Figure 2.1: Unbiased, precise and accurate archers (Lohr Fig 2.2 p29)
Solution:
19
2.4.1 Other Parameters
The standard error is the standard deviation of the sampling distribution of y ′ and is
not the standard deviation of the population distribution (σY ).
• This assumption may not work if the sample size is very small or the population
distribution is very skewed, although it will work for large samples from skewed
populations. Some theoretical justification is based on a version of the Central
Limit Theorem appropriate to finite populations.
• Cochran (1977, section 2.15) suggests that, for simple random sampling, the as-
sumption of Normality should be reasonable if n > 25G21 , where
∑
N
1
N (Yi − Ȳ )3
i=1
G1 = ( )3/2 , Fisher’s Coefficient of Skewness
∑
N
1
N (Yi − Ȳ )2
i=1
Sugden, Smith and Jones (2000) suggest that when the fact that the standard error
is estimated is taken into account then we need n > 28 + 25G21 .
20
2.6 Bivariate Definitions
Suppose that we collect information on two variables y, x on each sample unit then we
get a bivariate sampling distribution for the estimators y ′ , x′ .
Relative Covariance of y ′ , x′ is
C(y ′ , x′ )
Vy′ x′ =
E(y ′ )E(x′ )
′ ′
(Sampling) correlation = √ C(y′ ,x ) = corr (y ′ , x′ ) = ρ(y ′ , x′ )
V (y )V (x′ )
σY X
This is not the population correlation ρY X = .
σY σX
Theorem 2.1
Choose one of the balls at random and then choose one of the numbers inside that ball.
Let Y be the number that is chosen and let
{
1 if Ball A is chosen
Z =
0 if Ball B is chosen
(2.1)
21
(a) Calculate
(i) E(Y |Z = 1)
(ii) E(Y |Z = 0)
(iii) E(Y )
2.7 References:
Cochran, W.G. (1977) Sampling Techniques, 3rd. ed.: New York; John Wiley.
Sugden, R., Smith, T.M.F. and Jones. (2000). Cochran’s rule for Simple Random
Sampling. Journal of the Royal Statistical Society, Series B, 62, pp787-793.
22
Chapter 3
• SRSWR - with replacement - all possible samples have the same chance of selection,
but an individual unit can be drawn more than once.
Number of possible samples = N n .
• SRSWOR - without replacement - all possible samples have the same chance of
selection, but an individual unit( cannot
)
be drawn more than once.
Number of possible samples = N n = N!
n!(N −n)!
Theorem 3.1
For any probability sampling scheme for which πi is the expected number of times the
ith population unit is selected, then
[ n ]
∑ ∑
N
E yi = πi Y i .
i=1 i=1
23
Proof:
Define δi = the number of times the ith population unit is in the sample. Then,
∑
n ∑
N
yi = δi Yi
i=1 i=1
∑
n ∑
N
Hence E[ yi ] = E(δi )Yi , since Yi is fixed, and E(δi ) = πi , by definition. (Recall
i=1 i=1
that Yi is not a random variable.) For any WOR sampling scheme, πi = pi the probability
of selection.
Example 3.1
Recall example 2.1 SRSWOR with y1 = Y2 , and y2 = Y106 : then
δ1 = 0, δ2 = 1, δ3 = δ4 = . . . = δ105 = 0, δ106 = 1, δ107 = . . . = δN = 0
and we can write
y1 + y2 =
=
=
Corollary 3.2
[ n ]
∑ yi ∑
N
E = Yi = Y
i=1
πi i=1
24
Example 2.2 continued: By using the values of pi obtained earlier, show that
1 ∑n
yi
N i=1 πi
25
Corollary 3.3 For SRSWOR πi = n/N hence an unbiased estimator for the population
total Y is
N∑ n
y′ = yi
n i=1
y′
and also since ȳ = , then
N
1∑ n
ȳ = yi
n i=1
is unbiased for Ȳ .
The estimator y ′ is sometimes called a number raised estimator.
Example 3.2
If one eighth of a population is selected in SRSWOR, and the total income for the sample
is calculated as $5, 200, 000. Calculate an estimate of the total income for the population
using correct notation.
Solution:
26
Theorem 3.4
N∑ n
′
For SRSWOR, the estimator y = yi = N ȳ has sampling variance
n i=1
( )
′ n SY2 1 ∑ N
V (y ) = N 2
1− where SY2 = (Yi − Ȳ )2
N n N − 1 i=1
Proof:
extra steps
( )2 ∑N ∑
N ∑
N
′ N n 2 n (n − 1)
E(y 2 ) = Yi + Yi Yj .
n i=1
N i=1 j=1
N N −1
i̸=j
27
Hence
1 ∑ 2 ∑
N N ∑
N
N2 n−1
V (y ′ ) = Yi + Yi Yj − n(Ȳ )2
n N i=1
N (N − 1) i=1 j=1
i̸=j
N2 1 ∑
N
n−1 ∑
N ∑
N
n ∑N ∑∑
N N
= Yi2 + Yi Yj − 2 Yi2 + Yi Yj
n N i=1
N (N − 1) i=1 j=1
N i=1 i=1 j=1
i̸=j i̸=j
since
1 ∑N
Ȳ = Yi
N i=1
1 ∑ 2 ∑ ∑
N N N
Ȳ 2 = 2
Yi + Yi Yj
N i=1 i=1 j=1
i̸=j
Thus V (y ′ ) =
( ) N { }
N2 1
n ∑ 1 ∑N ∑ N
N (N − 1)n
= N 1 − N Yi −
2
Yi Yj − (n − 1)
n N (N − 1) i=1 j=1 N 2
i=1 | {z i̸=j
}
n
1− N
( )
1 ∑ 2 ∑
N N ∑ N
N2 n 1
= 1− Yi − Yi Yj
n N N i=1
N (N − 1) i=1 j=1
i̸=j
| {z }
SY2
( )
N2 n
= 1− SY2
n N
Corollary 3.5
For SRSWOR, the sample mean, ȳ, has sampling variance
( )
n SY2
V (ȳ) = 1 −
N n
28
The sampling variance of ȳ depends on 3 factors:
Of these factors, the sampling fraction is generally the least important; it is typically,
but not always, small (much nearer to 0 than to 1).
( )
The term 1 − n
N is called the finite population correction factor.
Notes
Example 3.3
Consider the population values from Example 2.2: {−1, 2, 4, 10}
If each of the 6 possible samples of size 2 is drawn using SRSWOR, show that the sam-
pling variance of ȳ (i.e. V (ȳ)) is equal to the variance of the mean over all 6 samples.
Hint: first calculate the mean for each of the 6 possible samples of size 2.
Solution:
i 1 2 3 4 5 6
p(si )
y1 , y 2
ȳ
29
Solution continued.
Corollary 3.6
For SRSWOR the relative variance of y ′ and also of ȳ is
SY2
Vy2′ = Vȳ2 = (1 − f )
Ȳ 2 n
VY2 SY2
= (1 − f ) where VY2 =
n |Ȳ |2
This corollary shows that knowledge of VY is important. From now on we will refer to
VY as the population coefficient of variation.
Corollary 3.7
For SRSWOR the RSE of y ′ and ȳ are
√ VY
Vy ′ = Vȳ = 1−f√ .
n
30
3.1.1 Estimating Sampling Variance
An important feature of probability sampling is that, for most designs, the likely precision
of the estimates can be estimated from the sample.
Theorem 3.8
1 ∑ n
Let s2y = (yi − ȳ)2 be the sample variance for a SRSWOR then E(s2y ) = SY2
n − 1 i=1
1∑ n
1 ∑n ∑ n
Proof: s2y = yi2 − yi yj .
n i=1 n(n − 1) i=1 j=1
i̸=j
Hence
1∑ N
n 2 1 ∑N ∑ N
n(n − 1)
E(s2y ) = Yi − Yi Yj
n i=1 N n(n − 1) i=1 j=1 N (N − 1)
i̸=j
= SY2
Corollary 3.9
An estimate of the sampling variance of y ′ is given by
s2y
Vb (y ′ ) = N 2 (1 − f ) .
n
Vb (y ′ ) is unbiased for V (y ′ ).
In any particular problem, a statistician must be able to specify the design, the estimator
and the associated sampling variance and the variance estimator. The results above
effectively do this for SRSWOR.
where for a two-sided confidence interval with degrees of freedom df , P (−tdf,1−α/2 < T <
tdf,1−α/2 ) = 1 − α. For example, for a 95% confidence interval with n = 20, α = 0.05,
31
we have df = n − 1 = 19 degrees of freedom and from t distribution tables we find that
tn−1, 1−α/2 = t19, 0.975 = 2.093.
The term which is added to and subtracted from the estimate to obtain the confidence
interval is often referred to as the margin
√ of error. Therefore, in this case, the margin
of error is tn−1,1−α/2 SE(y ′ ) = tn−1,1−α/2 Vb (y ′ ).
Example 3.4
In a large first year subject with 260 students enrolled at exam time, a lecturer was
interested in student absence from tutorials and recorded the number of tutorials (out of
28) missed by a simple random sample (without replacement) of 20 of those 260 students.
The data obtained are tabulated below together with some additional calculations.
∑ ∑
Other information: yf (y) = 36 and y 2 f (y) = 166
(a) Calculate the mean and standard deviation of the number of missed tutorials for
the students in this sample.
(b) Find an approximate 95% confidence interval for the mean number of missed tu-
torials per student in the whole class. Explain your calculations and avoid any
unnecessary approximations.
(c) State how the confidence interval would change if the sample had come from an
infinite population. Do not do any additional calculations.
Solution:
32
Solution cont.:
33
3.2 Proportions or Bernoulli Variables
An important special case occurs when we want to estimate the number or proportion
of the population in some category “c”. For example we might want to estimate the
number or proportion of people unemployed.
{
1 if i ∈ c for some category c of the population
Yi =
0 if i ̸∈ c
The sampling variance of the estimate of the proportion can be obtained by substituting
for SY2 into V (ȳ). From corollary 3.5:
( )
n SY2
V (ȳ) = 1−
N n
( ) ( )
n 1 N
V (pc ) = 1− Pc (1 − Pc )
N n N −1
( ) ( )
N −n 1 N
= Pc (1 − Pc )
N n N −1
( )
N − n Pc (1 − Pc )
=
N −1 n
If SY2 is unknown, an estimate may be calculated by just using the sample proportion:
n
s2y = pc (1 − pc )
n−1
34
Thus, the sampling variance can be estimated by just using s2y in replace of SY2 :
( )
n sy2
Vb (ȳ) = 1−
N n
( ) [ ]
b n 1 n
V (pc ) = 1− pc (1 − pc )
N n n−1
( )
n pc (1 − pc )
= 1−
N n−1
where for a two-sided confidence interval, P (−z1−α/2 < Z < z1−α/2 ) = 1 − α. For ex-
ample, for a 95% confidence interval, α = 0.05 and from Normal distribution tables we
can determine that z1−α/2 = z0.975 = 1.96. For small samples, use t critical values, as
before.
The margin of error is z1−α/2 SE(pc ) and is easily obtained from the calculation of the
confidence interval. It has a maximum when pc = 0.5.
Also, the number raised estimate of the number of people in the category is
y ′ = N pc .
An estimate of its variance can also be obtained by just using the sample proportion to
calculate s2y , which can then be substituted into Vb (y ′ ).
35
Newspoll Example:
36
3.2.2 Examples
Example 3.5 (from p52 Cochran):
From a list of 3042 names and addresses, a simple random sample of 200 names showed
on investigation 38 wrong addresses. Estimate
(a) the total number of addresses needing correction in the list; and
(b) find the standard error of this estimate.
37
Example 3.6 (from Lohr p35 eg 2.6):
A SRS of 300 counties of a total of 3078 counties in the U.S. collected information
including the acreage devoted to farms. 153 were found to have less than 200 000 acres
of farmland.
(a) Estimate the proportion of counties with less than 200,000 acres of farmland;
(b) Calculate its standard error;
(c) Determine a 95% CI for the proportion.
38
Newspoll Example cont.:
39
3.3 Setting Sample Size
An important step in designing a survey is estimating the size of the sample to be taken
from the population. The implications of the chosen sample size are important - time
and money. Hence if the sample is too large, the survey will take longer and cost more
than necessary. Alternatively, if the sample is too small, the survey results will be unre-
liable and the time and resources expended will have been wasted.
Sample size can be determined based on the precision required for the estimate. Suppose
we want a relative variance of α2 , e.g. 5% RSE corresponds to α = 0.05.
We have shown (see corollary 3.6) that the relative variance is given by:
Vy2′ = Vȳ2
SY2
= (1 − f )
Ȳ 2 n
V2
= (1 − f ) Y
( n)
1 1
= − VY2
n N
SY
where VY = is the coefficient of variation of the variable in the population.
|Ȳ |
SY2
To determine n, we need to know VY2 , α2 , N . We thus need to estimate or guess VY2 = .
Ȳ 2
40
There are several approaches:
• Roughly estimate the range or the range within which 95 percent of the population
lies and assume a particular distribution e.g Normal (or uniform) and work out
the implied CV.
- Normal distribution: Estimate what range would contain 95% of the values.
Divide by 4 to obtain an estimate of SY .
- Uniform
√ distribution: Estimate the range of the distribution and divide by
12.
Example 3.7
Consider average household income - the average is probably something like 50,000
dollars a year and most would be in the range 20,000 to 150,000 dollars.
(a) If we assume a Normal distribution:
Ȳ = 50
130
⇒ 4SY = 130 ⇒ SY = ≈ 32.5
4
SY 32.5
⇒ = ≈ 0.65
Ȳ 50
n =
(3.1)
Note since N ≈ 5, 000, 000, the term involving N can be ignored in this case.
(b) If we assume
√ a uniform√distribution in the population then
SY = h/ 12 = 130/ 12 ≈ 37.5 in which case Ȳ ≈ 0.75.
SY
n =
41
Example 3.8 (From Cochran p56)
In nurseries that produce young trees for sale it is advisable to estimate, in late winter or
early spring, how many healthy young trees are likely to be on hand, since this determines
policy toward the solicitation and acceptance of orders. A study of sampling methods
for the estimation of the total numbers of seedlings was undertaken by Johnson (1943).
The data that follow were obtained from a bed of silver maple seedlings 1 foot wide and
430 feet long. The sampling unit was 1 foot of the length of the bed, so that N = 430.
By complete enumeration of the bed it was found that Ȳ = 19, S 2 = 85.6, these being
the true population values.
How many trees must be sampled to estimate Ȳ within 10%, apart from a chance of 1
in 20? (Hint: Use the margin of error to firstly calculate the SE and hence the RSE).
Solution:
Suppose we want to estimate a proportion P with a standard error of SE, then the
relative standard error required is α = SE/P . Moreover, VY2 = (1−P )
P . Substituting in
the formula for the required sample size gives
1
n = SE 2 1
(3.2)
P (1−P ) + N
42
If N is large this becomes
P (1 − P )
n=
SE 2
To use this approach all we need is a rough idea of P. For a given SE using P = 0.5 is
a conservative approach.
It is important to distinguish between the required standard error (SE) and the required
relative standard error (α). Suppose that we want to estimate a proportion that we think
is roughly 20 percent, so P = 0.2. We want a confidence interval of plus or minus 2
percentage points - this is sometimes called the margin of error. This corresponds to
SE = 0.01, that is 1 percentage point and α = .01/0.2 = 0.05, which is a relative
standard error of 5 percent (of the 20 percent). For large N this gives n = 1600.
Example 3.9
A survey showed that 20% of Australians supported an Australia Card in 2000. A new
survey is to be run this year. The aim is to estimate the proportion with a SE of 2%.
(a) What sample size would you use?
(b) What if the aim was to get an RSE of 2%?
Solution:
Example 3.10
A survey is to be run to find out the proportion of people who use public transport, out
of a population of 400. The aim is to estimate this proportion with a SE of 5%. No
other information is given. What sample size would you use?
Solution:
43
3.4 Estimation of Ratios
Sometimes in surveys we wish to estimate ratios which have a numerator and a denom-
inator, both of which are random variables to be estimated. Examples include the ratio
of profit to employees, or the unemployment rate. This is in contrast to estimating
proportions where p = y/n; y is a random variable and n is a constant.
Suppose for each unit in the population we collect Yi , Xi and we want to estimate
ΣYi Ȳ
R= = ,
ΣXi X̄
which is the ratio of the variables over the whole population.
Example 1: Yi = profit
Xi = no. employees
Y /X = the ratio of profit to employees
{
1 if unemployed
Example 2: Yi =
0 otherwise
Xi = 1 if in Labour Force
Y /X = the unemployment rate
Note: ∑N
Yi Ȳ 1 ∑N
Yi 1 ∑N
R = ∑Ni=1 = ̸= = Ri
i=1 Xi X̄ N i=1 Xi N i=1
where Ri = Yi /Xi = R̄ is the ratio for the ith population unit. That is, the ratio of
means ̸= the mean of ratios. If we want to estimate R̄ then the methods and theory
presented in the previous sections of the chapter apply with Ri replacing Yi . Hence
it is important to decide whether the characteristic of the population in which we are
interested is R or R̄. In most cases it is R.
44
Theorem 3.10
Similarly for x′ .
Then
y′ Y (1 + ∆y ′ )
r= =
x′ X(1 + ∆x′ )
Y ′ ′
= (1 + ∆y ′ )(1 − ∆x′ + ∆x 2 − ∆x 3 + . . .)
X
′
therefore r = R(1 + ∆y ′ − ∆x′ − ∆y ′ ∆x′ + ∆x 2 + 3rd order terms)
Take expectation of both sides.
extra steps:
45
where
Cov(y ′ , x′ ) (1 − f )SY X
Vy′ x′ = = . (3.3)
YX nȲ X̄
and where
SY X
Cov(y ′ , x′ ) = N 2 (1 − f )
n
1 ∑ N
SY X = (Yi − Ȳ )(Xi − X̄)
N − 1 i=1
E(r) = R + O(n−1 )
provided Vy′ x′ , Vx2′ are O(n−1 ) which they will be for SRSWOR. Thus the bias of the
estimate of the ratio will be small provided the ratio is not based on a small sample.
E(r) − R
≈
R
= Vx2′ − Vy′ x′
V (x′ ) Cov(y ′ , x′ )
= −
X2 YX
B y ′ x′
= Vx2′ (1 − )
R
Cov(y ′ , x′ )
where By′ x′ = . Hence the bias depends how close By′ x′ is to R.
V (x′ )
Theorem 3.11
For SRSWOR, to 0(n−1 ), the mean square error of r is given by
Proof: As before
46
r − R = R(∆y ′ − ∆x′ ) + terms which will be at least 3rd order
when we square this.
′2 ′ ′ ′2
(r − R) 2
= R (∆y − 2∆y ∆x + ∆x ) + 3rd order or higher terms
2
E(r − R) 2
= R2 [Vy2′ + Vx2′ − 2Vy′ x′ ] + 3rd order or higher terms
Corollary 3.12
The relative mean square error of r is
M SE(r)
Vr2 = = Vy2′ + Vx2′ − 2Vy′ x′
R2
Corollary 3.13
Vr2 < Vy2′ if Vx2′ − 2Vy′ x′ < 0
1 Vx′
that is, if < corr(y ′ , x′ ).
2 Vy′
Corollary 3.14
For SRSWOR, to 0(n−1 ),
1
M SE(r) ≈ [V (y ′ ) + R2 V (x′ ) − 2R C(y ′ , x′ )]
X2
Proof: Substitute for Vy′2 etc in Corollary 3.12
Theorem 3.15
For SRSWOR to 0(n−1 ), the M SE(r) may be written in terms of SR
2
N 2 (1 − f ) 2
M SE(r) ≈ [SY + R2 SX
2
− 2RSY X ]
X 2n
(1 − f ) 1 1 ∑ N
= × × (Yi − RXi )2
X̄ 2 n N − 1 i=1
1 SR2
M SE(r) ≈ (1 − f )
n X̄ 2
∑N
where 2 =
SR 1
N −1 i=1 (Yi − RXi )2
(Note: If X were known, we could calculate y ′ /X and this would have MSE
1 SY2
(1 − f ) × .
n X̄ 2
47
So, using the sample estimate x′ in the calculation of the ratio can lead to a better
2 < S 2 .)
estimate of R provided SR Y
Hence
N 2 (1 − f ) 2
M SE(r) = [SY + R2 SX
2
− 2RSY X ]
X 2n
write
1 ∑ N
1 ∑ N
[ ]2
2
SR = (Yi − RXi )2 = (Yi − Ȳ ) − R(Xi − X̄)
N − 1 i=1 N − 1 i=1
= SY2 + R2 SX
2
− 2RSY X
Theorem 3.16
2 is estimated without bias by
For SRSWOR SR
1 ∑ n
s2r = (yi − rxi )2
n − 1 i=1
(1 − f ) 1 2
Md
SE(r) = s .
x̄2 n r
48
Example 3.11
A SRSWOR of 6 universities is selected from a population of 36 universities to estimate
the average number of academic papers published in a year per academic staff member.
The following data are obtained:
1 263 154
2 1604 743
3 4210 1420
4 407 194
5 738 303
6 504 320
Calculate an estimate of
(a) the ratio of number papers per academic staff member;
(b) the square root of the mean square error of this estimate.
Solution:
49
Solution cont.:
50
3.5 Helpful background information
3.5.1 Alternative formulas
For exercises involving calculations by calculator it is useful to note the following iden-
tities
∑n n n∑ n ∑ ∑
(yi − rxi )2 = yi2 + r2 x2i − 2r yi xi
i=1 i=1 i=1 i=1
∑
n ∑
n
1 ∑n
(yi − ȳ)2 = yi2 − ( yi )2
i=1 i=1
n i=1
∑
n
= yi2 − n(ȳ)2
i=1
51
For functions with two variables, we use the partial derivatives in a region of x close to
x = a and y close to y = b:
(x − a) dg (y − b) dg (x − a)2 d2 g
g(x, y) = g(a, b) + + +
1! dx 1! dy 2! dx2
(y − b) d g 2(x − a)(y − b) d g
2 2 2
+ + + ...
2! dy 2 2! dxdy
(3.4)
Example 3.12
Use the Taylor series expansion about x = 0 to obtain an approximation to
1
f (x) = .
1+x
Solution:
3.6 References:
Cochran, W.G. (1977) Sampling Techniques, 3rd. ed.: New York; John Wiley.
Hornbeck, R. W. (1975) Numerical Methods, Quantum publishers, New York.
52
Chapter 4
Systematic Sampling
In the previous chapter, the theory of SRSWOR was discussed. One way to implement
a SRSWOR is to randomly order the list of population units, select a random number
between 1 and N/n and take that unit and every N/nth unit thereafter. When the
list has not been randomly ordered (i.e. it has been purposely ordered according to
a particular variable) the method is called systematic sampling. The initial random
number is called the random start and N/n is the skip interval. Because of the use of
the random start this method is still a probability sampling method - it is not purposive
selection. It is important that you do not start the selection at the first unit, unless that
happens to be the randomly selected start.
If N/n is not an integer, we can round it to an integer, that is use k = int(N/n). The
sample size then will not then be exactly n, and in estimation the achieved sample size
should be used. Alternatively we can use a non-integer skip, which is a little more
complicated.
Systematic sampling is often used because of its convenience. There are only as many
different samples as there are random starts. In the above example there are only
( )
200 samples instead of 100000500 . This can be beneficial if the samples that have been
eliminated are ones that would give estimates a long way from the population value
being estimated. This will occur if we order by a variable that is related to the variables
being estimated. For example by ordering businesses by size we remove the possibility
of having all small or all large units in the sample, since it must include a unit from the
smallest k, one from the second smallest group of k etc. Another common application of
systematic sampling is to order units in a serpentine geographic way. You can think of
53
systematic sampling being a weak form of stratification, although the independence of
the selection of units between strata is not fulfilled (see next chapter). In fact systematic
sampling is a special case of cluster sampling.
Provided a sensible ordering has been used, there will usually be some reduction in
sampling variance through the use of systemic sampling. Most of the time the worst
that can happen is that the ordering ends up being close to random, because there is
little or no relationship between the ordering variable and the variable of interest. The
main problem to watch out for is periodicity in the list which is related to the skip being
used, eg sampling production every Friday or Monday; sampling every fourth flat in
block of flats with four flats on each floor.
Example 4.1 (adapted from Lemeshow, S. and Levy, P.S. (2008) Ch4.)
A nurse attended to a total of 12 patients on a particular day and the time spent per
patient was recorded. The data in Table 4.1 are listed in the order the nurse saw the
patients. Table 4.2 has the same data as Table 4.1 but ordered by decreasing time spent
with the patient.
1 15
2 34
3 35
4 36
5 11
6 17
7 49
8 40
9 25
10 46
11 33
12 14
The population mean is Ȳ = 29.583 and the finite population variance is SY2 = 166.99.
54
Table 4.2: Nurse visits - ordered by decreasing time
1 7 49
2 10 46
3 8 40
4 4 36
5 3 35
6 2 34
7 11 33
8 9 25
9 6 17
10 1 15
11 12 14
12 5 11
Note: In this exercise, all possible samples are determined in (b) for the purpose of
comparison. In a practical situation, only one of these would be carried out by using a
random start.
55
Solution: (a)
56
Solution cont.:
4.1 References:
Lemeshow, S. and Levy, P.S. (2008) Sampling of populations. Methods and Applications.
4th edition. New York; John Wiley.
57
Chapter 5
Stratified Sampling
5.1 Introduction
SRS is rarely used in practice. It is usually possible to do better. Unless VY is small you
often need large samples to get good RSEs.
Stratification involves using auxiliary information for all units in the population. We
divide the population into H mutually exclusive and exhaustive groups called strata and
then take a sample from each stratum independently of the sample in the other strata.
Examples
• faculty;
• location;
• type (UG, PG, Mature Age);
• sex;
• year of enrolment.
58
Figure 5.1: Stratified sampling. Source: http://simon.cs.vt.edu/SoSci/converted/Sampling/
The stratification variables must be known for each population unit before the sample
is selected. An important feature of any sampling frame is what stratification variables
are available on it. Sampling frames must also have unique unit identifiers and contact
details such as address.
• often gives lower variance for fixed cost compared with non-stratifying (i.e. SR-
SWOR)
59
However, if for some reason, the allocation is a long way from proportional or optimal,
the sampling variance can be greater than for SRSWOR.
5.2 Notation
The total number of strata is denoted by H, with individual stratum denoted by h, such
that h = 1 . . . H. The subscript h can then be attached to notation previously used for
SRSWOR.
Nh is the population size in stratum h
nh is the number of units selected SRSWOR from stratum h
Yhi is the value of ith population unit in stratum h
yhi is the value of ith unit selected from stratum h.
∑
H
N = Nh is the population size
h=1
∑H
n = nh is the sample size
h=1
∑H ∑Nh ∑
H
Y = Yhi = Yh is the overall population total
h=1 i=1 h=1
∑
Nh
i.e. Yh = Yhi is the total in stratum h
i=1
Yh
Ȳh = is the population mean in stratum h
Nh
yh
ȳh = is the sample mean in stratum h
nh
60
5.3 Definitions and Basic Properties
Assuming SRSWOR is used within all strata, for stratum h
Nh ∑ ∑
nh Nh
yh′ = yhi is unbiased for Yh = Yhi
nh i=1 i=1
and ( )
N2 nh
V (yh′ ) = h 1− Sh2 (applying Theorem 3.4 to yh′ )
nh Nh
where
1 ∑ h N
Sh2 = (Yh − Ȳh )2 is the population variance in stratum h.
Nh − 1 i=1 i
Also the estimate of the variance of the estimate of total for stratum h is
( )
N2 nh
V̂ (yh′ ) = h 1− s2
nh Nh h
where
1 ∑ hn
s2h = (yh − ȳh )2
nh − 1 i=1 i
Theorem 5.1
Suppose yh′ is an estimate of the total of the variable of interest in stratum h. Then
∑
H
′
y = yh′ has:
h=1
∑H
E(y ′ ) = E(yh′ ) (depends only on y ′ being linear.)
h=1
∑H
V (y ′ ) = V (yh′ ) (since we sample independently between strata)
h=1
61
Hence, applying Theorem 5.1 we obtain
Theorem 5.2
Using SRSWOR within strata
∑
H ∑
H
Nh ∑
H
y′ = yh′ = yh is unbiased for Y = Yh
h=1 h=1
nh h=1
y′
Also ȳ = is unbiased for Ȳ
N
The sampling variance of y ′ is
∑
H ( )
′ Nh2 nh
V (y ) = 1− Sh2
h=1
nh Nh
∑
H ( )
Nh2 nh
and Vb (y ′ ) = 1− s2
h=1
nh Nh h
is an unbiased estimator of V (y ′ )
y′
2. then ȳ = is an unbiased estimate of Ȳ .
N
The probability (and expected number of times) a unit in stratum h is selected is πhi =
nh
Nh . The estimator can be written as
∑
H ∑
nh
1 ∑
H ∑
nh
y′ = yhi = whi yhi
h=1 i=1
πhi h=1 i=1
where whi = πhi −1 . This shows we can use unequal probabilities of selection provided
we account for it in the estimation.
62
Example 5.1 (Eg. 4.1 from Lohr)
We want to estimate the total number of acres devoted to farming in the United States.
Using the 4 census regions as strata: Northeast, North Central, South, West; a SRS of
10% of the counties in each stratum is selected. The following data are obtained:
Solution:
63
5.4 Allocation of Sample
An important decision to be made is how many units to select from each stratum. There
are three main ways of allocating sample to strata: proportional allocation, optimal
allocation and equal sampling variance.
nh ∝ Nh
Nh
⇒ nh = ×n
N
nh n
⇒ = =f
Nh N
In this design, each unit has the same chance of selection and the sampling fraction or
rate is the same in each stratum. Rounding and non-response means that this rarely
happens exactly.
Example 5.2
A population consists of 2400 men and 1600 women. It is desired that the total sample
size is 10% of the population.
(a) Calculate the proportional allocation;
(b) Determine the probability of selection for each stratum.
Solution:
(a)
So allocation is calculated by
nM =
nF =
π1i =
π2i =
64
Substituting the expression for nh into the formula for V (y ′ ) in Theorem 5.2 gives,
∑
H
n Sh2
Vprop (y ′ ) = Nh2 (1 − )
h=1
N nNh /N
=
(1 − f ) ∑H
= Nh Sh2
f h=1
=
N −n ∑
H
= Nh Sh2
n h=1
n 1∑ H
Nh 2
Vprop (y ′ ) = N 2 (1 − ) S
N n h=1 N h
Notice that this is of the same form as for SRSWOR in Theorem 3.4 with SY2 replaced by
∑
H
Nh
Sh2 ,
h=1
N
∑
H ∑
Nh ∑
H
Within Strata (N − H) (Yhi − Ȳh )2 = (Nh − 1)Sh2 = (N − H)SW
2
∑
H ∑
Nh
Total (N − 1) (Yhi − Ȳ )2 = (N − 1)S 2
h=1 i=1
respectively.
65
For an unstratified design:
( )
n S2
VSRS (y ′ ) = N 2 1 −
N n
( ) [ ]
n 1 1 ∑ H
(H − 1) 2
= N 2
1− (Nh − 1)Sh2 + S
N n N − 1 h=1 (N − 1) b
( )
n 1 (H − 1) 2
= Vprop (y ′ ) + O(N −1 ) + N 2 1 − S
N n (N − 1) b
It is possible (but hard) to find a stratification such that Vprop > VSRS , but this rarely
happens in practice. Usually the worst that happens is your gains are small, but often
they are large.
Example 5.3
For the following design data
(a) Calculate the proportional allocations assuming a total sample size of 100.
(b) Round the resulting stratum sizes to integers and calculate the associated relative
standard error on the estimate of the population total Y for the two allocations.
Solution:
66
Solution cont.:
67
5.4.2 Optimal Allocation
Another way to allocate the sample to the strata is called optimal allocation. Optimal
allocation is designed to minimize the variance of estimates referring to the whole pop-
ulation. It is optimal according to a given constraint such as fixed sample size or fixed
cost.
Taking
N h Sh n
nh = ∑
h N h Sh
(5.1)
∑
H
minimises V (y ′ ) subject to n = nh fixed.
h=1
We want to minimise a function subject to a linear constraint, hence we will use La-
grangian methods.
Consider
( )
∑
H
Nh2 Sh2 ∑
H ∑
H
′
F = V (y ) = − Nh Sh2 +λ nh − n
h=1
nh h=1 h=1
dF Nh2 Sh2
= − +λ=0
dnh n2h
N h Sh
⇒ nh = √
λ
68
Use constraint to obtain λ:
∑
H ∑
H
Nh Sh
nh = √
h=1 h=1 λ
n =
√
λ =
√ ∑H
Nh Sh
λ =
h=1
n
extra steps:
N h Sh
nh = ∑H ×n
h=1 Nh Sh
nh /Nh = fh ∝ Sh
which implies using higher sampling fraction in the more heterogeneous strata, where
heterogeneity is measured by the population standard deviation, Sh .
nh ∝ Nh Sh implies putting more of the sample in the strata with high Sh and Nh .
This allocation is called an “optimal” allocation. It is only optimal for the constraints
given. Different constraints give different optimal allocations. Optimal allocation is also
called the Neyman allocation.
By construction,
Vopt ≤ Vprop .
69
Proportional allocation is the same as optimal allocation if Sh are constant i.e. same
for all strata. This can happen for geographic strata and a 0/1 variable, since then
Sh2 ≈ Ph (1 − Ph ), and will be approximately constant if the proportions do not vary
much across areas. If the variances within each stratum differ, then optimal allocation
will give a smaller sampling variance of the estimate for the whole population than
proportional allocation.
To calculate the optimal allocation you need information to estimate or guess the values
of Sh .
Corollary 5.4
For an optimal allocation the sampling variance of y ′ is
∑H
( 2 ∑
H
′ h=1 Nh Sh )
Vopt (y ) = − Nh Sh2
n h=1
Note: if rounding has been used to obtain nh , the actual sampling variance is not exactly
as given in Corollary 5.4. It may be wiser to use formula given in Theorem 5.2. This
comment also applies for proportional allocation.
70
Solution cont.:
71
Fixed cost
Costs may vary across strata. To take this into account a simple cost function can be
used as an approximation to the real cost structure, such as
∑
H
Cost = C0 + C h nh
h=1
The same sort of approach gives the optimal allocation for fixed cost
√
Nh Sh / Ch
nh = n. ∑ √
h N h Sh / C h
which leads to putting less sample in the strata which are more expensive to enumerate.
The value of n is obtained from the cost constraint. This gives the final allocations as
Corollary 5.5 The optimal allocation for a linear cost function is given by
√
Nh Sh / Ch
nh = (Cost − C0 ). ∑ √
h Ch Nh Sh
√
Because cost enters the equation in terms of Ch , there has to be quite a degree of
variation in costs between strata before it is worthwhile taking them into account.
( )−1
αh2 1
nh = 2 +
VY h Nh
VY2h
≈
αh2
SY2 h /Ȳh2
=
αh2
Note that if SY2 h /Ȳh2 and αh2 are equal across strata, then this implies equal nh .
72
5.4.4 Power Allocation
Sometimes in a survey, reliable estimates are required at both the national level and
for regional areas. Bankier (1988), provides a simple allocation method which allows a
compromise between Neyman or optimal allocation and equal allocation. He calls the
allocation method a power allocation.
To determine sample sizes for each stratum, the loss function F , given by
∑( )2
F = Xhq V (yh′ )/Yh2
h
∑
is minimised subject to the constraint h nh = n. Xh is some measure of size or
importance of stratum h (could be Yh or Nh ), and q is a constant in the range 0 ≤ q ≤ 1
and is called the power of the allocation. The result is as follows:
SY h Xhq /Ȳh
nh = n ∑ q
h SY h Xh /Ȳh
Note that if q is set to 1 and Xh = Yh , then the result for nh is the Neyman allocation.
If q = 0, and SY h /Ȳh are not equal, then the allocation for nh , given n, can be determined
by
SY h /Ȳh
nh = n ∑ .
h SY h /Ȳh
If q = 0, and if the SY h /Ȳh are similar between stratum, then the allocation becomes
equal nh = n/H (see Section 5.4.3). A compromise between optimal allocation and
equal allocation can be achieved by setting q to a value between 0 and 1 (Bankier, p174,
1988). In practice, a value of q = 0.5 is often used.
• where does the design information that allows calculation of Sh come from?
73
– use a related variable (e.g. calculate Sh for stratification variable say employ-
ment and hope it is good for the variable of interest (say turnover) or use past
or pilot data to build a regression model between say turnover and employ-
ment. Using this model we can relate Sh for turnover to Sh for employment.)
Problems arise in balancing the use of the different stratification variables available since
the number of strata can become very large and we need at least 2 responding units per
stratum for variance estimation. If we use ratio estimation within strata, a minimum of
6 respondents in each stratum may be required.
e.g. n = 2000 retailers, stratification 8 (states) × 15 (industry types) × 4 (sizes) = 480
strata, allowing only an average of 4 selections per stratum.
74
5.7 Choosing Stratum Boundaries
In many surveys, the stratum boundaries are determined by the information available.
If the variable chosen is a size variable such as number of employees or turnover in a
business survey, the boundaries may be chosen to minimise the variance of the estimates.
But since Ȳh , Sh depend on the boundaries, this would have to be solved iteratively.
• divide
√ the population into “fine” substrata according to the size variable. Calculate
dk fk where dk = width of kth interval and fk = number of population units in
the interval
√
• cumulate dk fk
∑√
• select the ‘boundaries’ so that dk fk is approximately constant.
k∈h
⇒ stratum 1 (0 to 4)
stratum 2 (5 to 14)
stratum 3 15+
(We would probably take those over 100 as a completely enumerated stratum.)
75
√
Number of dk fk dk fk cumulative Σ
employees
0-4 5 10,000 223.6 223.6
5-9 5 4,000 141.4 365.0
10-14 5 2,000 100 465
15-19 5 1,000 70.7 535.7
20-29 10 300 54.8 590.5
30-39 10 120 34.6 625.1
40-49 10 100 31.6 656.7
50-69 20 80 40 696.7
70-89 20 30 24.5 721.2
90-99 10 4 6.3 727.5
100+ 4
727.5
3 ≈ 242.5
In practice, this method is applied to the stratification variable. This procedure produces
approximately optimal stratum boundaries for the stratification variable and we hope
these are good for the variable of interest also.
Other approximations suggested are:
• choose such that Nh (yh − yh−1 ) is approximately constant in this case, if the distri-
bution within each stratum is approximately constant, then Sh ∝ (yh − yh−1 ) and
again, the optimal allocation is approximately equal sample within each stratum.
This suggests that a rough check that stratum boundaries are close to optimal, is if the
optimal allocation is close to equal sample numbers per stratum.
In practice, it is more important to have a near optimal allocation than optimal bound-
aries.
76
Explanation of Dalenius and Hodges method
so
Nh ≈ N fh (yh − yh−1 )
assuming approximate uniform distribution in the interval (yh−1 , yh ).
1
Sh ≈ √ (yh − yh−1 )
12
for small intervals. Now multiplying respective sides by Nh ≈ N fh (yh − yh−1 ), gives
N
Nh Sh = √ fh (yh − yh−1 )2 .
12
Hence, taking the sum over h, and then dividing by N ,
N ∑
ΣNh Sh ≈ √ fh (yh − yh−1 )2
12 h
1 ∑
ΣWh Sh = √ fh d2h
12 h
5.8 References:
Bankier, Michael, D. (1988) Power Allocations: Determining Sample Sizes for
Subnational Areas. The American Statistician, Volume 42, number 3.
77
Chapter 6
Ratio Estimation
N∑ n
y′ = yi
n i=1
We have shown that this estimate is unbiased for Y , and we have determined the sam-
pling variance of this estimator (theorem 3.4).
Sometimes, there is available useful information such as another variable that can be
used to determine an alternative estimator of the population total, Y . The additional
variable is often referred to as the auxiliary or benchmark variable. We will assume that
for each unit, in addition to the variable of interest Yi , we have some auxiliary variable
Zi , known for each unit in the population. We could therefore use Zi in the design
as a stratification variable, but we will consider how we might use this information in
estimation. We can calculate
∑
N
Z= Zi
i=1
which is the population total of the auxiliary variable, and similarly the sample total is
given by
∑
n
z= zi .
i=1
The ratio Z/z is a check on how well the sampling worked. If Z/z is very different from
N/n then it suggests the sample has under or over-represented the smaller units. To
compensate, multiply by
Z z Z Z
/ = ′ = N
N n z nz
78
This suggests we weight by Z/z instead of N/n, resulting in the ratio estimator,
which will be denoted by y ′′ :
Z z
y ′′ = / × y′
N n
=
Z∑ n
y ′′ = yi
z i=1
y′
= Z
z′
y′
= Zr where r= .
z′
The latter form shows that y ′′ is the sample ratio of the variable of interest to the
benchmark variable multiplied by the benchmark total.
For example, if y = turnover, z = employment, then to estimate total turnover we
estimate the turnover to employment ratio from the sample and multiply it by the total
employment for the population which is available from some other source.
Other examples,
y z
turnover past employment
employment now past employment
earnings past employment
retail sales past retail sales
Often the benchmark variable is the same variable at a previous point of time, then r is
the growth rate or factor.
Since y ′′ is just a ratio multiplied by a constant, the properties of a ratio estimator follow
immediately from those of a ratio, considered in section 3.4, with z replacing x in the
formulas.
79
Proof: We saw previously in theorem 3.10:
where R = Y /Z hence
[ ]
′′ N 2 (1 − f ) 2 SY Z
E(y ) − Y = SZ R − 2
nZ SZ
Note the bias is 0(n−1 ), so it might be important if n is small - this becomes an issue
for within stratum ratio estimation (see section 6.3).
80
Theorem 6.2
The mean square error of the ratio estimator y ′′ is given by
Proof: follows from theorem 3.11, i.e. M SE(r) = R2 (Vy2′ + Vz2′ − 2Vy′ z ′ ).
Corollary 6.3
The relative MSE of y ′′ is
M SE(y ′′ )
Vy2′′ =
Y2
= Vr = Vy2′ + Vz2′ − 2Vy′ z ′
2
The results so far apply no matter what design and estimation methods are used, pro-
vided E(y ′ ) = Y , E(z ′ ) = Z and the higher order terms of the Taylor Series expansion
can be ignored.
Theorem 6.4
For a SRSWOR of size n drawn from a population of size N , to O(n−1 )
1( 2 )
M SE(y ′′ ) = N 2 (1 − f ) SY + R2 SZ2 − 2RSY Z
n
( )
1 1 ∑ N
= N (1 − f )
2
(Yi − RZi )2
n N − 1 i=1
1 2
M SE(y ′′ ) = N 2 (1 − f ) SR
n
81
6.2.1 Comparison of Ratio Estimator with Number Raised Estimator
We now have two estimators for the population total: the number raised estimator (y ′ )
and the ratio estimator (y ′′ ). We can compare their mean square errors to determine
which estimator is the better choice.
We have already seen that (see Corollary 3.13)
1 Vz ′
Vr2 < Vy2′ if ρy ′ z ′ >
2 Vy ′
and so, since Vy2′′ = Vr2 , this condition also implies
This condition relates to ρy′ z ′ , the correlation between the two estimators y ′ , z ′ . For a
SRSWOR design
SY Z
ρy ′ z ′ = = ρY Z
SY SZ
∑N
where SY Z = 1
N −1 i=1 (Yi − Ȳ )(Zi − Z̄) and so the condition is equivalent to
1 VZ
ρY Z > ,
2 VY
since use of ratio estimation when the condition is not satisfied will lead to Vy2′′ > Vy2′ .
To check if ratio estimation is better than number raised estimation we can check the
correlation condition. Alternatively, calculate the variance of the ratio estimator, or an
estimator of it, corresponding to theorem 6.2 (or theorem 6.4) and compare it with the
variance of the number raised estimator (theorem 3.4).
82
Example 6.1
A SRSWOR of 5 large retailers is selected from a population of 357 and the following
data are obtained
Sample Retailer No. Employees Turnover($millions)
1 1,050 169
2 1,270 163
3 608 120
4 829 94
5 1,509 263
It is known that the total number of employees in the population is 370, 128.
(a) Based on the data obtained, do you think ratio estimation using the number of em-
ployees as the benchmark variable would be better than number raised estimation
for estimating the total turnover of the population of 357 large retailers? Justify
your answer without doing any formal calculations.
(b) Calculate the number raised estimate of total turnover and an estimate of the
sampling variance of this estimate.
(c) Calculate the ratio estimate of total turnover using the number of employees as
the benchmark variable and an estimate of the MSE of this estimate.
(d) Compare your answers to parts (b) and (c). Is ratio estimation better than number
raised estimation in this example? Justify your answer. Does this agree with your
response to part (a)?
(f) Calculate the estimate of the average turnover per employee and an estimate of its
standard error.
Solution:
83
Solution cont.:
84
Solution cont.:
85
6.3 Ratio Estimation Under a Super-population Model
To gain some insight into when ratio estimation is useful, we assume the population
values follow a “super-population” model. In this approach we assume that the popula-
tion values are selected from some “super population” or generated by some stochastic
process giving the population values
Y1 ,..,YN
Z1 ,...,ZN
A sample is selected giving the sample values
y1 ,..,yn
z1 ,...,zn
In this case we will assume that the population values are generated from a linear
regression model
Yi = α + βZi + ϵi
where
Eξ [ϵi |Zi ] = 0
The ξ subscript is used to denote taking expectations over the superpopulation or
stochastic process involved in generating the population values. Then
∑
N
Eξ [Y |Z] = N α + βZ recall Y = Yi
i=1
Eξ [Ȳ |Z̄] = α + β Z̄
(6.1)
Now
1 ∑ N
SY Z = (Yi − Ȳ )(Zi − Z̄)
N − 1 i=1
1 ∑ N
= (α + βZi + ϵi − α − β Z̄ − ϵ̄)(Zi − Z̄)
N − 1 i=1
1 ∑ N
= (β(Zi − Z̄) + (ϵi − ϵ̄))(Zi − Z̄).
N − 1 i=1
86
Taking ξ expectations gives:
[ ]
β ∑ N
1
Eξ [SY Z ] = (Zi − Z̄)2 + Eξ Σ(ϵi − ϵ̄)(Zi − Z̄)
N − 1 i=1 N −1
= βSZ2
We can also use this model to explain the meaning of the condition:
1 SZ /Z̄ 1 Ȳ
ρY Z > , which is equivalent to SY Z > SZ2 .
2 SY /Ȳ 2 Z̄
Replacing terms by the ξ-expectation gives:
1 2
2 SZ (α + β Z̄)
βSZ2 >
Z̄
i.e. β Z̄ > 12 (α + β Z̄) (assuming Z̄ > 0)
β Z̄ > α
Since this condition relates to Eξ (Vy2′′ ), then the ratio estimate is superior, in expectation,
when the condition is fulfilled, even if there is some expected bias due to a non-zero
intercept. These results make no assumption about Vξ (ϵi |Zi )
The use of a statistical model for the population values to guide us in understanding the
properties of design based estimators is called model assisted sampling.
The ratio estimator can be thought of arising from a regression through the origin. It
can also be written as
y′
y ′′ = y ′ + ′ (Z − z ′ ).
z
87
We can generalise this to the regression estimator
ŷreg = y ′ + β̂(Z − z ′ ),
where β̂ is some estimator of the slope of the linear regression relating the variable of
interest to the auxiliary variable. This approach can be easily generalised to include
information about several auxiliary variables and more complex designs.
• stratum by stratum
∑
H
yh′ ∑H
′′
sy = Zh = yh′′
h=1
zh′ h=1
• across stratum ∑
′′y′ H ′
h=1 yh
a y = Z ′ = Z ∑H ′
z h=1 zh
Theorem 6.5
∑
H
M SE(s y ′′ ) = M SE(yh′′ ) (by linearity and independence between strata)
h=1
∑H ( )
1 − fh
= Nh2 2
(ShY + Rh2 ShZ
2
− 2Rh ShY Z )
h=1
nh
(apply Thm 6.4 within strata)
∑
H ( ) ∑N
1 − fh 1 h
= Nh2 2
ShR where 2
ShR = (Yhi − Rh Zhi )2
h=1
nh Nh − 1 i=1
Note that Yhi is the value for the ith unit in the hth stratum. This formula for the MSE
Yh
involves a separate ratio for each stratum Rh = .
Zh
1 ∑ h n
s2hr = (yh − rh zhi )2
nh − 1 i=1 i
88
Theorem 6.6
This formula is similar to M SE(s y ′′ ) but with R replacing Rh for each stratum.
1 ∑ h N
(Yh − RZhi )2
Nh − 1 i=1 i
1 ∑
nh
y′
by (yhi − r′ zhi )2 where r′ = .
nh−1 i=1
z′
Which is better?
Consider
Now apply the result on the bias of the ratio estimator within strata. We can show that:
(1 − fh )
Bias(yh′′ ) = E[yh′′ ] − Yh = Nh2 2
(Rh ShZ − ShY Z ).
nh Zh
Hence,
∑
H
(1 − fh ) ( )
M SE(a y ′′ ) − M SE(s y ′′ ) = Nh2 (R − Rh )2 ShZ
2
h=1
nh
∑
H
( )
+ 2(Rh − R)Zh −Bias(yh′′ ) .
h=1
89
We have seen before that the bias of the ratio estimator depends on n−1 hence Bias(yh′′ )
depends on n−1 h . This suggests that if the stratum sample sizes are reasonable, the
difference in MSE depends mainly on the variation of the Rh . If the Rh vary a lot then
′′
s y would have lower MSE. But if the sample sizes within strata are small, as is often
the case, the second term may be important and a y ′′ may have smaller MSE than s y ′′ .
To determine which is better it is necessary to do the calculations.
In practice there is a conflict between using across stratum ratio estimation with fine
stratification and stratum-by-stratum ratio estimation with a broad stratification. More-
over, once stratification is introduced, the condition for ratio estimation to be better than
number raised changes. For stratum by stratum it becomes:
1
2 ShZ /Z h
ρhY Z > .
ShY /Y h
• If the appropriate conditions are not met, ratio estimation can be worse than
number raised. You need to watch out for defunct units and zeros; these reduce
correlations a lot.
• We can use ratio estimation in some strata in which it is beneficial and not in
others.
∑
n
• Do not need Zi to be known for all population units, just need Z and z = zi
i=1
(unlike stratification or PPS selection methods).
• Can use ratio estimation for some variables, but not others (c.f. stratification
which affects all variables).
• Leads to a minimum sample size constraint of say 5 or 6 per strata, because of the
potential bias in small samples.
90
6.6 Additional Reading:
Cochran (1977), Sections 6.1 to 6.12.
Lohr (1999), Sections 3.1, 3.2, 3.4.
91
Chapter 7
In this chapter, a brief description of cluster sampling and multi-stage sampling will be
given. The theory behind these designs will be covered in your next course in sampling.
Each population unit must be uniquely identified with one and only one cluster through
well constructed and applied coverage rules.
We use cluster and multistage sampling for one or both of the following reasons:
(i) a suitable sampling frame of population units does not exist but a list of clusters
does;
(ii) cost - a clustered sample is usually less costly than an unclustered sample of the
same size in terms of population units.
92
For cluster sampling, the probability a population unit is selected is the probability the
cluster containing the unit is selected.
For example, if we want to determine how many computers are owned per household in
a community of 10,000 households, we could take a simple random sample of a selection
of households. Alternatively, we could take a sample of CD’s within the community and
then survey every household in the selected CD’s. The CD’s are the primary sampling
units and the household is the population unit.
However, for many variables there is often the penalty of higher sampling variances than
for a simple random sample with the same sample size. This is due to the tendency of
members within a cluster to be similar while large differences can occur between clusters.
In practice, the size of a cluster sample often needs to be larger than that for a simple
random sample in order to compensate for the higher sampling variance.
We would prefer if the clusters are as heterogeneous as possible, but many of the clusters
that arise naturally are reasonably homogeneous. This contrasts with stratification
where we want homogeneous strata. This is because in stratified sampling we include all
strata in the sample and hence eliminate the between strata component of variance. In
cluster sampling, we eliminate the within cluster component of variance, since all units
in a cluster are selected but only a sample of clusters is taken.
The design effect of a cluster sample will be large when the clusters are very homogeneous
or, in many cases, when the clusters are large. In both these situations consideration
may be given to including only a sample of population units from each selected cluster.
The money saved by including only a sample of population units from each selected
cluster may then be spent by including more clusters in the sample. Because of the
93
costs involved with selecting clusters the total number of population units in the sample
will be reduced but the sample will be more spread and this may compensate for the
reduced sample size leading to estimates with smaller sampling variance. One of the
main problems in designing such samples is to determine what size subsample to take
to optimally balance cost and sampling variance.
In many situations, the problems of compiling lists of population units and travel between
selected population units are present even within selected first stage units. Consideration
is thus given to selecting the sample of population units within selected first stage units
by grouping the population units into second stage units, a sample of which is selected.
The population units are then selected from selected second stage units. This is called
three stage sampling. Clearly this process can be continued to any number of stages.
A multistage sample can be defined as one which is selected in stages, the sample units
at each stage being subsampled for the larger units chosen at the previous stage. At the
first stage the entire population is divided into First stage (or Primary Sampling) units.
At each successive stage smaller sampling units are defined within those selected at the
previous stage and a further selection are made within each of them. At each stage a list
of units from which the selections is to be made is required only within units selected at
the previous stage.
In multistage sampling the probability that a population unit is selected is the proba-
bility the cluster, i.e. PSU, containing the unit is selected multiplied by the conditional
probability the unit is selected given the cluster it is in is selected.
Multistage sampling is especially important where the population units are geographi-
cally spread and there is no list of them. The units of selection are then usually areas
of land and this is called area sampling.
The set of all selected population units in a selected PSU is sometimes called an ultimate
cluster.
Multistage sampling is a very flexible technique since many aspects of the design have
to be chosen; including the number of stages and for each stage
• the method of selection (eg PPS or equal probability, systematic or simple random)
Moreover, stratification and ratio estimation may be used. This flexibility means that
there is large scope for meeting the demands of a particular survey in the most efficient
way and hence good opportunity for the sampling statisticians to practice their craft.
94
Appendix A
University of Wollongong
Both classes of investigations can give evidence of association between variables, but
only controlled experiment can give evidence of causation, provided other factors have
been properly accounted for in the design of the experiment.
An important class of observational studies are sample surveys which if done properly will
represent the population from which the sample is selected well, i.e. have strong external
validity. They permit analysis of relationships over a large number of different groups in
the population. There are issues with internal validity because of the self-selection of the
treatment or independent variables and lack of control of other factors. Other important
types of observational studies are retrospective and prospective studies. Observational
studies can suffer from lack of representation of the population, at worse even self-
selection of inclusion in the study i.e. volunteers. Experiments and observational studies
can play complementary roles in investigating an issue.
95
Surveys are usually conducted to provide a description of a population. This usually
involves estimation of features of the population such as totals, means, proportions,
the number or units in various categories and ratios. Often the major outputs from a
survey are a number of tables. Surveys can also be used for analytical purposes such
as investigating the association between two or more variables. In this situation we
are usually interested in associations that apply more generally than just the particular
population surveyed at a particular time.
96
A.2.1 Steps in the Survey Process
Survey development
• determine objectives
• determine resources available and constraints
• review alternative sources of information
• specify population of interest
• identify research issues
• decide data items and classifications
• determine precision required
• decide type of investigation needed
• determine collection method
• develop collection instrument
• specify sampling method
• develop and plan survey operations
Survey operations
• calculation of estimates
• production of tables, charts and diagrams
• identifying important subgroups and relationships
• calculation of sampling errors
• report preparation
Evaluation
The relative importance of these steps will vary between projects. Some sampling related
issues are discussed here.
97
A.3 Specifying the Population of Interest
The specification of the population should clearly define the group about which we wish
to make conclusions. It should cover the definition of units, scope, geographic coverage,
and reference period.
For example, suppose we wish to survey Doctors in the Illawarra. We must first decide
exactly what a Doctor is. Exactly what constitutes the Illawarra? Do we want Doctors
who live in the Illawarra or those that work in the Illawarra? Is the actual Doctors we
are interested in or their practices or offices? What period are we concerned with, a
particular week or a financial year? If the latter, are Doctors that only practice part of
the year included? The answers to these questions depend on the purposes of the study.
When initially defining the population we should not overly concern ourselves about the
feasibility of obtaining information on all or a sample of the population, although as we
develop the survey we may have to define a survey population that does not correspond
precisely to the target population. Availability of data may influence the definition of
the unit.
To conduct a survey we must be able to identify the units in the population and include
all or a sample of them. This means we have to have access to, or construct, a sampling
frame. In most cases the frame will be list of the population units and some way of
contacting them, such as a list of all businesses and their address and contact names,
positions or telephone numbers. Often the list available does not correspond to the
target population and we must decide if we can proceed with a survey population that
differs from the target population, e.g. members of the AMA instead of all Doctors.
Lists always have some problems:
• omissions
• duplicates
• ceased units
Some judgement has to be made as to how serious these problems might be and what
steps can be taken to overcome them. You must plan how to handle these problems in
the survey operations and estimation phases. Information on the likely quality of the
list should be obtained; it may even be necessary to do a small pilot test to gauge the
extent of these problems.
98
In some situations no list exists but by use of a technique known as multi-stage sampling
it is still possible to obtain a valid sample of units through a sampling frame of higher
level units through which the population units can be identified. For example, to obtain
a sample of hospital patients we could select a sample of hospitals and for the selected
hospitals select a sample of wards and then a sample of patients. Even to get a sample
of private households we might need to start with a list of streets if we do not have a
satisfactory list of dwellings.
In practice, the size of a survey is determined by the funds available and it is important to
consider whether the size of the survey possible will be of any real use. In looking at the
usefulness of a proposed survey the value of the information to be obtained is determined
by what it adds to what is already known. If there is virtually no information available
then even a small study will be quite valuable, but if reliable and detailed information
is already available then we must critically examine what the proposed survey will add.
The reliability of the estimates from a survey depends on the errors that are affecting
the survey. Groves (1989), Chapter 1, gives an excellent review of the potential sources
of survey errors.
• Sampling error: if instead of including all units in the population in the survey a
sample is selected then the estimates will differ from the result that a complete
enumeration would give. The size of this difference is called the sampling error. For
a probability sample an indication of the likely size, but not direction of this error,
can be calculated from the sample using a statistic called the standard error. This
is one of the main attributes of using probabilities sampling, for other methods it
is not possible to estimate the likely size of the sampling error, although in some
cases an attempt is made by assuming the sampling procedure is equivalent to a
probability sampling scheme.
• Coverage error: errors because some units were not on the sampling frame or list
• Non-response error: errors because some selected units could not be contacted or
refused to provide the information
• Interviewer error: for surveys involving personal interviewing the interviewers may
affect the responses the respondent provides in various ways
• Instrument errors: errors or differences due to the way the questions and instruc-
tions are asked. If physical measurements are taken there will be measurement
errors associated with the measurement process.
• Mode of data collection: different answers to the same question may be obtained
when using different modes e.g. mail versus telephone to collect the data.
99
All data collections are potentially subject to these errors. A census or complete enu-
meration would have no sampling error but would be subject to all the other sources
of error. In fact although they introduce sampling error, sample surveys can give more
reliable results than censuses because more effort can be put into reducing the other
errors for the same cost.
In the end information is used for decisions and the reliability of the estimates from the
survey should be that necessary to support that decision-making. If the same decision
will be made whether the estimate is 30% or 40% then there is no need to design the
survey to have a likely sample error of less than 10%. The subject of sampling error is a
technical one. For many surveys the detail of the tables to be produced is a determining
factor. For example, if the key output from survey is to be a table then a sample size of
a 10 to 25 times the number of cells in the table should be considered initially. Hence if
we have a table with 40 cells a sample of between 400 and 1000 should be considered.
More precisely, if a simple random sample is conducted
√ then an estimate of a proportion
p has a 95% chance of a sampling error of 2 p(1 − p)/n or less where n is the sample
size.
In considering sample size an allowance for non-response has to be made, and unless
there is some legal compulsion response rates of 40% are common, and often they are
less. Non-response raises the possibility of non-response error, which is the error because
the non-respondents are different from the respondents. This error is not reflected in the
standard error discussed above which only reflects the likely error due to the fact that a
sample and not the whole population is selected. Standard errors also do not cover the
errors due to respondent or interviewer errors. These errors are difficult to measure and
are best minimized through the testing in the development phase of the survey.
In considering the precision required the balance between the amount of information
collected and the number of units to include has to be faced. There is often a temptation
to cover a lot of questions in a survey. Overloading the questionnaire will probably lower
the response rate and affect the quality of the responses that are obtained. Use of small
sample sizes often means that more in depth methods can be used but the generalisations
that can be made are more limited. Small, in depth studies, and larger surveys with less
depth can compliment each other.
Traditionally there are three common ways of collecting data in a survey: by mail,
telephone, or field interview. More recently the options of using email or the internet
have become available, although there are issues associated with adequate sampling
frames for general surveys. The best method to use in any given situation depends on
the population being surveyed, the information being collected, the contact information
available on the sampling frame and the cost structure applying.
100
A.6.1 Mail Surveys
These involve mailing a form to selected units and asking them to fill in the form and
return it. The sampling frame must give addresses and preferably contact names. The
method has the attraction of being apparently cheap and simple, the postage cost being
the cost of two stamps per unit. However the initial response rate is often poor e.g.
20%-30% and several mail-based follow-ups are required to increase the response rate. I
would allow the time and money for 3 follow up phases. The form must get to the right
person and must be clearly set out. The questions, instructions and explanations must
also be clear and somehow encourage the person to respond. Mail surveys tend to be
used to survey businesses that are more used to providing information in this way, but
can be used to survey households. Problems of literacy and foreign language arise.
101
made on the basis of cost, logistics and sampling factors. However, mail and other self-
completion surveys have to be shorter than interviewer surveys, basically because the
interviewer is not there to maintain interest. Open questions and sequencing or question
skipping should only be used sparingly in mail surveys.
If the survey population is small, or very detailed analysis is planned, a census of the
population should be taken. However, once the population becomes large the resources
needed to conduct a census become too much and some sampling has to be used. Use of
sampling also means that more effort can be put into the data collection so that higher
quality and more detailed information can be collected. Different ways of obtaining
samples are used, and for this section it will be assumed that we wish to obtain a sample
using probability sampling methods. For large-scale surveys, or even small surveys with
special requirements, sample design and the specification of the associated estimation
procedures can be a fairly technical exercise requiring the advice of an experienced
specialist. The comments here indicate the basic approaches possible.
102
than the population size so that the following approximation can be made:
√
1
SE = P (1 − P )
n
These formulas for the SE assume a particular sample design, simple random sampling
without replacement, however they are often used for survey planning even for more
complex sample designs. Use of cluster or multistage sampling can increase the sampling
errors. These methods are discussed below.
For survey design, we may have a given standard error (SE) we wish to achieve, for a
proportion. The sample size n can be calculated by
P (1 − P )
n =
SE 2
To apply this formula, we must have some idea of the population proportion P . This
may seem strange - if we already know P then why do we need to run the survey?
However, a rough estimate of P is sufficient to calculate the sample size. The survey will
provide a precise estimate that can be used for research, but a rough estimate is good
enough to decide the sample size.
How can we calculate the sample size if we have no information at all about P ? The
maximum value of P (1 − P ) is at P = 0.5; at this value P (1 − P ) = 0.25. So we can
calculate a “conservative” or “worst case” value for the required sample size using
0.25
n =
SE 2
However, we then hear about a similar survey conducted 5 years ago, which estimated
that this proportion was 0.3. The proportion may have changed, but we can still use
P = 0.3 as a rough estimate for setting sample size:
P (1 − P ) 0.3(1 − 0.3)
n = 2
= = 2100
SE 0.012
Remember that the 95% confidence interval for the population proportion will be the
sample proportion plus and minus approximately twice the SE.
103
A.7.2 Simple Random Sampling
This is the method that most people mean when they refer to random sampling. It means
each unit on the population has the same chance of selection. In fact it goes further
than that and is a method in which every possible sample of the specified size has the
same chance of selection. Usually it is done without replacement, so if a unit is selected
it is not given another chance of selection. Suppose you have a population of size N and
wish to select a sample of size n. Manually a simple random sample can be achieved by
numbering, at least notionally, all the units in the population and selecting n random
numbers between 1 and N from a table of random numbers - if the same number comes
up twice just select another. If the sampling frame is in computer readable form there
is usually software available to generate random numbers, alternatively if the sampling
frame can be randomly ordered and then units selected using random systematic method
described in the next section. If the list is long randomly ordering it can be expensive
in computer time. Be wary of methods that appear random or lists that are alleged to
already be in random order.
In stratified sampling we divide the population into more homogeneous groups and select
a separate sample from each group. This ensures an adequate representation from each
group. So in taking a sample of employees, rather than let the representation of say
males to females be random we could divide the list of employees into two groups or
strata according to sex and take a sample from both using the same sampling rate or
104
fraction in each stratum. If the strata vary a lot in their degree of homogeneity then this
means for some strata only a small sample is required to obtain reasonable reliability,
whereas for the more heterogeneous strata a proportionately larger sample is required.
By altering the arrangement of the sample in this way much more efficient samples can be
obtained. The sampling fraction may also be varied between strata because we wish to
ensure sufficient representation of particular groups in our analysis. Common variables
to use in forming strata are size, industry, type, and geographic area.
If different sampling fractions are used between strata then this must be taken account
of in the estimation procedures. So for example if we took a 1 in 8 sample of men but a
1 in 5 sample of women then in the estimation the men’s answers would be multiplied
or weighted by 8 but the women’s by 5. In sampling institutions it is common to
form a stratum consisting of all the very large units and include them, they have then
been selected with probability 1. If say a 1 in 10 sample of the remaining businesses
were selected then the non-large businesses would be multiplied by 10 and added to the
unweighted results from the very large businesses.
For cost or logistical reasons it is sometimes more convenient to select the sample by
selecting, at random, groups of units. This often happens when there is some geographic
aspect to the sample selection and/or when there is no population list of population
units available. The method is best illustrated by an example. Suppose we wish to
select a national sample of hospital patient records. No central list of such records
exists. However, we may be able to obtain a list of hospitals in Australia. We could then
select a sample of hospitals from this list and select all patient records from the selected
establishments in which case we have a cluster of units from each selected hospital. It
would probably be better to select a sample from the patient records in each selected
establishments, in which case we have a multi-stage sampling scheme. Notice that the
probability of a particular patient record being selected in the sample is the product of the
probability of the establishment being selected and the probability the guest is selected
given the establishment is selected. So if we took a 1 in 5 sample of establishments
and then a 1 in 20 sample of hospital records from the selected establishments we have
selected a 1 in 100 sample of guests. Provided the selection of establishments and guests
within selected establishments is done randomly the sample is still a valid probability
sample. In this example it would also be worthwhile stratifying according to the size of
the establishment and its type.
While cluster and multistage sampling are usually cheaper and more convenient than
other methods there is a price to play in increased standard errors for the same sample
size in terms of number of finally selected population units. These methods are quite
complicated and the advice of an experienced sampling statistician would be desirable.
105
A.8 Survey Operations
A.8.1 Follow Up
No matter how the survey is carried out 100% response rate will not be achieved at
first. The strategy for following up selected units that have not responded or not been
contacted has to be worked out. For mail-based surveys the main problems are people
not returning the form and the form not getting to the business in the first place because
of moves or ceased businesses or poor contact information. Follow up is usually a further
mail out which is eventually supplemented with telephone and occasionally field visits.
In field surveys outright refusals are less of a problem than contacting people at home
and several visits may be necessary to even contact the household.
It is sometimes the practice to replace non-contacts with apparently similar units, e.g.
next door neighbours or by just making another telephone call. While these procedures
maintain the sample size, they hide the non-response and can give biased samples e.g.
biased to the people at home more often. If such a situation is unavoidable at least
obtain a count of the substitution so the real response rate can be worked out.
A.8.2 Non-Response
Response rates vary considerable according to the population being surveyed, the subject
matter, the survey organisation and the survey methods used. Once the follow up phases
have been completed and an element of non-response remains there are two things that
can be done. One is to compare the profile of the sample with any information available
on the population e.g. age sex of the population of Wollongong, the ranks or functions in
a staff survey. Difference in these profiles can be used in the estimation phase to attempt
to adjust for non-response bias. Even if the sample and population profiles are reasonably
close there is no guarantee that they will be so for other variables. An often recommended
method not often used is to intensively follow up a small number of randomly selected
non-respondents to determine the characteristics of the non-respondent group.
Edit failures should be checked back to the form and if necessary to the respondent.
106
A.8.4 Output Editing
This refers to checking the figures that come from the survey in various ways such as
against other of historical data sources or across subgroups in the survey.
A.9 References
107