Sampling Notes 2016 PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 108

Sample Surveys

David Steel
Carole Birrell
Contents

1 Introduction 4
1.1 What is Survey Sampling? . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Alternatives to Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Advantages of Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Disadvantages of Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Complementary Roles of Samples and Censuses . . . . . . . . . . . . . . . 7
1.6 Probability Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.7 Other Sampling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.8 The Survey Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.8.1 Steps in the Survey Process . . . . . . . . . . . . . . . . . . . . . . 8
1.9 Sources of Errors in Samples . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.10 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.11 References: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.12 Additional Reading: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 Basic Definitions, Concepts and Notation 14


2.1 Population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Mapping sample units to population units . . . . . . . . . . . . . . . . . . 16
2.4 Randomisation Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.1 Other Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5 Sampling Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6 Bivariate Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.7 References: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.8 Additional Reading: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3 Simple Random Sampling 23


3.1 Definition and Basic Properties . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.1 Estimating Sampling Variance . . . . . . . . . . . . . . . . . . . . 31
3.2 Proportions or Bernoulli Variables . . . . . . . . . . . . . . . . . . . . . . 34
3.2.1 Upper Bound on RSE for Small Proportions . . . . . . . . . . . . . 36
3.2.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.3 Percentages vs percentage points . . . . . . . . . . . . . . . . . . . 39

1
3.3 Setting Sample Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.1 Sample Size for Estimating Proportions . . . . . . . . . . . . . . . 42
3.4 Estimation of Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5 Helpful background information . . . . . . . . . . . . . . . . . . . . . . . . 51
3.5.1 Alternative formulas . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.5.2 Taylor Series Expansion . . . . . . . . . . . . . . . . . . . . . . . . 51
3.6 References: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.7 Additional Reading: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4 Systematic Sampling 53
4.1 References: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Additional Reading: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5 Stratified Sampling 58
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.1.1 Benefits of Stratification: . . . . . . . . . . . . . . . . . . . . . . . 59
5.1.2 Decisions to be made: . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.3 Definitions and Basic Properties . . . . . . . . . . . . . . . . . . . . . . . 61
5.3.1 Probability of selection . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.4 Allocation of Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.4.1 Proportional Allocation . . . . . . . . . . . . . . . . . . . . . . . . 64
5.4.2 Optimal Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.4.3 Equal Sampling Variance . . . . . . . . . . . . . . . . . . . . . . . 72
5.4.4 Power Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.4.5 Allocation in Practice . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.5 Variables to use as Stratification Variables . . . . . . . . . . . . . . . . . . 74
5.6 Number of Strata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.7 Choosing Stratum Boundaries . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.7.1 Dalenius and Hodges Method . . . . . . . . . . . . . . . . . . . . . 75
5.8 References: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.9 Additional Reading: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

6 Ratio Estimation 78
6.1 Introduction and Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.2 Properties of Ratio Estimation . . . . . . . . . . . . . . . . . . . . . . . . 79
6.2.1 Comparison of Ratio Estimator with Number Raised Estimator . . 82
6.3 Ratio Estimation Under a Super-population Model . . . . . . . . . . . . . 86
6.4 Use of Ratio Estimation with Stratification . . . . . . . . . . . . . . . . . 88
6.5 Additional Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.6 Additional Reading: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

2
7 Other Sampling Designs 92
7.1 Introduction to Cluster Sampling . . . . . . . . . . . . . . . . . . . . . . . 92
7.2 Introduction to Multi-stage Sampling . . . . . . . . . . . . . . . . . . . . . 93

A Surveys and Sampling 95


A.1 Experiments and Observational Studies . . . . . . . . . . . . . . . . . . . 95
A.2 Overview of the Survey Process . . . . . . . . . . . . . . . . . . . . . . . . 96
A.2.1 Steps in the Survey Process . . . . . . . . . . . . . . . . . . . . . . 97
A.3 Specifying the Population of Interest . . . . . . . . . . . . . . . . . . . . . 98
A.4 Sampling Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
A.5 Precision Required . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
A.6 Collection Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
A.6.1 Mail Surveys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
A.6.2 Telephone surveys . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
A.6.3 Field Interview Survey . . . . . . . . . . . . . . . . . . . . . . . . . 101
A.7 Sampling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
A.7.1 Sample Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
A.7.2 Simple Random Sampling . . . . . . . . . . . . . . . . . . . . . . . 104
A.7.3 Random Systematic Sampling . . . . . . . . . . . . . . . . . . . . . 104
A.7.4 Stratified Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
A.7.5 Cluster and Multistage Sampling . . . . . . . . . . . . . . . . . . . 105
A.8 Survey Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
A.8.1 Follow Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
A.8.2 Non-Response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
A.8.3 Input Editing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
A.8.4 Output Editing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
A.9 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

3
Chapter 1

Introduction

Sample surveys belong to the class of studies generally called ‘observational studies’
which are separate from experimental studies. They are sometimes referred to as ‘cross-
sectional studies’ as they provide a ‘snapshot’ of a population at some point in time,
although repeated and longitudinal surveys are often conducted. Other types of obser-
vational studies are cohort or case control studies which are designed for the purpose of
testing a hypothesis. Cohort studies are also called prospective studies as they follow
one group of people into the future and collect information at several points in time.
Sample surveys are descriptive surveys and will be the focus of this part of the subject,
although they are increasingly used for analytical purposes.

Figure 1.1: Overview

4
1.1 What is Survey Sampling?

In its widest sense, a survey is any process that involves the collection of information
about some population. The population in question will often be some group of people,
but may be a group of businesses or events, or episodes (e.g. trips). It may also be,
for example, an area of land or a volume of the atmosphere. In general a population
consists of a group of units about which you wish to draw conclusions.

Population of units Selection of random sample Sample of units

Figure 1.2: Populations and samples (Griffiths et al p228)

From a sample we attempt to draw conclusions about the population from which that
sample was drawn - this involves inductive logic or inference.

If the variables that we measure in the population have low variability, then only a
few observations, possibly selected in some haphazard way, are needed to provide good
estimates of characteristics of the population. However, it is usually not that simple.

Examples of Surveys: Monthly Labour Force, Retail Trade, Capital Expenditure, Com-
pany Profits, Market Research, Performance Appraisal (job satisfaction), Opinion Polls,
Quality Management, Ecological or Geological Surveys.

We will be concerned with sampling from a finite population


U = {1, 2, . . . , N },
and with estimating descriptive quantities such as means, medians, variances, propor-
tions, rates etc. We want to estimate characteristics of the population based on our
sample. In this subject, our focus is on estimation for a single variable. Many of the
principles involved also apply for hypothesis testing and for relationships between vari-
ables.

Sampling theory is concerned with:


• how to select samples

5
• how to calculate estimates and make inferences
• the precision of inferences and how to estimate that precision.
It takes into account:
• the “messiness” of the real world
• cost.

This topic differs from most statistics subjects in the following ways.
• It is concerned with estimating characteristics of a finite population, not the pa-
rameters of a probability distribution, or infinite population.
• It will consider how to obtain samples taking costs and other operational factors
into account. Often you will have been taught how to analyse data already avail-
able.

The topic covers methods of sample design and related estimation procedures aimed at
providing the required precision at least cost.

1.2 Alternatives to Sampling

A sample survey is one way to obtain information about a finite population. Other
important ways of obtaining information are;
• censuses or complete enumerations,
• administrative systems e.g. Medicare, Australian Tax Office.
Sampling may be used in a census or administrative system. In a census, some ques-
tions which are more detailed and expensive to process may be asked of a sample of
people whereas basic questions are asked of everyone. To reduce costs, statistics may be
generated from an administrative system using a sample of the records.

1.3 Advantages of Sampling

• cheaper than a census


• better timeliness
• reduces respondent load
• reduces non-sampling error through the use of better and more expensive methods
of collecting data
• enables collection of more detailed data.

6
1.4 Disadvantages of Sampling

• introduces sampling errors (errors due to use of a sample rather than the entire
population)

• limits level of disaggregation possible (i.e. breaking sample up into small areas or
groups).

1.5 Complementary Roles of Samples and Censuses

• census provides information for use in sample design and estimation, e.g in strati-
fication and ratio estimation

• sample provides updates of census at the broad level

• can combine samples and censuses e.g. 1st phase - census, 2nd phase - follow up
sample for more detail.

1.6 Probability Sampling

• each unit in the population has a known non-zero chance of selection, which can
be determined for each unit in the sample

• this justifies the theory that follows, which enables unbiased estimates to be cal-
culated and the precision of the estimates to be evaluated through the calculation
of estimates of standard errors.
This subject will focus on probability sampling methods, which may also be called
scientific or random sampling methods. It is not necessary that the selection
probabilities are equal, but if unequal selection probabilities are used, this must
be accounted for in the estimation procedures used.

1.7 Other Sampling Methods

• purposive selection

• haphazard selection

• volunteer

• quota sampling.

7
These methods may be useful in some situations, but since they result in some units
having zero or unknown probabilities of selection they are not recommended in general.

Even with a probability sample, the final sample may not be exactly as designed because
of non-response, non-contact, bad operations, sampling process not followed properly,
bad lists, poor frames e.g. missing units, duplicates.

1.8 The Survey Process

Surveys vary widely in the subjects they cover, the methods used, their size and com-
plexity and the purposes they fulfil. Conducting a survey is a process that involves
a number of steps that must fit together well. To ensure everything fits together the
whole survey process must be properly planned. While the steps involved follow a logical
sequence there is always a degree of iteration involved in the development of a survey.
Decisions have to reviewed in light of later developments.

Keys to a successful survey are:

• have clear aims;

• test and evaluate all processes involved.

Three main phases in conducting a survey are

• development (e.g. developing questionnaire);

• operational (collecting information);

• analysis and reporting - producing estimates, tables.

A common fault is lack of effort in the development phase, e.g testing key aspects of
questionnaire.

1.8.1 Steps in the Survey Process

Survey development

• determine objectives;
• determine resources available and constraints;
• review alternative sources of information;
• specify population of interest;
• specify population units and sampling units;
• identify research issues;

8
• decide data items and classifications;
• determine precision required;
• decide type of investigation needed;
• determine collection method, including non-response follow-up;
• develop collection instrument;
• specify sampling method, design and estimation procedures;
• develop and plan survey operations.

Survey operations

• recruitment and training of operational staff;


• despatch and collection control;
• data collection;
• non- response follow-up
• data capture;
• input editing;
• output editing.

Survey Analysis and Reporting

• weighting;
• calculation of estimates;
• account for missing data;
• production of tables, charts and diagrams;
• identifying important subgroups and relationships;
• calculation of estimates of sampling errors;
• report preparation.

Evaluation

In any particular survey some of these steps may not be significant and the relative
importance of them will vary between projects.

9
1.9 Sources of Errors in Samples

Two classes of errors in a sample survey are

• sampling errors

• non-sampling errors e.g., errors due to the form, respondent, frame or list from
which we sample, etc. These types of errors potentially affect all statistical collec-
tions.

The reliability of the estimates from a survey depends on the errors that affect the survey.
Groves (1989), Chapter 1, gives an excellent review of the potential sources of survey
errors:

• Sampling error - If instead of including all units in the population in the survey
a sample is selected then the estimates will differ from the result that a complete
enumeration would give. The size of this difference is called the sampling error.
For a probability sample, an indication of the likely size, but not direction, of this
error, can be calculated from the sample using the standard error. This is one key
advantage of using probability sampling. For other methods, it is not possible to
estimate the likely size of the sampling error, although in some cases an attempt is
made by assuming the sampling procedure is equivalent to a probability sampling
scheme.

• Coverage error - errors caused by some units not being on the sampling frame or
list.

• Non-response error - there arise if some selected units could not be contacted or
refused to provide the information.

• Interviewer or observer error - for surveys involving personal interviewing, the


interviewers may affect the responses the respondent provides in various ways.

• Instrument errors - errors or differences may arise from the way the questions are
asked and instructions given.

• Mode of data collection - different answers to the same question may be obtained
when using different modes (e.g. mail, telephone, face-to-face) of data collection.

All data collections are potentially subject to these errors. A census or complete enu-
meration would have no sampling error but would be subject to all of the other sources
of error. Although they introduce sampling error, sample surveys can sometimes give
more reliable results than censuses because more effort can be put into reducing the
non-sampling errors for the same cost.

10
1.10 Examples
Example 1.1: Discuss the following diagram

Figure 1.3: The target population and sample population in a telephone survey of likely
voters (adapted from Lohr Figure 1.1 p4)

Notes:

11
Example 1.2: (Griffiths et al, p228)
Previous to the 1936 presidential election of two candidates: Roosevelt and Landon, two
polls were conducted:

1. Literary digest poll

• mailed questionnaires to 10 million Americans (using names from telephone


books and club membership)
• received 2.4 million replies

2. George Gallup Poll

• surveyed 50 000 people


• each interviewer given a quota of people to select in different categories (males,
females, old, young etc)

Discuss the main differences between the two polls.


Notes:

The results of the 2 polls and the actual result are given in the following table.

Table 1.1: Results

Support for Landon Support for Roosevelt

Literary Digest

Gallup

Actual Result 38% 62%

12
Comments:

1.11 References:
Griffiths, D., Stirling, W.D. and Weldon, L.L. (1998) Understanding Data: Principles &
Practice of Statistics. John Wiley.
Groves, R. M. (1989) Survey Errors and Survey Costs: New York; John Wiley.
Lemeshow, S. and Levy, P.S. (2008) Sampling of populations. Methods and Applications.
4th edition. New York; John Wiley.

1.12 Additional Reading:


Cochran (1977), Chapter 1;
Lohr (1998), Chapter 1;
Lemeshow, S. and Levy, P.S. (2008) Chapter 1;
Steel, D. Surveys and Sampling Notes (see Appendix 1)

13
Chapter 2

Basic Definitions, Concepts and


Notation

2.1 Population

• Finite population - set of individual units U = {1, 2, . . . , N }.

• Target population - “ideal” population about which we want to draw conclusions


e.g. all retailers operating at anytime in 2008/9.

• Survey population - population that is sampled from i.e., set of units with non-zero
chance of selection e.g. all retailers with employees operating in June 2008. The
survey population and the target population may differ.

• Population units - the elements or entities about which we wish to make estimates
(e.g. people).

• Reporting (or observation) Units - the units providing the information (e.g. people)
or an object on which a measurement is taken.

• Population values - each individual in the population has associated with it the
value of one or more characteristics of interest. We will usually consider one
characteristic or variable and denote the population values as Y1 , . . . , YN . The
value of the characteristic of interest of the ith population unit is denoted Yi .

• Population parameters - any function of the population values.


Examples include:
∑N ∑
Population total: Y = i=1 Yi , (sometimes written Yi )
i∈U

14
Y
Population mean: Ȳ = N,

Population median,

∑N
Population variance: σY2 = 1
N i=1 (Yi − Ȳ )2 ,
∑N
(later we will consider SY2 = 1
N −1 i=1 (Yi − Ȳ )2 , )

σY
Population coefficient of variation: CV = |Ȳ |
.

2.2 Sample

• Sample - A subset of a population.

• Sampling Units - the units we select (e.g. households).

• Sampling Frame - the list of sampling units. (e.g. the list of households)

• Sample statistics - any function of the sample values.


Examples include:
∑n ∑
Sample total is y= i=1 yi = i∈s Yi

y
Sample mean is ȳ = n
∑n
Sample variance is s2y = 1
n−1 i=1 (yi − ȳ)2 .

A sample s is any subset of U . The sample will often be obtained by selecting sampling
units from a list of these units (sampling frame). Usually the sampling units are the
population units but can be groups of population units (e.g. cluster and multi-stage
sampling). More generally, the sampling frame is the set of materials used to obtain the
sample.

15
Sample values will be written y1 , . . . , yn , where n is the sample size. Usually, the order
of selection contains no information and is ignored. The sampling scheme may allow
the same population unit to be selected more than once - if this is the case it is a with
replacement scheme. If the duplicates can be identified it is better, in theory, to remove
them. This must then be taken into account in the estimation procedure. If duplicates in
the sample cannot be identified, unbiased estimates can still be calculated provided the
expected number of times a population unit is selected can be determined. See Theorem
3.1. In practice, without replacement schemes are generally used. Unknown duplicates
on the sampling frame are a different issue and can cause biases.

2.3 Mapping sample units to population units


Denote: y1 = Yj(1) where j(1)...j(n) map sample units to population units.
yn = Yj(n)

Example 2.1: If population units are

Y1 , Y2 , . . ., Y105 , Y106 , Y107 , . . ., YN

then s = {2, 106} maps to Y2 , Y106


such that: y1 = Y2 , y2 = Y106 , so j(1) = 2, and j(2) = 106.

2.4 Randomisation Distribution


Let S = set of all possible samples, so that s ∈ S.
Example 2.2: (a) If N = 4, and n = 2, how many possible samples can be drawn without
replacement?

Let S be the random variable denoting which sample is drawn and for the sample design
or selection procedure
P (S = s) = pd (s)

Note pd (s) ≥ 0 ∀s ∈ S and s∈S pd (s) = 1. So pd (s) defines a probability distribu-
tion over S.

16
Example 2.2 (b) N = 4 so U = {1, 2, 3, 4}

Since n = 2, possible samples of size 2 are:

Each sample has a known probability of being chosen:

So we have a probability distribution over S

Note not every sample has a non-zero chance of selection, but every population unit
does, so provided these probabilities are known, this is a probability sampling method.
Suppose we define an estimation procedure which results in the estimate y ′ (s) for s, then
y ′ (S) is a random variable such that

P (y ′ (S) = a) = pd (s)
{s:y ′ (s)=a}

and this gives the sampling distribution of the estimator y ′ (S).

We will drop the “d” subscript, but remember different designs give different sampling
distributions. We define the mean and variance of y ′ (S) over all possible samples, that
is the randomisation distribution introduced by sampling.

Ep [y ′ (S)] = p(s)y ′ (s)
s∈S


Vp [y (S)] = p(s)(y ′ (s) − Ep [y ′ (S)])2
s∈S

17
Example 2.2 cont.
(c) Consider the population values {−1, 2, 4, 10}

(i) Determine the population mean and the sample mean for each possible sample
with n = 2.

(ii) Calculate the expectation of the sample mean.

(iii) Calculate the variance of the sample mean.

The shape of the probability distribution over S will depend on the original distribution
of the variable in the finite population, the sample design and the estimator used. We
will drop the “p” subscript and also write y ′ (s) and y ′ (S) as y ′ - the context will make
clear which is meant. Considering the properties of estimators with respect to the
randomisation distribution is called the design based approach.

Sometimes it is convenient to define for i ∈ U

P (i ∈ S) = pi
P (i, j ∈ S) = pij

Note that the finite population value, Yi , is regarded as a fixed quantity. On some occa-
sions we may let it be a random variable also generated by some stochastic mechanism
or drawn from some superpopulation. This approach is called a model-based approach.

18
Note: y ′ is unbiased for Y if E(y ′ ) = Y ;

Bias(y ′ ) = E(y ′ ) − Y
MSE(y ′ ) = V (y ′ ) + Bias2 (y ′ )

See Cochran (1977) section 1.8 about the effect of bias on the reliability of confidence
intervals.

Figure 2.1: Unbiased, precise and accurate archers (Lohr Fig 2.2 p29)

Example 2.2 (d)


Calculate the bias of the sample mean:

Solution:

An unbiased estimator is possible in this situation by probably accounting for the


different selection probabilities - see later.

19
2.4.1 Other Parameters

Note: y ′ is usually used to denote estimation of the population total Y .



Standard error of y ′ = V (y ′ ).
V (y ′ )
Relative Variance of y ′ is Vy2′ = is scale invariant.
E(y ′ )2

V (y ′ )
Relative standard error (RSE) of y′ is Vy′ =
and is often expressed in percentage
|E(y ′ )|
terms. Sometimes the RSE is called the coefficient of variation of the survey estimate.

The standard error is the standard deviation of the sampling distribution of y ′ and is
not the standard deviation of the population distribution (σY ).

2.5 Sampling Distribution

• In practice, we assume that the distribution of the sample estimator is approx-


imately Normal, so that the interval obtained using the estimate ±1.96× (esti-
mated) standard error is a 95% confidence interval for the corresponding popula-
tion parameter. (Empirical studies have found this to work generally.)

• This assumption may not work if the sample size is very small or the population
distribution is very skewed, although it will work for large samples from skewed
populations. Some theoretical justification is based on a version of the Central
Limit Theorem appropriate to finite populations.

• Cochran (1977, section 2.15) suggests that, for simple random sampling, the as-
sumption of Normality should be reasonable if n > 25G21 , where


N
1
N (Yi − Ȳ )3
i=1
G1 = ( )3/2 , Fisher’s Coefficient of Skewness

N
1
N (Yi − Ȳ )2
i=1

Sugden, Smith and Jones (2000) suggest that when the fact that the standard error
is estimated is taken into account then we need n > 28 + 25G21 .

• In developing a sample design we should try to isolate tails in the population


distribution and include those units with certainty. One approach is stratification
- considered in Chapter 5.

20
2.6 Bivariate Definitions
Suppose that we collect information on two variables y, x on each sample unit then we
get a bivariate sampling distribution for the estimators y ′ , x′ .

(Sampling) covariance = E[(y ′ − E[y ′ ])(x′ − E(x′ ))]


= C(y ′ , x′ ) = σy′ x′
where the expectation is taken over all possible samples. This is not the population

covariance σY X = N1 Ni=1 (Yi − Ȳ )(Xi − X̄)

Relative Covariance of y ′ , x′ is
C(y ′ , x′ )
Vy′ x′ =
E(y ′ )E(x′ )
′ ′
(Sampling) correlation = √ C(y′ ,x ) = corr (y ′ , x′ ) = ρ(y ′ , x′ )
V (y )V (x′ )

σY X
This is not the population correlation ρY X = .
σY σX

Theorem 2.1

For any 2 random variables; A, B


E(A) = EB [E[A|B = b]]
V (A) = VB [E[A|B = b]] + EB [V (A|B = b)].

Example 2.3: (from Lohr Ex B.3 p432)

Choose one of the balls at random and then choose one of the numbers inside that ball.
Let Y be the number that is chosen and let
{
1 if Ball A is chosen
Z =
0 if Ball B is chosen
(2.1)

21
(a) Calculate

(i) E(Y |Z = 1)

(ii) E(Y |Z = 0)

(iii) E(Y )

2.7 References:
Cochran, W.G. (1977) Sampling Techniques, 3rd. ed.: New York; John Wiley.
Sugden, R., Smith, T.M.F. and Jones. (2000). Cochran’s rule for Simple Random
Sampling. Journal of the Royal Statistical Society, Series B, 62, pp787-793.

2.8 Additional Reading:


Cochran (1977), Section 2.3.
Lohr (1999), Sections 2.1, 2.2; Appendix B.3, Appendix B.4.

22
Chapter 3

Simple Random Sampling

3.1 Definition and Basic Properties


Simple Random Sampling (SRS), for samples of size n, refers to a sampling scheme in
which each possible subset of size n has the same chance of selection. In practice, it is
not often used solely by iteself, but it is a useful starting point for developing the theory
of more complex methods.
There are two variants:

• SRSWR - with replacement - all possible samples have the same chance of selection,
but an individual unit can be drawn more than once.
Number of possible samples = N n .

• SRSWOR - without replacement - all possible samples have the same chance of
selection, but an individual unit( cannot
)
be drawn more than once.
Number of possible samples = N n = N!
n!(N −n)!

Theorem 3.1
For any probability sampling scheme for which πi is the expected number of times the
ith population unit is selected, then
[ n ]
∑ ∑
N
E yi = πi Y i .
i=1 i=1

23
Proof:
Define δi = the number of times the ith population unit is in the sample. Then,


n ∑
N
yi = δi Yi
i=1 i=1


n ∑
N
Hence E[ yi ] = E(δi )Yi , since Yi is fixed, and E(δi ) = πi , by definition. (Recall
i=1 i=1
that Yi is not a random variable.) For any WOR sampling scheme, πi = pi the probability
of selection.

Example 3.1
Recall example 2.1 SRSWOR with y1 = Y2 , and y2 = Y106 : then
δ1 = 0, δ2 = 1, δ3 = δ4 = . . . = δ105 = 0, δ106 = 1, δ107 = . . . = δN = 0
and we can write

y1 + y2 =
=
=

so we can see that,



n ∑
N
yi = δi Y i
i=1 i=1

Corollary 3.2
[ n ]
∑ yi ∑
N
E = Yi = Y
i=1
πi i=1

Proof: Apply Theorem 3.1 with yi replaced by yi /πi .


This corollary shows that we can always get an unbiased estimator of the population

total by weighting by πi−1 . That is the estimator can be written as ni=1 wi yi where
wi = πi−1 . To do this we need all πi > 0, and πi must be known when the ith population
unit is in the sample, as per the definition of probability sampling.

24
Example 2.2 continued: By using the values of pi obtained earlier, show that

1 ∑n
yi
N i=1 πi

is unbiased for the population mean.

In a SRSWOR sampling scheme:


• Probability of any particular sample being chosen is
1 n!(N − n)!
P (s) = (N ) =
n
N!

• Probability of unit i being selected in the sample is


n
πi =
N
Proof: (Lohr p44) If unit i is in the sample, then the other (n − 1) units must be
chosen from the remaining (N − 1) units in the population. So we can write:
(1)(N −1)
1 n−1
πi = (N )
n

25
Corollary 3.3 For SRSWOR πi = n/N hence an unbiased estimator for the population
total Y is
N∑ n
y′ = yi
n i=1

y′
and also since ȳ = , then
N
1∑ n
ȳ = yi
n i=1

is unbiased for Ȳ .
The estimator y ′ is sometimes called a number raised estimator.

Example 3.2
If one eighth of a population is selected in SRSWOR, and the total income for the sample
is calculated as $5, 200, 000. Calculate an estimate of the total income for the population
using correct notation.
Solution:

26
Theorem 3.4
N∑ n

For SRSWOR, the estimator y = yi = N ȳ has sampling variance
n i=1

( )
′ n SY2 1 ∑ N
V (y ) = N 2
1− where SY2 = (Yi − Ȳ )2
N n N − 1 i=1

is the (redefined) population variance and n/N = f is the sampling fraction.

Proof:

V (y ′ ) = E[(y ′ )2 ] − {E(y ′ )}2 but E(y ′ ) = Y


( )2 
( )2  ∑n 
′ N
E(y 2 ) = ×E yi
n  
i=1
 
( )2 
∑n ∑
n ∑
n 

N
= ×E yi2 + yi yj
n 
 i=1 

i=1 j=1
j̸=i

extra steps

 
( )2 ∑N ∑
N ∑
N
′ N  n 2 n (n − 1) 
E(y 2 ) =  Yi + Yi Yj  .
n i=1
N i=1 j=1
N N −1
i̸=j

27
Hence
 
1 ∑ 2 ∑
N N ∑
N
N2 n−1 
V (y ′ ) =  Yi + Yi Yj − n(Ȳ )2 
n N i=1
N (N − 1) i=1 j=1
i̸=j
  
N2 1 ∑
N
n−1 ∑
N ∑
N
n ∑N ∑∑
N N

=  Yi2 + Yi Yj − 2  Yi2 + Yi Yj 
n N i=1
N (N − 1) i=1 j=1
N i=1 i=1 j=1
i̸=j i̸=j

since

1 ∑N
Ȳ = Yi
N i=1
 
1 ∑ 2 ∑ ∑
N N N

Ȳ 2 = 2 
Yi + Yi Yj 
N i=1 i=1 j=1
i̸=j

Thus V (y ′ ) =
 
 ( ) N { }
N2 1
 n ∑ 1 ∑N ∑ N
N (N − 1)n 
= N 1 − N Yi −
2
Yi Yj − (n − 1) 

n  N (N − 1) i=1 j=1 N 2
i=1 | {z i̸=j
}
n
1− N
 
( )
1 ∑ 2 ∑
N N ∑ N
N2 n 1 
= 1−  Yi − Yi Yj 
n N N i=1
N (N − 1) i=1 j=1
i̸=j
| {z }
SY2
( )
N2 n
= 1− SY2
n N

Corollary 3.5
For SRSWOR, the sample mean, ȳ, has sampling variance
( )
n SY2
V (ȳ) = 1 −
N n

Proof: ȳ = y ′ /N and the result follows immediately.

28
The sampling variance of ȳ depends on 3 factors:

• SY2 : the population variance

• n: the sample size

• f = n/N the sampling fraction.

Of these factors, the sampling fraction is generally the least important; it is typically,
but not always, small (much nearer to 0 than to 1).
( )
The term 1 − n
N is called the finite population correction factor.

Notes

Example 3.3
Consider the population values from Example 2.2: {−1, 2, 4, 10}
If each of the 6 possible samples of size 2 is drawn using SRSWOR, show that the sam-
pling variance of ȳ (i.e. V (ȳ)) is equal to the variance of the mean over all 6 samples.
Hint: first calculate the mean for each of the 6 possible samples of size 2.

Solution:

i 1 2 3 4 5 6

si (1, 2) (1, 3) (1, 4) (2, 3) (2, 4) (3, 4)

p(si )
y1 , y 2

Also verify that for SRSWOR, ȳ is unbiased for Ȳ .

29
Solution continued.

Corollary 3.6
For SRSWOR the relative variance of y ′ and also of ȳ is
SY2
Vy2′ = Vȳ2 = (1 − f )
Ȳ 2 n
VY2 SY2
= (1 − f ) where VY2 =
n |Ȳ |2

This corollary shows that knowledge of VY is important. From now on we will refer to
VY as the population coefficient of variation.

Corollary 3.7
For SRSWOR the RSE of y ′ and ȳ are

√ VY
Vy ′ = Vȳ = 1−f√ .
n

30
3.1.1 Estimating Sampling Variance
An important feature of probability sampling is that, for most designs, the likely precision
of the estimates can be estimated from the sample.

Theorem 3.8
1 ∑ n
Let s2y = (yi − ȳ)2 be the sample variance for a SRSWOR then E(s2y ) = SY2
n − 1 i=1

1∑ n
1 ∑n ∑ n
Proof: s2y = yi2 − yi yj .
n i=1 n(n − 1) i=1 j=1
i̸=j

Hence
1∑ N
n 2 1 ∑N ∑ N
n(n − 1)
E(s2y ) = Yi − Yi Yj
n i=1 N n(n − 1) i=1 j=1 N (N − 1)
i̸=j

= SY2

Corollary 3.9
An estimate of the sampling variance of y ′ is given by
s2y
Vb (y ′ ) = N 2 (1 − f ) .
n
Vb (y ′ ) is unbiased for V (y ′ ).

In any particular problem, a statistician must be able to specify the design, the estimator
and the associated sampling variance and the variance estimator. The results above
effectively do this for SRSWOR.

For inference, we assume


y′ − Y

Vb (y ′ )
is approximately distributed as tn−1 . Hence a 95% confidence interval for Y is
( √ √ )

y − tn−1,0.975 Vb (y ′ ), ′
y + tn−1,0.975 Vb (y ′ )

where for a two-sided confidence interval with degrees of freedom df , P (−tdf,1−α/2 < T <
tdf,1−α/2 ) = 1 − α. For example, for a 95% confidence interval with n = 20, α = 0.05,

31
we have df = n − 1 = 19 degrees of freedom and from t distribution tables we find that
tn−1, 1−α/2 = t19, 0.975 = 2.093.

The term which is added to and subtracted from the estimate to obtain the confidence
interval is often referred to as the margin
√ of error. Therefore, in this case, the margin
of error is tn−1,1−α/2 SE(y ′ ) = tn−1,1−α/2 Vb (y ′ ).

The estimate of the RSE of y ′ is



Vb (y ′ )
Vby′ =
|y ′ |
√ sy 1
= (1 − f ) √
n |ȳ|

which is also the estimate of the RSE of ȳ.

Example 3.4
In a large first year subject with 260 students enrolled at exam time, a lecturer was
interested in student absence from tutorials and recorded the number of tutorials (out of
28) missed by a simple random sample (without replacement) of 20 of those 260 students.
The data obtained are tabulated below together with some additional calculations.

No. missed tutorials y 0 1 2 3 4 5 6 7 8 9 Total

No. students f(y) 8 3 4 1 2 1 0 0 0 1 20

∑ ∑
Other information: yf (y) = 36 and y 2 f (y) = 166

(a) Calculate the mean and standard deviation of the number of missed tutorials for
the students in this sample.

(b) Find an approximate 95% confidence interval for the mean number of missed tu-
torials per student in the whole class. Explain your calculations and avoid any
unnecessary approximations.

(c) State how the confidence interval would change if the sample had come from an
infinite population. Do not do any additional calculations.

Solution:

32
Solution cont.:

33
3.2 Proportions or Bernoulli Variables
An important special case occurs when we want to estimate the number or proportion
of the population in some category “c”. For example we might want to estimate the
number or proportion of people unemployed.
{
1 if i ∈ c for some category c of the population
Yi =
0 if i ̸∈ c

then Ȳ = Pc is the proportion of the population in the category;


and Y = N Pc is the number of population units in the category,
and
N
SY2 = Pc (1 − Pc )
N −1
≈ Pc (1 − Pc ), for large N.

An unbiased estimate of the population proportion is given by the sample proportion,


denoted by pc . Hence, the sample proportion and variance are
n
ȳ = pc and s2y = pc (1 − pc ) ≈ pc (1 − pc )
n−1
where pc is the sample proportion in category c.

The sampling variance of the estimate of the proportion can be obtained by substituting
for SY2 into V (ȳ). From corollary 3.5:

( )
n SY2
V (ȳ) = 1−
N n
( ) ( )
n 1 N
V (pc ) = 1− Pc (1 − Pc )
N n N −1
( ) ( )
N −n 1 N
= Pc (1 − Pc )
N n N −1
( )
N − n Pc (1 − Pc )
=
N −1 n

If SY2 is unknown, an estimate may be calculated by just using the sample proportion:
n
s2y = pc (1 − pc )
n−1

34
Thus, the sampling variance can be estimated by just using s2y in replace of SY2 :
( )
n sy2
Vb (ȳ) = 1−
N n
( ) [ ]
b n 1 n
V (pc ) = 1− pc (1 − pc )
N n n−1
( )
n pc (1 − pc )
= 1−
N n−1

A (1 − α)% confidence interval for pc is given by



pc ± z1−α/2 Vb (pc )

where for a two-sided confidence interval, P (−z1−α/2 < Z < z1−α/2 ) = 1 − α. For ex-
ample, for a 95% confidence interval, α = 0.05 and from Normal distribution tables we
can determine that z1−α/2 = z0.975 = 1.96. For small samples, use t critical values, as
before.

The margin of error is z1−α/2 SE(pc ) and is easily obtained from the calculation of the
confidence interval. It has a maximum when pc = 0.5.

Also, the number raised estimate of the number of people in the category is

y ′ = N pc .
An estimate of its variance can also be obtained by just using the sample proportion to
calculate s2y , which can then be substituted into Vb (y ′ ).

Exercise: Determine an expression (using Pc ) for each of the following:


(a) VY2 ; (b) Vȳ2 ; and (c) Vȳ .

35
Newspoll Example:

3.2.1 Upper Bound on RSE for Small Proportions


N Pc Qc Qc 1
VY2 = ≈ ≤ where Qc = 1 − Pc .
N − 1 Pc
2 Pc Pc

and the relative standard error of y ′ is



√ Qc 1 N
Vy ′ ≈ 1−f ≤√ letting =1
nPc nPc N −1
where nPc is the expected sample taken for category c.

The corresponding sample estimate of the RSE of y ′ is:


√ √
qc 1
Vby′ = 1−f ≤√
npc npc
where npc is actual sample in category c.

e.g. npc = 25 implies an estimated relative standard error


1
Vby′ ≤ √ = 0.2 i.e. 20%
25
Hence a rule of thumb is that if we want the estimated cell totals for an r × c table to
be estimated with “average” RSE of 20% we need a total sample of 25 × r × c. If we
want an “average” RSE of 10% then we should set n = 100 × r × c.

36
3.2.2 Examples
Example 3.5 (from p52 Cochran):
From a list of 3042 names and addresses, a simple random sample of 200 names showed
on investigation 38 wrong addresses. Estimate
(a) the total number of addresses needing correction in the list; and
(b) find the standard error of this estimate.

37
Example 3.6 (from Lohr p35 eg 2.6):
A SRS of 300 counties of a total of 3078 counties in the U.S. collected information
including the acreage devoted to farms. 153 were found to have less than 200 000 acres
of farmland.
(a) Estimate the proportion of counties with less than 200,000 acres of farmland;
(b) Calculate its standard error;
(c) Determine a 95% CI for the proportion.

38
Newspoll Example cont.:

3.2.3 Percentages vs percentage points


A person’s salary increased from $45000 to $48000, so increase is $3000. This can be
expressed as a percentage of original salary:
i.e. 6.7% increase or change in salary it is relative to original salary
We can express a proportion, pc , as a percentage e.g. 0.1=10%. Suppose the standard
error of pc is 0.02. We might also express this as a percentage: 2% (percentage points).

Comment: it is important to distinguish between a standard error of 5% (meaning a SE


of 0.05 on the proportion SE(p) = 0.05 ), and a relative standard error of 5% (meaning
an SE of 0.05 times the proportion).

39
3.3 Setting Sample Size
An important step in designing a survey is estimating the size of the sample to be taken
from the population. The implications of the chosen sample size are important - time
and money. Hence if the sample is too large, the survey will take longer and cost more
than necessary. Alternatively, if the sample is too small, the survey results will be unre-
liable and the time and resources expended will have been wasted.

Sample size can be determined based on the precision required for the estimate. Suppose
we want a relative variance of α2 , e.g. 5% RSE corresponds to α = 0.05.

Assume SRSWOR. What size sample should we use?

We have shown (see corollary 3.6) that the relative variance is given by:

Vy2′ = Vȳ2
SY2
= (1 − f )
Ȳ 2 n
V2
= (1 − f ) Y
( n)
1 1
= − VY2
n N

SY
where VY = is the coefficient of variation of the variable in the population.
|Ȳ |

Hence, if we require a relative variance of α2 , then:


( )
1 1
α 2
= − VY2
n N
VY2 1 2
α2 + = V
N n Y
VY2 1
⇒n = VY2
= α2 1
.
α2 + VY2
+ N
N

SY2
To determine n, we need to know VY2 , α2 , N . We thus need to estimate or guess VY2 = .
Ȳ 2

40
There are several approaches:

• Roughly estimate the range or the range within which 95 percent of the population
lies and assume a particular distribution e.g Normal (or uniform) and work out
the implied CV.

- Normal distribution: Estimate what range would contain 95% of the values.
Divide by 4 to obtain an estimate of SY .
- Uniform
√ distribution: Estimate the range of the distribution and divide by
12.

• Use pilot test data to estimate the CV

• Use previous data (e.g. census) perhaps for a similar variable

Example 3.7
Consider average household income - the average is probably something like 50,000
dollars a year and most would be in the range 20,000 to 150,000 dollars.
(a) If we assume a Normal distribution:

Ȳ = 50
130
⇒ 4SY = 130 ⇒ SY = ≈ 32.5
4
SY 32.5
⇒ = ≈ 0.65
Ȳ 50

Hence, for α = 0.01, the required sample size is

n =

(3.1)

Note since N ≈ 5, 000, 000, the term involving N can be ignored in this case.

(b) If we assume
√ a uniform√distribution in the population then
SY = h/ 12 = 130/ 12 ≈ 37.5 in which case Ȳ ≈ 0.75.
SY

Hence, for α = 0.01, the required sample size is

n =

41
Example 3.8 (From Cochran p56)
In nurseries that produce young trees for sale it is advisable to estimate, in late winter or
early spring, how many healthy young trees are likely to be on hand, since this determines
policy toward the solicitation and acceptance of orders. A study of sampling methods
for the estimation of the total numbers of seedlings was undertaken by Johnson (1943).
The data that follow were obtained from a bed of silver maple seedlings 1 foot wide and
430 feet long. The sampling unit was 1 foot of the length of the bed, so that N = 430.
By complete enumeration of the bed it was found that Ȳ = 19, S 2 = 85.6, these being
the true population values.
How many trees must be sampled to estimate Ȳ within 10%, apart from a chance of 1
in 20? (Hint: Use the margin of error to firstly calculate the SE and hence the RSE).
Solution:

3.3.1 Sample Size for Estimating Proportions

If the variable is dichotomous (i.e. a Bernoulli or 0/1 variable) then SY2 ≈ P Q, Ȳ = P


so we just need a rough idea of the proportion in the category. The last method is very
useful for social surveys since nearly all the estimates produced are of the number or
proportion of people in various categories. Even estimates for a variable such as income
often take the form of estimates of the number of people in various income ranges.

Suppose we want to estimate a proportion P with a standard error of SE, then the
relative standard error required is α = SE/P . Moreover, VY2 = (1−P )
P . Substituting in
the formula for the required sample size gives
1
n = SE 2 1
(3.2)
P (1−P ) + N

42
If N is large this becomes
P (1 − P )
n=
SE 2
To use this approach all we need is a rough idea of P. For a given SE using P = 0.5 is
a conservative approach.

It is important to distinguish between the required standard error (SE) and the required
relative standard error (α). Suppose that we want to estimate a proportion that we think
is roughly 20 percent, so P = 0.2. We want a confidence interval of plus or minus 2
percentage points - this is sometimes called the margin of error. This corresponds to
SE = 0.01, that is 1 percentage point and α = .01/0.2 = 0.05, which is a relative
standard error of 5 percent (of the 20 percent). For large N this gives n = 1600.

Example 3.9
A survey showed that 20% of Australians supported an Australia Card in 2000. A new
survey is to be run this year. The aim is to estimate the proportion with a SE of 2%.
(a) What sample size would you use?
(b) What if the aim was to get an RSE of 2%?
Solution:

Example 3.10
A survey is to be run to find out the proportion of people who use public transport, out
of a population of 400. The aim is to estimate this proportion with a SE of 5%. No
other information is given. What sample size would you use?
Solution:

43
3.4 Estimation of Ratios
Sometimes in surveys we wish to estimate ratios which have a numerator and a denom-
inator, both of which are random variables to be estimated. Examples include the ratio
of profit to employees, or the unemployment rate. This is in contrast to estimating
proportions where p = y/n; y is a random variable and n is a constant.

Suppose for each unit in the population we collect Yi , Xi and we want to estimate

ΣYi Ȳ
R= = ,
ΣXi X̄
which is the ratio of the variables over the whole population.

Example 1: Yi = profit
Xi = no. employees
Y /X = the ratio of profit to employees
{
1 if unemployed
Example 2: Yi =
0 otherwise
Xi = 1 if in Labour Force
Y /X = the unemployment rate

A “natural” estimator of R = Y /X is r = y ′ /x′ , where y ′ estimates Y and x′ estimates X.

Note: ∑N
Yi Ȳ 1 ∑N
Yi 1 ∑N
R = ∑Ni=1 = ̸= = Ri
i=1 Xi X̄ N i=1 Xi N i=1

where Ri = Yi /Xi = R̄ is the ratio for the ith population unit. That is, the ratio of
means ̸= the mean of ratios. If we want to estimate R̄ then the methods and theory
presented in the previous sections of the chapter apply with Ri replacing Yi . Hence
it is important to decide whether the characteristic of the population in which we are
interested is R or R̄. In most cases it is R.

44
Theorem 3.10

E[r] ≈ R[1 − Vy′ x′ + Vx2′ ]

Proof: (using Taylor Series)


Write
( )
′ y′ − Y
y = Y 1+
Y

= Y (1 + ∆y )
y′ − Y
where ∆y ′ = is the relative error.
Y

Similarly for x′ .
Then
y′ Y (1 + ∆y ′ )
r= =
x′ X(1 + ∆x′ )
Y ′ ′
= (1 + ∆y ′ )(1 − ∆x′ + ∆x 2 − ∆x 3 + . . .)
X


therefore r = R(1 + ∆y ′ − ∆x′ − ∆y ′ ∆x′ + ∆x 2 + 3rd order terms)
Take expectation of both sides.
extra steps:

E(r) = R(1 + 0 − 0 − Vy′ x′ + Vx2′ + . . .)

since y ′ , x′ are unbiased.


The proof works, provided y ′ , x′ are unbiased, and the higher order terms (those of
order O(n−3/2 ) can be ignored. Hence for SRSWOR

E(r) ≈ R[1 − Vy′ x′ + Vx2′ ]

45
where
Cov(y ′ , x′ ) (1 − f )SY X
Vy′ x′ = = . (3.3)
YX nȲ X̄
and where
SY X
Cov(y ′ , x′ ) = N 2 (1 − f )
n
1 ∑ N
SY X = (Yi − Ȳ )(Xi − X̄)
N − 1 i=1

We can write theorem 3.10 as:

E(r) = R + O(n−1 )

provided Vy′ x′ , Vx2′ are O(n−1 ) which they will be for SRSWOR. Thus the bias of the
estimate of the ratio will be small provided the ratio is not based on a small sample.

The relative bias of r (to 0(n−3/2 )) is,

E(r) − R

R

= Vx2′ − Vy′ x′
V (x′ ) Cov(y ′ , x′ )
= −
X2 YX

B y ′ x′
= Vx2′ (1 − )
R
Cov(y ′ , x′ )
where By′ x′ = . Hence the bias depends how close By′ x′ is to R.
V (x′ )
Theorem 3.11
For SRSWOR, to 0(n−1 ), the mean square error of r is given by

M SE(r) ≈ R2 (Vy2′ + Vx2′ − 2Vy′ x′ )

Proof: As before

r = R(1 + ∆y ′ − ∆x′ + ∆y ′ ∆x′ . . .)

46
r − R = R(∆y ′ − ∆x′ ) + terms which will be at least 3rd order
when we square this.
′2 ′ ′ ′2
(r − R) 2
= R (∆y − 2∆y ∆x + ∆x ) + 3rd order or higher terms
2

E(r − R) 2
= R2 [Vy2′ + Vx2′ − 2Vy′ x′ ] + 3rd order or higher terms

Corollary 3.12
The relative mean square error of r is
M SE(r)
Vr2 = = Vy2′ + Vx2′ − 2Vy′ x′
R2

Corollary 3.13
Vr2 < Vy2′ if Vx2′ − 2Vy′ x′ < 0
1 Vx′
that is, if < corr(y ′ , x′ ).
2 Vy′

Corollary 3.14
For SRSWOR, to 0(n−1 ),
1
M SE(r) ≈ [V (y ′ ) + R2 V (x′ ) − 2R C(y ′ , x′ )]
X2
Proof: Substitute for Vy′2 etc in Corollary 3.12

Theorem 3.15
For SRSWOR to 0(n−1 ), the M SE(r) may be written in terms of SR
2

N 2 (1 − f ) 2
M SE(r) ≈ [SY + R2 SX
2
− 2RSY X ]
X 2n
(1 − f ) 1 1 ∑ N
= × × (Yi − RXi )2
X̄ 2 n N − 1 i=1

1 SR2
M SE(r) ≈ (1 − f )
n X̄ 2

∑N
where 2 =
SR 1
N −1 i=1 (Yi − RXi )2
(Note: If X were known, we could calculate y ′ /X and this would have MSE

1 SY2
(1 − f ) × .
n X̄ 2

47
So, using the sample estimate x′ in the calculation of the ratio can lead to a better
2 < S 2 .)
estimate of R provided SR Y

Proof of Theorem 3.15: To O(n−1 )


1
M SE(r) = [V (y ′ ) + R2 V (x′ ) − 2RC(y ′ , x′ )].
X2
S2
Now V (y ′ ) = N 2 (1 − f ) Y
n
S2
V (x′ ) = N 2 (1 − f ) X
n
N (1 − f )SY X
2
C(y ′ , x′ ) = (not shown, but proof the same as V (y ′ ))
n

Hence
N 2 (1 − f ) 2
M SE(r) = [SY + R2 SX
2
− 2RSY X ]
X 2n
write

1 ∑ N
1 ∑ N
[ ]2
2
SR = (Yi − RXi )2 = (Yi − Ȳ ) − R(Xi − X̄)
N − 1 i=1 N − 1 i=1
= SY2 + R2 SX
2
− 2RSY X

Theorem 3.16
2 is estimated without bias by
For SRSWOR SR

1 ∑ n
s2r = (yi − rxi )2
n − 1 i=1

Hence M SE(r) can be estimated by

(1 − f ) 1 2
Md
SE(r) = s .
x̄2 n r

48
Example 3.11
A SRSWOR of 6 universities is selected from a population of 36 universities to estimate
the average number of academic papers published in a year per academic staff member.
The following data are obtained:

Sample university No. papers No. staff members

1 263 154
2 1604 743
3 4210 1420
4 407 194
5 738 303
6 504 320

Calculate an estimate of
(a) the ratio of number papers per academic staff member;
(b) the square root of the mean square error of this estimate.
Solution:

49
Solution cont.:

50
3.5 Helpful background information
3.5.1 Alternative formulas

For exercises involving calculations by calculator it is useful to note the following iden-
tities
∑n n n∑ n ∑ ∑
(yi − rxi )2 = yi2 + r2 x2i − 2r yi xi
i=1 i=1 i=1 i=1


n ∑
n
1 ∑n
(yi − ȳ)2 = yi2 − ( yi )2
i=1 i=1
n i=1

n
= yi2 − n(ȳ)2
i=1

Because of the squaring and subtraction involved in these calculation it is important to


use a reasonable number of significant figures in your calculations.

3.5.2 Taylor Series Expansion


The Taylor Series expansion is a useful method in statistics for deriving approximate
expectations and variances.

For functions with one variable:

If the value of a function f (x) can be expressed in a region of x close to x = a by the


infinite power series:

(x − a) ′ (x − a)2 ′′ (x − a)3 ′′′ (x − a)n (n)


f (x) = f (a) + f (a) + f (a) + f (a) + . . . + f (a)
1! 2! 3! n!
then f (x) is said to be analytic in the region near x = a, and the series above is unique
and called the Taylor series expansion of f (x). (Hornbeck Ch2)

It can also be written as


∫ x
f (x) − f (a) = f ′ (t)dt
a
(x − a)2 ′′ (x − a)3 ′′′
= (x − a)f ′ (a) + f (a) + f (a) + . . .
2! 3!

51
For functions with two variables, we use the partial derivatives in a region of x close to
x = a and y close to y = b:

(x − a) dg (y − b) dg (x − a)2 d2 g
g(x, y) = g(a, b) + + +
1! dx 1! dy 2! dx2
(y − b) d g 2(x − a)(y − b) d g
2 2 2
+ + + ...
2! dy 2 2! dxdy
(3.4)

Example 3.12
Use the Taylor series expansion about x = 0 to obtain an approximation to
1
f (x) = .
1+x
Solution:

3.6 References:
Cochran, W.G. (1977) Sampling Techniques, 3rd. ed.: New York; John Wiley.
Hornbeck, R. W. (1975) Numerical Methods, Quantum publishers, New York.

3.7 Additional Reading:


Cochran (1977), Sections 2.1 to 2.11, 3.1 to 3.3, 4.1 to 4.7.
Lohr (1999), Sections 2.3 to 2.5, 2.7, 2.9.

52
Chapter 4

Systematic Sampling

In the previous chapter, the theory of SRSWOR was discussed. One way to implement
a SRSWOR is to randomly order the list of population units, select a random number
between 1 and N/n and take that unit and every N/nth unit thereafter. When the
list has not been randomly ordered (i.e. it has been purposely ordered according to
a particular variable) the method is called systematic sampling. The initial random
number is called the random start and N/n is the skip interval. Because of the use of
the random start this method is still a probability sampling method - it is not purposive
selection. It is important that you do not start the selection at the first unit, unless that
happens to be the randomly selected start.

If N/n is not an integer, we can round it to an integer, that is use k = int(N/n). The
sample size then will not then be exactly n, and in estimation the achieved sample size
should be used. Alternatively we can use a non-integer skip, which is a little more
complicated.

e.g. 1 in 200 - 146 (random start)


346
546
etc

to select 500 out of 100, 000

Systematic sampling is often used because of its convenience. There are only as many
different samples as there are random starts. In the above example there are only
( )
200 samples instead of 100000500 . This can be beneficial if the samples that have been
eliminated are ones that would give estimates a long way from the population value
being estimated. This will occur if we order by a variable that is related to the variables
being estimated. For example by ordering businesses by size we remove the possibility
of having all small or all large units in the sample, since it must include a unit from the
smallest k, one from the second smallest group of k etc. Another common application of
systematic sampling is to order units in a serpentine geographic way. You can think of

53
systematic sampling being a weak form of stratification, although the independence of
the selection of units between strata is not fulfilled (see next chapter). In fact systematic
sampling is a special case of cluster sampling.

Provided a sensible ordering has been used, there will usually be some reduction in
sampling variance through the use of systemic sampling. Most of the time the worst
that can happen is that the ordering ends up being close to random, because there is
little or no relationship between the ordering variable and the variable of interest. The
main problem to watch out for is periodicity in the list which is related to the skip being
used, eg sampling production every Friday or Monday; sampling every fourth flat in
block of flats with four flats on each floor.

Example 4.1 (adapted from Lemeshow, S. and Levy, P.S. (2008) Ch4.)
A nurse attended to a total of 12 patients on a particular day and the time spent per
patient was recorded. The data in Table 4.1 are listed in the order the nurse saw the
patients. Table 4.2 has the same data as Table 4.1 but ordered by decreasing time spent
with the patient.

Table 4.1: Nurse visits - unordered

Patient Time (in mins)

1 15
2 34
3 35
4 36
5 11
6 17
7 49
8 40
9 25
10 46
11 33
12 14

The population mean is Ȳ = 29.583 and the finite population variance is SY2 = 166.99.

(a) Consider all possible SRSWOR of size 3.

(i) How many possible samples are there?


(ii) Determine the sampling variance and standard error of the mean time in
minutes spent with patients, using formula given in chapter 3.

54
Table 4.2: Nurse visits - ordered by decreasing time

j Patient Time (in mins)

1 7 49
2 10 46
3 8 40
4 4 36
5 3 35
6 2 34
7 11 33
8 9 25
9 6 17
10 1 15
11 12 14
12 5 11

(b) Using a 1 in 4 systematic sampling process, determine the 4 possible systematic


samples by ordering the list by the time spent with the patient (refer to Table
4.2).

(i) Calculate the mean for each sample;


(ii) Determine the sampling variance and standard error of the mean using your
answers from part (i).

(c) Compare your results for (a) and (b). Discuss.

Note: In this exercise, all possible samples are determined in (b) for the purpose of
comparison. In a practical situation, only one of these would be carried out by using a
random start.

55
Solution: (a)

56
Solution cont.:

4.1 References:
Lemeshow, S. and Levy, P.S. (2008) Sampling of populations. Methods and Applications.
4th edition. New York; John Wiley.

4.2 Additional Reading:


Cochran (1977), Sections 8.1, 8.2.
Lemeshow, S. and Levy, P.S. (2008) Ch4.
Lohr (1999), Section 2.6, 5.6.

57
Chapter 5

Stratified Sampling

5.1 Introduction
SRS is rarely used in practice. It is usually possible to do better. Unless VY is small you
often need large samples to get good RSEs.

Stratification involves using auxiliary information for all units in the population. We
divide the population into H mutually exclusive and exhaustive groups called strata and
then take a sample from each stratum independently of the sample in the other strata.
Examples

1. For a population of businesses possible stratification variables are

• type of industry (ANZIC);


• geographic location;
• size (employment).

2. For a population of households or people we could stratify by

• geographic location (SLA, postcode)

3. For a population of university students we could stratify by

• faculty;
• location;
• type (UG, PG, Mature Age);
• sex;
• year of enrolment.

58
Figure 5.1: Stratified sampling. Source: http://simon.cs.vt.edu/SoSci/converted/Sampling/

The stratification variables must be known for each population unit before the sample
is selected. An important feature of any sampling frame is what stratification variables
are available on it. Sampling frames must also have unique unit identifiers and contact
details such as address.

5.1.1 Benefits of Stratification:


Stratification

• often gives lower variance for fixed cost compared with non-stratifying (i.e. SR-
SWOR)

– true for proportional allocation because, as we shall see, we eliminate the


between stratum component of variance;
– using optimal allocation we can concentrate the sample where there is higher
variability and lower cost;

• permits use of different sampling and estimation methods in different strata;

• allows sufficient sample size in groups of particular interest.

59
However, if for some reason, the allocation is a long way from proportional or optimal,
the sampling variance can be greater than for SRSWOR.

5.1.2 Decisions to be made:


• what variable(s) to use as stratification variable(s);

• how many strata;

• precise definition of strata;

• allocation of sample to strata;

• selection and estimation methods to be used - may differ between strata.

5.2 Notation
The total number of strata is denoted by H, with individual stratum denoted by h, such
that h = 1 . . . H. The subscript h can then be attached to notation previously used for
SRSWOR.
Nh is the population size in stratum h
nh is the number of units selected SRSWOR from stratum h
Yhi is the value of ith population unit in stratum h
yhi is the value of ith unit selected from stratum h.


H
N = Nh is the population size
h=1
∑H
n = nh is the sample size
h=1
∑H ∑Nh ∑
H
Y = Yhi = Yh is the overall population total
h=1 i=1 h=1


Nh
i.e. Yh = Yhi is the total in stratum h
i=1
Yh
Ȳh = is the population mean in stratum h
Nh
yh
ȳh = is the sample mean in stratum h
nh

60
5.3 Definitions and Basic Properties
Assuming SRSWOR is used within all strata, for stratum h

Nh ∑ ∑
nh Nh
yh′ = yhi is unbiased for Yh = Yhi
nh i=1 i=1

and ( )
N2 nh
V (yh′ ) = h 1− Sh2 (applying Theorem 3.4 to yh′ )
nh Nh
where

1 ∑ h N
Sh2 = (Yh − Ȳh )2 is the population variance in stratum h.
Nh − 1 i=1 i

Also the estimate of the variance of the estimate of total for stratum h is
( )
N2 nh
V̂ (yh′ ) = h 1− s2
nh Nh h
where
1 ∑ hn
s2h = (yh − ȳh )2
nh − 1 i=1 i

Theorem 5.1
Suppose yh′ is an estimate of the total of the variable of interest in stratum h. Then


H

y = yh′ has:
h=1
∑H
E(y ′ ) = E(yh′ ) (depends only on y ′ being linear.)
h=1
∑H
V (y ′ ) = V (yh′ ) (since we sample independently between strata)
h=1

61
Hence, applying Theorem 5.1 we obtain

Theorem 5.2
Using SRSWOR within strata


H ∑
H
Nh ∑
H
y′ = yh′ = yh is unbiased for Y = Yh
h=1 h=1
nh h=1

y′
Also ȳ = is unbiased for Ȳ
N
The sampling variance of y ′ is


H ( )
′ Nh2 nh
V (y ) = 1− Sh2
h=1
nh Nh


H ( )
Nh2 nh
and Vb (y ′ ) = 1− s2
h=1
nh Nh h
is an unbiased estimator of V (y ′ )

To obtain an estimate of the population mean when using stratified sampling:

1. apply Theorem 5.2 to obtain y ′ ,

y′
2. then ȳ = is an unbiased estimate of Ȳ .
N

Hence, to obtain the sampling variance of the estimate of the mean:

1. apply Theorem 5.2 to obtain V (y ′ ) or Vb (y ′ ),


1 1 b ′
2. then V (ȳ) = V (y ′ ) and similarly Vb (ȳ) = V (y ).
N2 N2

5.3.1 Probability of selection

The probability (and expected number of times) a unit in stratum h is selected is πhi =
nh
Nh . The estimator can be written as


H ∑
nh
1 ∑
H ∑
nh
y′ = yhi = whi yhi
h=1 i=1
πhi h=1 i=1

where whi = πhi −1 . This shows we can use unequal probabilities of selection provided
we account for it in the estimation.

62
Example 5.1 (Eg. 4.1 from Lohr)
We want to estimate the total number of acres devoted to farming in the United States.
Using the 4 census regions as strata: Northeast, North Central, South, West; a SRS of
10% of the counties in each stratum is selected. The following data are obtained:

Region No. of counties Sample size Sample Mean Sample Variance


Northeast 220 21 97,629.8 7,647,472,708
North Central 1054 103 300,504.2 29,618,183,543
South 1382 135 211,315.0 53,587,487,856
West 422 41 662,295.5 396,185,950,266
Total 3078 300

Solution:

63
5.4 Allocation of Sample
An important decision to be made is how many units to select from each stratum. There
are three main ways of allocating sample to strata: proportional allocation, optimal
allocation and equal sampling variance.

5.4.1 Proportional Allocation


The number of sampled units in each stratum is proportional to the population size of
the stratum. i.e. nh ∝ Nh

nh ∝ Nh

Nh
⇒ nh = ×n
N
nh n
⇒ = =f
Nh N
In this design, each unit has the same chance of selection and the sampling fraction or
rate is the same in each stratum. Rounding and non-response means that this rarely
happens exactly.

Example 5.2
A population consists of 2400 men and 1600 women. It is desired that the total sample
size is 10% of the population.
(a) Calculate the proportional allocation;
(b) Determine the probability of selection for each stratum.

Solution:
(a)

N = 2400 + 1600 = 4000


n = 0.1 × 4000 = 400

So allocation is calculated by

nM =

nF =

b) Let πhi be the probability of selection of unit i in stratum h. Then

π1i =

π2i =

64
Substituting the expression for nh into the formula for V (y ′ ) in Theorem 5.2 gives,


H
n Sh2
Vprop (y ′ ) = Nh2 (1 − )
h=1
N nNh /N

=
(1 − f ) ∑H
= Nh Sh2
f h=1
=
N −n ∑
H
= Nh Sh2
n h=1
n 1∑ H
Nh 2
Vprop (y ′ ) = N 2 (1 − ) S
N n h=1 N h

Notice that this is of the same form as for SRSWOR in Theorem 3.4 with SY2 replaced by


H
Nh
Sh2 ,
h=1
N

which is the weighted average of the stratum population variances.

Consider the ANOVA table:


Component df Sum of Squares

H ∑
Nh ∑
H
Between Strata (H − 1) (Ȳh − Ȳ )2 = Nh (Ȳh − Ȳ )2 = (H − 1)Sb2
h=1 i=1 h=1


H ∑
Nh ∑
H
Within Strata (N − H) (Yhi − Ȳh )2 = (Nh − 1)Sh2 = (N − H)SW
2

h=1 i=1 h=1


H ∑
Nh
Total (N − 1) (Yhi − Ȳ )2 = (N − 1)S 2
h=1 i=1

These equations define Sb2 , SW


2 as the between strata and within strata mean squares

respectively.

65
For an unstratified design:
( )
n S2
VSRS (y ′ ) = N 2 1 −
N n
( ) [ ]
n 1 1 ∑ H
(H − 1) 2
= N 2
1− (Nh − 1)Sh2 + S
N n N − 1 h=1 (N − 1) b
( )
n 1 (H − 1) 2
= Vprop (y ′ ) + O(N −1 ) + N 2 1 − S
N n (N − 1) b

Hence to 0(N −1 ) VSRS (y ′ ) ≥ Vprop (y ′ ).


Basically, we eliminate the between stratum component of variance. This suggests

• form strata to make Sb2 as large as possible and hence SW


2 as small as possible

• form homogeneous strata ⇔ make strata means as different as possible.

It is possible (but hard) to find a stratification such that Vprop > VSRS , but this rarely
happens in practice. Usually the worst that happens is your gains are small, but often
they are large.

Example 5.3
For the following design data

Stratum Nh Ȳh Sh Nh Sh2


1 480 4060 2274
2 750 8453 3974
3 900 17819 6415

(a) Calculate the proportional allocations assuming a total sample size of 100.

(b) Round the resulting stratum sizes to integers and calculate the associated relative
standard error on the estimate of the population total Y for the two allocations.

Solution:

66
Solution cont.:

67
5.4.2 Optimal Allocation
Another way to allocate the sample to the strata is called optimal allocation. Optimal
allocation is designed to minimize the variance of estimates referring to the whole pop-
ulation. It is optimal according to a given constraint such as fixed sample size or fixed
cost.

Fixed sample size


Theorem 5.3

Taking
N h Sh n
nh = ∑
h N h Sh
(5.1)


H
minimises V (y ′ ) subject to n = nh fixed.
h=1

Proof: Recall theorem 5.2:



H ( )
nh Sh2
V (y ′ ) = Nh2 1 −
h=1
Nh nh
Nh2 Sh2 ∑
∑H H
= − Nh Sh2
h=1
n h h=1

We want to minimise a function subject to a linear constraint, hence we will use La-
grangian methods.

Consider
( )

H
Nh2 Sh2 ∑
H ∑
H

F = V (y ) = − Nh Sh2 +λ nh − n
h=1
nh h=1 h=1
dF Nh2 Sh2
= − +λ=0
dnh n2h

N h Sh
⇒ nh = √
λ

68
Use constraint to obtain λ:

H ∑
H
Nh Sh
nh = √
h=1 h=1 λ

n =


λ =

√ ∑H
Nh Sh
λ =
h=1
n

extra steps:

N h Sh
nh = ∑H ×n
h=1 Nh Sh

This result can be re-expressed as

nh /Nh = fh ∝ Sh

which implies using higher sampling fraction in the more heterogeneous strata, where
heterogeneity is measured by the population standard deviation, Sh .

nh ∝ Nh Sh implies putting more of the sample in the strata with high Sh and Nh .

This allocation is called an “optimal” allocation. It is only optimal for the constraints
given. Different constraints give different optimal allocations. Optimal allocation is also
called the Neyman allocation.

By construction,

Vopt ≤ Vprop .

69
Proportional allocation is the same as optimal allocation if Sh are constant i.e. same
for all strata. This can happen for geographic strata and a 0/1 variable, since then
Sh2 ≈ Ph (1 − Ph ), and will be approximately constant if the proportions do not vary
much across areas. If the variances within each stratum differ, then optimal allocation
will give a smaller sampling variance of the estimate for the whole population than
proportional allocation.
To calculate the optimal allocation you need information to estimate or guess the values
of Sh .

Corollary 5.4
For an optimal allocation the sampling variance of y ′ is
∑H
( 2 ∑
H
′ h=1 Nh Sh )
Vopt (y ) = − Nh Sh2
n h=1

Note: if rounding has been used to obtain nh , the actual sampling variance is not exactly
as given in Corollary 5.4. It may be wiser to use formula given in Theorem 5.2. This
comment also applies for proportional allocation.

Example 5.3 cont.


We will now calculate the optimal allocation and the associated RSE for n = 100 given
the data in example 5.3. Compare your result with that given by proportional allocation.

Solution: Set up a table

Stratum Nh Ȳh Sh Nh Sh nh Nh2 Sh2

1 480 4060 2274

2 750 8453 3974


3 900 17819 6415
Total

70
Solution cont.:

71
Fixed cost
Costs may vary across strata. To take this into account a simple cost function can be
used as an approximation to the real cost structure, such as

H
Cost = C0 + C h nh
h=1

The same sort of approach gives the optimal allocation for fixed cost

Nh Sh / Ch
nh = n. ∑ √
h N h Sh / C h

which leads to putting less sample in the strata which are more expensive to enumerate.
The value of n is obtained from the cost constraint. This gives the final allocations as
Corollary 5.5 The optimal allocation for a linear cost function is given by

Nh Sh / Ch
nh = (Cost − C0 ). ∑ √
h Ch Nh Sh

Because cost enters the equation in terms of Ch , there has to be quite a degree of
variation in costs between strata before it is worthwhile taking them into account.

5.4.3 Equal Sampling Variance


n
Equal allocation involves taking the same size sample from each stratum. Thus, nh = H .
This would be the allocation of choice if the main objective is to test hypotheses regarding
differences among the strata for the variable of interest, under the assumption that the
stratum variances are equal. That is, if the strata are the domains of interest (e.g.
states) then we can allocate to give each domain the same relative variance. If Sh /Ȳh is
approximately constant, this gives approximately the same sample size in each domain,
which may be strata.
If the assumption of equal variances cannot be made, then the allocation may be deter-
mined using the result from Section 3.3 and applying it to strata:

( )−1
αh2 1
nh = 2 +
VY h Nh
VY2h

αh2
SY2 h /Ȳh2
=
αh2

Note that if SY2 h /Ȳh2 and αh2 are equal across strata, then this implies equal nh .

72
5.4.4 Power Allocation

Sometimes in a survey, reliable estimates are required at both the national level and
for regional areas. Bankier (1988), provides a simple allocation method which allows a
compromise between Neyman or optimal allocation and equal allocation. He calls the
allocation method a power allocation.
To determine sample sizes for each stratum, the loss function F , given by
∑( )2
F = Xhq V (yh′ )/Yh2
h

is minimised subject to the constraint h nh = n. Xh is some measure of size or
importance of stratum h (could be Yh or Nh ), and q is a constant in the range 0 ≤ q ≤ 1
and is called the power of the allocation. The result is as follows:

SY h Xhq /Ȳh
nh = n ∑ q
h SY h Xh /Ȳh

Note that if q is set to 1 and Xh = Yh , then the result for nh is the Neyman allocation.
If q = 0, and SY h /Ȳh are not equal, then the allocation for nh , given n, can be determined
by

SY h /Ȳh
nh = n ∑ .
h SY h /Ȳh

If q = 0, and if the SY h /Ȳh are similar between stratum, then the allocation becomes
equal nh = n/H (see Section 5.4.3). A compromise between optimal allocation and
equal allocation can be achieved by setting q to a value between 0 and 1 (Bankier, p174,
1988). In practice, a value of q = 0.5 is often used.

5.4.5 Allocation in Practice


In order to carry out the allocation, there may be other decisions that need to be made
or other information which is required. Some of these are listed below:

• must decide which variable to base the allocation on.

• where does the design information that allows calculation of Sh come from?

– use previous data or census data


– make up using range type methods
– model and simulate
– pilot test data

73
– use a related variable (e.g. calculate Sh for stratification variable say employ-
ment and hope it is good for the variable of interest (say turnover) or use past
or pilot data to build a regression model between say turnover and employ-
ment. Using this model we can relate Sh for turnover to Sh for employment.)

• multipurpose surveys ⇒ multivariate allocation.


• compromise different requirements (see e.g. above)
• compromise optimal allocation for 2 variables
• some strata are completely enumerated, so nh = NH and there is no sampling
variance arising from these strata - also called ‘take all strata’ and typically used
for large and essentially unique units
• the theoretical derived design standard errors are usually optimistic i.e. you do
not usually get as good as expected.

5.5 Variables to use as Stratification Variables


We have to form strata using variables that are available to all units on the sampling
frame. Ideally, we use the variable(s) most closely related to what we are trying to
estimate.
Common examples of stratification:
• geography - state, region, LGA, postcode;
• size - no. beds (hospitals), no. employees or turnover (businesses);
• time (to account for seasonality);
• type - public/private type of industry, male/female, car model.

Problems arise in balancing the use of the different stratification variables available since
the number of strata can become very large and we need at least 2 responding units per
stratum for variance estimation. If we use ratio estimation within strata, a minimum of
6 respondents in each stratum may be required.
e.g. n = 2000 retailers, stratification 8 (states) × 15 (industry types) × 4 (sizes) = 480
strata, allowing only an average of 4 selections per stratum.

5.6 Number of Strata


The number of strata formed is influenced by the output requirements and total sample
size available, but for size stratification it is usually not worth going past 6 strata. (See
Cochran 5A.8).

74
5.7 Choosing Stratum Boundaries
In many surveys, the stratum boundaries are determined by the information available.
If the variable chosen is a size variable such as number of employees or turnover in a
business survey, the boundaries may be chosen to minimise the variance of the estimates.

For an optimal allocation


∑H
( 2 ∑
H
′ h=1 Nh Sh )
Vopt (y ) = − Nh Sh2 (= f (y0 . . . yH ))
n h=1

Both Nh , Sh are functions of the strata used.


Let y0 , y1 , . . . , yH be the stratum boundaries (we are considering size stratification using
y as the size variable).

Vopt (y ′ ) is minimised, ignoring the term H 2
h=1 Nh Sh due to the finite population cor-
rection factor (fpc), if the boundaries satisfy the equations:

(yh − Ȳh )2 + Sh2 (yh − Ȳh+1 )2 + Sh+1


2
= h = 1, 2, . . . , H − 1
Sh Sh+1

But since Ȳh , Sh depend on the boundaries, this would have to be solved iteratively.

5.7.1 Dalenius and Hodges Method


Dalenius and Hodges suggest a quick approximate solution.

• divide
√ the population into “fine” substrata according to the size variable. Calculate
dk fk where dk = width of kth interval and fk = number of population units in
the interval

• cumulate dk fk
∑√
• select the ‘boundaries’ so that dk fk is approximately constant.
k∈h

e.g. Y = Number of employees

boundaries correspond to 242.5


485.0

⇒ stratum 1 (0 to 4)
stratum 2 (5 to 14)
stratum 3 15+
(We would probably take those over 100 as a completely enumerated stratum.)

75

Number of dk fk dk fk cumulative Σ
employees
0-4 5 10,000 223.6 223.6
5-9 5 4,000 141.4 365.0
10-14 5 2,000 100 465
15-19 5 1,000 70.7 535.7
20-29 10 300 54.8 590.5
30-39 10 120 34.6 625.1
40-49 10 100 31.6 656.7
50-69 20 80 40 696.7
70-89 20 30 24.5 721.2
90-99 10 4 6.3 727.5
100+ 4

727.5
3 ≈ 242.5

In practice, this method is applied to the stratification variable. This procedure produces
approximately optimal stratum boundaries for the stratification variable and we hope
these are good for the variable of interest also.
Other approximations suggested are:

• choose the boundaries such that Nh Sh is approximately constant, in which case


the optimal allocation is equal sample size in each stratum. If Sh is proportional
to Ȳh , i.e. constant coefficient of variation, then this implies Nh Sh is proportional
to Nh Ȳh = Yh the stratum total. This leads to forming strata with approximately
the same total of the variable in each stratum.

• choose such that Nh (yh − yh−1 ) is approximately constant in this case, if the distri-
bution within each stratum is approximately constant, then Sh ∝ (yh − yh−1 ) and
again, the optimal allocation is approximately equal sample within each stratum.

This suggests that a rough check that stratum boundaries are close to optimal, is if the
optimal allocation is close to equal sample numbers per stratum.
In practice, it is more important to have a near optimal allocation than optimal bound-
aries.

76
Explanation of Dalenius and Hodges method

Suppose f (t) is the pdf of the variable of interest


∫ yh
Nh
Wh = = f (t)dt ≈ fh (yh − yh−1 )
N yh−1

so
Nh ≈ N fh (yh − yh−1 )
assuming approximate uniform distribution in the interval (yh−1 , yh ).
1
Sh ≈ √ (yh − yh−1 )
12
for small intervals. Now multiplying respective sides by Nh ≈ N fh (yh − yh−1 ), gives

N
Nh Sh = √ fh (yh − yh−1 )2 .
12
Hence, taking the sum over h, and then dividing by N ,
N ∑
ΣNh Sh ≈ √ fh (yh − yh−1 )2
12 h
1 ∑
ΣWh Sh = √ fh d2h
12 h

and this sum is minimised when each term is the same.

5.8 References:
Bankier, Michael, D. (1988) Power Allocations: Determining Sample Sizes for
Subnational Areas. The American Statistician, Volume 42, number 3.

5.9 Additional Reading:


Bankier, Michael, D. (1988) Power Allocations: Determining Sample Sizes for
Subnational Areas. The American Statistician, Volume 42, number 3.
Cochran (1977), Sections 5.1 to 5.12, 5A.7, 5A.8, 5A.10.
Lemeshow, S. and Levy, P.S. (2008) Ch 5, 6.
Lohr (1999), Sections 4.1, 4.6.

77
Chapter 6

Ratio Estimation

6.1 Introduction and Notation


Suppose we take an SRSWOR of n out of N units, then as previously discussed, the
number raised estimator of the population total is

N∑ n
y′ = yi
n i=1

We have shown that this estimate is unbiased for Y , and we have determined the sam-
pling variance of this estimator (theorem 3.4).
Sometimes, there is available useful information such as another variable that can be
used to determine an alternative estimator of the population total, Y . The additional
variable is often referred to as the auxiliary or benchmark variable. We will assume that
for each unit, in addition to the variable of interest Yi , we have some auxiliary variable
Zi , known for each unit in the population. We could therefore use Zi in the design
as a stratification variable, but we will consider how we might use this information in
estimation. We can calculate

N
Z= Zi
i=1

which is the population total of the auxiliary variable, and similarly the sample total is
given by

n
z= zi .
i=1

The ratio Z/z is a check on how well the sampling worked. If Z/z is very different from
N/n then it suggests the sample has under or over-represented the smaller units. To
compensate, multiply by

Z z Z Z
/ = ′ = N
N n z nz

78
This suggests we weight by Z/z instead of N/n, resulting in the ratio estimator,
which will be denoted by y ′′ :
Z z
y ′′ = / × y′
N n

=
Z∑ n
y ′′ = yi
z i=1
y′
= Z
z′
y′
= Zr where r= .
z′

The latter form shows that y ′′ is the sample ratio of the variable of interest to the
benchmark variable multiplied by the benchmark total.
For example, if y = turnover, z = employment, then to estimate total turnover we
estimate the turnover to employment ratio from the sample and multiply it by the total
employment for the population which is available from some other source.

Other examples,
y z
turnover past employment
employment now past employment
earnings past employment
retail sales past retail sales

Often the benchmark variable is the same variable at a previous point of time, then r is
the growth rate or factor.

Since y ′′ is just a ratio multiplied by a constant, the properties of a ratio estimator follow
immediately from those of a ratio, considered in section 3.4, with z replacing x in the
formulas.

6.2 Properties of Ratio Estimation


Theorem 6.1

E[y ′′ ] ≈ Y (1 − Vy′ z ′ + Vz2′ )


= Y + 0(n−1 ) provided Vy′ z ′ , Vz2′ are 0(n−1 )

79
Proof: We saw previously in theorem 3.10:

E(r) ≈ R(1 − Vy′ z ′ + Vz2′ )

where R = Y /Z hence

E(y ′′ ) = E(Zr) ≈ ZR(1 − Vy′ z ′ + Vz2′ )


≈ Y (1 − Vy′ z ′ + Vz2′ )

For SRSWOR, the bias is given by

[ ]
′′ N 2 (1 − f ) 2 SY Z
E(y ) − Y = SZ R − 2
nZ SZ

Proof: do as class exercise

Note the bias is 0(n−1 ), so it might be important if n is small - this becomes an issue
for within stratum ratio estimation (see section 6.3).

80
Theorem 6.2
The mean square error of the ratio estimator y ′′ is given by

M SE(y ′′ ) = Y 2 (Vy2′ + Vz2′ − 2Vy′ z ′ )

Proof: follows from theorem 3.11, i.e. M SE(r) = R2 (Vy2′ + Vz2′ − 2Vy′ z ′ ).

Corollary 6.3
The relative MSE of y ′′ is

M SE(y ′′ )
Vy2′′ =
Y2
= Vr = Vy2′ + Vz2′ − 2Vy′ z ′
2

The results so far apply no matter what design and estimation methods are used, pro-
vided E(y ′ ) = Y , E(z ′ ) = Z and the higher order terms of the Taylor Series expansion
can be ignored.

Theorem 6.4
For a SRSWOR of size n drawn from a population of size N , to O(n−1 )
1( 2 )
M SE(y ′′ ) = N 2 (1 − f ) SY + R2 SZ2 − 2RSY Z
n
( )
1 1 ∑ N
= N (1 − f )
2
(Yi − RZi )2
n N − 1 i=1
1 2
M SE(y ′′ ) = N 2 (1 − f ) SR
n

Proof: See Theorem 3.15 and multiply by Z 2 .


1 ∑ n
We use s2r = (yi − rzi )2 to estimate SR
2 without bias, and substitute in the
n − 1 i=1
above to obtain an unbiased estimate of M SE(y ′′ ).

81
6.2.1 Comparison of Ratio Estimator with Number Raised Estimator
We now have two estimators for the population total: the number raised estimator (y ′ )
and the ratio estimator (y ′′ ). We can compare their mean square errors to determine
which estimator is the better choice.
We have already seen that (see Corollary 3.13)

1 Vz ′
Vr2 < Vy2′ if ρy ′ z ′ >
2 Vy ′
and so, since Vy2′′ = Vr2 , this condition also implies

Vy2′′ < Vy2′ .

This condition relates to ρy′ z ′ , the correlation between the two estimators y ′ , z ′ . For a
SRSWOR design
SY Z
ρy ′ z ′ = = ρY Z
SY SZ
∑N
where SY Z = 1
N −1 i=1 (Yi − Ȳ )(Zi − Z̄) and so the condition is equivalent to

1 VZ
ρY Z > ,
2 VY

the correlation of the variables in the population.

Often Zi is the same variable as Yi measured at a previous point in time and so VY ≈ VZ


and the condition simplifies to ρY Z > 12 . Furthermore, it is often the case that the
benchmark is less variable than the variable of interest, in which case VZ /VY < 1 and
so in this case, the bound is less than 12 .

It is necessary to check that


1 VZ SY SZ
ρY Z > where VY = and VZ =
2 VY |Ȳ | |Z̄|

since use of ratio estimation when the condition is not satisfied will lead to Vy2′′ > Vy2′ .

To check if ratio estimation is better than number raised estimation we can check the
correlation condition. Alternatively, calculate the variance of the ratio estimator, or an
estimator of it, corresponding to theorem 6.2 (or theorem 6.4) and compare it with the
variance of the number raised estimator (theorem 3.4).

82
Example 6.1
A SRSWOR of 5 large retailers is selected from a population of 357 and the following
data are obtained
Sample Retailer No. Employees Turnover($millions)

1 1,050 169
2 1,270 163
3 608 120
4 829 94
5 1,509 263

Total 5266 809

It is known that the total number of employees in the population is 370, 128.

(a) Based on the data obtained, do you think ratio estimation using the number of em-
ployees as the benchmark variable would be better than number raised estimation
for estimating the total turnover of the population of 357 large retailers? Justify
your answer without doing any formal calculations.

(b) Calculate the number raised estimate of total turnover and an estimate of the
sampling variance of this estimate.

(c) Calculate the ratio estimate of total turnover using the number of employees as
the benchmark variable and an estimate of the MSE of this estimate.

(d) Compare your answers to parts (b) and (c). Is ratio estimation better than number
raised estimation in this example? Justify your answer. Does this agree with your
response to part (a)?

(e) Use an alternative method to answer the question in part (d).

(f) Calculate the estimate of the average turnover per employee and an estimate of its
standard error.

Solution:

83
Solution cont.:

84
Solution cont.:

85
6.3 Ratio Estimation Under a Super-population Model
To gain some insight into when ratio estimation is useful, we assume the population
values follow a “super-population” model. In this approach we assume that the popula-
tion values are selected from some “super population” or generated by some stochastic
process giving the population values
Y1 ,..,YN
Z1 ,...,ZN
A sample is selected giving the sample values
y1 ,..,yn
z1 ,...,zn
In this case we will assume that the population values are generated from a linear
regression model
Yi = α + βZi + ϵi
where
Eξ [ϵi |Zi ] = 0
The ξ subscript is used to denote taking expectations over the superpopulation or
stochastic process involved in generating the population values. Then


N
Eξ [Y |Z] = N α + βZ recall Y = Yi
i=1
Eξ [Ȳ |Z̄] = α + β Z̄
(6.1)

Now

1 ∑ N
SY Z = (Yi − Ȳ )(Zi − Z̄)
N − 1 i=1
1 ∑ N
= (α + βZi + ϵi − α − β Z̄ − ϵ̄)(Zi − Z̄)
N − 1 i=1
1 ∑ N
= (β(Zi − Z̄) + (ϵi − ϵ̄))(Zi − Z̄).
N − 1 i=1

86
Taking ξ expectations gives:
[ ]
β ∑ N
1
Eξ [SY Z ] = (Zi − Z̄)2 + Eξ Σ(ϵi − ϵ̄)(Zi − Z̄)
N − 1 i=1 N −1
= βSZ2

The ξ expectation of the bias of y ′′ is thus:


Eξ [E(y ′′ ) − Y ] = Eξ [−Vy′ z ′ + Vz2′ ]Y
[ ]
SY Z S2 Y
= Eξ − + Z2 (1 − f ) for SRSWOR
Ȳ Z̄ Z̄ n
[ ]
(1 − f ) S2 Y N SY Z
= Eξ Z 2 −
n Z̄ Z̄
( )
1 − f SZ2
= [N α + βZ − βZ] recall Eξ [Y ] = N α + βZ
n Z̄ 2
( )
1 − f SZ2
= Nα
n Z̄ 2

So the expected bias depends on the intercept and is 0 if α = 0.

We can also use this model to explain the meaning of the condition:
1 SZ /Z̄ 1 Ȳ
ρY Z > , which is equivalent to SY Z > SZ2 .
2 SY /Ȳ 2 Z̄
Replacing terms by the ξ-expectation gives:
1 2
2 SZ (α + β Z̄)
βSZ2 >

i.e. β Z̄ > 12 (α + β Z̄) (assuming Z̄ > 0)
β Z̄ > α

Since this condition relates to Eξ (Vy2′′ ), then the ratio estimate is superior, in expectation,
when the condition is fulfilled, even if there is some expected bias due to a non-zero
intercept. These results make no assumption about Vξ (ϵi |Zi )

The use of a statistical model for the population values to guide us in understanding the
properties of design based estimators is called model assisted sampling.

The ratio estimator can be thought of arising from a regression through the origin. It
can also be written as
y′
y ′′ = y ′ + ′ (Z − z ′ ).
z

87
We can generalise this to the regression estimator

ŷreg = y ′ + β̂(Z − z ′ ),

where β̂ is some estimator of the slope of the linear regression relating the variable of
interest to the auxiliary variable. This approach can be easily generalised to include
information about several auxiliary variables and more complex designs.

6.4 Use of Ratio Estimation with Stratification


Usually stratification is also used with ratio estimation, sometimes using the same aux-
iliary variable. There are two ways of combining ratio estimation with stratification

• stratum by stratum

H
yh′ ∑H
′′
sy = Zh = yh′′
h=1
zh′ h=1

• across stratum ∑
′′y′ H ′
h=1 yh
a y = Z ′ = Z ∑H ′
z h=1 zh

Theorem 6.5

H
M SE(s y ′′ ) = M SE(yh′′ ) (by linearity and independence between strata)
h=1
∑H ( )
1 − fh
= Nh2 2
(ShY + Rh2 ShZ
2
− 2Rh ShY Z )
h=1
nh
(apply Thm 6.4 within strata)

H ( ) ∑N
1 − fh 1 h
= Nh2 2
ShR where 2
ShR = (Yhi − Rh Zhi )2
h=1
nh Nh − 1 i=1

Note that Yhi is the value for the ith unit in the hth stratum. This formula for the MSE
Yh
involves a separate ratio for each stratum Rh = .
Zh

To estimate M SE(s y ′′ ) we replace ShR


2 by

1 ∑ h n
s2hr = (yh − rh zhi )2
nh − 1 i=1 i

88
Theorem 6.6

M SE(a y ′′ ) = Y 2 (Vy2′ + Vz2′ − 2Vy′ z ′ ) (apply Thm 6.2)


= V (y ′ ) + R2 V (z ′ ) − 2RC(y ′ , z ′ )
now expand each term to reflect the stratification

H
(1 − fh ) 2
= Nh2 [ShY + R2 Sh2Z − 2RShY Z ]
h=1
nh
∑H
(1 − fh ) 1 ∑ h N
= Nh2 (Yh − RZhi )2 .
h=1
nh Nh − 1 i=1 i

This formula is similar to M SE(s y ′′ ) but with R replacing Rh for each stratum.

To estimate M SE(a y ′′ ) replace

1 ∑ h N
(Yh − RZhi )2
Nh − 1 i=1 i

1 ∑
nh
y′
by (yhi − r′ zhi )2 where r′ = .
nh−1 i=1
z′

Which is better?
Consider

M SE(a y ′′ ) − M SE(s y ′ ) (6.2)



H ( )
2 (1 − fh )
= Nh (R2 − Rh2 )Sh2Z − 2(R − Rh )ShY Z
h=1
nh
∑H
(1 − fh ) ( )
= Nh2 (R − Rh )2 ShZ
2
+ 2(Rh − R)(ShY Z − Rh ShZ
2
) .
h=1
nh

Now apply the result on the bias of the ratio estimator within strata. We can show that:
(1 − fh )
Bias(yh′′ ) = E[yh′′ ] − Yh = Nh2 2
(Rh ShZ − ShY Z ).
nh Zh
Hence,


H
(1 − fh ) ( )
M SE(a y ′′ ) − M SE(s y ′′ ) = Nh2 (R − Rh )2 ShZ
2

h=1
nh

H
( )
+ 2(Rh − R)Zh −Bias(yh′′ ) .
h=1

89
We have seen before that the bias of the ratio estimator depends on n−1 hence Bias(yh′′ )
depends on n−1 h . This suggests that if the stratum sample sizes are reasonable, the
difference in MSE depends mainly on the variation of the Rh . If the Rh vary a lot then
′′
s y would have lower MSE. But if the sample sizes within strata are small, as is often
the case, the second term may be important and a y ′′ may have smaller MSE than s y ′′ .
To determine which is better it is necessary to do the calculations.

In practice there is a conflict between using across stratum ratio estimation with fine
stratification and stratum-by-stratum ratio estimation with a broad stratification. More-
over, once stratification is introduced, the condition for ratio estimation to be better than
number raised changes. For stratum by stratum it becomes:
1
2 ShZ /Z h
ρhY Z > .
ShY /Y h

Stratification tends to reduce ρhY Z but also ShZ /Z̄h .

For a y ′′ , the condition becomes:


1 ′
2 Vz
ρ(y ′ , z ′ ) > .
Vy ′
where ρ(y ′ , z ′ ) is the correlation between the number raised estimates calculated from
the stratified sample and is not ρY Z .

6.5 Additional Comments

• If the appropriate conditions are not met, ratio estimation can be worse than
number raised. You need to watch out for defunct units and zeros; these reduce
correlations a lot.

• We can use ratio estimation in some strata in which it is beneficial and not in
others.


n
• Do not need Zi to be known for all population units, just need Z and z = zi
i=1
(unlike stratification or PPS selection methods).

• Can use ratio estimation for some variables, but not others (c.f. stratification
which affects all variables).

• Leads to a minimum sample size constraint of say 5 or 6 per strata, because of the
potential bias in small samples.

• Can lead to very large gains.

90
6.6 Additional Reading:
Cochran (1977), Sections 6.1 to 6.12.
Lohr (1999), Sections 3.1, 3.2, 3.4.

91
Chapter 7

Other Sampling Designs

In this chapter, a brief description of cluster sampling and multi-stage sampling will be
given. The theory behind these designs will be covered in your next course in sampling.

7.1 Introduction to Cluster Sampling

In cluster sampling, instead of selecting a sample of population units directly we select


sampling units which are groups or clusters containing several population units. The
sampling unit and the population unit differ. The sampling units are called clusters. If
we select all population units from each selected cluster, we have cluster sampling. (If
we select a subsample of the units in the selected clusters, we have multistage sampling).
Examples:
cluster population unit
households people
employer employee
block household
hospital patient
school student
flight passenger

Each population unit must be uniquely identified with one and only one cluster through
well constructed and applied coverage rules.

We use cluster and multistage sampling for one or both of the following reasons:

(i) a suitable sampling frame of population units does not exist but a list of clusters
does;

(ii) cost - a clustered sample is usually less costly than an unclustered sample of the
same size in terms of population units.

92
For cluster sampling, the probability a population unit is selected is the probability the
cluster containing the unit is selected.

An important example: area sampling. There is no absolutely complete list of households


in Australia. But there is a complete set of maps developed for the population census,
which provides a complete list of geographic areas in Australia, called census districts
(CD’s). On average a CD contains 200-250 dwellings. In area sampling, we select a
sample of CD’s and then select all households (for a cluster sample) or a sample of
households (for a multistage sample) within the CD. We can then select all or a sample
of people from the selected households.

For example, if we want to determine how many computers are owned per household in
a community of 10,000 households, we could take a simple random sample of a selection
of households. Alternatively, we could take a sample of CD’s within the community and
then survey every household in the selected CD’s. The CD’s are the primary sampling
units and the household is the population unit.

However, for many variables there is often the penalty of higher sampling variances than
for a simple random sample with the same sample size. This is due to the tendency of
members within a cluster to be similar while large differences can occur between clusters.
In practice, the size of a cluster sample often needs to be larger than that for a simple
random sample in order to compensate for the higher sampling variance.

We would prefer if the clusters are as heterogeneous as possible, but many of the clusters
that arise naturally are reasonably homogeneous. This contrasts with stratification
where we want homogeneous strata. This is because in stratified sampling we include all
strata in the sample and hence eliminate the between strata component of variance. In
cluster sampling, we eliminate the within cluster component of variance, since all units
in a cluster are selected but only a sample of clusters is taken.

7.2 Introduction to Multi-stage Sampling


A sampling scheme in which a sample of population units is selected from each cluster
in a sample of clusters is called a two-stage sampling scheme. The clusters are usually
called First Stage Units or Primary Sampling Units (PSU’s).

In our previous example of the number of computers per household, an alternative


sampling design could be a two-stage design where we take a sample of CD’s within the
community and then take a sample of households in the selected CD’s.

The design effect of a cluster sample will be large when the clusters are very homogeneous
or, in many cases, when the clusters are large. In both these situations consideration
may be given to including only a sample of population units from each selected cluster.
The money saved by including only a sample of population units from each selected
cluster may then be spent by including more clusters in the sample. Because of the

93
costs involved with selecting clusters the total number of population units in the sample
will be reduced but the sample will be more spread and this may compensate for the
reduced sample size leading to estimates with smaller sampling variance. One of the
main problems in designing such samples is to determine what size subsample to take
to optimally balance cost and sampling variance.

In many situations, the problems of compiling lists of population units and travel between
selected population units are present even within selected first stage units. Consideration
is thus given to selecting the sample of population units within selected first stage units
by grouping the population units into second stage units, a sample of which is selected.
The population units are then selected from selected second stage units. This is called
three stage sampling. Clearly this process can be continued to any number of stages.
A multistage sample can be defined as one which is selected in stages, the sample units
at each stage being subsampled for the larger units chosen at the previous stage. At the
first stage the entire population is divided into First stage (or Primary Sampling) units.
At each successive stage smaller sampling units are defined within those selected at the
previous stage and a further selection are made within each of them. At each stage a list
of units from which the selections is to be made is required only within units selected at
the previous stage.
In multistage sampling the probability that a population unit is selected is the proba-
bility the cluster, i.e. PSU, containing the unit is selected multiplied by the conditional
probability the unit is selected given the cluster it is in is selected.
Multistage sampling is especially important where the population units are geographi-
cally spread and there is no list of them. The units of selection are then usually areas
of land and this is called area sampling.
The set of all selected population units in a selected PSU is sometimes called an ultimate
cluster.
Multistage sampling is a very flexible technique since many aspects of the design have
to be chosen; including the number of stages and for each stage

• the unit of selection

• the method of selection (eg PPS or equal probability, systematic or simple random)

• number of units selected.

Moreover, stratification and ratio estimation may be used. This flexibility means that
there is large scope for meeting the demands of a particular survey in the most efficient
way and hence good opportunity for the sampling statisticians to practice their craft.

94
Appendix A

Surveys and Sampling

Professor David Steel

School of Mathematics and Applied Statistics

University of Wollongong

A.1 Experiments and Observational Studies


There are 2 broad classes of research investigations
• Controlled experiments - where the investigator introduces changes into a process
and makes observations to evaluate the effect of these changes.
• Observational studies - where the investigator does not interfere with the process
but only observes what happens. This includes sample surveys, quasi or pseudo
experiments.

Both classes of investigations can give evidence of association between variables, but
only controlled experiment can give evidence of causation, provided other factors have
been properly accounted for in the design of the experiment.
An important class of observational studies are sample surveys which if done properly will
represent the population from which the sample is selected well, i.e. have strong external
validity. They permit analysis of relationships over a large number of different groups in
the population. There are issues with internal validity because of the self-selection of the
treatment or independent variables and lack of control of other factors. Other important
types of observational studies are retrospective and prospective studies. Observational
studies can suffer from lack of representation of the population, at worse even self-
selection of inclusion in the study i.e. volunteers. Experiments and observational studies
can play complementary roles in investigating an issue.

95
Surveys are usually conducted to provide a description of a population. This usually
involves estimation of features of the population such as totals, means, proportions,
the number or units in various categories and ratios. Often the major outputs from a
survey are a number of tables. Surveys can also be used for analytical purposes such
as investigating the association between two or more variables. In this situation we
are usually interested in associations that apply more generally than just the particular
population surveyed at a particular time.

A.2 Overview of the Survey Process


In its widest sense a survey is any process that involves the collection of information
about some population. The population of interest will often be some group of people,
but may be a group of businesses, institutions, events or episodes. In general a population
consists of a group of units about which you wish to draw conclusions.
Surveys vary widely in the subjects they cover, the methods used, the size and complexity
and the purposes they fulfill. Conducting a survey is a process that involves a number
of steps that must fit together well. To ensure everything fits together the whole survey
process must be properly planned. While the steps involved follow a logical sequence
there is always a degree of iteration involved in the development of a survey. Decisions
have to be reviewed in light of later developments. Sufficient time for the development
phase must be allowed.
The keys to conducting a successful survey are:

• have clear aims,

• test and evaluate all the processes involved.

There are three main phases in conducting a survey:

• development - e.g developing the questionnaire or collection instrument,

• operational - actually collecting the information,

• analysis and reporting - producing estimates, tables etc.

A common fault in conducting a survey is lack of effort in the development phase,


especially testing of the key aspects such as the questionnaire.

96
A.2.1 Steps in the Survey Process
Survey development

• determine objectives
• determine resources available and constraints
• review alternative sources of information
• specify population of interest
• identify research issues
• decide data items and classifications
• determine precision required
• decide type of investigation needed
• determine collection method
• develop collection instrument
• specify sampling method
• develop and plan survey operations

Survey operations

• recruitment and training of operational staff


• despatch and collection control
• data collection
• data capture
• input editing
• output editing

Survey Analysis and Reporting

• calculation of estimates
• production of tables, charts and diagrams
• identifying important subgroups and relationships
• calculation of sampling errors
• report preparation

Evaluation

The relative importance of these steps will vary between projects. Some sampling related
issues are discussed here.

97
A.3 Specifying the Population of Interest

The specification of the population should clearly define the group about which we wish
to make conclusions. It should cover the definition of units, scope, geographic coverage,
and reference period.
For example, suppose we wish to survey Doctors in the Illawarra. We must first decide
exactly what a Doctor is. Exactly what constitutes the Illawarra? Do we want Doctors
who live in the Illawarra or those that work in the Illawarra? Is the actual Doctors we
are interested in or their practices or offices? What period are we concerned with, a
particular week or a financial year? If the latter, are Doctors that only practice part of
the year included? The answers to these questions depend on the purposes of the study.
When initially defining the population we should not overly concern ourselves about the
feasibility of obtaining information on all or a sample of the population, although as we
develop the survey we may have to define a survey population that does not correspond
precisely to the target population. Availability of data may influence the definition of
the unit.

A.4 Sampling Frames

To conduct a survey we must be able to identify the units in the population and include
all or a sample of them. This means we have to have access to, or construct, a sampling
frame. In most cases the frame will be list of the population units and some way of
contacting them, such as a list of all businesses and their address and contact names,
positions or telephone numbers. Often the list available does not correspond to the
target population and we must decide if we can proceed with a survey population that
differs from the target population, e.g. members of the AMA instead of all Doctors.
Lists always have some problems:

• omissions

• duplicates

• ceased units

• units not in scope

• incorrect or out-of-date names and addresses or other contact information.

Some judgement has to be made as to how serious these problems might be and what
steps can be taken to overcome them. You must plan how to handle these problems in
the survey operations and estimation phases. Information on the likely quality of the
list should be obtained; it may even be necessary to do a small pilot test to gauge the
extent of these problems.

98
In some situations no list exists but by use of a technique known as multi-stage sampling
it is still possible to obtain a valid sample of units through a sampling frame of higher
level units through which the population units can be identified. For example, to obtain
a sample of hospital patients we could select a sample of hospitals and for the selected
hospitals select a sample of wards and then a sample of patients. Even to get a sample
of private households we might need to start with a list of streets if we do not have a
satisfactory list of dwellings.

A.5 Precision Required

In practice, the size of a survey is determined by the funds available and it is important to
consider whether the size of the survey possible will be of any real use. In looking at the
usefulness of a proposed survey the value of the information to be obtained is determined
by what it adds to what is already known. If there is virtually no information available
then even a small study will be quite valuable, but if reliable and detailed information
is already available then we must critically examine what the proposed survey will add.
The reliability of the estimates from a survey depends on the errors that are affecting
the survey. Groves (1989), Chapter 1, gives an excellent review of the potential sources
of survey errors.

• Sampling error: if instead of including all units in the population in the survey a
sample is selected then the estimates will differ from the result that a complete
enumeration would give. The size of this difference is called the sampling error. For
a probability sample an indication of the likely size, but not direction of this error,
can be calculated from the sample using a statistic called the standard error. This
is one of the main attributes of using probabilities sampling, for other methods it
is not possible to estimate the likely size of the sampling error, although in some
cases an attempt is made by assuming the sampling procedure is equivalent to a
probability sampling scheme.

• Coverage error: errors because some units were not on the sampling frame or list

• Non-response error: errors because some selected units could not be contacted or
refused to provide the information

• Interviewer error: for surveys involving personal interviewing the interviewers may
affect the responses the respondent provides in various ways

• Instrument errors: errors or differences due to the way the questions and instruc-
tions are asked. If physical measurements are taken there will be measurement
errors associated with the measurement process.

• Mode of data collection: different answers to the same question may be obtained
when using different modes e.g. mail versus telephone to collect the data.

99
All data collections are potentially subject to these errors. A census or complete enu-
meration would have no sampling error but would be subject to all the other sources
of error. In fact although they introduce sampling error, sample surveys can give more
reliable results than censuses because more effort can be put into reducing the other
errors for the same cost.
In the end information is used for decisions and the reliability of the estimates from the
survey should be that necessary to support that decision-making. If the same decision
will be made whether the estimate is 30% or 40% then there is no need to design the
survey to have a likely sample error of less than 10%. The subject of sampling error is a
technical one. For many surveys the detail of the tables to be produced is a determining
factor. For example, if the key output from survey is to be a table then a sample size of
a 10 to 25 times the number of cells in the table should be considered initially. Hence if
we have a table with 40 cells a sample of between 400 and 1000 should be considered.
More precisely, if a simple random sample is conducted
√ then an estimate of a proportion
p has a 95% chance of a sampling error of 2 p(1 − p)/n or less where n is the sample
size.
In considering sample size an allowance for non-response has to be made, and unless
there is some legal compulsion response rates of 40% are common, and often they are
less. Non-response raises the possibility of non-response error, which is the error because
the non-respondents are different from the respondents. This error is not reflected in the
standard error discussed above which only reflects the likely error due to the fact that a
sample and not the whole population is selected. Standard errors also do not cover the
errors due to respondent or interviewer errors. These errors are difficult to measure and
are best minimized through the testing in the development phase of the survey.
In considering the precision required the balance between the amount of information
collected and the number of units to include has to be faced. There is often a temptation
to cover a lot of questions in a survey. Overloading the questionnaire will probably lower
the response rate and affect the quality of the responses that are obtained. Use of small
sample sizes often means that more in depth methods can be used but the generalisations
that can be made are more limited. Small, in depth studies, and larger surveys with less
depth can compliment each other.

A.6 Collection Methods

Traditionally there are three common ways of collecting data in a survey: by mail,
telephone, or field interview. More recently the options of using email or the internet
have become available, although there are issues associated with adequate sampling
frames for general surveys. The best method to use in any given situation depends on
the population being surveyed, the information being collected, the contact information
available on the sampling frame and the cost structure applying.

100
A.6.1 Mail Surveys
These involve mailing a form to selected units and asking them to fill in the form and
return it. The sampling frame must give addresses and preferably contact names. The
method has the attraction of being apparently cheap and simple, the postage cost being
the cost of two stamps per unit. However the initial response rate is often poor e.g.
20%-30% and several mail-based follow-ups are required to increase the response rate. I
would allow the time and money for 3 follow up phases. The form must get to the right
person and must be clearly set out. The questions, instructions and explanations must
also be clear and somehow encourage the person to respond. Mail surveys tend to be
used to survey businesses that are more used to providing information in this way, but
can be used to survey households. Problems of literacy and foreign language arise.

A.6.2 Telephone surveys


These require the sampling frame to have telephone numbers or at least be possible
to obtain the telephone numbers of selected units. Businesses would usually have a
telephone although there is still the problem of contacting the right person in the organ-
isation. More than 95% of households have a telephone connected, but the people who
are in non-connected households are different from the population as a whole. Steel and
Boal (1988) found that telephone accessibility was lower for young people, people living
alone or in rented accommodation. Moreover if the telephone book or electronic white
pages is used people with unlisted numbers will not be included. Telephone surveys have
the advantage of obtaining information quickly and relatively cheaply. However, it will
often take several calls to contact a person and the time and money for these additional
calls must be made. The main problem with telephone surveys is in establishing the
credentials of the survey organisation, the length of interview feasible, and the ease with
which the respondent can understand the questions, especially if there are a range of
responses possible for each question. Bennet and Steel (2000) give details of a large-scale
telephone survey conducted in Queensland.

A.6.3 Field Interview Survey


This method involves an interviewer making contact with the selected unit and interview-
ing them face to face. It is expensive since it involves traveling to the respondent, often
several times, to establish contact. It is often used in conjunction with multi-stage or
area sampling, where the interviewer calls at selected households. While expensive, the
use of personal interviewing usually means that more detailed and extensive information
can be asked since the interviewer is present to answer requests for clarification. This
interaction may be a source of bias, especially if the interviewers are inexperienced or
poorly trained. The cost of recruiting and proper training of the interviewers is incurred.
Generally little or no differences have been found between the information collected
using different collection methods. Hence the decision on which method to use is usually

101
made on the basis of cost, logistics and sampling factors. However, mail and other self-
completion surveys have to be shorter than interviewer surveys, basically because the
interviewer is not there to maintain interest. Open questions and sequencing or question
skipping should only be used sparingly in mail surveys.

A.7 Sampling Methods

If the survey population is small, or very detailed analysis is planned, a census of the
population should be taken. However, once the population becomes large the resources
needed to conduct a census become too much and some sampling has to be used. Use of
sampling also means that more effort can be put into the data collection so that higher
quality and more detailed information can be collected. Different ways of obtaining
samples are used, and for this section it will be assumed that we wish to obtain a sample
using probability sampling methods. For large-scale surveys, or even small surveys with
special requirements, sample design and the specification of the associated estimation
procedures can be a fairly technical exercise requiring the advice of an experienced
specialist. The comments here indicate the basic approaches possible.

A.7.1 Sample Size


It must be remembered that sample size is more important than the sampling fraction
or rate in terms of sampling error. So for example a sample of 100 out of a population
of 1000 has almost the same sampling error as a sample of 100 out of a population
of 1,000,000, provided the two populations are equally variable or heterogeneous. The
simple approach of taking the number of cells or degree of breakdown required and
multiplying by between 10 and 25 gives a good starting point. Such sample sizes will
give standard errors of about 30% to 20% of the size of the estimate on most estimates.
However, if lower sampling errors are required and/or a particular group that is relatively
rare in the population is of interest larger sample sizes will be needed. For example,
suppose we had an interest in a particular group in the population that occurred in
about 5% of the population and we wish to produce estimates that break this group into
five categories. To get an average of 25 in each of these groups, we need 125 of the group
of interest in the sample, which means we want a sample of 20 × 125 = 2500. Effective
screening methods to identifying people in the group can sometimes be developed.
An important objective of many sample surveys is to estimate one or more proportions.
If the population proportion is P , then the standard error (SE) of the sample proportion,
p, is

n 1
SE = P (1 − P )(1 − )
N n
where n is the sample size and N is the population size. (Note that the factor NN−1 has
been ignored here, see section 3.2). In practice, the sample size is often much smaller

102
than the population size so that the following approximation can be made:

1
SE = P (1 − P )
n

These formulas for the SE assume a particular sample design, simple random sampling
without replacement, however they are often used for survey planning even for more
complex sample designs. Use of cluster or multistage sampling can increase the sampling
errors. These methods are discussed below.
For survey design, we may have a given standard error (SE) we wish to achieve, for a
proportion. The sample size n can be calculated by

P (1 − P )
n =
SE 2

To apply this formula, we must have some idea of the population proportion P . This
may seem strange - if we already know P then why do we need to run the survey?
However, a rough estimate of P is sufficient to calculate the sample size. The survey will
provide a precise estimate that can be used for research, but a rough estimate is good
enough to decide the sample size.
How can we calculate the sample size if we have no information at all about P ? The
maximum value of P (1 − P ) is at P = 0.5; at this value P (1 − P ) = 0.25. So we can
calculate a “conservative” or “worst case” value for the required sample size using
0.25
n =
SE 2

For example, suppose we are surveying a population of University of Wollongong stu-


dents, to estimate the proportion who make use of the shuttle bus. We want to estimate
this proportion with a standard error of 0.01 (i.e. one percentage point). If we have no
information about P , then we would calculate
0.25 0.25
n = = = 2500
SE 2 0.012

However, we then hear about a similar survey conducted 5 years ago, which estimated
that this proportion was 0.3. The proportion may have changed, but we can still use
P = 0.3 as a rough estimate for setting sample size:

P (1 − P ) 0.3(1 − 0.3)
n = 2
= = 2100
SE 0.012

Remember that the 95% confidence interval for the population proportion will be the
sample proportion plus and minus approximately twice the SE.

103
A.7.2 Simple Random Sampling

This is the method that most people mean when they refer to random sampling. It means
each unit on the population has the same chance of selection. In fact it goes further
than that and is a method in which every possible sample of the specified size has the
same chance of selection. Usually it is done without replacement, so if a unit is selected
it is not given another chance of selection. Suppose you have a population of size N and
wish to select a sample of size n. Manually a simple random sample can be achieved by
numbering, at least notionally, all the units in the population and selecting n random
numbers between 1 and N from a table of random numbers - if the same number comes
up twice just select another. If the sampling frame is in computer readable form there
is usually software available to generate random numbers, alternatively if the sampling
frame can be randomly ordered and then units selected using random systematic method
described in the next section. If the list is long randomly ordering it can be expensive
in computer time. Be wary of methods that appear random or lists that are alleged to
already be in random order.

A.7.3 Random Systematic Sampling

In this procedure, a sample is selected by initially selecting a random number, r, between


1 and N/n , and selecting the unit corresponding to r, r + N/n, r + 2N/n ...etc. The
initial random number is called a random start and the factor N/n is called the skip
interval. For example, suppose we wish to select 120 units out of 2400. The skip interval
is 2400/120 = 20. We thus select a random number between 1 and 20 inclusive and
select that unit and every 20th unit afterwards. The method can be applied even when
N/n is not an integer or whole number.
As mentioned above this method can be used to obtain a simple random sample if the list
has been randomly ordered or we are sure that the order can be treated as random. In
some circumstances deliberately using a non-randomly ordered list can be advantageous.
For example ordering a list of hospitals by size or location will ensure an even spread
and removes the possibility of the sample including only small units. The main point
to watch for is periodicity in the list. For example selecting every 5th flat in a block of
flats might just always give the same flat on different floors if there were five flats per
floor. If some of the flats have ocean views then there would be price difference, which
might mean wealthier people have flats with certain numbers.

A.7.4 Stratified Sampling

In stratified sampling we divide the population into more homogeneous groups and select
a separate sample from each group. This ensures an adequate representation from each
group. So in taking a sample of employees, rather than let the representation of say
males to females be random we could divide the list of employees into two groups or
strata according to sex and take a sample from both using the same sampling rate or

104
fraction in each stratum. If the strata vary a lot in their degree of homogeneity then this
means for some strata only a small sample is required to obtain reasonable reliability,
whereas for the more heterogeneous strata a proportionately larger sample is required.
By altering the arrangement of the sample in this way much more efficient samples can be
obtained. The sampling fraction may also be varied between strata because we wish to
ensure sufficient representation of particular groups in our analysis. Common variables
to use in forming strata are size, industry, type, and geographic area.
If different sampling fractions are used between strata then this must be taken account
of in the estimation procedures. So for example if we took a 1 in 8 sample of men but a
1 in 5 sample of women then in the estimation the men’s answers would be multiplied
or weighted by 8 but the women’s by 5. In sampling institutions it is common to
form a stratum consisting of all the very large units and include them, they have then
been selected with probability 1. If say a 1 in 10 sample of the remaining businesses
were selected then the non-large businesses would be multiplied by 10 and added to the
unweighted results from the very large businesses.

A.7.5 Cluster and Multistage Sampling

For cost or logistical reasons it is sometimes more convenient to select the sample by
selecting, at random, groups of units. This often happens when there is some geographic
aspect to the sample selection and/or when there is no population list of population
units available. The method is best illustrated by an example. Suppose we wish to
select a national sample of hospital patient records. No central list of such records
exists. However, we may be able to obtain a list of hospitals in Australia. We could then
select a sample of hospitals from this list and select all patient records from the selected
establishments in which case we have a cluster of units from each selected hospital. It
would probably be better to select a sample from the patient records in each selected
establishments, in which case we have a multi-stage sampling scheme. Notice that the
probability of a particular patient record being selected in the sample is the product of the
probability of the establishment being selected and the probability the guest is selected
given the establishment is selected. So if we took a 1 in 5 sample of establishments
and then a 1 in 20 sample of hospital records from the selected establishments we have
selected a 1 in 100 sample of guests. Provided the selection of establishments and guests
within selected establishments is done randomly the sample is still a valid probability
sample. In this example it would also be worthwhile stratifying according to the size of
the establishment and its type.
While cluster and multistage sampling are usually cheaper and more convenient than
other methods there is a price to play in increased standard errors for the same sample
size in terms of number of finally selected population units. These methods are quite
complicated and the advice of an experienced sampling statistician would be desirable.

105
A.8 Survey Operations
A.8.1 Follow Up
No matter how the survey is carried out 100% response rate will not be achieved at
first. The strategy for following up selected units that have not responded or not been
contacted has to be worked out. For mail-based surveys the main problems are people
not returning the form and the form not getting to the business in the first place because
of moves or ceased businesses or poor contact information. Follow up is usually a further
mail out which is eventually supplemented with telephone and occasionally field visits.
In field surveys outright refusals are less of a problem than contacting people at home
and several visits may be necessary to even contact the household.
It is sometimes the practice to replace non-contacts with apparently similar units, e.g.
next door neighbours or by just making another telephone call. While these procedures
maintain the sample size, they hide the non-response and can give biased samples e.g.
biased to the people at home more often. If such a situation is unavoidable at least
obtain a count of the substitution so the real response rate can be worked out.

A.8.2 Non-Response
Response rates vary considerable according to the population being surveyed, the subject
matter, the survey organisation and the survey methods used. Once the follow up phases
have been completed and an element of non-response remains there are two things that
can be done. One is to compare the profile of the sample with any information available
on the population e.g. age sex of the population of Wollongong, the ranks or functions in
a staff survey. Difference in these profiles can be used in the estimation phase to attempt
to adjust for non-response bias. Even if the sample and population profiles are reasonably
close there is no guarantee that they will be so for other variables. An often recommended
method not often used is to intensively follow up a small number of randomly selected
non-respondents to determine the characteristics of the non-respondent group.

A.8.3 Input Editing


As the data for each unit are received, or in the computer processing, various checks or
input edits should be carried out. Typical input edits are

• legal values and range checks

• internal consistency checks

• checks for reasonableness on values and ratios of variables

Edit failures should be checked back to the form and if necessary to the respondent.

106
A.8.4 Output Editing
This refers to checking the figures that come from the survey in various ways such as
against other of historical data sources or across subgroups in the survey.

A.9 References

Bennet, D. and Steel, D. G. (2000). An evaluation of a large scale CATI household


survey using random digit dialling. Australian and New Zealand Journal of Statistics.
Groves, R. M. (1989) Survey Errors and Survey Costs, John Wiley, NY.
Steel, D. and Boal, P (1988) Accessibility by Telephone in Australia: Implications for
Telephone Surveys. Journal of Official Statistics, Vol 4, No 4, 307-318.

107

You might also like