3rd Assignment Research Solution

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 109

3rd assignment research solution

Q1.Chapter 8 Sampling
Sampling is the statistical process of selecting a subset (called a “sample”) of a
population of interest for purposes of making observations and statistical inferences
about that population. Social science research is generally about inferring patterns of
behaviors within specific populations. We cannot study entire populations because of
feasibility and cost constraints, and hence, we must select a representative sample from
the population of interest for observation and analysis. It is extremely important to
choose a sample that is truly representative of the population so that the inferences
derived from the sample can be generalized back to the population of interest. Improper
and biased sampling is the primary reason for often divergent and erroneous inferences
reported in opinion polls and exit polls conducted by different polling groups such as
CNN/Gallup Poll, ABC, and CBS, prior to every U.S. Presidential elections.

The Sampling Process

Figure 8.1. The sampling process

The sampling process comprises of several stage. The first stage is defining the target
population. A population can be defined as all people or items ( unit of analysis ) with
the characteristics that one wishes to study. The unit of analysis may be a person,
group, organization, country, object, or any other entity that you wish to draw scientific
inferences about. Sometimes the population is obvious. For example, if a manufacturer
wants to determine whether finished goods manufactured at a production line meets
certain quality requirements or must be scrapped and reworked, then the population
consists of the entire set of finished goods manufactured at that production facility. At
other times, the target population may be a little harder to understand. If you wish to
identify the primary drivers of academic learning among high school students, then what
is your target population: high school students, their teachers, school principals, or
parents? The right answer in this case is high school students, because you are
interested in their performance, not the performance of their teachers, parents, or
schools. Likewise, if you wish to analyze the behavior of roulette wheels to identify
biased wheels, your population of interest is not different observations from a single
roulette wheel, but different roulette wheels (i.e., their behavior over an infinite set of
wheels).

The second step in the sampling process is to choose a sampling frame . This is an
accessible section of the target population (usually a list with contact information) from
where a sample can be drawn. If your target population is professional employees at
work, because you cannot access all professional employees around the world, a more
realistic sampling frame will be employee lists of one or two local companies that are
willing to participate in your study. If your target population is organizations, then the
Fortune 500 list of firms or the Standard & Poor’s (S&P) list of firms registered with the
New York Stock exchange may be acceptable sampling frames.

Note that sampling frames may not entirely be representative of the population at large,
and if so, inferences derived by such a sample may not be generalizable to the
population. For instance, if your target population is organizational employees at large
(e.g., you wish to study employee self-esteem in this population) and your sampling
frame is employees at automotive companies in the American Midwest, findings from
such groups may not even be generalizable to the American workforce at large, let
alone the global workplace. This is because the American auto industry has been under
severe competitive pressures for the last 50 years and has seen numerous episodes of
reorganization and downsizing, possibly resulting in low employee morale and self-
esteem. Furthermore, the majority of the American workforce is employed in service
industries or in small businesses, and not in automotive industry. Hence, a sample of
American auto industry employees is not particularly representative of the American
workforce. Likewise, the Fortune 500 list includes the 500 largest American enterprises,
which is not representative of all American firms in general, most of which are medium
and small-sized firms rather than large firms, and is therefore, a biased sampling frame.
In contrast, the S&P list will allow you to select large, medium, and/or small companies,
depending on whether you use the S&P large-cap, mid-cap, or small-cap lists, but
includes publicly traded firms (and not private firms) and hence still biased. Also note
that the population from which a sample is drawn may not necessarily be the same as
the population about which we actually want information. For example, if a researcher
wants to the success rate of a new “quit smoking” program, then the target population is
the universe of smokers who had access to this program, which may be an unknown
population. Hence, the researcher may sample patients arriving at a local medical
facility for smoking cessation treatment, some of whom may not have had exposure to
this particular “quit smoking” program, in which case, the sampling frame does not
correspond to the population of interest.
The last step in sampling is choosing a sample from the sampling frame using a well-
defined sampling technique. Sampling techniques can be grouped into two broad
categories: probability (random) sampling and non-probability sampling. Probability
sampling is ideal if generalizability of results is important for your study, but there may
be unique circumstances where non-probability sampling can also be justified. These
techniques are discussed in the next two sections.

Probability Sampling

Probability sampling is a technique in which every unit in the population has a chance
(non-zero probability) of being selected in the sample, and this chance can be
accurately determined. Sample statistics thus produced, such as sample mean or
standard deviation, are unbiased estimates of population parameters, as long as the
sampled units are weighted according to their probability of selection. All probability
sampling have two attributes in common: (1) every unit in the population has a known
non-zero probability of being sampled, and (2) the sampling procedure involves random
selection at some point. The different types of probability sampling techniques include:

Simple random sampling. In this technique, all possible subsets of a population (more
accurately, of a sampling frame) are given an equal probability of being selected. The
probability of selecting any set of n units out of a total of N units in a sampling frame is
N C n . Hence, sample statistics are unbiased estimates of population parameters,
without any weighting. Simple random sampling involves randomly selecting
respondents from a sampling frame, but with large sampling frames, usually a table of
random numbers or a computerized random number generator is used. For instance, if
you wish to select 200 firms to survey from a list of 1000 firms, if this list is entered into
a spreadsheet like Excel, you can use Excel’s RAND() function to generate random
numbers for each of the 1000 clients on that list. Next, you sort the list in increasing
order of their corresponding random number, and select the first 200 clients on that
sorted list. This is the simplest of all probability sampling techniques; however, the
simplicity is also the strength of this technique. Because the sampling frame is not
subdivided or partitioned, the sample is unbiased and the inferences are most
generalizable amongst all probability sampling techniques.

Systematic sampling. In this technique, the sampling frame is ordered according to


some criteria and elements are selected at regular intervals through that ordered list.
Systematic sampling involves a random start and then proceeds with the selection of
every k th element from that point onwards, where k = N / n , where k is the ratio of
sampling frame size N and the desired sample size n , and is formally called the
sampling ratio . It is important that the starting point is not automatically the first in the
list, but is instead randomly chosen from within the first k elements on the list. In our
previous example of selecting 200 firms from a list of 1000 firms, you can sort the 1000
firms in increasing (or decreasing) order of their size (i.e., employee count or annual
revenues), randomly select one of the first five firms on the sorted list, and then select
every fifth firm on the list. This process will ensure that there is no overrepresentation of
large or small firms in your sample, but rather that firms of all sizes are generally
uniformly represented, as it is in your sampling frame. In other words, the sample is
representative of the population, at least on the basis of the sorting criterion.

Stratified sampling. In stratified sampling, the sampling frame is divided into


homogeneous and non-overlapping subgroups (called “strata”), and a simple random
sample is drawn within each subgroup. In the previous example of selecting 200 firms
from a list of 1000 firms, you can start by categorizing the firms based on their size as
large (more than 500 employees), medium (between 50 and 500 employees), and small
(less than 50 employees). You can then randomly select 67 firms from each subgroup
to make up your sample of 200 firms. However, since there are many more small firms
in a sampling frame than large firms, having an equal number of small, medium, and
large firms will make the sample less representative of the population (i.e., biased in
favor of large firms that are fewer in number in the target population). This is called non-
proportional stratified sampling because the proportion of sample within each subgroup
does not reflect the proportions in the sampling frame (or the population of interest), and
the smaller subgroup (large-sized firms) is over-sampled . An alternative technique will
be to select subgroup samples in proportion to their size in the population. For instance,
if there are 100 large firms, 300 mid-sized firms, and 600 small firms, you can sample
20 firms from the “large” group, 60 from the “medium” group and 120 from the “small”
group. In this case, the proportional distribution of firms in the population is retained in
the sample, and hence this technique is called proportional stratified sampling. Note that
the non-proportional approach is particularly effective in representing small subgroups,
such as large-sized firms, and is not necessarily less representative of the population
compared to the proportional approach, as long as the findings of the non-proportional
approach is weighted in accordance to a subgroup’s proportion in the overall population.

Cluster sampling. If you have a population dispersed over a wide geographic region, it
may not be feasible to conduct a simple random sampling of the entire population. In
such case, it may be reasonable to divide the population into “clusters” (usually along
geographic boundaries), randomly sample a few clusters, and measure all units within
that cluster. For instance, if you wish to sample city governments in the state of New
York, rather than travel all over the state to interview key city officials (as you may have
to do with a simple random sample), you can cluster these governments based on their
counties, randomly select a set of three counties, and then interview officials from every
official in those counties. However, depending on between- cluster differences, the
variability of sample estimates in a cluster sample will generally be higher than that of a
simple random sample, and hence the results are less generalizable to the population
than those obtained from simple random samples.

Matched-pairs sampling. Sometimes, researchers may want to compare two subgroups


within one population based on a specific criterion. For instance, why are some firms
consistently more profitable than other firms? To conduct such a study, you would have
to categorize a sampling frame of firms into “high profitable” firms and “low profitable
firms” based on gross margins, earnings per share, or some other measure of
profitability. You would then select a simple random sample of firms in one subgroup,
and match each firm in this group with a firm in the second subgroup, based on its size,
industry segment, and/or other matching criteria. Now, you have two matched samples
of high-profitability and low-profitability firms that you can study in greater detail. Such
matched-pairs sampling technique is often an ideal way of understanding bipolar
differences between different subgroups within a given population.

Multi-stage sampling. The probability sampling techniques described previously are all
examples of single-stage sampling techniques. Depending on your sampling needs, you
may combine these single-stage techniques to conduct multi-stage sampling. For
instance, you can stratify a list of businesses based on firm size, and then conduct
systematic sampling within each stratum. This is a two-stage combination of stratified
and systematic sampling. Likewise, you can start with a cluster of school districts in the
state of New York, and within each cluster, select a simple random sample of schools;
within each school, select a simple random sample of grade levels; and within each
grade level, select a simple random sample of students for study. In this case, you have
a four-stage sampling process consisting of cluster and simple random sampling.

Non-Probability Sampling

Nonprobability sampling is a sampling technique in which some units of the population


have zero chance of selection or where the probability of selection cannot be accurately
determined. Typically, units are selected based on certain non-random criteria, such as
quota or convenience. Because selection is non-random, nonprobability sampling does
not allow the estimation of sampling errors, and may be subjected to a sampling bias.
Therefore, information from a sample cannot be generalized back to the population.
Types of non-probability sampling techniques include:

Convenience sampling. Also called accidental or opportunity sampling, this is a


technique in which a sample is drawn from that part of the population that is close to
hand, readily available, or convenient. For instance, if you stand outside a shopping
center and hand out questionnaire surveys to people or interview them as they walk in,
the sample of respondents you will obtain will be a convenience sample. This is a non-
probability sample because you are systematically excluding all people who shop at
other shopping centers. The opinions that you would get from your chosen sample may
reflect the unique characteristics of this shopping center such as the nature of its stores
(e.g., high end-stores will attract a more affluent demographic), the demographic profile
of its patrons, or its location (e.g., a shopping center close to a university will attract
primarily university students with unique purchase habits), and therefore may not be
representative of the opinions of the shopper population at large. Hence, the scientific
generalizability of such observations will be very limited. Other examples of
convenience sampling are sampling students registered in a certain class or sampling
patients arriving at a certain medical clinic. This type of sampling is most useful for pilot
testing, where the goal is instrument testing or measurement validation rather than
obtaining generalizable inferences.

Quota sampling. In this technique, the population is segmented into mutually-exclusive


subgroups (just as in stratified sampling), and then a non-random set of observations is
chosen from each subgroup to meet a predefined quota. In proportional quota
sampling , the proportion of respondents in each subgroup should match that of the
population. For instance, if the American population consists of 70% Caucasians, 15%
Hispanic-Americans, and 13% African-Americans, and you wish to understand their
voting preferences in an sample of 98 people, you can stand outside a shopping center
and ask people their voting preferences. But you will have to stop asking Hispanic-
looking people when you have 15 responses from that subgroup (or African-Americans
when you have 13 responses) even as you continue sampling other ethnic groups, so
that the ethnic composition of your sample matches that of the general American
population. Non-proportional quota sampling is less restrictive in that you don’t have to
achieve a proportional representation, but perhaps meet a minimum size in each
subgroup. In this case, you may decide to have 50 respondents from each of the three
ethnic subgroups (Caucasians, Hispanic-Americans, and African- Americans), and stop
when your quota for each subgroup is reached. Neither type of quota sampling will be
representative of the American population, since depending on whether your study was
conducted in a shopping center in New York or Kansas, your results may be entirely
different. The non-proportional technique is even less representative of the population
but may be useful in that it allows capturing the opinions of small and underrepresented
groups through oversampling.

Expert sampling. This is a technique where respondents are chosen in a non-random


manner based on their expertise on the phenomenon being studied. For instance, in
order to understand the impacts of a new governmental policy such as the Sarbanes-
Oxley Act, you can sample an group of corporate accountants who are familiar with this
act. The advantage of this approach is that since experts tend to be more familiar with
the subject matter than non-experts, opinions from a sample of experts are more
credible than a sample that includes both experts and non-experts, although the
findings are still not generalizable to the overall population at large.

Snowball sampling. In snowball sampling, you start by identifying a few respondents


that match the criteria for inclusion in your study, and then ask them to recommend
others they know who also meet your selection criteria. For instance, if you wish to
survey computer network administrators and you know of only one or two such people,
you can start with them and ask them to recommend others who also do network
administration. Although this method hardly leads to representative samples, it may
sometimes be the only way to reach hard-to-reach populations or when no sampling
frame is available.

Statistics of Sampling

In the preceding sections, we introduced terms such as population parameter, sample


statistic, and sampling bias. In this section, we will try to understand what these terms
mean and how they are related to each other.

When you measure a certain observation from a given unit, such as a person’s
response to a Likert-scaled item, that observation is called a response (see Figure 8.2).
In other words, a response is a measurement value provided by a sampled unit. Each
respondent will give you different responses to different items in an instrument.
Responses from different respondents to the same item or observation can be graphed
into a frequency distribution based on their frequency of occurrences. For a large
number of responses in a sample, this frequency distribution tends to resemble a bell-
shaped curve called a normal distribution , which can be used to estimate overall
characteristics of the entire sample, such as sample mean (average of all observations
in a sample) or standard deviation (variability or spread of observations in a sample).
These sample estimates are called sample statistics (a “statistic” is a value that is
estimated from observed data). Populations also have means and standard deviations
that could be obtained if we could sample the entire population. However, since the
entire population can never be sampled, population characteristics are always unknown,
and are called population parameters (and not “statistic” because they are not
statistically estimated from data). Sample statistics may differ from population
parameters if the sample is not perfectly representative of the population; the difference
between the two is called sampling error . Theoretically, if we could gradually increase
the sample size so that the sample approaches closer and closer to the population, then
sampling error will decrease and a sample statistic will increasingly approximate the
corresponding population parameter.

If a sample is truly representative of the population, then the estimated sample statistics
should be identical to corresponding theoretical population parameters. How do we
know if the sample statistics are at least reasonably close to the population parameters?
Here, we need to understand the concept of a sampling distribution . Imagine that you
took three different random samples from a given population, as shown in Figure 8.3,
and for each sample, you derived sample statistics such as sample mean and standard
deviation. If each random sample was truly representative of the population, then your
three sample means from the three random samples will be identical (and equal to the
population parameter), and the variability in sample means will be zero. But this is
extremely unlikely, given that each random sample will likely constitute a different
subset of the population, and hence, their means may be slightly different from each
other. However, you can take these three sample means and plot a frequency
histogram of sample means. If the number of such samples increases from three to 10
to 100, the frequency histogram becomes a sampling distribution. Hence, a sampling
distribution is a frequency distribution of a sample statistic (like sample mean) from a set
of samples , while the commonly referenced frequency distribution is the distribution of
a response (observation) from a single sample . Just like a frequency distribution, the
sampling distribution will also tend to have more sample statistics clustered around the
mean (which presumably is an estimate of a population parameter), with fewer values
scattered around the mean. With an infinitely large number of samples, this distribution
will approach a normal distribution. The variability or spread of a sample statistic in a
sampling distribution (i.e., the standard deviation of a sampling statistic) is called its
standard error . In contrast, the term standard deviation is reserved for variability of an
observed response from a single sample.
Figure 8.2. Sample Statistic.

The mean value of a sample statistic in a sampling distribution is presumed to be an


estimate of the unknown population parameter. Based on the spread of this sampling
distribution (i.e., based on standard error), it is also possible to estimate confidence
intervals for that prediction population parameter. Confidence interval is the estimated
probability that a population parameter lies within a specific interval of sample statistic
values. All normal distributions tend to follow a 68-95-99 percent rule (see Figure 8.4),
which says that over 68% of the cases in the distribution lie within one standard
deviation of the mean value (µ + 1σ), over 95% of the cases in the distribution lie within
two standard deviations of the mean (µ + 2σ), and over 99% of the cases in the
distribution lie within three standard deviations of the mean value (µ + 3σ). Since a
sampling distribution with an infinite number of samples will approach a normal
distribution, the same 68-95-99 rule applies, and it can be said that:

 (Sample statistic + one standard error) represents a 68% confidence interval for
the population parameter.
 (Sample statistic + two standard errors) represents a 95% confidence interval for
the population parameter.
 (Sample statistic + three standard errors) represents a 99% confidence interval
for the population parameter.
Figure 8.3. The sampling distribution.

A sample is “biased” (i.e., not representative of the population) if its sampling


distribution cannot be estimated or if the sampling distribution violates the 68-95-99
percent rule. As an aside, note that in most regression analysis where we examine the
significance of regression coefficients with p<0.05, we are attempting to see if the
sampling statistic (regression coefficient) predicts the corresponding population
parameter (true effect size) with a 95% confidence interval. Interestingly, the “six sigma”
standard attempts to identify manufacturing defects outside the 99% confidence interval
or six standard deviations (standard deviation is represented using the Greek letter
sigma), representing significance testing at p<0.01.

Figure 8.4. The 68-95-99 percent rule for confidence interval.


Methods of sampling from a
population
PLEASE NOTE:

We are currently in the process of updating this chapter and we appreciate  your
patience whilst this is being completed.

It would normally be impractical to study a whole population, for example when doing a
questionnaire survey. Sampling is a method that allows researchers to infer information
about a population based on results from a subset of the population, without having to
investigate every individual. Reducing the number of individuals in a study reduces the cost
and workload, and may make it easier to obtain high quality information, but this has to be
balanced against having a large enough sample size with enough power to detect a true
association. (Calculation of sample size is addressed in section 1B (statistics) of the Part A
syllabus.)
If a sample is to be used, by whatever method it is chosen, it is important that the
individuals selected are representative of the whole population. This may involve specifically
targeting hard to reach groups. For example, if the electoral roll for a town was used to
identify participants, some people, such as the homeless, would not be registered and
therefore excluded from the study by default.

There are several different sampling techniques available, and they can be subdivided into
two groups: probability sampling and non-probability sampling. In probability (random)
sampling, you start with a complete sampling frame of all eligible individuals from which you
select your sample. In this way, all eligible individuals have a chance of being chosen for the
sample, and you will be more able to generalise the results from your study. Probability
sampling methods tend to be more time-consuming and expensive than non-probability
sampling. In non-probability (non-random) sampling, you do not start with a complete
sampling frame, so some individuals have no chance of being selected. Consequently, you
cannot estimate the effect of sampling error and there is a significant risk of ending up with
a non-representative sample which produces non-generalisable results. However, non-
probability sampling methods tend to be cheaper and more convenient, and they are useful
for exploratory research and hypothesis generation.
 

Probability Sampling Methods

1. Simple random sampling

In this case each individual is chosen entirely by chance and each member of the population
has an equal chance, or probability, of being selected. One way of obtaining a random
sample is to give each individual in a population a number, and then use a table of random
numbers to decide which individuals to include.1 For example, if you have a sampling frame
of 1000 individuals, labelled 0 to 999, use groups of three digits from the random number
table to pick your sample. So, if the first three numbers from the random number table
were 094, select the individual labelled “94”, and so on.

As with all probability sampling methods, simple random sampling allows the sampling error
to be calculated and reduces selection bias. A specific advantage is that it is the most
straightforward method of probability sampling. A disadvantage of simple random sampling
is that you may not select enough individuals with your characteristic of interest, especially
if that characteristic is uncommon. It may also be difficult to define a complete sampling
frame and inconvenient to contact them, especially if different forms of contact are required
(email, phone, post) and your sample units are scattered over a wide geographical area.
 

2. Systematic sampling

Individuals are selected at regular intervals from the sampling frame. The intervals are
chosen to ensure an adequate sample size. If you need a sample size n from a population of
size x, you should select every x/nth individual for the sample.  For example, if you wanted a
sample size of 100 from a population of 1000, select every 1000/100 = 10th member of the
sampling frame.
Systematic sampling is often more convenient than simple random sampling, and it is easy
to administer. However, it may also lead to bias, for example if there are underlying
patterns in the order of the individuals in the sampling frame, such that the sampling
technique coincides with the periodicity of the underlying pattern. As a hypothetical
example, if a group of students were being sampled to gain their opinions on college
facilities, but the Student Record Department’s central list of all students was arranged such
that the sex of students alternated between male and female, choosing an even interval
(e.g. every 20th student) would result in a sample of all males or all females. Whilst in this
example the bias is obvious and should be easily corrected, this may not always be the
case.
 

3. Stratified sampling

In this method, the population is first divided into subgroups (or strata) who all share a
similar characteristic. It is used when we might reasonably expect the measurement of
interest to vary between the different subgroups, and we want to ensure representation
from all the subgroups. For example, in a study of stroke outcomes, we may stratify the
population by sex, to ensure equal representation of men and women. The study sample is
then obtained by taking equal sample sizes from each stratum. In stratified sampling, it
may also be appropriate to choose non-equal sample sizes from each stratum. For example,
in a study of the health outcomes of nursing staff in a county, if there are three hospitals
each with different numbers of nursing staff (hospital A has 500 nurses, hospital B has 1000
and hospital C has 2000), then it would be appropriate to choose the sample numbers from
each hospital proportionally (e.g. 10 from hospital A, 20 from hospital B and 40 from
hospital C). This ensures a more realistic and accurate estimation of the health outcomes of
nurses across the county, whereas simple random sampling would over-represent nurses
from hospitals A and B. The fact that the sample was stratified should be taken into account
at the analysis stage.
Stratified sampling improves the accuracy and representativeness of the results by reducing
sampling bias. However, it requires knowledge of the appropriate characteristics of the
sampling frame (the details of which are not always available), and it can be difficult to
decide which characteristic(s) to stratify by.
 

4. Clustered sampling

In a clustered sample, subgroups of the population are used as the sampling unit, rather
than individuals. The population is divided into subgroups, known as clusters, which are
randomly selected to be included in the study. Clusters are usually already defined, for
example individual GP practices or towns could be identified as clusters. In single-stage
cluster sampling, all members of the chosen clusters are then included in the study. In two-
stage cluster sampling, a selection of individuals from each cluster is then randomly
selected for inclusion. Clustering should be taken into account in the analysis. The General
Household survey, which is undertaken annually in England, is a good example of a (one-
stage) cluster sample. All members of the selected households (clusters) are included in the
survey.1

Cluster sampling can be more efficient that simple random sampling, especially where a
study takes place over a wide geographical region. For instance, it is easier to contact lots
of individuals in a few GP practices than a few individuals in many different GP practices.
Disadvantages include an increased risk of bias, if the chosen clusters are not
representative of the population, resulting in an increased sampling error.
 

Non-Probability Sampling Methods

1. Convenience sampling

Convenience sampling is perhaps the easiest method of sampling, because participants are
selected based on availability and willingness to take part. Useful results can be obtained,
but the results are prone to significant bias, because those who volunteer to take part may
be different from those who choose not to (volunteer bias), and the sample may not be
representative of other characteristics, such as age or sex. Note: volunteer bias is a risk of
all non-probability sampling methods.
 

2. Quota sampling

This method of sampling is often used by market researchers. Interviewers are given a
quota of subjects of a specified type to attempt to recruit. For example, an interviewer
might be told to go out and select 20 adult men, 20 adult women, 10 teenage girls and 10
teenage boys so that they could interview them about their television viewing. Ideally the
quotas chosen would proportionally represent the characteristics of the underlying
population.

Whilst this has the advantage of being relatively straightforward and potentially
representative, the chosen sample may not be representative of other characteristics that
weren’t considered (a consequence of the non-random nature of sampling). 2
 

3. Judgement (or Purposive) Sampling

Also known as selective, or subjective, sampling, this technique relies on the judgement of
the researcher when choosing who to ask to participate. Researchers may implicitly thus
choose a “representative” sample to suit their needs, or specifically approach individuals
with certain characteristics. This approach is often used by the media when canvassing the
public for opinions and in qualitative research.

Judgement sampling has the advantage of being time-and cost-effective to perform whilst
resulting in a range of responses (particularly useful in qualitative research). However, in
addition to volunteer bias, it is also prone to errors of judgement by the researcher and the
findings, whilst being potentially broad, will not necessarily be representative.
 

4. Snowball sampling
This method is commonly used in social sciences when investigating hard-to-reach groups.
Existing subjects are asked to nominate further subjects known to them, so the sample
increases in size like a rolling snowball. For example, when carrying out a survey of risk
behaviours amongst intravenous drug users, participants may be asked to nominate other
users to be interviewed.

Snowball sampling can be effective when a sampling frame is difficult to identify. However,
by selecting friends and acquaintances of subjects already investigated, there is a significant
risk of selection bias (choosing a large number of people with similar characteristics or
views to the initial individual identified).
 

Bias in sampling

There are five important potential sources of bias that should be considered when selecting
a sample, irrespective of the method used. Sampling bias may be introduced when:1

1. Any pre-agreed sampling rules are deviated from


2. People in hard-to-reach groups are omitted
3. Selected individuals are replaced with others, for example if they are difficult to
contact
4. There are low response rates
5. An out-of-date list is used as the sample frame (for example, if it excludes people
who have recently moved to an area)
Further potential problems with sampling strategies are covered in chapter 8 of this section
(“Sources of variation, its measurement and control”).
 

References

1. Ben-Shlomo Y, Brookes S, Hickman M. 2013. Lecture Notes: Epidemiology,


Evidence-based Medicine and Public Health (6th ed.), Wiley-Blackwell, Oxford.
 
2. http://www.stats.gla.ac.uk/steps/glossary/sampling.html   - Accessed 8/04/17
 

Understanding different sampling methods


Date published September 19, 2019 by Shona McCombes. Date updated: June 19, 2020

When you conduct research about a group of people, it’s rarely possible to collect data from
every person in that group. Instead, you select a sample. The sample is the group of individuals
who will actually participate in the research.

To draw valid conclusions from your results, you have to carefully decide how you will select a
sample that is representative of the group as a whole. There are two types of sampling methods:

 Probability sampling involves random selection, allowing you to make statistical inferences


about the whole group.
 Non-probability sampling involves non-random selection based on convenience or other
criteria, allowing you to easily collect initial data.
You should clearly explain how you selected your sample in the methodology section of your
paper or thesis.

Table of contents

1.
2.
3.
4.

Population vs sample
First, you need to understand the difference between a population and a sample, and identify the
target population of your research.

 The population is the entire group that you want to draw conclusions about.
 The sample is the specific group of individuals that you will collect data from.

The population can be defined in terms of geographical location, age, income, and many other
characteristics.

It can be very broad or quite narrow: maybe you


want to make inferences about the whole adult population of your country; maybe your research
focuses on customers of a certain company, patients with a specific health condition, or students
in a single school.

It is important to carefully define your target population according to the purpose and


practicalities of your project.

If the population is very large, demographically mixed, and geographically dispersed, it might be
difficult to gain access to a representative sample.
Sampling frame
The sampling frame is the actual list of individuals that the sample will be drawn from. Ideally, it
should include the entire target population (and nobody who is not part of that population).

Example
You are doing research on working conditions at Company X. Your population is all 1000
employees of the company. Your sampling frame is the company’s HR database which lists the
names and contact details of every employee.

Sample size
The number of individuals in your sample depends on the size of the population, and on how
precisely you want the results to represent the population as a whole.

You can use a sample size calculator to determine how big your sample should be. In general,
the larger the sample size, the more accurately and confidently you can make inferences about
the whole population.

Probability sampling methods


Probability sampling means that every member of the population has a chance of being selected.
It is mainly used in quantitative research. If you want to produce results that are representative of
the whole population, you need to use a probability sampling technique.

There are four main types of probability sample.


1. Simple random sampling
In a simple random sample, every member of the population has an equal chance of being
selected. Your sampling frame should include the whole population.

To conduct this type of sampling, you can use tools like random number generators or other
techniques that are based entirely on chance.

Example
You want to select a simple random sample of 100 employees of Company X. You assign a
number to every employee in the company database from 1 to 1000, and use a random number
generator to select 100 numbers.

2. Systematic sampling
Systematic sampling is similar to simple random sampling, but it is usually slightly easier to
conduct. Every member of the population is listed with a number, but instead of randomly
generating numbers, individuals are chosen at regular intervals.

Example
All employees of the company are listed in alphabetical order. From the first 10 numbers, you
randomly select a starting point: number 6. From number 6 onwards, every 10th person on the
list is selected (6, 16, 26, 36, and so on), and you end up with a sample of 100 people.
If you use this technique, it is important to make sure that there is no hidden pattern in the list
that might skew the sample. For example, if the HR database groups employees by team, and
team members are listed in order of seniority, there is a risk that your interval might skip over
people in junior roles, resulting in a sample that is skewed towards senior employees.

3. Stratified sampling
This sampling method is appropriate when the population has mixed characteristics, and you
want to ensure that every characteristic is proportionally represented in the sample.

You divide the population into subgroups (called strata) based on the relevant characteristic (e.g.
gender, age range, income bracket, job role).

From the overall proportions of the population, you calculate how many people should be
sampled from each subgroup. Then you use random or systematic sampling to select a sample
from each subgroup.

Example
The company has 800 female employees and 200 male employees. You want to ensure that the
sample reflects the gender balance of the company, so you sort the population into two strata
based on gender. Then you use random sampling on each group, selecting 80 women and 20
men, which gives you a representative sample of 100 people.
4. Cluster sampling
Cluster sampling also involves dividing the population into subgroups, but each subgroup should
have similar characteristics to the whole sample. Instead of sampling individuals from each
subgroup, you randomly select entire subgroups.

If it is practically possible, you might include every individual from each sampled cluster. If the
clusters themselves are large, you can also sample individuals from within each cluster using one
of the techniques above.

This method is good for dealing with large and dispersed populations, but there is more risk of
error in the sample, as there could be substantial differences between clusters. It’s difficult to
guarantee that the sampled clusters are really representative of the whole population.

Example
The company has offices in 10 cities across the country (all with roughly the same number of
employees in similar roles). You don’t have the capacity to travel to every office to collect your
data, so you use random sampling to select 3 offices – these are your clusters.

What can proofreading do for your paper?


Scribbr editors not only correct grammar and spelling mistakes, but also strengthen your writing
by making sure your paper is free of vague language, redundant words and awkward phrasing.
See editing example

Non-probability sampling methods


In a non-probability sample, individuals are selected based on non-random criteria, and not every
individual has a chance of being included.

This type of sample is easier and cheaper to access, but it has a higher risk of sampling bias, and
you can’t use it to make valid statistical inferences about the whole population.

Non-probability sampling techniques are often appropriate for exploratory and qualitative


research. In these types of research, the aim is not to test a hypothesis about a broad population,
but to develop an initial understanding of a small or under-researched population.
1. Convenience sampling
A convenience sample simply includes the individuals who happen to be most accessible to the
researcher.

This is an easy and inexpensive way to gather initial data, but there is no way to tell if the sample
is representative of the population, so it can’t produce generalizable results.

Example
You are researching opinions about student support services in your university, so after each of
your classes, you ask your fellow students to complete a survey on the topic. This is a convenient
way to gather data, but as you only surveyed students taking the same classes as you at the same
level, the sample is not representative of all the students at your university.

2. Voluntary response sampling


Similar to a convenience sample, a voluntary response sample is mainly based on ease of access.
Instead of the researcher choosing participants and directly contacting them, people volunteer
themselves (e.g. by responding to a public online survey).

Voluntary response samples are always at least somewhat biased, as some people will inherently
be more likely to volunteer than others.

Example
You send out the survey to all students at your university and a lot of students decide to complete
it. This can certainly give you some insight into the topic, but the people who responded are
more likely to be those who have strong opinions about the student support services, so you can’t
be sure that their opinions are representative of all students.

3. Purposive sampling
This type of sampling involves the researcher using their judgement to select a sample that is
most useful to the purposes of the research.

It is often used in qualitative research, where the researcher wants to gain detailed knowledge
about a specific phenomenon rather than make statistical inferences. An effective purposive
sample must have clear criteria and rationale for inclusion.

Example
You want to know more about the opinions and experiences of disabled students at your
university, so you purposefully select a number of students with different support needs in order
to gather a varied range of data on their experiences with student services.

4. Snowball sampling
If the population is hard to access, snowball sampling can be used to recruit participants via other
participants. The number of people you have access to “snowballs” as you get in contact with
more people.
Example
You are researching experiences of homelessness in your city. Since there is no list of all
homeless people in the city, probability sampling isn’t possible. You meet one person who
agrees to participate in the research, and she puts you in contact with other homeless people that
she knows in the area.

What is sampling?
Sampling definition: Sampling is a technique of selecting individual
members or a subset of the population to make statistical inferences from
them and estimate characteristics of the whole population. Different sampling
methods are widely used by researchers in market research so that they do
not need to research the entire population to collect actionable insights. It is
also a time-convenient and a cost-effective method and hence forms the basis
of any research design. Sampling techniques can be used in a research
survey software for optimum derivation.

For example, if a drug manufacturer would like to research the adverse side
effects of a drug on the country’s population, it is almost impossible to conduct
a research study that involves everyone. In this case, the researcher decides
a sample of people from each demographic and then researches them, giving
him/her indicative feedback on the drug’s behavior.

Select your respondents

Types of sampling: sampling methods 


Sampling in market research is of two types – probability sampling and non-
probability sampling. Let’s take a closer look at these two methods of
sampling.

1. Probability sampling: Probability sampling is a sampling technique


where a researcher sets a selection of a few criteria and chooses members
of a population randomly. All the members have an equal opportunity to be
a part of the sample with this selection parameter.
2. Non-probability sampling: In non-probability sampling, the researcher
chooses members for research at random. This sampling method is not a
fixed or predefined selection process. This makes it difficult for all elements
of a population to have equal opportunities to be included in a sample.
In this blog, we discuss the various probability and non-probability sampling
methods that you can implement in any market research study.

Types of probability sampling with examples:


Probability sampling is a sampling technique in which researchers choose
samples from a larger population using a method based on the theory of
probability. This sampling method considers every member of the population
and forms samples based on a fixed process.

For example, in a population of 1000 members, every member will have a


1/1000 chance of being selected to be a part of a sample. Probability
sampling eliminates bias in the population and gives all members a fair
chance to be included in the sample.

There are four types of probability sampling techniques:

 Simple random sampling: One of the best probability sampling


techniques that helps in saving time and resources, is the Simple Random
Sampling method. It is a reliable method of obtaining information where
every single member of a population is chosen randomly, merely by chance.
Each individual has the same probability of being chosen to be a part of a
sample.
For example, in an organization of 500 employees, if the HR team decides
on conducting team building activities, it is highly likely that they would
prefer picking chits out of a bowl. In this case, each of the 500 employees
has an equal opportunity of being selected.
 Cluster sampling: Cluster sampling is a method where the researchers
divide the entire population into sections or clusters that represent a
population. Clusters are identified and included in a sample based on
demographic parameters like age, sex, location, etc. This makes it very
simple for a survey creator to derive effective inference from the feedback.
For example, if the United States government wishes to evaluate the
number of immigrants living in the Mainland US, they can divide it into
clusters based on states such as California, Texas, Florida, Massachusetts,
Colorado, Hawaii, etc. This way of conducting a survey will be more
effective as the results will be organized into states and provide insightful
immigration data.
 Systematic sampling: Researchers use the systematic sampling
method to choose the sample members of a population at regular intervals.
It requires the selection of a starting point for the sample and sample size
that can be repeated at regular intervals. This type of sampling method has
a predefined range, and hence this sampling technique is the least time-
consuming.
For example, a researcher intends to collect a systematic sample of 500
people in a population of 5000. He/she numbers each element of the
population from 1-5000 and will choose every 10th individual to be a part of
the sample (Total population/ Sample Size = 5000/500 = 10).
 Stratified random sampling: Stratified random sampling is a method in
which the researcher divides the population into smaller groups that don’t
overlap but represent the entire population. While sampling, these groups
can be organized and then draw a sample from each group separately.
For example, a researcher looking to analyze the characteristics of people
belonging to different annual income divisions will create strata (groups)
according to the annual family income. Eg – less than $20,000, $21,000 –
$30,000, $31,000 to $40,000, $41,000 to $50,000, etc. By doing this, the
researcher concludes the characteristics of people belonging to different
income groups. Marketers can analyze which income groups to target and
which ones to eliminate to create a roadmap that would bear fruitful results.
Uses of probability sampling
There are multiple uses of probability sampling. They are:

 Reduce Sample Bias: Using the probability sampling method, the bias


in the sample derived from a population is negligible to non-existent. The
selection of the sample mainly depicts the understanding and the inference
of the researcher. Probability sampling leads to higher quality data
collection as the sample appropriately represents the population.
 Diverse Population: When the population is vast and diverse, it is
essential to have adequate representation so that the data is not skewed
towards one demographic. For example, if Square would like to understand
the people that could make their point-of-sale devices, a survey conducted
from a sample of people across the US from different industries and socio-
economic backgrounds helps.
 Create an Accurate Sample: Probability sampling helps the
researchers plan and create an accurate sample. This helps to obtain well-
defined data.

Types of non-probability sampling with examples


The non-probability method is a sampling method that involves a collection of
feedback based on a researcher or statistician’s sample selection capabilities
and not on a fixed selection process. In most situations, the output of a survey
conducted with a non-probable sample leads to skewed results, which may
not represent the desired target population. But, there are situations such as
the preliminary stages of research or cost constraints for conducting research,
where non-probability sampling will be much more useful than the other type.

Four types of non-probability sampling explain the purpose of this sampling


method in a better manner:

 Convenience sampling: This method is dependent on the ease of


access to subjects such as surveying customers at a mall or passers-by on
a busy street. It is usually termed as convenience sampling, because of the
researcher’s ease of carrying it out and getting in touch with the subjects.
Researchers have nearly no authority to select the sample elements, and
it’s purely done based on proximity and not representativeness. This non-
probability sampling method is used when there are time and cost
limitations in collecting feedback. In situations where there are resource
limitations such as the initial stages of research, convenience sampling is
used.
For example, startups and NGOs usually conduct convenience sampling at
a mall to distribute leaflets of upcoming events or promotion of a cause –
they do that by standing at the mall entrance and giving out pamphlets
randomly.
 Judgmental or purposive sampling: Judgemental or purposive
samples are formed by the discretion of the researcher. Researchers purely
consider the purpose of the study, along with the understanding of the target
audience. For instance, when researchers want to understand the thought
process of people interested in studying for their master’s degree. The
selection criteria will be: “Are you interested in doing your masters in …?”
and those who respond with a “No” are excluded from the sample.
 Snowball sampling: Snowball sampling is a sampling method that
researchers apply when the subjects are difficult to trace. For example, it
will be extremely challenging to survey shelterless people or illegal
immigrants. In such cases, using the snowball theory, researchers can track
a few categories to interview and derive results. Researchers also
implement this sampling method in situations where the topic is highly
sensitive and not openly discussed—for example, surveys to gather
information about HIV Aids. Not many victims will readily respond to the
questions. Still, researchers can contact people they might know or
volunteers associated with the cause to get in touch with the victims and
collect information.
 Quota sampling:  In Quota sampling, the selection of members in this
sampling technique happens based on a pre-set standard. In this case, as a
sample is formed based on specific attributes, the created sample will have
the same qualities found in the total population. It is a rapid method of
collecting samples.

Uses of non-probability sampling


Non-probability sampling is used for the following:

 Create a hypothesis: Researchers use the non-probability sampling


method to create an assumption when limited to no prior information is
available. This method helps with the immediate return of data and builds a
base for further research.
 Exploratory research: Researchers use this sampling technique widely
when conducting qualitative research, pilot studies, or exploratory research.
 Budget and time constraints: The non-probability method when there
are budget and time constraints, and some preliminary data must be
collected. Since the survey design is not rigid, it is easier to pick
respondents at random and have them take the survey or questionnaire.
How do you decide on the type of sampling to use?
For any research, it is essential to choose a sampling method accurately to
meet the goals of your study. The effectiveness of your sampling relies on
various factors. Here are some steps expert researchers follow to decide the
best sampling method.

 Jot down the research goals. Generally, it must be a combination of


cost, precision, or accuracy.
 Identify the effective sampling techniques that might potentially achieve
the research goals.
 Test each of these methods and examine whether they help in
achieving your goal.
 Select the method that works best for the research.
Select your respondents
Difference between Probability Sampling and Non-
Probability Sampling Methods
We have looked at the different types of sampling methods above and their
subtypes. To encapsulate the whole discussion, though, the significant
differences between probability sampling methods and non-probability
sampling methods are as below:

Probability Sampling Methods Non-Probability Sampling Methods

Probability Sampling is a sampling technique Non-probability sampling is a sampling tec


in which samples from a larger population are in which the researcher selects samples b
Definition
chosen using a method based on the theory of the researcher’s subjective judgment rathe
probability. random selection.

Alternatively
Random sampling method. Non-random sampling method
Known as

Population
The population is selected randomly. The population is selected arbitrarily.
selection

Nature The research is conclusive. The research is exploratory.

Since there is a method for deciding the Since the sampling method is arbitrary, the
Sample sample, the population demographics are population demographics representation is
conclusively represented. always skewed.

Takes longer to conduct since the research This type of sampling method is quick sinc
Time Taken design defines the selection parameters the sample or selection criteria of the sam
before the market research study begins. undefined.

This type of sampling is entirely unbiased and This type of sampling is entirely biased an
Results hence the results are unbiased too and the results are biased too, rendering the re
conclusive. speculative.

In probability sampling, there is an underlying


hypothesis before the study begins and the In non-probability sampling, the hypothesi
Hypothesis
objective of this method is to prove the derived after conducting the research stud
hypothesis.

 
Q2.A sample design is the framework, or road map, that serves as the basis for the
selection of a survey sample and affects many other important aspects of a survey as
well.

Type of universe, Sampling unit, Source list, Size of sample, Parameters of interest,
Budgetary constraint and Sampling procedure are the points to be taken into
consideration by a research in developing a Sample design.

Read more on Brainly.in - https://brainly.in/question/6206582#readmore 

What points should be taken into consideration by a research in


developing a Sample design ?

While developing a sampling design, the researcher must pay attention to the following
points:
1. Type of universe: The first step in developing any sample design is to clearly
define the set of objects, technically called the Universe, to be studied. The universe
can be finite or infinite. In finite universe the number of items is certain, but in case of
an infinite universe the number of items is infinite, i.e., we cannot have any idea about
the total number of items. The population of a city, the number of workers in a
factory and the like are examples of finite universes, whereas the number of stars in
the sky, listeners of a specific radio programme, throwing of a dice etc. are examples
of infinite universes.
2. Sampling unit: A decision has to be taken concerning a sampling unit before
selecting sample. Sampling unit may be a geographical one such as state, district,
village, etc., or a construction unit such as house, flat, etc., or it may be a social unit
such as family, club, school, etc., or it may be an individual. The researcher will have
to decide one or more of such units that he has to select for his study.
3. Source list: It is also known as ‘sampling frame’ from which sample is to be
drawn. It contains the names of all items of a universe (in case of finite universe
only). If source list is not available, researcher has to prepare it. Such a list should be
comprehensive, correct, reliable and appropriate. It is extremely important for the
source list to be as representative of the population as possible.
4. Size of sample: This refers to the number of items to be selected from the
universe to constitute a sample. This a major problem before a researcher. The size
of sample should neither be excessively large, nor too small. It should be optimum.
An optimum sample is one which fulfills the requirements of efficiency,
representativeness, reliability and flexibility. While deciding the size of sample,
researcher must determine the desired precision as also an acceptable confidence
level for the estimate. The size of population variance needs to be considered as in
case of larger variance usually a bigger sample is needed. The size of population
must be kept in view for this also limits the sample size. The parameters of interest
in a research study must be kept in view, while deciding the size of the sample. Costs
too dictate the size of sample that we can draw. As such, budgetary constraint must
invariably be taken into consideration when we decide the sample size.
5. Parameters of interest: In determining the sample design, one must consider
the question of the specific population parameters which are of interest. For
instance, we may be interested in estimating the proportion of persons with some
characteristic in the population, or we may be interested in knowing some average or
the other measure concerning the population. There may also be important sub-
groups in the population about whom we would like to make estimates. All this has a
strong impact upon the sample design we would accept.
6. Budgetary constraint: Cost considerations, from practical point of view, have a
major impact upon decisions relating to not only the size of the sample but also to
the type of sample. This fact can even lead to the use of a non-probability sample.
7. Sampling procedure: Finally, the researcher must decide the type of sample he
will use i.e., he must decide about the technique to be used in selecting the items for
the sample. In fact, this technique or procedure stands for the sample design itself.
There are several sample designs (explained in the pages that follow) out of which
the researcher must choose one for his study. Obviously, he must select that design
which, for a given sample size and for a given cost, has a smaller sampling error.

Q3. Why probability sampling is generally preferred in comparison to non-probability


sampling? Explain the procedure of selecting a simple random sample.

Q3. There are two primary reasons:

The first reason relates to random error. Probability is the machine that allows the
assessment of precision of an estimate taken from a survey, and random sampling is what
generates the framework for this machinery.

If you don’t sample randomly, it’s simply impossible to make any kind of assessment of the
quality of your estimate, which statisticians gauge through statements about the likely size
of error conjoined with a meaning of “likely.” For example, when we say “with 90%
confidence the estimate is within error E of the true value of the parameter,” we mean that if
we took random samples in the same manner over and over, our estimate would change
each time… but 90% of the samples would lead to estimates within E of the unknown true
value. This kind of assessment is only possible if the sample is random.

The second reason relates to nonrandom error. While random samples are not certain to be
representative of a population, we know that nonrandom samples are notoriously unlikely
to be so. Simply put, if you sample nonrandomly, you’re very much inclined to obtain a
nonrepresentative sample, leading to nonrandom error (bias).

Currently, President Trump’s reelection campaign is posting advertisements online, inviting


people to take an “official” Trump approval poll. His approval ratings in this poll is sky-high.
Obviously, folks that choose to visit his reelection campaign site to take a survey are not
representative of the population at large.

Simply put, you should reject out of hand any and all surveys in which nonrandom sampling
is done. They’re always untrustworthy.

Does this mean that polls that use random sampling are always trustworthy? No, all kinds of
problems can occur with them. Think of it this way: Nonrandom sampling is always bad;
random sampling is always good in theory, but in practice may not lead to a reliable
estimate.
A probability sample is a sample for which you know a priori the probability of each
unit to be part of the sample. There are many sampling designs that fit with this
definition, such as simple random sampling, stratified sampling, two-stage sampling,
etc. Statistical inference techniques are based on such sampling, so if you have a non-
probability sample you can not make statistical inference.

Probability sampling is based on the concept of random selection where each population


elements have non-zero chance to be occurred as sample. ... This differs from non-probability
sampling, in which each member of the population would not have the same odds of being
selected.

Q3. Sampling Methods


Sampling Methods can be classified into one of two categories:

 Probability Sampling: Sample has a known probability of being


selected
 Non-probability Sampling: Sample does not have known probability of
being selected as an inconvenience or voluntary response surveys
 
Probability Sampling
In probability sampling, it is possible to both determine which sampling units
belong to which sample and the probability that each sample will be selected.
The following sampling methods are examples of probability sampling:

1. Simple Random Sampling (SRS)


2. Stratified Sampling
3. Cluster Sampling
4. Systematic Sampling
5. Multistage Sampling (in which some of the methods above are
combined in stages)
Of the five methods listed above, students have the most trouble
distinguishing between stratified sampling and cluster sampling.
Stratified Sampling is possible when it makes sense to partition the
population into groups based on a factor that may influence the variable that
is being measured. These groups are then called strata. An individual group is
called a stratum. With stratified sampling one should:

 partition the population into groups (strata)


 obtain a simple random sample from each group (stratum)
 collect data on each sampling unit that was randomly sampled from
each group (stratum)
Stratified sampling works best when a heterogeneous population is split into
fairly homogeneous groups. Under these conditions, stratification generally
produces more precise estimates of the population percents than estimates
that would be found from a simple random sample. Table 2.2 shows some
examples of ways to obtain a stratified sample.
 

Table 2.2. Examples of Stratified Samples


  Example 1 Example 2 Example 3

Population All people in the U.S. All PSU intercollegiate athletes All elemen
district

Groups (Strata) 4 Time Zones in the U.S. (Eastern, Central, 26 PSU intercollegiate teams 11 differen
Mountain, Pacific) school dis
Obtain a Simple Random 500 people from each of the 4 time zones 5 athletes from each of the 26 20 studen
Sample PSU teams elementar

Sample 4 × 500 = 2000 selected people 26 × 5 = 130 selected athletes 11 × 20 = 2

Cluster Sampling is very different from Stratified Sampling. With cluster


sampling, one should

 divide the population into groups (clusters).


 obtain a simple random sample of so many clusters from all possible
clusters.
 obtain data on every sampling unit in each of the randomly selected
clusters.
It is important to note that, unlike with the strata in stratified sampling, the
clusters should be microcosms, rather than subsections, of the population.
Each cluster should be heterogeneous. Additionally, the statistical analysis
used with cluster sampling is not only different but also more complicated
than that used with stratified sampling.

Table 2.3. Examples of Cluster Samples


  Example 1 Example 2 Example 3

Population All people in the U.S. All PSU intercollegiate athletes All elementary

Groups (Clusters) 4 Time Zones in the U.S. (Eastern, Central, 26 PSU intercollegiate teams 11 different el
Mountain, Pacific.) school district

Obtain a Simple Random 2 time zones from the 4 possible time zones 8 teams from the 26 possible 4 elementary s
Sample teams elementary sc

Sample every person in the 2 selected time zones every athlete on the 8 selected every student
teams schools

Each of the three examples that are found in Tables 2.2 and 2.3 was used to
illustrate how both stratified and cluster sampling could be accomplished.
However, there are obviously times when one sampling method is preferred
over the other. The following explanations add some clarification about when
to use which method.

 With Example 1: Stratified sampling would be preferred over cluster


sampling, particularly if the questions of interest are affected by time
zone. For example, the percentage of people watching a live sporting
event on television might be highly affected by the time zone they are
in. Cluster sampling really works best when there are a reasonable
number of clusters relative to the entire population. In this case,
selecting 2 clusters from 4 possible clusters really does not provide
many advantages over simple random sampling.
 With Example 2: Either stratified sampling or cluster sampling could be
used. It would depend on what questions are being asked. For instance,
consider the question "Do you agree or disagree that you receive
adequate attention from the team of doctors at the Sports Medicine
Clinic when injured?" The answer to this question would probably not be
team dependent, so cluster sampling would be fine. In contrast, if the
question of interest is "Do you agree or disagree that weather affects
your performance during an athletic event?" The answer to this question
would probably be influenced by whether or not the sport is played
outside or inside. Consequently, stratified sampling would be preferred.
 With Example 3: Cluster sampling would probably be better than
stratified sampling if each individual elementary school appropriately
represents the entire population as in a school district where students
from throughout the district can attend any school. Stratified sampling
could be used if the elementary schools had very different locations and
served only their local neighborhood (i.e., one elementary school is
located in a rural setting while another elementary school is located in
an urban setting.) Again, the questions of interest would affect which
sampling method should be used.
The most common method of carrying out a poll today is using Random
Digit Dialing in which a machine random dials phone numbers. Some polls
go even farther and have a machine conduct the interview itself rather than
just dialing the number! Such "robocall polls" can be very biased because
they have extremely low response rates (most people don't like speaking to a
machine) and because federal law prevents such calls to cell phones. Since the
people who have landline phone service tend to be older than people who
have cell phone service only, another potential source of bias is introduced.
National polling organizations that use random digit dialing in conducting
interviewer based polls are very careful to match the number of landline
versus cell phones to the population they are trying to survey.
 
Non-probability Sampling
The following sampling methods that are listed in your text are types of non-
probability sampling that should be avoided:

1. volunteer samples
2. haphazard (convenience) samples
Since such non-probability sampling methods are based on human choice
rather than random selection, a statistical theory cannot explain how they
might behave and potential sources of bias are rampant. In your textbook, the
two types of non-probability samples listed above are called "sampling
disasters."
Read the article: "How Polls are Conducted" by the Gallup organization
available in Canvas.
The article provides great insight into how major polls are conducted. When
you are finished reading this article you may want to go to the Gallup Poll
Website and see the results from recent Gallup polls. Another excellent source
of public opinion polls on a wide variety of topics using solid sampling
methodology is the Pew Research Center Website. When you read one of
the summary reports on the Pew site, there is a link (in the upper right corner)
to the complete report giving more detailed results and a full description of
their methodology as well as a link to the actual questionnaire used in the
survey so you can judge whether there might be bias in the wording of their
survey.
It is important to be mindful of margin or error as discussed in this article. We
all need to remember that public opinion on a given topic cannot be
appropriately measured with one question that is only asked on one poll. Such
results only provide a snapshot at that moment under certain conditions. The
concept of repeating procedures over different conditions and times leads to
more valuable and durable results. Within this section of the Gallup article,
there is also an error: "in 95 out of those 100 polls, his rating would be
between 46% and 54%." This should instead say that in an expected 95 out of
those 100 polls, the true population percent would be within the confidence
interval calculated. In 5 of those surveys, the confidence interval would not
contain the population percent.
Q5.

“A systematic bias results from errors in the sampling procedures”. What do you
mean by such a systematic bias? Describe the important causes responsible for
such a bias.

What is Sampling Criteria ?

Sampling Criteria is one must remember that two costs are involved in a sampling analysis
viz., the cost of collecting the data and the cost of an incorrect inference resulting from the
data. Researcher must keep in view the two causes of incorrect inferences viz., systematic
bias and sampling error. A systematic bias  results from errors in the sampling procedures,
and it cannot be reduced or eliminated by increasing the sample size. At best the causes
responsible for these errors can be detected and corrected.
1. Inappropriate sampling frame: If the sampling frame is inappropriate i.e., a
biased representation of the universe, it will result in a systematic bias.
2. Defective measuring device: If the measuring device is constantly in error, it
will result in systematic bias. In survey work, systematic bias can result if the
questionnaire or the interviewer is biased. Similarly, if the physical measuring device
is defective there will be systematic bias in the data collected through such a
measuring device.
3. Non-respondents: If we are unable to sample all the individuals initially included
in the sample, there may arise a systematic bias. The reason is that in such a
situation the likelihood of establishing contact or receiving a response from an
individual is often correlated with the measure of what is to be estimated.
4. Indeterminancy principle: Sometimes we find that individuals act differently
when kept under observation than what they do when kept in non-observed
situations. For instance, if workers are aware that somebody is observing them in
course of a work study on the basis of which the average length of time to complete
a task will be determined and accordingly the quota will be set for piece work, they
generally tend to work slowly in comparison to the speed with which they work if kept
unobserved. Thus, the indeterminancy principle may also be a cause of a systematic
bias.
5. Natural bias in the reporting of data: Natural bias of respondents in the
reporting of data is often the cause of a systematic bias in many inquiries. There is
usually a downward bias in the income data collected by government taxation
department, whereas we find an upward bias in the income data collected by some
social organisation. People in general understate their incomes if asked about it for
tax purposes, but they overstate the same if asked for social status or their
affluence. Generally in psychological surveys, people tend to give what they think is
the ‘correct’ answer rather than revealing their true feelings.
Sampling errors  are the random variations in the sample estimates around the true
population parameters. Since they occur randomly and are equally likely to be in either
direction, their nature happens to be of compensatory type and the expected value of such
errors happens to be equal to zero. Sampling error decreases with the increase in the size
of the sample, and it happens to be of a smaller magnitude in case of homogeneous
population.
Sampling error  can be measured for a given sample design and size. The measurement of
sampling error is usually called the ‘precision of the sampling plan’. If we increase the
sample size, the precision can be improved. But increasing the size of the sample has its
own limitations viz., a large sized sample increases the cost of collecting data and also
enhances the systematic bias. Thus the effective way to increase precision is usually to
select a better sampling design which has a smaller sampling error for a given sample size
at a given cost. In practice, however, people prefer a less precise design because it is easier
to adopt the same and also because of the fact that systematic bias can be controlled in a
better way in such a design.
In brief, while selecting a sampling procedure, researcher must ensure that the procedure
causes a relatively small sampling error and helps to control the systematic bias in a better
way.
ADDITIONAL INFORMATION

Sampling Design

Sampling Design
Census and Sample Survey

 All items in any field of inquiry constitute a ‘Universe’ or ‘Population.’

 A complete enumeration of all items in the ‘population’ is known as a census inquiry

Researcher must prepare a sample design for his study i.e., he must plan how a sample should be
selected and of what size such a sample would be.

Implications of a Sample Design

         A sample design is a definite plan for obtaining a sample from a given population.

         It refers to the technique or the procedure the researcher would adopt in selecting items for the
sample.

         Sample design may as well lay down the number of items to be included in the sample i.e., the size of
the sample. Sample design is determined before data are collected

STEPS IN SAMPLE DESIGN

While developing a sampling design, the researcher must pay attention to the following points:

a)      Type of universe: The first step in developing any sample design is to clearly define the set of objects,
technically called the Universe, to be studied. The universe can be finite or infinite. In finite universe the
number of items is certain, but in case of an infinite universe the number of items is infinite, i.e., we
cannot have any idea about the total number of items. The population of a city, the number of workers
in a factory and the like are examples of finite universes, whereas the number of stars in the sky,
listeners of a specific radio programme, throwing of a dice etc. are examples of infinite universes.
b)      Sampling unit: A decision has to be taken concerning a sampling unit before selecting sample. Sampling
unit may be a geographical one such as state, district, village, etc., or a construction unit such as house,
flat, etc., or it may be a social unit such as family, club, school, etc., or it may be an individual. The
researcher will have to decide one or more of such units that he has to select for his study.

c)      Source list: It is also known as ‘sampling frame’ from which sample is to be drawn. It contains the names
of all items of a universe (in case of finite universe only). If source list is not available, researcher has to
prepare it. Such a list should be comprehensive, correct, reliable and appropriate. It is extremely
important for the source list to be as representative of the population as possible.

d)     Size of sample: This refers to the number of items to be selected from the universe to constitute a
sample. The size of sample should neither be excessively large, nor too small. It should be optimum. An
optimum sample is one which fulfills the requirements of efficiency, representativeness, reliability and
flexibility. While deciding the size of sample, researcher must determine the desired precision as also an
acceptable confidence level for the estimate.

e)      Parameters of interest: In determining the sample design, one must consider the question of the
specific population parameters which are of interest. For instance, we may be interested in estimating
the proportion of persons with some characteristic in the population, or we may be interested in
knowing some average or the other measure concerning the population. There may also be important
sub-groups in the population about whom we would like to make estimates. All this has a strong impact
upon the sample design we would accept.

f)       Budgetary constraint: Cost considerations, from practical point of view, have a major impact upon
decisions relating to not only the size of the sample but also to the type of sample. This fact can even
lead to the use of a non-probability sample.

g)      Sampling procedure: Finally, the researcher must decide the type of sample he will use i.e., he must
decide about the technique to be used in selecting the items for the sample. In fact, this technique or
procedure stands for the sample design itself. There are several sample designs (explained in the pages
that follow) out of which the researcher must choose one for his study. Obviously, he must select that
design which, for a given sample size and for a given cost, has a smaller sampling error.

Criteria of Selecting a Sampling Procedure


In this context one must remember that two costs are involved in a sampling analysis viz.,

          i.            the cost of collecting the data and

        ii.            the cost of an incorrect inference resulting from the data.

Researcher must keep in view the two causes of incorrect inferences viz., systematic bias and sampling
error. A systematic bias  results from errors in the sampling procedures, and it cannot be reduced or
eliminated  by increasing the sample size. At best the causes responsible for these errors can be
detected and corrected. Usually a systematic bias is the result of one or more of the following factors:

a)      Inappropriate sampling frame: If the sampling frame is inappropriate i.e., a biased representation of the
universe, it will result in a systematic bias.

b)     Defective measuring device: If the measuring device is constantly in error, it will result in systematic
bias. In survey work, systematic bias can result if the questionnaire or the interviewer is biased.
Similarly, if the physical measuring device is defective there will be systematic bias in the data collected
through such a measuring device.

c)      Non-respondents: If we are unable to sample all the individuals initially included in the sample, there
may arise a systematic bias. The reason is that in such a situation the likelihood of establishing contact
or receiving a response from an individual is often correlated with the measure of what is to be
estimated.

d)     Indeterminancy principle: Sometimes we find that individuals act differently when kept


under observation than what they do when kept in non-observed situations. For instance, if workers are
aware that somebody is observing them in course of a work study on the basis of which the average
length of time to complete a task will be determined and accordingly the quota will be set for piece
work, they generally tend to work slowly in comparison to the speed with which they work if kept
unobserved. Thus, the indeterminancy principle may also be a cause of a systematic bias.

e)      Natural bias in the reporting of data: Natural bias of respondents in the reporting of data is often the
cause of a systematic bias in many inquiries. There is usually a downward bias in the income data
collected by government taxation department, whereas we find an upward bias in the income data
collected by some social organisation. People in general understate their incomes if asked about it for
tax purposes, but they overstate the same if asked for social status or their affluence. Generally in
psychological surveys, people tend to give what they think is the ‘correct’ answer rather than revealing
their true feelings.

Sampling errors  are the random variations in the sample estimates around the true
population  parameters. Since they occur randomly and are equally likely to be in either direction, their
nature happens to be of compensatory type and the expected value of such errors happens to be equal
to zero. Sampling error decreases with the increase in the size of the sample, and it happens to be of a
smaller magnitude in case of homogeneous population.

Characteristics of a Good Sample Design

From what has been stated above, we can list down the characteristics of a good sample design as
under:

o  Sample design must result in a truly representative sample.

o  Sample design must be such which results in a small sampling error.

o  Sample design must be viable in the context of funds available for the research study.

o  Sample design must be such so that systematic bias can be controlled in a better way.

o  Sample should be such that the results of the sample study can be applied, in general, for the universe
with a reasonable level of confidence.

Different Types of Sample Designs


There are different types of sample designs based on two factors viz., the representation basis and the
element selection technique. On the representation basis, the sample may be probability sampling or it
may be non-probability sampling. Probability sampling is based on the concept of random selection,
whereas non-probability sampling is ‘non-random’ sampling. On element selection basis, the sample
may be either unrestricted or restricted. When each sample element is drawn individually from the
population at large, then the sample so drawn is known as ‘unrestricted sample’, whereas all other
forms of sampling are covered under the term ‘restricted sampling’. The following chart exhibits the
sample designs as explained above.

Thus, sample designs are basically of two types viz., non-probability sampling and probability
sampling. We take up these two designs separately.

CHART SHOWING BASIC SAMPLING DESIGNS

Representation basis

Element selection Probability sampling Non-probability sampling

technique

Unrestricted Simple random


sampling sampling Haphazard sampling or

convenience sampling

Restricted Complex randomPurposive sampling


sampling sampling (such as

(such as clusterquota sampling,


sampling, judgement

systematic sampling, sampling)

stratified sampling
etc.)
a)      Non-probability sampling:

 Non-probability sampling is that sampling procedure which does not afford any basis for
estimating the probability that each item in the population has of being included in the sample.

 Non-probability sampling is also known by different names such as deliberate sampling,


purposive sampling and judgement sampling.

 In this type of sampling, items for the sample are selected deliberately by the researcher;
his choice concerning the items remains supreme.

 In other words, under non-probability sampling the organisers of the inquiry purposively
choose the particular units of the universe for constituting a sample on the basis that the small
mass that they so select out of a huge one will be typical or representative of the whole.

 For instance, if economic conditions of people living in a state are to be studied, a few
towns and villages may be purposively selected for intensive study on the principle that they can
be representative of the entire state.

 In such a design, personal element has a great chance of entering into the selection of the
sample.

 The investigator may select a sample which shall yield results favourable to his point of
view and if that happens, the entire inquiry may get vitiated. Thus, there is always the danger of
bias entering into this type of sampling technique.

 But in the investigators are impartial, work without bias and have the necessary
experience so as to take sound judgement, the results obtained from an analysis of deliberately
selected sample may be tolerably reliable.

 However, in such a sampling, there is no assurance that every element has some
specifiable chance of being included.

 Sampling error in this type of sampling cannot be estimated and the element of bias, great
or small, is always there. As such this sampling design in rarely adopted in large inquires of
importance. However, in small inquiries and researches by individuals, this design may be adopted
because of the relative advantage of time and money inherent in this method of sampling.

 Quota sampling is also an example of non-probability sampling. Under quota sampling the
interviewers are simply given quotas to be filled from the different strata, with some restrictions
on how they are to be filled.

 In other words, the actual selection of the items for the sample is left to the interviewer’s
discretion. This type of sampling is very convenient and is relatively inexpensive.
 But the samples so selected certainly do not possess the characteristic of random samples.
Quota samples are essentially judgement samples and inferences drawn on their basis are not
amenable to statistical treatment in a formal way.

b)      Probability sampling:

         Probability sampling is also known as ‘random sampling’ or ‘chance sampling’.

         Under this sampling design, every item of the universe has an equal chance of inclusion in the sample.

         It is, so to say, a lottery method in which individual units are picked up from the whole group not
deliberately but by some mechanical process.

         Here it is blind chance alone that determines whether one item or the other is selected.

         The results obtained from probability or random sampling can be assured in terms of probability i.e., we
can measure the errors of estimation or the significance of results obtained from a random sample, and
this fact brings out the superiority of random sampling design over the deliberate sampling design.

         Random sampling ensures the law of Statistical Regularity which states that if on an average the sample
chosen is a random one, the sample will have the same composition and characteristics as the universe.

         This is the reason why random sampling is considered as the best technique of selecting a
representative sample.

         Random sampling from a finite population refers to that method of sample selection which gives each
possible sample combination an equal probability of being picked up and each item in the entire
population to have an equal chance of being included in the sample.
         This applies to sampling without replacement i.e., once an item is selected for the sample, it cannot
appear in the sample again (Sampling with replacement is used less frequently in which procedure the
element selected for the sample is returned to the population before the next element is selected.

         In such a situation the same element could appear twice in the same sample before the second element
is chosen). In brief, the implications of random sampling (or simple random sampling) are:

  It gives each element in the population an equal probability of getting intothe sample; and all choices are
independent of one another.

   It Gives Each Possible Sample Combination An Equal Probability Of Being Chosen.

Random Sample from an Infinite Universe

So far we have talked about random sampling, keeping in view only the finite populations. But what
about random sampling in context of infinite populations? It is relatively difficult to explain the concept
of random sample from an infinite population. However, a few examples will show the basic
characteristic of such a sample. Suppose we consider the 20 throws of a fair dice as a sample from the
hypothetically infinite population which consists of the results of all possible throws of the dice. If he
probability of getting a particular number, say 1, is the same for each throw and the 20 throws are all
independent, then we say that the sample is random. Similarly, it would be said to be sampling from an
infinite population if we sample with replacement from a finite population and our sample would be
considered as a random sample if in each draw all elements of the population have the same probability
of being selected and successive draws happen to be independent. In brief, one can say that the
selection of each item in a random sample from an infinite population is controlled by the same
probabilities and that successive selections are independent of one another.

Complex Random Sampling Designs

Probability sampling under restricted sampling techniques, as stated above, may result in complex
random sampling designs. Such designs may as well be called ‘mixed sampling designs’ for many of such
designs may represent a combination of probability and non-probability sampling procedures in
selecting a sample. Some of the popular complex random sampling designs are as follows:

a)      Systematic sampling:
         In some instances, the most practical way of sampling is to select every ith item on a list. Sampling of
this type is known as systematic sampling.

         An element of randomness  is introduced into this kind of sampling by using random numbers to pick up
the unit with which to start. For instance, if a 4 per cent sample is desired, the first item would be
selected randomly from the first twenty-five and thereafter every 25th item would automatically be
included in the sample.

         Thus, in systematic sampling only the first unit is selected randomly and the remaining units of the
sample are selected at fixed intervals. Although a systematic sample is not a random sample in the strict
sense of the term, but it is often considered reasonable to treat systematic sample as if it were a
random sample.

         Systematic sampling has certain plus points. It can be taken as an improvement over a simple random
sample in as much as the systematic sample is spread more evenly over the entire population.

         It is an easier and less costlier method of sampling and can be conveniently used even in case of large
populations.

         But there are certain dangers too in using this type of sampling. If there is a hidden periodicity in the
population, systematic sampling will prove to be an inefficient method of sampling.

         For instance, every 25th item produced by a certain production process is defective. If we are to select a
4% sample of the items of this process in a systematic manner, we would either get all defective items
or all good items in our sample depending upon the random starting position.

         If all elements of the universe are ordered in a manner representative of the total population, i.e., the
population list is in random order, systematic sampling is considered equivalent to random sampling.

         But if this is not so, then the results of such sampling may, at times, not be very reliable. In practice,
systematic sampling is used when lists of population are available and they are of considerable length.
b)     Stratified sampling:

         If a population from which a sample is to be drawn does not constitute a homogeneous group, stratified
sampling technique is generally applied in order to obtain a representative sample.

         Under stratified sampling the population is divided into several sub-populations that are individually
more homogeneous than the total population (the different sub-populations are called ‘strata’) and then
we select items from each stratum to constitute a sample.

         Since each stratum is more homogeneous than the total population, we are able to get more precise
estimates for each stratum and by estimating more accurately each of the component parts, we get a
better estimate of the whole.

         In brief, stratified sampling results in more reliable and detailed information.

c)      Cluster sampling:

         If the total area of interest happens to be a big one, a convenient way in which a sample can be taken is
to divide the area into a number of smaller non-overlapping areas and then to randomly select a
number of these smaller areas (usually called clusters), with the ultimate sample consisting of all (or
samples of) units in these small areas or clusters.

         Thus in cluster sampling the total population is divided into a number of relatively small subdivisions
which are themselves clusters of still smaller units and then some of these clusters are randomly
selected for inclusion in the overall sample.

         Suppose we want to estimate the proportion of machine-parts in an inventory which are defective. Also
assume that there are 20000 machine parts in the inventory at a given point of time, stored in 400 cases
of 50 each. Now using a cluster sampling, we would consider the 400 cases as clusters and randomly
select ‘ n’ cases and examine all the machine-parts in each randomly selected case.
         Cluster sampling, no doubt, reduces cost by concentrating surveys in selected clusters. But certainly it is
less precise than random sampling. There is also not as much information in ‘ n’ observations within a
cluster as there happens to be in ‘ n’ randomly drawn observations. Cluster sampling is used only
because of the economic advantage it possesses; estimates based on cluster samples are usually more
reliable per unit cost.

d)     Area sampling:

         If clusters happen to be some geographic subdivisions, in that case cluster sampling is better known as
area sampling.

         In other words, cluster designs, where the primary sampling unit represents a cluster of units based on
geographic area, are distinguished as area sampling.

         The plus and minus points of cluster sampling are also applicable to area sampling.

e)      Multi-stage sampling:

         Multi-stage sampling is a further development of the principle of cluster sampling. Suppose we want to


investigate the working efficiency of nationalised banks in India and we want to take a sample of few
banks for this purpose.

         The first stage is to select large primary  sampling unit such as states in a country. Then we may select
certain districts and interview all banks in the chosen districts. This would represent a two-stage
sampling design with the ultimate sampling units being clusters of districts.

         If instead of taking a census of all banks within the selected districts, we select certain towns and
interview all banks in the chosen towns. This would represent a three-stage sampling design.

         If instead of taking a census of all banks within the selected towns, we randomly sample banks from
each selected town, then it is a case of using a four-stage sampling plan.
         If we select randomly at all stages, we will have what is known as ‘multi-stage random sampling design’.

         Ordinarily multi-stage sampling is applied in big inquires extending to a considerable large geographical
area, say, the entire country. There are two advantages of this sampling design viz.,

         It is easier to administer than most single stage designs mainly because of the fact that sampling frame
under multi-stage sampling is developed in partial units. (b) A large number of units can be sampled for
a given cost under multistage sampling because of sequential clustering, whereas this is not possible in
most of the simple designs.

f)       Sequential sampling:

         This sampling design is somewhat complex sample design.

         The ultimate size of the sample under this technique is not fixed in advance, but is determined
according to mathematical decision rules on the basis of information yielded as survey progresses.

         This is usually adopted in case of acceptance sampling plan in context of statistical quality control.

         When a particular lot is to be accepted or rejected on the basis of a single sample, it is known as single
sampling; when the decision is to be taken on the basis of two samples, it is known as double sampling
and in case the decision rests on the basis of more than two samples but the number of samples is
certain and decided in advance, the sampling is known as multiple sampling.

         But when the number of samples is more than two but it is neither certain nor decided in advance, this
type of system is often referred to as sequential sampling. Thus, in brief, we can say that in sequential
sampling, one can go on taking samples one after another as long as one desires to do so.

Conclusion

From a brief description of the various sample designs presented above, we can say that normally one
should resort to simple random sampling because under it bias is generally eliminated and the sampling
error can be estimated. But purposive sampling is considered more appropriate when the universe
happens to be small and a known characteristic of it is to be studied intensively. There are situations in
real life under which sample designs other than simple random samples may be considered better (say
easier to obtain, cheaper or more informative) and as such the same may be used. In a situation when
random sampling is not possible, then we have to use necessarily a sampling design other than random
sampling. At times, several methods of sampling may well be used in the same study.

CRITERIA OF SELECTING A
SAMPLING PROCEDURE
In this context one must
remember that two costs are
involved in a sampling analysis
viz., the
cost of collecting the data and
the cost of an incorrect
inference resulting from the
data.
A systematic bias results from
errors in the sampling
procedures, and it cannot be
reduced or eliminated by
increasing the sample size. At
best the causes responsible for
these
errors can be detected and
corrected. Usually a systematic
bias is the result of one or more
of the
following factors:
1. Inappropriate sampling
frame: If t
CRITERIA OF SELECTING A
SAMPLING PROCEDURE
In this context one must
remember that two costs are
involved in a sampling analysis
viz., the
cost of collecting the data and
the cost of an incorrect
inference resulting from the
data.
SUBJECT: Research in
physical education (Chapter 8 )
Noorkalam sekh
AIM NET / SET in physical
education
From VIP talent Hub
19
Researcher must keep in view
the two causes of incorrect
inferences viz., systematic bias
and
sampling error. A systematic
bias results from errors in the
sampling procedures, and it
cannot be
reduced or eliminated by
increasing the sample size. At
best the causes responsible for
these
errors can be detected and
corrected. Usually a systematic
bias is the result of one or more
of the
following factors:
1. Inappropriate sampling
frame: If the sampling frame is
inappropriate i.e., a biased
representation of the universe, it
will result in a systematic bias.
2. Defective measuring device:
If the measuring device is
constantly in error, it will result
in
systematic bias. In survey work,
systematic bias can result if the
questionnaire or the interviewer
is biased. Similarly, if the
physical measuring device is
defective there will be
systematic bias in
the data collected through such
a measuring device.
3. Non-respondents: If we are
unable to sample all the
individuals initially included in
the
sample, there may arise a
systematic bias. The reason is
that in such a situation the
likelihood of
establishing contact or receiving
a response from an individual is
often correlated with the
measure of what is to be
estimated.
4. Indeterminancy principle:
Sometimes we find that
individuals act differently when
kept
under observation than what
they do when kept in non-
observed situations. For
instance, if
workers are aware that
somebody is observing them in
course of a work study on the
basis of
which the average length of
time to complete a task will be
determined and accordingly the
quota
will be set for piece work, they
generally tend to work slowly in
comparison to the speed with
which they work if kept
unobserved. Thus, the
indeterminacy principle may
also be a cause of a
systematic bias.
5. Natural bias in the reporting
of data: Natural bias of
respondents in the reporting of
data is
often the cause of a systematic
bias in many inquiries. There is
usually a downward bias in the
income data collected by
government taxation
department, whereas we find an
upward bias in
the income data collected by
some social organisation.
Sampling errors are the random
variations in the sample
estimates around the true
population
parameters. Since they occur
randomly and are equally likely
to be in either direction, their
nature happens to be of
compensatory type and the
expected value of such errors
happens to be
equal to zero. Sampling error
decreases with the increase in
the size of the sample, and it
happens to be of a smaller
magnitude in case of
homogeneous population.
Sampling error can be measured
for a given sample design and
size. The measurement of
sampling error is usually called
the ‘precision of the sampling
plan’. If we increase the sample
size, the precision can be
improved. But increasing the
size of the sample has its own
limitations
viz., a large sized sample
increases the cost of collecting
data and also enhances the
systematic
bias. Thus the effective way to
increase precision is usually to
select a better sampling design
which has a smaller sampling
error for a given sample size at
a given cost. In practice,
however,
people prefer a less precise
design because it is easier to
adopt the same and also because
of the
fact that systematic bias can be
controlled in a better way in
such a design.
A systematic bias results from
errors in the sampling
procedures, and it cannot be
reduced or eliminated by
increasing the sample size. At
best the causes responsible for
these
errors can be detected and
corrected. Usually a systematic
bias is the result of one or more
of the
following factors:
1. Inappropriate sampling
frame: If t
A systematic bias results from
errors in the sampling
procedures, and it cannot be
reduced or eliminated by
increasing the sample size. At
best the causes responsible for
these
errors can be detected and
corrected. Usually a systematic
bias is the result of one or more
of the
following factors:
1. Inappropriate sampling
frame: If t
INTRODUCTORY BUSINESS STATISTICS
Confidence Intervals

41A Confidence Interval for A Population Proportion

During an election year, we see articles in the newspaper that state confidence intervals in terms
of proportions or percentages. For example, a poll for a particular candidate running for president
might show that the candidate has 40% of the vote within three percentage points (if the sample
is large enough). Often, election polls are calculated with 95% confidence, so, the pollsters
would be 95% confident that the true proportion of voters who favored the candidate would be
between 0.37 and 0.43.

Investors in the stock market are interested in the true proportion of stocks that go up and down
each week. Businesses that sell personal computers are interested in the proportion of households
in the United States that own personal computers. Confidence intervals can be calculated for the
true proportion of stocks that go up or down each week and for the true proportion of households
in the United States that own personal computers.

The procedure to find the confidence interval for a population proportion is similar to that for the
population mean, but the formulas are a bit different although conceptually identical. While the
formulas are different, they are based upon the same mathematical foundation given to us by the
Central Limit Theorem. Because of this we will see the same basic format using the same three
pieces of information: the sample value of the parameter in question, the standard deviation of
the relevant sampling distribution, and the number of standard deviations we need to have the
confidence in our estimate that we desire.

How do you know you are dealing with a proportion problem? First, the
underlying distribution has a binary random variable and therefore is a binomial
distribution. (There is no mention of a mean or average.) If X is a binomial random variable,
then X ~ B(n, p) where n is the number of trials and p is the probability of a success. To form a
sample proportion, take X, the random variable for the number of successes and divide it by n,
the number of trials (or the sample size). The random variable P′ (read “P prime”) is the sample
proportion,

(Sometimes the random variable is denoted as  , read “P hat”.)

p′ = the estimated proportion of successes or sample proportion of successes (p′ is a point


estimate for p, the true population proportion, and thus q is the probability of a failure in any one
trial.)

x = the number of successes in the sample

n = the size of the sample


The formula for the confidence interval for a population proportion follows the same format as
that for an estimate of a population mean. Remembering the sampling distribution for the
proportion from Chapter 7, the standard deviation was found to be:

The confidence interval for a population proportion, therefore, becomes:

 is set according to our desired degree of confidence and   is the standard
deviation of the sampling distribution.

The sample proportions p′ and q′ are estimates of the unknown population


proportions p and q. The estimated proportions p′ and q′ are used because p and q are not
known.

Remember that as p moves further from 0.5 the binomial distribution becomes less symmetrical.
Because we are estimating the binomial with the symmetrical normal distribution the further
away from symmetrical the binomial becomes the less confidence we have in the estimate.

This conclusion can be demonstrated through the following analysis. Proportions are based upon
the binomial probability distribution. The possible outcomes are binary, either “success” or
“failure”. This gives rise to a proportion, meaning the percentage of the outcomes that are
“successes”. It was shown that the binomial distribution could be fully understood if we knew
only the probability of a success in any one trial, called p. The mean and the standard deviation
of the binomial were found to be:

It was also shown that the binomial could be estimated by the normal distribution if BOTH np
AND nq were greater than 5. From the discussion above, it was found that the standardizing
formula for the binomial distribution is:

which is nothing more than a restatement of the general standardizing formula with appropriate
substitutions for μ and σ from the binomial. We can use the standard normal distribution, the
reason Z is in the equation, because the normal distribution is the limiting distribution of the
binomial. This is another example of the Central Limit Theorem. We have already seen that the
sampling distribution of means is normally distributed. Recall the extended discussion in Chapter
7 concerning the sampling distribution of proportions and the conclusions of the Central Limit
Theorem.
We can now manipulate this formula in just the same way we did for finding the confidence
intervals for a mean, but to find the confidence interval for the binomial population parameter, p.

Where p′ = x/n, the point estimate of p taken from the sample. Notice that p′ has replaced p in the
formula. This is because we do not know p, indeed, this is just what we are trying to estimate.

Unfortunately, there is no correction factor for cases where the sample size is small so np′ and
nq’ must always be greater than 5 to develop an interval estimate for p.
Suppose that a market research firm is hired to estimate the percent of adults living in a large city who
have cell phones. Five hundred randomly selected adult residents in this city are surveyed to determine
whether they have cell phones. Of the 500 people sampled, 421 responded yes – they own cell phones.
Using a 95% confidence level, compute a confidence interval estimate for the true proportion of adult
residents of this city who have cell phones.

 The solution step-by-step.

Let X = the number of people in the sample who have cell phones. X is binomial: the random variable is
binary, people either have a cell phone or they do not.

To calculate the confidence interval, we must find p′, q′.

n = 500

x = the number of successes in the sample = 421

p′ = 0.842 is the sample proportion; this is the point estimate of the population proportion.

q′ = 1 – p′ = 1 – 0.842 = 0.158

Since the requested confidence level is CL = 0.95, then α = 1 – CL = 1 – 0.95 = 0.05   = 0.025.

Then 

This can be found using the Standard Normal probability table in (Figure). This can also be found in the
students t table at the 0.025 column and infinity degrees of freedom because at infinite degrees of
freedom the students t distribution becomes the standard normal distribution, Z.

The confidence interval for the true binomial population proportion is


InterpretationWe estimate with 95% confidence that between 81% and 87.4% of all adult residents of this
city have cell phones.

Explanation of 95% Confidence LevelNinety-five percent of the confidence intervals constructed in this
way would contain the true value for the population proportion of all adult residents of this city who have
cell phones.

Try It

Suppose 250 randomly selected people are surveyed to determine if they own a tablet. Of the 250
surveyed, 98 reported owning a tablet. Using a 95% confidence level, compute a confidence
interval estimate for the true proportion of people who own tablets.
(0.3315, 0.4525)
The Dundee Dog Training School has a larger than average proportion of clients who compete in
competitive professional events. A confidence interval for the population proportion of dogs that compete
in professional events from 150 different training schools is constructed. The lower limit is determined to
be 0.08 and the upper limit is determined to be 0.16. Determine the level of confidence used to construct
the interval of the population proportion of dogs that compete in professional events.
We begin with the formula for a confidence interval for a proportion because the random variable is
binary; either the client competes in professional competitive dog events or they don’t.

Next we find the sample proportion:

The ± that makes up the confidence interval is thus 0.04; 0.12 + 0.04 = 0.16 and 0.12 − 0.04 = 0.08, the
boundaries of the confidence interval. Finally, we solve for Z.

, therefore Z = 1.51

And then look up the probability for 1.51 standard deviations on the standard normal table.

,  or .
A financial officer for a company wants to estimate the percent of accounts receivable that are more than
30 days overdue. He surveys 500 accounts and finds that 300 are more than 30 days overdue. Compute a
90% confidence interval for the true percent of accounts receivable that are more than 30 days overdue,
and interpret the confidence interval.

 The solution is step-by-step:

x = 300 and n = 500


Since confidence level = 0.90, then α = 1 – confidence level = (1 – 0.90) = 0.10  = 0.05

 = Z0.05 = 1.645

This Z-value can be found using a standard normal probability table. The student’s t-table can also be
used by entering the table at the 0.05 column and reading at the line for infinite degrees of freedom. The
t-distribution is the normal distribution at infinite degrees of freedom. This is a handy trick to remember
in finding Z-values for commonly used levels of confidence. We use this formula for a confidence
interval for a proportion:

Substituting in the values from above we find the confidence interval for the true binomial population
proportion is 0.564 ≤ p ≤ 0.636
Interpretation

 We estimate with 90% confidence that the true percent of all accounts receivable overdue 30
days is between 56.4% and 63.6%.
 Alternate Wording: We estimate with 90% confidence that between 56.4% and 63.6% of ALL
accounts are overdue 30 days.

Explanation of 90% Confidence LevelNinety percent of all confidence intervals constructed in this way
contain the true value for the population percent of accounts receivable that are overdue 30 days.
Try It

A student polls his school to see if students in the school district are for or against the new
legislation regarding school uniforms. She surveys 600 students and finds that 480 are against
the new legislation.

a. Compute a 90% confidence interval for the true percent of students who are against the new
legislation, and interpret the confidence interval.
(0.7731, 0.8269); We estimate with 90% confidence that the true percent of all students in the
district who are against the new legislation is between 77.31% and 82.69%.
b. In a sample of 300 students, 68% said they own an iPod and a smart phone. Compute a 97%
confidence interval for the true percent of students who own an iPod and a smartphone.
Solution

Sixty-eight percent (68%) of students own an iPod and a smart phone.

Since CL = 0.97, we know α = 1 – 0.97 = 0.03 and   = 0.015.


The area to the left of z0.015 is 0.015, and the area to the right of z0.015 is 1 – 0.015 = 0.985.

Using the TI 83, 83+, or 84+ calculator function InvNorm(.985,0,1),

p′ – EPB = 0.68 – 0.0584 = 0.0584

p′ + EPB = 0.68 + 0.0584 = 0.0584

We are 97% confident that the true proportion of all students who own an iPod and a smart
phone is between 0.6216 and 0.7384.

References

Jensen, Tom. “Democrats, Republicans Divided on Opinion of Music Icons.” Public Policy
Polling. Available online at http://www.publicpolicypolling.com/Day2MusicPoll.pdf (accessed
July 2, 2013).

Madden, Mary, Amanda Lenhart, Sandra Coresi, Urs Gasser, Maeve Duggan, Aaron Smith, and
Meredith Beaton. “Teens, Social Media, and Privacy.” PewInternet, 2013. Available online at
http://www.pewinternet.org/Reports/2013/Teens-Social-Media-And-Privacy.aspx (accessed July
2, 2013).

Prince Survey Research Associates International. “2013 Teen and Privacy Management Survey.”
Pew Research Center: Internet and American Life Project. Available online at
http://www.pewinternet.org/~/media//Files/Questionnaire/2013/Methods%20and
%20Questions_Teens%20and%20Social%20Media.pdf (accessed July 2, 2013).

Saad, Lydia. “Three in Four U.S. Workers Plan to Work Pas Retirement Age: Slightly more say
they will do this by choice rather than necessity.” Gallup® Economy, 2013. Available online at
http://www.gallup.com/poll/162758/three-four-workers-plan-work-past-retirement-age.aspx
(accessed July 2, 2013).

The Field Poll. Available online at http://field.com/fieldpollonline/subscribers/ (accessed July 2,


2013).

Zogby. “New SUNYIT/Zogby Analytics Poll: Few Americans Worry about Emergency
Situations Occurring in Their Community; Only one in three have an Emergency Plan; 70%
Support Infrastructure ‘Investment’ for National Security.” Zogby Analytics, 2013. Available
online at http://www.zogbyanalytics.com/news/299-americans-neither-worried-nor-prepared-in-
case-of-a-disaster-sunyit-zogby-analytics-poll (accessed July 2, 2013).
“52% Say Big-Time College Athletics Corrupt Education Process.” Rasmussen Reports, 2013.
Available online at
http://www.rasmussenreports.com/public_content/lifestyle/sports/may_2013/52_say_big_time_c
ollege_athletics_corrupt_education_process (accessed July 2, 2013).

Chapter Review

Some statistical measures, like many survey questions, measure qualitative rather than
quantitative data. In this case, the population parameter being estimated is a proportion. It is
possible to create a confidence interval for the true population proportion following procedures
similar to those used in creating confidence intervals for population means. The formulas are
slightly different, but they follow the same reasoning.

Let p′ represent the sample proportion, x/n, where x represents the number of successes


and n represents the sample size. Let q′ = 1 – p′. Then the confidence interval for a population
proportion is given by the following formula:

Formula Review

p′=   where x represents the number of successes in a sample and n represents the sample size.
The variable p′ is the sample proportion and serves as the point estimate for the true population
proportion.

q′ = 1 – p′

The variable p′ has a binomial distribution that can be approximated with the normal distribution
shown here. The confidence interval for the true population proportion is given by the formula:

 provides the number of observations needed to sample to estimate the population


proportion, p, with confidence 1 – α and margin of error e. Where e = the acceptable difference
between the actual population proportion and the sample proportion.
Use the following information to answer the next two exercises: Marketing companies are
interested in knowing the population percent of women who make the majority of household
purchasing decisions.
When designing a study to determine this population proportion, what is the minimum number
you would need to survey to be 90% confident that the population proportion is estimated to
within 0.05?
If it were later determined that it was important to be more than 90% confident and a new survey
were commissioned, how would it affect the minimum number you need to survey? Why?
It would decrease, because the z-score would decrease, which reducing the numerator and
lowering the number.

Use the following information to answer the next five exercises: Suppose the marketing company
did do a survey. They randomly surveyed 200 households and found that in 120 of them, the
woman made the majority of the purchasing decisions. We are interested in the population
proportion of households where women make the majority of the purchasing decisions.
Identify the following:

a. x = ______
b. n = ______
c. p′ = ______

Define the random variables X and P′ in words.


X is the number of “successes” where the woman makes the majority of the purchasing decisions
for the household. P′ is the percentage of households sampled where the woman makes the
majority of the purchasing decisions for the household.
Which distribution should you use for this problem?
Construct a 95% confidence interval for the population proportion of households where the
women make the majority of the purchasing decisions. State the confidence interval, sketch the
graph, and calculate the error bound.
CI: (0.5321, 0.6679)
EBM: 0.0679
List two difficulties the company might have in obtaining random results, if this survey were
done by email.

Use the following information to answer the next five exercises: Of 1,050 randomly selected
adults, 360 identified themselves as manual laborers, 280 identified themselves as non-manual
wage earners, 250 identified themselves as mid-level managers, and 160 identified themselves as
executives. In the survey, 82% of manual laborers preferred trucks, 62% of non-manual wage
earners preferred trucks, 54% of mid-level managers preferred trucks, and 26% of executives
preferred trucks.
We are interested in finding the 95% confidence interval for the percent of executives who prefer
trucks. Define random variables X and P′ in words.
X is the number of “successes” where an executive prefers a truck. P′ is the percentage of
executives sampled who prefer a truck.
Which distribution should you use for this problem?
Construct a 95% confidence interval. State the confidence interval, sketch the graph, and
calculate the error bound.
CI: (0.19432, 0.33068)

Suppose we want to lower the sampling error. What is one way to accomplish that?
The sampling error given in the survey is ±2%. Explain what the ±2% means.
The sampling error means that the true mean can be 2% above or below the sample mean.
Use the following information to answer the next five exercises: A poll of 1,200 voters asked
what the most significant issue was in the upcoming election. Sixty-five percent answered the
economy. We are interested in the population proportion of voters who feel the economy is the
most important.
Define the random variable X in words.
Define the random variable P′ in words.
P′ is the proportion of voters sampled who said the economy is the most important issue in the
upcoming election.
Which distribution should you use for this problem?
Construct a 90% confidence interval, and state the confidence interval and the error bound.
CI: (0.62735, 0.67265)

EBM: 0.02265
What would happen to the confidence interval if the level of confidence were 95%?

Use the following information to answer the next 16 exercises: The Ice Chalet offers dozens of
different beginning ice-skating classes. All of the class names are put into a bucket. The 5 P.M.,
Monday night, ages 8 to 12, beginning ice-skating class was picked. In that class were 64 girls
and 16 boys. Suppose that we are interested in the true proportion of girls, ages 8 to 12, in all
beginning ice-skating classes at the Ice Chalet. Assume that the children in the selected class are
a random sample of the population.
What is being counted?
The number of girls, ages 8 to 12, in the 5 P.M. Monday night beginning ice-skating class.
In words, define the random variable X.
Calculate the following:

a. x = _______
b. n = _______
c. p′ = _______

a. x = 64
b. n = 80
c. p′ = 0.8

State the estimated distribution of X. X~________


Define a new random variable P′. What is p′ estimating?
p
In words, define the random variable P′.
State the estimated distribution of P′. Construct a 92% Confidence Interval for the true
proportion of girls in the ages 8 to 12 beginning ice-skating classes at the Ice Chalet.

. (0.72171, 0.87829).
How much area is in both tails (combined)?
How much area is in each tail?
0.04
Calculate the following:

a. lower limit
b. upper limit
c. error bound

The 92% confidence interval is _______.


(0.72; 0.88)
Fill in the blanks on the graph with the areas, upper and lower limits of the confidence interval,
and the sample proportion.

In one complete sentence, explain what the interval means.


With 92% confidence, we estimate the proportion of girls, ages 8 to 12, in a beginning ice-
skating class at the Ice Chalet to be between 72% and 88%.
Using the same p′ and level of confidence, suppose that n were increased to 100. Would the error
bound become larger or smaller? How do you know?
Using the same p′ and n = 80, how would the error bound change if the confidence level were
increased to 98%? Why?
The error bound would increase. Assuming all other variables are kept constant, as the
confidence level increases, the area under the curve corresponding to the confidence level
becomes larger, which creates a wider interval and thus a larger error.
If you decreased the allowable error bound, why would the minimum sample size increase
(keeping the same level of confidence)?

Homework

Insurance companies are interested in knowing the population percent of drivers who always
buckle up before riding in a car.

a. When designing a study to determine this population proportion, what is the minimum number
you would need to survey to be 95% confident that the population proportion is estimated to within
0.03?
b. If it were later determined that it was important to be more than 95% confident and a new
survey was commissioned, how would that affect the minimum number you would need to survey?
Why?

a. 1,068
b. The sample size would need to be increased since the critical value increases as the confidence
level increases.

Suppose that the insurance companies did do a survey. They randomly surveyed 400 drivers and
found that 320 claimed they always buckle up. We are interested in the population proportion of
drivers who claim they always buckle up.

a.
i. x = __________
ii. n = __________
iii. p′ = __________
b. Define the random variables X and P′, in words.
c. Which distribution should you use for this problem? Explain your choice.
d. Construct a 95% confidence interval for the population proportion who claim they always buckle
up.
i. State the confidence interval.
ii. Sketch the graph.
e. If this survey were done by telephone, list three difficulties the companies might have in
obtaining random results.

According to a recent survey of 1,200 people, 61% feel that the president is doing an acceptable
job. We are interested in the population proportion of people who feel the president is doing an
acceptable job.

a. Define the random variables X and P′ in words.


b. Which distribution should you use for this problem? Explain your choice.
c. Construct a 90% confidence interval for the population proportion of people who feel the
president is doing an acceptable job.
i. State the confidence interval.
ii. Sketch the graph.

a. X = the number of people who feel that the president is doing an acceptable job;

P′ = the proportion of people in a sample who feel that the president is doing an acceptable job.

b.
c.
i. CI: (0.59, 0.63)
ii. Check student’s solution

An article regarding interracial dating and marriage recently appeared in the Washington Post.
Of the 1,709 randomly selected adults, 315 identified themselves as Latinos, 323 identified
themselves as blacks, 254 identified themselves as Asians, and 779 identified themselves as
whites. In this survey, 86% of blacks said that they would welcome a white person into their
families. Among Asians, 77% would welcome a white person into their families, 71% would
welcome a Latino, and 66% would welcome a black person.

a. We are interested in finding the 95% confidence interval for the percent of all black adults who
would welcome a white person into their families. Define the random variables X and P′, in words.
b. Which distribution should you use for this problem? Explain your choice.
c. Construct a 95% confidence interval.
i. State the confidence interval.
ii. Sketch the graph.

Refer to the information in (Figure).

a. Construct three 95% confidence intervals.


i. percent of all Asians who would welcome a white person into their families.
ii. percent of all Asians who would welcome a Latino into their families.
iii. percent of all Asians who would welcome a black person into their families.
b. Even though the three point estimates are different, do any of the confidence intervals overlap?
Which?
c. For any intervals that do overlap, in words, what does this imply about the significance of the
differences in the true proportions?
d. For any intervals that do not overlap, in words, what does this imply about the significance of
the differences in the true proportions?
a.
i. (0.72, 0.82)
ii. (0.65, 0.76)
iii. (0.60, 0.72)
b. Yes, the intervals (0.72, 0.82) and (0.65, 0.76) overlap, and the intervals (0.65, 0.76) and (0.60,
0.72) overlap.
c. We can say that there does not appear to be a significant difference between the proportion of
Asian adults who say that their families would welcome a white person into their families and the
proportion of Asian adults who say that their families would welcome a Latino person into their families.
d. We can say that there is a significant difference between the proportion of Asian adults who say
that their families would welcome a white person into their families and the proportion of Asian adults
who say that their families would welcome a black person into their families.

Stanford University conducted a study of whether running is healthy for men and women over
age 50. During the first eight years of the study, 1.5% of the 451 members of the 50-Plus Fitness
Association died. We are interested in the proportion of people over 50 who ran and died in the
same eight-year period.

a. Define the random variables X and P′ in words.


b. Which distribution should you use for this problem? Explain your choice.
c. Construct a 97% confidence interval for the population proportion of people over 50 who ran
and died in the same eight–year period.
i. State the confidence interval.
ii. Sketch the graph.
d. Explain what a “97% confidence interval” means for this study.

A telephone poll of 1,000 adult Americans was reported in an issue of Time Magazine. One of
the questions asked was “What is the main problem facing the country?” Twenty percent
answered “crime.” We are interested in the population proportion of adult Americans who feel
that crime is the main problem.

a. Define the random variables X and P′ in words.


b. Which distribution should you use for this problem? Explain your choice.
c. Construct a 95% confidence interval for the population proportion of adult Americans who feel
that crime is the main problem.
i. State the confidence interval.
ii. Sketch the graph.
d. Suppose we want to lower the sampling error. What is one way to accomplish that?
e. The sampling error given by Yankelovich Partners, Inc. (which conducted the poll) is ±3%. In one
to three complete sentences, explain what the ±3% represents.
a. X = the number of adult Americans who feel that crime is the main problem; P′ = the proportion
of adult Americans who feel that crime is the main problem
b. Since we are estimating a proportion, given P′ = 0.2 and n = 1000, the distribution we should use

is  .
c.
i. CI: (0.18, 0.22)
ii. Check student’s solution.
d. One way to lower the sampling error is to increase the sample size.
e. The stated “± 3%” represents the maximum error bound. This means that those doing the study
are reporting a maximum error of 3%. Thus, they estimate the percentage of adult Americans who feel
that crime is the main problem to be between 18% and 22%.

Refer to (Figure). Another question in the poll was “[How much are] you worried about the
quality of education in our schools?” Sixty-three percent responded “a lot”. We are interested in
the population proportion of adult Americans who are worried a lot about the quality of
education in our schools.

a. Define the random variables X and P′ in words.


b. Which distribution should you use for this problem? Explain your choice.
c. Construct a 95% confidence interval for the population proportion of adult Americans who are
worried a lot about the quality of education in our schools.
i. State the confidence interval.
ii. Sketch the graph.
d. The sampling error given by Yankelovich Partners, Inc. (which conducted the poll) is ±3%. In one
to three complete sentences, explain what the ±3% represents.

Use the following information to answer the next three exercises: According to a Field Poll, 79%
of California adults (actual results are 400 out of 506 surveyed) feel that “education and our
schools” is one of the top issues facing California. We wish to construct a 90% confidence
interval for the true proportion of California adults who feel that education and the schools is one
of the top issues facing California.
A point estimate for the true population proportion is:

a. 0.90
b. 1.27
c. 0.79
d. 400

c
A 90% confidence interval for the population proportion is _______.

a. (0.761, 0.820)
b. (0.125, 0.188)
c. (0.755, 0.826)
d. (0.130, 0.183)

Use the following information to answer the next two exercises: Five hundred and eleven (511)
homes in a certain southern California community are randomly surveyed to determine if they
meet minimal earthquake preparedness recommendations. One hundred seventy-three (173) of
the homes surveyed met the minimum recommendations for earthquake preparedness, and 338
did not.
Find the confidence interval at the 90% Confidence Level for the true population proportion of
southern California community homes meeting at least the minimum recommendations for
earthquake preparedness.

a. (0.2975, 0.3796)
b. (0.6270, 0.6959)
c. (0.3041, 0.3730)
d. (0.6204, 0.7025)

The point estimate for the population proportion of homes that do not meet the minimum
recommendations for earthquake preparedness is ______.

a. 0.6614
b. 0.3386
c. 173
d. 338

a
On May 23, 2013, Gallup reported that of the 1,005 people surveyed, 76% of U.S. workers
believe that they will continue working past retirement age. The confidence level for this study
was reported at 95% with a ±3% margin of error.

a. Determine the estimated proportion from the sample.


b. Determine the sample size.
c. Identify CL and α.
d. Calculate the error bound based on the information provided.
e. Compare the error bound in part d to the margin of error reported by Gallup. Explain any
differences between the values.
f. Create a confidence interval for the results of this study.
g. A reporter is covering the release of this study for a local news station. How should she explain
the confidence interval to her audience?

A national survey of 1,000 adults was conducted on May 13, 2013 by Rasmussen Reports. It
concluded with 95% confidence that 49% to 55% of Americans believe that big-time college
sports programs corrupt the process of higher education.

a. Find the point estimate and the error bound for this confidence interval.
b. Can we (with 95% confidence) conclude that more than half of all American adults believe this?
c. Use the point estimate from part a and n = 1,000 to calculate a 75% confidence interval for the
proportion of American adults that believe that major college sports programs corrupt higher education.
d. Can we (with 75% confidence) conclude that at least half of all American adults believe this?

a. p′ =   = 0.52; EBP = 0.55 – 0.52 = 0.03


b. No, the confidence interval includes values less than or equal to 0.50. It is possible that less than
half of the population believe this.

c. CL = 0.75, so α = 1 – 0.75 = 0.25 and  . (The area to the right of this z is
0.125, so the area to the left is 1 – 0.125 = 0.875.)

(p′ – EBP, p′ + EBP) = (0.52 – 0.018, 0.52 + 0.018) = (0.502, 0.538)


d. Yes – this interval does not fall less than 0.50 so we can conclude that at least half of all
American adults believe that major sports programs corrupt education – but we do so with only 75%
confidence.

Public Policy Polling recently conducted a survey asking adults across the U.S. about music
preferences. When asked, 80 of the 571 participants admitted that they have illegally
downloaded music.

a. Create a 99% confidence interval for the true proportion of American adults who have illegally
downloaded music.
b. This survey was conducted through automated telephone interviews on May 6 and 7, 2013. The
error bound of the survey compensates for sampling error, or natural variability among samples. List
some factors that could affect the survey’s outcome that are not covered by the margin of error.
c. Without performing any calculations, describe how the confidence interval would change if the
confidence level changed from 99% to 90%.

You plan to conduct a survey on your college campus to learn about the political awareness of
students. You want to estimate the true proportion of college students on your campus who voted
in the 2012 presidential election with 95% confidence and a margin of error no greater than five
percent. How many students must you interview?
Glossary

Binomial Distribution

a discrete random variable (RV) which arises from Bernoulli trials; there are a fixed number, n, of
independent trials. “Independent” means that the result of any trial (for example, trial 1) does
not affect the results of the following trials, and all trials are conducted under the same
conditions. Under these circumstances the binomial RV X is defined as the number of successes
in n trials. The notation is: X~B(n,p). The mean is μ = np and the standard deviation is σ = 

. The probability of exactly x successes in n trials is  .

Error Bound for a Population Proportion (EBP)

the margin of error; depends on the confidence level, the sample size, and the estimated (from
the sample) proportion of successes.

Random sampling
Random, or probability sampling, gives each member of the target population a known and
equal probability of selection. The two basic procedures are:

1 the lottery method, e.g. picking numbers out of a hat or bag


2 the use of a table of random numbers.

Systematic sampling
Systematic sampling is a modification of random sampling. To arrive at a systematic sample we
simply calculate the desired sampling fraction, e.g. if there are 100 distributors of a particular
product in which we are interested and our budget allows us to sample say 20 of them then we
divide 100 by 20 and get the sampling fraction 5. Thereafter we go through our sampling frame
selecting every 5th distributor. In the purest sense this does not give rise to a true random
sample since some systematic arrangement is used in listing and not every distributor has a
chance of being selected once the sampling fraction is calculated. However, because there is
no conscious control of precisely which distributors are selected, all but the most pedantic of
practitioners would treat a systematic sample as though it were a true random sample.

Figure 7.2 Systematic sampling as applied to a survey of retailers

Systematic sampling
Population = 100 Food Stores
Sample desired = 20 Food Stores
a. Draw a random number 1-5.
b. Sample every Xth store.
Sample Numbered Stores
1 1, 6, 11, 16, 21... 96
2 2 7, 12 17, 22... 97
3 3, 8, 13 18, 23... 98
4 4, 9, 14 19, 24... 99
5 5, 10, 15, 20, 25... 100

Stratified samples
Stratification increases precision without increasing sample size. Stratification does not imply
any departure from the principles of randomness it merely denotes that before any selection
takes place, the population is divided into a number of strata, then random samples taken within
each stratum. It is only possible to do this if the distribution of the population with respect to a
particular factor is known, and if it is also known to which stratum each member of the
population belongs. Examples of characteristics which could be used in marketing to stratify a
population include: income, age, sex, race, geographical region, possession of a particular
commodity.

Stratification can occur after selection of individuals, e.g. if one wanted to stratify a sample of
individuals in a town by age, one could easily get figures of the age distribution, but if there is no
general population list showing the age distribution, prior stratification would not be possible.
What might have to be done in this case at the analysis stage is to correct proportional
representation. Weighting can easily destroy the assumptions one is able to make when
interpreting data gathered from a random sample and so stratification prior to selection is
advisable. Random stratified sampling is more precise and more convenient than simple
random sampling.

When stratified sampling designs are to be employed, there are 3 key questions which have to
be immediately addressed:

1 The bases of stratification, i.e. what characteristics should be used to subdivide the
universe/population into strata?

2 The number of strata, i.e. how many strata should be constructed and what stratum
boundaries should be used?

3 Sample sizes within strata, i.e. how many observations should be taken in each stratum?

Bases of stratification

Intuitively, it seems clear that the best basis would be the frequency distribution of the principal
variable being studied. For example, in a study of coffee consumption we may believe that
behavioural patterns will vary according to whether a particular respondent drinks a lot of coffee,
only a moderate amount of coffee or drinks coffee very occasionally. Thus we may consider that
to stratify according to "heavy users", "moderate users" and "light users" would provide an
optimum stratification. However, two difficulties may arise in attempting to proceed in this way.
First, there is usually interest in many variables, not just one, and stratification on the basis of
one may not provide the best stratification for the others. Secondly, even if one survey variable
is of primary importance, current data on its frequency is unlikely to be available. However, the
latter complaint can be attended to since it is possible to stratify after the data has been
completed and before the analysis is undertaken. The only approach is to create strata on the
basis of variables, for which information is, or can be made available, that are believed to be
highly correlated with the principal survey characteristics of interest, e.g. age, socio-economic
group, sex, farm size, firm size, etc.

In general, it is desirable to make up strata in such a way that the sampling units within strata
are as similar as possible. In this way a relatively limited sample within each stratum will provide
a generally precise estimate of the mean of that stratum. Similarly it is important to maximise
differences in stratum means for the key survey variables of interest. This is desirable since
stratification has the effect of removing differences between stratum means from the sampling
error.

Total variance within a population has two types of natural variation: between-strata variance
and within-strata variance. Stratification removes the second type of variance from the
calculation of the standard error. Suppose, for example, we stratified students in a particular
university by subject speciality - marketing, engineering, chemistry, computer science,
mathematics, history, geography etc. and questioned them about the distinctions between
training and education. The theory goes that without stratification we would expect variation in
the views expressed by students from say within the marketing speciality and between the
views of marketing students, as a whole, and engineering students as a whole. Stratification
ensures that variation between strata does not enter into the standard error by taking account of
this source in drawing the sample.

Number of strata

The next question is that of the number of strata and the construction of stratum boundaries. As
regards number of strata, as many as possible should be used. If each stratum could be made
as homogeneous as possible, its mean could be estimated with high reliability and, in turn, the
population mean could be estimated with high precision. However, some practical problems limit
the desirability of a large number of strata:

1 No stratification scheme will completely "explain" the variability among a set of observations.
Past a certain point, the "residual" or "unexplained" variation will dominate, and little
improvement will be effected by creating more strata.

2 Depending on the costs of stratification, a point may be reached quickly where creation of
additional strata is economically unproductive.

If a single overall estimate is to be made (e.g. the average per capita consumption of coffee) we
would normally use no more than about 6 strata. If estimates are required for population
subgroups (e.g. by region and/or age group), then more strata may be justified.

Sample sizes within strata


Proportional allocation: Once strata have been established, the question becomes, "How big
a sample must be drawn from each?" Consider a situation where a survey of a two-stratum
population is to be carried out:
Stratum Number of Items in Stratum
A 10,000
B 90,000

If the budget is fixed at $3000 and we know the cost per observation is $6 in each stratum, so
the available total sample size is 500. The most common approach would be to sample the
same proportion of items in each stratum. This is termed proportional allocation. In this
example, the overall sampling fraction is:

Thus, this method of allocation would result in:

Stratum A (10,000 × 0.5%) = 50


Stratum B (90,000 × 0.5%) = 450

The major practical advantage of proportional allocation is that it leads to estimates which are
computationally simple. Where proportional sampling has been employed we do not need to
weight the means of the individual stratum when calculating the overall mean. So:

= W1
sr  1  + W2   + W3 
2 3 + - - - Wk  k

Optimum allocation: Proportional allocation is advisable when all we know of the strata is their
sizes. In situations where the standard deviations of the strata are known it may be
advantageous to make a disproportionate allocation.

Suppose that, once again, we had stratum A and stratum B, but we know that the individuals
assigned to stratum A were more varied with respect to their opinions than those assigned to
stratum B. Optimum allocation minimises the standard error of the estimated mean by ensuring
that more respondents are assigned to the stratum within which there is greatest variation.

Quota sampling
Quota sampling is a method of stratified sampling in which the selection within strata is non-
random. Selection is normally left to the discretion of the interviewer and it is this characteristic
which destroys any pretensions towards randomness.

Quota v random sampling

The advantages and disadvantages of quota versus probability samples has been a subject of
controversy for many years. Some practitioners hold the quota sample method to be so
unreliable and prone to bias as to be almost worthless. Others think that although it is clearly
less sound theoretically than probability sampling, it can be used safely in certain
circumstances. Still others believe that with adequate safeguards quota sampling can be made
highly reliable and that the extra cost of probability sampling is not worthwhile.
Generally, statisticians criticise the method for its theoretical weakness while market
researchers defend it for its cheapness and administrative convenience.

Main arguments against: Quota sampling

1 It is not possible to estimate sampling errors with quota sampling because of the absence of
randomness.

Some people argue that sampling errors are so small compared with all the other errors and
biases that enter into a survey that not being able to estimate is no great disadvantage. One
does not have the security, though, of being able to measure and control these errors.

2 The interviewer may fail to secure a representative sample of respondents in quota sampling.
For example, are those in the over 65 age group spread over all the age range or clustered
around 65 and 66?

3 Social class controls leave a lot to the interviewer's judgement.

4 Strict control of fieldwork is more difficult, i.e. did interviewers place respondents in groups
where cases are needed rather than in those to which they belong.

Main arguments for: quota sampling

1 Quota sampling is less costly. A quota interview on average costs only half or a third as much
as a random interview, but we must remember that precision is lost.

2 It is easy administratively. The labour of random selection is avoided, and so are the
headaches of non-contact and callbacks.

3 If fieldwork has to be done quickly, perhaps to reduce memory errors, quota sampling may be
the only possibility, e.g. to obtain immediate public reaction to some event.

4. Quota sampling is independent of the existence of sampling frames.

Cluster and multistage sampling


Cluster sampling: The process of sampling complete groups or units is called cluster sampling,
situations where there is any sub-sampling within the clusters chosen at the first stage are
covered by the term multistage sampling. For example, suppose that a survey is to be done in a
large town and that the unit of inquiry (i.e. the unit from which data are to be gathered) is the
individual household. Suppose further that the town contains 20,000 households, all of them
listed on convenient records, and that a sample of 200 households is to be selected. One
approach would be to pick the 200 by some random method. However, this would spread the
sample over the whole town, with consequent high fieldwork costs and much inconvenience. (All
the more so if the survey were to be conducted in rural areas, especially in developing countries
where rural areas are sparsely populated and access difficult). One might decide therefore to
concentrate the sample in a few parts of the town and it may be assumed for simplicity that the
town is divided into 400 areas with 50 households in each. A simple course would be to select
say 4 areas at random (i.e. 1 in 100) and include all the households within these areas in our
sample. The overall probability of selection is unchanged, but by selecting clusters of
households, one has materially simplified and made cheaper the fieldwork.

A large number of small clusters is better, all other things being equal, than a small number of
large clusters. Whether single stage cluster sampling proves to be as statistically efficient as a
simple random sampling depends upon the degree of homogeneity within clusters. If
respondents within clusters are homogeneous with respect to such things as income, socio-
economic class etc., they do not fully represent the population and will, therefore, provide larger
standard errors. On the other hand, the lower cost of cluster sampling often outweighs the
disadvantages of statistical inefficiency. In short, cluster sampling tends to offer greater
reliability for a given cost rather than greater reliability for a given sample size.

Multistage sampling: The population is regarded as being composed of a number of first stage


or primary sampling units (PSU's) each of them being made up of a number of second stage
units in each selected PSU and so the procedure continues down to the final sampling unit, with
the sampling ideally being random at each stage.

The necessity of multistage sampling is easily established. PSU's for national surveys are often
administrative districts, urban districts or parliamentary constituencies. Within the selected PSU
one may go direct to the final sampling units, such as individuals, households or addresses, in
which case we have a two-stage sample. It would be more usual to introduce intermediate
sampling stages, i.e. administrative districts are sub-divided into wards, then polling districts.

Area sampling
Area sampling is basically multistage sampling in which maps, rather than lists or registers,
serve as the sampling frame. This is the main method of sampling in developing countries
where adequate population lists are rare. The area to be covered is divided into a number of
smaller sub-areas from which a sample is selected at random within these areas; either a
complete enumeration is taken or a further sub-sample.

Figure 7.3 Aerial sampling


A grid, such as that shown above, is drawn and superimposed on a map of the area of concern.
Sampling points are selected on the basis of numbers drawn at random that equate to the
numbered columns and rows of the grid.

If the area is large, it can be subdivided into sub-areas and a grid overlayed on these. Figure 7.4
depicts the procedures involved. As in figure 7.3 the columns and rows are given numbers.
Then, each square in the grid is allocated numbers to define grid lines. Using random numbers,
sampling points are chosen within each square. Figure 7.4 gives an impression of the pattern of
sampling which emerges.

Figure 7.4 Multistage aerial sampling

Suppose that a survey of agricultural machinery/implement ownership is to be made in a sample


of rural households and that no comprehensive list of such dwellings is available to serve as a
sampling frame. If there is an accurate map of the area we can superimpose vertical and
horizontal lines on it, number these and use them as a reference grid. Using random numbers
points can be placed on the map and data collected from households either on or nearest to
those points. A variation is to divide the area into "parcels" of land. These "parcels" (the
equivalent of city blocks) can be formed using natural boundaries e.g. hills or mountains, canals,
rivers, railways, roads, etc. If sufficient information is known about an area then it is permissible
to construct the "parcels" on the basis of agro-ecosystems.
Alternatively, if the survey is of urban households then clusters of dwellings such as blocks
bounded by streets can be identified. This can serve as a convenient sampling frame. The town
area is then divided into blocks and these blocks are numbered and a random sample of them is
selected. The boundaries of the blocks must be well defined, easily identifiable by field workers
and every dwelling must be clearly located in only one block. Streets, railway lines and rivers
make good boundaries.

Sampling and statistical testing


Research is conducted in order to determine the acceptability (or otherwise) of hypotheses.
Having set up a hypothesis, we collect data which should yield direct information on the
acceptability of that hypothesis. This empirical data requires to be organised in such a fashion
as to make it meaningful. To this end, we organise it into frequency distributions and calculate
averages or percentages. But often, these statistics on their own mean very little. The data we
collect often requires to be compared and when comparisons have to be made, we must take
into account the fact that our data is collected from a sample of the population and is subject to
sampling and other errors. The remainder of this paper is concerned with the statistical testing
of sample data. One assumption which is made is that the survey results are based on random
probability samples.

The null hypothesis


The first step in evaluating sample results is to set up a null hypothesis (Ho). The null
hypothesis is a hypothesis of no differences. We formulate it for the express purpose of
rejecting it. It is formulated before we collect the data (a priori). For example, we may wish to
know whether a particular promotional campaign has succeeded in increasing awareness
amongst housewives of a certain brand of biscuit. Before the campaign we have a certain
measure of awareness, say x%. After the campaign we obtain another measure of the
awareness, say y%. The null hypothesis in this case would be that "there is no difference
between the proportions aware of the brand, before and after the campaign",

Since we are dealing with sample results, we would expect some differences; and we must try
and establish whether these differences are real (i.e. statistically significant) or whether they are
due to random error or chance.

If the null hypothesis is rejected, then the alternative hypothesis may be accepted. The
alternative hypothesis (H1) is a statement relating to the researchers' original hypothesis. Thus,
in the above example, the alternative hypothesis could either be:

a. H1: There is a difference between the proportions of housewives aware of the brand, before
and after the campaign,

or

b. H1: There is an increase in the proportion of housewives aware of the brand, after the
promotional campaign.

Note that these are clearly two different and distinct hypotheses. Case (a) does not indicate the
direction of change and requires a TWO-TAILED test. Case (b), on the other hand, indicates the
predicted direction of the difference and a one-tailed test is called for. The situation when a one-
tailed test is used are:

(a) comparing an experimental product with a currently marketed ones

(b) comparing a cheaper product which will be marketed only if it is not inferior to a current
product.

Parametric tests and non-parametric tests

The next step is that of choosing the appropriate statistical test. There are basically two types of
statistical test, parametric and non-parametric. Parametric tests are those which make
assumptions about the nature of the population from which the scores were drawn (i.e.
population values are "parameters", e.g. means and standard deviations). If we assume, for
example, that the distribution of the sample means is normal, then we require to use a
parametric test. Non-parametric tests do not require this type of assumption and relate mainly to
that branch of statistics known as "order statistics". We discard actual numerical values and
focus on the way in which things are ranked or classed. Thereafter, the choice between
alternative types of test is determined by 3 factors: (1) whether we are working with dependent
or independent samples, (2) whether we have more or less than two levels of the independent
variable, and (3) the mathematical properties of the scale which we have used, i.e. ratio,
interval, ordinal or nominal. (These issues are covered extensively in the data analysis course
notes).

We will reject Ho, our null hypothesis, if a statistical test yields a value whose associated
probability of occurrence is equal to or less than some small probability, known as the critical
region (or level). Common values of this critical level are 0.05 and 0.01. Referring back to our
example, if we had found that the observed difference between the percentage of housewives
aware of the brand from pre-to-post-campaign could have arisen with probability 0.01 and if we
had set our significance level in advance at 0.05, then we would accept the Ho. If, on the other
hand, we found the probability of this difference occurring was 0.02 then we would reject the null
hypothesis and accept our alternative hypothesis.

Type I errors and type II errors


The choice of significance level affects the ratio of correct and incorrect conclusions which will
be drawn. Given a significance level there are four alternatives to consider:

Figure 7.5 Type I and type II errors

Correct Conclusion Incorrect Conclusion


Accept a correct hypothesis Reject a correct hypothesis
Reject an incorrect hypothesis Accept an incorrect hypothesis

Consider the following example. In a straightforward test of two products, we may decide to
market product A if, and only if, 60% of the population prefer the product. Clearly we can set a
sample size, so as to reject the null hypothesis of A = B = 50% at, say, a 5% significance level.
If we get a sample which yields 62% (and there will be 5 chances in a 100 that we get a figure
greater than 60%) and the null hypothesis is in fact true, then we make what is known as a Type
I error.

If however, the real population is A = 62%, then we shall accept the null hypothesis A = 50% on
nearly half the occasions as shown in the diagram overleaf. In this situation we shall be saying
"do not market A" when in fact there is a market for A. This is the type II error. We can of course
increase the chance of making a type I error which will automatically decrease the chance of
making a type II error.

Obviously some sort of compromise is required. This depends on the relative importance of the
two types of error. If it is more important to avoid rejecting a true hypothesis (type I error) a high
confidence coefficient (low value of x) will be used. If it is more important to avoid accepting a
false hypothesis, a low confidence coefficient may be used. An analogy with the legal profession
may help to clarify the matter. Under our system of law, a man is presumed innocent of murder
until proved otherwise. Now, if a jury convicts a man when he is, in fact, innocent, a type I error
will have been made: the jury has rejected the null hypothesis of innocence although it is
actually true. If the jury absolves the man, when he is, in fact, guilty, a type II error will have
been made: the jury has accepted the null hypothesis of innocence when the man is really
guilty. Most people will agree that in this case, a type I error, convicting an innocent man, is the
more serious.

In practice, of course, researchers rarely base their decisions on a single significance test.
Significance tests may be applied to the answers to every question in a survey but the results
will be only convincing, if consistent patterns emerge. For example, we may conduct a product
test to find out consumers preferences. We do not usually base our conclusions on the results
of one particular question, but we ask several, make statistical tests on the key questions and
look for consistent significances. We must remember that when one makes a series of tests,
some of the correct hypotheses will be rejected by chance. For example, if 20 questions were
asked in our "before" and "after" survey and we test each question at the 5% level, then one of
the differences is likely to give significant results, even if there is no real difference in the
population.

No mention is made in these notes of considerations of costs of incorrect decisions. Statistical


significance is not always the only criterion for basing action. Economic considerations of
alternative actions is often just as important.

These, therefore, are the basic steps in the statistical testing procedure. The majority of tests
are likely to be parametric tests where researchers assume some underlying distribution like the
normal or binomial distribution. Researchers will obtain a result, say a difference between two
means, calculate the standard error of the difference and then ask "How far away from the zero
difference hypothesis is the difference we have found from our samples?"

To enable researchers to answer this question, they convert their actual difference into
"standard errors" by dividing it by its standard deviation, then refer to a chart to ascertain the
probability of such a difference occurring.

Example calculations of sample size


1. Suppose a researcher wishes to measure a population with respect to the percentage of
persons owning a maize sheller. He/she may have a rough idea of the likely percentage, and
wishes the sample to be accurate to within 5% points and to be 95% confident of this accuracy.

2. Consider the standard error of a percentage:

Assume that the researcher hazards a guess that the likely percentage of ownership is 30%.

Then,

But 2. [SE(p)] must equal 5% (the level of accuracy required)

i.e.

 i.e. 

It is necessary to take a sample of, say, 340 (rounding up).

Generally, then, for percentages, the sample size may be calculated using:

 for accuracy at the 95% level.

Case 1: In a census taken 6 years ago, 60% of farms were found to be selling horticultural
produce direct to urban markets. Recently a sample survey has been carried out on 1000 farms
and found 70% of them were selling their horticultural produce to urban centres direct.

Situation: Population statistics (P = 60%) are known

Question: Has there been a change in 6 years or is the higher percentage (p = 70%) found due
to sampling error?
When the population value is known, we can know the sampling error and we use this error for
the purpose of our statistical test. The standard error of a percentage is always pq/n, but in this
case the researcher puts p, the population value, in the formula and uses the size of the
sample, n, to ascertain the standard error of the estimate, p = 70%.

The null hypothesis for this case is: "There is no difference between the sample percentage of
farms selling direct to urban areas and the population percentage of farms found to be selling
direct 6 years ago" (i.e. the sample we have drawn comes from the population on which the
census was carried out and there has been no change in the 6 years).

This must be a 2-tailed test as it could not be assumed that there would either be more or less
farms selling produce direct six years later.

Standard error   PQ where Q=100 P

Statistical test:

 = 

 N.B. This has infinite degrees of freedom.

t=6.45

If reference is made to the table for a two-tailed test with infinite degrees of freedom, it can be
seen that t = 3.29 which shows that there is only a 1/1000 chance of our result (p = 70%) being
due to sampling error, since 6.45 > 3.29. Researchers realise that the probability of this having
occurred because of sampling error must be even smaller than 1/1000. Thus they are able to
say that the probability that the percentage of households selling direct is now 70% is at least
999/1000 and that the null hypothesis is refuted at beyond 1/1000 level of significance. If
researchers claim this, they shall be wrong less than 1 in 1000 times.
Case 2:. Six months ago, it was found from a sample survey that 20% of shoppers in a certain
urban area buy fresh fruit from street vendors rather than established shops or supermarkets. A
second survey, independent of the earlier one, is carried out on 500 respondents and it is found
that 24% of them buy fresh fruit and vegetables regularly from street vendors. Is there any real
difference?

Situation: The two surveys are carried out on different occasions, so the two samples may well
be subject to different amounts of error. Due to this researchers use both estimates of error.

Question: Has the percentage of gift shoppers changed?

Null hypothesis: There is no difference in the percentages of housewives buying from street _
vendors six months ago and now. This is a 2-tailed test.

Six months ago Now


P1 = 20% P2 = 24%
n1 = 200 n2 = 500

Standard error of 

Since P1 is independent of P2

S.E. 

= 3.3%

Test of significance

N.B. This has infinite degrees of freedom.

Since 1.18 < 1.64, the difference is not significant at even 1/10 (10%) level, so the null
hypothesis is not refuted and researchers do not accept that there is any significant change in
the percentage of women buying fresh fruit and vegetables from street vendors.

Case 3: 54% of rural housewives are found, in a sample of 200, to include fish in their family's
weekly diet. However, in a sample of 100 urban housewives only 33% said that fish was a
regular part of their diet.
Situation: The same commodity is being investigated on the same occasion by listing two parts
of a population.

Question: Is there any difference between rural and urban housewives in their regular
consumption of fish?

Null hypothesis: There is no difference between the two social class groups in their regular
consumption of fish. This is a two-tailed test.

ABC DE
no. = 33 = c1 no. = 108 = c2
P1 = 33% P2 = 54%
n1=100 n2=200

Standard error of 

where 

N.B. Researchers take an average value of p, since they believe both the rural and urban
families to be alike and the circumstances of measurement of p1 and p2 are exactly the same.

S.E. 

So

= 6.1

Significant test
t= 3.44

(N.B. This has infinite degrees of freedom).

Since 3.44 > 3.29, the two-tailed t-value for 1/1000 level of significance for 0 degrees of
freedom, the null hypothesis is refuted at beyond the 1/1000 level. Thus the difference in fish
consumption between rural and urban housewives is significant at beyond 1/1000 level.

Case 4:. 200 housewives are interviewed in June to determine their purchases of a canned fruit
juice. Two months later, after an intensive promotional campaign, they are re-interviewed with
the same object.

Situation: The same sample is interviewed on two different occasions (or assessing two
different products).

Question: Is there any difference in purchases of the product between June and September?

Null There is no difference in purchases of the product between June and September (A two-
hypothesis: tailed test).
June September
Purchases % 20 32 Sample size = n = 200

The last term under the square root sign = 2 × Covariance of the two assessments, the term
which takes into consideration how each person behaves both in June and September.

= 3.54

Significance test

This has infinite degrees of freedom.

Since 3.39 > 3.29 with 0.0 degrees of freedom, the difference between the June and September
purchases is significant at beyond the 1/1000 or 0.1% level, (i.e. the null hypothesis is refuted at
this level).
Confidence intervals for the mean

Sometimes the task is one of estimating a population value from a sample mean, rather than
testing hypotheses. For example, suppose from a sample of 100 farmers it is found that their
average monthly purchases of the Insecticide Bugdeath were 10.5 litres. It cannot assume that
simply because the sample mean was 10.5 litres that this is necessarily a good estimate of the
average purchases of all farmers in the population. Indeed, samples do not and cannot give
point estimates, like 10.5 litres. Rather a sample will give a range within which it is thought the
true population value lies. To calculate this range researchers need to know the standard
deviation as well as the mean. The standard deviation is calculated as follows:

Suppose a small sample of say 8 farmers is taken and asked how much Bugdeath they bought
each month. Their responses appear in table 7.1 below. Their mean consumption is 10.5 litres
per month. In the middle column you will see that researchers have subtracted each of the
individual values from the mean. In the end column these values have been squared and
summed to give the total variance.

Table 7.1 Calculating the mean and standard deviation

X Consumption in litres
-X ( - X2)
5 -5.5 30.25
8 -2.5 6.25
8 -2.5 6.25
11 0.5 0.25
11 0.5 0.25
11 0.5 0.25
14 3.5 12.25
16 5.5 30.25
X=10.5 Total variance = 86.00

To calculate the standard deviation researchers divide the total variance by the sample size to
obtain the standard deviation i.e.

From the standard deviation researchers must now calculate the standard error if they are to
project from what are sample figures to the population. The standard error is calculated by
dividing the standard deviation by the square root of the sample size, viz:

Thus the estimate is that the average consumption is 10.5 litres plus or minus 2.83 litres, i.e., it
is estimated that most farmers buy somewhere between 7.67 litres and 13.33 litres. This is the
best estimate that can be given on the basis of such a small sample.
As those who have studied elementary statistics will know, only 68% of the values under a
normal distribution curve lie between ±1 standard deviation. In other words, researchers can
only be 68% sure that the true consumption level is between 7.67 and 13.33 litres. If
researchers want to be 95% sure of a correct prediction then they must multiply their standard
error by 1.96. (Students may have to be reminded that if they look up their statistical tables they
will see that 95% of the area under the curve equates to a Z value of 1.96.)

Thus, the calculation becomes:

 (Standard Error)

Confidence Interval = 10.5 ± 1.96 × 2.83


=10.5±5.5
=5 to 17 litres

So, researchers are 95% confident that the true value of farmers' usage of Bugdeath is between
5 and 17 litres. This example serves to show the mechanics of the confidence interval
calculation and the poor estimates we get from small sample sizes.

Students who have had a basic training in statistics will also know that if they wanted to be 99%
confident then the Z value would be 2.57 rather than 1.96.

Chapter Summary
Two major principles underlie all sample design: the desire to avoid bias in the selection
procedure and to achieve the maximum precision for a given outlay of resources. Sampling bias
arises when selection is consciously or unconsciously influenced by human choice, the
sampling frame inadequately covers the target population or some sections of the population
cannot be found or refuse to co-operate.

Random, or probability sampling, gives each member of the target population a known and
equal probability of selection. Systematic sampling is a modification of random sampling. To
arrive at a systematic sample we simply calculate the desired sampling fraction and take every
nth case.

Stratification increases precision without increasing sample size. There is no departure from the
principles of randomness. It merely denotes that before any selection takes place, the
population is divided into a number of strata, then a random sample is taken within each
stratum. It is only possible to stratify if the distribution of the population with respect to a
particular factor is known, and if it is also known to which stratum each member of the
population belongs. Random stratified sampling is more precise and more convenient than
simple random sampling. Stratification has the effect of removing differences between stratum
means from the sampling error. The best basis would be the frequency distribution of the
principal variable being studied. Some practical problems limit the desirability of a large number
of strata: (1) past a certain point, the "residual" variation will dominate, and little improvement
will be effected by creating more strata (2) a point may be reached where creation of additional
strata is economically unproductive. Sample sizes within strata are determined either on a
proportional allocation or optimum allocation basis.
Quota sampling is a method of stratified sampling in which the selection within strata is non-
random. Therefore, it is not possible to estimate sampling errors. Some argue that sampling
errors are so small compared with all the other errors and biases that not being able to estimate
standard errors is no great disadvantage. The interviewer may fail to secure a representative
sample of respondents in quota sampling, e.g. are those in the over 65 age group spread over
all the age range or clustered around 65 and 66? Social class controls leave a lot to the
interviewer's judgments. Strict control of fieldwork is more difficult, i.e. did interviewers place
respondents in groups where cases are needed rather than in those to which they belong.

A quota interview on average costs only half or a third as much as a random interview, the
labour of random selection is avoided, and so are the headaches off non-contact and call-backs,
and if fieldwork has to be quick, perhaps to reduce memory errors, quota sampling may be the
only possibility. Quota sampling is independent of the existence of sampling frames.

The process of sampling complete groups or units is called cluster sampling. Where there is
sub-sampling within the clusters chosen at the first stage, the term multistage sampling applies.
The population is regarded as being composed of a number of first stage or primary sampling
units (PSU's) each of them being made up of a number of second stage units in each selected
PSU and so the procedure continues down to the final sampling unit, with the sampling ideally
being random at each stage. Using cluster samples ensures fieldwork is materially simplified
and made cheaper. That is, cluster sampling tends to offer greater reliability for a given cost
rather than greater reliability for a given sample size. With respect to statistical efficiency, larger
numbers of small clusters is better - all other things being equal - than a small number of large
clusters.

Multistage sampling involves first selecting the PSU, then the final sampling units such as
individuals, households or addresses:

Area sampling is basically multistage sampling in which maps, rather than lists or registers,
serve as the sampling frame. This is the main method of sampling in developing countries
where adequate population lists are rare.

In a random sample of 64 mangoes taken from a large consignment, some were found to
be bad. Deduce that the percentage of bad mangoes in the consignment almost certainly
lies between 31.25 and 68.75 given that the standard error of the proportion of bad
mangoes in the sample 1/16.

A box contains 20 mangoes out of which 4 are not good. If


two mangoes are taken out without replacement what is the
probable distribution of the number of bad mangoes in the
sample?
So there are 20 mangoes, 4 are bad which means 16 are good.
20 × 5 = 100

4 × 5 = 20

16 × 5 = 80

So 20% are bad and 80% are good. If I take out 2 mangoes, I am taking out 2 - 5% chunks
out of the total. There are 6 possible combinations of 2 bad mangoes from the bag, while
there's 120 possible combinations of 2 good mangoes, and there is 64 combinations of one
good and one bad mango. The total number of possibilities is 190.

6/190 = 0.0315789474 which equals 3.15789474% of the possibilities

120/190 = 0.6315789474, which is 63.15789474%

64/190 = 0.3368421053, which is 33.68421053%

So to round it out, 2 bad is about 3%, 2 good is about 63%, and 1 good and one bad is
about 34%

Community College Students and Gender


According to a 2010 report from the American Council on Education, females make up
57% of the college population in the United States. Students in a statistics class at
Tallahassee Community College want to determine the proportion of female students at
TCC. They select a random sample of 135 TCC students and find that 72 are female,
which is a sample proportion of 72 / 135 ≈ 0.533. So 53.3% of the students in the
sample are female.
What can they conclude about the proportion of females at the college? How confident
can they be in their estimate?
To answer these questions, we need to find a confidence interval.
Checking conditions:
We learned in Linking Probability to Statistical Inference that a confidence interval
comes from a normal model of the sampling distribution. Let’s first make sure that a
normal model is appropriate here. Recall the two conditions for using a normal model
for sample proportions:
 The sample must be random.
 The expected number of successes in the sample, np, and the expected number
of failures, n(1 – p), are both greater than or equal to 10. In symbols, this is np ≥
10 and n(1 − p) ≥ 10. Recall that success doesn’t mean good and failure doesn’t
mean bad. A success is just what we are counting.
When we try to check these conditions, we have a problem. We do not know p, the
population proportion. In fact, p is what we are trying to estimate! So we cannot
determine the expected number of successes and failures. Our solution to this problem
is to adjust these conditions. Advanced theory tells us that if the actual number of
successes and failures in the sample are greater than or equal to 10, then a normal
model is still a good fit.
This sample contains 72 successes (female students) and 63 failures (male students).
Both are greater than 10. We therefore use the normal model for the sampling
distribution.
Finding the margin of error:
We know that a sample proportion is only an estimate for the population proportion. We
do not expect the sample proportion to equal the population proportion, so there is
some error due to random chance. We use the standard deviation of the sample
proportions to describe the amount of error we can expect in random samples. We call
this the standard error.
In Linking Probability to Statistical Inference, we learned that the standard error of the
sample proportion depends on the population proportion and sample size. Here is the
formula for the standard error:
\displaystyle \sqrt{\frac{p(1-p)}{n}}√np(1−p)
When we use a normal model for the sampling distribution, 95% of sample proportions
estimate the population proportion within approximately 2 standard errors. So
the margin of error is the following:
\displaystyle 2\text{}\sqrt{\frac{p(1-p)}{n}}2√np(1−p)
Now let’s calculate the margin of error for the TCC estimate of 53.3%. Notice that we
have the same problem we had earlier. We don’t know p, the population proportion. So
we can’t calculate the margin of error! Our solution to this problem is to estimate the
standard error using the sample proportion in place of p. We call this the estimated
standard error, and the formula is:
√ˆp(1−ˆp)npˆ(1−pˆ)n
For this example, the estimated standard error is
\displaystyle \sqrt{\frac{0.533(1-0.533)}{135}}\text{}\approx \text{}0.043√
1350.533(1−0.533)≈0.043
So the margin of error for the 95% confidence interval is:
\displaystyle 2\text{}\sqrt{\frac{0.533(1-0.533)}{135}}\text{}\approx
\text{}2(0.043)\text{}=\text{}0.0862√1350.533(1−0.533)≈2(0.043)=0.086
Finding the confidence interval:
We can interpret the margin of error by saying we are 95% confident that the proportion
of all students at TCC who are female is within 0.086 of our sample proportion of 0.533.
We can then write the interval in the following form:
ˆp±marginoferror=0.533±0.086pˆ±marginoferror=0.533±0.086
When we add and subtract the margin of error from the sample proportion, the
confidence interval is 0.447 to 0.619.
Conclusion:
We are 95% confident that the proportion of all TCC students who are female is
between 0.447 and 0.619. We can also make this statement using percentages. We are
95% confident that the percentage of all TCC students who are female is between
44.7% and 61.9%.
Recall from Linking Probability to Statistical Inference that 95% confidence means this
method captures the population proportion about 95% of the time.

SUBJECT: Research in
physical education (Chapter 8 )
Noorkalam sekh
AIM NET / SET in physical
education
From VIP talent Hub
19
Researcher must keep in view
the two causes of incorrect
inferences viz., systematic bias
and
sampling error. A systematic
bias results from errors in the
sampling procedures, and it
cannot be
reduced or eliminated by
increasing the sample size. At
best the causes responsible for
these
errors can be detected and
corrected. Usually a systematic
bias is the result of one or more
of the
following factors:
1. Inappropriate sampling
frame: If the sampling frame is
inappropriate i.e., a biased
representation of the universe, it
will result in a systematic bias.
2. Defective measuring device:
If the measuring device is
constantly in error, it will result
in
systematic bias. In survey work,
systematic bias can result if the
questionnaire or the interviewer
is biased. Similarly, if the
physical measuring device is
defective there will be
systematic bias in
the data collected through such
a measuring device.
3. Non-respondents: If we are
unable to sample all the
individuals initially included in
the
sample, there may arise a
systematic bias. The reason is
that in such a situation the
likelihood of
establishing contact or receiving
a response from an individual is
often correlated with the
measure of what is to be
estimated.
4. Indeterminancy principle:
Sometimes we find that
individuals act differently when
kept
under observation than what
they do when kept in non-
observed situations. For
instance, if
workers are aware that
somebody is observing them in
course of a work study on the
basis of
which the average length of
time to complete a task will be
determined and accordingly the
quota
will be set for piece work, they
generally tend to work slowly in
comparison to the speed with
which they work if kept
unobserved. Thus, the
indeterminacy principle may
also be a cause of a
systematic bias.
5. Natural bias in the reporting
of data: Natural bias of
respondents in the reporting of
data is
often the cause of a systematic
bias in many inquiries. There is
usually a downward bias in the
income data collected by
government taxation
department, whereas we find an
upward bias in
the income data collected by
some social organisation.
Sampling errors are the random
variations in the sample
estimates around the true
population
parameters. Since they occur
randomly and are equally likely
to be in either direction, their
nature happens to be of
compensatory type and the
expected value of such errors
happens to be
equal to zero. Sampling error
decreases with the increase in
the size of the sample, and it
happens to be of a smaller
magnitude in case of
homogeneous population.
Sampling error can be measured
for a given sample design and
size. The measurement of
sampling error is usually called
the ‘precision of the sampling
plan’. If we increase the sample
size, the precision can be
improved. But increasing the
size of the sample has its own
limitations
viz., a large sized sample
increases the cost of collecting
data and also enhances the
systematic
bias. Thus the effective way to
increase precision is usually to
select a better sampling design
which has a smaller sampling
error for a given sample size at
a given cost. In practice,
however,
people prefer a less precise
design because it is easier to
adopt the same and also because
of the
fact that systematic bias can be
controlled in a better way in
such a design

You might also like