Efinitions AND Ypes OF Rror: STAT 344

Download as pdf or txt
Download as pdf or txt
You are on page 1of 33

DEFINITIONS AND TYPES OF

ERROR
STAT 344
OUTLINE
  Definitions
  Types of Surveys
  Components of a Survey

  Types of Error
  Sampling Error
  Non-Sampling Error

  Types of Bias – general and specific

Note: Though we go through the topics of chapter 1


quickly, they are extremely important in practice 2
and should not be ignored!
A NATURAL PLACE TO BEGIN…
… by defining the class name!
Survey: The collection of information about
characteristics of interest from some or all units
of a population using well defined concepts,
methods and procedures, and the compilation of
such information into a useful summary.

Census: A survey carried out over every element of


the population.

Sample: collection of information about


characteristics of interest from only a part of the 3
population.
SOME HISTORY
  Up until the early twentieth century, statistics
consisted mostly of reporting numbers. For
example collecting the total amount of gold
owned by each country. What brought about this
change?
  Thus, until recently, statistics dealt mostly with
census data only.
  In Canada, sampling goes back to the late 30’s
and early 40’s. Statistics Canada (then the
Dominion Bureau of Statistics) recognized that
survey sampling was an ”essential scientific
technique superior to enumeration” in 1943. 4
WHY A SAMPLE?
  Thiscourse pertains to survey sampling rather
than census surveys. Mostly because census
samples aren’t interesting to analyze.

  What are the pros of a sample over a census?


  Reduced cost

  Timeliness

  Plausibility

5
  Increased accuracy
SO YOU WANT TO TAKE A SURVEY – PART 1
The first task in planning a survey is to establish
your goals. Suppose you want to study housing
conditions for the poor. This is not a well defined
objective.
  Who are the poor? Is it based on debt? Is it based
on income? Is it based on asset to debt ratio?

  Where is the geographical area of interest? Are


we interested in the poor of the world or those of
BC?

  What are housing conditions? Age of house? 6


Location? Area per person living in it?
SO YOU WANT TO TAKE A SURVEY – PART 2
  Oncethe basic objective has been specified, we
can proceed to develop operational definitions.

  Operationaldefinitions indicate who or what is to


be observed and what is measured.

  Once the operational definitions are specified, we


can specify the data requirements and the level
of acceptable error.

  In
other words, define the who, the what and the
why and be specific. 7
SOME VOCABULARY FOR WHO AND WHAT
  A population is the aggregate of interest.
  It is composed of elements or observational units.

  Ingeneral, a variable is a feature of the


population we wish to observe. It can be
quantitative or numerical. A categorical variable
in sampling can be named an attribute.

  Acharacteristic is a summary measure based


on a variable or attribute.

  Theprimary goal of sampling is to estimate a 8


characteristic of the population.
EXAMPLE
BC hydro wants to find out more about the
households it services. It sends out a
questionnaire to 1000 households addressed to
the person who pays the bills. On it are 20
questions. One of these questions is: do you own a
color television?

  Here the population would be the 561,432


households it services.
  The elements would be households.

  “Owning a color television” would be an attribute


9
whose categories are ”yes” and ”no”.
CHARACTERISTICS OF INTEREST –
NOTATION
t: the total value of a given variable in the population.
  For example, the total number households which own a
color television.
µ: the average value of a variable in the population.
  For example, the average weight of bears in a forest.
  Note: This notation is different from the book and we will
convert to that notation in due time.
p: the proportion of elements in the population falling
into a given category.
  For example, the proportion of male bears.
t’: the number of elements in the population falling into
a given category.
  For example, the number of male bears.
In general we know:
N: the population size (the number of elements). 10
n: the sample size (the number of elements in the
sample).
AND THEN WE WERE LEFT WITH HOW?
  How can be applied to many aspects of sampling
  How will we collect the data?
  What type of sampling methodology will we use?
  What type of measuring instrument will we employ? A
questionnaire?
  How will we analyze the data?

In 1994, Statistics Canada conducted a cross Canada


survey to determine the toxicological behavior of
Canadians (i.e. what drugs do we consume and how).
They used 30 employees to call over 10,000
households. There doesn’t exist a list of all Canadians
and the co-ordinates at which they can be reached.
Though the elements of the population are Canadians
over the age of 15, the unit they actually sampled are
households. 11
SAMPLING VOCABULARY
  Sampling Unit: The unit we actually sample.
  Observational Unit: The unit we wish to study.
  Sampling Frame: The list of sampling units.
We need a grasp of some type on the population
to carry out a survey.
  Sampled Population: The collection of all
possible elements that could be included in the
sample. The population from which the sample
was taken.
  Target Population: The collection of elements
which we wish to study. In an ideal situation, the
sampled population and the target population are
the same. 12
IMAGE FROM TEXT BOOK

13
HUMAN RESOURCES
Human resources has begun using a new income
invoice statement and wants to assess its impact
on the employees. Using the list of employee
numbers, they sample 10 people from each of the
5 departments that make up the company. Each
selected employee then fills out a questionnaire.

What is the population?


What is the sample?
What are the units?
What is the sampling frame? 14
DIFFERENT DOCS
Mike Irvin at Medical Genetics would like to assess the classification
by his psychiatrists. Patients are classified for depression into the
following classes: possible, probable, definite. This is done
according to answers to a questionnaire.
25 cases were randomly pulled from files classified by each
psychiatrist and the classifications were recorded. Among
numerous other queries, he wanted to investigate if the
classification changed according to psychiatrist.

Identify the population (target and sampled).

Identify the units (sampling and observational).


15
A SAMPLE
  A sample is simply a subset of the population.

  Wehope our sample is representative of the


population.

  A perfect sample would be a scaled down version


of the population, mirroring every characteristic
of the whole population.

  Unfortunately there is no way of knowing if our


sample is representative or not. 16
OBTAINING A REPRESENTATIVE SAMPLE
  While there is no way of knowing whether or not
our sample is representative, we can find ways
of obtaining samples which tend to lead to them.
  Often, the information of interest will be
associated to other variables, some of which we
are aware of and others which we don’t even
know of.
  E.g. If polling Canadians on environmental issues,
location of the individual is important (e.g. Alberta
vs. Vancouver). But other variables such as education
and age may play a role as well.
17
MATCHING THE SAMPLE TO THE
POPULATION
  Perfectville (from book) is a city which is an ideal,
scaled down version of the population of interest. It
doesn’t exist.
  It may be tempting, to choose a sample such that it
represents as many characteristics of the population
as possible.
  E.g. from a sample of 1000, get 40% from Ontario, 25%
from Quebec, etc.
  Controlling for the distribution of other variables, like
age, income, etc. , to match those of the population
can be a lot of work and must be restricted to certain
variables.
  One of the variables we are not aware of may not be
well distributed in the sample, leading to error. 18
RANDOMIZATION
  When we randomize, randomly select individuals/
cases/participants from our entire population – like
drawing names from a hat.
  There are two motivations for randomization:
  It tends to give samples that have characteristics which
are comparable to the population, which minimizes the
chance of a non-representative sample.
  It plays an important role in the theory used to make
inference. i.e. it dictates the math that we’ll use.
  Inother words, we eliminate many potential causes of
bias in our sample and tend to obtain a representative
sample. 19
THE SOUP ANALOGY
  Imagine
we have a large soup we’d like to study
the composure of.

  Matching
would be the equivalent of selecting
soup from specific areas of the cauldron.
  E.g. one spoonful from the center and the edge at the
bottom in the middle and at the top.

  Randomizationis like mixing the soup


thoroughly and then taking a few spoonfuls.

  Expand analogy. 20
SAMPLING
  How the sample is selected is key and is a major
topic of this course.
  Two distinctive sampling methods are:
  Sampling with replacement: once an element is
selected it is measured and eligible to be selected
again. In this scheme only one element can be
selected at a time
  Sampling without replacement: once the element
is selected it can not be selected again. All elements
may be selected simultaneously here. Throughout
most of this course, we will deal with this type of
sampling and it will be understood that we are
sampling without replacement unless specifically
indicated. 21
MORE DISTINCTIVE SAMPLING
  Two more distinctive sampling methods are:

  Probability Sampling: Sampling is done using


randomization. Most sampling methods we will
discuss will be of this flavor

  Non-Probability Sampling: There is no


randomization mechanism used for the sampling.
Sometimes this is the only option – specifically if
there is no access to a sampling frame!

22
TYPES OF ERROR
23
THE GOOD AND THE BAD
  Two classes of error can affect characteristic
estimates obtained through sampling are:
sampling error and non-sampling error
(bias).

  Thepresence of sampling error is unavoidable


and acceptable (we have some grasp over it).

  Thepresence of non-sampling error (bias) is for


the most part avoidable and should be removed
as best as possible. (we have no grasp over this)
24
SAMPLING ERROR
  This is the error due to the luck of the draw. It is
the error that comes from the difference between
a particular sample and the population as a
whole.
  In probability sampling, we can characterize the
random behavior of characteristic estimators
using sampling distributions.
  As a counter example, consider an interviewer
standing at the corner of the street and
interviewing people who pass by. Not everyone
has a chance of being interviewed, so we cannot
characterize the behavior of the estimator. 25
TYPES OF BIAS
There are two categories of bias:

1. That caused by a discrepancy between the target


population and the sampled population called
Selection Bias.
2. that caused by measuring instruments that tend
to differ from the truth in one direction called
Measurement Bias.

Within each category, there are many


subcategories. We discuss some of them now. 26
SELECTION BIAS – 1
Many survey designs lead to sampled populations
which do not cover the target populations
properly.
  Convenience Sampling (see previous example)
  Judgment Sampling : Selecting elements in
accordance to what is believed to be representative of
the population.
  Volunteer Sampling: The title says it all. People who
volunteer to respond tend to belong to the extremes.
(see the Hite report example in book)
In general, if the sampling population is a subset
of the target population, we say there is under- 27
coverage.
SELECTION BIAS – 2
Sometimes Selection Bias arises from other sources
than the design. These are often more difficult to
avoid.
  Non-response: either a unit cannot be reached, a
unit refuses to respond or is incapable of
responding.
  People who refuse to answer often differ
systematically with those who do.
  People’s availability may be indirectly related to the
variable we wish to measure.
  Misspecified target population – no explanation
required 28
MEASUREMENT BIAS
Measurement Bias isn’t as directly related to survey
design. Here are sources of such bias:
  Malfunctioning equipment (e.g. ph measuring tool
consistently off by 1.2)
  The measuring instrument can be a person. Family
members tend to overestimate pain and doctors tend
to underestimate it.
  Ambiguous measurements may be carried out
differently by different people
  People lie.
  Misunderstood question
  Misleading question.
  Misleading interviewer (introduces subjectivity). 29
GUN OWNERS
A group in support of stricter gun laws wants to collect
more information on gun owners. Among other things,
they’d like to know how many gun owners have used
guns in home defense. Not having a list of all
Canadian gun owners, they selected 10 shooting
ranges from the directory of registered shooting
ranges. For each shooting range, they interviewed 5
members of the range (by showing up and
interviewing people present at the time). The
question asked towards this endeavor was: “Some
people claim using guns for self defense increases
your chance of being harmed, have you ever used a
gun for home defense?”
Discuss the different sources of bias here. Be sure to
30
specify what type of bias it is.
THE ART OF MAKING A QUESTIONNAIRE
  This
could be a whole other course on its own. We
skim the details.

  Aquestionnaire is a measurement instrument


and so great care must be taken to avoid
introduction of bias through it.

  It
can be extremely difficult to find the question
that will measure what it is you wish to measure.

  A golden rule is the KISS principle. 31


OTHER USEFUL QUESTIONNAIRE TIPS
  Design your questionnaire to fit the interviewing
medium. You can’t show pictures over the phone.

  A good introduction can help reduce the non-


response rate.

  Make questions specific.

  For large questionnaires multiple questions can


be repeated in different wording to allow
reliability to be tested. For reliability see ICC
(Intraclass Correlation) and Cronbach’s alpha. 32
QUESTIONNAIRE RELATED DEFINITIONS
  Loaded Question: Begin the question with a
statement that favors one answer.

  Double-Barreled question: A question which


asks two concepts at once
  E.g. Should Vancouver extend its support for the
homeless and the mentally ill?

  Closed
questions are restricted in the options of
answers
  E.g. True/False, multiple choice, Likert Scale, etc.
33

You might also like