SW Statistics

Download as pdf or txt
Download as pdf or txt
You are on page 1of 150

Social Work Statistics

Melba L. Manapol, PhD, RSW

Foundation and Concepts for
Understanding Statistics
What are statistics and how are they used in social
work practice, programs and policy analysis?
Outline of Presentation
Definition of Statistics and Basic
Underlying Concepts
Levels of Measurement
Rates, ratios and percentages
Measures of Central Tendency
Measures of Dispersion
The Normal Distribution
• Statistics is
the art and
science of
Spiegel, 1978


Process/Input Output Outcomes

Why should we study STATISTICS?

1. Social workers conduct research studies

a knowledge of statistical analyses enhances our
ability to design research studies and to draw
justifiable conclusions
for our study to have credibility, we must
demonstrate that the data we gathered, were
collected, analyzed, and interpreted according to
accepted methodolgical and statistical procedures
Why should we study STATISTICS?
2. Social workers rely on other’s research findings
We must be statistically literate when
reading professional social work literature or
attempt to evaluate the vast amount of
information availabe in the internet now.
Unless we know whether a statistical
analysis is performed correctly, we cannot
know whether the findings derived from the
study are credible.
Why should we study STATISTICS?

3. Social workers need to evaluate their

practice effectiveness
both single-system designs and program
evaluations rely on statistical analyses
to determine whether an individual
intervention and/or project is
accomplishing its objective.
Uses of Statistics

Summarize the characteristics of a specific

research sample or population
Estimate the characteristics of the population
from which our sample was drawn
Determine if any patterns of relationships found
within a research sample can safely be
generalized back to the population from which
the sample was drawn
Methodological Terms
Data (singular: datum)

refers to the measurements (numbers, scores and so

forth) collected in a research study before they
have been analyzed in any way.
usually consists of measurements taken previously
from some other purpose (secondary data analysis)
data are ‘raw facts and figures’. For example, the
scores on the midterm for 50 students is ‘raw’ data.
data doesn’t help decision-makers make decisions –
information does. Calculating statistics can
transform data into information. Then humans
process information to obtain knowledge.
Data Collection Methods

in-depth interviews
content analyses
participant observations
group experiments
historical research
Is the interpretation we give to collected data
after we have analyzed them.
Treatment Intervention A is more successful
than Treatment Intervention B in reducing
the substance abuse among research
If the temperature outside this room
as measured by a thermometer is 35
degrees Celsius, the 35 degrees is
datum. The interpretation that it is
very hot is information.
is a characteristic that
differs in quantity or quality
among units under study
All research focus on
Examples: educational
level, gender, ethnicity
Qualitative variable
A qualitative variable takes on non-
numerical values.
It simply describes which class or
category the observations fall.
Nationality - Filipino, Chinese, American, Hispanic

Religion - Catholic, Protestants, Muslim, Jewish

Occupation - Social Worker, Nurse, Statistician

Skills requirement - Computer literate, College

graduate, 35 years old below
A quantitative variable may take any
value from a given set of values. It
has actual units of measure
Are traits and characteristics that do not differ
in quantity or quality among units under study
Example: In a study of drug addiction of
female adolescents, the constants are
female, addiction to drugs and
adolescents, which is the characteristic
of all the participants.

when different measurements of a

variable can be expressed in words
• gender - female, male
tenure - permanent, casual, temporary
civil stat - single, married, separated
when different measurements of a variable can
be expressed in numbers that reflect some
quantifiable difference
it reflects more precise measurements than value
• age - 23, 35, 67 (reflects the actual age of
Frequency of Value Categories and Values
the number of times that a given
value category or values occurs within
a groups of cases
Value Category Frequency
BS SW 90
MSW 35
PhD 20
Total 145
The Stages of Social Research

1. The problem to be studied is reduced

to a testable hypothesis
2. An appropriate set of instruments is
developed (eg. questionnaire,
interview schedule, etc.)
3. The data are collected
5. Results of analysis are
4. The data are analyzed for their interpreted and
bearing in the initial hypothesis communicated to an
audience (eg. lecture,
journal, publications)
Research Hypothesis

is a statement of a relationship between

or among variables.
is often stated in the future tense
because it predicts what will be found
when the data derived from the research
study are collected and analyzed
it expresses what we believe will be
found to be true
Causal relationship
the values of one variable are believed to
actually produce the different values of
the other variable

relationships of variables where there are

identifiable patterns but there is no reason to
believe that one variable directly causes the
measurements of the other variable.
association and corelation
Independent variable
a variable predicted to influence
another variable

Dependent variable
a variable believed to be influenced by the
independent variable
Levels of Measurement
Why is Level of Measurement Important?

• Knowing the level of measurement helps you decide what

statistical analysis is appropriate on the values that were
• Knowing the level of measurement helps you decide how to
interpret the data from that variable.
• Knowing the level of measurement helps you to design and
implement appropriate research designs.
4 Levels of Measurement

1. Categorical (Nominal)
2. Ordinal
3. Interval
4. Ratio
Example : Nominal Scale

Zip Codes

Gender: male, female

race: black or white

phone number
Example: Ordinal Scale

1 College graduate
5 Excellent
High school
Very 2
4 graduate
3 Satisfactory graduate

Needs Educational Attainment

2 Improvemen
1st place
Quality of Service
1 Poor
2nd place
3rd place
Example: Interval Scale

• the existence of a fixed , absolute, and non arbitrary zero
point constitutes the only difference between interval and
ratio levels of measurement
• since the measurement has an absolute zero and the
difference between numbers is significant thus ratios makes
• the absolute zero point property permits all arithmetic
• values at the ratio level indicate the actual amount of the
property being measured
Example: Ratio Scale

Weights and Measurements
Permitted Operations
• Categorical
• Equality and inequality, no < or >, no + or -
• Ordinal
• = and =, < and >, no + or -
• Interval
• = and =, < and >, + and -, but no * or /
• Ratio
• = and =, < and >, + and -, * and /
Comparative Summary
Nominal: Are you currently employed? Yes/No
Ordinal: What is your employment status?
Employed part-time
Employed-full time (37 - 40 hours)
Employed over 40 hours
Ratio: How many hours a week are you employed?
Discrete and Continuous Variables
• Discrete variables
• can take on only a finite number of values
(e.g. 34 students, 3 days, 5 lectures)
• Continuous variables
• can theoretically take on all numerical values
• assuming that we can use precise measuring
instruments capable of measuring the values with ever
increasing precision, it can take a number of different
values (e.g.height of a person, kilos of rice, liters of
Dichotomous, Binary and Dummy Variables
• Dichotomous variable
• a specific type of discrete variable that only has two value categories
(e.g. gender:male, female; election:win , lose)
1. Binary variable
• a special type of dichotomous variable
• 1 - presence of the variable, 0 - absence of the variable
2. Dummy variable
• transformation used when we want a nominal dichotomous
variable such as gender to be used in performing other statistical
analyses such as in regression analysis where variables need to
be in the interval/ratio level
Categories of Statistical Analyses

1. Number of variables
being analyzed
• Univariate
- examine the
distribution of value
categories (for
nominal or ordinal
level data) and values
(for interval/ratio
level data) for a
single variable
Categories of Statistical Analyses

1. Number of variables being analyzed (cont.)

• Multivariate analyses
- examine the relationship among three or more

ounces of red meat taken/week

cholesterol level ounces of fish taken per week

blood pressure ounces of dairy products taken

per week
Categories of Statistical Analyses
1. Number of variables being analyzed (cont.)
• Bivariate analyses
- examine the relationship between two variables
- e.g. correlation between gender and income
Considerations in the Choice of Data Analysis

1. Level of measurement of the variable to be

2.The unit of analysis
3. The shape of the distribution of a variable
including the presence of outliers(extreme values)
4. The study design used to produce the data from
populations, probability samples or batches
5.Completeness of the data
Proportions and Percentages,
Ratios and Rates
compares the number of cases in a given category with the total
size of the distribution
the value obtained when the number of observation in the
category is divided by the total number of cases

the sum of proportions is always 1

P = proportion
f = frequency of the category
N = total number of cases
Proportions: Example
TABLE 1 - Number of Employed Persons by Age Group, Philippines: January 2008

Employed Persons
Age Group Proportion
Total 33,695 1.000
15-24 years 6,520 ?
25-34 years 8,916 ?
35-44 years 7,943 ?
45-54 years 5,851 ?
55-64 years 3,080 ?
65 years and over 1,383 ?
Not reported 1 ?
Source of basic data: National Statistics Office, Labor Force Survey
Proportions: Example

to get the proportion of employed persons aged 15-24

years old:
Proportions: Example
TABLE 1 - Number of Employed Persons by AgeGroup, Philippines: January 2008

Employed Persons
Age Group Proportion
Total 33,695 1.000
15-24 years 6,520 0.194
25-34 years 8,916 0.265
35-44 years 7,943 0.236
45-54 years 5,851 0.174
55-64 years 3,080 0.091
65 years and over 1,383 0.041
Not reported 1 *
Note: Details may not add up to total due to rounding
* Less than 0.005
Source of basic data: National Statistics Office, Labor Force Survey

from the Latin words “per” (by means of/for every) and “centrum” (by
the hundredths/for every hundredths)
the frequency of occurrence of a category per 100 cases
useful in computing for distribution or disaggregation of a set of
particular observation
sum of percentages should always add up to 100 percent
rounding affects the sum of percentages
put footnote if the sum is not equal to 100

f = frequency of the category

N = total number of cases
Table 1 - Gender of Students Majoring in Engineering in Universities A and B

Engineering Majors
Gender of Students University A University B
f % f %
Male 1,082 80 146 80
Female 270 20 37 20
Total 1,352 100 183 100
Percentages: Example
TABLE 2 - Number and Percentage Distribution of Unemployed Persons by Age Group,
Philippines: January 2008
Unemployed Persons
Age Group Percentage (%)
Total 2,675 100.0
15-24 years 1,328 ?
25-34 years 796 ?
35-44 years 274 ?
45-54 years 175 ?
55-64 years 85 ?
65 years and over 17 ?
Source of basic data: National Statistics Office, Labor Force Survey
Percentages: Example
to get the percentage of unemployed persons 20 to 24
years old:

Percentages: Example
TABLE 3 - Number and Percentage Distribution of Unemployed Persons by AgeGroup,
Philippines: January 2008
Unemployed Persons
Age Group Percentage (%)
Total 2,675 100.0
15-24 years 1,328 49.6
25-34 years 796 29.8
35-44 years 274 10.2
45-54 years 175 6.5
55-64 years 85 3.2
65 years and over 17 0.6

Source of basic data: National Statistics Office, Labor Force Survey


directly compares the number of cases falling into one

category with the number of cases falling into another
divide the first number or quantity by the second
colon (:) is used to indicate ratio and is read as “is to”
example: 3:4 is read as “3 is to 4”

f1 = frequency of one category

f2 = frequency of another category
to express the ratio in terms of denominator of 10, multiply
each quantity in the ratio by 10

• 2:1 2x10 : 1x10 20:10

to express the ratio in terms of denominator of 100, multiply

each quantity in the ratio by 100 and so on

• 2:1 2x100 : 1x100 200:100

EXAMPLE: 3:4 is read as “3 is to 4”

sex ratio = ratio of males to females
males = 150
females = 50

For every one female, there are three males
sex ratio = ratio of males to females
males = 1,350
females = 80

If multiplied by 100,
Measures of Central Tendency
Measures of Central Tendency…

– a measure of the center of a set of data.

- is regarded as the most representative value of the given

data because it is determined at the point where the
concentration of values is greatest.

- most common measures of central tendency are the mean,

median and the mode.

• commonly referred to as the average or arithmetic mean.

• it is the sum of all the values in a distribution divided by the
total number of values
• can be computed for interval level or ratio level variable but
not with the nominal and ordinal levels
• uses all the values within a data set in its computation

åX i
Sum of allvaluesin the data set
X= i=1
Total number of observations
MEAN – Ungrouped Data

Name Age Name Age Name Age

Chuck 21 Leon 27 Shanti 37
Tony 25 Kathy 31 Rosemarie 37
Brad 26 Vincent 31 Clarisse 37
Herb 26 Raquel 31 Marguerite 49
Karen 26 Mario 31 Elwin 49
Rosina 27 Rashad 32 David 69
Peter 27 Antoinette 32

20 671
åX i
X= i-1
X = 33.55 » 34 years old
MEAN – Grouped Data
å ( frequency * midpoint)

Limits f Limits f midpoints

20-24 1 20-24 1 22
25-29 7 25-29 7 27
30-34 6 30-34 6 32
35-39 3 35-39 3 37
40-44 0 40-44 0 42
45-49 2 45-49 2 47
50-54 0 50-54 0 52
55-59 0 55-59 0 57
60-64 0 60-64 0 62
65-70 1 65-70 1 67
total 20 20
Limits f Midpt f*midpoint
20-24 1 22 22
25-29 7 27 189
30-34 6 32 192
35-39 3 37 111
40-44 0 42 0
45-49 2 47 94
50-54 0 52 0
55-59 0 57 0
60-64 0 62 0
65-70 1 67 67
n = 20 Total = 675

å ( frequency * midpoint)
= 33.75 » 34
n 20
Advantages of the MEAN:

v Takes into account all observations.

v Can be used for further statistical calculations and
mathematical manipulation.
Disadvantages of the MEAN:
v Mean is easily affected by extreme values.
v Mean cannot be computed if there are missing values due to
omission or non-response.
v In grouped data with open-ended class intervals, the
mean cannot be computed. It is dependent on all
observed values.

• The middle score of a ranked distribution of raw

• It is the value where half the scores fall above
and half the scores fall below.
MEDIAN – Ungrouped Data
For an even number of scores, the median is the average of the two
middle scores. For small distributions the calculation is fairly easy as is
shown below for this small set of raw data:.
10 23 2 34 17 5 3 12 43 25 44 17 7 8

The first step is to rank order all of the raw scores:

2 3 5 7 8 10 12 17 17 23 25 34 43 44

Then count the total number of scores (in this case n = 14) and divide
by two (7). Now count in by the number you just calculated from both
ends of your distribution until you find the middle score or scores.

2 3 5 7 8 10 12 17 17 23 25 34 43 44

The median falls between 12 and 17. So add the two together and
divide by 2 to find the actual median: 12 + 17 / 2 = 14.5
Advantages of the MEDIAN:

v Not affected by extreme values.

v It can be computed even for grouped data with open-ended
class intervals.

Disadvantages of the MEDIAN:

v The median cannot be combined with other

distributions with similar variates to obtain an overall

• the mode is the value within a data set that occurs most frequently

Although you do not have to rank order these raw scores to determine the
10 23 2 34 17 5 3 12 43 25 44 17 7 8

Rank ordering makes the mode stand out:

2 3 5 7 8 10 12 17 17 23 25 34 43 44

There is only one raw score that appears twice. Therefore, the mode of this
raw score distribution is 17. It is the most frequently occurring score.
If a distribution has one mode it is said to be unimodal. If it has two modes it is
bimodal, if it has more than two modes then it is said to me multimodal. It is
also possible for a distribution to not have any mode.
MODE – Grouped Data
For a Group Frequency Distribution, the mode is the midpoint of the interval
with the highest frequency. The Frequency Distribution table below has its mode

Apparent Limits Frequency Midpoints

81-90 5 85.5
71-80 3 75.5
61-70 12 65.5
51-60 16 55.5
41-50 33 45.5
31-40 21 35.5
21-30 15 25.5
11-20 7 15.5
Total 404
The mode of this Grouped Frequency Distribution is 45.5.
Advantage of the MODE:

v Mode, unlike the mean and the median where some

calculations are required, can be easily identified
through ocular inspection.
Disadvantages of the MODE:

v The mode does not possess the desired algebraic

property of the mean that allows further manipulation.
v Like the median, all the raw data of the different
distributions have to be merged to obtain a new mode,
whether group or ungroup data are involved.
Considerations to be made when using the three most
common measures of central tendency:

Measure to Use Measurement Scale of Distribution Other Considerations

Mean Interval or Ratio Normal v When further statistical calculations
or mathematical manipulations are
v When all observations are
considered in the computation

Median Ordinal Skewed v When distribution has open-ended

Mode Nominal Skewed v When interested in the most
frequently occurring observation
Skew refers to the general shape of a
distribution when it is graphed. There are
three basic types of skewing that can occur
with any data set
A Normal Distribution or Bell-Shaped Curve is said to have No Skew.
• The distribution is symmetrical.
• The mean, median, and mode are all the same.
• It is best to use the mean as the measure of central tendency.
A Positively Skewed distribution of data is not symmetrical.
• The tail of the distribution goes toward the positive end of the curve.
• The mean, median, and mode are not the same.
• It is best to use the median or the mode as the measure of central tendency.
A Negatively Skewed distribution of data is also not symmetrical.
• The tail of the distribution goes toward the negative end of the curve.
• The mean, median, and mode are not the same.
• It is best to use the median or the mode as the measure of central tendency.
Measures of Dispersion
Measures of Dispersion
• An index of how the scores in a data set are
scattered around the center of the distribution (An
indicator of how widely scores are dispersed
around the measure of central tendency)

• A measure of what is commonly referred to as

variability (also known as spread or width)

• Supplements an average or a measure of central


• Compares one group of data with another.

• Indication on how representative the average is.

5 Measures of Variability

1. Range
2. Interquartile Range
3. Mean deviation
4. Variance
5. Standard deviation
The Standard Deviation

•Gives us a measure of dispersion relative to the mean

•Is sensitive to each score in the distribution
•Like the mean, the standard deviation is stable with regard to sampling
•Both the mean and standard deviation can be manipulated algebraically
and allows them to be used in inferential statistics calculations
The The Standard
standard deviation Deviation
Deviation when the majority of scores are around
the mean
•The standard deviation INCREASES as more scores lie further from the
mean score
•Similar to the mean, only appropriate for interval and ratio level scores
•The standard deviation has the same units as the original data
•The standard deviation is the most stable measure of variability (varies
least from sample to sample)
Obtaining the Variance and Standard Deviation from a Simple Frequency Distribution

The following are scores X f

representing the age at first 31 1
marriage of a sample of 25 30 1
adults 29 1

28 0
18 20 21 24 27
27 2
18 20 22 25 27
26 3
19 20 22 26 29 25 1

19 20 23 26 30 24 1

23 2
19 21 23 26 31
22 2

21 2
This data can be rearranged as a 20 3
simple frequency distribution 19 4
table 18 2
Formula for Standard
å (X - X ) 2

s = the standard deviation

å (X - X ) = the sum of the squared deviations from the mean


N = total number of scores

Reporting Measures of
Level of Measurement Central Tendency Variability
Nominal Mode # of value categories
Ordinal Mode Range
Interval/ratio ( no Mean Standard Deviation
outliers) Variance
Interval/ratio (outliers) Median, trimmed mean Interquartile range

Five Number Summary

1.Minimum value
2.25th percentile (Q1)
3.50th percentile (Q2, median)
4.75th percentile (Q3)
5.Maximum value
Interpreting Distributions

School A School B School C

Mean 50 60 70
S.d. 10 10 10
0 20 40 60 80 100 120
Interpreting Distributions

School A School B School C

Mean 50 50 50
S.d. 10 13 16
0 20 40 60 80 100 120
Interpreting Distributions

National Mean School A School B

Mean 55 60 40
S.d. 10 15 15

-20 0 20 40 60 80 100 120
The Normal Distribution
Normal Distribution
}Normal Distribution – A bell-shaped and symmetrical theoretical
distribution, with the mean, the median, and the mode all coinciding at
its peak and with frequencies gradually decreasing at both ends of the

• The normal distribution is a theoretical ideal distribution.

Real-life empirical distributions never match this model
perfectly. However, many things in life do approximate
the normal distribution, and are said to be “normally
Scores “Normally Distributed?”
Table 10.1 Exam Grades in Statistics of 1,200 Students
Midpoint Frequency Bar Chart Fre Cum. % Cum %
Score q Freq (below)
40 * 4 4 0.33 0.33
50 ******* 78 82 6.5 6.83
60 *************** 275 357 22.92 29.75
70 *********************** 483 840 40.25 70
80 *************** 274 1114 22.83 92.83
90 ******* 81 1195 6.75 99.58
100 * 5 1200 0.42 100

• Is this distribution normal?

• There are two things to initially examine: (1) look at the shape
illustrated by the bar chart, and (2) calculate the mean, median,
and mode.
• The Mean = 70.07
• The Median = 70
• The Mode = 70

• Since all three are essentially equal, and this is reflected in the bar
graph, we can assume that these data are normally
• Also, since the median is approximately equal to the mean, we
know that the distribution is symmetrical.
The Shape of a Normal Distribution: The Normal Curve
The Shape of a Normal Distribution

Notice the shape of the normal curve in this graph. Some normal distributions are tall
and thin, while others are short and wide. All normal distributions, though, are wider
in the middle and symmetrical.
Different Shapes of the Normal Distribution

Notice that the standard deviation changes the relative width of

the distribution; the larger the standard deviation, the wider the
Areas Under the Normal Curve by Measuring Standard Deviations
• A measure of association between
• Two ordinal variables
• An ordinal and an interval/ratio variables
• Two ratio variables
• Correlation analysis examines if one variable:
• changes by a certain amount, how much the other
variable would change in which direction
Example: Scatter Plot
¨Students who get higher score in the Chemistry
class also get higher score in the Biology class
Positive Correlation
• When scores of two variables move together in the same
direction, we say that these variables are positively (or
directly) correlated
• There is a positive correlation between the chemistry final
score and the biology final score
• When two variables are positively correlated, the scatter
plot shows a trend line that runs from lower-left to upper-
Example: Scatter Plot
¨Students who get higher score in the Chemistry
class get lower score in the Art Classc
Negative Correlation
• When scores of two variables move in opposite
directions, we say that these variables are negatively (or
inversely) correlated
• There is a negative correlation between the chemistry
final score and the art final score
• When two variables are negatively correlated, the
scatter plot shows a trend line that runs from upper-left
to lower-right
Example: Test Score
¨Score in Chemistry and Score in English are not
• Correlation
When the change in one variable does not
affect the change in another variable, we
say these variables have no correlation
• There is no correlation between the
chemistry final score and the English final
• When two variables have no correlation,
the scatter plot shows the dots scattered
throughout the grids

0: No relationship
Positive: +
•As one variable gets bigger, so does the

Negative: -
•As one variable gets bigger, the 2nd
gets smaller
Correlation Co-efficient
• A number that indicates how strongly and in
which direction 2 variables are correlated with
each other
• A correlation co-efficient varies from –1 to +1
• Indicated as r
• r = +1: Perfect positive correlation
• If one variable increases by x%, another
variable also increases by x%
• r = - 1: Perfect negative correlation
• r = 0: No correlation
Correlation Co-efficient
Perfect None Perfect

Negative Positive
-1 0 +1
Stronger Weaker Stronger
Ranges from +1
0 or close to 0 indicates NO relationship
+/- 0.2 – 0.39 weak -/+ correlation
+/- 0.4 – 0.59 moderate -/+
+/- 0.6 – 0.79 strong -/+ correlation
+/- 0.8 - .99 very strong -/+
+/- 1.00 perfect-/+ correlation
Negative relationships are NOT weaker!
Pearson’s correlation coefficient

• Determines the strength and direction of relationship between X and Y

variables, both which have been measured at the interval level.

å (X - X )(Y - Y )
å (X - X ) å (Y - Y )
2 2
A 49 81
B 50 88
C 53 87
D 55 99
E 60 91
F 55 89
G 60 95
H 50 90
å (X - X )(Y - Y ) =
80 å (X - X ) å (Y - Y )
47 52 57

CHILD X Y (Y-Y) (X-X)(Y-Y)

A 49 81 -9 45 N=8
B 50 88 -2 8 X = 54
C 53 87 -3 3 Y = 90
D 55 99 9 9
E 60 91 1 6
F 55 89 -1 -1
G 60 95 5 30
H 50 90 -4 0 0
ΣX=432 ΣY=720 Total=100
CHILD X Y (X-X) (Y-Y) (X-X)(Y-Y) (X-X)2 (Y-Y)2
A 49 81 -9 45 25 81
B 50 88 -2 8 16 4
C 53 87 -3 3 1 9
D 55 99 9 9 1 81
E 60 91 1 6 36 1
F 55 89 -1 -1 1 1
G 60 95 5 30 36 25
H 50 90 -4 0 0 16 0
ΣX=432 ΣY=720 SP=100 SSX =132 SSY=202

å (X - X )(Y - Y ) =
å (X - X ) å (Y - Y )
2 2 (132)(202)

= = +0.61
A 49 81 2401 6561 3969 ΣX=432
B 50 88 2500 7744 4400 ΣY= 720
C 53 87 2809 7569 4611 X=432/8 =54
D 55 99 3025 9801 5445 Y =720/8=90
E 60 91 3600 8281 5460 ΣX2=23460
F 55 89 3025 7921 4895 ΣY2=65002
G 60 95 3600 9025 5700 ΣXY=38980
H 50 90 2500 8100 4500
54 90 23460 65002 38980

å XY - NXY =
38980 - (8 * 54 * 90)
(å X - NX )(åY
2 2 2
- NY )
((3460 - (8 * 54 2 ))((65002 - (8 * 90 2 ))
38980 - 38880
(23460 - 23328)(65002 - 64808)
100 100
= = = 0.61
(132)(202) 26664
Types of Correlation r

• Use Spearman rho’s correlation if one or both of your

variables are ordinal
• Use Pearson’s r correlation if both of your variables are
interval or ratio
• You can interpret both kinds of correlation in the same way
Basic Concepts in Samples and Sampling

• Population: the entire group under study as defined by

research objectives. Sometimes called the “universe.”

Researchers define populations in specific terms such

as heads of households, individual person types,
families, types of retail outlets, etc. Population
geographic location and time of study are also
Basic Concepts in Samples and Sampling

• Sample: a subset of the populations that

should represent the entire group

• Sample unit: the basic level of

investigation…student, household, types of
benefit etc.

• Census: an accounting of the complete

Basic Concepts in Samples and Sampling

• Sampling error: any error that occurs in a survey because a

sample is used (random error)

• Sample frame: a master list of the population (total or partial)

from which the sample will be drawn

• Sample frame error (SFE): the degree to which the sample

frame fails to account for all of the defined units in the population
(e.g. a telephone book listing does not contain unlisted numbers)
leading to sampling frame error.

• Sampling fraction: the ratio of sample size to population size or,

in the context of stratified sampling, the ratio of the sample size
to the size of the stratum. The formula for the sampling fraction
is = n/N, where n is the sample size and N is the population size.
Reasons for Taking a Sample

• Practical considerations such as cost and population


• Inability of researcher to analyze large quantities of

data potentially generated by a census

• Samples can produce sound results if proper rules are

followed for the draw
Basic Sampling Classifications

• Probability samples: ones in which members of the population

have a known chance (probability) of being selected
• Simple random sampling
• Systematic sampling
• Cluster sampling
• Stratified sampling
• Non-probability samples: instances in which the chances
(probability) of selecting members from the population are
• Convenience sampling
• Judgment sampling
• Referral sampling (snowball sampling)
• Quota sampling
Probability Sampling Methods
Simple Random Sampling

• Simple random sampling: the probability of being selected is “known

and equal” for all members of the population

• Procedures are set up such that the different units in the population
have equal probabilities of being chosen

• Blind Draw Method (e.g. names “placed in a hat” and then drawn
• Random Numbers Method (all items in the sampling frame given
numbers, numbers then drawn using table or computer program)
Probability Sampling Methods
Simple Random Sampling

• Advantages:
• Known and equal chance of selection
• Easy method when there is an electronic database
• Disadvantages
• Complete accounting of population needed
• Cumbersome to provide unique designations to
every population member
• Very inefficient when applied to skewed population
distribution (over- and under-sampling problems) –
this is not “overcome with the use of an electronic
Probability Sampling Methods
Systematic Sampling

• Systematic sampling: way to select a probability-based

sample from a directory or list. This method is at times
more efficient than simple random sampling.
• Sampling interval (SI) = population list size (N) divided by a pre-
determined sample size (n)

• How to draw:
1) calculate SI,
2) select a number between 1 and SI randomly,
3) go to this number as the starting point and the item on the list
here is the first in the sample,
4) add SI to the position number of this item and the new position
will be the second sampled item,
5) continue this process until desired sample size is reached
Probability Sampling Methods
Systematic Sampling

• Advantages:
• Known and equal chance of any of the SI “clusters” being

• Efficiency…do not need to designate (assign a number to)

every population member, just those early on on the list (unless
there is a very large sampling frame).
• Less expensive…faster than SRS

• Disadvantages:
• Small loss in sampling precision
• Potential “periodicity” problems (An example of this would
occur if you used a sampling frame of adult residents in an
area composed of predominantly couples or young families. If
this list was arranged: Husband / Wife / Husband / Wife etc.
and if every tenth person was to be interviewed, there would
be an increased chance of males being selected)
Probability Sampling Methods
Cluster Sampling

• Cluster sampling: method by which the population is

divided into groups (clusters), any of which can be
considered a representative sample.

• These clusters are mini-populations and therefore

are heterogeneous.
• Once clusters are established a random draw is
done to select one (or more) clusters to represent
the population.
• Area and systematic sampling (discussed earlier)
are two common methods.
Probability Sampling Methods
Cluster Sampling

• Advantages
• Economic efficiency … faster and less expensive
than SRS
• Does not require a list of all members of the

• Disadvantage:
• Cluster specification error…the more homogeneous
the cluster chosen, the more imprecise the sample
Probability Sampling Methods
Cluster Sampling – Area Method

• Drawing the area sample:

• Divide the geo area into sectors (subareas) and give

them names/numbers, determine how many sectors
are to be sampled (typically a judgment call),
randomly select these subareas. Do either a census
or a systematic draw within each area.

• To determine the total geo area estimate add the

counts in the subareas together and multiply this
number by the ratio of the total number of subareas
divided by number of subareas.
A two-step area cluster sample (sampling several clusters) is
preferable to a one-step (selecting only one cluster) sample
unless the clusters are homogeneous
Probability Sampling Methods
Stratified Sampling

• This method is used when the population

distribution of items is skewed. It allows us
to draw a more representative sample.
Hence if there are more of certain type of
item in the population the sample has more
of this type and if there are fewer of another
type, there are fewer in the sample.
Probability Sampling Methods
Stratified Sampling

• Stratified sampling: the population is separated into

homogeneous groups/segments/strata and a sample
is taken from each.
• The results are then combined to get the picture of
the total population.

• Sample stratum size determination

• Proportional method (stratum share of total sample
is stratum share of total population)
• Disproportionate method (variances among strata
affect sample size for each stratum)
Probability Sampling Methods
Stratified Sampling

• Advantage:
• More accurate overall sample of skewed population

• Disadvantage:
• More complex sampling plan requiring different
sample sizes for each stratum
Nonprobability Sampling Methods
Convenience Sampling Method

• Convenience samples: samples drawn at the

convenience of the interviewer.
• People tend to make the selection at familiar
locations and to choose respondents who are
like themselves.
• Error occurs:
1. in the form of members of the population who
are infrequent or nonusers of that location
2. Those who are not typical in the population
Nonprobability Sampling Methods
Judgment Sampling Method

• Judgment samples: samples that require a

judgment or an “educated guess” on the part
of the interviewer as to who should represent
the population.
• Also, “judges” (informed individuals) may be
asked to suggest who should be in the
• Subjectivity enters in here, and certain
members of the population will have a
smaller or no chance of selection compared
to others
Nonprobabilty Sampling Methods
Referral and Quota Sampling Methods

• Referral samples (snowball samples): samples which

require respondents to provide the names of additional
• Members of the population who are less known,
disliked, or whose opinions conflict with the
respondent have a low probability of being selected.
• Quota samples: samples that set a specific number of
certain types of individuals to be interviewed
• Often used to ensure that convenience samples will
have desired proportion of different respondent
Developing a Sample Plan

• Sample plan: definite sequence of steps that the

researcher goes through in order to draw and
ultimately arrive at the final sample
Developing a Sample Plan
Six steps

• Step 1: Define the relevant population.

• Specify the descriptors, geographic locations,
and time for the sampling units.

• Step 2: Obtain a population list, if possible; may

only be some type of sample frame
• List brokers, government units, customer
lists, competitors’ lists, association lists,
directories, etc.
Developing a Sample Plan
Six steps …continued

• Step 3: Design the sample method (size and method).

• Determine specific sampling method to be used. All
necessary steps must be specified (sample frame,
n, … recontacts, and replacements)

• Step 4: Draw the sample.

• Select the sample unit and gain the information
Developing a Sample Plan
Six steps…concluded

• Step 4 (Continued):
• Substitution
• Oversampling
• Resampling

• Step 5: Assess the sample.

• Sample validation – compare sample profile
with population profile; check non-responders

• Step 6: Resample if necessary.

• when a test of significance requires (1)
normality in the population (2) an interval level
of measurement
• t-test and Analysis of Variance (ANOVA)

•Significance tests whose list of requirements

does not include normality or interval level of
• chi square test is an example
Power of a test

• The probability of rejecting the null hypothesis when it is

actually false and should be rejected
• The most powerful tests are those that have the strongest
or most difficult requirements (e.g. normality) to satisfy.
• The results of a parametric test whose requirements have
gone unsatisfied may lack any meaningful interpretation
Common Parametric Tests

Tests of Differences Between Means (t-test)

1.Independent Samples T-test
2.Repeated Measures
3.Analysis of Variance (test of differences
of more than two means)
Chi-Square as a Statistical Test

• Chi-square test: an inferential statistics technique

designed to test for significant relationships between two
variables organized in a bivariate table.
• Chi-square requires no assumptions about the shape of the
population distribution from which a sample is drawn.
• It can be applied to nominally or ordinally measured variables.
• The most commonly used non parametric test because it is
relatively easy to follow and applicable to a wide variety of
research problems.

You might also like