SW Statistics
SW Statistics
SW Statistics
Decisions
in-depth interviews
content analyses
participant observations
surveys
group experiments
meta-analysis
historical research
Information
Is the interpretation we give to collected data
after we have analyzed them.
Example:
Treatment Intervention A is more successful
than Treatment Intervention B in reducing
the substance abuse among research
participants.
If the temperature outside this room
as measured by a thermometer is 35
degrees Celsius, the 35 degrees is
datum. The interpretation that it is
very hot is information.
is a characteristic that
differs in quantity or quality
among units under study
All research focus on
variables
Examples: educational
level, gender, ethnicity
etc.
Qualitative variable
A qualitative variable takes on non-
numerical values.
It simply describes which class or
category the observations fall.
Nationality - Filipino, Chinese, American, Hispanic
Dependent variable
a variable believed to be influenced by the
independent variable
Levels of Measurement
Why is Level of Measurement Important?
1. Categorical (Nominal)
2. Ordinal
3. Interval
4. Ratio
Example : Nominal Scale
Zip Codes
phone number
Example: Ordinal Scale
1 College graduate
5 Excellent
High school
Very 2
4 graduate
Satisfactory
elementary
3
3 Satisfactory graduate
temperature
Ratio
• the existence of a fixed , absolute, and non arbitrary zero
point constitutes the only difference between interval and
ratio levels of measurement
• since the measurement has an absolute zero and the
difference between numbers is significant thus ratios makes
sense
• the absolute zero point property permits all arithmetic
operations
• values at the ratio level indicate the actual amount of the
property being measured
Example: Ratio Scale
income
Weights and Measurements
Permitted Operations
• Categorical
• Equality and inequality, no < or >, no + or -
• Ordinal
• = and =, < and >, no + or -
• Interval
• = and =, < and >, + and -, but no * or /
• Ratio
• = and =, < and >, + and -, * and /
Comparative Summary
Illustration
Nominal: Are you currently employed? Yes/No
Ordinal: What is your employment status?
Unemployed
Employed part-time
Employed-full time (37 - 40 hours)
Employed over 40 hours
Ratio: How many hours a week are you employed?
Discrete and Continuous Variables
• Discrete variables
• can take on only a finite number of values
(e.g. 34 students, 3 days, 5 lectures)
• Continuous variables
• can theoretically take on all numerical values
• assuming that we can use precise measuring
instruments capable of measuring the values with ever
increasing precision, it can take a number of different
values (e.g.height of a person, kilos of rice, liters of
water)
Dichotomous, Binary and Dummy Variables
• Dichotomous variable
• a specific type of discrete variable that only has two value categories
(e.g. gender:male, female; election:win , lose)
1. Binary variable
• a special type of dichotomous variable
• 1 - presence of the variable, 0 - absence of the variable
2. Dummy variable
• transformation used when we want a nominal dichotomous
variable such as gender to be used in performing other statistical
analyses such as in regression analysis where variables need to
be in the interval/ratio level
Categories of Statistical Analyses
1. Number of variables
being analyzed
• Univariate
analyses
- examine the
distribution of value
categories (for
nominal or ordinal
level data) and values
(for interval/ratio
level data) for a
single variable
Categories of Statistical Analyses
P = proportion
f = frequency of the category
N = total number of cases
Proportions: Example
TABLE 1 - Number of Employed Persons by Age Group, Philippines: January 2008
Employed Persons
Age Group Proportion
(‘000)
Total 33,695 1.000
15-24 years 6,520 ?
25-34 years 8,916 ?
35-44 years 7,943 ?
45-54 years 5,851 ?
55-64 years 3,080 ?
65 years and over 1,383 ?
Not reported 1 ?
Source of basic data: National Statistics Office, Labor Force Survey
Proportions: Example
Employed Persons
Age Group Proportion
(000)
Total 33,695 1.000
15-24 years 6,520 0.194
25-34 years 8,916 0.265
35-44 years 7,943 0.236
45-54 years 5,851 0.174
55-64 years 3,080 0.091
65 years and over 1,383 0.041
Not reported 1 *
Note: Details may not add up to total due to rounding
* Less than 0.005
Source of basic data: National Statistics Office, Labor Force Survey
Percentages
from the Latin words “per” (by means of/for every) and “centrum” (by
the hundredths/for every hundredths)
the frequency of occurrence of a category per 100 cases
useful in computing for distribution or disaggregation of a set of
particular observation
sum of percentages should always add up to 100 percent
rounding affects the sum of percentages
put footnote if the sum is not equal to 100
Percentages
Engineering Majors
Gender of Students University A University B
f % f %
Male 1,082 80 146 80
Female 270 20 37 20
Total 1,352 100 183 100
Percentages: Example
TABLE 2 - Number and Percentage Distribution of Unemployed Persons by Age Group,
Philippines: January 2008
Unemployed Persons
Age Group Percentage (%)
(‘000)
Total 2,675 100.0
15-24 years 1,328 ?
25-34 years 796 ?
35-44 years 274 ?
45-54 years 175 ?
55-64 years 85 ?
65 years and over 17 ?
Source of basic data: National Statistics Office, Labor Force Survey
Percentages: Example
to get the percentage of unemployed persons 20 to 24
years old:
•
•
Percentages: Example
TABLE 3 - Number and Percentage Distribution of Unemployed Persons by AgeGroup,
Philippines: January 2008
Unemployed Persons
Age Group Percentage (%)
(000)
Total 2,675 100.0
15-24 years 1,328 49.6
25-34 years 796 29.8
35-44 years 274 10.2
45-54 years 175 6.5
55-64 years 85 3.2
65 years and over 17 0.6
Interpretation
For every one female, there are three males
Ratios
sex ratio = ratio of males to females
males = 1,350
females = 80
If multiplied by 100,
Measures of Central Tendency
Measures of Central Tendency…
åX i
X=
Sum of allvaluesin the data set
X= i=1
Total number of observations
n
MEAN – Ungrouped Data
20 671
åX i
X=
20
X= i-1
X = 33.55 » 34 years old
n
MEAN – Grouped Data
X=
å ( frequency * midpoint)
n
X=
å ( frequency * midpoint)
X=
675
= 33.75 » 34
n 20
Advantages of the MEAN:
Then count the total number of scores (in this case n = 14) and divide
by two (7). Now count in by the number you just calculated from both
ends of your distribution until you find the middle score or scores.
2 3 5 7 8 10 12 17 17 23 25 34 43 44
The median falls between 12 and 17. So add the two together and
divide by 2 to find the actual median: 12 + 17 / 2 = 14.5
Advantages of the MEDIAN:
• the mode is the value within a data set that occurs most frequently
Although you do not have to rank order these raw scores to determine the
mode:
10 23 2 34 17 5 3 12 43 25 44 17 7 8
2 3 5 7 8 10 12 17 17 23 25 34 43 44
There is only one raw score that appears twice. Therefore, the mode of this
raw score distribution is 17. It is the most frequently occurring score.
If a distribution has one mode it is said to be unimodal. If it has two modes it is
bimodal, if it has more than two modes then it is said to me multimodal. It is
also possible for a distribution to not have any mode.
MODE – Grouped Data
For a Group Frequency Distribution, the mode is the midpoint of the interval
with the highest frequency. The Frequency Distribution table below has its mode
highlighted:
81-90 5 85.5
71-80 3 75.5
61-70 12 65.5
51-60 16 55.5
41-50 33 45.5
31-40 21 35.5
21-30 15 25.5
11-20 7 15.5
Total 404
The mode of this Grouped Frequency Distribution is 45.5.
Advantage of the MODE:
1. Range
2. Interquartile Range
3. Mean deviation
4. Variance
5. Standard deviation
The Standard Deviation
28 0
18 20 21 24 27
27 2
18 20 22 25 27
26 3
19 20 22 26 29 25 1
19 20 23 26 30 24 1
23 2
19 21 23 26 31
22 2
21 2
This data can be rearranged as a 20 3
simple frequency distribution 19 4
table 18 2
Formula for Standard
Deviation
s=
å (X - X ) 2
Where
s = the standard deviation
0.045
0.04
0.035
0.03
0.025
0.02
0.015
0.01
0.005
0
-20 0 20 40 60 80 100 120
The Normal Distribution
Normal Distribution
}Normal Distribution – A bell-shaped and symmetrical theoretical
distribution, with the mean, the median, and the mode all coinciding at
its peak and with frequencies gradually decreasing at both ends of the
curve.
• Since all three are essentially equal, and this is reflected in the bar
graph, we can assume that these data are normally
distributed.
• Also, since the median is approximately equal to the mean, we
know that the distribution is symmetrical.
The Shape of a Normal Distribution: The Normal Curve
The Shape of a Normal Distribution
Notice the shape of the normal curve in this graph. Some normal distributions are tall
and thin, while others are short and wide. All normal distributions, though, are wider
in the middle and symmetrical.
Different Shapes of the Normal Distribution
0: No relationship
Positive: +
•As one variable gets bigger, so does the
2nd
Negative: -
•As one variable gets bigger, the 2nd
gets smaller
Correlation Co-efficient
• A number that indicates how strongly and in
which direction 2 variables are correlated with
each other
• A correlation co-efficient varies from –1 to +1
• Indicated as r
• r = +1: Perfect positive correlation
• If one variable increases by x%, another
variable also increases by x%
• r = - 1: Perfect negative correlation
• r = 0: No correlation
Correlation Co-efficient
Perfect None Perfect
Negative Positive
-1 0 +1
Stronger Weaker Stronger
• SIZE–1/toSTRENGTH
Ranges from +1
0 or close to 0 indicates NO relationship
+/- 0.2 – 0.39 weak -/+ correlation
+/- 0.4 – 0.59 moderate -/+
correlation
+/- 0.6 – 0.79 strong -/+ correlation
+/- 0.8 - .99 very strong -/+
correlation
+/- 1.00 perfect-/+ correlation
Negative relationships are NOT weaker!
Pearson’s correlation coefficient
r=
å (X - X )(Y - Y )
å (X - X ) å (Y - Y )
2 2
CHILD HEIGHT (in) WEIGHT (lb)
A 49 81
B 50 88
C 53 87
D 55 99
E 60 91
F 55 89
G 60 95
H 50 90
100
98
96
94
92
90
HEIGHT VS
88 WEIGHT
86
84
r=
å (X - X )(Y - Y ) =
SP
82
80 å (X - X ) å (Y - Y )
2 2 SSX SSY
47 52 57
r=
å (X - X )(Y - Y ) =
100
å (X - X ) å (Y - Y )
2 2 (132)(202)
100
=
26,664
100
= = +0.61
163.2
CHILD X Y X2 Y2 XY N=8
A 49 81 2401 6561 3969 ΣX=432
B 50 88 2500 7744 4400 ΣY= 720
C 53 87 2809 7569 4611 X=432/8 =54
D 55 99 3025 9801 5445 Y =720/8=90
E 60 91 3600 8281 5460 ΣX2=23460
F 55 89 3025 7921 4895 ΣY2=65002
G 60 95 3600 9025 5700 ΣXY=38980
H 50 90 2500 8100 4500
54 90 23460 65002 38980
r
å XY - NXY =
38980 - (8 * 54 * 90)
(å X - NX )(åY
2 2 2
- NY )
2
((3460 - (8 * 54 2 ))((65002 - (8 * 90 2 ))
38980 - 38880
=
(23460 - 23328)(65002 - 64808)
100 100
= = = 0.61
(132)(202) 26664
Types of Correlation r
• Procedures are set up such that the different units in the population
have equal probabilities of being chosen
• Blind Draw Method (e.g. names “placed in a hat” and then drawn
randomly)
• Random Numbers Method (all items in the sampling frame given
numbers, numbers then drawn using table or computer program)
Probability Sampling Methods
Simple Random Sampling
• Advantages:
• Known and equal chance of selection
• Easy method when there is an electronic database
• Disadvantages
• Complete accounting of population needed
• Cumbersome to provide unique designations to
every population member
• Very inefficient when applied to skewed population
distribution (over- and under-sampling problems) –
this is not “overcome with the use of an electronic
database)
Probability Sampling Methods
Systematic Sampling
• How to draw:
1) calculate SI,
2) select a number between 1 and SI randomly,
3) go to this number as the starting point and the item on the list
here is the first in the sample,
4) add SI to the position number of this item and the new position
will be the second sampled item,
5) continue this process until desired sample size is reached
Probability Sampling Methods
Systematic Sampling
• Advantages:
• Known and equal chance of any of the SI “clusters” being
selected
• Disadvantages:
• Small loss in sampling precision
• Potential “periodicity” problems (An example of this would
occur if you used a sampling frame of adult residents in an
area composed of predominantly couples or young families. If
this list was arranged: Husband / Wife / Husband / Wife etc.
and if every tenth person was to be interviewed, there would
be an increased chance of males being selected)
Probability Sampling Methods
Cluster Sampling
• Advantages
• Economic efficiency … faster and less expensive
than SRS
• Does not require a list of all members of the
universe
• Disadvantage:
• Cluster specification error…the more homogeneous
the cluster chosen, the more imprecise the sample
results
Probability Sampling Methods
Cluster Sampling – Area Method
• Advantage:
• More accurate overall sample of skewed population
• Disadvantage:
• More complex sampling plan requiring different
sample sizes for each stratum
Nonprobability Sampling Methods
Convenience Sampling Method
• Step 4 (Continued):
• Substitution
• Oversampling
• Resampling