W2 Descriptive Statistics

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 60

DESCRIPTIVE

STATISTICS
KEY STATISTICAL CONCEPTS

Sample
• a set of data drawn from
the population.
Population • Potentially large, but less
• the group of all items of interest to than the population
a statistics practitioner.
• frequently very large; sometimes
infinite.
KEY STATISTICAL CONCEPTS

Statistic

Parameter

Parameter — A descriptive measure of a population.

Statistic — A descriptive measure of a sample.


SCENARIO
The faculty senate at a major university with 35,000 students
is considering changing the current grading policy from A, B,
C, D, F to a plus and minus system—that is, B +, B, B - rather
than just B. The faculty is interested in the students’ opinions
concerning this change and will sample 500 students.
a. What is the population of interest?
b. What is the sample?
c. How could the sample be selected?
d. What type of questions should be included in the
questionnaire?
DESCRIPTIVE STATISTICS
Which Group is Smarter?

Class A--IQs of 13 Students Class B--IQs of 13 Students


127 162
102 115 131 103
128 109 96 111
131 89 80 109
98 106 93 87
140 119 120 105
93 97 109
110
Each individual may be different. If you try to understand a group by remembering the
qualities of each member, you become overwhelmed and fail to understand the group.
DESCRIPTIVE STATISTICS
Which group is smarter now?

Class A--Average IQ Class B--Average IQ

110.54 110.23

They’re roughly the same!

With a summary descriptive statistic, it is much easier to


answer our question.
DESCRIPTIVE STATISTICS
Types of descriptive statistics:

• Organize Data
– Tables
– Graphs

• Summarize Data
– Central Tendency
– Variation
DESCRIPTIVE STATISTICS
• Organize Data
– Tables
• Frequency Distribution
• Relative Frequency Distribution
– Graphs
• Bar Chart
• Histogram
• Stem and Leaf Plot
• Frequency Polygon
• Pie Chart
• Scatter Plot
SPSS OUTPUT FOR
FREQUENCY DISTRIBUTION
GROUPED RELATIVE FREQUENCY
DISTRIBUTION

Relative Frequency Distribution of IQ for Two Classes

IQ Frequency Percent Cumulative Percent

80 – 89 3 11.5 11.5
90 – 99 5 19.2 30.7
100 – 109 7 26.9 57.6
110 – 119 4 15.4 73.0
120 – 129 3 11.5 84.5
130 – 139 2 7.7 92.2
140 – 149 1 3.8 96.0
150 and over 1 3.8 100.0

Total26 100.0 100.0


HISTOGRAM
BAR GRAPH
STEM AND LEAF PLOT
Stem and Leaf Plot of IQ for Two Classes

Stem Leaf
8 0 7 9
9 3 3 678
10 2 3 56999
11 0 1 59
12 0 7 8
13 1 1
14 0
15
16 2
SPSS OUTPUT OF A
FREQUENCY POLYGON
PIE CHART
SCATTER PLOT
DESCRIPTIVE STATISTICS
Summarizing Data:

– Central Tendency (or Groups’ “Middle Values”)


• Mean
• Median
• Mode

– Variation (or Summary of Differences Within


Groups)
• Range
• Interquartile Range
• Variance
• Standard Deviation
MEAN
Most commonly called the “average.”

Add up the values for each case and divide by


the total number of cases.

Y-bar = (Y1 + Y2 + . . . + Yn)


n

Y-bar = Σ Yi
n
MEAN
What’s up with all those symbols, man?

Y-bar = (Y1 + Y2 + . . . + Yn)


n
Y-bar = Σ Yi
n

Some Symbolic Conventions in this Class:


• Y = your variable (could be X or Q or  or even “Glitter”)
• “-bar” or line over symbol of your variable = mean of that
variable
• Y1 = first case’s value on variable Y
• “. . .” = ellipsis = continue sequentially
• Yn = last case’s value on variable Y
• n = number of cases in your sample
• Σ = Greek letter “sigma” = sum or add up what follows
• i = a typical case or each case in the sample (1 through n)
MEAN
Class A--IQs of 13 Students Class B--IQs of 13 Students
102 115 127 162
128 109 131 103
131 89 96 111

98 106 80 109
93 87
140 119
120 105
93 97
109
110
MEAN
The mean is the “balance point.”
Each person’s score is like 1 pound placed at the score’s
position on a see-saw. Below, on a 200 cm see-saw, the
mean equals 110, the place on the see-saw where a
fulcrum finds balance:

1 lb at 1 lb at 1 lb at
93 cm 106 cm 110 cm 131 cm

17 21
4
units
units 0
above
below units
units
below
The scale is balanced because…
17 + 4 on the left = 21 on the right
MEAN

1. Means can be badly affected by outliers (data points


with extreme values unlike the rest)
2. Outliers can make the mean a bad measure of
central tendency or common experience

Income in Malaysia.

Syed Al-Bukhary
All of Us
Mean Outlier
MEDIAN

The middle value when a variable’s values are ranked in


order; the point that divides a distribution into two
equal halves.

When data are listed in order, the median is the point at


which 50% of the cases are above and 50% below it.

The 50th percentile.


MEDIAN
Class A--IQs of 13 Students
89
93
97
98
102
Median = 109
106
(six cases above, six below)
109
110
115
119
128
131 140
MEDIAN
If the first student were to drop out of Class A, there
would be a new median:
89
93
97
98
102
106
109
Median = 109.5
110
109 + 110 = 219/2 = 109.5
115
(six cases above, six below)
119
128
131
140
MEDIAN

1. The median is unaffected by outliers,


making it a better measure of central
tendency, better describing the
“typical person” than the mean when
data are skewed.

All of Us Syed Al-Buqhary


outlier
MEDIAN
2. If the recorded values for a variable form a
symmetric distribution, the median and
mean are identical.
3. In skewed data, the mean lies further
toward the skew than the median.

Symmetric Skewed

Mean
Median
Median Mean
MEDIAN

The middle score or measurement in a set of ranked


scores or measurements; the point that divides a
distribution into two equal halves.

Data are listed in order—the median is the point at


which 50% of the cases are above and 50% below.

The 50th percentile OR second quartile Q2.


MODE
The most common data point is called the mode.

The combined IQ scores for Classes A & B:


80 87 89 93 93 96 97 98 102 103 105 106 109 109 109
110 111 115 119 120
127 128 131 131 140 162
A la mode!!

BTW, it is possible to have more than one mode!


MODE
It may mot be at the
center of a
distribution.

Data distribution on
the right is
“bimodal” (even
statistics can be
open-minded)
MODE
1. It may give you the most likely experience
rather than the “typical” or “central”
experience.
2. In symmetric distributions, the mean,
median, and mode are the same.
3. In skewed data, the mean and median lie
further toward the skew than the mode.
Symmetric Skewed

Mean
Median
Mode Mode Median Mean
Choosing a Measure of Central
Tendency

– If you want to know which score occurred most often,


then the mode is the choice.
– The median is a better choice to serve as the
representative score because it takes into account all
the data in the distribution. However, it treats all scores
alike; differences in magnitude are not taken into
account.
– When the mean is calculated, the value of each
number is taken into account.
• When the scores in your distribution tend to
cluster in one of the tails (i.e., a cluster of high or
low scores) the distribution is skewed (i.e., a
nonsymmetrical distribution). In these instances,
the median may be more appropriate.
DESCRIPTIVE STATISTICS

Summarizing Data:

 Central Tendency (or Groups’ “Middle Values”)


 Mean
 Median
 Mode

– Variation (or Summary of Differences Within Groups)


• Range
• Interquartile Range
• Variance
• Standard Deviation
LATIHAN
Skor peperiksaan pertengahan semester beberapa orang
pelajar bagi kelas KP2 adalah seperti berikut: (Skor dalam
peratus item ditanda betul).
87 99 75 87 94 75 35 88 87 93

Cari min, median, dan mod.


Apakah maklumat yang diperolehi mengenai prestasi
pelajar dalam peperiksaan pertengahan semester KP2?
Adakah taburan data pencong? Nyatakan jenis
kepencongan.
RANGE
The spread, or the distance, between the lowest
and highest values of a variable.

To get the range for a variable, you subtract its


lowest value from its highest value.
Class A--IQs of 13 Students Class B--IQs of 13 Students
102 115 127 162
128 109 131 103
131 89 96 111
98 106 80 109
140 119 93 87
93 97 120 105
110 109
Class A Range = 140 - 89 = 51 Class B Range = 162 - 80 = 82
INTERQUARTILE RANGE
A quartile is the value that marks one of the divisions that breaks a
series of values into four equal parts.

The median is a quartile and divides the cases in half.

25th percentile is a quartile that divides the first ¼ of cases from the
latter ¾.
75th percentile is a quartile that divides the first ¾ of cases from the
latter ¼.

The interquartile range is the distance or range between the 25th


percentile25% 25%what is the25%
and the 75th percentile. Below, interquartile
25%
range? of
of
cases cases

0 250 500 750 1000


DETECTING POTENTIAL OUTLIERS
An observation is a potential outlier if it falls
more than 1.5 x IQR below the first quartile
or more than 1.5 x IQR above the third
quartile.

• Cutoff value for LOW OUTLIERS:


Q1-1.5 X IQR *any value less than this
number is considered a low outlier
• Cutoff value for HIGH OUTLIERS
Q3+1.5 X IQR *any value greater than this
number is considered a high outlier
VARIANCE
A measure of the spread of the recorded values on a
variable. A measure of dispersion.

The larger the variance, the further the individual cases


are from the mean.

Mean
The smaller the variance, the closer the individual scores
are to the mean.

Mean
VARIANCE
Variance is a number that at first seems complex to
calculate.

Calculating variance starts with a “deviation.”

A deviation is the distance away from the mean of a


case’s score.
If the average person’s car costs $20,000,
Yi – Y-bar my deviation from the mean is - $14,000!
6K - 20K = -14K
VARIANCE
The deviation of 102 from 110.54 is? Deviation of
115?

Class A--IQs of 13 Students


102 115
128 109
131 89
98 106
140 119
93 97
110
Y-barA = 110.54
VARIANCE
The deviation of 102 from 110.54 is? Deviation of 115 from 110.54?
102 - 110.54 = -8.54 115 - 110.54 = 4.46

Class A--IQs of 13 Students


102 115
128 109
131 89
98 106
140 119
93 97
110
Y-barA = 110.54
VARIANCE
• We want to add these to get total deviations,
but if we were to do that, we would get zero
every time. Why?
• We need a way to eliminate negative signs.

Squaring the deviations will eliminate negative


signs...
Back to the IQ example,
A Deviation Squared: (Yi – Y-bar)2
A deviation squared for 102 is: of 115:
(102 - 110.54)2 = (-8.54)2 = 72.93 (115 - 110.54)2 = (4.46)2 = 19.89
VARIANCE
If you were to add all the squared deviations together,
you’d get what we call the “Sum of Squares.”

Sum of Squares (SS) = Σ (Yi – Y-bar)2

SS = (Y1 – Y-bar)2 + (Y2 – Y-bar)2 + . . . + (Yn – Y-bar)2


VARIANCE
Class A--IQs of 13
Class A, sum of squares: Students
(102 – 110.54)2 + (115 – 110.54)2 +
102 115
(126 – 110.54)2 + (109 – 110.54)2 + 128 109
(131 – 110.54)2 + (89 – 110.54)2 + 131 89
(98 – 110.54)2 + (106 – 110.54)2 + 98 106
(140 – 110.54)2 + (119 – 110.54)2 +
140 119
(93 – 110.54)2 + (97 – 110.54)2 +
(110 – 110.54) = SS = 2825.39
93 97
110
Y-bar = 110.54
VARIANCE
The last step…

The approximate average sum of squares is the variance.

SS/N = Variance for a population.

SS/n-1 = Variance for a sample.

Variance = Σ(Yi – Y-bar)2 / n – 1


VARIANCE
For Class A, Variance = 2825.39 / n - 1
= 2825.39 / 12 =
235.45

How helpful is that???


STANDARD DEVIATION
To convert variance into something of meaning,
let’s create standard deviation.

The square root of the variance reveals the


average deviation of the observations
from the mean.

s.d. = Σ(Yi – Y-bar)2


n-1
STANDARD DEVIATION
For Class A, the standard deviation is:

235.45 = 15.34

The average of persons’ deviation from the mean


IQ of 110.54 is 15.34 IQ points.

Review:
1. Deviation
2. Deviation squared
3. Sum of squares
4. Variance
5. Standard deviation
STANDARD DEVIATION
1. Larger s.d. = greater amounts of variation around the mean.
For example:

19 25 31 13 25 37
Y = 25 Y = 25
s.d. = 3 s.d. = 6
2. s.d. = 0 only when all values are the same (only when you have
a constant and not a “variable”)
3. If you were to “rescale” a variable, the s.d. would change by the
same magnitude—if we changed units above so the mean
equaled 250, the s.d. on the left would be 30, and on the right,
60
4. Like the mean, the s.d. will be inflated by an outlier case value.
STANDARD DEVIATION
• Note about computational formulas:
– Your book provides a useful short-cut formula for computing
the variance and standard deviation.
– This is intended to make hand calculations as quick as
possible.
– They obscure the conceptual understanding of our statistics.
– SPSS and the computer are “computational formulas” now.
SYMBOLS IN STATISTICS
DESCRIPTIVE STATISTICS
Summarizing Data:

 Central Tendency (or Groups’ “Middle Values”)


 Mean
 Median
 Mode

 Variation (or Summary of Differences Within Groups)


 Range
 Interquartile Range
 Variance
 Standard Deviation

– …Wait! There’s more


BOX-PLOTS
A way to graphically portray almost all
the descriptive statistics at once is the
box-plot.

A box-plot shows: min, Q1, Q2, Q3, max


BOX-PLOTS
IQR = 27; There
is no outlier.

162

123.5

M=110.5 106.5

96.5

82
SPSS OUTPUT OF CLASS A & B
SHAPE OF DISTRIBUTIONS
• Shape of distribution is measured by
– Skewness & Kurtosis
• When the scores in your distribution tend to cluster in
one of the tails (i.e., a cluster of high scores or a cluster
of low scores) the distribution is skewed.
– Positively Skewed Distributions – occur when there is
cluster of lower scores, the smaller, more spread-out tail
will be on the right (i.e., fewer high scores).
– Negatively Skewed Distributions – occur when there is a
cluster of higher scores, the smaller more spread out tail
will be on the left (i.e., fewer small scores).
• Statisticians use several specific
terms to describe the different
shapes these distributions can
assume.
– Unimodal Distributions have
one prominent category or
high point.

– Bimodal Distributions have


two prominent categories or
high points.

– Multimodal Distributions
have several prominent
categories or high points.
SAMPLE RESEARCH ARTICLE
DESCRIPTIVE STATISTICS
• Now you are qualified to use descriptive statistics!
• Questions?
• Do your Quiz 1 (Week 1 and Week 2’s lectures) online
PLEASE!

You might also like