Descriptive Statistics (II)
Descriptive Statistics (II)
Descriptive Statistics (II)
GOG 502/PLN 504 Youqin Huang 3 GOG 502/PLN 504 Youqin Huang 4
1
Calculating the Mean Calculating the Mean
5
Yi represents “ith” Person # Guns
case of variable Y
Person # Guns
owned ∑Y = Y
i =1
i 1 +Y 2 +Y 3 +Y 4 +Y 5 owned
i goes from 1 to n (Y) = 0 + 3 + 0 +1+1 = 5 1
(Y)
Y1 = 0
Y1 = value of Y for 1 Y1 = 0
first case in 1n 1
∑
2 Y2 = 3
spreadsheet
2 Y2 = 3 Y= Yi = ×5 =1
Y2 = value for 3 Y3 = 0 n i=1 5 3 Y3 = 0
second case, etc. 4 Y4 = 1
4 Y4 = 1
Yn = value for last 5 Y5 = 1
case 5 Y5 = 1
GOG 502/PLN 504 Youqin Huang 7
∑M
j =1
j fj
k
∑ (Y − Y ) 〈∑ (Y − A)
GOG
i
i =1 502/PLN 504
2
i =1
Youqin Huang
i
2
2
Hypothetical Block After One Heck of a Remodel:
The Mean and Extreme Values Mean housing price/value is not very meaningful
1 20 20
2 40 40
3 0 0 1000000
255000 250000 235000 250000
4 70 1000
GOG 502/PLN 504 Youqin Huang 15 GOG 502/PLN 504 Youqin Huang 16
3
Central Tendency: Median Central Tendency: Mode
Pros: The value that occurs most frequently -- the
“Modal” value
Unaffected by outliers (appropriate for
variables such as income, housing price) Appropriate for all types of data.
Commonly used for categorical (nominal, ordinal) data
Cons: Only useful for continuous (interval/ratio) variables if
Insensitive to the distances of the you have grouped data
measurements from the middle. Otherwise, all values may very likely be unique
8, 9, 10, 11, 12 Modes = Peaks
1, 2, 10, 100, 500 Uni-modal distribution: One peak
Bi-modal distribution: Two peaks
Multi-modal distribution: Multiple peaks (usually more
than two).
GOG 502/PLN 504 Youqin Huang 19 GOG 502/PLN 504 Youqin Huang 20
GOG 502/PLN 504 Youqin Huang 21 GOG 502/PLN 504 Youqin Huang 22
4
Comparing Mean, Median, Mode
Comparing Mean, Median, Mode
Both mean and median can be easily
calculated for grouped or ungrouped
data. Mode is usually used for grouped
data
Unequal class intervals in grouped data
do not hinder the calculation of mean,
For a symmetrical unimodal distribution, the three are median, but severely limit the calculation
identical
For a smoothed unimodal frequency curve, the mode of mode
defines the peak value; the median divides the area The presence of an open-ended class do
under the curve into two equal parts; the mean divides
the curve into two equally balanced parts through the not affect the median or mode, but
center of gravity severely limit the calculation of mean.
GOG 502/PLN 504 Youqin Huang 25 GOG 502/PLN 504 Youqin Huang 26
Variability
• Very different groups can have the same means: Variability
16
6
14
12 5
10 4
8
3
6
2
4
Std. Dev = 21.72
2 Mean = 101
1 Std. Dev = 67.62
Mean = 100.0
0 N = 23.00
0 N = 23.00
0 50 100 150 200 0.0 50.0 100.0 150.0 200.0
25 75 125 175 25.0 75.0 125.0 175.0
5
Quartile and Interquartile Range
Measures of Variation
Range (Ymax – Ymin)
Doesn’t tell you much about the middle cases
Influenced by extreme values… may not be
representative
Interpercentile (usually interquartile)
range
Percentile: p% scores below it, (100-p)% above it
Lower quartile (P25), upper quartile (P75 )
IQR =P75 - P25
Not sensitive to extreme value
Outlier: >1.5 IQR above the upper quartile, or 1.5 IQR
below the lower quartile
GOG 502/PLN 504 Youqin Huang 31 GOG 502/PLN 504 Youqin Huang 32
1 3
s =
2 i =1
= i =1
n −1 n −1
Y
2 5
n
3 1
Standard ∑ (Y − Y )
i
2
4 7
deviation (sY) sY = sY2 = i =1
Example Example
Case Num Mean Case Num Mean Deviation
CD’s (Ybar) CD’s (Ybar) (Yi-Ybar)
1 3 4 1 3 4 -1
2 5 4 2 5 4 1
3 1 4 3 1 4 -3
4 7 4 4 7 4 3
6
Case Num Mean Deviationn
Square
n
of
∑ di2 ∑ (Yi − Y ) 2
CD’s (Ybar) sY =
(Yi-Ybar)
2 i =1
= deviation
n −1
i =1
n −1
Properties of Variance (sY2)
1 3 4 -1 1 sY2 >=0
2 5 4 1 1 Zero if all points cluster exactly on the mean
3 1 4 -3 9 Larger for more “spread” distributions
4 7 4 3 9 Pros:
Comparable across samples of different size
sum 0 20 Better mathematical characteristics than
n n AAD
∑d i
2
∑ (Y − Y )
i
2
Variance=20/(4-1)=6.67 Cons:
s =
2 i =1
= i =1
St.Dev=sqrt(6.67)=2.58 Values get fairly large, due to “squaring”
n −1 n −1
Y
GOG 502/PLN 504 Youqin Huang 37 GOG 502/PLN 504 Youqin Huang 38
Properties of Standard
Properties of Standard Deviation Deviation
s >= 0
s=0 when all observations have the same
value, grows larger if points are spread further
from the mean
s is the average distance of an observation
from the mean
Most commonly used measure of dispersion
Comparable across different sample sizes
The Empirical Rule
GOG 502/PLN 504 Youqin Huang 39 GOG 502/PLN 504 Youqin Huang 40
N N How to
∑d i ∑ Y −Y i
interpret this
AAD = i =1
= i =1
box plot?
N N
GOG 502/PLN 504 Youqin Huang 41 GOG 502/PLN 504 Youqin Huang 42
7
Measuring Skewness Measureing Skewness
Is the distribution symmetrical? A “tail” is referred to as “skewness”
Tail on left = skewed to left = negative skew
Skewness measuring the degree of
Tail on right = skewed to right = positive
asymmetry around a measure of skew
central tendency
Pearson’s Coefficient of Skewness
Zero = perfectly symmetrical Based on distance from Mean to Median
Higher number = increasingly skew Mean moves more if there are extreme
cases, as when there is a “tail”
3(Y − Mdn)
skew =
GOG 502/PLN 504 Youqin Huang
GOG 502/PLN 504 Youqin Huang 43 sY 44
Interpreting Skewness
Measuring Skewness Penn 56 RGDPCH 1990
50 Which way is it skewed?
Pearson’s Coefficient of Skewness 40
What is the social
Quartile skewness 30
interpretation?
What would be the
Measures distance between median and 20
10
20
40
60
80
10
12 0
14 0
16 0
18 0
20 0
0
00
00
00
00
00
00
00
00
00
00
.0
.0
.0
.0
0.
0.
0.
0.
0.
0.
skewness
0
GOG 502/PLN 504 Youqin Huang 47 GOG 502/PLN 504 Youqin Huang 48
8
Example:
Example: How would you
How would you describe this variable?
describe this variable?
GOG 502/PLN 504 Youqin Huang 49 GOG 502/PLN 504 Youqin Huang 50
Measuring Kurtosis:
Summary statistics Is the distribution curve flat or peaked?
Measures of central tendency
Mean, median, mode
Measures of variability
Range, IQR, variance, s.d., AAD
Measures of skewness
Measures of kurtosis
GOG 502/PLN 504 Youqin Huang 51 GOG 502/PLN 504 Youqin Huang 52
∑ (Y − Y )i
4
/n
K= i =1
−3
( s 2 )2
GOG 502/PLN 504 Youqin Huang 53 GOG 502/PLN 504 Youqin Huang 54
9
Cumulative Frequency List
Years of Education (N=2904)
Cumulative % Graphs
Value Frequency Percent Cumulat %
Indicates
7 or less 21 1.4 3.9 100
that 55% of
Cumulative Percentage
8 82 5.3 9.3 90
9 51 3.3 12.6
students
80
10 70 4.6 17.2 have 12 70
11 95 6.2 23.4 years of 60
12 489 31.8 55.4 education 50
13 125 8.1 63.5 or less 40
14 184 12.0 75.6 30
15 76 4.9 80.5 20
16 152 9.9 90.5 10
17 40 2.6 93.1 0
18 61 4.0 97.1 5 10 15 20
19 18 1.2 98.2
GOG 502/PLN 504 Youqin Huang 55 YearsYouqin
of Education
GOG 502/PLN 504 Huang 56
20 27 1.8 100.0
Rank (Ri)
Cumulative frequency list/curve
Quantile
Percentiles, quartiles, deciles, etc…
General term = quantile
Dividing cases up into fixed number of equal
“chunks”
100 chunks = percentiles (1% each)
10 chunks = deciles (10% each)
5 = quintiles (20% each)
4 = quartiles (25% each)
GOG 502/PLN 504 Youqin Huang 57 GOG 502/PLN 504 Youqin Huang 58
10
Yi − Y
Zi =
Z-Score Example s Z-Score (Standardized Score)
Number of CD’s: Mean = 32.5, s = 29.8 Unit of Z-scores is “standard deviation”
A Z-score of -1.1 indicates a case is nearly
Case Num Mean Deviation Z-score one standard deviation below the mean
CD’s (Y) (Y bar) (d) (di/s) Z=0.5 Æ 0.5 St. Dev above e the mean
1 20 32.5 -12.5 -.42 You can convert any or all values of a
variable to a common scale
2 40 32.5 7.5 +.25 mean = 0
negative = below mean
3 0 32.5 -32.5 -1.1
positive = above mean
4 70 32.5 37.5 +1.3 range approximately from –3 to +3. WHY?
GOG 502/PLN 504 Youqin Huang 61 GOG 502/PLN 504 Youqin Huang 62
200
Frequency
200
GOG 502/PLN 504 Youqin Huang 63 GOG 502/PLN 504 Youqin Huang 64
Special case:
Summary statistics Dichotomous Variables
Measures of central tendency Mean, variance, and S.D. are generally
Mean, median, mode NOT too useful for nominal variables
Measures of variability Exception: Mean of dichotomous
Range, IQR, variance, s.d., AAD variables
Measures of skewness Dichotomous variable = nominal, w/ 2
categories, often called “dummy” variables
Measures of kurtosis
E.g.: Do you approve of gun control
Measures of relative position (yes/no)?
Rank, quantile, Z-score People saying “yes” assigned 1, no = 0
GOG 502/PLN 504 Youqin Huang 65 GOG 502/PLN 504 Youqin Huang 66
11
Dichotomous (Dummy) Variables:
1 = Presence of something, 0 = absence of it Dichotomous Variables
Perso View On Support? Interpretation:
n Gun (Dummy) Mean = proportion indicating yes
1=
Control Presence of Example: “Do you approve of gun
1 Favor 1 support for control?”
2 Oppose 0
gun control 14 yes, 24 no. (37% yes, 63% no)
3 Favor 1 0 = Absence Mean of variable = .37
of support
4 Favor 1 for gun
5 Oppose 0 control
GOG 502/PLN 504 Youqin Huang 67 GOG 502/PLN 504 Youqin Huang 68
( X ,Y )
Areal data:
Mean of X=78/8=9.8
n n
∑w x i i ∑w y i i
Mean of Y=99/8=12.4
x= i =1
;y = i =1
n n Weighted mean of X=178.3/18.2=9.8
∑w
i =1
i ∑w
i =1
i Weighted mean of Y=196.7/18.2=10.8
GOG 502/PLN 504 Youqin Huang 71 GOG 502/PLN 504 Youqin Huang 72
12
Mean Center Geographic Data: Median Center
NOT the point defined by the
medians of x and y
Minimum aggregated distance to all
points
n
∑ [( X
i =1
i − X 0 ) 2 + (Yi − Y0 ) 2 ]
∑(X i − X ) 2 + ∑ (Yi − Y ) 2
SD = i =1 i =1
n
∑d 2
ic
SD = s x2 + s y2
SD = i =1
n
GOG 502/PLN 504 Youqin Huang 75 GOG 502/PLN 504 Youqin Huang 76
SD
RD =
r
GOG 502/PLN 504 Youqin Huang 77 GOG 502/PLN 504 Youqin Huang 78
13
SPSS: Descriptive Statistics
Summary
Measures of central tendency
Mean, median, mode
Measures of variability, skewness, kurtosis
Variance, standard deviation, range, IQR
Pearson’s coefficient, Quartile skewness, kurtosis
Measures of relative position
Rank, quantile, Z-score
Dummy variables
Measures for geographic data
Mean center, median center, standard distance,
relative distance
GOG 502/PLN 504 Youqin Huang 79 GOG 502/PLN 504 Youqin Huang 80
14