Descriptive Statistics (II)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Descriptive Statistics

Descriptive Statistics (II) † Tabulation


† Graph
† Mapping
† Summary statistics
„ Measures of central tendency
„ Measures of variability, skewness,
kurtosis
„ Measures of relative position
„ Geographic data
GOG 502/PLN 504 Youqin Huang 1 GOG 502/PLN 504 Youqin Huang 2

Summary Measures for


Frequency Curve Measuring the Central Tendency
† The “center” of a distribution,
“typical” case
„ Mean, median, mode

GOG 502/PLN 504 Youqin Huang 3 GOG 502/PLN 504 Youqin Huang 4

Variables Central Tendency: Mean


† Each column of a dataset is considered a † Arithmetic Mean, or “average”, “Y-bar”
variable, generally referred as “Y”, or “X” „ Sum of the Y for all cases divided by the
Person # Guns number of subjects
owned The variable “Y” „ Most frequently used measure
1 0
2 3
3 0
4 1
GOG 502/PLN 504 Youqin Huang 5 GOG 502/PLN 504 Youqin Huang 6
5 1

1
Calculating the Mean Calculating the Mean
5
† Yi represents “ith” Person # Guns
case of variable Y
Person # Guns
owned ∑Y = Y
i =1
i 1 +Y 2 +Y 3 +Y 4 +Y 5 owned
† i goes from 1 to n (Y) = 0 + 3 + 0 +1+1 = 5 1
(Y)
Y1 = 0
† Y1 = value of Y for 1 Y1 = 0
first case in 1n 1

2 Y2 = 3
spreadsheet
2 Y2 = 3 Y= Yi = ×5 =1
† Y2 = value for 3 Y3 = 0 n i=1 5 3 Y3 = 0
second case, etc. 4 Y4 = 1
4 Y4 = 1
† Yn = value for last 5 Y5 = 1
case 5 Y5 = 1
GOG 502/PLN 504 Youqin Huang 7

Mean of Groups Central Tendency: Mean


† Mean of groups is the weighted mean † Grouped data
(average) of group means.
† Two groups of size n1, n2

Y = (n1Y1 + n2Y2 )/(n1 + n2 )


† More generally, k

∑M
j =1
j fj
k

∑M j fj What if the last class


Y = Y =
j =1
=
9817
= 114.15 is open-ended?
n n 86
GOG 502/PLN 504 Youqin Huang 9 GOG 502/PLN 504 Youqin Huang 10

Properties of the Mean Properties of the Mean


† Pros: † Cons:
„ Gives a sense of “typical” case „ Every case influences outcome
„ Useful for continuous data „ Extreme cases (outliers) affect results a
„ Easy to calculate lot. (e.g. Mean income is often not very
meaningful)
„ Center of gravity
„ Doesn’t give you a full sense of the
n
distribution
∑ (Y − Y ) = 0
i =1
i
„ Appropriate only for quantitative data
n n

∑ (Y − Y ) 〈∑ (Y − A)
GOG
i
i =1 502/PLN 504
2

i =1
Youqin Huang
i
2

11 GOG 502/PLN 504 Youqin Huang 12

2
Hypothetical Block After One Heck of a Remodel:
The Mean and Extreme Values Mean housing price/value is not very meaningful

235000 250000 245000 260000 240000 255000

Case Num CD’s Num CD’s2

1 20 20

2 40 40

3 0 0 1000000
255000 250000 235000 250000
4 70 1000

Mean 32.5 265


GOG 502/PLN 504 Youqin Huang 13 © 2011 Taylor and Francis

Central Tendency: Median Central Tendency: Median


† The middle measurement of a ranked † If n is odd, median is a single measurement; if
n is even, median is the midpoint between the
sample: (n+1)/2.
two middle measurements

GOG 502/PLN 504 Youqin Huang 15 GOG 502/PLN 504 Youqin Huang 16

Median: Appropriate for both ratio and


ordinal data, but not for nominal data Central Tendency: Median
† Same as the mean for symmetric
distributions; for skewed
distribution, median lies toward the
shorter tail related to the mean

† What is the median education?


GOG 502/PLN 504 Youqin Huang 17 GOG 502/PLN 504 Youqin Huang 18

3
Central Tendency: Median Central Tendency: Mode
† Pros: † The value that occurs most frequently -- the
“Modal” value
„ Unaffected by outliers (appropriate for
variables such as income, housing price) † Appropriate for all types of data.
„ Commonly used for categorical (nominal, ordinal) data
† Cons: „ Only useful for continuous (interval/ratio) variables if
„ Insensitive to the distances of the you have grouped data
measurements from the middle. † Otherwise, all values may very likely be unique
† 8, 9, 10, 11, 12 † Modes = Peaks
† 1, 2, 10, 100, 500 „ Uni-modal distribution: One peak
„ Bi-modal distribution: Two peaks
„ Multi-modal distribution: Multiple peaks (usually more
than two).

GOG 502/PLN 504 Youqin Huang 19 GOG 502/PLN 504 Youqin Huang 20

Central Tendency: Mode Reasons for Multi-Modal Distributions


† The sample is heterogeneous (i.e.,
made up of more than one group)
Why is the „ Height forms a bell-shaped distribution for
distribution men and for women, but the peaks are
different. A combined sample has two
bimodal? peaks

GOG 502/PLN 504 Youqin Huang 21 GOG 502/PLN 504 Youqin Huang 22

Reasons for Multi-Modal Distributions Mode


† The sample is heterogeneous (i.e., made † Pro: Easy, useful
up of more than one group) † Con:
„ Height forms a bell-shaped distribution for
„ Do not necessarily close to the center
men and for women, but the peaks are
different. A combined sample has two „ Not very helpful (even misleading) in
peaks certain circumstances, e.g. if there are
† The sample reflects some exogenous many peaks, or a single unusual one; if
the variable is distributed quite evenly
structural ordering process
„ Years of education completed is peaked at
12 (high school), 16 (college)
GOG 502/PLN 504 Youqin Huang 23 GOG 502/PLN 504 Youqin Huang 24

4
Comparing Mean, Median, Mode
Comparing Mean, Median, Mode
† Both mean and median can be easily
calculated for grouped or ungrouped
data. Mode is usually used for grouped
data
† Unequal class intervals in grouped data
do not hinder the calculation of mean,
† For a symmetrical unimodal distribution, the three are median, but severely limit the calculation
identical
† For a smoothed unimodal frequency curve, the mode of mode
defines the peak value; the median divides the area † The presence of an open-ended class do
under the curve into two equal parts; the mean divides
the curve into two equally balanced parts through the not affect the median or mode, but
center of gravity severely limit the calculation of mean.
GOG 502/PLN 504 Youqin Huang 25 GOG 502/PLN 504 Youqin Huang 26

Levels of Measurement and


Measures of the Centre Summary statistics
† Measures of central tendency
„ Mean, median, mode
Nominal Ordinal Ratio
Mode YES YES YES † Measures of variability
Median NO YES YES „ Describing how “spread out” a
Mean NO NO YES distribution is around its center

If appropriate, report all three. The


differences between them tell something
important about the distribution
© 2011 Taylor and Francis GOG 502/PLN 504 Youqin Huang 28

Variability
• Very different groups can have the same means: Variability

16
6
14
12 5

10 4
8
3
6
2
4
Std. Dev = 21.72
2 Mean = 101
1 Std. Dev = 67.62
Mean = 100.0
0 N = 23.00
0 N = 23.00
0 50 100 150 200 0.0 50.0 100.0 150.0 200.0
25 75 125 175 25.0 75.0 125.0 175.0

Number of CDs (Group 1) Number of CDs (Group 2)

Which country would you prefer to live in?


GOG 502/PLN 504 Youqin Huang 29 GOG 502/PLN 504 Youqin Huang 30

5
Quartile and Interquartile Range
Measures of Variation
† Range (Ymax – Ymin)
„ Doesn’t tell you much about the middle cases
„ Influenced by extreme values… may not be
representative
† Interpercentile (usually interquartile)
range
„ Percentile: p% scores below it, (100-p)% above it
„ Lower quartile (P25), upper quartile (P75 )
„ IQR =P75 - P25
„ Not sensitive to extreme value
„ Outlier: >1.5 IQR above the upper quartile, or 1.5 IQR
below the lower quartile

GOG 502/PLN 504 Youqin Huang 31 GOG 502/PLN 504 Youqin Huang 32

Measures of Variation Example


† Deviation
di = Yi − Y Case Num
n n
CD’s
† Variance (sY2)
∑d i
2
∑ (Y − Y ) i
2

1 3
s =
2 i =1
= i =1

n −1 n −1
Y
2 5
n
3 1
† Standard ∑ (Y − Y )
i
2
4 7
deviation (sY) sY = sY2 = i =1

n −1 Variance? Standard Deviation?


GOG 502/PLN 504 Youqin Huang 33 GOG 502/PLN 504 Youqin Huang 34

Example Example
Case Num Mean Case Num Mean Deviation
CD’s (Ybar) CD’s (Ybar) (Yi-Ybar)
1 3 4 1 3 4 -1
2 5 4 2 5 4 1
3 1 4 3 1 4 -3
4 7 4 4 7 4 3

Variance? Standard Deviation? Variance? Standard Deviation?


GOG 502/PLN 504 Youqin Huang 35 GOG 502/PLN 504 Youqin Huang 36

6
Case Num Mean Deviationn
Square
n
of
∑ di2 ∑ (Yi − Y ) 2
CD’s (Ybar) sY =
(Yi-Ybar)
2 i =1
= deviation
n −1
i =1

n −1
Properties of Variance (sY2)
1 3 4 -1 1 † sY2 >=0
2 5 4 1 1 „ Zero if all points cluster exactly on the mean
3 1 4 -3 9 „ Larger for more “spread” distributions

4 7 4 3 9 † Pros:
„ Comparable across samples of different size
sum 0 20 „ Better mathematical characteristics than
n n AAD
∑d i
2
∑ (Y − Y )
i
2
Variance=20/(4-1)=6.67 † Cons:
s =
2 i =1
= i =1
St.Dev=sqrt(6.67)=2.58 „ Values get fairly large, due to “squaring”
n −1 n −1
Y

GOG 502/PLN 504 Youqin Huang 37 GOG 502/PLN 504 Youqin Huang 38

Properties of Standard
Properties of Standard Deviation Deviation
† s >= 0
† s=0 when all observations have the same
value, grows larger if points are spread further
from the mean
† s is the average distance of an observation
from the mean
† Most commonly used measure of dispersion
† Comparable across different sample sizes
† The Empirical Rule

GOG 502/PLN 504 Youqin Huang 39 GOG 502/PLN 504 Youqin Huang 40

The Alternative Measure Central tendency and Variation


♦ Average Absolute Deviation (AAD) † Box plots
– Very intuitive interpretation
– Has non-ideal statistical properties

N N How to
∑d i ∑ Y −Y i
interpret this
AAD = i =1
= i =1
box plot?
N N

GOG 502/PLN 504 Youqin Huang 41 GOG 502/PLN 504 Youqin Huang 42

7
Measuring Skewness Measureing Skewness
† Is the distribution symmetrical? † A “tail” is referred to as “skewness”
„ Tail on left = skewed to left = negative skew
† Skewness measuring the degree of
„ Tail on right = skewed to right = positive
asymmetry around a measure of skew
central tendency
† Pearson’s Coefficient of Skewness
† Zero = perfectly symmetrical „ Based on distance from Mean to Median
† Higher number = increasingly skew „ Mean moves more if there are extreme
cases, as when there is a “tail”

3(Y − Mdn)
skew =
GOG 502/PLN 504 Youqin Huang
GOG 502/PLN 504 Youqin Huang 43 sY 44

Interpreting Skewness
Measuring Skewness Penn 56 RGDPCH 1990
50 Which way is it skewed?
† Pearson’s Coefficient of Skewness 40
What is the social
† Quartile skewness 30
interpretation?
What would be the
„ Measures distance between median and 20

lower & upper quartiles interpretation if it were


Frequency

10

„ Extreme values move lower/upper


Std. Dev = 4915.68
Mean = 4810.4 skewed in the opposite
0 N = 152.00

quartiles further out, resulting in larger direction?


0.

20

40

60

80

10

12 0

14 0

16 0

18 0

20 0
0

00

00

00

00

00

00

00

00

00

00
.0

.0

.0

.0

0.

0.

0.

0.

0.

0.

skewness
0

Penn 56 RGDPCH 1990

P + P − 2Mdn † Skewness provides information about inequality


skew = 25 75 „ Example: Economic wealth of nations
2
GOG 502/PLN 504 Youqin Huang 45 GOG 502/PLN 504 Youqin Huang 46

Interpreting Skewness Notes on Skewness


† More often assessed informally “by eye”
† Skewness may reflect “floor” or than calculated as a value.
“ceiling” effects „ Look at a histogram to identify skewness
„ Example: Number of crimes committed † Some statistical techniques work properly
by individuals in a sample. Lower bound only on variables that are not skewed (e.g.
is zero. Mode is very low. A few cases empirical rule).
are high. „ It can be very important to identify highly skewed
variables.
„ Example: National secondary school
† Note: mode, skew sound like “jargon”, but
enrollment ratio. Cannot exceed 100% are actually quite helpful in communicating
descriptive information about your variables

GOG 502/PLN 504 Youqin Huang 47 GOG 502/PLN 504 Youqin Huang 48

8
Example:
Example: How would you
† How would you describe this variable?
describe this variable?

GOG 502/PLN 504 Youqin Huang 49 GOG 502/PLN 504 Youqin Huang 50

Measuring Kurtosis:
Summary statistics Is the distribution curve flat or peaked?
† Measures of central tendency
„ Mean, median, mode
† Measures of variability
„ Range, IQR, variance, s.d., AAD
† Measures of skewness
† Measures of kurtosis

GOG 502/PLN 504 Youqin Huang 51 GOG 502/PLN 504 Youqin Huang 52

Measuring Kurtosis Measuring Relative Position


† Is the distribution curve flat or † Rank (Ri)
peaked? „ Sort the data, the position of score
† Negative: flattened curve † Cumulative frequency list/curve
† Positive: peaked/pointed curve „ Number of cases (percentage of cases )
† Zero: bell-shaped normal falling in or below a given interval
distribution n

∑ (Y − Y )i
4
/n
K= i =1
−3
( s 2 )2
GOG 502/PLN 504 Youqin Huang 53 GOG 502/PLN 504 Youqin Huang 54

9
Cumulative Frequency List
Years of Education (N=2904)
Cumulative % Graphs
Value Frequency Percent Cumulat %
Indicates
7 or less 21 1.4 3.9 100
that 55% of

Cumulative Percentage
8 82 5.3 9.3 90
9 51 3.3 12.6
students
80
10 70 4.6 17.2 have 12 70
11 95 6.2 23.4 years of 60
12 489 31.8 55.4 education 50
13 125 8.1 63.5 or less 40
14 184 12.0 75.6 30
15 76 4.9 80.5 20
16 152 9.9 90.5 10
17 40 2.6 93.1 0
18 61 4.0 97.1 5 10 15 20
19 18 1.2 98.2
GOG 502/PLN 504 Youqin Huang 55 YearsYouqin
of Education
GOG 502/PLN 504 Huang 56
20 27 1.8 100.0

Is History Siding With Obama’s


Economic Plan?
Measuring Relative Position (NY Times)

† Rank (Ri)
† Cumulative frequency list/curve
† Quantile
„ Percentiles, quartiles, deciles, etc…
„ General term = quantile
„ Dividing cases up into fixed number of equal
“chunks”
† 100 chunks = percentiles (1% each)
† 10 chunks = deciles (10% each)
† 5 = quintiles (20% each)
† 4 = quartiles (25% each)

GOG 502/PLN 504 Youqin Huang 57 GOG 502/PLN 504 Youqin Huang 58

Measuring Relative Position:


Benefit of Quantiles Ratio
† Quantiles allow you to identify cases (or † Ratio
groups of cases) in relation to the larger „ The position of an individual score in
group relation to some other score (e.g. mean,
„ Who is “high”, who is “low”
max, min…)
„ Regardless of the unit of measurement
„ Standardized score (Z-score): deviation
† Advantage: Allows comparison between
from mean divided by standard deviation
variables with different scales (or with
different means)
Yi − Y
„ Example: Reading test scored 1-100, Math test is
scored 1-25. How do you know which you scored Zi =
better on? Answer: percentile s
GOG 502/PLN 504 Youqin Huang 59 GOG 502/PLN 504 Youqin Huang 60

10
Yi − Y
Zi =
Z-Score Example s Z-Score (Standardized Score)
† Number of CD’s: Mean = 32.5, s = 29.8 † Unit of Z-scores is “standard deviation”
„ A Z-score of -1.1 indicates a case is nearly
Case Num Mean Deviation Z-score one standard deviation below the mean
CD’s (Y) (Y bar) (d) (di/s) „ Z=0.5 Æ 0.5 St. Dev above e the mean
1 20 32.5 -12.5 -.42 † You can convert any or all values of a
variable to a common scale
2 40 32.5 7.5 +.25 „ mean = 0
„ negative = below mean
3 0 32.5 -32.5 -1.1
„ positive = above mean
4 70 32.5 37.5 +1.3 „ range approximately from –3 to +3. WHY?
GOG 502/PLN 504 Youqin Huang 61 GOG 502/PLN 504 Youqin Huang 62

Converting Variables to Z-scores


Z-Scores GSS Data, N=2904
† Z-scores can be compared across variables
with different units or means
„ Examples: height and weight; a person is -0.3 on math, HIGHEST YEAR OF SCHOOL COMPLETED Z-SCORE: HIGHEST YEAR OF EDUCATION
but 1.2 on income 1000
1000

„ Simple deviations can’t be compared if units of


measurement are different
800
800

† Convert an entire variable (all cases) to Z- 600


600

scores, creating a new variable with useful


properties
400
400

„ preserves the shape of the distribution, but unit is changed


Frequency

200
Frequency

200

„ Mean = zero, because it is based on deviations


0
„ Standard Deviation (sy) = 1 0 4 6 8 10 12 14 16 18 20
0
-4.56 -3.19 -2.51 -1.83 -1.15 -.46 .22 .90 1.58 2.27
3 5 7 9 11 13 15 17 19
„ Easier to compare different variables -3.54 -2.85 -2.17 -1.49 -.81 -.12 .56 1.24 1.92

GOG 502/PLN 504 Youqin Huang 63 GOG 502/PLN 504 Youqin Huang 64

Special case:
Summary statistics Dichotomous Variables
† Measures of central tendency † Mean, variance, and S.D. are generally
„ Mean, median, mode NOT too useful for nominal variables
† Measures of variability † Exception: Mean of dichotomous
„ Range, IQR, variance, s.d., AAD variables
† Measures of skewness „ Dichotomous variable = nominal, w/ 2
categories, often called “dummy” variables
† Measures of kurtosis
„ E.g.: Do you approve of gun control
† Measures of relative position (yes/no)?
„ Rank, quantile, Z-score „ People saying “yes” assigned 1, no = 0

GOG 502/PLN 504 Youqin Huang 65 GOG 502/PLN 504 Youqin Huang 66

11
Dichotomous (Dummy) Variables:
1 = Presence of something, 0 = absence of it Dichotomous Variables
Perso View On Support? † Interpretation:
n Gun (Dummy) „ Mean = proportion indicating yes
1=
Control Presence of † Example: “Do you approve of gun
1 Favor 1 support for control?”
2 Oppose 0
gun control „ 14 yes, 24 no. (37% yes, 63% no)
3 Favor 1 0 = Absence „ Mean of variable = .37
of support
4 Favor 1 for gun
5 Oppose 0 control
GOG 502/PLN 504 Youqin Huang 67 GOG 502/PLN 504 Youqin Huang 68

Dichotomous Variables Geographic Data


† The Standard Deviation for dichotomous
variables † Statistical and spatial distribution
sY = s = ( p0 )( p1 ) 2 † Summary statistics:
Y
„ Important to accessibility and
• p0 = the proportion of cases scoring 0 dispersion
• p1 = the proportion of cases scoring 1 „ Centrality: mean center, median
• Q: What is sY if the sample is half 0, half 1? center
„ Dispersion: standard distance,
• Answer:
relative distance
• p0 =0.5, p1 =0.5
• square root of 0.25, = 0.5
GOG 502/PLN 504 Youqin Huang 69 GOG 502/PLN 504 Youqin Huang 70

Geographic Data: Mean Center Geographic Data: Mean Center


† Minimize the sum of squared distance
„ Point data:

( X ,Y )
„ Areal data:
Mean of X=78/8=9.8
n n
†
∑w x i i ∑w y i i
† Mean of Y=99/8=12.4
x= i =1
;y = i =1
n n † Weighted mean of X=178.3/18.2=9.8
∑w
i =1
i ∑w
i =1
i † Weighted mean of Y=196.7/18.2=10.8
GOG 502/PLN 504 Youqin Huang 71 GOG 502/PLN 504 Youqin Huang 72

12
Mean Center Geographic Data: Median Center
† NOT the point defined by the
medians of x and y
† Minimum aggregated distance to all
points
n

∑ [( X
i =1
i − X 0 ) 2 + (Yi − Y0 ) 2 ]

† Application in location theory


GOG 502/PLN 504 Youqin Huang 73 GOG 502/PLN 504 Youqin Huang 74

Median Center Geographic Data: Standard Distance


† Spatial equivalent to standard deviation
† Average distance of observations to
mean center, or radius around mean
center n n

∑(X i − X ) 2 + ∑ (Yi − Y ) 2
SD = i =1 i =1
n

∑d 2
ic
SD = s x2 + s y2
SD = i =1
n
GOG 502/PLN 504 Youqin Huang 75 GOG 502/PLN 504 Youqin Huang 76

Geographic Data: Standard Distance Geographic Data: Relative Distance


† affected by the unit which distance † Dividing SD by the radius of a circle
is measured with area equal to the size of the
† Affected by the study area study area

SD
RD =
r

GOG 502/PLN 504 Youqin Huang 77 GOG 502/PLN 504 Youqin Huang 78

13
SPSS: Descriptive Statistics
Summary
† Measures of central tendency
„ Mean, median, mode
† Measures of variability, skewness, kurtosis
„ Variance, standard deviation, range, IQR
„ Pearson’s coefficient, Quartile skewness, kurtosis
† Measures of relative position
„ Rank, quantile, Z-score
† Dummy variables
† Measures for geographic data
„ Mean center, median center, standard distance,
relative distance
GOG 502/PLN 504 Youqin Huang 79 GOG 502/PLN 504 Youqin Huang 80

14

You might also like