L03 ECO220 Print

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Percentiles, STATA, Box Plots,

Grouped Data, Standardizing,


and Other Transformations

Lecture 3

Reading: Sections 5.7 – 5.14

Figure 1: Distribution of Individual Management Score in 2016

Notes: The management score is unweighted average of the score for each of the 16 questions,
where each question is first normalized to be on a 0-1 scale. The sample is all 2016 CEES surveyors
with at least 11 non-missing responses to management questions and [select firms].

“Do CEOs Know Best? Evidence from China” (2018) http://www.nber.org/papers/w24760 2

Lecture 3 Slides, ECO220Y1Y, 1


Measures of Relative Standing:
Percentiles
• Example 1: SSAT scores
– Sample SSAT Upper-Level Score Report for a
student named “Jordan”
• https://www.ssat.org/about/scores/report-breakdown
• Example 2: Growth charts for infants
– Birth to 36 months (5th-95th percentile): Boys
Length-for-age and Weight-for-age
• https://www.cdc.gov/growthcharts/clinical_charts.htm
Percentiles help communicate when people are not too familiar
with a context. (E.g. Is a score of 2,100 on the SSAT good or not?
Is a 24-month-old boy who weights 14kg big or small?)
3

Linking Histograms and Percentiles


n = 174 countries, bin width = 5
.6
8
.482

What is approx. median


.5
6

(50th percentile)?
.350

.4
Fraction

2.30th percentile?
.3
85.64th percentile?
.2
7
.074

5
5
.034

7
7

7
7

.1
.023

.011
.005
.005

.005
.005

0
0 20 40 60
Inflation Rate, 2011
World bank data, again What if a country had exactly 5% inflation? 4

Lecture 3 Slides, ECO220Y1Y, 2


Reading STATA Output
. su inflation_2011, detail

inflation_2011
-------------------------------------------------------------
Percentiles Smallest
1% -2.517798 -4.895247
5% .9223603 -2.517798
10% 2.075173 -.3644478 Obs 174
25% 3.329906 -.2833333 Sum of Wgt. 174

50% 4.977675 Mean 6.646499


Largest Std. Dev. 6.77998
75% 8.253968 26.09021
90% 12.43155 33.22422 Variance 45.96813
95% 17.71178 47.27686 Skewness 3.773002
99% 47.27686 53.2287 Kurtosis 22.85972

Median? Range?

Trips Freq. Percent Cum. Trips Freq. Percent Cum. What is this
0 294 35.85 35.85 19 1 0.12 95.85
1 76 9.27 45.12 20 3 0.37 96.22
table called?
2 66 8.05 53.17 21 2 0.24 96.46 What’s the 25th
3 58 7.07 60.24 22 4 0.49 96.95
4 47 5.73 65.98 23 1 0.12 97.07
percentile?
5 47 5.73 71.71 24 4 0.49 97.56 What’s the
6 36 4.39 76.10 25 2 0.24 97.80
median?
7 30 3.66 79.76 26 4 0.49 98.29
8 28 3.41 83.17 27 2 0.24 98.54 What’s the 75th
9 15 1.83 85.00 28 3 0.37 98.90 percentile?
10 9 1.10 86.10 30 1 0.12 99.02
11 16 1.95 88.05 34 1 0.12 99.15
12 25 3.05 91.10 35 1 0.12 99.27
13 9 1.10 92.20 36 1 0.12 99.39
14 5 0.61 92.80 41 1 0.12 99.51
15 9 1.10 93.90 43 1 0.12 99.63
16 5 0.61 94.51 44 1 0.12 99.76
17 6 0.73 95.24 45 1 0.12 99.88
18 4 0.49 95.73 50 1 0.12 100.00
cont’d Total 820 100.00
6

Lecture 3 Slides, ECO220Y1Y, 3


Discrete Histogram (bin width = 1)

.3
Density
.2
.1
0

0 5 10 20 30 40 50
Number Fishing Trips

Reading a STATA Summary


. summarize Number_of_Trips, detail;

Number_of_Trips
-------------------------------------------------------------
Percentiles Smallest
1% 0 0
5% 0 0
10% 0 0 Obs 820
25% 0 0 Sum of Wgt. 820

50% 2 Mean 4.52439


Largest Std. Dev. 6.684273
75% 6 43
90% 12 44 Variance 44.6795
95% 17 45 Skewness 2.717188
99% 30 50 Kurtosis 13.01081

How can the 10th percentile and the 25th percentile both be zero?
8

Lecture 3 Slides, ECO220Y1Y, 4


One Popular Use of Percentiles
• Quartiles: • Quintiles:
– 1st quartile: obs btwn 0th – Divide variable into
and 25th percentiles fifths: e.g. top quintile
– 2nd quartile: obs btwn includes obs btwn 80th
25th and 50th percentiles and 100th percentiles
– 3rd quartile: obs btwn • Deciles:
50th and 75th percentiles – Divide variable into
– 4th quartile: obs btwn tenths: e.g. bottom
75th and 100th decile includes obs btwn
percentiles 0th and 10th percentiles
Note: You are responsible for knowing the meaning of these
terms if they appear on a test, exam, etc.
9

Practice Reading and Interpreting

Alesina et al. (2001) “Why Doesn’t the United


States Have a European-Style Welfare State?”
What do these numbers mean? How should they be interpreted?

10

Lecture 3 Slides, ECO220Y1Y, 5


Interquartile Range (IQR)
• Interquartile range: 75th percentile minus 25th
percentile
– Measures spread of middle observations
– What does it measure?

11

Boxplot of Inflation Distribution, n = 174 countries


Median 75th Percentile
LAV Upper Adjacent Value (UAV)
UAV marks biggest obs. within 1.5
IQR’s of the 75th percentile

Outside Values

whiskers
25th Percentile

0 20 40 60
3.3299 Inflation Rate, 2011
12

Lecture 3 Slides, ECO220Y1Y, 6


Is this x1, x2, or x3?
x1

0 .1 .2 .3 .4
Density
x2

x3

-4 -2 0 2 4 -2 0 2 4

Is this x1, x2, or x3? Is this x1, x2, or x3?


0 .1 .2 .3 .4 .5

0 .2 .4 .6 .8 1
Density

Density
-4 -2 0 2 4 -2 -1 0 1 2
13

Is this x1, x2, or x3?


x1
.6
Density
.2 .4

x2

x3
0

4 6 8 10 12 4 6 8 10 12

Is this x1, x2, or x3? Is this x1, x2, or x3?


0 .5 1 1.5 2 2.5
0 .2 .4 .6 .8
Density

Density

4 6 8 10 12 4 6 8 10 12
14

Lecture 3 Slides, ECO220Y1Y, 7


Is this x1, x2, or x3?
x1

.3
Density
.1 .2
x2

x3

0
6 8 10 12 14 6 8 10 12 14

Is this x1, x2, or x3? Is this x1, x2, or x3?

.1 .2 .3 .4
.05 .1 .15
Density

Density
0

6 8 10 12 14 0 6 8 10 12 14
15

Outliers
• Outliers: extremely large or small values
different from the bulk of the data
• Robust: not sensitive to outliers
– Is the sample mean a robust measure of central
tendency?
• Is the sample median robust?
• However, the mean retains more information from
sample & has useful statistical properties
– Is the IQR robust? variance?

16

Lecture 3 Slides, ECO220Y1Y, 8


“Sunlight and Protection Against Influenza”

Note: Unit of observation is a year-month for each of the 36 contiguous [U.S.] states that
have complete flu and sunlight data.

Which kind of data are these: cross-sectional, time series, or panel?


Why 1,404 observations? These are monthly data from Oct. 2008 to
Dec. 2011 (39 months) for 36 states (39*36=1,404).
Slusky and Zeckhauser (2018), http://www.nber.org/papers/w24340.pdf 17

Jan is 1, Feb is 2, ... Each month has 108 obs (36 states*3yrs) except Oct, Nov,
and Dec have 144 obs (36 states*4yrs). N = 1,404 (=9*108 + 3*144) 18

Lecture 3 Slides, ECO220Y1Y, 9


Charitable Donors: Stats Can
http://www5.statcan.gc.ca/cansim/a05?lang=eng&id=1110002&pattern=1110002&searchTypeByValue=1&p2=35

Donors and donations 2011


Number of taxfilers4 24,841,630
Number of donors2,3 5,709,700
Percentage of donors aged 0 to 24 years2,3,6 3
Percentage of donors aged 25 to 34 years2,3,6 12
Percentage of donors aged 35 to 44 years2,3,6 17
Percentage of donors aged 45 to 54 years2,3,6 23
Percentage of donors aged 55 to 64 years2,3,6 21
Percentage of donors aged 65 years and over2,3,6 25
2Charitable
donor is defined as a taxfiler reporting a charitable donation
amount on line 340 of the personal income tax form.
19

Average Age of Donors?


Section 5.7 “Grouped 𝑀𝑒𝑎𝑛
Data” tells how to ≈ 0.03 ∗ 21 + 0.12 ∗ 29.5
+ 0.17 ∗ 39.5 + 0.23 ∗ 49.5
approximate the mean & + 0.21 ∗ 59.5 + 0.25 ∗ 70
s.d. with grouped data ≈ 52.3 years
% aged 0 to 24 [21] 3 What if we use 75 years
% aged 25 to 34 [29.5] 12 old for last category? Then
% aged 35 to 44 [39.5] 17 mean ≈ 53.5.
% aged 45 to 54 [49.5] 23 What if we use 12 years
% aged 55 to 64 [59.5] 21 old for first category? Then
% aged 65 and mean ≈ 52.0.
[70] 25
over 20

Lecture 3 Slides, ECO220Y1Y, 10


Logic of Calculation: Smaller Example
• Survey a random sample of 40 A&S students asking how many
courses are you currently taking. A tabulation:
num_courses | Freq. Percent Cum.
------------+-----------------------------------
2 | 3 7.50 7.50
4 | 7 17.50 25.00
5 | 28 70.00 95.00
6 | 2 5.00 100.00
------------+-----------------------------------
Total | 40 100.00

∑ 𝑥 ∑ 2+∑ 4+∑ 5+∑ 6 3 ∗ 2 + 7 ∗ 4 + 28 ∗ 5 + 2 ∗ 6


𝑋= = =
𝑛 40 40
= 0.075 ∗ 2 + 0.175 ∗ 4 + 0.700 ∗ 5 + 0.050 ∗ 6 = 4.65
21

Similarly for standard deviation


num_courses | Freq. Percent Cum.
------------+-----------------------------------
2 | 3 7.50 7.50
4 | 7 17.50 25.00
5 | 28 70.00 95.00
6 | 2 5.00 100.00
------------+-----------------------------------
Total | 40 100.00

∑ 𝑥 −𝑋
𝑠= = 0.70 but = 0.736842
𝑛−1

∑ 2 − 4.65 +∑ 4 − 4.65 +∑ 5 − 4.65 +∑ 6 − 4.65


=
39

3 2 − 4.65 + 7 4 − 4.65 + 28 5 − 4.65 + 2 6 − 4.65


= = 0.89
39
22

Lecture 3 Slides, ECO220Y1Y, 11


Short-cut for standard deviation
num_courses | Freq. Percent Cum.
------------+-----------------------------------
2 | 3 7.50 7.50
4 | 7 17.50 25.00
5 | 28 70.00 95.00
6 | 2 5.00 100.00
------------+-----------------------------------
Total | 40 100.00

∑ 𝑥 −𝑋
𝑠=
𝑛−1

∑ 2 − 4.65 +∑ 4 − 4.65 +∑ 5 − 4.65 +∑ 6 − 4.65 40


= ∗
40 39

= 0.075 2 − 4.65 + 0.175 4 − 4.65 + 0.7 5 − 4.65 + 0.05 6 − 4.65

= 0.89 And, if you ignore 40/39, you get 0.88 (very close to right answer) 23

Standard Deviation of Age of Donors?


% aged 0 - 24 [21] 3 𝑠
≈ 0.03 21 − 52.3
% aged 25 - 34 [29.5] 12
+ 0.12 29.5 − 52.3
% aged 35 - 44 [39.5] 17 + 0.17 39.5 − 52.3
% aged 45 - 54 [49.5] 23 + 0.23 49.5 − 52.3
% aged 55 - 64 [59.5] 21 + 0.21 59.5 − 52.3
+ 0.25 70 − 52.3
% aged 65 & over [70] 25
= 210.6 years
𝑠. 𝑑. ≈ 210.6 = 14.5 years

24

Lecture 3 Slides, ECO220Y1Y, 12


Standardization Inflation Rate, 2011
n = 174 countries
(“z-scores”) .1
.08

Density
.06
• Standardize: 𝑧 = .04
.02
– z: how many s.d.’s a 0
0 20 40 60
value is from the mean Inflation Rate, 2011
(+ if above; - if below)
Inflation Rate, 2011
– Z has a mean of 0 and n = 174 countries
s.d. of 1 and no units .6

Density
– Eg: mean inflation 6.64, .4
s.d. 6.78; 2.91 in Canada: .2
z=-0.55=(2.91-6.64)/6.78
0
– What does -0.55 mean? -2 0 2 4 6
standardized (z-scores)
25

Linear Transformations
• Linear transformation can be written as
Y = a + bX where a and b are constants
– Linear transformation of X?
• V = 200 – X
• W = X2 – 1 = (X – 1)(X + 1)
• Y = (X - 10)/2
– Linear transformations change scale of a variable
but not the shape of the distribution
• Conditional on the numeric values of the mean and
s.d., standardization is a linear transformation
26

Lecture 3 Slides, ECO220Y1Y, 13


mean=14955, med=9100 mean=8.972, med=9.116
sd = 16243 sd = 1.249
.5 .15
Fraction .4

Fraction
.3 .1
.2 .05
.1
0 0
0 40000 80000 6 7 8 9 10 11
GDP per capita ln(GDP per capita)

mean=14.955, med=9.100 Non-linear transformations


sd = 16.243 (natural log is very popular) can
.5
.4 often transform skewed data to
Fraction

.3 be more symmetric.
.2
.1 Linear transformations (such as
0 changing units) do not affect the
0 20 40 60 80 100
GDP per capita ($1000s) shape at all.

CIA data again, US$, PPP, 2012 est., n = 185 countries 27

mean = 58.47, med = 49.34 mean = 53.24, med = 46.65


sd = 34.37 sd = 35.09
.4 .4
Fraction

Fraction

.3 .3
.2 .2
.1 .1
0 0
0 50 100 150 0 50 100 150 200
Gov’t debt (% GDP), 2010 Gov’t debt (% GDP), 2005

mean = 5.23, med = 5.30 A linear combination of 𝑋 and 𝑌


sd = 22.58 is 𝑎 + 𝑏𝑋 + 𝑐𝑌 , where 𝑎, 𝑏,
.6
and 𝑐 are constants
Fraction

.4
.2 Change = Debt10 – Debt05
5.23 = 58.47 – 53.24
0
-100 -50 0 50
Change: 2005 to 2010 Linear combinations have simple
effect on mean, but not median
World Bank data again, Central gov’t
debt, n = 48 countries or std. dev. 28

Lecture 3 Slides, ECO220Y1Y, 14


Recap
• Continued previous lecture to describe a
single interval (quantitative) variable
– Reviewed histograms and learned box plots
– Percentiles and quartiles, quintiles, and deciles
– Learned how to read a Stata summary (output)
– Estimated mean and s.d. of grouped data
– Discussed both linear transformations (e.g.
standardizing) and non-linear transformations
(e.g. taking a log) and impacts of each
29

Lecture 3 Slides, ECO220Y1Y, 15

You might also like