Intro W03 Rev

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 23

Intro to Statistics

B232KL : Thursday 16.00 – 18.30

Week 3: Numerical Descriptive Measures


Albertus Bramantyo
[email protected]
Summarizing Data Numerically
The numerical summary measures:
1. Center of the a distribution (shows typical value)
2. Spread/Dispersion a distribution
3. Position (eq. Top 10% high salaries for fresh graduate)

Notes:
If the measures are computed for data from a sample, they are called sample statistics.
If the measures are computed for data from a population, they are called population
parameters.
A sample statistic is referred to as the point estimator of the corresponding population
parameter.
Mean
The mean for ungrouped data is obtained by dividing the
sum of all values by the number of values in the data set. Thus,

Mean for population data:


  x
N

Mean for sample data:


x
 x
n
where
 x is the sum of all values; N is the population size; n
is the sample size;  is the population mean; and
x is the
sample mean.
Mean example:
Table 3.1 lists the total revenues (rounded
to billions of dollars) of 10 U.S. companies
for the year 2018 (Fortune.com, August
2019).

Solution:

Thus, the average revenue of these 10 companies was $209.8 billion in


2018.
Median
 Definition
The median is the value that divides a data set that has been ranked in
increasing order in two equal halves.
1. If the data set has an odd number of values, the median is given by the value
of the middle term in the ranked data set.
2. If the data set has an even number of values, the median is given by the
average of the two middle values in the ranked data set.

 Notes:
 The median gives the center of a histogram, with half the data values to the left
the median and half to the right of the median.
 The advantage of using the median as a measure of central tendency is that it is
not influenced by outliers.
 Consequently, the median is preferred over the mean as a measure of center
for data sets that contain outliers.
Mean and Median application:
The following data give lunch expenses in Prasmul canteen (in 000 Rp) for a
sample of 20 students on Friday, 8 March 2024.
25;35;27;45;33;28;29;42;33;28;45;35;35;40;22;29;40;35;35;30
a) Calculate the mean
b) Calculate the median
c) Give some interpretation of these statistics.
d) Assume you have intention to open canteen business in this location.
What are some considerations before you start the business (relevant to
the two statistics)
Solution
22 + 25 + 27 + 28 + 28 + 29 + 29 + 30 + 33 + 33 + 35 + 35 + 35 + 35 + 35 + 40 + 40 + 42+ 45 + 45
a) 𝑚𝑒𝑎𝑛= =33.55
20

20+1
b) n = 20 Even number, so the median position is: =10.5
2

22, 25, 27, 28, 28, 29, 29, 30, 33, 33, 35, 35, 35, 35, 35, 40, 40, 42, 45, 45

𝑥 10 + 𝑥11 33+35
X10 X11 Median value: = =34
2 2

c) From 20 students, the mean and median values are similar, so the central position based on the two
values are almost the same. Therefore, the samples are more likely to be symmetric (skewness is
almost zero)

d) What is the number of students that will likely to become your customers? The likely profit would
be the number of customers times the central position (mean or median).
Mode
• The mode of a data set is the value that occurs with
greatest frequency. It’s more suitable for categorical
data (limited category) than continues quantitative
data.
• The greatest frequency can occur at two or more
different values.
• If the data have exactly two modes, the data are
bimodal.
• If the data have more than two modes, the data are
multimodal.
Weighted Mean  xw
Weighted Mean 
w
where x and w denote the variable and the weights,
respectively.

Example:
Maura bought gas for her car four times during June 2015. She bought 10
gallons at a price of $2.60 a gallon, 13 gallons at a price of $2.80 a gallon, 8
gallons at a price of $2.70 a gallon, and 15 gallons at a price of $2.75 a gallon.
What is the average price that Maura paid for gas during June 2015?

Weighted Mean 
 xw 125.25
  $2.72
 w 46
Thus, Maura paid an average of $2.72 a gallon for the gas she bought in June
2015.
Relationships Among the Mean, Median, and Mode

Symmetric

Skewed to right

Skewed to left
Geometric Mean
1. The geometric mean is calculated by finding the nth root of the product of n
values. It useful when we have high outlier values to get “typical” values.

The (simple/arithmetic) mean for following numbers (25,22,24,23,22,100)=?


The geometric mean=(25*22*24*23*22*100)^(1/6)=?
Compare the two values. Which one better representative of the “typical”
values or center of the distribution?

2. It should be applied anytime you want to determine the mean rate of


change over several successive periods (eq. return, inflation and growth
rate).
Suppose that the inflation rates for the last five years are 4%, 3%, 5%, 6%,
and -2%, respectively. Find the mean rate of inflation over the five year
period.
The mean rate of inflation=(1.04*1.03*1.05*1.06*0.98)^(1/5)-1=?
Measures of spread/variability
• Range = Largest value – Smallest Value
Disadvantages
 The range, like the mean, has the disadvantage of being influenced by
outliers. Consequently, the range is not a good measure of dispersion to use
for a data set that contains outliers. This indicates that the range is a
nonresistant measure of dispersion.
 Its calculation is based on two values only: the largest and the smallest. All
other values in a data set are ignored when calculating the range. Thus, the
range is not a very satisfactory measure of dispersion.

• Variance and standard deviation


 2

 x   2

and s 2

 x  x
2

N n 1

 x     x  x 
2 2

  and s 
N n 1
Coefficient of Variation (CV)

CV expresses the standard deviation as a percentage of the mean and is


computed as follows:

Note that the coefficient of variation does not have any units of
measurement, as it is always expressed as a percent.

The following data give lunch expenses in Prasmul canteen (in 000 Rp) for a
sample of 20 students on Friday, 8 March 2024.
25;35;27;45;33;28;29;42;33;28;45;35;35;40;22;29;40;35;35;30
Find the sample standard deviation and sample CV!
x (x - mean)^2
22 133.4025
25 73.1025
27 42.9025
28 30.8025
28 30.8025
29 20.7025
29 20.7025
30 12.6025
33 0.3025
33 0.3025
35 2.1025
35 2.1025
35 2.1025
35 2.1025
35 2.1025
40 41.6025
40 41.6025
42 71.4025
45 131.1025
45 131.1025
Total 792.95
mean 33.55
s 6.460202
Measures of Position
 Quartiles and Interquartile Range
 Percentiles and Percentile Rank

IQR = Interquartile range = Q3 – Q1


Calculating Percentiles
The (approximate) value of the k th percentile, denoted by Pk, is
 kn 
Pk  Value of the   th term in a ranked data set
 100 
where k denotes the number of the percentile and n represents
the sample size.

Finding Percentile Rank of a Value


Example:
A sample of 12 commuter students was selected from a college. The following
data give the typical one-way commuting times (in minutes) from home to
college for these 12 students. 29 14 39 17 7 47 63 37 42 18 24 55
(a) Find the values of the three quartiles.
(b) Where does the commuting time of 47 fall in relation to the three quartiles?
(c) Find the interquartile range.
(d) Find the value of the 70th percentile. Give a brief interpretation of the 70th
percentile.
(e) Find the percentile rank of 42 minutes. Give a brief interpretation of this
percentile rank.
Solution:
(a) The rank of the given data in increasing order as follows:
7 14 17 18 24 29 37 39 42 47 55 63
Q2=(x6+x7)/2=(29+37)/2=33
Q1=median(7,14,17,18,24,29)=(17+18)/2=17.5
Q3=median(37,39,42,47,55,63)=(42+47)/2=44.5
(b) By looking at the position of 47 minutes, we can state that this value
lies in the top 25% of the commuting times.
(c) IQR = Interquartile range =
= 27 minutes
(d) k  n (70)  (12)
  8.4  9 term
th

100 100
Thus, the 70th percentile, P70, is given by the value of the 9th term in
the ranked data set. Note that we rounded 8.4 up to 9, which is always
the case when calculating a percentile.
P70 = Value of the 9th term = 42 minutes
Thus, we can state that approximately 70% of these 12 students
commute for less than or equal to 42 minutes.
Find how many data values are less than 42.
In the above ranked data, there are eight data values that are less than 42.
8
Percentile rank of 42   100%  66.67%
12
Rounding this answer to the nearest integral value, we can state that about
67% of the students in this sample commute for less than 42 minutes.
Box-and-Whisker Plot
 Definition
 A plot that shows the center, spread, and skewness of a data set. It is
constructed by drawing a box and two whiskers that use the median, the first
quartile, the third quartile, and the smallest and the largest values in the data
set between the lower and the upper inner fences.

 The following data are the incomes (in thousands of dollars) for a sample of
12 households.

 75 69 84 112 74 104 81 90 94 144 79 98

 Construct a box-and-whisker plot for these data.


The ranked data are

69 74 75 79 81 84 90 94 98 104 112 144

Median = (84 + 90) / 2 = 87


 Q1 = (75 + 79) / 2 = 77
 Q3 = (98 + 104) / 2 = 101
 IQR = Q3 – Q1 = 101 – 77 = 24
1.5 x IQR = 1.5 x 24 = 36
Lower inner fence = Q1 – 36 = 77 – 36 = 41
Upper inner fence = Q3 + 36 = 101 + 36 = 137
Smallest value within the two inner fences = 69
Largest value within the two inner fences = 112
Looking at the constructed Box-plot, about 50% of the data values fall
within the box, about 25% of the values fall on the left side of the box, and
about 25% fall on the right side of the box. Also, 50% of the values fall on
the left side of the median and 50% lie on the right side of the median. The
data of this example are slightly skewed to the right because the lower 50%
of the values are spread over a smaller range than the upper 50% of the
values.
Distribution Shape: Skewness

Moderately Skewed Left


Skewness is negative.
# python
Mean will usually be less than the median.
from scipy.stats import skew
# define the data. Example:
Symmetric (not skewed)
data=[1,2,1,3,2,5,1,12]
Skewness is zero.
skewnessMean and median are equal.
= skew(data)
print("Skewness:", skewness)
Highly Skewed Right
Skewness is positive (often above 1.0)
Mean will usually be more than the median.

You might also like