Descriptive Statistics
Descriptive Statistics
Descriptive Statistics
1
Descriptive Statistics
Collect data
ex. Survey
Present data
ex. Tables and graphs
Characterize data
ex. Sample mean = X i
n
2
Descriptive statistics..
Encompasses the following:
Graphical or pictorial display
Condensation of large masses of data
into a form such as tables
Preparation of summary measures to
give a concise description of complex
information (e.g. an average figure)
Exhibition of patterns that may be found
in sets of information
3
Summary Statistics
6
The Arithmetic Mean
7
The Arithmetic
Mean
• Example 1
The reported time on the Internet of 10 adults are 0, 7, 12, 5, 33,
14, 8, 0, 9, 22 hours. Find the mean time on the Internet.
i 1 xi
10
0x1 7x2 ... 22
x10
x 11.0
10 10
• Example 2
Suppose the telephone bills represent the population of measurement
The population mean is
i200
1 x i x42.19
1 x38.45
2 ... x45.77
200
43.59
200 200 8
The Arithmetic Mean
Drawback of the mean:
It can be influenced by unusual
observations, because it uses all
the information in the data set.
9
The Median
The Median of a set of
observations is the value that falls
in the middle when the
observations are arranged in
order of magnitude.
•Median = 6
11
The Median
The engineering group receives e-mail
requests for technical information from
sales and services person. The daily
numbers for 6 days were
11, 9, 17, 19, 4, and 15.
What is the central location of the data?
•For even sample sizes, the median is the
average of {n/2}th and {n/2+1}th ordered observations.
4, 9, 11, 15, 17, 19
Median = (11 + 15) / 2 = 13
12
The Mode
The Mode of a set of observations is
the value that occurs most
frequently.
13
The Mode
Find the mode for the data in Example
1. Here are the data again: 0, 7, 12, 5,
33, 14, 8, 0, 9, 22
Solution
14
Relationship among Mean,
Median, and Mode
If a distribution is symmetrical, the
mean, median and mode coincide.
Mean = Median = Mode
Mode Mean 15
Median
Relationship among Mean,
Median, and Mode
If a distribution is symmetrical, the
mean, median and mode coincide
Mean Mode 16
Mean < Median < Mode Median
Geometric Mean
The arithmetic mean is the most popular measure
of the central location of the distribution of a set of
observations.
But the arithmetic mean is not a good measure of
the average rate at which a quantity grows over
time. That quantity, whose growth rate (or rate
of change) we wish to measure, might be the
total annual sales of a firm or the market value of
an investment.
The geometric mean should be used to
measure the average growth rate of the
values of a variable over time.
17
18
Example
19
20
21
22
Measures of variability
23
Measures of variability
24
Methods of Variability Measurement
n 1
S
X i
2
X 2
n 1
Properties of Standard
Deviation (S)
always greater than or equal to 0
the greater the variation about mean, the
greater S is
n-1 corrects for bias when using sample data. S
tends to underestimate the real population
standard deviation when based on sample data.
To correct for this boas, we use n-1. The larger
the sample size, the smaller difference this
correction makes.
When calculating the standard deviation for the
whole population, use N in the denominator.
Practical Application for Understanding
Variance and Standard Deviation
Even though we live in a world where we pay real monetary
values for goods and services (not percentages of income),
most employers issue raises based on percent of salary.
The problem is that the flat percent raise gives unequal increased
rewards. . .
Practical Application for Understanding
Variance and Standard Deviation
Acme Toilet Cleaning Services
Salary Pool: $200,000
Incomes:
President: $100K; Manager: 50K; Secretary: 40K; and Toilet
Cleaner: 10K
Mean: $50K
Range: $90K
Variance: $1,050,000,000 These can be considered
“measures of inequality”
Standard Deviation: $32.4K
The flat percentage raise increased inequality. The top earner got 50% of the new
money. The bottom earner got 5% of the new money. Measures of inequality
went up by 5%.
Since we pay for goods and services in real dollars, not in percentages,
there are substantially more new things the top earners can purchase
compared with the bottom earner for the rest of their employment
years.
Acme is essentially saying: “Each year we’ll buy you a new TV, in
addition to everything else you buy, here’s what you’ll get:”
Practical Application for Understanding
Variance and Standard Deviation
Toilet Cleaner Secretary Manager President
The first quartile (Q1) is the first 25% of the data. The
second quartile (Q2) is between the 25th and 50th
percentage points in the data. The upper bound of Q2
is the median. The third quartile (Q3) is the 25% of
the data lying between the median and the 75% cut
point in the data.
Q1 is the median of the first half of the ordered observations and Q3 is
the median of the second half of the ordered observations.
In the following example Q1= ((15+1)/4)1 =4th observation of the
data. The 4th observation is 11. So Q1 is of this data is 11.
Q1 Q2 Q3 Q1 Q2 Q3
Positively skewed Negatively skewed
histogram histogram
37
Percentiles
Pk kth percentile
Start
Compute
( )
L= k
100
n where
n = number of scores
k = percentile in question
The value of the kth percentile
is midway between the Lth score
Is L a Yes and the next higher score in the
whole
Number ?
original set of data. Find Pk by
No adding the L th score and the
Change L by rounding next higher score and dividing the
it up to the next total by 2.
larger whole number.
s
Coefficient of Variation =
x
Measures of Central Tendency for Grouped Data
Mean
Median
1.
Class Frequency
1-5 2
6-10 4
11-15 9
16-20 7
21-25 5 Answer: 15.5
26-30 3
Total 30
2.
Class 1–3 4–6 7–9 10 – 12 13 – 15 16 – 18
interval
Frequency 5 3 2 1 6 4
Answer: 11
Mode
1
ˆx L c where
1 2
Answer: 14.64
Variance & standard deviation
fx
2
fx 2
fx 2 n x
f
2
Variance s 2 or
f 1 f 1
fx
2
fx 2
fx n x
f 2 2
Standard deviation s or
f 1 f 1
Find the variance and standard deviation of the sample data below:
60-62 5 61 305
63-65 18 64 1152
66-68 42 67 2814
69-71 27 70 1890
72-74 8 73 584
Total 100 6745
fx fx
2 2
fx
2
fx
2
s2
f ? f
s ?
f 1 f 1
Answer : s2=8.61; s =2.93
Consider data set of weights of 30 items. Find the standard
deviation.
Weight(kg) Frequency (f)
20-29 1
30-39 8
40-49 10
50-59 6
60-69 5
Answer: s = 11.265
Shape of Data
Shape of data is measured by
Skewness
Kurtosis
Skewness is a measure of symmetry or lack of
symmetry.
A data set is symmetrical when the when the
proportion of data at equal distance (measured in
terms of standard deviation) from mean (or
median) is equal.
i.e., the proportion of data between μ and μ – kσ is
same as that between μ and μ + kσ.
Skewness
Measures asymmetry of data
Positive or right skewed: Longer right
tail
Kurtosis:
n
The Empirical Rule
Amount spent per month by a segment of credit
card users of a bank has a mean of Rs. 12000
and standard deviation of Rs. 2000. What can you
say about the amount spent?
Solution
At least 75% of the customers spend between Rs.
8000 and Rs. 16000
12000 – 2(2000) 12000 + 2(2000)
At least 88.9% of the customers spend between
Rs. 6000 and 18000
12000 – 3(2000) 12000 + 3(2000)
63
Paired Data Sets and the Sample
Correlation Coefficient
The covariance and the coefficient of
correlation are used to measure the
direction and strength of the linear
relationship between two variables.
Covariance - is there any pattern to the
way two variables move together?
Coefficient of correlation - how strong is
the linear relationship between two
variables
64
Covariance
(x i x )(y i y )
Population covariance COV(X, Y)
N
x (y) is the population mean of the variable X (Y).
N is the population size.
(xi x)(y i y)
Sample cov ariance cov (x y, )
n-1
x (y) is the sample mean of the variable X (Y).
n is the sample size.
65
Covariance
Compare the following three sets
xi yi (x – x) (y – y) (x – x)(y – y)
2 13 -3 -7 21
6 20 1 0 0
7 27 2 7 14 xi yi
or
or r = 0 No linear relationship
COV(X,Y)=0
71
Population Linear Regression
The population regression
model:
Populatio Random
Population Independent Error
n Slope
y Variable term, or
Coefficien
Dependent intercept residual
t
y β0 β1x ε
Variable
72
Linear Regression Assumptions
Error values (ε) are statistically
independent
Error values are normally distributed for
any given value of x
The probability distribution of the errors
is normal
The probability distribution of the errors
has constant variance
The underlying relationship between the x
variable and the y variable is linear
73
Population Linear Regression(continued)
y y β0 β1x ε
Observed
Value of y for xi
εi Slope = β1
Predicted Random Error
Value of y for
xi for this x value
Intercept = β0
xi x
74
Explained and Unexplained
Variation
Total variation is made up of two parts:
76
Explained and Unexplained
Variation
(continued)
y
yi
2
SSE = (yi - yi ) y
_
SST = (yi - y)2
y _2
_ SSR = (yi - y) _
y y
Xi x
77
Coefficient of Determination,
R2
The coefficient of determination is the
portion of the total variation in the
dependent variable that is explained by
variation in the independent variable
R r2 2
where:
R2 = Coefficient of determination
r = Simple correlation coefficient
79
Demand Period Demand for Data Set
1 2 3 4 5
1 92 80 50 10 0
2 92 100 80 10 0
3 92 125 180 15 0
4 92 100 80 20 0
5 92 50 0 70 0
6 92 50 0 180 1105
7 92 100 180 250 0
8 92 125 150 270 0
9 93 125 10 230 0
10 92 100 100 40 0
11 92 50 180 0 0
12 93 100 95 10 0
Total 1105 1105 1105 1105 1105
Coefficient of variation 0 0.29 0.72 1.41 3.31
Box Plot
– This is a pictorial display that provides the main
descriptive measures of the data set:
•L The largest observation
• Q3 The upper quartile
• Q2 The median
• Q1 The lower quartile
• S The smallest observation