Descriptive Statistics

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 83

Descriptive Statistics

1
Descriptive Statistics
 Collect data
 ex. Survey
 Present data
 ex. Tables and graphs
 Characterize data
 ex. Sample mean =  X i
n

2
Descriptive statistics..
 Encompasses the following:
 Graphical or pictorial display
 Condensation of large masses of data
into a form such as tables
 Preparation of summary measures to
give a concise description of complex
information (e.g. an average figure)
 Exhibition of patterns that may be found
in sets of information

3
Summary Statistics

Summary Statistics describe the characteristics of


a data set.

Measures of central tendency yield information


about the center, or middle part, of a group of
members.

Measures of variability describe the spread or the


dispersion of a set of data.
The Arithmetic Mean

 This is the most popular and useful


measure of central location

Sum of the observations


Mean =
Number of observations

6
The Arithmetic Mean

Sample mean Population mean


n
ii11xxii
n  N
xi
x  i 1
nn N
Sample size Population size

7
The Arithmetic
Mean

• Example 1
The reported time on the Internet of 10 adults are 0, 7, 12, 5, 33,
14, 8, 0, 9, 22 hours. Find the mean time on the Internet.
 i 1 xi
10
0x1  7x2  ...  22
x10
x   11.0
10 10
• Example 2
Suppose the telephone bills represent the population of measurement
The population mean is

 i200
1 x i x42.19
1  x38.45
2  ...  x45.77
200
   43.59
200 200 8
The Arithmetic Mean
 Drawback of the mean:
It can be influenced by unusual
observations, because it uses all
the information in the data set.

9
The Median
 The Median of a set of
observations is the value that falls
in the middle when the
observations are arranged in
order of magnitude.

 It divides the data in half.


The Median
 Median of
8 2 9 11 1 6 3
n = 7 (odd sample size). First order the
data.
1 2 3 6 8 9 11
•For odd sample size, median is the
{(n+1)/2}th ordered observation.

•Median = 6
11
The Median
 The engineering group receives e-mail
requests for technical information from
sales and services person. The daily
numbers for 6 days were
11, 9, 17, 19, 4, and 15.
What is the central location of the data?
•For even sample sizes, the median is the
average of {n/2}th and {n/2+1}th ordered observations.
4, 9, 11, 15, 17, 19
Median = (11 + 15) / 2 = 13
12
The Mode
 The Mode of a set of observations is
the value that occurs most
frequently.

 Set of data may have one mode or


two or more modes or no modes.

13
The Mode
 Find the mode for the data in Example
1. Here are the data again: 0, 7, 12, 5,
33, 14, 8, 0, 9, 22

Solution

 All observation except “0” occur once. There are


two “0”. Thus, the mode is zero.
 Is this a good measure of central location?
 The value “0” does not reside at the center of this
set
(compare with the mean = 11.0 and the median =
8.5).

14
Relationship among Mean,
Median, and Mode
 If a distribution is symmetrical, the
mean, median and mode coincide.
Mean = Median = Mode

• If a distribution is asymmetrical, and skewed to


the left or to the right, the three measures differ.

A positively skewed distribution


(“skewed to the right”)

Mode < Median < Mean

Mode Mean 15
Median
Relationship among Mean,
Median, and Mode
 If a distribution is symmetrical, the
mean, median and mode coincide

A negatively skewed distribution


(“skewed to the left”)

Mean Mode 16
Mean < Median < Mode Median
Geometric Mean
 The arithmetic mean is the most popular measure
of the central location of the distribution of a set of
observations.
 But the arithmetic mean is not a good measure of
the average rate at which a quantity grows over
time. That quantity, whose growth rate (or rate
of change) we wish to measure, might be the
total annual sales of a firm or the market value of
an investment.
 The geometric mean should be used to
measure the average growth rate of the
values of a variable over time.
17
18
Example

19
20
21
22
Measures of variability

 Measures of central location fail to tell


the whole story about the distribution.
 A question of interest still remains
unanswered:

How much are the observations spread out


around the mean value?

23
Measures of variability

 Measures of central location fail to tell


the whole story about the distribution.
 A question of interest still remains
unanswered:

How much are the observations spread out


around the mean value?

24
Methods of Variability Measurement

Variability (or dispersion) measures the amount


of scatter in a dataset.
Commonly used methods: range, variance,
standard deviation, interquartile range, coefficient of
variation, etc.

Range: The difference between the largest and the


smallest observations. The range of 10, 5, 2, 100 is
(100-2)=98. It’s a crude measure of variability.
Methods of Variability Measurement
Variance: The variance of a set of observations is the average
of the squares of the deviations of the observations from their
mean. In symbols, the variance of the n observations x1,
x2,…xn is
( x1  x ) 2  ....  ( xn  x ) 2
S 
2

n 1

Variance of 5, 7, 3? Mean is (5+7+3)/3 = 5 and the


variance is
(5  5) 2  (3  5) 2  (7  5) 2
4
3 1

Standard Deviation: Square root of the variance.


The standard deviation of the above example is 2.
The standard deviation is affected by
extreme values

All three data sets


have the same
mean. The
variation in the
distribution of
values changes
the standard
deviation
A working formula for the
standard deviation:

S
X i
2

X 2

n 1
Properties of Standard
Deviation (S)
 always greater than or equal to 0
 the greater the variation about mean, the
greater S is
 n-1 corrects for bias when using sample data. S
tends to underestimate the real population
standard deviation when based on sample data.
To correct for this boas, we use n-1. The larger
the sample size, the smaller difference this
correction makes.
 When calculating the standard deviation for the
whole population, use N in the denominator.
Practical Application for Understanding
Variance and Standard Deviation
Even though we live in a world where we pay real monetary
values for goods and services (not percentages of income),
most employers issue raises based on percent of salary.

Why do supervisors think the most fair raise is a percentage


raise?
Answer: (1) Because higher paid persons win the most money.
(2) The easiest thing to do is raise everyone’s salary by
a fixed percent.

If your budget went up by 5%, salaries can go up by 5%.

The problem is that the flat percent raise gives unequal increased
rewards. . .
Practical Application for Understanding
Variance and Standard Deviation
Acme Toilet Cleaning Services
Salary Pool: $200,000
Incomes:
President: $100K; Manager: 50K; Secretary: 40K; and Toilet
Cleaner: 10K
Mean: $50K
Range: $90K
Variance: $1,050,000,000 These can be considered
“measures of inequality”
Standard Deviation: $32.4K

Now, let’s apply a 5% raise.


Practical Application for Understanding
Variance and Standard Deviation
After a 5% raise, the pool of money increases by $10K to $210,000
Incomes:
President: $105K; Manager: 52.5K; Secretary: 42K; and Toilet Cleaner: 10.5K
Mean: $52.5K – went up by 5%
Range: $94.5K – went up by 5%
Variance: $1,157,625,000 Measures of Inequality
Standard Deviation: $34K –went up by 5%

The flat percentage raise increased inequality. The top earner got 50% of the new
money. The bottom earner got 5% of the new money. Measures of inequality
went up by 5%.

Last year’s statistics:


Acme Toilet Cleaning Services annual payroll of $200K
Incomes:
$100K, 50K, 40K, and 10K
Mean: $50K
Range: $90K; Variance: $1,050,000,000; Standard Deviation: $32.4K
Practical Application for Understanding
Variance and Standard Deviation
The flat percentage raise increased inequality. The top earner got 50%
of the new money. The bottom earner got 5% of the new money.
Inequality increased by 5%.

Since we pay for goods and services in real dollars, not in percentages,
there are substantially more new things the top earners can purchase
compared with the bottom earner for the rest of their employment
years.

Acme Toilet Cleaning Services is giving the earners $5,000, $2,500,


$2,000, and $500 more respectively each and every year forever.

What does this mean in terms of compounding raises?

Acme is essentially saying: “Each year we’ll buy you a new TV, in
addition to everything else you buy, here’s what you’ll get:”
Practical Application for Understanding
Variance and Standard Deviation
Toilet Cleaner Secretary Manager President

The gap between the rich and poor expands.


This is why some progressive organizations give a percentage raise
with a flat increase for lowest wage earners. For example, 5% or
$1,000, whichever is greater.
Quartiles: Data can be divided into four regions that cover the
total range of observed values. Cut points for these regions are
known as quartiles.

In notations, quartiles of a data is the ((n+1)/4)qth


observation of the data, where q is the desired
quartile and n is the number of observations of data.

The first quartile (Q1) is the first 25% of the data. The
second quartile (Q2) is between the 25th and 50th
percentage points in the data. The upper bound of Q2
is the median. The third quartile (Q3) is the 25% of
the data lying between the median and the 75% cut
point in the data.
Q1 is the median of the first half of the ordered observations and Q3 is
the median of the second half of the ordered observations.
In the following example Q1= ((15+1)/4)1 =4th observation of the
data. The 4th observation is 11. So Q1 is of this data is 11.

An example with 15 numbers


3 6 7 11 13 22 30 40 44 50 52 61 68 80 94
Q1 Q2 Q3
The first quartile is Q1=11. The second quartile is Q2=40
(This is also the Median.) The third quartile is Q3=61.

Inter-quartile Range: Difference between Q3 and Q1. Inter-


quartile range of the previous example is 61- 40=21. The middle
half of the ordered data lie between 40 and 61.
Quartiles and Variability
 Quartiles can provide an idea about
the shape of a histogram

Q1 Q2 Q3 Q1 Q2 Q3
Positively skewed Negatively skewed
histogram histogram
37
Percentiles

 The kth percentile of a set of measurements


is the value for which k percent of the
observations are less than that value and
100(1-k) percent of all the observations are
greater than that value.
 Example
 Suppose your score is the 60 percentile of a test.
Then

60% of all the scores lie here 40%


38
Your score
Finding the Percentile of a Given Score

number of scores less than x


Percentile of score x = • 100
total number of scores
Finding the Score
Given a Percentile

n number of scores in the data set

k k percentile being used


L= • n L
100 locator that gives the position of a score

Pk kth percentile
Start

Rank the data.


Finding the Value of the
(Arrange the data in
order of lowest to
kth Percentile
highest.)

Compute

( )
L= k
100
n where

n = number of scores
k = percentile in question
The value of the kth percentile
is midway between the Lth score
Is L a Yes and the next higher score in the
whole
Number ?
original set of data. Find Pk by
No adding the L th score and the
Change L by rounding next higher score and dividing the
it up to the next total by 2.
larger whole number.

The value of Pk is the


Lth score, counting from
the lowest
Example

Coefficient of Variation

Coefficient of Variation: The standard


deviation of data divided by its mean.

s
Coefficient of Variation =
x
Measures of Central Tendency for Grouped Data

 Mean
 Median

The median of frequency distribution data can be described as:


 f 
  Fj 1 
x  Lc 2  where
 fj 
 
L = the lower class boundary of the median class
c = the size of median class interval
Fj 1  the sum of frequencies of all classes BEFORE the median class
f j  the frequency of the median class
Find the median of the following data:

1.
Class Frequency
1-5 2
6-10 4
11-15 9
16-20 7
21-25 5 Answer: 15.5
26-30 3
Total 30
2.
Class 1–3 4–6 7–9 10 – 12 13 – 15 16 – 18
interval

Frequency 5 3 2 1 6 4

Answer: 11
 Mode

 1 
ˆx  L  c   where

 1   2

L  the lower class boundary of the modal class


c = the size of the modal class interval
1  the difference between the modal class frequency and the class before it
 2  the difference between the modal class frequency and the class after it

NOTE: Class with the highest frequency is called MODAL CLASS


Find the mode of the following data:
1.
Class Frequency
1-5 2
6-10 4
11-15 9
16-20 7
Answer: 14.07
21-25 5
26-30 3
Total 30
2.
Class 1–3 4–6 7–9 10 – 12 13 – 15 16 – 18
interval
Frequency 5 3 2 1 6 4

Answer: 14.64
Variance & standard deviation

  fx 
2

 
fx 2

 fx 2  n  x 
f
2

Variance  s 2  or
 f 1  f 1

  fx 
2

 fx 2

 fx  n  x 
f 2 2

Standard deviation  s  or
 f 1  f 1
Find the variance and standard deviation of the sample data below:

Weight (Class Frequency, f Class Mark, x fx


Interval)

60-62 5 61 305
63-65 18 64 1152
66-68 42 67 2814
69-71 27 70 1890
72-74 8 73 584
Total 100 6745

  fx    fx 
2 2

 fx 
2
 fx 
2

s2 
f ? f
s ?
 f 1  f 1
Answer : s2=8.61; s =2.93
Consider data set of weights of 30 items. Find the standard
deviation.
Weight(kg) Frequency (f)

20-29 1
30-39 8
40-49 10
50-59 6
60-69 5

Answer: s = 11.265
Shape of Data
 Shape of data is measured by
 Skewness
 Kurtosis
Skewness is a measure of symmetry or lack of
symmetry.
A data set is symmetrical when the when the
proportion of data at equal distance (measured in
terms of standard deviation) from mean (or
median) is equal.
i.e., the proportion of data between μ and μ – kσ is
same as that between μ and μ + kσ.
Skewness
 Measures asymmetry of data
 Positive or right skewed: Longer right
tail

 Negative or left skewed: Longer left tail


Kurtosis
 Measures peakedness of the
distribution of data.

 The kurtosis of normal distribution


is 0.
Karl Pearson (1857-1938)
Skewness
___________________________________________________________________________________

Relationship between location measures:


mean – mode = 3(mean – median)

Coefficient of skewness: xx M

independent of measurement units sk 



3 x  m 
Combining both:
sk 

xM – mode, a value that occurs most frequently in the sample or
population

formulas Kurtosis
____________________________________________________________________________________________

Kurtosis:
n

 xi  x  sum of deviation from


4

mean value divided by


k i 1
 4 the standard deviation
to the 4th power
Kurtosis
____________________________________________________________________________________________

Positive and large:


leptokurtic distribution (high and thin distribution)
(high-frequency financial data, abnormal rate or
returns, long time-series covering periods of crisis and
expansions)
Negative and large:
platykurtic distribution (flat and spread)
(large variability)
Z-score


The Empirical Rule

Apply this rule to interpret the measures


when the data is symmetrical.
At least:
68% of the data values are within one
standard deviation of the mean: µ ± 1𝞼
90% of the data values are within two
standard deviation of the mean: µ ± 2𝞼
99% of the data values are within three
standards deviation of the mean: µ ± 3𝞼
Tchybychef’s Inequality



 Amount spent per month by a segment of credit
card users of a bank has a mean of Rs. 12000
and standard deviation of Rs. 2000. What can you
say about the amount spent?
Solution
At least 75% of the customers spend between Rs.
8000 and Rs. 16000
12000 – 2(2000) 12000 + 2(2000)
At least 88.9% of the customers spend between
Rs. 6000 and 18000
12000 – 3(2000) 12000 + 3(2000)

63
Paired Data Sets and the Sample
Correlation Coefficient
 The covariance and the coefficient of
correlation are used to measure the
direction and strength of the linear
relationship between two variables.
 Covariance - is there any pattern to the
way two variables move together?
 Coefficient of correlation - how strong is
the linear relationship between two
variables

64
Covariance

(x i   x )(y i   y )
Population covariance  COV(X, Y) 
N
x (y) is the population mean of the variable X (Y).
N is the population size.

(xi  x)(y i  y)
Sample cov ariance cov (x y, ) 
n-1
x (y) is the sample mean of the variable X (Y).
n is the sample size.
65
Covariance
 Compare the following three sets
xi yi (x – x) (y – y) (x – x)(y – y)

2 13 -3 -7 21
6 20 1 0 0
7 27 2 7 14 xi yi

x=5 y =20 Cov(x,y)=17.5 2 20


6 27 Cov(x,y) = -3.5
xi yi (x – x) (y – y) (x – x)(y – y) 7 13
2 27 -3 7 -21 x=5 y =20
6 20 1 0 0
7 13 2 -7 -14

x=5 y =20 Cov(x,y)=-17.5 66


Covariance

• If the two variables move in the same


direction, (both increase or both decrease),
the covariance is a large positive number.

 If the two variables move in opposite


directions, (one increases when the
other one decreases), the covariance
is a large negative number.
 If the two variables are unrelated, the
covariance will be close to zero.
67
The coefficient of correlation

Population coefficien t of correlatio n


COV ( X, Y)

xy

Sample coefficien t of correlatio n


cov(X, Y)
r
sxsy
 This coefficient answers the question: How
strong is the association between X and Y.
68
The coefficient of correlation

+1 Strong positive linear relationship


COV(X,Y)>0

or
 or r = 0 No linear relationship
COV(X,Y)=0

-1 Strong negative linear relationship COV(X,Y)<0


69
The coefficient of correlation

 If the two variables are very strongly


positively related, the coefficient value is
close to +1 (strong positive linear
relationship).
 If the two variables are very strongly
negatively related, the coefficient value is
close to -1 (strong negative linear
relationship).
 No straight line relationship is indicated by
a coefficient close to zero.
70
If r = 0, it means no association or correlation
between the two variables.

If 0 < r < 0.25, it means weak correlation.

If 0.25 ≤ r < 0.75, it means intermediate


correlation.

If 0.75 ≤ r < 1, it means strong correlation.

If r = 1, it means perfect correlation.

71
Population Linear Regression
The population regression
model:
Populatio Random
Population Independent Error
n Slope
y Variable term, or
Coefficien
Dependent intercept residual
t

y  β0  β1x  ε
Variable

Linear component Random Error


component

72
Linear Regression Assumptions
Error values (ε) are statistically
independent
Error values are normally distributed for
any given value of x
The probability distribution of the errors
is normal
The probability distribution of the errors
has constant variance
The underlying relationship between the x
variable and the y variable is linear
73
Population Linear Regression(continued)
y y  β0  β1x  ε
Observed
Value of y for xi

εi Slope = β1
Predicted Random Error
Value of y for
xi for this x value

Intercept = β0

xi x
74
Explained and Unexplained
Variation
 Total variation is made up of two parts:

SST  SSE  SSR


Total sum Sum of Sum of Squares
of Squares Squares Error Regression

SST  ( y  y)2 SSE  ( y  ŷ)2 SSR  ( ŷ  y)2


where:
y = Average value of the dependent variable
y = Observed values of the dependent variable
ŷ = Estimated value of y for the given x value 75
Explained and Unexplained
Variation
(continued)

SST = total sum of squares


 Measures the variation of the yi values around
their mean y
SSE = error sum of squares
 Variation attributable to factors other than the
relationship between x and y
SSR = regression sum of squares
 Explained variation attributable to the
relationship between x and y

76
Explained and Unexplained
Variation
(continued)
y
yi 
 2
SSE = (yi - yi ) y
_
SST = (yi - y)2

y  _2
_ SSR = (yi - y) _
y y

Xi x
77
Coefficient of Determination,
R2
The coefficient of determination is the
portion of the total variation in the
dependent variable that is explained by
variation in the independent variable

The coefficient of determination is also


called R-squared and is denoted as R2
SSR
R 
2 where 0  R2  1
SST
78
Coefficient of Determination,
R2
(continued)
Coefficient of determination
SSR sum of squares explained by regression
R 
2

SST total sum of squares

Note: In the single independent variable case, the


coefficient of determination is

R r2 2

where:
R2 = Coefficient of determination
r = Simple correlation coefficient
79
Demand Period Demand for Data Set
1 2 3 4 5
1 92 80 50 10 0
2 92 100 80 10 0
3 92 125 180 15 0
4 92 100 80 20 0
5 92 50 0 70 0
6 92 50 0 180 1105
7 92 100 180 250 0
8 92 125 150 270 0
9 93 125 10 230 0
10 92 100 100 40 0
11 92 50 180 0 0
12 93 100 95 10 0
Total 1105 1105 1105 1105 1105
Coefficient of variation 0 0.29 0.72 1.41 3.31
Box Plot
– This is a pictorial display that provides the main
descriptive measures of the data set:
•L The largest observation
• Q3 The upper quartile
• Q2 The median
• Q1 The lower quartile
• S The smallest observation

You might also like