LECTURED Statistics Refresher
LECTURED Statistics Refresher
LECTURED Statistics Refresher
STATISTICS REFRESHER
Introduction
Definition
Statistics: is a discipline which involves collecting, summarizing,
analyzing and presenting numerical data in a convenient form.
Is a discipline that involves extracting useful information from
numerical data.
Is presenting data to useful form from numerical data.
Importance (uses) of statistics
• Accounting
Public accounting firms use statistical sampling procedures
when conducting audits for their clients.
• Finance
Financial analysts use a variety of statistical information,
including price-earnings ratios and dividend yields, to guide
their investment recommendations.
Areas (Types of Statistics)
Two general types of statistics:
1. Descriptive statistics: statistics that summarize
observations.
Are procedures used to summarize, organize, and make
sense of a set of scores or observations.
Typically presented graphically, in tabular form (in tables),
or as summary statistics (single values).
A descriptive value for a population is called a parameter
and a descriptive value for a sample is called a statistic.
It involves processing (numerical) in convenient form. .
Explaining the reality
E.g. computing the average sum of test
3
2. Inferential statistics
It involves making judgment, inference, generalizations,
estimates etc. about a larger population using the information
that is obtained from the sample.
Measurements of population are called parameters.
Studying about the population is called census.
Studying about sample is called sampling.
While descriptive statistics describe the characteristics of the
observed data and helps to reach conclusion about some
group only, inferential statistics provides methods for making
generalization about the whole population based on the
sample of observed data.
statistics used to interpret the meaning of descriptive
statistics.
STATISTICAL INFERENCE
It is the process of making statements, forecasts,
prediction, and generalizations about a population
using information obtained from the sample.
Its primary purpose of statistics because
The population is usually large (infinite).
Conducting census is too costly or impractical.
Statistical data
Meaning
Data are facts/figures/values that variables will
assume.
Data are raw facts that will be used to draw a
conclusion or make a decision.
Data are the facts and figures that are collected,
summarized, analyzed, and interpreted.
E.g. ABC’s sales revenue is $100 bn.; stock/share price $80.
The data collected in a particular study are referred to
as the data set.
E.g. The sales revenue and stock price data for a number
of firms
Data Sources
Primary Secondary
Data Collection Data Compilation
Print or Electronic
Observation Survey
Experimentation
Types of Data
D a ta
C a t e g o r ic a l N u m e r ic a l
( Q u a l it a t iv e ) (Q u a n tit a tiv e )
D is c r e te C o n t in u o u s
Qualitative Data
Qualitative data are labels or names used to identify an attribute of
each element..
Qualitative data can be either numeric or nonnumeric.
Qualitative data can use either the nominal or ordinal scale of
measurement
1. Nominal (nominative): there is no meaningful ordering, or
ranking of the categories. Example: a person’s gender, the color
of a car, and an employee’s state of residence. It simply assigns
values.
2. Ordinal: There is a meaningful ordering, or ranking of the
categories. The measurements may be nonnumeric or
numerical. Example: a student may be asked to rate the teaching
effectiveness of a college instructor as
i. Excellent, Very good, Good, Poor, unsatisfactory
ii. 1,2,3,4,5
Quantitative Data
Quantitative data indicate either how many or how much.
Quantitative data that measure how many are discrete.
Quantitative data that measure how much are continuous because there is
no separation between the possible values for the data.
Quantitative data are always numeric.
Ordinary arithmetic operations (e.g., +, -) are meaningful only with
quantitative data.
Quantitative data can use either the interval or ratio (rates) scale of
measurement
1. Interval: the ratios of its values are not meaningful and there is not an
inherently defined zero value. Example: Temperature (on the Fahrenheit
scale) is an interval variable. Zero degrees Fahrenheit does not represent
“no heat at all”. In practice there are very few interval variables other than
temperature. Almost all variables are ratio variables.
2. Ratio (rates):it measures on a scale such that ratios of its values are
meaningful and there is an inherently defined zero value. Example:
salary, weight, height, time, distance
Organization of descriptive data
Tabular Methods of Data Presentation
Tabulation is the arrangement of information or data in tables.
There are various techniques of tabulation.
A) Data Array
Is a table showing data arranged in descending or ascending order.
Descending: (100, 99, 98, 97 ……..)
Ascending: (1, 2, 3,4,5,6,7,8,9 …………)
Data array offers a number of advantages:
Determine at a glance the highest & lowest values
contained in the data.
Identify groups of similar data values.
Easily see differences b/n values in the data.
B) Frequency Distribution
Is a table that group data into non-overlapping intervals called
classes & records the number of observations in each class.
Summarizes data in a condensed form that can be readily
understood & easily interpreted.
Key Terms in frequency distribution
Class: Each category of the frequency distribution.
Frequency: The number of data values falling within each class.
Total frequency: The sum of class frequencies.
Class Limits: The boundaries for each class (upper/lower limits)
Class Boundaries: Limits which are determined mathematically so
that no gap exists b/n classes. Also called true class limits.
Class Interval: The width of each class = The difference b/n lower
limit/upper limit of the class & lower limit/upper limit of the next
higher class.
range
Approximate class width
number of classes desired
Range Maximum value - minimum value
Class Mark: The midpoint of each class = The midway b/n
upper & lower class limits.
13
Guidelines for frequency distribution
a) The set of classes must be mutually exclusive.
- A given data value should fall into only one class/category.
- No overlap b/n classes.
- Limits such as the following would be inappropriate:
Class frequency Clas frequency
15-20 4 17.0-23.5 5
20-25 5 22.0-28.5 10
22
Class boundaries: can be obtained by subtracting 0.5 from
each lower class limit and adding 0.5 to each upper class limit.
99.5 – 104.5 = 99.5 ≤ x < 104.5, [99.5, 104.5), half
closed interval.
104.5 – 109.5 = 104.5 ≤ x< 109.5, [104.5, 109.5)
23
Class Class Upper Absolute Relative Less than Lower More than
Limits boundaries boundaries frequency Frequency frequency boundaries frequency
100-104 99.5-104.5 104.5 2 0.04 2 99.5 50
105-109 104.5-109.5 109.5 8 0.16 10 104.5 48
110-114 109.5-114.5 114.5 18 0.36 28 109.5 40
115-119 114.5-119.5 119.5 13 0.26 41 114.5 22
120-124 119.5-124.5 124.5 7 0.14 48 119.5 9
125-129 124.5-129.5 129.5 1 0.02 49 124.5 2
130-134 129.5-134.5 134.5 1 0.02 50 129.5 1
Total 50 1
99.5-104.5 104.5-109.5
18
16
14 109-114.5 114.5-119.5
Frequency
12
10
8
6 119.5-124.5 124.5-129.5
4
2
0
Classes in order 129.5-134.5
20
18
18
16
14 13
12
Frequency
Frequency
10
8
8 7
6
4
2
2 1 1
0 0
0
107
117
122
127
132
137
97
102
112
Class Marks
c) Cumulative frequency graph ( o-give):
The o-give is a graph that displays cumulative values for
frequencies, relative frequencies or percentages.
These values can be either “more than” or “ Less than”.
Steps in constructing cumulative frequency graph
Find the cumulative frequency for each class,
Draw the x–y axis and label the x–axis with the class
boundaries and y–axis with the cumulative
frequencies, and
Plot the cumulative frequency at each upper class
boundary. Upper class boundaries are used since the
cumulative frequencies represent the number of data
values accumulated up to the upper boundary of each
class.
Example: construct an o-give for the frequency distribution
given in the previous example.
Step 1: Calculate cumulative frequency for each class
Less than
cumulative
Class boundaries frequency found by
99.5 - 104.5 2 2+0
104.5 - 109.5 10 2+8
109.5 - 114.5 28 2+8+18
114.5 - 119.5 41 2+8+18+13
119.5 - 124.5 48 2+8+18+13+7
124.5 - 129.5 49 2+8+18+13+7+1
129.5 - 134.5 50 2+8+18+3+7+1+1
Steps 2 & 3
X i
i 1
N
where : N number of elements in the population
population mean
For ungrouped data, the sample mean is the sum of all the
sample values divided by the number of sample values:
n
X i
X i 1
n
X sample mean
n number of elements in the sample/sam ple size
A sample of five executives received the following salaries
(Birr in thousands): 14.0, 15.0, 17.0, 16.0, and 15.0, find the
mean salary.
Therefore, the mean salary of the executives is Birr
15,400.00
Xi 14.0 ... 15.0 77
X 15.4
n 5 5
Properties of Arithmetic mean
1. Arithmetic mean is the most widely used measure of
location/central tendency.
2. All the values are included in computing the mean.
3. A set of data has a unique mean.
4. Every set of quantitative data has a mean.
5. The mean is affected by large or small data values, called
outliers and may not be the appropriate average to use in
this situations.
6. We cannot determine a mean for open ended data.
7. The sum of the deviations of each value from the mean is
always zero.
Arithmetic mean for grouped data
The mean of a sample of data organized in a
frequency distribution is computed by the following
formula:
k
fX i i
fi i th class frequency
X i 1
k
where: X i class mark of the i th class
f i 1
i k number of classes
Example: Compute the arithmetic mean for the
following grouped data:
Class Boundaries Class mark fi fiXi
(Xi)
5.5-10.5 8 1 8
10.5-15.5 13 2 26
15.5-20.5 18 3 54
20.5-25.5 23 5 115
25.5-30.5 28 4 112
30.5-35.5 33 3 99
35.5-40.5 38 2 76
7 7
490
i1 f i 20 i1 f i X i 490 X 20 24.5
2. Weighted Mean: It is a special case of arithmetic mean.
It is the mean value of data values that have been weighted
according to their relative importance.
The formula for the weighted mean of a population or a
sample will be as follows:
or X ixi
i
Where: is population weighted mean
b) The GM of 1,3,9 is 3
1* 3 * 9 3
Solution: GM 8
835, 000
1 1.27%
755,000
ii) If a person receives a 20% rise in his initial income after one
year of service and a 10% rise after the second year of
service, What is the average percentage increase?
20% 10%
The average percentage raise is not 15% ( ) but 14.89% as shown below:
2
Let’s show this answer by assuming that the person earns Birr 10,000 at the
beginning and receives two raises of 20% and 10%.
Raise 1=10,000*20%=Birr 2000, Raise 2=12,000*10%=1200
The total increase in his salary is Birr 3200. The total is equivalent to:
Birr 10,000*14.89%=Birr 1489
Birr 10,000 +1489=11,489*14.89%=Birr 1710.71
Total increase= Birr 1489 + Birr 1710.71=3199.71 (almost equal to Birr 3200)
iii) The price of a certain commodity in 1970 was 1.06 times
that of 1969, in 1971 it was 1.04 times that of 1970. In the
next two years it was 1.10 and 1.23 times that of the
respective preceding years. What is the average annual
percentage increase in the given period?
GM 4 1.06 * 1.04 * 1.10 * 1.23 1.105 (1.105 1) * 100% 10.5%
(the average annual increase is 10.5%)
Fo r g r o u p e d d a t a g e o m e t r ic m e a n is c a lc ula t e d a s :
f1 f2 fm
GM x1 * x2 * ...... * xm
n
Where: fi is the frequency of the ith class mark, Xi is class mark, m is number of values, n=total
number of observations.
Example: Find the geometric mean for the following grouped
data on the percentage increase in salary of 16 employees of a
company.
% increase in salary Number of Class mark
employees
0-4 5 2
5-9 6 7
10-14 3 12
15-19 2 17
Dis tan ce
Speed
Time
2.5 * 40 2 * 50
Arithmetic mean (weighted mean) 44.44km / hr
4.5
This value can be found by using the harmonic mean formula:
2
HM= 44.44km / h
1 1
40 50
Notice that we don't use the Arithmetic mean find the average speed
because the man traveled equal distances by different speed on the two
trips (they have different weights).
Relationship between Arithmetic mean, Geometric Mean
and Harmonic Mean
Where : md is the lower class boundary/class limit of the median class, n
b. Find f i n 75 odd
th
75 1
c. Find the median class: observatio n 38 th
observatio n
2
d. In which class does the 38th observation fall? In the 3rd class and thus the 3rd class is
the median class
e. Find the cumulative frequency preceding the median class. 20 in this case.
f. Find the class width. 10 in this case.
g. Find the frequency of the median class. 24 in this case.
75
20
MD 50 2 *10 57.29
24
Properties of Median
I. Array is a must before we calculate the median.
II. There is a unique median for each data set.
III. Geometrically, median divides the histogram or
cumulative frequency curves into two parts with equal
area.
IV. Median remains unaffected by the magnitude of the
extreme values.
6. Mode(MO) :is the most frequent value in a data set.
Example: the examination scores for ten students are:
81,93,84,75,68,87,81,75,81and 87. Because the
score of 81 occurs three times, it is the mode
A data set may have
No mode at all, e.g. 1, 3, 9, 0, 7, 8
One mode (unimodal), e.g. 1, 3, 1, 7, 1, 9, mode is 1
Two modes (bimodal), e.g. 7,2,4,4,7 , the modes are
7 and 4.
Many modes (multimodal), e.g. 1, 0, 0, 1, 3, 2, 2, 3, 7,
7, 4, 9, the modes are 1, 0, 3, 2, 7.
Mode of a grouped data
• The approximate modal value of a grouped data is
calculated by the following formula:
f f1 f f1
Mode Lo i L0 i
f f 1 f f 2 2 f f 1 f 2
Where:
Lo lower classs boundary of the modal class (i.e., the class with the highest frequency)
f is the frequency of the modal class
f1 frequency of the class immediatel y preceding the modal class class
f2 frequency of the class immediatel y following the modal class
i class interval/w idth
Example: Find the mode of the following distribution:
Class Limit Frequency
90-100 10
100-110 37
110-120 65
120-130 80
130-140 51
140-150 35
150-160 18
160-170 4
80 65 150
Solution: Mode 120 *10 120 123.41
2 * 80 65 51 44
•
64
Properties of mode
I. It is the easiest average to compute.
II. It can be obtained for both qualitative and quantitative
data.
III. It is not affected by extreme values.
IV. The mode may not exist for a data set.
V. It is not unique. A data set can have more than one
mode.
VI. The mode is not based on all observations.
Distribution, shape and measures of central tendency
The relative values of the mean, median and mode are very much
dependent on the shape of the distribution for the data they are
describing.
The data distributions may be described in terms of symmetry and
Skewness.
In other words, data can be either symmetric or skewed depending on
how the data are distributed around the center.
Symmetry (normal, bell shaped) distribution: occurs when the data
values are evenly distributed around the center.
In a symmetrical distribution, the left and right sides of the distribution
are mirror images of each other, and the values of the mean, median and
mode are equal.
Skewed distribution: occurs when the data values are not evenly
distributed around the center.
Skewness is lack of symmetry of a distribution.
Skewness refers to the tendency of the distribution to “tail off” to the
right or left.
Right (positively) skewed distribution: The mean is
greater than the median, which in turn is greater than the
mode.
In such distributions, the median tend to be a better
measure of central tendency than the mean.
MO≤MD≤AM MO<MD<AM
Left (negatively) skewed distribution: the mean is less
than the median, which in turn is less than the mode.
As with the positively skewed distribution, the median is
less influenced by extreme values and tends to be a better
measure of central tendency than the mean.
AM ≤ MD≤ MO AM<MD<MO
Quartiles, Deciles and Percentiles
Descriptive measures that describe the position (place) of value in a given
data or distribution are positional averages.
Measures which divided data in to many equal parts are called quantiles
(fractiles).
The most important of these are quartiles, deciles and percentiles.
To obtain such measures, it is mandatory to first order the data in an
increasing order.
Quartiles: divide the data in to four equal parts. The jth quartile denoted as Qj
where j=1, 2, 3 is defined as: th
j (n 1)
Qj observatio n
4
Q1 gives the value where 25% of the observations lie below and 75% above
it.
Q2 gives the value where 50% of the observations lie below and 50% above
it.
Q3 gives the value where 75% of the observations lie below and 25% above
it.
Example: Find the quartiles (Q1, Q2, & Q3) from the
following distribution: 8, 4, 8, 3, 4, 8, 5, 5, 10.
Solution: Arrange first: 3,4,4,5,5,8,8,8,10
th
1(9 1)
Q1 item ( 2.5) th
item 2 nd
item 0.5(3 rd
item 2 nd
item ) 4 0.5 * (4 4) 4
4
th
2(9 1)
Q2 item (5) th
item 5
4
th
3(9 1)
Q3 item ( 7 . 5) th
item 7 th
item 0 . 5(8 th
item 7 th
item ) 8 0.5(8 8) 8
4
Quartiles for grouped data
The quartiles for grouped data can be calculated as follows:
i*n
cf
4 *w
Q j i i
Where i=1, 2,3
fi
i = lower class boundary of the i th quartile class (the class which contains the
i * n th
( ) item ).
4
w i =class width , f i=frequency of the ith quartile class, n=total number of observations
th
n 20
( ) th item item 5 th item is Q1 and it falls in the 3 rd class 15.5 - 20.5 is first qua
4 4
1 * 20
3
4
Q1 15.5 * 5 18.83
3
Q2 ?
th
2n 40
( ) th item item 10 th item is Q 2 and it falls in the 4 th class 20.5 - 25.5 is seco
4 4
2 * 20
6
4
Q2 20.5 * 5 20.5 4 24.5 median
5
Q3 ?
th
3n th 60
( ) item item 15 th item is Q 3 and it falls in the 5 th class 25.5 - 30.5 is third
4 4
3 * 20
11
4
Q3 25.5 * 5 25.5 5 30.5
4
Interpretation of Q1, Q2, & Q3 ………..
Deciles: are measures that divide a distribution/data set in to ten equal parts.
The jth decile for a simple frequency distribution (ungrouped data) denoted as Dj, where j=1,
2, 3.....9 is defined as th
j (n 1)
Dj observatio n
10
D1 gives the value where 10% of the observations lie below and 90% above it
D2 gives the value where 20% of the observations lie below and 80% above it
D3 gives the value where 30% of the observations lie below and 70% above it
73
D9 gives the value where 90% of the observations lie below and 100% above it
Deciles for grouped data
i*n
cf
10
D j i * wi
fi
Where i=1, 2,3,4.....9
i = lower class boundary of the ith decile class (the class which contains the
i * n th
( ) item ).
10
w i =class width, f i= frequency of the ith decile class, n=total number of observations
c f = the cumulative frequency of the class preceding the ith decile class
Percentiles
Percentiles divide a distribution/data set in to 100 equal parts.
– The jth percentile for a simple frequency distribution (ungrouped
data) denoted as Pj, where j=1, 2, 3.....99 is defined as:
th
j (n 1)
Pj observatio n
100
P1 gives the value where 1% of the observations lie below and 99%
above it
P2 gives the value where 2% of the observations lie below and 98%
above it
.
.
P99 gives the value where 99% of the observations lie below and 1%
above it
Percentiles For grouped data
i*n
cf
100
Pj i * wi
fi
Where i=1, 2,3,4.....99
i = lower class boundary of the ith percentile class (the class which contains the
i * n th
( ) item ).
100
wi =class width
fi=frequency of the ith percentile class
n=total number of observations
cf=the cumulative frequency of the class preceding the i th percentile class
Measures of Dispersion
Dispersion is the scatter or variation of items from a measure of
central tendency.
It measures the extent to which the values vary among themselves.
Example: Consider the following data on the expenditures of two
groups of workers:
– Group A:ETB Br 6200 2000 1300 1300 1200 (the mean is
ETB 2400)
– Group B: ETB 1600 1700 1300 4200 3200 (the mean is
ETB 2400)
We simply conclude that the two groups spend identical amount, if
we were given only the average expenditure of the two groups
without knowing the actual expenditures.
But the actual observations indicate that more variation is
observed in group A.
Consequently, there is a need to have a measure of
dispersion to observe variability of data.
A measure of dispersion may be in an absolute form or
relative form.
An absolute measure express the magnitude of dispersion in
the same unit of measurement in which the data are
recorded.
However, a relative measure (w/h is unitless) expresses
dispersion in percentages or ratios. It is a quotient obtained
by dividing the absolute measure by a quantity in respect to
which the absolute dispersion has been computed.
Qualities of a Good Measure of Dispersion
I. It should be based on all observations
II. It should be easily calculated.
III. It should be easily understandable.
IV. It should be affected as little as possible by sampling
fluctuations.
V. It should be capable of further statistical treatment.
Types of Measures of Dispersion
1. Range
2. Mean Deviation
4. Coefficient of Variation
1) Range
Range is defined as the difference between the smallest and
the largest observations in a given set of raw data.
Properties of Range:
Only two values are used in its calculation
It is influenced by an extreme value (Outliers).
It is easy to compute and understand.
It is the crudest measure of dispersion.
It cannot be determined for an open ended data.
The grater the range, the higher the variability of the
data and vice versa.
Example: Find the ranges of the following two groups.
– Group A:ETB 6200 2200 1700 1700 1200 (the mean is ETB 2400)
– Group B: ETB 1600 1700 1300 4200 3200 (the mean is ETB 2400)
Solution:
• For Group A :
The highest expenditure = 6200 birr
The lowest expenditure = 1200 birr
Range = highest value – lowest value
= 6200 – 1200 = 5000 Birr
• For Group B :
The highest expenditure = 4200
The lowest expenditure = 1300
Range = 4200 – 1300 = 2900 Birr
• Therefore, in terms of expenditure more variation is observed
in group A.
Note that: For discrete grouped data we use the same
formula as given above, i.e., the difference between the
highest and lowest values.
Example: Compute the range of the following data.
Table: Results (out of 35%) of 20 students in Cost Accounting
test.
Xi 6 24 18 22 30 15
Fi 3 2 5 1 4 5
• Solution:-
UCBL = 30.5, LCBF = 5.5.
30.5 5.5
X 100%
Coefficient of range = 30.5 5.5 = 69.4%
86
Points to note:
1. Range is as good a measure of dispersion as any other
where the data consist of a few observations.
2. It is advantageous when one wants to know only the
extent of the extreme dispersion under “ordinary”
conditions.
3. It tells us noting about the dispersion of the values which
fall between the two extremes.
4. It is highly affected if the value of the two extremes
change.
2. Mean Deviation
Mean Deviation measures the average deviation /scatters of
a set of observations about a central value(mean/median).
For ungrouped data
f i
Median =
n 2
th
th
value n 1 value 20 21
2 = 20.5
2 2
Mean =
fX i i
685
= 17.125
n 40
Median = Lmd
40 CF
2
PMd
xCW md
FMd
= 15.5
20 15
x5 = 17.167
15
Therefore,
f
D
40 i
2
X i
2
N
where:
= Mean (population)
N = total number of observation
X 2
i X
S2
n 1
Where;
n = sample size
X = mean
Alternatively, we can simplify it as follows
S 2
Xi X=
X 2 X 2 2 X X
i
2
i
n 1 n 1
X 2 X 2 2 X X
i i
Xi X 2X Xi
2 2
=
n 1 n 1 n 1 n 1
X
i
2
2 2 2 2
X n X 2n X X
=
i
n i
• n 1 n 1 n 1 n 1 n 1
n X i X i
2 2
n X i X i
2 2
96
Cont’d: Standard Deviation
The population standard deviation is the square root of the population variance.
X
2
i
N
and the sample standard deviation is the square root of the sample variance.
X 2
i X
S for small sample size &
n 1
X 2
i X
S for large sample size
n
Alternatively, for small sample less than about 30
n X i2 X i
2
S
n n 1
Example: From the sample data given below, compute
variance and standard deviation .
10, 15, 30, 22, 41, 32
Solution:- n = 6
n X i X i
2 2
Xi Xi2
So, S
2
n n 1
10 100
15 225
30 900
2
22 484 6 4414 150
= = 132.8
41 1681
4 5
32 1024
X 150 X 4414
2
S S 132.8 = 11.51
2
i i
Variance and Standard deviations for grouped data
For grouped data the population and sample variance denoted by and S2 respectively are given
by:
fi X i f i X i2 f i X i
2 2
2
f i 2
S2
fi X i X n f X f X
i i
2
i i
2
f i n2
in which Xi’s are the class mid-points and f i N for the population and f i n for the
sample.
Alternatively for small sample size we can use:
n f i X iw f i X 2
S2
n n 1
By definition, standard deviations in each case are the square roots of the respective variances.
Example: From the continuous frequency distribution given
below, compute the sample variance and standard deviation.
Class limits Class fi fi X i X X X X
i i
2
fi X i X 2
X i2 f i X 2i
(scores) mark
6 –10 8 5 40 -9.125 83.26 416.328 64 320
11 – 15 13 10 130 -4.125 17.016 170.16 169 1690
16 – 20 18 15 270 0.875 0.7656 11.48 324 4860
21 – 25 23 7 161 5.875 34.516 241.609 529 3703
26 – 30 28 3 84 10.875 118.26 254.80 784 2352
40 685 253.82 1194.8 12925
[n=40>30, better to use the formula for large sample.]
Therefore, for small sample size
f X 2
i Xi 1194 .8
S 2
= 30.625
n 1 40 1
S S 2 30.625 = 5.534
n f i X f i X 4012925 685
2 2 2
i
Alternatively, S 2 =
n n 1 40 39
= 30.625
S 30.625 = 5.534
Important properties of Variance /Standard Deviation
The variance/standard deviation of any constant is always
zero.
A standard deviation of zero implies that there is no
variation at all in the data set. In other words the data
values are the same.
A variance/standard deviation never be a negative number.
If a constant is added or subtracted from each observation,
the variance/standard deviation of the resulting observations
will not be affected.
If every observation is multiplied by a constant K, then the
new variance will be K2 times the original variance and the
new standard deviation will be K times the original
standard deviation.
If there are two sets of data consisting of n 1 and n2 observations with S12 and S 22 as their
respective variances, the combined variance S C2 of (n1 + n2) observations is
S
2
n1 S12 d12 n2 S 22 d 22
C
n1 n2
where d12 = X 1 XC 2
and d 22 X 2 X C .
2
Herein, the combined mean
n1 X 1 n2 X 2
XC
n1 n2
in case X 1 X 2 .
n1S12 n2 S 22
S
2
C
n1 n2
Further, when n1 = n2
S12 S 22
S
2
C
2
• If Y represents a linear transformation of X as Y = a+bX, with a
as the additive constant and b as the multiplicative constant,
then the variance of Y is: S 2
Y b 2 2
SX
S X2
,
where is the variance of X. It follows that standard deviation of Y
is bSX. Where SX is the standard deviation of X.
Example: Calculate the standard deviation of the combined
group of 400 items form the following data.
Group A Group B Group C
Number of items (ni) 50 150 200
Mean X i 40 50 60
Variance S
i
2 81 100 121
Solution:-
n1 X 1 n2 X 2 n3 X 3
XC
n1 n2 n3
50(40) 150(50) 200(60)
=
50 150 200
= 53.75
di X X C
d1 = 40 – 53.75 d2 = 50 – 53.75 d3 = 60 –53.75
= -13.75 = -3.75 = 6.25
Consequently, the combined variance is given as
n S
S C2 1 1
2
d 3
1 n 2S 2
2 d 2
2 n3 S2
3 d 2
3
n1 n2 n3
=
50 81 13.75 150 100 3.75 200 121 6.25
2 2 2
400
13503 17109 32012
=
400
= 156.56
S C 156.56
= 12.512
4. Coefficient of Variation
A useful measure of dispersion when the data are in
different units or the data are in the same units but the
means are far apart.
It is defined as the ratio of the standard deviation to the
arithmetic mean (where mean is different from zero),
expressed as a percentage:
S tan darddeviat ion
CV X 100%
Mean
for population
CV X 100%
N
while for sample, it is obtained as
S
CV X 100%
N
Coefficient of variation (CV) helps us for comparing the
– Variability,
– Heterogeneity /homogeneity,
– Uniformity, &
– Consistency of two or more distributions.
A series /distribution with smaller coefficient of variation is
said to be more homogenous /uniform/ consistent than the
other distribution, and vice versa.
Example: The number of employees, the average wages and
the variance of the wages for two factories are given below.
Which factory is consistent in respect to the wages of
employees?
Summary of wage & employees of two factories.
Factory A Factory B
Number of employees 50 100
Average wages 120 85
Variance of the wages 9 16
Solution:
Factory A Factory B
Given: nA = 50 Given: nB = 100
XA = 120 X B = 85
S A2 = 9 S B2 = 16
• SA SB
CVA x100% CVB X 100%
XX XB
Xi X
Z
S.d
110
Example: Helen scored 65 in Auditing and Samuel scored 70
in Auditing. If the average score of the whole students in
Auditing is 67 and standard deviation equal to 3, which
student performs better?
Solution
Z Helen X X Sami X
Z Helen Z Samuel
S S
65 67 70 67
= =
3 3
= -0.6 =1
Therefore, Samuel performs better in Auditing than Helen and than the average result of the
whole students.
Moments, Skewness, and Kurtosis
Moments
Moments tell us information about the “shape” of
the distribution
It is represented by Mr, r =0, 1, …, r, which is called
the rth moment.
We can have moments about any constant number,
about the mean, zero or any desired value.
In general, the rth moment about any arbitrary constant
number, say A, is given by
X i A
r
Mr
n
Note: For grouped data the rth moment about any
constant number, say A, is given as:
f X A
r
i i
Mr
f i
where;
f i => Frequency of Xi in case of discrete grouped data
f i => Frequency of the i class in case of continuous groped data
th
th
and here Xi is the class mark of the i class
Class work: Using the following data, compute
the first THREE moments about TWO(2).
– Also calculate the first THREE central
moments.
4, 4, 5, 6, 7, 8, 10 , 10, 24, 26
Solution(central moments):
M0=1
M1=8.4
M2=128.2
M3=?
Skewness
Skewness refers us lack of symmetry.
We study skewness to have an idea about the shape of the
curve which we can draw with the help of the frequency
distribution.
Frequency distributions often found skewed on either side
of its central value. As a result, it has a longer tail either to
the left or to the right.
If there is a longer tail to the right of the center, the
distribution is said to be positively skewed.
If the tail is longer to the left of the center, the distribution is
said to be negatively skewed.
A positive Skewness means a greater dispersal of individual
observations towards the right of the central value.
A negative Skewness, on the other hand, implies that
individual observations have greater dispersal towards the
left of the central value.
Skewness, therefore, not only refers to the lack of symmetry
in distribution, it also shows the direction of dispersion of
individual observations on either side of the center of the
distribution.
Accordingly, a measure of skewness quantifies the extent of
departure from symmetry and also indicates the direction in
which the departure takes place.
Measures of Skewness:
a) Moment coefficient of Skewness
b) Pearsonian coefficient of Skewness
a) Moment coefficient of Skewness
In terms of moment coefficient, skewness is defined as:
= =
Where M2 = S2 = variance
Interpretation:
(1) If = 0 => Symmetrical distribution
(2) If < 0 => Negatively skewed distribution
(3) If > 0 => positively skewed distribution
(4) A greater or smaller value of means a greater or smaller degree of skewness.
b) Pearsonian coefficient of Skewness
This measure is based on the fact that when a distribution
drifts away from symmetry, its mean, median, and mode
tend to deviate from each other.
This results about from the presences of exceptionally high
or low observations affecting the value of the mean the
most, and that of the mode the least.
Thus, it is the direction in which mode drifts from mean that
determines whether a distribution will have positive or
negative skewness.
• Thus, the Pearsonian coefficient of skewness is defined as :
In which S is standard deviation. Using the empirical relationship among mean, mode and median in a
moderately skewed distribution, i.e, mode = mean – 3(mean – median), the above equation can be modified
as
Note:
1.
2. If the distribution is symmetrical
3. If the distribution is positively skewed
4. If , the distribution is negatively skewed
Example 5.23. Find the skewness of the following data using pearsonian’s coefficient of skewness.
Solution:-
Arrange the data in an increasing order
1, 2, 4, 5, 6, 7, 8, 10, 30, 32
= 6.5
= 10.5
= 124.06
= 11.14
Therefore, =
= 1.077
Interpretation: The distribution is positively skewed.
Kurtosis
• kurtosis measures the characteristics of flatness or
peakdness at the top of the distribution.
• Taking symmetrical distribution as a frame of reference,
– a distribution which is more peaked than the normal is
known as Leptokurtic distribution.
– The one whose polygon is flat at its top is called a
Platykurtic distribution.
– A distribution with a polygon which is neither to high in
peak, nor too flat at the top is termed as Mesokurtic
distribution.
(i) The coefficient of Kurtosis
The coefficient of kurtosis denoted by K is defined as a ratio of inter-quartile range to inter-
decile range.
𝑄3 −𝑄1
K=
𝐷9 −𝐷1
Interpretation:
If K = 0.5, approximately the distribution is Mesokurtic
If K > 0.5, approximately the distribution is leptokurtic
If K<0.5, approximately the distribution is platykurtic.
Moment coefficient of Kurtosis
Moment coefficient of Kurtosis is Kurtosis in terms of the fourth moment about the mean, denoted by B2, and is
defined as
𝑀4 𝑀4
𝐵2 = 2 =
𝑀2 𝑆4