Foundation Notes 2013
Foundation Notes 2013
Foundation Notes 2013
Arranging Data
In this Lesson we will get familiar with data and its various types. We will also discuss the methods of
data collection. Then we will focus on various data presentation tools like table and graphs (like line
chart, bar chart, pie diagram, pictogram and scatter diagram).
In this Lesson we will get familiar with frequency distribution and frequency polygon. We will also study
the properties (skew ness and kurtosis) of frequency distribute on curve.
What is Data?
Data is a collection of related observations, facts or figures. A collection of data is called a data set, and
each observation a data point.
Example: Marks obtained by students in Introduction to
Quantitative Methods course
Types of Data
Raw Data: Information before its systematic arrangement and analysis is called raw data. Useful
inferences can be derived from the raw data by applying various statistical methods.
Example: Sales data of a company for a year
Data can be classified as:
Published Data
Unpublished Data
Secondary Data
By magnitude
In graphical presentation, the collected data is represented by various types of geometrical devices such
as points, lines, bars, multi-dimensional figures, pictorials, etc. A graphical method is a non-quantitative
form of presentation; the quantities are also indicated along with them. The magnitude of the data is
depicted visually through the proportional size of the diagram or graph.
Line chart is one of the effective graphical methods to depict the trend in a data. If the line is rising from
left to right, then the data is showing an increasing trend and vice-versa.
Bar Chart
Bar charts use rectangles to present the data which is referred as bars. There are two types of bar
charts vertical and horizontal. These diagrams are one-dimensional as the magnitude of the data is
represented by length of the bar. The thickness or width of the bar has no relevance. The bars should be
arranged from left to right.
The given bar diagram shows the yearly sales of a company.
Multiple bar diagram or compound bar diagrams are used to compare two or more sets of related data.
This diagram is similar to the simple bar diagram, but bars in each set are placed together and gap is left
between each set of bars.
The given multiple bar diagram shows yearly export import values of a company.
Pie Diagram
Pie diagram is a circle divided into various segments and each segment represents the percentage
contribution of various components to the total. Pie diagrams are used to compare many components
simultaneously.
For drawing a pie diagram it is necessary to express the value of each category as a percentage of the
total. 3600 in a circle represent the whole (i.e., 100%) and 3.60 constitute 1% of the total.
Degree of each part=Part 360/Total =Part 3.6
The pie diagram represents the share holding pattern of a company.
Pictogram
Pictograms represent the data in the form of pictures. The data is presented using appropriate pictures
and their sizes indicate the magnitude of the data.
Scatter Diagram
Scatter diagram is used to study the correlation between two dependent variables. The scatter diagram
is drawn by plotting the points on X and Y axis. When the points on the graph follow a pattern, it
indicates high correlation and irregular pattern or behavior indicates low correlation.
Frequency Distribution
The table in which raw data is tabulated by dividing it into classes of convenient size and computing the
number of data elements (or their fraction out of the total) falling within each pair of class boundary is
called a frequency distribution table.
Classes are groups of values having same characteristics of data. E.g. Employees of a company are
grouped together on the basis of their ages.
The range of values of a given class is called a class limits, and middle of a class interval is called class
mark. For the class 25-29, 25 and 29 are called as class limits, 27 is the class mark and
30-25 = 5 is the class interval.
A cumulative frequency distribution is a tabular display of data showing how many observations lie
above, or below, certain values.
Construction of Frequency Distribution
To construct a frequency distribution, the data is to be divided into groups of similar intervals. Then the
number of data points that fall into each group has to be recorded against each group.
Frequency distributions can be constructed with classes of qualitative attributes. The classification can be
either quantitative or qualitative and either discrete or continuous classes.
Histogram
A histogram is a series of rectangles, the width of each being proportional to the range of values within a
class and height being proportional to the number of items falling in the class. The widths of the bars are
uniform when the widths of classes in a frequency distribution are equal.
When a histogram is constructed using relative frequency, it is called a relative frequency histogram.
While the absolute histogram represents the number of data items, the relative frequency histogram
shows the relative size of each class with the total.
Frequency Polygon
For constructing a frequency polygon, the frequencies are marked on the vertical axis and the values of
variables (that are being studied) are taken on the horizontal axis. Dots are put on the graph against the
class marks to represent the frequencies. These dots are connected by drawing straight lines, this forms
a frequency polygon. When the straight line are smoothed by adding classes and data points, is called a
frequency curve.
Frequency polygons represent graphically both simple and relative frequency distributions.
Ogive
Frequency Distribution Table
The Less than Ogive Curve for the above Frequency Distribution is:
Skew ness and Kurtosis are the two characteristics of data sets that provide useful trends and patterns
in the data represented as frequency distribution curves.
Skew ness is the extent to which a distribution of data points is concentrated at one end or the other; or
the lack of symmetry in the curve. The curves representing the data points in the data set can be of two
types:
Symmetrical curves :- A curve is said to be symmetrical when a vertical line drawn from the
center of the curve to the X-axis divides the area under the curve into equal parts.
Skewed curves (positively or negatively skewed):-A curve is said to be skewed when the
values in the frequency distribution are concentrated more towards the left or right side of the
curve i.e. the values are not equally distributed from the center of the curve. A curve is said to
be positively skewed when the tail of the curve is more stretched towards the right side. It is
said to be negatively skewed when the tail is more stretched towards the left side.
Kurtosis
Kurtosis is the degree of peak ness of a distribution of points i.e. Kurtosis measures the peaked ness of
a distribution. Two curves with same central location and dispersion may have different degrees of
kurtosis.
Summary
Data collection is done in two ways complete enumeration and sample method.
Data is systematically and clearly represented in the form of tables and graphs.
Line charts, bar charts, pie diagram, scatter diagram are some of the tools that are used to
graphically represent the data.
In this Lesson we will get familiar with measures of central tendency. We will study the objectives
of averaging and requisites of good average. We will also focus on other types of averages like
arithmetic mean, weighted arithmetic mean, geometric mean, harmonic mean, median and mode.
Objectives of Averaging
To find out one value that represents the whole mass of data
If the researcher knows the average value of the data, then he need not study each
and every data point in the data set.
To enable comparison
Averages act as a common denominator for comparing two or more sets of data.
To establish relationship
The average calculated from a sample data give a reliable idea about the average of
the entire universe.
To aid decision-making
Averages are basically divided into two types: Mathematical averages and positional averages.
The mathematical averages are arithmetic mean, geometric mean and harmonic mean. The
positional averages are median and mode.
Arithmetic Mean
x )/n
Frequency
21-25
38
26-30
30
31-35
35
36-40
25
41-45
15
46-50
12
51-55
56-60
Class
Frequency (f)
fx
21-25
38
23
874
26-30
30
28
840
36-40
25
38
950
41-45
15
43
645
46-50
12
48
576
51-55
53
159
56-60
58
116
fx=
5315
n = 160
= (f
= 33.218
x )/n=
Short-cut Method
Locate an assumed mean. Assign a code value zero to the class containing assumed
mean
Assign negative integers as codes to the classes with values smaller than assumed mean
and positive integers to the classes with values larger than assumed mean
=
x0 + w (u
f)/n
Where,
=Mean
X0 =value of the class mark assigned the code 0
w =numerical width of the class interval
U =code assigned to each class
F =frequency of the class (number of observations)
N =total number of observations in the sample
Example: We will solve the previous example by the short-cut method.
Class
Class Mark
(X)
Code
(u)
Frequency
(f)
uf
21-25
23
-3
38
-114
26-30
28
-2
30
-60
31-35
33
-1
35
-35
36-40
38
25
41-45
43
15
15
46-50
48
12
24
51-55
53
56-60
58
8
-153
x0 + w (u f)/n
=
= 38 + 5 -153 / 160
= 33.218
Weighted Arithmetic Mean
The weighted mean is calculated taking into account the relative importance of each of the values
to the total value. The formula for calculating the weighted average is:
= (w x)/ Sw
Where,
= symbol for weighted mean
w
W
=weight allocated to each observation
(wx
=sum of each weight multiplied by that element
)
Sw
Example:
Class of
Labour
Product 2
Unskilled
Semiskilled
Skilled
10
15
20
2
3
5
6
2
1
=
=
Similarly
xw=
Rs 16.5/1
Rs. 16.5 per hour
for labor cost / hour for Product 2 is given by
(wx) / Sw
==
Median
The median is the middle value of a series arranged in ascending or descending order. The
median is the 50th percentile value below which 50% of the values in the sample fall.
Ungrouped Data
If the dataset contains an odd number of items, the middle item of the dataset is the
median
If the dataset contains an even number of items, the average of the two middle items is
the median
If the total of the frequencies is odd, say n, then value of (n+1)/2th item gives the median
If the total of the frequencies is even, say, 2n, then the arithmetic mean of nth and
(n + 1)th gives the median
Example: A fruit vendor recorded the sales of oranges for a week.
Day
Number of oranges
280
240
250
220
270
Number of oranges
Wednesday
220
225
265
Friday
225
Monday
240
Tuesday
250
Saturday
265
Thursday
270
Sunday
280
The dataset contains 7 data points, so the median is given by the middle item, i.e. item number
4. Thus the median for the given data is 250.
Grouped Data
To find the median for grouped data, first we need to identify the median class. It is
assumed that the items are evenly spaced over the entire class interval. Then by
interpolation median is calculated as
Median=
W + Lm
where,
Lm =lower limit of the median class
fm =frequency of the median class
F =cumulative frequency up to the lower limit of the median class
W =width of the class interval
N =total frequency
Example:
Class
Frequency
Cumulative
Frequency
101-200
201-300
12
18
301-400
18
36
401-500
27
63
501-600
21
84
601-700
17
101
701-800
15
116
801-900
11
127
9011000
136
Lm =501,N=136,F=63,fm =21,W=100
W + Lm
Median=
Mo =Lmo +
Where,
Lmo =lower limit of the modal class
d1 =frequency of the modal class - the frequency of the class just below it
d2 =frequency of the modal class - the frequency of the class just above it
w =width of the modal class
We analyze the data statistically to calculate the average point of the data.
The average point of the data that is located centrally is called as the measure of central
tendency.
There are two types of averages mathematical averages Arithmetic mean, Geometric
mean and Harmonic mean and Positional averages Median and mode.
Measure Of Dispersion
In this Lesson we will get familiar with what is dispersion. We will study a few measures of
dispersion namely range, quartile deviation and mean deviation along with their merits and
limitations. In this session we will discuss the calculation of these measures for ungrouped
and grouped data.
Dispersion of a dataset measures the variability of the data or how data is distributed in a
dataset.
When the dispersion is measured in terms of the difference between two values selected
from the data set, it is called as distance measure. E.g. The range, the interquartile range
and quartile deviation
When the dispersion is measured in terms of the average deviation from some measure of
central tendency, it is called as average deviation measure. E.g. Mean Deviation, Variance
and Standard Deviation
The Range
For ungrouped data, range is defined as the difference between the value of the smallest
observation and the value of the largest observation present in the distribution.
Range = Largest Value Smallest Value
For grouped data, range is defined as the difference between the upper limit of the highest
class and the lower limit of the smallest class.
Range = Upper limit of the highest class - Lower limit of the lowest class
Coefficient of range is relative measure of range and is used for comparing observations in
different units. For example, a physical trainer cannot compare the range of the weights of
employees with range of their heights as the range of weights would be in kilograms and
that of heights in centimeters.
Coeffici
ent of =
Range
given data:
Class
0-10
11-20
21-30
31-40
41-50
Frequency
10
Range =
Coefficient of Range =
=1
Merits:
Range is simple to understand and easy to calculate.
Range is the quickest way to get a measure of dispersion, although it is not accurate.
Limitations:
It is not based on all the observations in the data. It is computed based on the highest
and the lowest values and ignores the nature of dispersion among other values of
observations in the data set.
It is influenced by extreme values and hence fluctuates from sample to sample of a
population, even though the values that fall in between the highest and lowest values are
similar.
Range cannot be computed for frequency distributions with open-end classes.
Range fails to explain about the character of the distribution within two extreme
observations (i.e. L and S)
Range is unreliable as a measure of dispersion of the values within a distribution.
Uses:
The quality control experts analyze the dispersion of a products quality. If the
dispersion is more, that means the quality keeps changing, if the dispersion is less
then the quality remains more or less the same.
Financial analysts are concerned about the dispersion of a firms earnings. Widely
dispersed earnings, those varying from extremely high to low, indicate a higher risk
to stockholders and creditors than do earnings remaining relatively stable.
Quartile Deviation
Interquartile Range
The range calculated on the basis of middle 50% of the observations is called as
interquartile range. This interquartile range is calculated from observations obtained after
discarding one quartile of the observations at the lower end and another quartile of the
observations at the upper end of the distribution. Thus, interquartile range is the
difference between the third quartile and the first quartile.
Interquartile range = Q3-Q1
Quartile Deviation
Quartile deviation is defined as one half of the interquartile range. Quartile deviation gives
the average value by which the two quartiles differ from the median. In symmetrical
distribution, the quartiles Q3 and Q1 are equidistant from the median i.e. Median - Q1 = Q3
Median
Quartile deviation (Q.D.)
The relative measure of quartile deviation is called coefficient of quartile deviation. It can
be used to compare the degree of variation in different distributions.
Coefficient of Q.D
ob
servation
observation
Where,
N = total number of observations
Example: The sales figures of a company are
given below. Calculate the quartile deviation for
the sales data.
Month &
Year
April 02
May 02
June
02
July
02
Aug. 02
Sept.
02
Oct.02
Sales (in
Rs. 000)
15.6
16.3
18.1
19.5
20.4
21.5
22.7
Q1 =
=2
Q3 =
= 2.6
15012500
No. of
3
Employees
Wages
1501-2500
2501-3500
3501-4500
4501-5500
5501-6500
25013500
35014500
45015500
55016500
10
15
12
No. of Employees
3
10
15
12
2
Cumulative Frequency
3
13
28
40
42
= 10.75th observation
Q1 =
Q3 =
= 3251
= 32.25th observation
Q3 =
=
=
4792.667
Quartile Deviation =
= 770.833
Coefficient of Q.D.
=
0.787
Merits:
Limitations:
Subtract the mean from every value in the data set and ignore the positive or
negative signs
Add all the differences and divide the sum by the number of items in the sample
(for a sample)
1
25.0
2
24.8
3
25.2
4
24.6
5
24.0
6
23.7
7
23.3
8
23.0
9
22.7
10
22.5
Absolute
deviation
Day
Temperature
(oC)
Deviation from
mean (x )
25.0
1.12
1.12
24.8
0.92
0.92
25.2
1.32
1.32
24.6
0.72
0.72
24.0
0.12
0.12
23.7
-0.18
0.18
23.3
-0.58
0.58
23
-0.88
0.88
22.7
-1.18
1.18
10
22.5
-1.38
1.38
N=1
0
x= 238.8
Mean (
= 8.4
= 23.88
= 0.84
0-200
201-400
401-600
601-800
801-1000
Frequency
32
108
67
28
14
Solution:
Class
Interval
Frequ
ency
(f)
Mid-value of
class interval
(X)
0-200
32
100
3200
307.0879
201-400
108
300.5
32454
106.5879
401-600
67
500.1
33506.7
93.0121
6231.8107
601-800
28
700.1
19602.8
293.0121
8204.3388
801-1000
14
900.1
12601.4
493.0121
6902.1694
9826.8128
11511.493
N=
=249
2500.8
101364.9
=42676.624
Hint:
Use MS Excel to demonstrate the example
=
Absolute Mean
= 407.0879
=
= 171.3920
Deviation
Merits:
Limitations:
Subtract the mean from every value in the data set and square the difference
Add all the differences and divide the sum by the total number of items in the
sample
=
= (f
x )/f
Where x is the mid-point of the class and f is the frequency of the class
Calculate the difference between the sample mean and the mid-point of the class
Multiply the frequency of the class and the squared difference. Add all the products
and divide the sum by the total frequency
=
Standard Deviation ()
Standard deviation is the square root of the average of the squared distances of the
observations from the mean (i.e. square root the variance).
Standard deviation for ungrouped
data,
Standard deviation for grouped data,
Properties of Standard Deviation
The value of standard deviation remains the same, if in a series each of the observation is
increased or decreased by a constant quantity.
For example, for the observations 3, 10 and
12
= 8.33,
= 3.85
= 3.859
Hence although has increased by 4.5,
remains the same.
= 23.152
, 3.859 6.
The sum of the squares of the deviations of items of any series from a value other than the
Where,
= standard deviation of first group
= standard deviation of second group
= 1 -
= 2 -
= (n11 + n22 ) / n1 + n2
Coefficient of Variation
d1
d2
100
The coefficient of variation measures the spread of a set of data as a proportion of its
mean. It is used in problem situations where we want to compare the variability,
homogeneity, stability, uniformity and consistency of two or more data sets. The data set
for which the coefficient of variation is greater is said to be more variable i.e. less
consistent or less homogeneous. On the other hand, if the coefficient of variation is less it
is said to be less variable i.e., more consistent or more homogeneous.
Example 1: Find the standard
deviation and the coefficient of
variance for the given data.
xi
15
13
17
16
18
20
xi
(xi-
(xi-
15
-1.5
2.25
13
-3.5
12.25
17
0.5
0.25
16
-0.5
0.25
18
1.5
2.25
20
3.5
12.25
)2
Sum
=
99
29.50
= 16.5
4.9
1
2.21
Coefficient of variation
100
(%)
=
= 13.429
Example 2: Find the standard deviation and
the coefficient of variance for the given data.
Class
0-10
11-20
21-30
31-40
41-50
51-60
Frequen
cy
13
15
18
20
Class
Fre
que
ncy
(f)
Mid
poi
nt
(x)
f x
x-
(x-
0-10
30
-31.8375
1013.6264
6081.7584
11-20
15.5
124
-21.3375
455.2889
3642.3112
21-30
13
25.5
331.5
-11.3375
128.5389
1671.0057
31-40
15
35.5
532.5
-1.3375
1.7889
26.8335
41-50
18
45.5
819
8.6625
75.0389
1350.7002
)2
f(x-
)2
51-60
20
Sum
80
55.5
1110
18.6625
348.2889
6965.778
2947
-38.525
2022.5709
19738.387
= 246.7298
= 15.7076
= 42.6402
100