Grouped Data:: BA 302: Chapter-2 Instructions by Dr. Kishor Guru-Gharana
Grouped Data:: BA 302: Chapter-2 Instructions by Dr. Kishor Guru-Gharana
Grouped Data:: BA 302: Chapter-2 Instructions by Dr. Kishor Guru-Gharana
Grouped data: The raw data can be very large. Looking at the raw data it may be difficult to
see or understand any pattern. Therefore it helps in our understanding if the data is organized by
grouping into few classes or categories and counting the number of items which fall into
different classes or groups or categories.
Frequency Table for Qualitative Data: Qualitative data can be organized by grouping
the observations into the different categories which define the variable (such as male/female,
white/black/others, undergraduate/graduate/post-graduate, etc) and counting the number of items
which fall into the various categories. The number of items falling in each category is called the
frequency of the group or category. If it is presented in the form of a Table showing categories
and corresponding frequencies, we have a frequency Table. If the frequencies are divided by the
total frequency we have relative frequency (which can be expressed as decimal or multiplied by
100 to express percentage relative frequency). Remember that the total of all relative frequency
(in decimals) must be exactly equal to 1. Read the book for examples.
Bar Chart: If a frequency Table for a set of Qualitative Data is plotted graphically showing
the categories in the horizontal axis and the frequencies by vertical bars (with heights
corresponding to the size of the frequency), we have Bar Chart. Sometimes the axes are
switched, showing the frequencies as horizontal bars and categories listed in the vertical axis.
Pie Chart: If the percentage relative frequencies are plotted as contiguous slices (cones) of a
circular pie, we have a pie chart where the size of the total pie is 100% and the sizes (percentage
frequencies) of the different categories are indicated by the sizes of the corresponding slices or
cones. Remember that the total degrees in a circle are 360. So the degrees for a class with a share
of 15% (or relative frequency of 0.15) will be .15 times 360 or 54 degrees. Read the book for
examples.
Frequency Table for Quantitative Data: In case of quantitative data we do not have
inherently defined classes unlike the categorical or qualitative data. In this case we can organize
the data into different classes arbitrarily using the values of quantitative variable. First we have
to find how many class intervals to make.
Number of Classes: The popular practical guide for determining the number of classes (k) is
the 2k rule. According to this rule, the number of classes (k) is the lowest value of k which makes
2k greater than the total frequency (or total number of observations). However, note that this rule
is only a guide or recommendation and the exact number of classes could be a little more or less
as decided appropriate by the researcher depending on the nature of the data and the underlying
population. The point to remember is that having too many classes beats the very purpose of
grouping, while having too few classes hides too much information.
For example we have 23 = 8 and 24 = 16. Therefore for data sizes from 8 to 15, four classes are
recommended. If the data size is between 16 and 31, five classes are recommended, and so on.
For other data sizes you can easily find the recommended number of classes using this rule.
Class Intervals: After determining the number of classes, the next step is to find the class
intervals (or the upper and lower boundaries of the classes). Generally the class intervals are
made equal (or nearly equal) as far as possible. But the class intervals do not always have to be
equal. For example one could define lower income group below $60,000 per year, lower middle
income group between $60,000 and $150, 000, upper middle income group between $150,000
and $250,000 and higher income group above $250,000 per year. These numbers are given here
only for illustration. Some researchers may have more (or less) number of categories and may
define different groups quite differently, depending on the purpose of the study.
The only necessary rule is that the classes have to be mutually exclusive or non-overlapping.
The same value cannot belong to two different classes. For discrete values the classes will
naturally be disjoint. However, for continuous values they are shown with overlapping class
boundaries. The popular convention is that the lower limit of the class is included but the upper
limit is excluded (that is, it goes to the next class). According to this convention if a class interval
is shown as 10 to 20 or simply 10-20, then it means all values including 10 and up to 20 but not
including 20. That means a value like 19.99 would belong to this class but exact value 20 would
belong to the next class. But some authors and some computer programs (including Excel) may
do just the opposite (that is, include the upper boundary and exclude the lower boundary). In this
class we will adopt the first approach if we are making the classes, or let the computer program
use its own convention.
If the class intervals are decided to be equal then the size of the interval i is found using the
following formula:
i=
HL
k
, where i is the size of class interval, H is the highest value in the data, L is the lowest
value and k is the number of classes. If we get decimals, we generally round it. After the class
intervals are established it is only a matter of counting the number of items which fall in each
class to make the frequency table. Read the book for an example on this.
Class Midpoints: Sometimes we work with the class midpoints as the value representing the
whole class (for example in calculating the weighted average for grouped data or in drawing the
frequency polygons, the topic discussed below). The class midpoint is obtained by adding the
lower and upper limits of a class and dividing by two. So the midpoint of a class with interval 10
to 20 could be 15.
Relative Frequency: As already mentioned above the relative frequencies are derived by
dividing the frequencies by the total frequency (or total number of observation). It can be
expressed in decimals or multipled by 100 to express in percentages.
Histograms: When we plot the class intervals along the horizontal axis and the corresponding
frequencies using vertical bars, we have the Histograms. Histograms in quantitative data are the
counterparts of Bar charts in qualitative data.
Frequency Polygon: If vertical lines (or heights) are erected at the class midpoints with
heights corresponding to the class frequencies and the top of the vertical lines are connected
using line segments, the resulting picture or diagram is called the frequency polygon. The
frequency polygon visually depicts the shape of the frequency distribution among the different
classes.
Use of Excel
Suppose a series of observation of values for some variable (20 observations) are as follows;
observation
1
1
2
2
3
3
3
4
4
4
4
4
5
5
5
6
6
7
8
9
To obtain frequency distribution, cumulative frequency distribution, frequency polygon and
histogram using excel follow the following simple steps.
Open excel, type the data in any column, e.g. from cell A2 to A21 (cell A1 used for the title
observation). Then click on data button on top, then data analysis on the right corner (you
must have installed Excel Analysis Tool Pack. I will give simple hint for new Excel. Open Excel,
click on the upper left corner icon, Then in the opening window click on Excel options on the
bottom right, then select Excel Add-In, then analysis Tool Pack, and so on. Just follow this
lead.). After opening Data Analysis, click on option Histogram from the list given, then specify
input range by selecting A1 and dragging the mouse until the end of the series ($A$1:$A$21),
then specify the space where you want output, then check selection boxes for labels, cumulative
percentage and chart output, then click ok. If we leave the bin range blank, excel forms the class
intervals automatically. In this case it forms the class intervals as 0 to 1, 1 to 3, 3 to 5, 5 to 7, and
7 and above. We could specify our own class intervals (of equal or unequal length). The Excel
output appears below:
Frequency
Cumulative
%
10.00%
35.00%
75.00%
90.00%
100.00%
Bin
More
It is so simple. Note that the line in red is the cumulative relative frequency line or Ogive, which
is a non-decreasing curve or plot of cumulative relative (or percentage) frequency.
If you want to calculate relative frequency, you can use simple calculator (divide frequency by
total number of observations) or use excel functions for dividing one cell value by another. Or
you can also get it from the difference in successive cumulative (percentage) frequency column
given above by Excel. All methods are mathematically the same. I used excel formula in one cell
and copied (dragged) to all cells to get relative frequency for the different class intervals or
groups.
rel.fr.
10.00%
25.00%
40.00%
15.00%
10.00%
Perhaps you already know all these and more, if not learn all these tricks in excel by playing with
it. It will be useful for your assignments. Below I give examples of working with grouped
data.
class
freq
2-6
15
6-10
60
10-14
100
14-18
25
If I want to plot the frequency histogram I go to excel, then type the data with the headings.
Remember excel will not allow you to type 2-6, because it will be treated as February 6. So type
2-6 (that is with an apostrophe). But it will allow you to type 14-18, because there is no 14th
moth. Some stupid Excel rules!
Anyway, after you type the data select the whole data by dragging the mouse, then select
Insert from the buttons on top, which will open a whole new set of options. To plot
histogram, select Column and 2D (if you dont want 3D). Hit the ok or enter, and the
output appears wherever you want it. After the output appears, you can move it around in any
convenient place or copy and paste somewhere like this:
Now by clicking anywhere on the histogram, right clicking, you can change the format, and do a
number of things, like changing the labels, adding labels, changing the X-axis, etc. Learn all
these tricks.
Incidentally, the above histogram shows a Nonsymmetrical or Asymmetrical distribution (with
dissimilar tails) and it is skewed to the left (with long left tail).
Remember: we use the Analysis Tool Pack Data and Data Analysis path for
individual observations, but the Insert function and Column path is for
grouped data.
6
Now suppose I want to plot the Ogive and frequency polygon. Frequency polygon is easy. Just
select the data set, click on Insert and select Line, and then select any line type you want,
e.g. Line with Marker. This will plot the frequency polygon as follows:
The Ogive curve is the plot of cumulative relative frequency or a continuous cumulative
frequency polygon. To get Ogive for grouped data one way is to first construct the relative
frequency and cumulative relative frequency table along with or beside the frequency table using
simple calculator or excel function. Remember that cumulative relative frequency column is
obtained by adding relative frequencies and that the total of all relative frequencies must always
be 100% or in probability value 1.The expanded table for the above example will be as follows:
class frequency
2-6
15
6-10
60
10-14
100
14-18
25
Cumulative
Relative
relative
frequency frequency
0.075
0.075
0.3
0.375
0.5
0.875
0.125
1
Now select the first column (the column of class intervals) and pressing control button, also
select the cumulative relative frequency column, and then select the Insert button, then the
option Scatter and select scatter with smooth lines. This will produce the cumulative relative
frequency or Ogive curve as follows:
You can also change the X-axis labels corresponding to the class intervals or the upper limits of
classes (which makes sense in the case of cumulative frequency). Right clicking on the curve
gives you further options for labeling.