PNG University of Technology Mathematics & Computer Science Department
PNG University of Technology Mathematics & Computer Science Department
PNG University of Technology Mathematics & Computer Science Department
MA 339
LECTURE 2
1
Data Processing
Once a data collection process has been
implemented, we are faced with the task of
storing and processing the data. This was
once a most time-consuming and tedious
operation but now that technology has
helped a lot. A large number of statistical
analysis packages have been developed
which, can access the information from
databases and perform a variety of
analysis. 2
Another important development has been
the electronic spreadsheet, which is widely
used for data storage, manipulation,
graphics and relative simple statistical
analysis.
3
Some of the most widely used computer software
packages are;
Database managers: DBase, Oracle, SQL
Statistical Analysis Packages: SPSS, SAS, SYSTAT,
Minitab,
Electronic Spreadsheets: Lotus 1-2-3, Quattro
Pro, Excel
Graphical Packages: Hardware Graphics, Lotus
Freelance Plus.
The widespread use of computers has dramatically
improved the ease of data processing
4
Frequency Distribution
A data maybe quiet small or it may
be very large. In any case, the best
way to summarize a set of data
values relating to a given variable is
to construct a frequency
distribution for that variable.
5
Types of frequency distributions
A simple (or ungrouped) frequency
distribution is a table that shows how many
times each possible value of the variable
observed in the dataset appears. The
possible values are shown separately in the
columns, say first column; for each possible
value, the frequency (which can be found
from tally also), i.e. is the number of times
value occurs in the dataset, is shown in the
next column
6
Example 1: Consider the marks obtained by
50 students in a test which is marked out of
10.
4 3 5 4 3 5
5 4 3 6 5 4
5 3 4 4 5 5
7 4 4 3 4 5
4 3 6 1 3 6
3 2 6 6 3 5
2 7 5 7 7 6
5 8 6 3 1 4
3 5
7
The above is an example of raw
data, where it can be seen that the
numbers have not yet been
arranged in any systematic way.
One way of organizing such raw
data into order is to form a
frequency distribution, ungrouped
frequency distribution for this case.
8
Mark Tally Frequency
1 2
2 2
3 11
4 11
5 12
6 7
7 4
8 1
Table 1 Total 50
9
Grouped frequency distribution – When dealing
with a large amount of data, it is useful to group
the information into classes or categories. The
number of items belonging to each class can
then be determined; this leads to a class
frequency distribution – this is called a grouped
frequency distribution in other words.
Generally, a grouped frequency distribution
condenses data even more, by grouping the
possible values into classes and showing for
each class the class frequency, i.e. how many
values fall in the class.
10
Example 2: The table below gives
the times in seconds (to nearest
second) for 40 children to swim the
length of a swimming pool.
40 49 43 35 42 43 46 36
42 36 37 44 39 41 31 45
38 48 44 51 38 53 35 32
30 43 41 52 46 43 50 40
39 41 48 47 32 52 47 42
11
These above raw data can be
represented in the form of a
grouped frequency table, using
intervals 30 – 34, 35 – 39, etc. A
tally chart can again be used to
obtain the frequencies as
shown below.
12
Class Tally Frequency
Interval
30 – 34 4
35 – 39 9
40 – 44 14
45 – 49 8
50 – 54 5
Table 2 Total 40
13
The main advantage of grouping is that it
produces a clear overall picture of the
distribution. However, too many groups
will destroy the pattern of the distribution,
while too few will destroy much of the
detail which was present in the raw data.
Depending on the volume of the raw data,
the number of classes used is usually
between 5 and 20.
14
Definitions
The class intervals, bounded by class limits, define the
grouping taken. Thus in example 2, the first class
interval is 30 – 34, with 30 being the lower class limit
and 34 being the upper class limit.
Class boundaries are the real limits beyond which the
data in the class cannot go. Thus above, with times
recorded to the nearest seconds, the true class
boundaries for the first class are 29.5 seconds and 34.5
seconds. In any distribution, the class boundaries
maybe found by adding the upper class limit of one
class to the lower class limit of the next and dividing
this sum by two.
15
The class width of a class interval is the
difference between the upper and lower class
boundaries (not the limits). Thus in the above
example, the width of the second class is
39.5 – 34.5 = 5 seconds.
When calculating the width of a class interval,
a common mistake is to take the difference of
the class limits, giving 4 seconds above. This is
incorrect.
The class midpoint is the value halfway
between the lower and upper real limits.
[more to understand in this definition].
16
•Cumulative Frequency
From Table 1: Cumulative frequency of the test mark
Mark Frequency Cumulative
Frequency
1 2 2
2 2 4
3 11 15
4 11 26
5 12 38
6 7 45
7 4 49
8 1 50
Table 1.1
17
From Table 2: Cumulative frequency
distribution of 40 children. (< Cum. Freq)
Upper class boundary Cumulative
Frequency
<34.5 4
<39.5 13
<44.5 27
<49.5 35
<54.5 40
Table 2.1
18
Relative frequency and relative
cumulative frequency distribution
19
Mark Frequency Relative Relative
frequency Cumulative
Frequency
1 2 4
(2/50)*100 =4
2 2 8
(2/50)*100 =4
3 11 30
4 11 (11/50)*100 =22 52
5 12 (11/50)*100 =22 76
6 7 (12/50)*100 =24 90
7 4 (7/50)*100 =14
98
8 1 (4/50)*100 =8
100
Table 1.2 (1/50)*100 =2
20
From Table 2:
Relative frequency and relative
cumulative frequency
distribution of 40 children is
shown below.
21
Upper class Frequency Relative Relative
boundary Frequency Cumulative
Frequency
22