Mathematicsunit 4lesson
Mathematicsunit 4lesson
Mathematicsunit 4lesson
Introduction
Objectives:
51
Lesson I. Classification and Organization of Data
At the end of the twentieth century, there were 94 million households in the Philippines
with television sets . The television program viewed by the greatest percentage of such
households in that century was the final episode of Probinsiyano .Over 50 million Filipinos
watched this program .
Numerical information , such as the information about the top three TV shows of the
twentieth century is called data. The word statistics is often used when referring to data.
However , statistics has a second meaning:
Statistics is also a method for collecting , organizing , analyzing , and interpreting data, as
well as drawing conclusions based on the data. This methodology divide statistics into two
main areas.
Inferential Statistics has to do with making generalization about and drawing conclusions
from the data.
There are many classifications of data. Different kinds of data are collected ,
analyzed , and interpret. Being able to differentiate them is the first thing that must be
considered when organizing data.
Qualitative data deals with categories or attributes. Examples are colored eyes , ethnicity ,
and brand of ice cream.
Quantitative data are numerical data. Quantitative data can be discrete or continuous.
Discrete data is obtained through counting .Number of countries in Southeast Asia and
number of courses in a school term are examples of discrete data.
Continuous data is obtained by measuring . Weight and age are some examples of
continuous data.
Nominal level of measurement classifies qualitative data into two or more categories. It is
the lowest level of measurement.
52
Examples of nominal are the books in the library and courses in college.
Examples of ordinal are winners in science quiz bee and levels of anxiety.
Interval level of measurement involves quantitative data that are ranked and makes sense
of differences. There is no starting point for this level of measurement.
Ratio level of measurement does not only include those characteristics of interval of
measurement but also start at 0 value. It is highest level of measurement.
Examples of ratio are weight, the time it takes to do a math project and the number of
absences of students in a class.
53
Lesson II : Measure of Central Tendency
Consider the set of all Filipinos TV households. Such a set is called the population. In
general , a population is the set containing all the people or objects whose properties are to
be described and analyzed by the data collector.
Random Samples :
A random sample is a sample obtained in such a way that every element in the population
has an equal chance of being selected for the sample .
One of the most basic statistical concepts involves finding measure of central
tendency of a set of numerical data . It is often helpful to find numerical values that locate ,
in some sense , the center of a set of data . Suppose Mikee is a senior at a university . In a
few months he plans to graduate and start a career as a landscape architect . A survey of
five landscape architects from last year’s senior class shows that they received job offers
with the following yearly salaries.
Before Mikee interviews for the job , he wishes to determine the average of these
five salaries. This average should be a “ central “ number around which the salaries cluster.
We will consider three types of averages , known as the arithmetic mean , the median , and
the mode .
54
Each of these averages is a measure of central tendency for the numerical data . There are
three different measure of central tendency : the mean , the median and the mode. Each
measure of central tendency is calculated in different ways. Thus , it is better to use a
specific term ( mean , median and mode ) than to use the generic description term
“average .”
The arithmetic mean is the most commonly used measure of central tendency . The
arithmetic mean of asset of numbers is often referred to as simply the mean . To find the
mean for a set of data , find the sum of the data values and divide it by the number of data
values.
Ʃx
x=
n
Where : x=mean
For instance , to find the mean of the 5 salaries listed above , Mikee would divide the sum of
the salaries by 5 .
₱ 206,500
= = ₱ 41,300
5
The mean suggests that Mikee can reasonably expect a job offer at a salary of about
₱41,300.
In statistics it is often necessary to find the sum of a set of numbers . The traditional symbol
used to indicate is the Greek letter sigma , Σ . Thus the notation Σ x , called summation
notation , denotes the sum of all the numbers in a given set .
Mean
Σx
Mean = x= 55
n
Statisticians often collect data from small portions of a large group in order to determine
information about the group .In such situations the entire group under consideration is
known as the population , and any subset of the population is called a sample. It is
traditional to denote the mean of a sample by x bar ( x ¿¿ and to denote the mean of the
population by the Greek letter μ ( lowercase mu ).
95 85 75 70 88 90 93 69 89
Solution :
Exercises:
A doctor ordered 6 separate blood tests to measure a patient’s total blood cholesterol
levels . The test results were
The Median
Another type of average is the median . Essentially , the median is the middle number or
the mean of the two middle numbers in a list of numbers that have been arranged in
numerical order from smallest to the largest or largest to smallest. Any list of numbers that
is arranged in numerical order from smallest to largest or largest to smallest is ranked list.
56
Median
Example :
1. 4, 8, 1, 14, 9, 21, 12
2. 46, 23, 92,89, 77, 108
Solution :
1. The list 4, 8, 1, 14, 9, 21, 12 contains 7 numbers . The median of a list with odd
number of entries is found by ranking the numbers and finding the middle number.
Rank the numbers from smallest to largest .
1, 4, 8, 9, 12, 14, 21
2. The list 46, 23, 92, 89, 77, 108 , contains 6 numbers . The median of a list of data
with even number of entries is found by ranking the numbers and computing the
mean of the two middle numbers. Rank the numbers from smallest to largest.
23,46, 77, 89, 92, 108 89 is 83 .
The two middle numbers are 77and 89. The mean of 77 and 89 is 83 . Therefore the
median of the data is 83.
Exercises :
Mode
The mode of a list of numbers is the number that occurs most frequently .
Some lists of numbers do not have a mode . For instance , in the list 2 , 7 , 9, 11, 33 , 16 , 50
each number occurs exactly once. Because no number occur more often than the other
numbers , then , there is no mode.
A list of numerical data can have more than one mode. For instance , 4, 2, 6, 2, 7, 9, 2,
4, 9, 8, 9,7 , the number 2 occurs three times and the number 9 also occurs three times. All
other numbers occurs less than three times. Thus 2 and 9 are both modes for the data.
Example :
Solution :
1. In the list 18, 15, 21, 16, 15, 14, 15, 21 , the number 15 occurs more often than the
other numbers. Therefore 15 is the mode.
2. Each number in the list occurs only once. Because no number occurs more often
than the others , there is no mode.
Exercises :
1. 3, 3, 3, 3, 3, 4 ,4, 5, 5, 5, 8, 8, 8, 8,
2. 12, 34, 12, 71, 48, 93, 71, 93, 12
58
The mean , the median and the mode are all averages; however, they are generally not
equal. The mean of a set of data is the most sensitive of the averages. A change in any of the
numbers changes the mean , and the mean can be changed drastically by changing an
extreme value.
In contrast , the median and the mode of a set of data are usually not changed by
changing the extreme value.
When a data set has one or more extreme values that are very different from the
majority of the data values , the mean will not necessarily be a good indicator of an average
value. In the following example , we compare the mean ,median and mode for the salaries
of 5 employees of a small company.
₱ 506,000
=¿₱ 101,200
5
The mean is the middle number, ₱36,000.Because the ₱20,000salary occurs the
most , the mode is ₱20,000. The data contain one extreme value that is much larger than
the other values. This extreme value makes the mean considerably larger than the median.
Most of the employees of this company would probably agree that the median of ₱36,000
better represents the average of the salaries than does either the mean or the mode.
A value called the weighted mean is often used when some data values are more important
than others. For instance , many professors determine a student’s course grade from the
student’s tests and the final examination. Consider the situation in which a professor
counts the final examination score as 3 test score and test score as 2. To find the weighted
mean of the student’s scores, the professor first assigns a weight to each score. In case a
professor could assign each test a weight of 2 and the final exam score a weight of 3.A
student with a test scores of 65, 70, and 75 and a final examination score of 90 has a
weighted mean of
( 65 x 2 ) + ( 70 x 2 ) + ( 75 x 2 ) +(90 x 3) 690
= =76.667
9 9
59
Note that the numerator of the weighted mean above is the sum of the products of each test
score and its corresponding weight. The number 9 in the denominator is the sum of all the
weights ( 2+2+2+3 ). The procedure for finding the weighted mean can be generalized as
follows.
The weighted mean of the n number x1 , x2, x3 , . . . xn with the respective assigned weights
w1, w2, w3, . . . wn is
Σ( x ⋅w)
Weighted mean = = x
Σw
where Σ ( x ⋅ w ¿ is the sum of the products formed by multiplying each numberby its
assigned weight , the Σ w is the sum of all the weights.
A = 4 , B = 3 , C = 2, D = 1 , E = 0
A student’s grade point average ( GPA ) is calculated as a weighted mean , where the
student’s grade in each course is given a weight equal to the number of units ( or
credits )that course is worth . Use this 4 – point grading system for the given example.
The table 4.1 : Shows Peter’s first semester course grades. Use the weighted mean formula
to find Peter’s GPA for the first semester.
Math B 4
History A 3
Chemistry D 3
Biology C 4
Solution:
The B is worth 3 points , with a weight of 4 ; the A is worth 4 points with a point of 3; the D
is worth 1 point with a weight of 3 ; and the C is worth 2 points , with a weight of 4. The
sum of all the weights is 4+ 3 + 3 + 4 , or 14.
( 3 x 4 ) + ( 4 x 3 ) + ( 1 x 3 ) +(2 x 4) 35
Weighted mean = x = = =2.5
14 14
60
EXERCISE SET 13 :
The table 4.2 , shows Lourd’s second semester course grades. Use the weighted mean
formula to find lourd’s GPA for the second semester.
Biology A 4
Statistic B 3
Business C 3
Psychology F 2
CAD B 2
Frequency Distribution
After the data have been collected from the sample of the population , the next task
facing the statistician is to present the data into a condensed and manageable form . In this
way , the data can be more easily to interpret.
Data that have not been organized or manipulated in any manner are called raw
data. A large collection of raw data may not provide much readily observable information .
A frequency distribution which is a table that lists observed events and the frequency of
occurrence of each observed event, is often used to organize raw data. For instance ,
consider the following table , which lists the number of laptop computers owned by
families in each of 40 homes in a subdivision.
A piece of data is called data item . This list of data has 40 data items . Some of the
data items are identical . Two of the data items are 5 and 5 . Thus , we say that the data
value 5 occurs twice. Similarly , because 14 of the data item are2 , the data value 2 occurs
14 times.
61
2 0 3 1 2 1 0 4
2 1 1 7 2 0 1 1
0 2 2 1 3 2 2 1
1 4 2 5 2 3 1 2
2 1 2 1 5 0 2 5
The next table (Table 2 )is a frequency distribution table which was constructed using the
data from the above table. The first column of the frequency distribution consists of the
numbers , 0, 1, 2, 3, 4, 5, 6, and 7. The corresponding frequency of occurrence , f , of each of
the numbers in the first column is listed in the second column.
0 llll - ------------------------------- 5
5 lll - ------------------------------- 3
6 - ---------------------- 0
l - -------------------------------- 1
7
____
40
total
The formula for a weighted mean can be used to find the mean of the data in a
frequency distribution. The only change is that the weights w1 , w2, w3, . . . wn are replaced
with the frequencies f1, f2, f3, . . . fn. This procedure is illustrated in the next example.
62
Example 1:
Solution :
The number in the right-hand column of Table 2 are the frequencies f for numbers in the
first column . The sum of all the frequencies is 40.
Σ( x ⋅ f )
Mean = x =
n
( 0 ⋅ 5 ) + ( 1⋅12 ) + ( 2 ⋅14 ) + ( 3 ⋅3 ) + ( 4 ⋅ 2 ) + ( 5 ⋅3 ) + ( 6 ⋅ 0 ) +(7 ⋅1) 79
= = =1.975
40 40
The mean number of laptop computers per house hold for the homes in the subdivision is
1.975.7
Example 2 :
Table 4.5 :
0 2 0 • 2 = 0
1 1 1 • 1 = 1
2 3 2 • 3 = 6
3 12 3 • 12 = 36
4 16 4 • 16 = 64
5 18 5 • 18 = 90
6 13 6 • 13 = 78
7 31 7 • 31 = 217
8 26 8 • 26 = 208
9 15 9 • 15 = 135
10 14 10 • 14 = 140
Ʃxf 975
Mean = x= = ≈6.46
n 151
The mean of 0 to 10 stress – level ratings is approximate 6.46 . Notice that the mean is
greater than 5 , the middle of the 0 to 10 scale.
63
A frequency distribution that lists all possible data items can be quite cumbersome
when there are many such items . For example , consider the following data items . These
are statistics test scores for a class of 40 students.
82 47 75 64 57 82 63 93
76 68 84 54 88 77 79 80
94 92 94 80 94 66 81 67
75 73 66 87 76 45 43 56
57 74 50 78 71 84 59 76
It is difficult to determine how well the group did when the grades are displayed like
this . Because there so many data items , one way to organize these data so that the results
are more meaningful is to arrange the grades into groups , or classes , based on something
that interest us. Many grading systems assign an A to grades in the 90 – 100 class , B to
grades in the 80 – 89 class , C to grades in the 70 – 79 class , and so on . These classes
provide one way to organize the data.
Looking at the 40 statistics test score , we see that they range from a low of 43 to a
high of 94. We can use classes that run from 40 through 49 , 50 through 59 , 60 through 69 ,
and so on up to 90 through 99 , to organize the scores. In the example , we go through the
data tally each item into appropriate class. This method for organizing data is called a
grouped frequency distribution .
TABLE 4.6 :
( frequency )
40 - 49 Lll 3
50 - 59 llll - l 6
60 - 69 llll - l 6
70 - 79 llll - llll - l 11
80 - 89 llll - llll 9
90 - 99 llll 5
Omitting the tally column results in the grouped frequency distribution in table 2 . The
distribution shows that the greatest frequency of students scored in the 70 – 79 class. The
number of students decreases in classes that contain successively lower and higher scores .
The sum of frequencies , 40 , is equal to the original number of data items.
Table 4.7.
Class Frequency
40 - 49 3
50 - 59 6
60 - 69 6
70 - 79 11
80 - 89 9
90 - 99 5
Total: n = 40
The leftmost number in each class of a grouped frequency distribution is called the
lower class limit . For example , in table 2 , the lower limit of the first class is 40 and the
lower limit of the third class is 60. The rightmost number in each class is called the upper
class limit . In table 2 , 49 and 69 are the upper class limit of the first and third class,
respectively . Notice that if we take the difference between two consecutive lower class
limits we get the same number.
50 - 40 = 10 , 60 - 50 = 10 , 70 - 60 = 10, 80 - 70 = 10 , 90 - 80 = 10
65
When setting up class limits , each class , with the possible exception for the first or
last , should have the same width. Because each data item must fall into exactly one class , it
is sometimes helpful to vary the width of the first or last to allow for items that fall far
above or below most of the data.
Exercise :
A housing division consists of 45 homes. The following frequency distribution shows the
number of homes in the subdivision that are two – bedroom homes , the number that are
three bedroom homes , the number that are four-bedroom homes, and the number that are
five- bedroom homes , Find the mean number of bedrooms for the 45 homes.
with x bedrooms
2 5
3 25
4 10
5 5
______
Total 45
EXERCISE SET 14 :
1. The following table displays the ages of actors when they starred in their Oscar –
winning Best Actor performances in 1980 – 2015 Academy Awards.
41 33 31 74 33 49 38 61 21 41 26 80
42 29 33 36 45 49 39 34 26 25 33 35
35 28 30 29 61 32 33 45 66 25 46 55
Find the mean , median and mode for the data in the table.
66
2 .In some 4.0 grading systems , a student’s grade point average ( GPA ) is calculated by
assigning letter grades the following numerical values.
English A 3
Anthropology A 3
Chemistry B 4
French C+ 3
Theatre B– 2
History D+ 3
Computer Science B+ 2
Math A- 3
3.Find the mean for the data in the given frequency distribution.
basketball game
2 6
4 5
5 6
9 3
10 1
14 2
19 1
67
In the preceding units we introduced three types of averages for a data set - the mean ,
the median and the mode. Some characteristics of a set of data may not be evident from the
examination of averages.
Example 1:
For instance , consider a soft-drink dispensing machine that should dispense 8 oz of your
selection into a cup. In the following table 4.8 , shows data for two of these machines.
Machine 1 Machine 2
9.52 8.01
6.41 7.99
10. 07 7.95
5. 85 8.03
8.15 8.03
x = 8.0 x = 8.0
The mean data value for each machine is 8 oz . However , look at the variation in
data values for machine 1 . The quantity of soda dispensed is very inconsistent --- in some
cases the soda overflows the cup , and in other cases too little soda is dispensed. The
machine obviously needs adjustments. Machine 2 , on the other hand , is working just fine.
The quantity dispensed is very consistent , with little variation.
This example shows that average values do not reflect the spread or dispersion
data..
Example 2.
When you think of Houston , Texas and Honolulu , Hawaii , The same temperature comes to
mind ? Both cities have a mean temperature of 75o. However , the mean temperature does
not tell the whole story . The temperature in Houston differs seasonally from a low of about
40o in January to a high of close to 100o in July and August. By contrast , Honolulu’s
temperature varies less throughout the year usually ranging between 60 o and 90o .
68
Measures of dispersion are used to describe the spread of data items in a data set .
To measure the spread or dispersion of data , we must introduce the two of the most
common statistical values known as , the range and the standard deviation .
The Range
A quick but rough measure of dispersion is the range , the difference between the highest
( greatest ) data values and the lowest ( least ) data values in a data set.
1. For example , if Houston’s hottest annual temperature is 103 o and its coldest is 33o , the
range in temperature is
If Honolulu’s hottest day is 89o and its coldest day 61o , the range in temperature is
2. Find the range of the numbers of ounces dispensed by machine 1 in the given table.
Solution :
The greatest number of ounces dispensed is 10.07 and the least is 5.85 . The range of the
numbers of ounces dispensed is 10.07 - 5.85 = 4.22 oz.
The Range
The range , the difference between the highest and the lowest data values in a data set ,
indicates the total spread of the data.
69
Exercises:
a. 16, 17 , 18 , 19, 20
b. 11, 13 , 14 , 15 , 16 , 17
c. 3, 3, 4, 4, 5, ,5
A second measure of dispersion , and one that is dependent on all of the data items , is
called the standard deviation . The standard deviation is found by determining how much
each data item differ from the mean.
Example , preparing to find the standard deviation ; Finding deviations from the mean.
Find the deviations of countries with the most workers from the mean for the five data
items 778 , 472 , 147 , 106 , and 82 ( in millions ) .
Solution ;
Ʃ x 778+472+147+106+ 82 1585
Mean = x= = = =317 millions
n 5 5
The mean for the countries with the largest labor forces is 317 million workers. Now , lets
find by how much each of the five data item differs from 317 , the mean.
70
( x ) ( x−x ¿
Ʃx = 1585
Ʃx 1585
Mean = x = = =317
n 5
For China , with 778 million workers , the computation is shown as follows:
This indicates that the labor force in China exceeds the mean by 461 million workers.
The computation for United states , with 147 million workers , is given by
This indicates that the labor force in United States is 170 million workers below the mean.
The sum of deviations for a set of data is always zero. For the deviations in the table above.
71
Ʃx
1. Find the mean of the data item. x=
n
5. Divide the sum in step4 by n - 1 , where n represents the number of data items :
2 2
Ʃ(data item−mean) Ʃ( x−x)
=
n−1 n−1
6. Take the square root of the quotient in step 5 . This value is the standard deviation for
the data set.
√
2
Ʃ(data item−mean)
Standard deviation= = √ Ʃ¿ ¿ ¿
n−1
The computation of the standard deviation can be organized using a table with three
columns.
( x−x ) (x−x )
2
72
Example : Table 4.10 Showing the number of workers , in millions , for the five
countries with the largest labor forces . Find the standard deviation , in millions , for these
five countries.
( x−x ¿ (x−x )2
√ √
2
Ʃ(data item−mean) 365,192
Standard deviation = = = √91,298
n−1 4
Exercises:
A consumer group has tested a sample of 8 size – D batteries from each 3 companies. The
results of the tests are shown in the following table . According to these tests , which
company produces batteries for which the values representing hours of constant use have
the smallest standard deviation.
Ever ready 6.2 , 6.4 , 7.1 , 5.9 , 8.3 , 5.3 , 7.5 , 9.3
73
The Variance
A statistic known as the variance is also used as a measure of dispersion . The variance for
a given set of data is the square of the standard deviation of the data. The following chart
shows the mathematical notations that are used to denote standard deviations and
variance.
2 , 4 , 7 , 12 , 15
Solution :
2+ 4+ 7+12+15 40
Mean = x = = =8
5 5
Table 4.11
Data item Deviation: ( Deviation )2 :
2 2 – 8 = -6 ( - 6 )2 = 36
4 4 - 8 = -4 ( - 4 )2 = 16
7 7 - 8 = -1 (-1)2 = 1
12 12 - 8 = 4 ( 4 ) 2 = 16
15 15 - 8 = 7 ( 7 ) 2 = 49
Standard deviation = s=
√ √ Ʃ (data item)2 =
n−1 √ 118
4
=√29.5=5.43
74
Find the Range , the standard deviation , and the variance for the following:
1. 1, 2 , 5 , 7 , 19 , 22
2. 3 , 4 , 7 , 11 , 12 , 12 , 15 , 16
3. 78 , 91, 87 , 93 , 59 , 68 , 92 , 100 , 81
4. 93 , 67 , 49 , 55 , 92 , 87 , 77 , 66 , 73 , 96 , 54
5. 8,6,8,6,8,6,8,6,8,6,8,6,8
75
Consider the Internet site that offers movie downloads . Based on data kept by the
site , an estimate of the mean time to download a certain movie is 12 min , with a standard
deviation of 4 min. When you download this movie , the download takes 20 min, which you
think is unusually long time for the download. On the other hand , when your friend
downloads the movie , the download takes only 6 min , and your friend is pleasantly
surprised at how quickly she receives the movie. In each case , a data value far from the
mean is unexpected.
The number of standard deviations between a data value and the mean is known as
the data value’s z – score or standard score .
z – Score
The z-score for a given data value x is the number of standard deviations that x is above or
below the mean of the data . The following formulas shows how to calculate the z- score for
data value x in a population and in a sample.
x = data item
The z-score equation involves four variables. If the values of any three of the four
variables are known , you can solve for the unknown variable.
Example : Aggu Utang has taken two tests in his chemistry class . He scored 72 on the first
test , for which the mean of all scores was 65 and the standard deviation was 8 . He
received a 60 on a second test , for which the mean of all scores was 45 and the standard
deviation was 12. In comparison to the other students , did Aggu Utang do better on the
first test or the second test ?
76
Solution :
60−45
z 60= =1.25
12
Aggu Utang scored 0.875 standard deviation above the mean on the first test and 1.25
standard deviations above the mean on the second test . The z-score indicates that , in
comparison to his classmates , Aggu Utang scored better on the second test than he did on
the first test.
Percentiles
Most standardized examinations provide scores in terms of percentiles , which are defined
as follows :
pth Percentile
A value x is called the pth percentile of a data provided p% of the data values are less than
x .
In a recent year , the median annual salary of a Medical Technologist was ₱185, 698.00. If
the 90th percentile for a salary of a Medical Technologists was ₱205, 500.00 , find the
percent of the Medical Technologists whose annual was
a. ₱185, 698.00
b. ₱ 205, 500.00
77
Solution :
a. By definition , the median is the 50th percentile. Therefore , 50% of the Medical
Technologists earned more than ₱185,698.00.
b. Because ₱205,500.00 is the 90th percentile, 90% of all Medical Technologist made less
than ₱ 205,500.00.
c. From parts a and b 90% - 50% = 40% of the Medic al Technologist earned between
₱185,698.00 and ₱ 205,500.00.
The following formula can be used to find the percentile that corresponds to a data value
in a set of data.
Example :
On a reading examination given to 950 students, Jack Ammu score of 652 was higher than
the scores of 580 of the students who took the examination . What is the percentile for Jack
Ammu’s score ?
580
= 950 •100=61.0
78
Quartiles
The three numbers Q1 , Q2 , and Q3 , that partition a ranked data set into four
( approximately ) equal groups are called the quartiles of the data .
↑ ↑ ↑
Q1 Q2 Q3
The quartile Q1 , is called the first quartile . The quartile Q2 , is called the second quartile. It
is the median of the data. The quartile Q3 , is called the third quartile. The following
method of finding the quartile makes use of the medians.
3. The first quartile , Q1 , is the median of data values less than Q2. The third quartile , Q3 , is
the median of the data values greater than Q2.
The following table lists the calories per 100 milliliters of 25 popular soft drinks. Find the
quartiles for the data.
43 37 42 40 53 62 36 32 50 49 26 53 73
48 45 39 45 48 40 56 41 36 58 42 39
79
Solution :
2) 32 12) 42 22) 56
3) 36 13) 43 23) 58
4) 36 14) 45 24) 62
5) 37 15) 45 25) 73
6) 39 16) 48
7) 39 17) 48
8) 40 18) 49
9) 40 19) 50
10) 41 20) 53
Step 2: The median of these 25 data values has a ranked of 13. Thus the median is 43 ,. The
second quartile Q2 , is the median of the data , so 43 ,
Step 3: There are 12 data values less that the median and 12 data values greater than the
median . The first quartile is the median of the data values less than the median. Thus Q1 , is
the mean of the data values with ranks 6 and 7.
39+39
Q 2= =39
2
The third quartile is the median of the data values greater than the median . Thus , Q3 , is
the mean of the data values with ranks 19 and 20.
50+53
Q 3= =51.5
2
80
EXERCISE SET 16 :
1. A data set has a mean of x = 75 and a standard deviation of 11.5 . Find the z-score for
each of the following:
a. x = 85 b. x = 95
c. x = 50 d. x = 75
2. Which of the following three test score is the highest relative score?
b. A score of 102 on a test with a mean of 130 and a standard deviation of 18.5.
c. A score of 605 on a test with a mean of 720 and a standard deviation of 116.4.
81
Large sets of data are often displayed using a grouped frequency distribution , or a
histogram . For instance , consider the following situation. An Internet Service Provider
(ISP ) has installed new computers. To estimate the new download times its subscribers
will experience , the ISP surveyed 1000 of its subscribers to determine the time required
for each subscriber to download a particular file from an Internet site. The result of that
survey are summarized in the Table.
Table 4.12 :
( in seconds ) subscribers
0 - 5 6
5 - 10 17
10 - 15 43
15 - 20 92
20 - 25 151
25 - 30 192
30 - 35 190
35 - 40 149
40 - 45 90
45 - 50 45
50 - 55 15
55 - 60 10
82
200
number
of 150
subscriber
100
50
0 10 20 30 40 50 60
The type of frequency distribution that lists the percent of data in each class is
called a relative frequency distribution. The relative frequency histogram was drawn by
using the data in the relative frequency distribution. It shows the percent of subscribers
along its vertical axis.
83
a. the percent of subscribers who required at least 25 seconds to download the file.
b. probability that a subscriber chosen at random will require at least 5 seconds but less
than 20 seconds to download the file.
Solution :
a. The percent of data in all the classes with a lower boundary of 25 seconds or more is the
sum of the percents printed in blue in the table below. Thus the percent of the subscribers
who required at least 25 seconds to download the file is 69.1%
Download time Percent of Table 4.14
( in seconds ) subscribers
0 - 5 0.6 Sum is
5 - 10 1.7 15.1%
10 - 15 4.3
15 - 20 9.2
20 - 25 15.1 Sum is
25 - 30 19.2 69.1 %
30 - 35 19.0
b. The percent of the data in all the
35 - 40 14.9
classes with lower boundary of 5
40 - 45 9.0 seconds and the upper boundary of 20
seconds is the sum of the percents
45 - 50 4.5
50 - 55 1.5
55 - 60 1.0
printed in blue in table 4.14 above. Thus , the percent of subscribers who required at least
5 seconds but less than 20 seconds to download the file is 15.1%. The probability that a
subscriber chosen at random will require at least 5 seconds but less than 20 seconds to
download the file is ) 0.152 .
84
5
4
Figure 4.4
frequency
Figures 4.2 and 4.3 show non-normal distributions . Figure 4.2 has two peaks. There is also
a gap in the data. The peak of figure 4.3 is not centered which violates the concept of a bell.
Figure 4.4 shows a normal distribution.
1. It is bell-shaped curve.
3. The tails of the normal curve are asymptotic to the horizontal axis.
6. The mean , median , and the mode have the same value.
86
The standard normal has the same properties as that of the normal distribution
except that the mean is zero and the standard deviation is 1.
87
It was stated that the normal distribution is symmetric about the mean. This signifies that
the areas of a z-value are the same , whether it is positive or negative. Hence , area of – z is
equal to the area od +z.
The concept of probability is used for normal distribution. Probabilities are from 0
to 1. This means that the values of areas cannot be negative. Moreover ,they also cannot
have values greater than 1.
The notation P ( a < z < b ) , P ( z < a ) and P ( z > a ) will be used and their
meanings are as follows :
88
89
Using the z-table , the area of z = -0.46 is 0.1772 and the area of z = 0.52 is 0.1985.
For z = - 0.46 , look for 0.4 under z column , and column of 0.06 , what ever is the
intersection along the row of 0.4 and the column of 0.06 is the area which is 0.1772. The
same through with z = 0.52 with an area of 0.1985. Look for 0.5 along column z and 0.02.
The intersection of row of 0.5 and the column of 0.02 is the area which is 0.1985.
To find the areas under the normal curve , three things mustbe done :
3. Calculate the area by using the Table of Areas under the Normal Curve.
Example :
90
2. P ( -2.58 < z < 2.58 )
0.3389
-2.58 0 2.58
3. P ( z > 1.95 )
1.95
91
If the areas are given , what are the values of z ? Here are some examples :
Since the area given is less than 0.5 , the shaded area is on the extreme left or extreme right.
However , looking at the direction , it can be seen that the shaded area is at the extreme
right.
Therefore ,
Since the area given is more than 0.5 and there are two values of z0 to be obtained ,
0.8452 has to be divided into 2.
0.8452
closest to 0.4226 which is ( ),
2
-z0 z0
92
There are various applications of the normal distribution to real-life problems. As such ,
these problems are to be transformed to the standard normal distribution which makes use
of the formula:
x−μ
z=
σ
x = random variable
μ = population mean
Examples:
1. Thirteen students who took the final exam last term have a mean grade of 34.08 and the
standard deviation of 7.62.
a. What is the probability that Akiwikiwag will get more than 40 in the final exam ?
40−34.08
z= =0.78
7.62
Therefore , the area of 0.78 is 0.2177. This means that Akiwikiwag has a 21.77% chance of
getting more than 4o in the final exam.
30−34.08 40−34.08
z 1= =−0.54 z 2= =0.78
7.62 7.62
93
_____________
2. The average age of Filipino man to undergo sacrament of matrimony is 29 with standard
deviation of 2.5 years. Peter , aged 26 , is contemplating if he should marry already . What is
the probability that he will marry before he reaches 30 ?
26−29 30−29
z 1= =−1.2 z 2= =0.4
2.5 2.5
____________
Exercise Set 17 :
1. Find the area of the standard normal distribution between z = -1.44 and z = 0.
2. Find the area of the standard normal distribution between z = - 0.67 and z = 0.
3. A soft drink machine dispenses soft drinks into 12 – ounce cups. Tests show that the
actual amount of soft drinks dispensed is normally distributed , with a mean of 11.5 oz and
a standard deviation of 0.2 oz.
a. What percent of cups will receive less than 11.25 oz of soft drinks ?
b. What percent of cups will receive between 11.2 and 11.5 oz of soft drinks ?
c. If a cup is filled at random , what is the probability that the machine will overflow the
cup ?
95
Linear Regression
When performing research studies , scientist often wish to know whether two
variables are related. It the variables are determined to be related , a scientist may then
wish to find an equation that can be used to model the relationship . For instance , a
geologist might want to know whether there is a relationship between the duration of an
eruption of a geyser and the time between eruptions. A first step in this determination is
to collect some data. Data involving two variables are called bivariate data . Table 6.1 gives
bivariate data showing the time between two eruptions and the duration of the second
eruption for 10 eruptions of the geyser Old Faithful.
Table 6.1 :
Time
between
eruptions
( in 272 227 237 238 203 270 218 226 250 245
seconds),
x
Duration
of
eruption
( in 89 79 83 82 81 85 78 81 85 79
seconds),
y
Once the data are collected , a scatter diagram or scatter plot can be drawn , as shown in
Figure 6.1
96
89 ____ •( 272 , 89 )
88 ____
87 _____
86 ____
84 ____
length 83 ____
of 84 ____
82 ____ •(238 , 82 )
80 ____
78 ____ •( 218 , 78 )
203 218 226 227 237 238 245 250 270 272
One way for geologist to create a model of the relationship between the time between two
eruptions and the duration of the second eruption is to find the line that approximates the
data points plotted in the scatter plot ( the dots ). There are many lines that can be drawn
in figure 6.1.
Of all the possible lines that can be drawn , the one that is usually of most interest is
called the line of best fit or the least-squares regression lines . The least-squares line is the
line that fits the data better than any other line that might be drawn. The least-squares
regression line is defined as follows
97
The least- squares regression line for a set of bivariate is the line that minimizes the sum
of the squares of the vertical deviations from each data point to the line, or simple linear
regression line , seeks to develop an equation that will predict future values of the
dependent variable from the values of the independent variable.
In this definition , the phrase “ minimizes the sum of the squares of the vertical deviations “
it means that of all lines possible , the linear equation that minimizes the sum
2 2 2 2 2 2 2 2 2 2
d 1 +d 2 +d 3 +d 4 + d 5+ d 6+ d 7 +d 8 +d 9 +d 10
Is the equation of the line of best fit. In this expression , each d , represents the distance
from the point n to the line.
89 ____ •( 272 , 89 )
88 ____ d 10
87 _____
86 ____
84 ____ d8 d9
length 83 ____
of 84 ____ d6
82 ____ •(238 , 82 )
80 ____ d1 d4
79 ____ d2 ( 227 , 79 ) • •( 245 , 79 )
78 ____ •( 218 , 78 )
203 218 226 227 237 238 245 250 270 272 98
Applying some techniques from calculus , it is possible to find a formula for the least-
squares line.
( y 1 , y 1) , ( x 2 , y 2 ) , ( x 3 , y 3 ) , .. . ,( x n , y n)
The regression line or the prediction line is drawn on the scatter plot and it is given by;
^y =ax+ b ,
nƩxy−( Ʃx ) ( Ʃy)
a= ∧b= y −a x
nƩ x 2−( Ʃx)2
Ʃx = 272 + 227 + 237 + 238 + 203 + 270 + 218 + 226 + 250 + 245 = 2386
Ʃy = 89 + 79 + 83 + 82 + 81 + 85 + 78 + 81 + 85 + 79 = 822
Ʃ x 2=2722 +2272 +2372 +2382 +2032 +2702 +2182 +226 2+250 2+ 2452
Ʃ x 2=573,560
nƩxy−(Ʃx)(Ʃy)
a=
nƩ x 2−¿ ¿
Ʃx 2386 Ʃy 822
x= = =238.6∧ y= = =82.2
n 10 n 10
89 ____ •
88 ____
86 ____
85 ____ • •
84 ____
length 83 ____
of 84 ____
eruptions83 ____ •
82 ____ •
81 ____ • •
80 ____
79 ____ • •
78 ____ •
203 218 226 227 237 238 245 250 270 272
100
We can now use the regression equation to estimate the duration of an eruption given the
time between eruptions. For instance , if the time between two eruptions is 250 seconds,
then the estimated duration of the second eruption is
^y ≈ 84
The scatter plot is a visual representation of the linear relationship between the two
variables. It is a graph involving the x – and y – axes. The following scatter plots show the
difference of linear relationship between two variables.
101
y y
• •
• • •
• • •
• • • •
• • •
• • •
• •
x x
• • •
• • • •
• • • •
• • •
• •
No relationship x
There are many methods to get the value of a correlation coefficient . However , the
Pearson’s moment correlation coefficient ( or simply Pearson correlation coefficient ) will
be used throughout this lesson . The formula for Pearson correlation coefficient is given by
r =¿ ¿
where :
x = independent variable
y = dependent variable
Table 6.1:
103
income
8 •
5 •
4 •
3 • •
1 • •
0 1 2 3 4• • 5 • 6 7
Figure
6.5: Hours
using the
lathe machine
It can be presumed that there is a positive relationship of hours on the lathe machine
Table 6.2 :
Month X Y XY X2 Y2
r =¿ ¿
[12 ( 332.175 )−( 62.25 )( 62.8 ) ]
r=
√¿ ¿ ¿
76.8
r= =0.61
√ [ 76.6875 ] [208.1]
as with the scatter plot , the direction of the obtained value is positive. Therefore, there is a
positive relationship between the number of hours on the lathe machine and the income
per month.
Exercise Set 18 :
1. Find the equation of the least-squares line for the ordered pairs in the given table below.
Adults men
2. Use the equation of the least- squares line t from item # 1. To predict the average speed
of an adult man for each of the following stride length. Round your answer to the nearest
tenth of a meter per second.
a. 2.8 m b. 4.8 m
105
UNIT IV : SUMMARY
The following tables summarizes essential concepts in this unit . The references given in
the right-hand column list of examples and exercises that can be used to test your
understanding of a concept.
Mean , Median , and Mode : The mean of n is see examples on page 54 , 57 and 58
the sum of the numbers divided by n . The
median of a ranked list of n numbers is the
middle number if n is odd , or the mean of
the two middle numbers if n is even. The
mode of a list of numbers is the number that
occurs most frequently .
Weighted Mean : The formula for the See example on page 60 and then try
weighted mean of the n numbers exercises on page 61.
x 1, , x 2 , , x 3 , . . ., x n is
weighted mean=
∑ (x • w)
∑w
Where ∑ (x • w) is the sum of the products
formed by multiplying each number by its
assigned weight , and ∑ w is the sum of all
the weights.
Range : The range of a set of data values is See example on page 69 and then try
the difference between the greatest data exercises on page 70.
value and the least data value.
𝛔 =
√ ∑ ( x−μ)2
n
, and the
variance is
∑ ( x−μ )2
n
If x 1 , x 2 , x 3 ,. . . , x n is a sample of n numbers
with mean x , then the standard deviation of
the sample is
106
S =
√ ∑ ( x−x )2
n−1
, and the
¿
Variance is = ∑ x −x ¿2 n−1
x−μ
zx=
σ
Percentile score of x ,
Least – Squares Line : Bivariate data are See examples on page 99.
data given as ordered pairs . The least-
squares regression line , least-square line
or regression line , for a set of bivariate data
is the line that minimizes the sum of the
squares of the vertical deviations from each
data point to the line. The equation of the
least-squares line for the n ordered pairs
( x 1 , y 1 ¿ , ( x 2 , y 2) , ( x 3 , y 3 ) , . . .,(x n , y n) is
^y =ax+ b , where
b= y−a x
Linear Correlation Coefficient : The linear See examples on pages 103 – 105.
correlation coefficient r measures the
strength of a linear relationship between
two variables. The closer r is to 1 , the
stronger the linear relationship is between
the variables. For n ordered pairs
x 1 , y 1 ¿ , ( x 2 , y 2) , ( x 3 , y 3 ) , . . .,( x n , y n) , the
linear correlation coefficient is
r =¿ ¿ 108
UNIT IV TEST :
Class interval f
30 - 34 2
35 - 39 3
40 - 44 6
45 - 49 7
50 - 54 8
55 - 59 7
60 - 64 5
65 - 69 4
70 - 75 2
N = 44
2. The mean weight of a newborn infants is 7 pounds and the standard deviation is 0.8
pound. The weights of newborn infants are normally distributed. Find the z-score for a
weight of
a. 9 pounds
b. 7 pounds
c. 6 pounds
109
3. Shown below are the data involving the number of years of school , x , completed by ten
randomly selected people and their scores on the test measuring prejudice , y .The higher
the scores on prejudice ( 1 to 10 ) indicate greater levels of prejudice. Determine the
correlation coefficient between years of education and scores on a prejudice test.
Respondent A B C D E F G H I J
Years of 12 5 14 13 8 10 16 11 12 4
education
(x)
Score on 1 7 2 3 5 4 1 2 3 10
prejudice
(y)
9 12 10 8 9 12 12 11 14 12
10 8 10 9 12 8 12 15 9 8
13 10 9 9 11 10 11 10
Determine the mean , median and mode of the data item given.
1. Solve the least-square regression line for the data scores in the table:
Employees X Y
A 2 8
B 8 10
C 4 11
D 11 13
E 5 9
F 13 17
G 4 8
H 15 14