STATS Studyguide
STATS Studyguide
STATS Studyguide
Department of Statistics
STA1610
Introduction to Statistics
iii
STA1610/1
Table of contents
STUDY UNIT 1
STUDY UNIT 2
10
STUDY UNIT 3
11
13
16
20
STUDY UNIT 4
21
23
25
38
42
iv
STUDY UNIT 5
5.1 Introduction to this study unit
45
47
52
55
57
59
STUDY UNIT 6
6.1 Introduction to this study unit
61
62
82
STUDY UNIT 7
7.1 Introduction to this study unit
83
85
92
STUDY UNIT 8
8.1 Introduction to this study unit
108
8.2 Confidence interval estimate for the mean when the population standard deviation is known 110
8.3 Confidence interval estimate for the mean (population standard deviation is unknown)
112
117
STUDY UNIT 9
9.1 Introduction to this study unit
121
122
125
131
STA1610/1
STUDY UNIT 10
10.1 Introduction to this study unit
137
137
138
141
STUDY UNIT 11
11.1 Introduction to this study unit
142
143
145
148
vi
Lets GO!
I am sure that there are many questions in your mind, and seeing that you are a distance learner, I
will anticipate some of those questions:
vii
STA1610/1
Question 1
How will I benefit by gaining this knowledge?
Statistical skills take on different forms and without realizing it, you are using many of them every day
of your life. Soon, everybody will be forced to have a certain level of numeracy at school level, and
statistics has been introduced at school level as well. Learning concepts by heart has little meaning
in statistics, because it is a subject about perception, insight and the ability to apply knowledge.
Allow logic to direct you through the rules and results.
How will this benefit you? Statistics will enrich you with knowledge relevant in different walks of life,
because it is living knowledge, applicable wherever it fits in. It is definitely not only about collecting
information, called data. We will go beyond data and you will become an explorer, turning information
into wisdom. After completion of this module you should understand more about life, the role of
decision making and the importance of scientific knowledge in governance and control.
Question 2
What is the nature of the statistics in this module?
The authors of this prescribed book explain in the preface that their aim was to present statistics in
an interesting and useful way. In this they succeeded and also in their use of modern technological
advances. You can complete this module even if you do not have access to a computer, but should
you have access, you can familiarize yourself with different statistical software and the additional
information given on the CD-ROM that accompanies each textbook. This module is a service module
for students from different disciplines and with varying background knowledge. It is therefore different
from the more mathematical presentations we offer to students from the College of Science, or
students from the College of Economic and Management Sciences who are majoring in statistics.
This is a stand-alone module, as it may not be a prerequisite for any level module in statistics.
Question 3
How must I go about this module?
Keep in mind that different students have different study methods, so it is not really possible to give
you an indication of the time you will spend preparing for this module. The time you spend studying
is not necessarily correlated to intelligence. The extremely important fact I do want to stress is that
Statistics need continuous, steady attention! You need time for reflection on the knowledge you
attained. Please make a study time table for all the modules for which you are enrolled, taking the
assignment due dates and your personal circumstances into consideration.
viii
Question 4
Can I continue with statistics once I have completed this module?
As said, this is a service module. You cannot present this module for exemption from any other
major statistics module at Unisa. Furthermore, this module cannot form part of a major in statistics.
We trust that this module will open your mind for an interest in statistics, but if you want to major in
statistics, you will have to start again at level one. The reason for this rather depressing statement
lies in the depth of knowledge and the method of presentation in this module, which is too different
from the more mathematical presentation typical of modules forming part of a major in statistics.
Question 5
What happens if I cannot use the CD-ROM?
If you cannot go to a regional office, or you do not have a computer, or your computer cannot read
a CD-ROM, you can still complete this module successfully. You will not be examined on additional
information given on the CD or asked how to construct a specific descriptive measure in one of the
statistical software programs. Of course, the CD is very useful for those of you who have access
to a modern computer as it functions as an additional tutor system. We all seem to understand
better with practical applications, additional data sets, pictures, etc. If you are at all interested in
Statistics and see it as a career benefit, you will realize the importance of computer access. The
modern statistician needs a computer in the same way as the previous generation needed pocket
calculators!
Question 6
What is a wraparound (or textbook guide)?
This is a textbook guide you are reading at this moment. It is a way of talking to you in a manner
similar to that of a lecturer at a residential university talking to his/her students. I know that words
on a piece of paper can never substitute personal contact, but we try to come as close to that as
possible. In this guide I include summaries on certain sections, discuss difficult sections, indicate the
sections needed for examination preparation and then I also give you what we call activities. They
are like worked out examples, but I give you the opportunity to try them yourself before you look at
my solutions.
The process to follow is as follows:
Study the particular section in the textbook (given in a block at the beginning of each study unit).
Read the corresponding section in this textbook guide.
ix
STA1610/1
Attempt to answer the questions in the activity relevant to that section. Do not look at my solutions
at the end of each study unit before you have tried really hard to do them yourself.
You may have more questions and if they are serious and you have concerns, contact your lecturer.
You may find that you are able to answer your own questions as the year rolls on!
Chapter 2:
Chapter 3: .
Chapter 4: .
Basic Probability
Chapter 5: .
Chapter 6: .
Chapter 7: .
Chapter 8:.
Chapter 9:
Chapter 11:
Chi-Square Tests
Chapter 12:
You cannot do this module without the prescribed book. Also make sure that you buy the correct
edition. You should nurse all your prescribed books, even after completion of the different modules.
They will become precious references in your current or future career. Note that at the time of the
development of this module, your assignments and examination paper will all contain only multiple
choice questions. This is due to the large number of students enrolled for this module. The majority
of the questions in the activities given in this wraparound will also be multiple choice. We found that
some basic principles can be explained in more detail in a standard question.
x
The contents of this module has been divided into the following 11 study units:
Study material
Your study material consists of
a prescribed book : Business Statistics, a first course, 5th ed.
by DM Levine, TC Krehbiel and ML Berenson
Please make sure that you receive the study guide and tutorial letter 101 during registration. Once
you have bought the prescribed book, you will be ready to start with your new and exciting learning.
If you have access to the internet, log onto myUnisa and join the discussion forums for the different
modules you are enrolled for. Being a distance learner can lead to isolation, so get connected or
meet regularly with a peer group who is also registered for this module. Look around you and see if
you can find statistical information in your community, involve your parents, friends,etc.
STA1610/1
STUDY UNIT 1
STUDY CHAPTER 1
The information in the table below will become more and more clear as we continue in the further
chapter. If you do not know what is meant by measures of location or spread, use this table for
further reference.
2
Sample
Set of observations
Population
All possible observations
Statistic
Parameter
Measures of location
Average
Middle element
Most frequent element
Sample mean
Sample median
Sample mode
Population mean
Population median
Population mode
Measures of spread
Range
Standard deviation (SD)
Sample range
Sample SD
Population range
Population SD
Measure
This brings us to variables and the difference between qualitative and quantitative variables.
Characterization of data is the starting point of any statistical analysis, so, know your data! I would
like to train you to evaluate published statistical analyses in order to decide if results are really
trustworthy. This process starts with the data type and the corresponding correct form of analysis.
Quantitative variable
(several categories)
Data
Data only as frequencies
(count elements in categories)
Frequencies can also be
expressed as percentages
Discrete
Only specific values
you counted.
Continuous
Any value within interval
you measured.
Data
Data
Generated by counts
of elements
Generated by measurements
of some aspect of the elements
STA1610/1
Scales of measurement
Nominal
Categories or
labels. If numbers
are used they
have no numerical
meaning.
Ordinal
Preferences are
ordered. Numbers
are ranked but
ranks do not represent
specific measurements.
Ordinal
Interval
Numerical labels
indicate order and
distance. Unit of
measurement exists
but no absolute zero.
Ratio
Absolute zero
present and
multiples have
meaning.
Activity 1.1
Question 1
Which of the following statements about the variable type is incorrect?
1. Whether or not you own a Panasonic television set is a qualitative variable.
2. Your status as either a full-time of part-time student is a quantitative variable.
3. The number of people you know who attended the graduation last year is a quantitative, discrete
variable.
4. The price of your most recent haircut is a quantitative, discrete variable.
5. Cyrils travel time from his home to the examination centre is a quantitative, continuous variable.
Question 2
Which of the following quantitative variables is not continuous (i.e. it is discrete)?
1. Your weight
2. The circumference of your head (in centimetres)
3. The time it takes Jerome to walk from his home to the taxi pickup point
4. The length of your forearm from elbow to wrist
5. The number of coins in your pocket
...................................................................................................
STA1610/1
STUDY UNIT 2
STUDY CHAPTER 2
Even non-
manipulation would be to arrange data form small to large, or in alphabetical order, or...
For categorical data you can draw up a summary table and use the bar chart, pie chart and pareto
chart to display the data.
6
stem up into two or more lines. Look again at the printout and you will see that the stems have been
doubled, i.e. there are two 1s, two 2s, etc. Because there were too many numbers per stem, they
split each stem into two parts. For example, for the stem 1 all the numbers from 10 to 14 are written
with the first 1 and all numbers from 15 to 19 are with the second 1. An outlier is a data value very
distant from most of the others.
Of course, if the interest is in the form of the data (symmetric, skewed,..) repeating stems may not
be used in the stem-and-leaf display.
Activity 2.1
Question 1
The following stem-and-leaf plot gives the ages of people living in block A of a retirement village:
5
6
6
7
7
8
8
9
9
5
04
57
1112
5799
01334
666789
0124
78
Question 2
The following stem-and-leaf display is for a set of values where the stem is formed by the units and
the leaf represents the decimal digits.
3
4
5
6
7
0167
333
258
99
4
STA1610/1
Three methods to help portray a frequency distribution are the histogram, the percentage polygon
and the cumulative percentage polygon.
...................................................................................................
Activity 2.2
Question 1
The following comments refer to histograms. Identify the incorrect statement.
1. Histograms graphically display class intervals as well as class frequencies.
2. Histograms are appropriate for qualitative data.
3. Where stem-and-leaf displays are ideal for small data sets, large data sets are better presented
in histograms.
4. Histograms are good tools for judging the shape of a dataset, provided the sample size is relatively
large.
5. Only estimates of the centre, variability and outliers of a dataset can be determined from a
histogram.
Question 2
The following comments refer to different graphical presentations. Which statement is incorrect?
1. Adjacent rectangles in the histogram share a common side, while those in the bar chart have a
gap between them.
2. A pie chart is a circular display divided into sections based on the number of observations within
the segments.
3. With a stem-and-leaf plot the intervals for the stems are restricted in length, but this is not true for
a histogram.
4. The histogram as well as the bar chart, represents frequencies according to the relative lengths
of a set of rectangles.
5. Boxplots give a direct look at centre, variability, outliers and shape of a dataset.
...................................................................................................
STA1610/1
Feedback on Activities
Activity 2.1
Question 1
Option 1
1. Incorrect. There are 30 data points. (We do not count the stems.) You can simply count the digits
to the right of the vertical bar.
2. Correct. The first stem of 5 is for persons in the fifties and the leaf is a five, making the age 55.
3. Correct. With stem 7 there are eight values, namely 71, 71, 71, 72, 75, 77, 79 and 79.
4. Correct.
55 (smallest number) is not markedly lower than the second lowest value of
60. Furthermore 98 (highest value) is very close to 97 and also not an outlier.
5. Correct. Half of the people are indeed younger than 83. There are 30 people, so half of
them would imply 15 persons. If you look at the ages while counting these ordered ages from
55, 60, 64, 65, ...79, 79, 80, 81, the value 81 belongs to the 15th person.
Question 2
Option 3 is incorrect.
3 0167
4 333
5 258
6 99
7 4
1. Correct. The number with the highest frequency is 4.3 because there are three data points with
the value 4.3. (Highest frequency means the one that occurs the most.)
2. Correct. Count the stems and you will have 13 values. If you have a problem, look at the listed
numbers in the real answer to 3. below and count those.
3. Incorrect. The list given below did not include the values that occur more than once. The original
data are: 3.0, 3.1, 3.6, 3.7, 4.3, 5.2, 5.5, 5.8, 6.9 and 7.4. The values 4.3, 4.3 and 6.9 should also
have been in the list and then there are 13 datapoints.
4. Correct.
5. Correct. In a stem-and-leaf plot the values are arranged in ascending order. If you have 13 items
arranged in order, then item number 7 is in the middle position with 6 on either side of it. Start
with 3.0, 3.1, 3.6, ... and in position 7 you will find the second 4.3.
10
Activity 2.2
Question 1
Option 2
Statement 2 is the only statement which is incorrect as histograms are appropriate for quantitative
data.
Question 2
Option 5
Boxplots give a direct look at centre, variability and outliers but not shape.
...................................................................................................
Activity 2.3
Question 1
Option 2
Two variables have a positive association when the values of one variable tend to increase as the
values of the other variable increase.
Question 2
Option 3
All other statements are incorrect.
...................................................................................................
population
11
STA1610/1
STUDY UNIT 3
STUDY CHAPTER 3
In the discussion on the arithmetic mean you are introduced to mathematical notations, such as
S
,
, xi , x... These symbols are like little pictures and you should read them in that way. If
you were in a lecture hall, the lecturer would not say mew for , but he/she would say population
mean. Let me give you the words for the symbols as I trust this will help you a lot and has a double
purpose as it helps you to learn.
12
Symbol
S
xi
xi
Pronunciation
mew(like a cat)
Read as
population mean
sigma
the sum of
ex eye
all x -values
sigma ex eye
ex bar
sample mean
You can now imagine the lecturer saying the following sentences:
The population mean is equal to the sum of all the data
for
values in the population, divided by how many they were
The sample mean is equal to the sum of all the data
values in the sample divided by how many they were
Note
Be careful not to write =
for
x=
xi
N
xi
n
xi
n .
What is wrong?
refers to the population and you cannot divide by the number of values n in the sample if you are
xi
N
Note
It will help if you take note of the following:
Greek letters are used for the population parameters, e.g. .
Standard alphabet letters are used for the sample statistics, e.g. x.
The relationship between the mean, median and mode is determined by the shape of the distribution,
which can be symmetric, positively skewed or negatively skewed.
13
STA1610/1
Activity 3.1
Question 1
Read the following statements. The incorrect statement is:
1. The mean is one of the most frequently used measures of central tendency.
2. When the mean is greater than the mode, we say it is negatively skewed.
3. When the mean is greater than the median, we say it is positively skewed.
4. When a distribution is bimodal, it will be impossible for the mean, median, and both modes to be
equal.
5. The measure most affected by extreme values is the mean.
Question 2
Certain measures have been calculated for the following small sample data set
15
17
23
11
20
45
13
14
The range is a concept that is easy to understand and very few students have problems with it.
The reference to quartiles is much more complex, as the divisions of the data into equal groups lead
to very important features of the dataset. The interquartile range plays a very important role in the
analysis of statistical data. The best application of the quartiles are found in a boxplot. Make sure
that you understand and are able to interpret a given boxplot.
The variance and standard deviation are but one mathematical calculation apart, but in general
people prefer to refer to standard deviation and not to variance. You need to know how to calculate
these measures as well as understand their meaning for when you come across them in different
walks of life.
As we move through the prescribed book, you will find that the number of symbols you have to
recognize increases. Variance and standard deviation have their own symbols and again there are
clear distinctions between the sample and the population measures. Do you want another summary
and sentences?
Symbol
Pronunciation
sigma(same as for
sigma squared
Read as
S
)
s2
sample variance
(xi )2
N
population variance
Note division by N
(xi x)2
n1
sample variance
Note division by (n 1)
(xi )2
N
population standard
deviation
(xi x)2
n1
sample standard
deviation
uS
vS
Note
A reminder that
Greek letters are used for the population parameters, e.g. and 2 .
standard alphabet letters are used for the sample statistics, e.g. s and s2 .
15
STA1610/1
Make sure that you will be able to draw and interpret a boxplot.
For the interpretation and uses of the standard deviation Chebyshevs theorem and the Empirical
Rule are described.
Activity 3.2
Question 1
Given below are the summary statistics for data described as fastest ever driven.
Suppose the speed is given in kilometres per hour, then:
Median
Quartiles
Extremes
Males
87 students
110
95
120
55
150
Females
102 students
89
80
95
30
130
Determine
1.
2.
3.
4.
5.
6.
7.
16
Activity 3.3
Question 1
Identify the correct statement:
1. If the coefficient of correlation r is positive, the dependent variable y and the independent variable
x are said to be inversely related.
2. No indication of the value of the sample coefficient of correlation can be determined from a scatter
plot.
3. If the coefficient of correlation r = 1, then the best-fit linear equation will include all of the data
points.
4. The coefficient of correlation r is a number that only indicates the direction of the relationship
between the dependent variable y and the independent variable x.
5. If the coefficient of correlation = 0, then there is a linear relationship between the dependent
variable y and the independent variable x.
...................................................................................................
17
STA1610/1
Feedback to activities
Activity 3.1
Question 1
Option 2
When the mean is greater than the mode, we say it is positively skewed.
Question 2
Option 5
1. Correct.
The mean of the given data is 18 and the mode is 9 (there are two 9s), and we all know that 18 is
double 9.
2. Correct.
The value nearest to 45 is 23 and the data consist in general mostly of much smaller numbers, so
45 can be considered an outlier.
3. Correct.
We have already discussed the mode and saw that it is 9.
4. Correct.
To determine the median the data must be ordered (from small to large or vice versa):
9, 9, 11, 13, 15, 17, 20, 23, 45
If there are nine values, the middle one is in position five (four values on each side). In this position
we have the 15.
5. Incorrect.
Remove 45 and the total is 117, which must be divided by 8. The answer should have been 14.625
( 117
8 ). If you are interested, the incorrect answer given was calculated by dividing 117 by 9 instead
of 8.
...................................................................................................
18
Activity 3.2
Question 1
Median
Quartiles
Extremes
Males
87 students
110
95
120
55
150
Females
102 students
89
80
95
30
130
1. The fastest speed driven by anyone in the group is 150 km/h. Male and female top speeds are
indicated in the table as extremes. (Maximum for females is 130 km/h.)
2. The slowest speed driven by a male is 55 km/h.
3. The cut-off speed indicating that 25% of the men drove at that speed or faster, implies the value
of the upper quartile for males, which is 120 km/h.
4. To find the proportion of females who had driven 89 km/h or faster, you have to notice that 89 is the
value of the female median. The median divides the data into two equal parts, so the proportion
of the data above 89 is 50% (expressed as a percentage) or 0.5 expressed as a fraction.
5. The number of females who had driven 89 km/h or faster is (as said), half of the number of woman.
If there are 102 female students, half of them will be 51.
6. Use the table to interpret the differences between male and female drivers. Some of the obvious
differences are the following:
(a) The median speed for males is higher than that for females.
(b) The highest speed was recorded by a male.
(c) The lowest speed was recorded by a female (which is what can be expected if the mean values
differ the way they do).
(d) The upper quartile of females is the same as the lower quartile of the males (speed 95).
7. Range.
Males: (150 55) = 95
Females: (130 30) = 100
8. Interquartile range for male and female are respectively
(120 95) = 25 and (95 80) = 15.
19
STA1610/1
9.
Question 2
Which of the following statements is correct?
1. Incorrect.
The range is found by taking the difference between the high and low values - not divided by 2.
2. Incorrect.
The interquartile range is found by taking the difference between the 1st and 3rd quartiles - not
divided by 2.
3. Incorrect.
The mean is a measure of central tendency.
4. Correct.
The standard deviation is expressed in terms of the original units of measurement, but the
variance is not.
5. Incorrect.
The median is a measure of central tendency.
...................................................................................................
20
Question 3
a. The median is approximately 37.5 defects per day. The first quartile is approximately 37 defects
per day. The third quartile is approximately 39 defects per day.
b. The asterisks at the right are outliers, indicating two days on which unusually large numbers of
defects were produced. The production supervisor should try to determine if anything out of the
ordinary was happening at the plant on those days.
c. The distribution is positively skewed. Look at the position of the median and you will understand
this answer.
Activity 3.3
Question 1
Option 3
The corrected statements are:
1. If the coefficient of correlation r is positive, the dependent variable y and the independent variable
x are said to be directly related.
2. If the coefficient of determination is 0.81, the coefficient of correlation r can be 0 .90 or 0 .90.
3. Statement is correct.
4. The coefficient of correlation r is a number that indicates the direction as well as the strength of
the relationship between the dependent variable y and the independent variable x.
5. If the coefficient of correlation = 0, then there is no linear relationship whatsoever between the
dependent variable y and the independent variable x.
...................................................................................................
evaluate the meaning of dispersion as conveyed by the range, the quantiles, MAD and
variance/standard deviation
say if values in a given data set are outliers or not with special reference to a box-and-whisker plot
use the standard deviation and mean to determine the coefficient of variation for both sample and
population
explore different measures of association to determine the direction and strength of relationships
21
STA1610/1
STUDY UNIT 4
STUDY CHAPTER 4
22
Activity 4.1: Overview
Study skill
Draw a mind-map of the different sections/headings you will deal with in this study session. Then
page through the unit with the purpose of completing the map.
...................................................................................................
Events A and B
Complementary
events A and AC
Conditional Probability
P (A/B) =
P AC = 1 P (A)
P (A and B)
P (B)
P (A and B) = 0
Events A and B are
are independent
Union
A or B
Multiplication Rule
If A and B are
INDEPENDENT then
P (A and B) = P (A) P (B)
Events A and B
Independent
P (A/B) = P (A)
Conditional Probability
Probability Rules
Addition Rule
P (A or B) = P (AB) +
P (B) P (A and B)
Intersection
A and B
Joint probability
23
STA1610/1
Description
The probability of any event A is then obtained by summing the probabilities assigned the simple
events contained in A.
How do I know whether I should combine two events A and B using "and" or "or"?
24
Solution:
The key here is to fully understand the meaning of the combined statement.
P (A and B) = probability that A and B both occur while P (A or B) = probability that A or B or both
occur. Sometimes it will be necessary to reword the statement of a given event so that it conforms
with one of the two expressions given above.
For example, suppose your friend Rajab is about to write two exams and you define the events as
follows:
A:
B:
The event Rajab will pass at least one of the two exams can be reworded as Rajab will either pass
the statistics exam or he will pass the accounting exam, or he will pass both exams. This new event
can therefore be denoted (A or B ).
On the other hand, the event Rajab will not fail either exam is the same as Rajab will pass both his
statistics exam and his accounting exam. This event can therefore be denoted (A and B )
Example 4.1
An investor has asked his stockbroker to rate three stocks (A, B , and C ) and list them in the order in
which she would recommend them. Consider the following events:
L: Stock A doesnt receive the lowest rating.
M : Stock B doesnt receive the lowest rating.
N : Stock C receives the highest rating.
(i) Define the random experiment and list the simple events in the sample space.
(ii) List the simple events in each of the events L, M , and N .
(iii) List the simple events belonging to each of the following events:
L or N , L and M , and M .
Solution:
(i) The random experiment consists of observing the order in which the stockbroker recommends
the three stocks. The sample space consists of the set of all possible orderings:
S = {ABC, ACB, BAC, BCA, CAB, CBA}
(ii) L = {ABC, ACB, BAG, CAB}; M = {ABC, BAG, BCA, CBA}; N = {CAB, CBA}
(iii) The event (L or N ) consists of all simple events in L or N or both; (L or N ) =
25
STA1610/1
{ABC, ACB, BAG, CAB, CBA}. The event (L and M ) consists of all simple events in both L
(iv) No, there is not a pair of mutually exclusive events among L, M , and N , since each pair of events
has at least one simple event in common.
(L and M ) = {ABC, BAC}
(L and N ) = {CAB}
(M and N ) = {CBA}
(v) Yes, L and M are an exhaustive pairs of events, since every simple event in the sample space is
contained either in L or M , or both. That is, (L or M ) = S
Number
1
4
5
Probability
1/5
4/5
5/5or1
Before we calculate the probabilities, let us first discuss the meanings of the words.
At least two: This means that two is the minimum value and if we say at least two children, it means
two or three or four or ... children.
P (X 2) = P (x = 2) + P (x = 3) + P (x = 4) +
At most two: This means that two is the maximum value. At most two children means no child or
one child or two children.
P (X 2) = P (x = 0) + P (x = 1) + P (x = 2)
No more than two: This means that two is the maximum number, that is two or one or zero children.
P (X 2) = P (x = 0) + P (x = 1) + P (x = 2)
Less than two: This means that two is not included and we are only interested in the values smaller
than two, that is zero or one.
P (X < 2) = P (x = 0) + P (x = 1)
26
More than two: This means that two is not included and we are only interested in the values larger
than two, that is three, four, five, etc.
P (X > 2) = P (x = 3) + P (x = 4) + P (x = 5) +
Example 4.2
Consider the following table in which wild azaleas were classified by colour and by the presence and
absence of fragrance.
Fragrance
Yes
No
Total
White
12
50
62
Pink
60
10
70
Orange
58
10
68
Total
130
70
200
If an azaleas is randomly selected from the group, which one of the following probabilities is
incorrect?
1. P (a fragrance) =
130
200
2. P (Color is orange) =
68
200
58
200
58
130
58
130
Solution:
Option (1): Correct
Option (2): Correct
Option (3): Correct
Option (4): Correct
Option (5): Incorrect
P (has a fragrance given that it is orange) =
58
68
27
STA1610/1
Example 4.3
Consider the table of wild azaleas (example 4.2). If event A denotes the flower is orange, and event
B it has a fragrance, then: P (A or B) =
68
200
130
200
58
200
140
200
B:9%
AB: 4%
0:46%
What is the
Multiplication Rule
The multiplication rule finds the probability that events A and B both occur.
independent if one may occur irrespective of the other. For example, events A, the patient has
tennis elbow, and B, the patient has appendicitis, are intuitively independent.
Formula P (A and B) = P (A) P (B)
For many independent events A1 , A2 , . . . . . . . . . An , the multiplication rule can be written as:
P (A1 and A2 and . . . . and An ) = P (A1 ) P (A2 ) . . . . . . . . . . . . P (An )
The probability that a certain plant will flower during the first summer is 0.6. If five plants are planted,
calculate the probability that all of them will have flowers during the first summer.
The probability is: 0.6 0.6 0.6 0.6 0.6 = 0.078.
28
Computation of objective probabilities (Section 4.1 and 4.2 of Levine textbook).
Objective probabilities can be classified into 3 categories. The categories are:
Marginal probability
Joint probability
Conditional probability
72
150
= 0.48
Joint probability
A joint probability is the probability of both event A and event B occurring simultaneously on a given
trial of a random experiment. A joint event describes the behaviour of two or more random variables
(i.e. the characteristics of interest) simultaneously. It is written as:
P (A and B)
29
STA1610/1
Example 4.7
Table 4.2
Industry
Mining
Finance
Service
Retail
Total
Large
(50 and above)
35
42
1
6
84
Total
35
72
10
33
150
9
= 0.06
150
Conditional probability
Conditional probability is the probability of one event A occurring given information about the
occurrence of another event B.
A conditional event describes the behaviour of one random variable in the light of known additional
information about a second random variable.
Conditional probability is defined as:
P (A/B) =
P (A and B)
P (B)
The essential feature of the conditional probability is that the sample space is reduced to the
outcomes describing event B (the given prior event) only, and not all possible outcomes as for
marginal and joint probabilities.
Example 4.8
Let A = event (large company)
Let B = event (retail company)
Then is the probability of selecting a company from the JSE sample which is large given that the
company is known to be a retail company.
Retail
Small
14
Medium
13
Large
6
Total
33
30
There are 6 large companies out of 33 retail companies (Refer to table 4.2)
Thus P (A/B) =
6
33
= 0.1818
33
150
6
150
(a joint probability)
(a marginal probability)
Then P (A/B =
6
150
33
150
6
33
A conditional probability, denoted P (A/B) is the probability that an event A will occur given that we
know that an event B has already occurred.
The key to recognizing a conditional probability is to look for the words given that or their equivalent.
For example, the statement of a conditional probability might read The probability that A will occur
when B occurs or The probability that A will occur if B occurs. In each of these cases, you can
reword the statement using given that instead of when or if. Therefore, both of these statements
refer to conditional probabilities.
Probability Rules
This section outlines three rules of probability that allow you to calculate the probabilities of three
special events [A, (A or B), and (A and B)] from known probabilities of various related events. The
three rules are as follows:
1. Complement Rule: P (A ) = 1 P (A)
2. Addition Rule:
If A and B are independent events, then multiplication rule becomes: P (A and B) = P (A) P (B)
Activity 4.1: Selfassessment exercises
Application skills
Question 1
A soft drink company holds a contest in which a prize may be revealed on the inside of the bottle
cap. The probability that each bottle cap reveals a prize is 0.1 and winning is independent from one
bottle to the next. What is the probability that a customer wins a prize when opening his third bottle?
1. (0.1)(0.1)(0.9) = 0.009
2. (0.9)(0.9)(0.1) = 0.081
3. (0.9)(0.9) = 0.81
31
STA1610/1
4. 1 (0.1)(0.1)(0.9) = 0.991
5. (0.9)(0.9)(0.9) = 0.729
Question 2
Suppose two people each have to select a number from 00 to 99 (therefore 100 possible choices).
The probability that they both pick the number 13 is
1.
2
100
2.
1
100
3.
1
200
4.
1
10 000
5.
2
10 000
Question 3
Use the same information as in question 2. The probability that both persons pick the same number
is equal to
1.
2
100
2.
1
100
3.
1
200
4.
1
10 000
5.
2
10 000
Question 4
For three mutually exclusive events the probabilities are as follows:
P (A) = 0.2, P (B) = 0.7 and P (A or B or C) = 1.0. The value of P (A or C) is equal to
1. 0.3
2. 0.5
32
3. 0.9
4. 0.6
5. 0.1
...................................................................................................
Feedback to Activity 4.1:
Application skills
Question 1
Option 2.
The first two bottles must definitely not reveal a prize on the inside of the bottle top. The probability
of not winning with one bottle is 1 0.1 = 0.9. Probability not to win with 2 bottles is (0.9)(0.9) = 0.81.
...................................................................................................
Question 2
Option 4.
1
. The choice of the second person
100
1
. To
is independent from the first and the probability to select 13 for the second person is also
100
have the probability of two independent events use the rule
1
1
1
=
.
P (A and B) = P (A) P (B) =
100 100
10 000
Question 3
Option 2.
In the previous question we found the probability that both selected one specific number (13) was
1
. For this question any doubling of numbers is considered for the probability. There can be
10 000
two ones, two twos,..., two ninty-nines and if you count these possibilities there are 100 such double
combinations.
P (select the same number) is 100
1
1
=
.
10 000
100
Question 4
Option 1
Given that P (A or B or C) = 1.0 plus the fact that these events are mutually exclusive.
Therefore P (A) + P (B) + P (C) = 1.0.
Filling in the given probabilities we get 0.2 + 0.7 + P (C) = 1.0. Solve this for P (C), then P (C) = 0.1.
Therefore, P (A or C) = P (A) + P (C) = 0.2 + 0.1 = 0.3
33
STA1610/1
Tree diagrams
The following contingency table shows sampled data for four regions in South Africa in which people
live and three types of music the people listen to. We are going to use this table to illustrate different
types of probability.
...................................................................................................
Limpopo
Gauteng
Free State
KwaZulu-Natal
Classical
50
105
80
65
Jazz
40
85
70
55
Rock
85
160
125
80
To use the table we have to total the number of people in each region and the number of people who
listen to each of the types of music.
Limpopo
Gauteng
Free State
KwaZulu-Natal
Total
Classical
50
105
80
65
300
Jazz
40
85
70
55
250
Rock
85
160
125
80
450
Total
175
350
275
200
1000
Tree diagrams
The information given in the contingency table above can just as well be given in a probability tree
as follows.
...................................................................................................
Tree 1
Region
(Marginal)
Type of Music
x
(Conditional) = (Joint)
Limpopo
175/1000
50/175
40/175
85/175
Classical
Jazz
Rock
Gauteng
350/1000
105/350
85/350
160/350
Classical
Jazz
Rock
Free State
275/1000
80/275
70/275
125/275
Classical
Jazz
Rock
KwaZuluNatal
200/1000
65/200
55/200
80/200
Classical
Jazz
Rock
34
Notice that the regional probabilities are all marginal probabilities and the probabilities for each type
of music in each of the four regions are all conditional probabilities. So, the probability that someone
listens to classical music, for example, changes depending in which region he/she lives.
The probability tree we developed above shows the regions first and the types of music preferred
as contingent upon the region in which someone lives. But, suppose we want to show the tree the
other way around; that is, suppose we wanted to show the types of music first and the regional
identifications as contingent upon the types of music preferred.
Tree 2
Type of Music
(Marginal)
Region
x
(Conditional)
= (Joint)
Classical
300/1000
50/300
105/300
80/300
65/300
Limpopo
Gauteng
Free State
KwaZuluNatal
Jazz
250/1000
40/250
85/250
70/250
55/250
Limpopo
Gauteng
Free State
KwaZuluNatal
Rock
450/1000
85/450
160/450
125/450
80/450
Limpopo
Gauteng
Free State
KwaZuluNatal
Notice now that the music types are marginal probabilities and the probabilities for each region are
all conditional probabilities. So, the probability that someone lives in Limpopo, for example, changes
depending on which type of music he/she listens to.
The joint probabilities can be calculated directly from the contingency table or using the tree diagrams
e.g.
P(Classical and Gauteng)
=
=
=
P (Classical) P (Gauteng/Classical)
300/1000 105/300
105/1000
35
Activity 4.2: Selfassessment exercises
STA1610/1
Application skills
Question 1
Which statement is incorrect?
1. A marginal probability is the probability that an event will occur regardless of any other events.
2. A joint probability is the probability that two or more events will all occur.
3. If P (A) = 0.8 and P (B) = 0.5 and P (A and B) = 0.24, we can conclude that events A and B are
mutually exclusive.
4. Given the same information as in 3, the events A and B cannot be independent.
5. Two events cannot be mutually exclusive as well as independent.
Question 2
A study was conducted at a small college on first-year students living on campus. A number of
variables were measured. The table below provides information regarding number of roommates
and end of term health status for the first-year students at this college. Health status for individuals
is measured as poor, average, and exceptional.
Health Status
Poor
Average
Exceptional
Number of roommates
None One Two
15
36
65
35
94
40
50
50
25
4. The events H = {the student has poor health status} and N = {the student has no roommates}
are dependent.
5. If you find a person with average health you can be 52% sure that he/she comes from a room with
only one roommate.
36
Question 3
In the Barana Republic there are two producers of fridges, referred to as Cool and Dry. Assume that
there are no fridge imports. The market shares for these two producers are 70% for Cool and 30%
for Dry. One executive at Dry proposes a longer warranty period to be offered at a slight extra cost as
a plan to increase market share. A market research company appointed by Dry conducts a census
of fridge owners on their opinion of this warranty proposal. Among owners of a fridge made by Cool,
50% like the proposal, 30% are indifferent to it, while the remaining owners oppose it. Among owners
of a fridge made by Dry, 70% like the proposal, 20% are indifferent to it, and the remaining owners
oppose it.
3.1 A fridge owner will be selected at random. What is the probability that the person will own a fridge
made by Dry?
3.2 A fridge owner will be selected at random. What is the probability that the owner will be opposed
to the proposal of a new warranty at extra cost?
Application skills
Question 1
Option 3
1. Correct.
A marginal probability is the probability that an event will occur regardless of any other events.
2. Correct.
A joint probability is the probability that two or more events will all occur.
3. Incorrect.
For mutually exclusive events P (A and B) = 0.
4. Correct.
Given the same information as in 3, the events A and B cannot be independent. If events A and
B were independent, then P (A and B) = P (A) P (B) = 0.7 0.6 = 0.42 . However, the problem
37
STA1610/1
Question 2
Option 3
1. To be able to answer this question it is advisable that you add the rows and columns of the given
table.
Health Status
Poor
Average
Exceptional
Total
Number of roommates
None One Two
15
36
65
35
94
40
50
50
25
100
180 130
Total
116
169
125
410
Correct.
Of the 100 students with no roommates 15 had bad health and
15
= 0.15.
100
2. Correct.
Of the 180 students with 1 roommate 36 had bad health and
36
= 0.20.
180
3. Incorrect.
If these two events were to be mutually exclusive, the cell where poor health and no roommates
cross should have had a zero in (and there is a 15).
4. Correct.
Use the multiplication rule to prove that the events are not independent.
Test if P (H and N ) = P (H) P (N )
P (H and N ) = 15/410 = 0.037
P (H) P (N ) = 116/410 100/410 = 0.069 = 0.037
94
= 0.5222.
100
38
Question 3
The following tree diagram gives the question information in a concise manner:
B1 = like proposal
B2 = indifferent to proposal
B3 = oppose proposal
Cool
P( B1 | A) = 0.5
P( B2 | A) = 0.3
P( B3 | A) = 0.2
Dry
P( B1 | A ) = 0.7
P( B2 | A ) = 0.2
P( B3 | A ) = 0.1
P( A and B1 ) = 0.21
P( A and B2 ) = 0.06
P( A and B3 ) = 0.03
P(A) = 0.7
P( A ) = 0.3
3.1 The probability that the person will own a fridge made by Dry is equal to 0.30.
3.2 The probability that the owner will be opposed to the proposal of a new warranty at extra cost is
0.03 + 0.14 = 0.17.
39
STA1610/1
Question 2
Assume that X and Y are two independent events with P (X) = 0.5 and P (Y ) = 0.25. Which of the
following statements is incorrect?
1. P (X ) = 0.75
2. P (X and Y ) = 0.125
3. P (X or Y ) = 0.625
4. X and Y are not mutually exclusive
5. P (X/Y ) = 0.5
Question 3
Refer to the following contingency table:
Event
D1
D2
D3
Total
C1
75
90
135
300
C2
125
105
120
325
C3
65
60
75
200
C4
35
45
70
150
Total
300
300
400
1000
Question 4
A sidewalk icecream seller sells three flavours: chocolate, vanilla and strawberry. Of his sales 40%
is chocolate, 35% vanilla and 25% strawberry. Sales are by cone or cup. The percentages of cone
sales for chocolate, vanilla and strawberry are 80%, 60% and 40% respectively. Use a tree diagram
to determine the relevant probabilities of a randomly selected sale of one ice cream. Which one of
the following statements is incorrect?
1. P (strawberry) = 0.25
2. P (vanilla in a cup) = 0.14
3. P (chocolate in a cone) = 0.32
4. P (chocolate or vanilla) = 0.75
5. P (vanilla/in a cone) = 0.3889
40
Question 5
A survey asked people how often they exceed speed limits. The data are then categorized into the
following contingency table of counts showing the relationship between age group and response.
Age
Total
200
200
400
Question 6
Numbers 1, 2, 3, 4, 5, 6, 7, 8, 9 are written on separate cards. The cards are shuffled and the top
one turned over. Let A = an even number B = a number greater than 6
Which one of the following statements is incorrect?
1. The sample space is S = {1, 2, 3, 4, 5, 6, 7, 8, 9}
2. P (A) = 4/9
3. P (B) = 1/3
4. P (A and B) = 1/9
5. P (A or B) = 7/9
Question 7
If A and B are independent events with P (A) = 0.25 and P (B) = 0.60, then P (A/B) is equal to
1. 0.25
2. 0.60
3. 0.35
4. 0.85
5. 0.15
41
STA1610/1
Question 8
Given that P (A) = 0.7, P (B) = 0.6 and P (A and B) = 0.35, which one of the following statements is
incorrect?
1. P (B ) = 0.4
2. A and B are not mutually exclusive
3. A and B are dependent
4. P (B/A) = 0.6
5. P (A or B) = 0.95
Question 9
The Burger Queen Company has 124 locations along the west coast.
is concerned with the profitability of the locations compared with major menu items sold. The
information below shows the number of each menu item selected by profitability of store.
High profit
R1
Medium profit
R2
Low profit
R3
Total
Baby Burger
M1
250
Mother Burger
M2
424
Father Burger
M3
669
Nachos
M4
342
Tacos
M5
284
Total
312
369
428
271
200
1580
289
242
216
221
238
1206
851
1035
131
834
722
4755
1969
Question 10
In a particular country, airport A handles 50% of all airline traffic, and airports B and C handle 30%
and 20% respectively. The detection rates for weapons at the three airports are 0.9, 0.5 and 0.4
respectively.
If a passenger at one of the airports is found to be carrying a weapon through the boarding gate,
what is the probability that the passenger is using airport C?
1. 0.2206
42
2. 0.6618
3. 0.5000
4. 0.2941
5. 0.1176
Alternative (1).
Question 3
P (C1 and D1 ) = 75/1000 = 0.075
P (D1 ) = 300/1000 = 0.3
P (C1 or D1 ) = 0.3 + 0.3 0.075 = 0.525
Alternative (3).
Question 4
43
Sa = strawberry flavour
P (Strawberry) = 0.25
P (Vanilla in a cup) = P (V ) P (S /V ) = 0.35 0.4 = 0.14
Alternative 5
Question 5
P (always|over 30) = 40/200 = 0.20
Alternative 4
Question 6
P (A or B) = P (A) + P (B) P (A and B) = 4/9 + 1/3 1/9 = 6/9 = 2/3
Alternative (5)
STA1610/1
44
Question 7
P (A) P (B)
P (A and B)
=
= P (A) = 0.25
P (B)
P (B)
Alternative 1
P (A/B) =
Question 8
P (B ) = 1 P (B) = 0.4
Alternative (4).
Question 9
Alternative (2).
Question 10
45
STA1610/1
STUDY UNIT 5
STUDY CHAPTER 5
A continuous random variable assumes an uncountable number of possible values; it can take
on any value in one or more intervals of values. Levene et al. Provide the following definition of
probability function.
A probability function, denoted p (x) , specifies the probability that a random
variable is equal to a specific value. More formally, p (x) is the probability that the random
variable X takes on the value x, or p (x) = P (X = x) .
46
p (x) = 1, the sum of the probabilities for all possible outcomes, x, for a random variable, X ,
equals one.
47
English term
Probability distributions
Discrete random variables
Continuous random variables
Binomial distributions
Poisson distributions
The mean of the binomial distribution
The variance of the binomial distribution
The mean of the Poisson distribution
The variance of the Poisson distribution
The standard deviation of the binomial
distribution
The standard deviation of the Poisson
distribution
STA1610/1
Description
N
[
Xi P (Xi )
i=1
Where
Example 5.1
Based on her experience, a professor knows that the probability distribution for X = number of
students who come to her office on Wednesdays is given below.
x
P (X = x)
0
0.01
1
0.20
2
0.50
3
0.15
4
0.05
48
What is the expected number of students who visit her on Wednesdays?
1. 0.50
2. 0.70
3. 1.85
4. 0.90
5. 0.30
Solution: The expected (the mean) is calculated as the sum of the product of the random variable
X by its corresponding probabililty, P (X) , as follows:
N
[
= E (x) =
Xi P (Xi )
i=1
=
=
Alternative 3
follows:
N k
l
[
2 =
(Xi )2 P (Xi )
i=1
Where
Xi = the ith outcome of the discrete random variable X
P (Xi ) = the probability of occurrence of the ith outcome of X
Please note that we have to compute the mean first before we think of calculating the variance of a
discrete random variable.
The standard
ydeviation is the positive square root of the variance of a discrete random variable
xN
x[
2
= =w
(Xi )2 P (Xi )
i=1
Example 5.2
Let the probability distribution for X = number of jobs held during the past year for students at a
college be as follows:
x
P (X = x)
1
0.25
2
0.33
3
0.17
4
0.15
5
0.10
49
STA1610/1
=
=
= 2 = 1.6496 = 1.2844
Alternative 4
If you have not mastered how to calculate the mean, the variance and the standard deviation of a
discrete random variable, you can now work through section 5.1 of Levene et al. again, otherwise try
the following activity before looking at its solutions.
50
Activity 5.1
Self-assessment exercise
Application skills
Question 1
The number of telephone calls coming into a switchboard and their respective probabilities for a
3-minute interval are as follows:
x
0
1
2
3
P (X = x) 0.60 0.20 0.10 0.04
How many calls might be expected over a 3-minute interval?
4
0.03
5
0.03
1. 0.04
2. 3
3. 0.2
4. 0.79
5. 3.75
Question 2
The probability distribution of a discrete random variable is shown below.
x
P (X = x)
0
0.25
1
0.40
2
0.20
3
0.15
Question 3
Use the data set given in question 2 and find the incorrect statement.
1. P (x > 1) = 0.35
2. P (x 2) = 0.65
3. p (1 < x 2) = 0.20
4. P (0 < x < 1) = 0.00
5. P (1 x < 3) = 0.60
51
Solutions to Activity 5.1
STA1610/1
Application skills
Question 1
Recall, the expected number is also the mean of a discrete random variable, calculate as:
N
[
= E (x) =
Xi P (Xi )
i=1
=
=
Alternative 4
Question 2
1. Correct. The variable takes on discrete values, therefore the statement is correct. Remember
in section 5.2 of this unit we defined the probability distribution for discrete random variable
as a mutually exclusive listing of all possible numerical outcomes along with the probability of
occurrence of each outcome which is exactly the case in this option.
2. Correct.
= E (x)
N
[
Xi P (Xi )
i=1
=
=
=
=
4. Correct. You can see it if you study the calculation of the mean and the variance.
Question 3
1. Correct. We add from two (greater than one) up to three as follows;
P (x > 1) = P (x = 2) + P (x = 3) = 0.20 + 0.15 = 0.35
2. Incorrect. Here we take values from zero to two. One could also consider this question as atmost
two as discussed in study unit 4.
P (x 2) = P (x = 0) + P (x = 1) + P (x = 2) = 0.25 + 0.40 + 0.20 = 0.85
3. Correct. In this case one is not included but two is. P (1 < x 2) = P (x = 2) = 0.20
52
4. Correct. P (1 < x < 1) = 0.00 because between 0 and 1 there is no discrete value for x.
5. Correct. Here one is inclued but three is not. P (1 x < 3) = P (x = 1)+P (x = 2) = 0.40+0.20 =
0.60
Having understood discrete random variable, we can now discuss their probability distributions. This
is very small but important section in statistics.
There are quite a number of discrete probability distribution , though Levene emphasised only two
namely
the Binomial distribution and
the Poission distribution.
The probability () that the trial results in a success remains the same from trial to trial.
The trials are independent of each other (the outcome of a trial does not affect the outcome of
n!
x (1 )nx
x! (n x)!
n = number of trials or sample size
P (x) =
x =
The mathematical sign (!) is called the factorial sign of a positive integer n. It is interpreted as the
product of all positive integers less than or equal to n. For example 5! = 5 4 3 2 1 = 120, 4! =
4 3 2 1 = 24 and 0! = 1.
53
STA1610/1
5.3.2 The variance and the standard deviation of the binomial distribution
=
s
s
2 = V ar (X) = n (1 )
Example 5.3
A textile firm has found from experience that only 20% of the people applying for certain stitchingmachine job are qualified for the work. If 5 people are interviewed, what is the probability of finding
at least three qualified persons?
n = 5, = 0.20, P (x 3)?
Please do not forget that at least three means, add from three, four and so on.
P (x 3)
=
=
=
=
P (x = 3) + P (x = 4) + P (x = 5)
5!
5!
0.203 (1 0.20)53 +
0.204 (1 0.20)54
3! (5 3)!
4! (5 4)!
5!
0.25 (1 0.20)55
+
5! (5 3)!
0.0512 + 0.0064 + 0.0003
0.0579
You can now attempt the following typical exam question. Please try to answer them before looking
at the solutions.
Activity 5.2: Self-assessment exercises
Application skills
Question 1
A new car salesperson knows that he sells cars to one customer out of 10 who enters the showroom.
The probability that he will sell a car to exactly two of the next three customers is
1. 0.027
2. 0.973
3. 0.000
4. 0.090
5. 0.901
54
Question 2
Use the information given in question 1. Let X be number of cars the salesperson sells to the next
three customers. Which one of the following statements is incorrect?
1. X has a binomial distribution
2. The expected number of cars sold if n = 3 is 0.3
3. The variance of this distribution is 0.27
4. P (X 1) = 0.9720
5. P (X > 2) = 0.0280
Question 3
Suppose that 62% of new cars sold in a country are made by one big car manufacturer. A random
sample of 7 purchases of new cars is selected. The probability that 4 of those selected purchases
are made by this car manufacturer is
1. 0.5800
2. 0.5714
3. 0.2838
4. 0.4200
5. 0.7162
Application skills
Question 1
1
= 0.1,
P (x = 2)?
10
3!
P (x = 2) =
0.12 (1 0.1)32
2! (3 2)!
= 0.027
n = 3, =
Question 2
1. Correct.
2. Correct. E (x) = n = 3 0.1 = 0.3
3. Correct. 2 = n (1 ) = 3 0.1 (1 0.1) = 0.27
4. Correct.
P (x 1)
=
=
=
=
P (x = 0) + P (x = 1)
3!
3!
0.13 (1 0.1)30 +
0.11 (1 0.1)31
0! (3 0)!
1! (3 1)!
0.7290 + 0.2430
0.9720
55
5. Incorrect P (x > 2)
=
=
=
STA1610/1
P (x = 3)
3!
0.13 (1 0.1)33
3! (3 3)!
0.001
Question 3
n = 7, = 0.62,
P (x = 4)
=
=
P (x = 4)?
7!
0.624 (1 0.62)74
4! (7 4)!
0.2838
Alternative 3
The probability that success will occur in an interval is the same for all intervals of equal size, and
x is the count of the number of successes that occur in a given interval, and may take on any
56
P (x)
x x
x!
x = 0, 1, 2, ...
= the average number of successes occurring in the given time or measurement.
= 2.71828 (the base of natural logarithms)
Example 5.4
The average number of a certain radio sold per day by a firm is approximately Poisson, with mean of
1.5. The probability that the firm will sell at least two radios over a three-day period is equal to
1. 0.5578
2. 0.1255
3. 0.9344
4. 0.0447
5. 0.4422
Solution:
Recall that this distribution has no upper bound. Therefore we have to express atleast in another
equivalent way such as
P (x 2) = 1 P (x 1)
= 1q
{P (x = 0) + P (x =r1)}
0 1.5
1 1.5
e
+ .5 e1!
= 1 1.5 0!
= 1 {0.2231 + 0.3347}
= 0.4422
Alternative 5
Example 5.5
A bank receives on average 6 bad cheques per day. The probability that it will receive exactly 4 bad
cheques on a given day is
1. 0.0892
2. 0.1393
3. 0.2851
4. 0.1339
5. 0.6667
57
STA1610/1
Solution
Given that = 6, P (x = 4)?
P (x = 4) =
64 e6
x e
=
= 0.1339
x!
4!
Alternative 4
An
announcement that six bank robberies are taking place is being broadcast. The probability that
a firearm is being used in at least one of the robberies is
1. 0.0015
2. 0.7379
3. 0.0001
4. 0.9999
5. 0.0016
Question 2
In an urban country, health official anticipate that the number of births this year will be the same as
last year, when 438 children were born an average of 438/356, or 1.2 births per day. Daily births
have been distributed according to a Poisson distribution.
The distribution can be represented as
x
P (X = x)
0
0.3012
1
0.3614
2
0.2169
3
0.0867
4
0.0260
5
0.0062
6
0.0012
What is the probability that at least two births will occur on a given day?
1. 0.3374
2. 0.8795
3. 0.3795
7
0.0002
58
4. 0.7831
5. 0.6626
Question 3
Given the following probability distribution for an infinite population with the discrete random
variables, x
x
P (x)
0
0.2
1
0.1
2
0.3
3
0.4
Question 4
A drug is known to be 80% effective in curing a certain disease. If four people with the disease are
to be given the drug, the probability that more than two are cured is:
1. 0.8464
2. 0.1536
3. 0.5000
4. 0.1808
5. 0.8192
Question 5
Refer to question 4, the expected value of people cured is
1. 0.80
2. 0.20
3. 3.20
4. 0.64
5. 1.00
59
STA1610/1
Question 6
Given a Poission random variable X , where the average number of successes occurring in a
specified interval is 1.8, P (X = 0) is equal to
1. 0.1653
2. 0.2975
3. 1.0000
4. 0.0000
5. 0.4762
=
=
=
=
=
1 P (x 50)
1 {P (x = 0)}
6!
1 0!(60)!
0.800 (1 0.80)60
1 000064
0.9999
Alternative 4
Question 2
P (x 2) =
=
=
=
1 P (x 1)
1 {P (x = 0) + P (x = 1)}
1 {0.3012 + 0.3614}
0.3374
Alternative 1
Question 3
1. Correct.
= E (x)
N
[
Xi P (Xi )
i=1
=
=
2. Correct.
P (x 1)
60
3. Correct.
2
N k
l
[
(Xi )2 P (Xi )
i=1
=
=
4. Correct
= 1.29 = 1.14
5. Incorrect
P (x 1) = P (x = 0) + P (x = 1) + P (x = 2) + P (x = 3) = 1.0
Question 4
P (x > 2) =
=
=
=
P (x = 3) + P (x = 4)
43
4!
3
+
3!(43)! 0.80 (1 0.8)
0.4096 + 0.4096
0.8192
4!
4
4!(44)! 0.8 (1
0.8)44
Alternative 5
Question 5
E (X) = n = 4 0.80 = 3.2
Alternative 3
Question 6
Given that = 1.8, P (x = 0)?
P (x = 0) =
1.80 e1.8
x e
=
= 0.1653
x!
0!
Alternative 1
In summary to the study unit
After you have studied this study unit about discrete probability distributions, you should be able to
recognise and define a discrete probability distribution
construct a probability distribution for a discrete random variable
understand the concept of a Bernoulli process and it application in consecutive trials, as
61
STA1610/1
STUDY UNIT 6
STUDY CHAPTER 6
THE NORMAL DISTRIBUTION
normal distribution.
The normal distribution can be used to approximate various discrete probability distributions.
The normal distribution provides the basis for classical statistical inference because of its
relationship to the Central Limit Theorem (see section 6.1 and 6.2 of Levine)
You must make sure that you know the characteristics of the normal distribution and how to use the
normal table (E.2) to determine probabilities. For this module it is not necessary that you can use the
normal distribution to approximate the binomial distribution. Still, please read through those sections
with care as we cannot cover all the knowledge in one module, but it contains essential statistical
knowledge that you should be aware of. Fortunately you will always have the Levine text book for
reference, should you need it at any time in the future!
Activity 6.1: Overview
Study skills
Draw a mind-map of the different sections/headings you will deal with in this study session. Then
page through the unit with the purpose of completing the map.
...................................................................................................
62
Conceptual skills
Communication skill
Test your own knowledge (write in pencil) and then correct your understanding afterwards (erase and
write the correct description). Often a young language may not have all the terms in a discipline; can
you think of some examples?
...................................................................................................
English term
Continuous Probability distributions
Discrete Probability distributions
The Standard Normal probabilities
The mean of the Normal distribution
The standard deviation of the Normal distribution
The area under the normal curve
Description
product of two values, even though it is not such a perfect box. The nice part is that it is not your
63
STA1610/1
problem how area of such a funny box is determined you simply read off the answer from a table
(see E.2)
ACTIVITY
Read through sections 6.1 and 6.2 of Levine text book at least once!
In addition, there are a few things you have to know about the normal distribution.
The form of the distribution is described as bell-shaped, meaning that it is symmetric (if it was
possible to cut out the line forming the bell you would be able to fold it double with the two halves
fitting on top of each other).
The normal curve is indicated within an axis system two perpendicular lines (like you may have
The values of the continuous variable X (in the notation of Levine) are indicated on the horizontal
axis.
The distribution is only the line forming the bell and is called the density function, indicated as
f (X) (a function of the variable X ).
The total probability is represented by the area between that density function (bell-shaped
The total area equals one, but can be broken up into sections, determined by the values given to
The centre X -value, where you would fold the curve to indicate the symmetry, represents the
It is not only the placement of the mean which determines the distribution of a particular
normal distribution, the standard deviation (how the values are spread around the mean) also
determines the form of the distribution.
If the values of and are used to standardize each individual X -value with the formula
X
(see equation 6.2) then the original normal distribution will be transformed to a so-called
64
The values in the normal table give the areas for the probabilities of the standard normal
The previous statement implies that all general normal distributions must first be standardized
X
with the formula
before the normal table may be used.
In this study guide we also insert another version of the normal table (taken from Utts and Heckard:
Mind on Statistics, 3rd edition) for your convenience. Here two separate tables are given one for
negative Z values and one for positive Z values. Some students find it easier to calculate areas
under the normal curve, using these two tables.
65
STA1610/1
66
67
STA1610/1
The normal table can be used either to determine an area under the normal curve for a given Z -value
or it can be used backwards whenever the value of the Z variable has to be determined for a given
area.
Now you should go through sections 6.1 and 6.2 again. There is nothing wrong with you if you have
to go through a chapter a number of times. It may even be that you need to break a chapter up into
small sections and repeat them over and over until you understand what we are trying to teach you.
Remember that statistics is about understanding and then building a mental structure based on the
underlying theory.
I think you are now ready to do a few activities. Always study the feedback carefully as I do a lot of
explaining there (for those of you who could not manage the activity yourself).
Activity 6.3
Study skill
Question 1
Assume X is normally distributed with mean = 15 and standard deviation = 3. Use the
approximate areas under the normal curve, to evaluate the following statements. The incorrect
statement is
1. P (X 15) = P (X 15) = 0.5
2. P (12 X 18) = 0.955
3. P (X 9) = 0.0228
4. P (X = 20) = 0
5. P (X 12) = 0.8413
Question 2
I want you to do question 6.9 in Levine text book. The breaking strength of plastic bags used for
packaging produced is normally distributed, with a mean of 5 pounds per square inch and a standard
deviation of 1.5 pounds per square inch. What proportion of the bags have a breaking strength of
(a) less than 3.11 pounds per square inch?
(b) at least 3.8 pounds per square inch?
(c) between 5 and 5.5 pounds per square inch?
(d) 95% of the breaking strength will be contained between what two values symmetrically distributed
around the mean?
68
Question 3
A manufacturer of tow chains finds that the average breaking point is at 3500 kilograms and the
standard deviation is 250 kilograms. If you pull weight of 4200 kilograms with this tow chain, the
percentage of the time you can expect the chain to break, is
1. 2.8%
2. 0.26%
3. 49.74%
4. 99.74%
5. None of the above.
Question 4
A retailer finds that the demand for a very popular board game averages 100 per week with a standard
deviation of 20. If the seller wishes to have adequate stock 95% of the time, how many of the games
must she keep on hand?
1. 132.9
2. 67.1
3. 119
4. 195.0
5. 109
Question 5
Identify the incorrect statement:
1. The average waiting time at the checkout counter for a large grocery chain is 2.45 minutes with a
standard deviation of 24 seconds (0.40 minutes). If we assume that the distribution of waiting time
is normal, the probability that a customer must wait more than 3 minutes for check out is 0.9162.
2. Considering the information in 1, the proportion of the customers who are served between 1
minute and 2.5 minutes is 0.5518.
3. Suppose the monthly demand for automobile tyres at a tyre dealer is normally distributed with a
mean of 250 tyres and a standard deviation of 50 tyres. The number of tyres the store must have
in stock at the beginning of each month in order to meet demand for 95 percent of the time, is
332.25
4. A circus performer who gets shot from a cannon is supposed to land in a safety net. The distance
he travels is normally distributed with a mean of 55 metres and a standard deviation of 4.7 metres.
His landing net is 16 metres long and the mid-point of the net is positioned 55 metres from the
69
STA1610/1
cannon. The probability that the performer will miss the net on a given night is 0.0892.
5. The probability that the circus performer in question 4 will hit the net is equal to 0.9108.
Question 6
The scores of high school students on a national mathematics exam in Uganda were normally
distributed with a mean of 86 and a standard deviation of 4. If there were 97680 students with scores
higher than 91, how many students took the test?
1. 125000
2. 925000
3. 105000
4. 247667
5. 394400
3
3
The mean lies at the centre of the distribution and therefore divides the total area of 1 into half
(each half represents 0.5 of the total area) as shown in the figure below.
70
2. Incorrect Standardize the random variables before reading off the respective probabilities from
the graphs. You can use graphs on page 4 and 5 of the study guide or table E.10 in Levene et al.
X
18 15
12 15
3
0.1587 = 0.6826
3. Correct
P (X 9) = P
4. Correct
P (X = 20) = 0
X
9 15
= P (Z 2.00) = 0.0228
If the variable is continuous we assume that the probability of it assuming any fixed value is always
zero! Remember the continuous variable lies somewhere within a small interval, but we cannot
give a fixed value to it.
71
5. Correct
P (X 12) = P
Question 2
(a) P (X < 3.11) = P
(b) P (X 3.8) = P
X
12 15
3.11 5
X
<
1.5
3.8 5
X
1.5
STA1610/1
72
55
1.5
5.5 5
X
<
<
1.5
Note: Since the values of a and b are symmetrically distributed, they are similar in magnitude with
opposite signs. Using the normal graph (See E.2) and a = 1.96 and b = 1.96
73
STA1610/1
Question 3
Option 2
P (X > 4200) = P
4200 3500
X
>
250
Question 4
Option 1
This question is about working backwards. Standardise the random variable first, then look for the
Z -value corresponding to the area under 0.95. Please note that this should be read off from the body
74
P (X < w) = 0.95
P
w 100
X
<
20
= 0.95
w 100
P Z<
= 0.95
200
w 100
= 1.645
20
w = 100 + (20 1.645) = 132.9
Question 5
1. Incorrect. The correct answer is 0.0838
Note that you always have to use the same units in this case use only minutes
2.45
X
>
= P (Z > 1.38) = 1.00 0.9162 = 0.0838
P (X > 3) = P
0.40
From the normal table the value corresponding with 1.38 (rounded to two decimals) is 0.9162. This
is the given answer, but not the correct one. Remember how the normal table is tabulated? The
areas are tabulated cumulative from the mean up to the listed value, but the question specifies the
area greater than 1.38 (to the right of 1.38). For the correct answer you therefore have to subtract
0.9162 from 1.00
2. Correct
Using the normal table.
X
2.5 2.45
1 2.45
<
>
= P (3.63 < Z < 0.13) = 0.00014 +
P (1 < X < 2.5) = P
0.40
0.40
0.5517 = 0.5518
3. Correct
This question is about working backwards
X
w 250
w 250
<
P (X < w) = P
=P Z<
= 0.95
50
50
To meet the demand for 95% of the time implies that we are looking for a z -value such that 0.95 of
the area lies to the left of it. We use the normal table, (see E.2), to look for the value 0.95 inside
the normal table because this is an area.
The z -value which corresponds to an area of 0.95 is 1.645. This 1.645 is the z -value, but we have
75
STA1610/1
w 250
= 1.645
50
w = 250 + (50 1.645) = 332.25
4. Correct. According to the information the 16m net is placed in such a way that it begins at
(55 8 = 47) metres and stretched up to (55 + 8 = 63) metres from the cannon. The performer will
miss the net by falling short or falling past the net. In terms of the normally distributed variable,
this comment means
P (X 47) or P (X 55)
47 55
63 55
Standardize: P Z
or P Z
4.7
4.7
P (Z 1.70)
or P (Z 1.70)
The table value for P (Z 1.70) is 0.0446, which means that P (Z 1.70) = 1.00 0.9554 =
0.0446.
76
SELF ASSESSMENT EXERCISE TEST YOUR KNOWLEDGE
Question 1
Which one of the following is not a characteristic of a normal distribution?
1. The normal variable can take on only discrete values.
2. It is a symmetrical distribution.
3. The mean, median and mode are all equal.
4. It is a bell-shaped distribution.
5. The area under the curve is equal to one.
Question 2
Given that Z is a standard normal random variable, a negative value of z indicates that
1. the value Z is to the left of the mean
2. the value Z is to the right of the median
3. the standard deviation of Z is negative
4. the area between zero and Z is negative
5. the area to the right of Z is equal to 1
Question 3
If Z is a normal variable with = 0 and = 1, the area to the left of Z = 1.6 is
1. 0.4452
2. 0.9452
3. 0.0548
4. 0.5548
5. 0.5000
Question 4
Use the normal table to find the Z -value Z1 if the area to the right of Z1 is 0.8413. The value of Z1 is
1. 1.36
2. 1.36
3. 0.00
4. 1.00
5. 1.00
77
STA1610/1
Question 5
Let Z be a Z -score that is unknown but identifiable by position and area. If the area to the left of Z is
0.9306, then the value of Z must be
1. 1.48
2. 0.9603
3. 1.48
4. 0.4306
5. 0.0694
Question 6
Which of the following statements is incorrect?
1. P (Z 1.63) = 0.0516
2. P (Z 0.5) = 0.3085
3. P (Z < 1.63) = 0.0516
4. P (Z > 1.28)
5. P (1 Z 1) = 0.6826
Question 7
For a normal curve, if the mean is 20 minutes and the standard deviation is 5 minutes, then the area
between 22 and 25 minutes is
1. 0.1554
2. 0.3413
3. 0.4967
4. 0.1859
5. 0.0185
Question 8
A bakery firm finds that its average weight of the most popular package of biscuits is 200.5 g with a
standard deviation of 10.5 g. What proportion of biscuit packages will weigh less than 180 g?
1. 0.4744
2. 0.0256
3. 0.5226
4. 0.4713
5. 0.9744
78
Question 9
The average labour time to sew a pair of denims is 4.2 hours with a standard deviation of 0.5 hours.
If the distribution is normal, then the probability of a worker finishing a pair of jeans in more than 3.5
hours is
1. 0.0808
2. 0.4192
3. 0.5808
4. 0.9192
5. 0.9808
Question 10
A retailer finds that the demand for a popular board game averages 50 per week with a standard
deviation of 20. If the seller wishes to have adequate stock 99% of the time, how many games must
she keep on hand?
1. 81.0
2. 89.2
3. 50.0
4. 70.0
5. 96.6
79
Question 3
P (Z < 1.6) = 0.9452
Alternative (2)
Question 4
P (Z > Z1 ) = 0.8413
P (Z < Z1 ) = 0.1587 Z1 = 1.00
Alternative (4)
Question 5
P (Z > z) = 0.9306 z = 1.48
Alternative (3)
STA1610/1
80
Question 6
Option 1. Correct
P (Z 1.63) = 0.0516
Option 2. Correct
P (Z 0.5) = 0.3085
Option 3. Incorrect! Remember the area under the graph cannot be negative.
P (Z < 1.63) = 0.0516
Option 4. Correct
P (Z > 1.28) = 0.1003
81
STA1610/1
Option 5. Correct
P (1 < Z < 1) = 0.6826
Alternative (3)
Question 7
P (22 X 25)
=
=
=
=
25 20
22 20
Z
P
5
5
P (0.4 Z 1)
0.8413 0.6554
0.1859
Alternative (4)
Question 8
P (X < 180)
=
=
=
180 200.5
P Z<
10.5
P (Z < 1.95)
0.0256
Alternative (2)
Question 9
P (X > 3.5)
=
=
=
=
Alternative (4)
3.5 4.2
P Z<
0.5
P (Z > 1.4)
P (Z < 1.4)
0.9192
82
Question 10
This question is about working backwards
a 50
= 0.99
Z
20
a 50
= 2.33
20
a = 50 + (20 2.33) = 96.6
Alternative (5)
83
STA1610/1
STUDY UNIT 7
STUDY CHAPTER 7
SAMPLING DISTRIBUTION
OBJECTIVES
At the end of this chapter, you should be able:
To understand the concept of the sampling distribution.
To compute probabilities related to the sample mean and the sample proportion.
To understand the importance of the Central Limit Theorem.
84
Inference means that we are making an assumption or a deduction where data are gathered by
drawing a sample from the population and then making assumptions about the population, based
on this sample data.
Activity 7.1
Overview
Study Skill
Draw a mind - map of the different section / headings you will deal with in this study session. Then
page through the unit with the purpose of the completing the map.
Sampling distribution
Sampling distribution
of the mean
Sampling distribution
of proportion
In many situations the population is so large that you cannot gather information on every item.
Instead, statistical sampling procedures focus on collecting a small representative group of the larger
population. The results of the sample is less time-consuming, less costly, and more practical than an
analysis of the entire population.
Activity 7.2
Concepts
Conceptual skill
Communication skill
Test your own knowledge (write in pencil) and then correct your understanding afterwards (erase and
write the correct description). Often a young language may not have all the terms in a discipline can
you think of some examples?
English term
Sample mean
Population mean
Inference
Statistic
Parameter
Sample proportion
Population proportion
Sampling distribution
Unbiased
Standard error of the mean
Standard error of proportion
Description
85
STA1610/1
The sample mean (X) is unbiased because the mean of all the possible sample means is equal to
the population mean (), alternatively the sample mean (X ) is unbiased of the population parameter
because the expected value of the sample mean (X ) is equal to the population parameter:
E(X) =
X =
n
Activity 7.3
Question 1
Suppose a random sample of n = 25 observations is selected from a population that is normally
distributed with mean equal to 106 and the standard deviation equal to 12. Determine the mean and
the standard deviation of the sampling distribution of the sample mean X.
86
Solution
Steps
1. Population standard deviation = 12
2. The sample size n = 25
12
12
= 2.4
The Standard error of mean is equal to X = = =
5
n
25
Question 2
A random sample of n observations is selected from a population with a standard deviation = 2.
Calculate the standard error of the mean for these values of n:
(a) n = 5
(b) n = 49
Solution
(a) When n = 5
Standard deviation = 2
2
2
= 0.8944
Standard error of the mean X = = =
2.2361
n
5
(b) When n = 49
Standard deviation = 2
2
2
Standard error of the mean X = = = = 0.2857
7
n
49
Question 3
Population A consists of all values of invoices of a certain company. The mean of the population A is
R350 and the standard deviation is R100. Population B consists of all samples of 16 values drawn
from population A. The mean of population B is
1. R100
2. R250
3. R350
4. R450
5. R25
87
STA1610/1
Solution
The mean of the population A is equal to the mean of population B .
Option (3)
with the mean and the standard deviation , then regardless of the sample size n, the sampling
distribution of the mean is normally distributed with the mean X = , and the standard error of the
mean, X = .
n
How do you calculate the probability for the sampling distribution of the mean ?
Steps
1. Determine the population mean and the sample mean X
2. Determine the sample size n
3. Determine the number of the sample mean (X ) for which we want to determine the probability.
4. Find the value of Z called test statistic
X
Z=
Where
or
Z=
X
X
88
Activity 7.4
Question 1
Given a normal distribution with the population mean =100 and the standard deviation = 12, if
you select a sample of n = 36, what is the probability that the sample mean X
is
Solution
1. P (X < 95)?
Steps
X
95 100
5
5
= 2.5
=
=
=
12
12
2
n
6
36
3. Determine the equivalent number of the sample mean for which we want to determine the
probability P (X < 95) = P (Z < 2.5) , now determine the area which is less than 2.5
0,0062
0110
1010
_ 2,5
89
STA1610/1
4. Finding the value using the cumulative standard normal distribution table E.2 (from Appendix)
Z
6.0
5.5
.......
2.5
.......
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.0062
0.0060
0.0059
0.0057
0.0055
0.0054
0.0052
0.0051
0.0049
0.0048
= 100
2. If X = 95 then Z =
n = 36
= 12
95 100
5
=
= 2.5
12
12
6
36
If X = 97.2 then Z =
97.2 100
2.8
2.8
= 1.4
=
=
12
12
2
6
36
3. P (95 < X < 97.2) = P (2.5 < Z < 1.4) , now we determine the area which is between 2.5
and 1.4
0,0062
11
00
00
11
00
11
00
11
0110
1010
1010
1010
0,0746
_ 2,5
0,0808
_ 1,4
90
3. P (above 102.2) = P (X > 102.2)?
= 100
n = 36
= 12
Steps
(a) Use transformation formula called the test statistic Z =
X
102.2 100
2.2
2.2
= 1.1
=
=
=
12
12
2
n
6
36
(c) Determine the equivalent number of the sample mean for which we want to determine the
probability P (X > 102.2) = P (Z > 1.1) , now determine the area which is greater than 1.1
01
1010
10
1010
10
0,8643
0,1357
1,1
(d) Finding the value using the cumulative standard normal distribution table E.2 (from Appendix)
Z
0.0
0.1
.......
1.1
.......
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.8643
0.8665
0.8686
0.8708
0.8729
0.8749
0.8770
0.8790
0.8810
0.8830
4. P (X > a) = 0.65
a
P (Z > ) = 0.65
n = 36
= 12
a 100
find the corresponding Z value to 0.65 from the cumulative standard
) = 0.65
12
36
normal table by looking inside of the table.
P (Z >
P (Z >
a 100
) = 0.65
12
36
91
Z
0.0
0.1
.......
0.3
.......
STA1610/1
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.6179
0.6217
0.6255
0.6293
0.6331
0.6368
0.6406
0.6443
0.06480
0.6517
a 100
= 038
12
36
a 100
= 0.38
2
Question 2
Given an infinity population with a mean of 75 and a standard deviation of 12, the probability that the
mean of a sample of 36 observations, taken at random from this population, exceeds 78 is
1. 0.4332
2. 0.0668
3. 0.0987
4. 0.9013
5. 0.9332
Solution
P (X > 78)?
Steps
= 75
= 12
n = 36
X = 78
X
78 75
3
3
=
= = 1.5
=
12
12
2
n
6
36
3. Determine the equivalent number of the sample mean for which we want to determine the
probability
92
P (X > 78) = P (Z > 1.5) , now determine the area which is greater than 1.5
0,9332
01
1010
1010
1010
0,0668
1,5
4. Finding the value using the cumulative standard normal distribution table E.2
P (Z > 1.5) = 1 0.9332 = 0.0668
The central Limit Theorem is important in using statistical inference to draw conclusions about a
population without having to know the specific shape of the population distribution. The theorem
states that the sum of a large number of independent observations form the same distribution,
under certain general conditions an approximate normal distribution.
(1 )
n
In many instances, you can use the normal distribution to estimate the sampling distribution of the
proportion. When the parameter is unknown then
93
STA1610/1
p =
p(1 p)
n
When do you assume that the sampling distribution of proportion is approximately normally
distributed?
It is when n and n (1 ) are each at least 5.
How do you calculate the probability for sampling distribution of the proportion
Steps
1. Determine the population proportion and the sample proportion p
2. Determine the sample size n
3. Determine the number of the sample proportion (p) for which we want to determine the probability.
4. Find the value of Z called test statistic
p
Z=u
(1 )
n
or
Z=
p
p
Where
The standard error of the proportion ( p ) =
(1 )
n
when is known
p(1 p)
n
when is unknown
Activity 7.5
Question 1
In a random sample of 64 people, 48 are classified as successful".
(a) Determine the sample proportion, p of successful.
(b) If the population proportion is 0.80; determine the standard error of proportion.
94
Solution
(a) The sample proportion p =
0.75
(1 )
=
n
0.80(1 0.80)
= 0.05
64
Question 2
Suppose that we will randomly select a sample of n = 100 units from a population and that we
will compute the sample proportion p of these units that fall into a category of interest. If the true
population proportion equals 0.9.
(a) Find the mean and the standard deviation of the sampling distribution of p.
(b) Calculate the following probabilities about the sample proportion p.
(i) P (p
0.96)
(ii) P (0.855
(iii) P (p
0.945)
0.915)
Solution
(a) The population of all possible sample proportions has mean = 0.9
u
u
(1 )
0.9(1 0.9)
=
= 0.0009 = 0.03
The standard deviation p =
n
100
(b) (i) P (p
0.96)
Steps
1. The population proportion mean = 0.9 and the sample proportion p = 0.96
2. The sample size n = 100
3. The value of Z called test statistic
Z=u
(1 )
n
or
Z=
p
0.96 0.9
=2
=
p
0.03
95
4. P (p
0.96) = P (Z
STA1610/1
2) = 0.0228
0011
11001100
11001100
11001100
2
(ii) P (0.855
0,0228
0.945)?
Steps
1. The population proportion mean = 0.9
2. The sample proportion p = 0.855 and p = 0.945
3. The sample size n = 100
4. The value of Z called test statistic
if p = 0.855 then Z =
p
0.855 0.9
= 1.5
=
p
0.03
if p = 0.945 then Z =
p
0.945 0.9
= 1.5
=
p
0.03
0,0668
0110
1010
1010
1,5
0110
1010
1010
1,5
0,9332
5. P (0.855
0.945) = P (1.5
1.5) = P (Z
1.5) P (Z
1.5) =
96
(iii) P (p
0.915)?
Steps
1. The population proportion mean = 0.9 and the sample proportion p = 0.915
2. The sample size n = 100
3. The value of Z called test statistic
p
Z=u
or
(1 )
n
Z=
0,6915
4. P (p
0.915) = P (Z
SELFASSESSMENT
p
0.915 0.9
= 0.5
=
p
0.03
01
1010
1010
1010
10
0,3085
0,5
Question 1
Time spent using e-mail per session is normally distributed, with a population mean of 8 minutes and
population standard deviation of 2 minutes. Select a random sample of 16 sessions,
(a) What is the probability that the sample mean is between 7.8 and 8.2 minutes?
(b) If you select a random sample of 100 session, what is the probability that the sample mean is
between 7.8 and 8.2 minutes.
97
STA1610/1
Question 2
Consider an infinite population with a mean of 160 and a standard deviation of 25. A random sample
of size 64 is taken from this population. The standard error of the mean equals
1. 0.391
2. 6.4
3. 2.50
4. 9.766
5. 3.125
Question 3
The standard error of the mean is
1. the standard deviation of the sampling distribution.
2. the squared value of the population variance.
3. the same value as the population standard deviation.
4. the same for distributions of all sample sizes.
5. the mean of the sampling distribution
Question 4
A manufacturing company packages peanuts for Piedmont Airlines. The individual packages weigh
1.4 grams with a standard deviation of 0.6 grams. For a flight of 152 passengers receiving the peanuts,
the probability that the average weight of the packages is less than 1.3 grams is
1. 0.0202
2. 0.2040
3. 0.9798
4. 0.4798
5. 2.0500
98
Question 5
The fill amount of bottles of a soft drink is normally distributed, with a mean of 2.0 liters and a
standard deviation of 0.06 liter. If you select a random sample of 36 bottles, what is the probability
that the sample mean will be
1. between 1.99 and 2.0 liters?
2. below 1.98 liters?
3. greater than 2.01 liters?
4. The probability is 99% that the sample mean amount of soft drink will be at least how much?
5. The probability is 99% that the sample mean amount of soft drink will be between which two
values?
SelfAssessment
Question 1
In each of the following cases, find the mean, variance, and the standard deviation of the sampling
distribution of the sample proportion p.
(a) = 0.5
n = 250
99
STA1610/1
Question 3
According to Gallups poll on personal finances, 46% of the U.S. workers say they feel they will
have enough money to live comfortably when they retire. If you select a random sample of 200 U.S.
workers,
(a) what is the probability that the sample will have been between 45% and 55% who say they have
enough money to live comfortably now and expect to do so in future?
(b) the probability is 90% that the sample percentage will be contained within what symmetrical limits
of the population percentage?
Question 1
=2
n = 16
Steps
1. The transformation formula is the test statistic Z =
7.8 8
0.2
0.2
= 0.4
=
=
2
2
0.5
4
16
If X = 8.2 then Z =
8.2 8
0.2
0.2
= 0.4
=
=
2
2
0.5
4
16
100
3. P (7.8 < X < 8.2) = P (0.4 < Z < 0.4) , now determine the area which between 0.4 and 0.4
0,3446
01
10
1010
10
1010
10
0,4
0,6554
01
10
1010
10
1010
10
0,4
=
=
=
=2
n = 100
Steps
X
=
If X = 7.8 then Z =
7.8 8
0.2
0.2
= 0.1
=
=
2
2
0.2
10
100
If X = 8.2 then Z =
8.2 8
0.2
0.2
= 0.1
=
=
2
2
0.2
10
100
101
STA1610/1
3. P (7.8 < X < 8.2) = P (0.1 < Z < 0.1) , now determine the area which between 0.1 and 0.1
0,4602
0,5398
01 1010
10 10
1010 10
10 1010
1010 10
10 1010
_ 0,1 0,1
0
=
=
=
Question 2
Steps
1. Population mean standard deviation = 25
2. The sample size n = 64
25
25
= 3.125
The Standard error of mean is equal to X = = =
n
8
64
Question 3
Option 1
Question 4
P (X < 1.3)?
Steps
1. The test statistic Z =
= 1.4
= 0.6
n = 152
X = 1.3
X
1.3 1.4
0.1
0.1
= 2.05
=
=
=
0.6
0.6
0.0487
n
12.3288
152
102
3. Let determine the equivalent number of the sample mean for which we want to determine the
probability
P (X < 1.3) = P (Z < 2.05) , now determine the area which is less than 2.05
0,0202
0110
1010
_ 2,05
4. Let find the value using the cumulative standard normal distribution table E.2 (from Appendix)
P (Z < 2.05) = 0.0202
Question 5
Steps
n
6
36
X
2.0 2.0
0
0
=0
when X = 2.0 then Z = =
=
=
0.06
0.06
0.01
n
6
36
3. P (1.99 < X < 2.0) = P (1 < Z < 0) = P (Z < 0) P (Z < 1) , now determine the area
103
STA1610/1
01
1010
10
1010
10
0,3413
_1
4. The value using the cumulative standard normal distribution table E.2
P (Z < 0) = 0.5
P (Z < 1) = 0.1587
X
1.98 2.0
0.02
0.02
= 2
=
=
=
0.06
0.06
0.01
n
6
36
3. P (X < 1.98) = P (Z < 2) , now determine the area which is less than 2
0,0228
01
1010
1010
10
_2
104
(c) P (X > 2.01)?
1. The test statistic Z =
X
2.01 2.0
0.01
0.01
=1
=
=
=
0.06
0.06
0.01
n
6
36
3. P (X > 2.01) = P (Z > 1), now determine the area which is greater 1
0110
1010
1010
10
0,1587
X = +Z
the Z value corresponding to the area 0.99 is 2.33 (using the cumulative
n
standardized normal table )
0.06
= 2.0 + 2.33
36
= 2.0233
(e) The area between A and B equals to 0.99. The Z value corresponding to the area 0.99 is 2.33.
To find A and B values associated with known probability is given by
A =Z
n
0.06
= 2.0 2.33
36
= 1.9767
B =+Z
n
105
STA1610/1
0.06
= 2.0 + 2.33
36
= 2.0233
P (1.9767 < X < 2.0233)
(1 )
=
n
0.5(1 0.5)
= 0.001 = 0.0316
250
(1 )
=
n
0.98(1 0.98)
= 0.000019 = 0.0044
1000
P (p > 0.55)?
Steps
1. The test statistic Z = u
(1 )
n
=u
0.049
0.049
=
= 0.98
=
0.05
0.0025
0.501(1 0.501)
100
0.55 0.501
2. P (p > 0.55) = P (Z > 0.98) , now determine the area which is greater than 0.98
0,8365
0110
10
1010
1010
1010
0,98
106
3. The value using the cumulative standard normal distribution table is
P (Z > 0.98) = 1 P (Z < 0.98)
= 1 P (Z < 0.98)
= x
= 0.1635
(b) Population proportion = 0.49
The given information: p = 0.55
n = 100
Steps
1. The test statistic Z = u
(1 )
n
=u
0.55 0.49
0.49(1 0.49)
100
0.06
= 1.20
0.05
2. P (p > 0.55) = P (Z > 1.20) , now determine the area which is greater than 1.20.
0,8849
0110
10
1010
10
1,20
Steps
1. The test statistic Z = u
(1 )
n
0.45 0.46
0.46(1 0.46)
200
0.01
= 0.2841
0.0352
107
when p = 0.55 then Z = u
0.55 0.46
STA1610/1
0.09
= 2.5568
0.0352
0.46(1 0.46)
200
2. P (0.45 < p > 0.55) = P (0.2841 < Z < 2.5568) , now determine the area which is between
0.2841 and 2.5568.
01
1010
1010
1010
0,3897
0,28
0110
10
2,56
0,9948
3. The value using the cumulative standard normal distribution table is
P (0.2841 < Z < 2.5568) = P (Z < 2.5568) P (Z < 0.2841)
= = 0.9948 0.3897
= = 0.6051
(b) The area between A and B represents 0.90. The Z value corresponding to the area 0.90 is 1.645.
To find A and B values associated with known probability is given by
u
(1 )
A = Z
n
u
0.46(1 0.46)
= 0.46 1.645
200
= 0.4021
=
=
=
(1 )
nu
0.46(1 0.46)
0.46 + 1.645
200
0.5179
+Z
108
STUDY UNIT 8
STUDY CHAPTER 8
CONFIDENCE INTERVAL ESTIMATION
OBJECTIVES
At the end of this chapter, you should be able to construct and interpret confidence interval estimates
for the mean and the proportion.
109
Activity 8.1
Overview
STA1610/1
Study Skill
Draw a mind - map of the different section / headings you will deal with in this study session. Then
page through the unit with the purpose of the completing the map.
Confidence interval estimate
Confidence interval
estimate
for the mean
Confidence interval
estimate when
is known
Confidence interval
estimate for
proportion
Confidence interval
estimate when
is unknown
There are two options to consider for confidence interval estimate of the mean, depending on the
population standard deviation is known or the population standard deviation is unknown.
Activity 8.2
Concepts
Conceptual Skill
Communication skill
Test your own knowledge (write in pencil) and then correct your understanding afterwards (erase and
write the correct description). Often a young language may not have all the terms in a discipline can
you think of some examples?
English term
Confidence level
Confidence interval
Point estimate
Population standard deviation known
Population standard deviation unknown
Critical value
Degrees of freedom
Description
110
Z ( two- tailed)
2
2.58
1.96
1.645
Z ( one - tailed)
2
2.33
1.645
1.28
Z
n
2
( X Z ,
n
2
X + Z )
n
2
Activity 8.3
Question 1
The owner of a large shopping centre is besieged with complaints the shortage of parking space. He
feels that the 1 000 spaces are adequate. In an effort to address the problem, he obtain a sample of
the average number of cars on the parking lot during prime hours. The sample of 40 has a mean of
952. Assume a population standard deviation of 396. The 95% confidence interval estimate for prime
hour parking is
1. 790.46 to 1112.54
2. 849.31 to 1054.69
111
STA1610/1
3. 829.28 to 1074.72
4. 932.60 to 971.40
5. 952.00 to 1052.00
Solution
Steps
Z
n
2
396
952 1.96
40
952 122.7217
(952 122.7217 , 952 + 122.7217)
(829.2783 , 1074.7217)
Question 2
If X = 120, = 24 and n = 36, construct a 99% confidence interval estimate of the population
mean
Solution
Steps
112
24
5. Substitute the values into Z formula 120 2.58
36
120 10.32
(120 10.32 , 120 + 10.32)
(109.68 , 130.32)
X + (t
( n 1 ,
S
,
n
)
2
S
n
)
2
S
X +t
)
n
( n 1 ,
)
2
( n 1 ,
Steps
1. Determine the sample mean X
2. Determine the sample standard deviation S
3. Determine the sample size n
4. Determine the degrees of freedom df = n 1
5. Find the critical value using the t- student table with t
)
2
6. Substitute the values into the confidence interval estimate for the mean ( unknown)
(n 1 ,
113
STA1610/1
Activity 8.4
Question 1
For a selected month, the average kilowatt hours used by 49 residential customers is 1160 kilowatt
and the standard deviation S is 1085 kilowatt. Assume that the tvalue for a 95% confidence interval
is 1.6772. Determine the confidence interval estimate for the true mean?
Solution
Steps
1. The sample mean X = 1160
2. The sample standard deviation S = 1085
3. The sample size n = 49
4. The degrees of freedom df = n 1 = 49 1 = 48
5. The critical value equals to 1.6772
6. Substitute the values into the confidence interval estimate for the mean ( unknown) formula
1085
1160
1.6772
49
1160
259.966
1160 259.966,
(900.034
1160 + 259.966
1419.966)
Question 2
A stationery store wants to estimate the mean retail value of greeting cards that it has in its inventory.
A random sample of 100 greeting cards indicates a mean value of R2.65 and a standard deviation
of R0.44. Assuming a normal distribution, construct a 95% confidence interval estimate of the mean
value of all greeting cards in the stores inventory.
Solution
Steps
1. The sample mean X = 2.65
2. The sample standard deviation S = 0.44
3. The sample size n = 100
114
4. The degrees of freedom df = n 1 = 100 1 = 99
5. The critical value at t( 99 ,
0.025)
equals to 1.9842
6. Substitute the values into the confidence interval estimate for the mean ( unknown) formula
2.65
0.44
1.9842
100
2.65
0.0873
2.65 0.0873,
(2.5627
2.65 + 0.0873
2.7373)
Let us recall that in study unit 73 the probability of a sampling distribution of the mean was calculated
X
using the value Z = . If the population standard deviation is unknown the following statistic is
n
used:
X
t=
S
n
This expression has the same form as Z statistic except that S is used to estimate the unknown .
SELFASSESSMENT
Question 1
Your statistics instructor wants you to determine a confidence interval estimate for the mean test
score. Past experience indicated that tests scores are normally distributed with a sample mean of
160 and a population standard deviation of 45. A confidence interval estimate if your group has 36
students is:
1. 145.3 to 174.7
2. 157.55 to 162.45
3. 152.5 to 167.5
4. 158.75 to 161.25
5. 160 to 174.7
115
STA1610/1
Question 2
If X = 70, S = 24 and n = 36, and assuming that the population is normally distributed, construct a
95% confidence interval estimate of the population mean .
Question 3
The data represents the overall miles per gallon (MPG) of 2008 SUVs priced under $30 000.
23
17
20
21
21
18
22
18
18
18
18
17
17
17
17
16
19
20
19
16
19
22
Construct a 95% confidence interval estimate for the population mean miles per gallon of 2008 SUVs
priced under $30 000 assuming a normal distribution.
SOLUTIONS FOR SELFASSESSMENT
Question 1
Steps
1. Sample mean X = 160
2. Population standard deviation = 45
3. Sample size n = 36
4. Use the critical value Z = 1.96 for 95% confidence interval estimate
2
45
5. Substitute the values into Z formula 160 1.96
36
160 14.7
(160 14.7 , 160 + 14.7)
(145.3 , 174.7)
Option (1)
116
Question 2
Steps
1. The sample mean X = 70
2. The sample standard deviation S = 24
3. The sample size n = 36
4. The degrees of freedom df = n 1 = 36 1 = 35
5. The critical value at t( 35 ,
0.025)
equals to 2.0301
6. Substitute the values into the confidence interval estimate for the mean ( unknown) formula
24
70
2.0301
36
70
8.1204
70 8.1204,
70 + 8.1204
(61.8796
78.1204)
Question 3
Steps
413
23 + 20 + 21 + 22 + ........ + 20 + 16 + 22
=
= 18.7727
22
22
2
S
Xi X
2. The sample standard deviation S =
n1
S2 =
85.8636
(23 18.7727)2 + (20 18.7727)2 + ..... + (22 18.7727)2
=
= 4.0887
22 1
21
S=
4.0887 = 2.0221
0.025)
equals to 2.0796
117
STA1610/1
6. Substitute the values into the confidence interval estimate for the mean ( unknown) formula
18.7727
2.0221
2.0796
22
18.7727
0.8965
18.7727 0.8965,
(17.8762
2.65 + 0.8965
19.6692)
X
n
where X , is the number of items in the sample having the characteristic of interest.
n, is the sample size.
Z
2
p(1 p)
n
where
p is the sample proportion
Z is the critical value find from the standardized normal distribution
2
Activity 8-4
Question 1
Companies are spending more time screening applicants than in the past. a study of 102 recruiters
conducted by execunet found that 77 did internet research on candidates.
Construct a 95%
confidence interval estimate of the population proportion of recruiters who do internet research on
candidates.
118
Solution
Steps
1. The sample proportion p =
77
X
=
= 0.7549
n
102
0.7549
1.96
0.7549
0.0835
0.7549 0.0835,
(0.6714
0.7549(1 0.7549)
102
0.7549 + 0.0835
0.8384)
Question 2
Closed caption movies allow the hearing impaired to enjoy the dialogue as well as the acting. A
local organization for the hearing impaired members of the community takes a random sample of 100
movies listings offered by the cable television company in order to estimate the proportion of closed
caption movies offered. Fourteen movies were closed captioned. The cable television company says
at least 5% of the movies shown are captioned. Use Z = 1.65 to prepare a 90% confidence interval
2
estimate for true proportion, and comment on the cable television companys claim.
Solution
Steps
1. The sample proportion p =
14
X
=
= 0.14
n
100
119
STA1610/1
4. Substitute the values into the confidence interval estimate for proportion formula
0.14
1.65
0.14
0.0573
0.14 0.0573,
(0.0827
0.14(1 0.14)
100
0.14 + 0.0573
0.1973)
The interval for the population proportion is 0.0827 to 0.1973, or approximately 0.08 to 0.20. The
organization for hearing impaired people can therefore be 90% confident that the proportion of
closed caption movies offered is somewhere between 0.08 (8%) and 0.20(20%). The cable television
company is correct in saying that at least 5% of the movies it shows are closed captioned.
SELFASSESSMENT
The owner of a restaurant that serves continental food wants to study characteristics of his
customers. He decides to focus on two variables: the amount of money spent by customers and
whether customers order dessert. the results from a sample of 60 customers are as follows:
Based on the amount spent: = R38.54 and S = R7.26 and on the 18 customers purchased dessert.
1. Construct a 95% confidence interval estimate of the population mean amount spent per customer
in the restaurant.
2. Construct a 90% confidence interval estimate of the population proportion of customers who
purchase dessert.
Solution
1. Steps
1. The sample mean X = 38.54
2. The sample standard deviation S = 7.26
3. The sample size n = 60
4. The degrees of freedom df = n 1 = 60 1 = 59
5. The critical value at t( 59 ,
0.025)
equals to 2.0010
120
6. Substitute the values into the confidence interval estimate for the mean ( unknown) formula
38.54
7.26
2.001
60
38.54
1.8755
38.54 1.8755,
(36.6645
38.54 + 1.8755
40.4155)
2. Steps
1. The sample proportion p =
18
= 0.3
60
0.3
1.645
0.3
0.0973
0.3 0.0973,
(0.2027
0.3(1 0.3)
60
0.3 + 0.0973
0.3973)
121
STA1610/1
STUDY UNIT 9
STUDY CHAPTER 9
HYPOTHESIS TESTING
Conceptual Skill
Communication Skill
Test your own knowledge (write in pencil) and then correct your understanding afterwards (erase and
write the correct description). Often a young language may not have all the terms in a discipline can
you think of some examples?
122
English term
Acceptance region
Rejection region
Critical value
Two- tailed test
One - tailed test
Significance level of the test
The null hypothesis
The alternative hypothesis
Description
123
STA1610/1
Examples:
1. If the question asks if whether the waiting time to place an order has changed in the past month
from its previous population mean of 4.5 minutes.
The population mean is , then
H1 : = 4.5
and
H0 : = 4.5 that means, the population mean is equals to 4.5
Alternatively, the question might be There is sufficient evidence to conclude that the waiting time
to place an order is not equal to (or is different from) the previous population mean = 4.5.
Therefore you have to perform a two - tailed test.
2. If the question asks if there is sufficient evidence to conclude that the population mean is greater
than 4.5, then
H1 : > 4.5
and
H0 : = 4.5
and
H0 : = 4.5
124
Step 1
State the null hypothesis H0 : population parameter () = hypothesized value
Step 2
State the alternative hypothesis H1 summarizes what will be the case if the null hypothesis is not
true, and can assume one of the three possible forms:
(a) H1 : population parameter () = hypothesized value
(b) H1 : population parameter () < hypothesized value
(c) H1 : population parameter () > hypothesized value
Step 3
Choose the level of significance (), just to provide a probability basis for deciding whether an
observed difference between a sample statistic and a hypothesized is a chance difference or a
statistically significant difference.
Step 4
Determine the appropriate test statistic and compute the value.
Step 5
Determine the critical values that divide the rejection and nonrejection regions.
Step 6
State the decision rule.
The decision rule is a statement that indicates the action to be taken, that is, to fail to reject H0 or
reject H0 .
Reject H0 when the value of test statistic is greater than the critical value at a specific significance
The p- value is the lowest level of significance at which the null hypothesis can be rejected. It is
determined based on the test statistic value.
Step 7
State the conclusion. This conclusion should be based in the context of the problem, and the level of
significance should be included.
125
STA1610/1
126
Solution
H0 : = 575
H1 : = 575
Question 2
The p-value for hypothesis test has been reported as 0.03. If the test result is interpreted using the
= 0.05 level of significance as a criterion, will H0 be rejected? Explain.
Solution
Given information : = 0.05
p- value = 0.03
127
STA1610/1
17.0
4.9075
n
12
p- value: P (Z > 2.04) = 1 0.9793 = 0.0207
The table values are given as positive values. if you work in the left tail of the area under the
curve, you have to put a minus sign before the table value.
working two - tailed, you have to divide the significance level by 2 and use the answer for the table.
You then use that table value twice - once in the right tail with positive sign and once in the left
tail, making it negative.
Steps
1. State (or identify ) the null hypothesis H0 and the alternative H1 .
2. Choose the level of significance ()
3. Determine the sample mean X
4. Determine the population mean
5. Determine the sample standard deviation S
6. Determine the sample size n
7. Compute the test statistic
X
t=
S
128
Activity 9.3.2:
Question 1
My daughter and I have argued the average length of our preachers sermons on Sunday morning.
Despite my arguments, she thinks that the sermons are more than twenty minutes and this is not
acceptable to her. For one year she randomly selected 12 Sundays and found the average time of
26.42 minutes with the standard deviation of 6.69 minutes. Assuming that the population is normally
distributed and using a 0.05 level of significance, we decided to make a scientific analysis, using a
hypothesis test. Calculate the test statistic and make a statistical decision.
Solution
Steps
1. The null hypothesis H0 : 20 vesus the alternative H1 : > 20
2. The level of significance = 0.05
3. The sample mean X = 26.42
4. The population mean = 20
5. The sample standard deviation S = 6.69
6. The sample size n = 12
7. The test statistic
X
6.42
26.42 20
=
=
= 3.3244
t=
S
6.69
1.9312
n
12
Conclusion
Reject H0 if the test statistic is greater than the critical value
Critical value is 1.645 (from the table)
Since the test statistic 3.3244 is greater than 1.645 then H0 can be rejected. we conclude that there
is enough evidence that the alternative H1 is true and that my daughter is correct in thinking that the
average length of sermons is more than 20 minutes.
129
STA1610/1
Question 2
A random sample of 10 observations was drawn from a normally distributed population.
The data values were: 6, 4, 4, 7,
6 vesus H1 : > 6 and scribbled down his calculations. When his friend came along and quickly
wanted to copy his work, he did not read properly and wrote down that
1. the sample mean is equal to 4
2. the sample variance is equal to 1
3. the rejection region is t < t(0.05,10) = 1.833
4. the test statistic is t = 3.0
5. the conclusion is to reject H0 , because the test statistic t = 3.0 < 1.833
6. Which of the above statement is correct?
Solution
Xi
6+4+4+7+5+5+4+5+6+4
50
=
=
=5
n
10
10
2
S
Xi X
(6 5)2 + (4 5)2 + (4 5)2 + (7 5)2 + ..... + (4 5)2
2
=
=
2. The variance S =
n1
10 1
1.1111
1.111 = 1.0541
=
1
56
=
= 3
1.0541
0.3333
10
130
Question 3
The credit manager of a parge department store claims that the mean balance for the stores charge
account customers is R410. An independent auditor selects a random sample of 18 accounts and
finds a mean balance of X = R511.33 and a standard deviation of S = R183.75. If the managers
claim is not supported by these data, the auditor intends to examine all charge account balances. If
the population of account balances is assumed to be approximately normally distributed, what action
should the auditor take?
Solution
Steps
1. The null hypothesis H0 : = 410 vesus the alternative H1 : = 20
2. For this test, let the level of significance = 0.05
3. The sample mean X = 511.33
4. The population mean = 410
5. The sample standard deviation S = 183.75
6. The sample size n = 18
7. The test statistic
X
101.33
511.33 410
=
=
= 2.3396
t=
S
183.75
43.3103
n
18
Conclusion
Reject H0 if the test statistic is greater than the critical value
Critical value is 2.1098 (from the table)
Since the test statistic 2.3396 is greater than 2.1098 then H0 can be rejected. we conclude that there
is enough evidence that the alternative H1 is true and that the auditor should proceed to examine all
charge account balance.
131
STA1610/1
X
n
(1 )
n
7. Make the statistical decision.
Activity 9.4
Question 1
A random sample of 200 observations shows that there are 36 successes. We want to test at the 1%
significance level if the true proportion of successes in the population is less than 24%, and made
certain calculations.
Which one of the following statement is incorrect?
1. The value of p is
36
200
132
Solution
1. Correct
2. Correct
3. Correct
u
(1 )
0.24 (1 0.24)
=
= 0.0302
4. Correct the standard error is
n
200
0.06
p
0.18 0.24
=
= 1.9868
=
5. Incorrect the test statistic Z = u
0.0302
0.0302
(1 )
n
Question 2
If, in a random sample of 400 items, 164 are defective, what is the sample proportion of the defective
items?
Solution
The sample proportion p =
164
X
=
= 0.41
n
400
Question 3
Refer to question 2, suppose you are testing the null hypothesis H0 : = 0.40 against H1 : = 0.40
and you choose the level of significance = 0.05 . what is your statistical decision?
Solution
This is a two -tailed test
The decision rule:
Reject H0 when the value of test statistic is greater than the critical value at a specific significance
(1 )
n
=u
0.41 0.40
0.40 (1 0.40)
400
0.01
= 0.4082
0.0245
133
SELF- ASSESSMENT
STA1610/1
Question 1
A machine is supposed to be adjusted to produce components to a dimension of 2.0 centimeter. In
a sample of 50 components, the mean was found to be 2.001 centimeter and the standard deviation
to be 0.003 centimeter. Is there evidence to suggest that the machine is set too high? Use = 0.05
Question 2
The light bulbs in a industrial warehouse have been found to have a mean lifetime of 1030.0 hours,
with a standard deviation of 60.0 hours. The warehouse manager has been approached by a
representative of Extendabulb, a company that make a device intended to increase bulb life. The
manager is concerned that the average lifetime of Extendabulb-equipped bulbs might not be any
greater than the 1030 hours historically experienced. In a subsequent test, the manager tests 40
bulbs equipped with the device and finds their mean life to be 1061.6 hours.
Does Extendabulb
Question 4
The new director of a local YMCA has been told by his predecessors that the average member has
belong for 8.7 years. Examining a random sample of 15 memberships files, he finds the mean length
of membership to be 7.2 years, with a standard deviation of 2.5 years. assuming the population is
approximately normally distributed, and using the 0.05 level, does this result suggest that the actual
mean length of membership may be some value other than 8.7 year?
Question 5
The career services director of Hobart University has said that 70% of the schools senior enter the
job market in a position directly related to their undergraduate field of study. In a sample consisting
of 200 of the graduates from last years class, 66% have entered jobs related to their field of study.
Make the related decision. Use = 0.05
134
SOLUTIONS FOR SELF- ASSESSMENT
Solution 1
Steps
1. The null hypothesis H0 : = 2.0 and the alternative H1 : > 2.0 (this is one -tailed test)
2. The level of significance = 0.05
3. The sample mean X = 2.001
4. The population mean = 2.0
5. The population standard deviation = 0.003
6. The sample size n = 50
7. The test statistic
X
0.001
2.001 2.0
=
=
= 2.357
Z=
0.003
0.00042
n
50
135
STA1610/1
90
14.23025
n
40
8. The critical value is equal to 1.645
9. Decision
Since 2.206 > 1.645 then H0 is rejected at 5% level of significance.
The results suggest that Extendabulb does increase the mean lifetime of the bulbs. This firm may
wish to incorporate Extendbulb into its warehouse lighting system.
Solution 3
The test statistic
X
8
82.0 90
=
=
= 1.5114
Z=
S
20.5
5.2931
n
15
The critical value is equal to 2.58
Decision
Since 1.5114 < 2.58 then H0 is not rejected at 1% level of significance.
Solution 4
Steps
1. The null hypothesis H0 : = 8.7 vesus the alternative H1 : = 8.7
2. The level of significance = 0.05
3. The sample mean X = 7.2
4. The population mean = 8.7
5. The sample standard deviation S = 2.5
6. The sample size n = 15 and the degrees of freedom df = n 1 = 15 1 = 14
7. The test statistic
X
1.5
7.2 8.7
=
=
= 2.3238
t=
S
2.5
0.6455
n
15
8. The critical value are t = 2.145 and t = 2.145
136
Conclusion
Since the calculated test statistic falls in the rejection region we reject H0
At the 0.05 level, the results suggest that the actual mean length of membership may be some value
other than 8.7 years.
Question 5
Steps
1. The value of p is 0.66
2. The appropriate hypotheses are H0 : = 0.70 vesus H1 : = 0.70
3. The critical value of Z are 1.96 and 1.96
4. The test statistic is
Z=u
(1 )
n
=u
0.66 0.70
0.70 (1 0.70)
200
0.04
= 1.2346
0.0324
The test statistic value falls between the two critical values. The null hypothesis is not rejected.
We conclude that the proportion of graduates who enter the job market in careers related to their
field of study could indeed be equal to the claimed value of 0.70. This analysis would suggest that
the director assertion not be challenged.
137
STA1610/1
STUDY UNIT 10
STUDY CHAPTER 10
CHI-SQUARE DISTRIBUTION
(n 1) s2 / 2
freedom (df )
skewed, but as the df increases, the form of the 2 -distribution becomes more and more like the
138
question.
Draw a conclusion regarding the statements in the hypotheses after comparison of the critical
139
STA1610/1
Description
Question 1
Do questions 11.20 and 11.21 in a Prescribed textbook. Please try them yourselves before looking
at the solutions!
Question 2
The quality manager in tyre manufacturing plant in Port Elizabeth wants to test that the nature of
defects found in manufactured tyres depends upon the shift during which the defective tyres were
produced. Formulate the hypothesis of this test.
Question 3
A large carpet store wishes to determine if the brand of carpet purchased is related to the purchasers
family income. As a sampling frame, they mailed a survey to people who have a store credit card.
Five hundred customers returned the survey and the results follow:
Family Income
High Income
Middle Income
Low Income
Brand of Carpet
Brand A Brand B
65
32
80
68
25
35
Brand C
32
104
59
The statements below refer to a test conducted on the data above to determine if the brand of carpet
purchased is related to the purchasers family income at the 5% level of significance.
Select the incorrect statement.
1. H0 : Family income and brand of carpet are independent and
H1 : Family income and brand of carpet are dependent.
140
3. The estimated frequencies are as follows:
Family Income
High Income
Middle Income
Low Income
Brand of Carpet
Brand A Brand B
43.86
34.83
85.68
68.04
24.46
32.13
Brand C
40.31
98.28
46.41
Application skills
Question 1
11.20
11.21
12
a.
b.
c.
d.
e.
15.507
20.090
23.209
34.805
34.805
Question 2
1. H0 : The nature of defects found in manufactured tyres and the shift during which they were
produced are independent
H1 : The nature of defects found in manufactured tyres and the shift during which they were produced
=
=
129 170
500
43.86
141
STA1610/1
Brand of Carpet
Brand A Brand B
43.86
34.83
85.68
68.04
40.46
32.13
Brand C
50.31
98.28
46.41
4. Because the calculation of the 2 value is very tedious and you may not know where you made a
calculation error, I will show you the manual calculation of this value:
f0
fe
65
32
32
80
68
104
25
35
59
43.85
34.83
50.31
85.68
68.04
98.28
40.46
32.13
46.41
2 =
(f0 fe )2
fe
10.1892
0.2299
6.6638
0.3765
0
0.3327
5.9074
0.2564
3.4154
[ (f0 fe )2
fe
= 27.372
5. Because 27.372 > 20.01,4 = 13.277, we can reject H0 and there is a significant relationship
between the brand of the carpet and family income.
unit
142
STUDY UNIT 11
STUDY CHAPTER 11
REGRESSION AND CORRELATION ANALYSIS
Test your own knowledge (write in pencil) and then correct your understanding afterwards (erase and
write the correct description). Often a young language may not have all the terms in a discipline; can
you think of some examples?
...................................................................................................
143
English term
Regression
analysis
Correlation
Scatterplot
Least square
method
Regression
coefficients
Dependent
variable
Independent
variable
Slope
Y-intercept
Interpolation
Extrapolation
Coefficient of
correlation
Coefficient of
determination
Description
STA1610/1
144
The idea of a simple linear regression model, how to determine the equation as well as the principle of
least-squares criterion are explained in detail in Prescribed textbook. Make sure that you understand
the following:
We are only considering linear relations, meaning that we only fit straight lines through the data.
There is a difference between an observed value of a variable yi and the estimated value of the
same value yi . The difference between these two is called the error.
The regression line always passes through the point (X ,Y ). Stated differently the means of the
two variables given as a pair, are always a pair of coordinates on the regression line.
In the general form of the regression line Y = b0 + b1 X the b0 and b1 are only symbols and will
be substituted by numbers in the calculated equation for a specific data set. In school you most
probably used the form Y = mX + c for the straight line equation and learnt that m indicated the
slope and c the Y -intercept. Compare what you learnt in school with this new form and you will
see that the Y -intercept is given by the number without an x, namely b0 , and that the value with
the x, namely b1 (also called the coefficient of X ), represents the slope of the straight line.
To calculate the least squares regression line manually takes a lot of time and is quite tedious.
Still, it is a necessary exercise for you at this stage. There will be enough time for you later in your
life to use software and simply interpret printouts.
Make sure that you understand the meaning of the required assumptions for the linear regression
model.
...................................................................................................
Activity 11.1
Question 1
Do question 12.1 to 12.3 in the Prescribed textbook.
Question 2
Consider the following data values of variables x and y:
X
Y
5
7
4
8
3
10
6
5
9
2
8
3
10
1
145
STA1610/1
the strength of the relationship is weaker the closer the value of r (positive or negative) is to zero
As in the case of the regression line, a lot of calculations are needed if you want to determine the
value of r manually.
The coefficient of determination
This is simply the value of the square of the correlation coefficient r, namely r2 . The answer of
r2 indicates the proportion of the variation in y , as explained by the regression line Y = b0 + b1 X .
That is all you have to know about this coefficient. (See the first paragraph under the heading: The
coefficient of determination.)
...................................................................................................
146
Activity 11.2
Question 1
Do questions 12.11 to 12.15 in the Prescribed textbook.
Question 2
The statements in this question are based on the following data.
X
2.6
2.6
3.2
3.0
2.4
3.7
3.7
S
X = 21.2
Y
5.6
5.1
5.4
5.0
4.0
5.0
5.2
S
Y = 35.3
147
STA1610/1
Question 2
Option 2
1. Incorrect. The relationship between x and y appears to be linear and negative (as the x-values
are increasing, the y -values are decreasing).
2. Correct. The least squares regression line is y = 13.223 1.257x.
3. Incorrect.
4. Incorrect. The estimated value of y is 25.189 only when x = 5 is substituted into the equation
given in option 3. The correct answer is 10.709.
0.7
148
Question 2
1. Correct. Since r > 0.
2. Correct. y = 5.043.
3. Incorrect. r2 = (0.327)2 = 0.107.
4. Correct.
5. Correct. The value of r2 is (0.327)2 = 0.107. Then only 10.7% of the variation in y is explained by
the variation in x.