STATS Studyguide

Download as pdf or txt
Download as pdf or txt
You are on page 1of 157

STA1610/1

Department of Statistics
STA1610

Introduction to Statistics

Study guide for STA1610

iii

STA1610/1

Table of contents

STUDY UNIT 1

1.1 Sample versus population

1.2 Types of variables

1.3 Levels of measurement

STUDY UNIT 2

2.1 Tables and charts for categorical data

2.2 Tables and charts for numerical data

2.3 Scatter diagram and contingency tables

2.4 Summary: objectives and study units 1 and 2

10

STUDY UNIT 3

3.1 Central tendency

11

3.2 Measures of dispersion

13

3.3 Measures of relationship

16

3.4 Summary: objectives of study unit 3

20

STUDY UNIT 4

4.1 Introduction to this study unit

21

4.2 Assigning probability to an event

23

4.3 Calculation of probability

25

4.4 Self assessment exercise

38

4.5 Solutions to self assessment exercises

42

iv
STUDY UNIT 5
5.1 Introduction to this study unit

45

5.2 Probability distribution for discrete random variables

47

5.3 The binomial distribution

52

5.4 Poisson distribution

55

5.5 Self assessment exercise

57

5.6 Solutions to self assessment exercises

59

STUDY UNIT 6
6.1 Introduction to this study unit

61

6.2 The normal and standardized normal distribution

62

6.3 Summary to the study unit

82

STUDY UNIT 7
7.1 Introduction to this study unit

83

7.2 Sampling distribution of the mean

85

7.3 Sampling distribution of proportion

92

STUDY UNIT 8
8.1 Introduction to this study unit

108

8.2 Confidence interval estimate for the mean when the population standard deviation is known 110
8.3 Confidence interval estimate for the mean (population standard deviation is unknown)

112

8.4 Confidence interval estimate for proportion

117

STUDY UNIT 9
9.1 Introduction to this study unit

121

9.2 Fundamental concept of hypothesis testing

122

9.3 Hypothesis testing of the mean

125

9.4 Hypothesis testing for proportion

131

STA1610/1

STUDY UNIT 10
10.1 Introduction to this study unit

137

10.2 Basic concepts in Chi-square testing

137

10.3 Testing for independence of two variables

138

10.4 Summary: objectives of study unit 10

141

STUDY UNIT 11
11.1 Introduction to this study unit

142

11.2 The simple linear regression line

143

11.3 Introduction to correlation analysis

145

11.4 Summary; objectives of study unit 11

148

vi

Welcome to this Timely Topic with the Tattered Image:


STATISTICS!
The word Statistics leads to different reactions ranging from disgust to admiration. Do you believe
that statistics is difficult and irrelevant? I sincerely hope not, but if you do, this is just the module
which will enable you to confront that feeling and gradually change it to a feeling of astonishment
about the need and the strength of statistical theory and the clever scientific applications of statistics
in different disciplines. My aim is to convince you that statistics is indeed a timely topic forming an
essential part of your intellectual development!
We found an excellent textbook by Pearson:
Business Statistics (Fifth Edition) (2010)
by David M. Levine, Timothy C. Krehbiel and Mark L. Berenson
ISBN 978-0-13-609422-7
Most probably this module will be the only contact you make with statistics during the completion of
your degree and our aim is to make this introduction to statistics as interesting as possible.
We are all heading somewhere and, being a student, you must be heading for an exciting new
career or for an improvement of your academic skills. Never underestimate the power of your mind
with regard to your final destination with the degree you are enrolled for. Feeling negative about
statistics because you are not interested in numbers, or because you do not understand the need for
a module in statistics, will influence the way you study as well as the way you progress.
Have an open mind for what you are about to learn; let it catch your imagination. I am asking for a
real effort, seeing it will be an investment in your future personal and professional life. Set your mind
on success and think in terms of your journeys end, which is that degree you want to complete!

Lets GO!

Tell yourself: I want to learn, like and use statistics.

I am sure that there are many questions in your mind, and seeing that you are a distance learner, I
will anticipate some of those questions:

vii

STA1610/1

Question 1
How will I benefit by gaining this knowledge?
Statistical skills take on different forms and without realizing it, you are using many of them every day
of your life. Soon, everybody will be forced to have a certain level of numeracy at school level, and
statistics has been introduced at school level as well. Learning concepts by heart has little meaning
in statistics, because it is a subject about perception, insight and the ability to apply knowledge.
Allow logic to direct you through the rules and results.
How will this benefit you? Statistics will enrich you with knowledge relevant in different walks of life,
because it is living knowledge, applicable wherever it fits in. It is definitely not only about collecting
information, called data. We will go beyond data and you will become an explorer, turning information
into wisdom. After completion of this module you should understand more about life, the role of
decision making and the importance of scientific knowledge in governance and control.

Question 2
What is the nature of the statistics in this module?
The authors of this prescribed book explain in the preface that their aim was to present statistics in
an interesting and useful way. In this they succeeded and also in their use of modern technological
advances. You can complete this module even if you do not have access to a computer, but should
you have access, you can familiarize yourself with different statistical software and the additional
information given on the CD-ROM that accompanies each textbook. This module is a service module
for students from different disciplines and with varying background knowledge. It is therefore different
from the more mathematical presentations we offer to students from the College of Science, or
students from the College of Economic and Management Sciences who are majoring in statistics.
This is a stand-alone module, as it may not be a prerequisite for any level module in statistics.

Question 3
How must I go about this module?
Keep in mind that different students have different study methods, so it is not really possible to give
you an indication of the time you will spend preparing for this module. The time you spend studying
is not necessarily correlated to intelligence. The extremely important fact I do want to stress is that
Statistics need continuous, steady attention! You need time for reflection on the knowledge you
attained. Please make a study time table for all the modules for which you are enrolled, taking the
assignment due dates and your personal circumstances into consideration.

viii

Question 4
Can I continue with statistics once I have completed this module?
As said, this is a service module. You cannot present this module for exemption from any other
major statistics module at Unisa. Furthermore, this module cannot form part of a major in statistics.
We trust that this module will open your mind for an interest in statistics, but if you want to major in
statistics, you will have to start again at level one. The reason for this rather depressing statement
lies in the depth of knowledge and the method of presentation in this module, which is too different
from the more mathematical presentation typical of modules forming part of a major in statistics.

Question 5
What happens if I cannot use the CD-ROM?
If you cannot go to a regional office, or you do not have a computer, or your computer cannot read
a CD-ROM, you can still complete this module successfully. You will not be examined on additional
information given on the CD or asked how to construct a specific descriptive measure in one of the
statistical software programs. Of course, the CD is very useful for those of you who have access
to a modern computer as it functions as an additional tutor system. We all seem to understand
better with practical applications, additional data sets, pictures, etc. If you are at all interested in
Statistics and see it as a career benefit, you will realize the importance of computer access. The
modern statistician needs a computer in the same way as the previous generation needed pocket
calculators!

Question 6
What is a wraparound (or textbook guide)?
This is a textbook guide you are reading at this moment. It is a way of talking to you in a manner
similar to that of a lecturer at a residential university talking to his/her students. I know that words
on a piece of paper can never substitute personal contact, but we try to come as close to that as
possible. In this guide I include summaries on certain sections, discuss difficult sections, indicate the
sections needed for examination preparation and then I also give you what we call activities. They
are like worked out examples, but I give you the opportunity to try them yourself before you look at
my solutions.
The process to follow is as follows:
Study the particular section in the textbook (given in a block at the beginning of each study unit).
Read the corresponding section in this textbook guide.

ix

STA1610/1

Attempt to answer the questions in the activity relevant to that section. Do not look at my solutions

at the end of each study unit before you have tried really hard to do them yourself.

You may have more questions and if they are serious and you have concerns, contact your lecturer.
You may find that you are able to answer your own questions as the year rolls on!

Outline of this module


The basic statistics topics you need are all included in the prescribed textbook. In fact, there are
additional topics in this textbook which do not form part of this module. A clear indication of the
chapters designated for examination purposes is given below, but read the notes on the different
study units for finer details.
You have to know the following chapters for examination purposes:
Chapter 1:

Introduction and Data Collection

Chapter 2:

Presenting Data in Tables and Charts

Chapter 3: .

Numerical Descriptive Measures

Chapter 4: .

Basic Probability

Chapter 5: .

Discrete Probability distributions

Chapter 6: .

The Normal Disteribution

Chapter 7: .

Sampling and Sampling Distributions

Chapter 8:.

Confidence Interval Estimation

Chapter 9:

Fundamentals of Hypothesis Testing: One-Sample Tests

Chapter 11:

Chi-Square Tests

Chapter 12:

Simple Linear Regression

You cannot do this module without the prescribed book. Also make sure that you buy the correct
edition. You should nurse all your prescribed books, even after completion of the different modules.
They will become precious references in your current or future career. Note that at the time of the
development of this module, your assignments and examination paper will all contain only multiple
choice questions. This is due to the large number of students enrolled for this module. The majority
of the questions in the activities given in this wraparound will also be multiple choice. We found that
some basic principles can be explained in more detail in a standard question.

x
The contents of this module has been divided into the following 11 study units:

Study material
Your study material consists of
a prescribed book : Business Statistics, a first course, 5th ed.
by DM Levine, TC Krehbiel and ML Berenson

and this study guide.


Notes
The textbook you have to buy yourself and as soon as possible!
Tutorial letter 101, containing general information as well as the assignment questions, forms part

of the study material you received during the registration period.

Please make sure that you receive the study guide and tutorial letter 101 during registration. Once
you have bought the prescribed book, you will be ready to start with your new and exciting learning.
If you have access to the internet, log onto myUnisa and join the discussion forums for the different
modules you are enrolled for. Being a distance learner can lead to isolation, so get connected or
meet regularly with a peer group who is also registered for this module. Look around you and see if
you can find statistical information in your community, involve your parents, friends,etc.

STA1610/1

STUDY UNIT 1
STUDY CHAPTER 1

Key questions for this unit


What is statistics?
How do you collect and summarise data?
What types of variables exist?

1.1 Sample versus population


The first chapter is very general and offers you an excellent background to the knowledge which
follows. Most probably you will come across concepts which are new to you; others may seem
familiar, but you are not certain about their significance or meaning. Is this the stage where you
find statistics boring and a lot of dead facts? Hey - come on! Nothing real good in life comes easy!
Before one can use a language, you have to learn the boring vocabulary and only when you can
manoeuvre the words into sentences the language starts to make sense and you can enjoy it in all
its beauty!
Be patient, there are basic facts and concepts you have to learn! Let us quickly run through some
starters.
Make sure you understand the meaning of a population and a sample. Remember that the
measures for a sample are called statistics
measures for a population are called parameters

The information in the table below will become more and more clear as we continue in the further
chapter. If you do not know what is meant by measures of location or spread, use this table for
further reference.

2
Sample
Set of observations

Population
All possible observations

Statistic

Parameter

Measures of location
Average
Middle element
Most frequent element

Sample mean
Sample median
Sample mode

Population mean
Population median
Population mode

Measures of spread
Range
Standard deviation (SD)

Sample range
Sample SD

Population range
Population SD

Measure

This brings us to variables and the difference between qualitative and quantitative variables.
Characterization of data is the starting point of any statistical analysis, so, know your data! I would
like to train you to evaluate published statistical analyses in order to decide if results are really
trustworthy. This process starts with the data type and the corresponding correct form of analysis.

1.2 Types of variables


This brings us to variables and the difference between qualitative and quantitative variables.
Please understand that there is a difference between Types of Variables and Scales of Measurement.
Variables are classified as either Qualitative (think in terms of quality of life) or Quantitative (if you
quantify something you could count it). The information can be given in table form. Once you know
your variable is quantitative, it helps to ask yourself if you have actually counted (then discrete) or
measured (then continuous), when you gathered the values.
Qualitative variable

Quantitative variable

(several categories)

Data
Data only as frequencies
(count elements in categories)
Frequencies can also be
expressed as percentages

Discrete
Only specific values
you counted.

Continuous
Any value within interval
you measured.

Data

Data

Generated by counts
of elements

Generated by measurements
of some aspect of the elements

STA1610/1

1.3 Levels of measurement


Once a variable has been measured, you must know how to analyse the data (the measurements put
together), but in order to analyse data you have to look at the variable under a magnifying glass. Four
levels, called scales, of measurement are given. Data are actually either nominal or ordinal. There
is little to say about the nominal data, but ordinal data can be defined into finer detail as interval or
ratio. Make sure that you understand the difference between the data types.

Scales of measurement
Nominal
Categories or
labels. If numbers
are used they
have no numerical
meaning.

Ordinal
Preferences are
ordered. Numbers
are ranked but
ranks do not represent
specific measurements.

Ordinal

Interval
Numerical labels
indicate order and
distance. Unit of
measurement exists
but no absolute zero.

Ratio
Absolute zero
present and
multiples have
meaning.

Activity 1.1
Question 1
Which of the following statements about the variable type is incorrect?
1. Whether or not you own a Panasonic television set is a qualitative variable.
2. Your status as either a full-time of part-time student is a quantitative variable.
3. The number of people you know who attended the graduation last year is a quantitative, discrete
variable.
4. The price of your most recent haircut is a quantitative, discrete variable.
5. Cyrils travel time from his home to the examination centre is a quantitative, continuous variable.
Question 2
Which of the following quantitative variables is not continuous (i.e. it is discrete)?
1. Your weight
2. The circumference of your head (in centimetres)
3. The time it takes Jerome to walk from his home to the taxi pickup point
4. The length of your forearm from elbow to wrist
5. The number of coins in your pocket
...................................................................................................

Feedback on Activity 1.1


Question 1
Option 2
1. Correct. The options are Yes and No. The variable does not track measurements of elements,
but will generate counts of the number of people in each level.
2. Incorrect. Options are that you are a Full-time or you are a Parttime student and it is a qualitative
variable for the same reasons as given in 1.
3. Correct. The answer is some countable number of people, making the answer a number and this
number can only take on whole numbers as values, therefore a discrete variable.
4. Correct. Answer is a price in rands and cents. These are numbers, therefore quantitative and
furthermore these numbers can only extend to the second decimal place, so it is discrete.
5. Correct. The answer is a number, so quantitative, but time can be accurately measured to any
accuracy level. So, the quantitative variable is furthermore continuous.
Question 2
1. Continuous.
The answer lies in an interval.
2. Continuous.
The measurement lies somewhere in an interval of possibilities.
3. Continuous.
The time it takes Jerome to walk from his home to the taxi pickup point is a measurement.
4. Continuous.
The length of your forearm from elbow to wrist is a measurement.
5. Not continuous.
Number of coins in your pocket is a discrete value; easy to count.

STA1610/1

STUDY UNIT 2
STUDY CHAPTER 2

Key questions for this unit


How do you draw up a table for statistical data?
What different graphs exist?

2.1 Tables and charts for categorical data


The emphasis in this study unit is on appropriate methods for analysing quantitative data, but it
concludes with an introduction to the comparison of qualitative variables through cross-tabulation.
It is frustrating to try and make sense of raw data (simply collected information).
statisticians have the need to do something with such information.

Even non-

The most elementary

manipulation would be to arrange data form small to large, or in alphabetical order, or...
For categorical data you can draw up a summary table and use the bar chart, pie chart and pareto
chart to display the data.

2.2 Tables and charts for numerical data


In the previous section we decided that a frequency distribution is much better than simply writing
down the measurements as they are bring recorded. However, once data have been placed in the
different classes, some of the information is lost. You no longer have the data measurements, only
the counts per class. Different charts are used to display the data and one method is the stem-andleaf display, which is a very sweet visual presentation. No information about the data is lost! The
data is also grouped: the stems correspond with the class intervals of the frequency distribution, but
the leaves give more detail they record the actual data value in that class interval. Note that the
stem-and-leaf computer printout of MINITAB gives an additional first column indicating a count of the
number of leaves per stem. If you read that There are 37 values in this category or lower, but you
can see with your own eyes that there are only eight leaves, namely 5 5 5 5 7 8 8 9 and 9, in that line,
do not be upset. Why is this not a lie? In the text it was explained that it is possible to break each

6
stem up into two or more lines. Look again at the printout and you will see that the stems have been
doubled, i.e. there are two 1s, two 2s, etc. Because there were too many numbers per stem, they
split each stem into two parts. For example, for the stem 1 all the numbers from 10 to 14 are written
with the first 1 and all numbers from 15 to 19 are with the second 1. An outlier is a data value very
distant from most of the others.
Of course, if the interest is in the form of the data (symmetric, skewed,..) repeating stems may not
be used in the stem-and-leaf display.
Activity 2.1
Question 1
The following stem-and-leaf plot gives the ages of people living in block A of a retirement village:
5
6
6
7
7
8
8
9
9

5
04
57
1112
5799
01334
666789
0124
78

Which statement is incorrect?


1.
2.
3.
4.
5.

There are 39 data points in the stem-and-leaf display.


The youngest resident is 55 years old.
8 persons are between 70 and 80 years old.
There are no outliers in this data set.
Half of the people are younger than 83.

Question 2
The following stem-and-leaf display is for a set of values where the stem is formed by the units and
the leaf represents the decimal digits.
3
4
5
6
7

0167
333
258
99
4

Which one of the following statements is incorrect?


1.
2.
3.
4.
5.

The number with the highest frequency is 4.3.


13 values are represented in this stem-and-leaf display.
The original data are: 3.0, 3.1, 3.6, 3.7, 4.3, 5.2, 5.5, 5.8, 6.9 and 7.4.
The smallest and largest values are 3.0 and 7.4 respectively.
If arranged in increasing order, the middle value will be 4.3.

STA1610/1

Three methods to help portray a frequency distribution are the histogram, the percentage polygon
and the cumulative percentage polygon.

...................................................................................................
Activity 2.2
Question 1
The following comments refer to histograms. Identify the incorrect statement.
1. Histograms graphically display class intervals as well as class frequencies.
2. Histograms are appropriate for qualitative data.
3. Where stem-and-leaf displays are ideal for small data sets, large data sets are better presented
in histograms.
4. Histograms are good tools for judging the shape of a dataset, provided the sample size is relatively
large.
5. Only estimates of the centre, variability and outliers of a dataset can be determined from a
histogram.
Question 2
The following comments refer to different graphical presentations. Which statement is incorrect?
1. Adjacent rectangles in the histogram share a common side, while those in the bar chart have a
gap between them.
2. A pie chart is a circular display divided into sections based on the number of observations within
the segments.
3. With a stem-and-leaf plot the intervals for the stems are restricted in length, but this is not true for
a histogram.
4. The histogram as well as the bar chart, represents frequencies according to the relative lengths
of a set of rectangles.
5. Boxplots give a direct look at centre, variability, outliers and shape of a dataset.
...................................................................................................

2.3 Scatter diagram and contingency tables


The scatterplot or scatter diagram displays the relationship between two quantitative variables.
These variables are in a relationship where the one is the independent and the other the dependent
variable. It is customary to put the independent variable on the horizontal axis and call it x and the
dependent variable on the vertical axis and call it y. Make sure that you understand the characteristics
of the different relationships. In this module we are only going to consider linear relationships
between the two variables. We are going to concentrate on finding a best-fitting line through the
data points, but that will be done in a later study unit. At this stage we simply look at the scatterplot
and imagine a line through the data, using the so-called eyeball method.
In this study unit we looked at tabulation of data. Even non-statisticians tend to use these displays
when they want to compare categories. Make sure that you can evaluate these presentations when
you come across them in the many different walks of life.
Activity 2.3
Question 1
Which of the following statements is incorrect?
1. Every point in a scatterplot represents observed values for two variables.
2. Two variables have a positive association when the values of both variables are always positive.
3. Two variables have a negative trend when the values of one variable tend to decrease as the
values of the other variable increase.
4. The slope of the best-fit straight line through the points of the scatterplot indicates the relationship
between the two variables.
5. If the datapoints in a scatterplot seem to lie in a horizontal line, then the value of the one variable
cannot be used to predict the value of the other variable.
Question 2
A cross-tabulation is an effective summary display of data for examining
1. a single variable with data that are qualitative or grouped into categories.
2. a single variable with quantitative data.
3. two variables that are both qualitative or grouped into categories.
4. two variables that are both quantitative.
5. two variables, one variable quantitative and the other qualitative, or grouped into categories.
................................................................................................

STA1610/1

Feedback on Activities
Activity 2.1
Question 1
Option 1
1. Incorrect. There are 30 data points. (We do not count the stems.) You can simply count the digits
to the right of the vertical bar.
2. Correct. The first stem of 5 is for persons in the fifties and the leaf is a five, making the age 55.
3. Correct. With stem 7 there are eight values, namely 71, 71, 71, 72, 75, 77, 79 and 79.
4. Correct.

55 (smallest number) is not markedly lower than the second lowest value of

60. Furthermore 98 (highest value) is very close to 97 and also not an outlier.

5. Correct. Half of the people are indeed younger than 83. There are 30 people, so half of
them would imply 15 persons. If you look at the ages while counting these ordered ages from
55, 60, 64, 65, ...79, 79, 80, 81, the value 81 belongs to the 15th person.

Question 2
Option 3 is incorrect.
3 0167
4 333
5 258
6 99
7 4
1. Correct. The number with the highest frequency is 4.3 because there are three data points with

the value 4.3. (Highest frequency means the one that occurs the most.)
2. Correct. Count the stems and you will have 13 values. If you have a problem, look at the listed
numbers in the real answer to 3. below and count those.
3. Incorrect. The list given below did not include the values that occur more than once. The original
data are: 3.0, 3.1, 3.6, 3.7, 4.3, 5.2, 5.5, 5.8, 6.9 and 7.4. The values 4.3, 4.3 and 6.9 should also
have been in the list and then there are 13 datapoints.
4. Correct.
5. Correct. In a stem-and-leaf plot the values are arranged in ascending order. If you have 13 items
arranged in order, then item number 7 is in the middle position with 6 on either side of it. Start
with 3.0, 3.1, 3.6, ... and in position 7 you will find the second 4.3.

10
Activity 2.2
Question 1
Option 2
Statement 2 is the only statement which is incorrect as histograms are appropriate for quantitative
data.
Question 2
Option 5
Boxplots give a direct look at centre, variability and outliers but not shape.
...................................................................................................
Activity 2.3
Question 1
Option 2
Two variables have a positive association when the values of one variable tend to increase as the
values of the other variable increase.
Question 2
Option 3
All other statements are incorrect.
...................................................................................................

2.4 Summary: objectives of study units 1 and 2


Once you have familiarized yourself with these study units you should be able to
use graphical displays to describe sample data and to gain insight into the nature of the sampled

population

define and evaluate different scales of measurement


categorize qualitative data and evaluate different graphical displays
interpret and compare bar charts, pie charts, histograms and stem-and-leaf displays
inspect relationships between two variables in a scatter diagram
fit a straight line to a scatter diagram making use of the eyeball method
understand the principle of simple tabulation and the method and advantages of cross-tabulation
communicate the information contained in different statistical summary measures

11

STA1610/1

STUDY UNIT 3
STUDY CHAPTER 3

Key questions for this unit


What measures for tendency are used?
What measures for dispersion are used?
How do you find the relationship between two numerical variables?

3.1 Central tendency


There are three measures of central tendency, namely the
arithmetic mean (an average)
median (the middle value in an ordered arrangement of the data)
mode (the value(s) with the highest occurrence in the list of values)

In the discussion on the arithmetic mean you are introduced to mathematical notations, such as
S
,
, xi , x... These symbols are like little pictures and you should read them in that way. If

you were in a lecture hall, the lecturer would not say mew for , but he/she would say population
mean. Let me give you the words for the symbols as I trust this will help you a lot and has a double
purpose as it helps you to learn.

12
Symbol

S
xi

xi

Pronunciation
mew(like a cat)

Read as
population mean

sigma

the sum of

ex eye

all x -values

sigma ex eye

sum of all the x -values

ex bar

sample mean

total of the population values

total of the sample values

You can now imagine the lecturer saying the following sentences:

The population mean is equal to the sum of all the data
for
values in the population, divided by how many they were
The sample mean is equal to the sum of all the data
values in the sample divided by how many they were
Note
Be careful not to write =

for

x=

xi
N

xi
n

xi
n .

What is wrong?
refers to the population and you cannot divide by the number of values n in the sample if you are

referring to the population parameter.


Writing x =

xi
N

would be just as disastrous!

Note
It will help if you take note of the following:
Greek letters are used for the population parameters, e.g. .
Standard alphabet letters are used for the sample statistics, e.g. x.

The relationship between the mean, median and mode is determined by the shape of the distribution,
which can be symmetric, positively skewed or negatively skewed.

13

STA1610/1

Activity 3.1
Question 1
Read the following statements. The incorrect statement is:
1. The mean is one of the most frequently used measures of central tendency.
2. When the mean is greater than the mode, we say it is negatively skewed.
3. When the mean is greater than the median, we say it is positively skewed.
4. When a distribution is bimodal, it will be impossible for the mean, median, and both modes to be
equal.
5. The measure most affected by extreme values is the mean.

Question 2
Certain measures have been calculated for the following small sample data set
15

17

23

11

20

45

13

The incorrect calculation is


1. The mean for this data is double the value of the mode.
2. The value 45 is and outlier.
3. The mode for this dataset is 9.
4. The median is 15.
5. Removing the outlier, the mean is 13.
...................................................................................................

3.2 Measures of dispersion


Describing data using only the mean, median and mode can lead to disaster. Remember that
statistical analysis should be based on all the measurements, but because that is not practical in
most of the cases, we make summaries. These summaries have to represent all the data in the best
possible way, which implies that more than a description of centre values of the dataset is needed.
The measures of spread discussed in this section are the
the range (the difference between the highest and the lowest value)
quartiles (dividing the data in four equal-sized groups)
the variance/standard deviation (considering the average deviations from the mean)

14
The range is a concept that is easy to understand and very few students have problems with it.
The reference to quartiles is much more complex, as the divisions of the data into equal groups lead
to very important features of the dataset. The interquartile range plays a very important role in the
analysis of statistical data. The best application of the quartiles are found in a boxplot. Make sure
that you understand and are able to interpret a given boxplot.
The variance and standard deviation are but one mathematical calculation apart, but in general
people prefer to refer to standard deviation and not to variance. You need to know how to calculate
these measures as well as understand their meaning for when you come across them in different
walks of life.
As we move through the prescribed book, you will find that the number of symbols you have to
recognize increases. Variance and standard deviation have their own symbols and again there are
clear distinctions between the sample and the population measures. Do you want another summary
and sentences?
Symbol

Pronunciation

sigma(same as for

sigma squared

Read as
S
)

population standard deviation


population variance

sample standard deviation

s2

sample variance

(xi )2
N

population variance
Note division by N

sum of the squares of the differences between


the values and the population mean, divided
by the total of the population values

(xi x)2
n1

sample variance
Note division by (n 1)

sum of the squares of the differences between


the values and the sample mean, divided
by the total of sample values minus one

(xi )2
N

population standard
deviation

Square root of the population variance

(xi x)2
n1

sample standard
deviation

Square root of the sample variance

uS
vS

Note
A reminder that
Greek letters are used for the population parameters, e.g. and 2 .
standard alphabet letters are used for the sample statistics, e.g. s and s2 .

15

STA1610/1

Make sure that you will be able to draw and interpret a boxplot.
For the interpretation and uses of the standard deviation Chebyshevs theorem and the Empirical
Rule are described.
Activity 3.2
Question 1
Given below are the summary statistics for data described as fastest ever driven.
Suppose the speed is given in kilometres per hour, then:

Median
Quartiles
Extremes

Males
87 students
110
95
120
55
150

Females
102 students
89
80
95
30
130

Determine
1.
2.
3.
4.
5.
6.
7.

the fastest speed driven by anyone in the group


the slowest speed driven by a male
the cut-off speed indicating that 25% of the men drove at that speed or faster
the proportion of females who had driven 89 km/h or faster
the number of females who had driven 89 km/h or faster
the differences (if any) between male and female drivers and interpret your answers
the range for males and females

8. the interquartile range for male and female drivers


9. the distribution of male and female data by presenting it in two boxplots
Question 2
Which of the following statements is correct?
1. The range is found by taking the difference between the high and low values and dividing by 2.
2. The interquartile range is found by taking the difference between the 1st and 3rd quartiles and
dividing that value by 2.
3. The mean is a measure of the deviation in a data set.
4. The standard deviation is expressed in terms of the original units of measurement but the variance
is not.
5. The median is a measure of dispersion.
...................................................................................................

16

3.3 Measures of relationship


The two measures discussed here are the covariance and the coefficient of correlation.
Covariance

Coefficient of correlation (r)

* Measures the strength of a linear


relationship between two
numerical variables

* Measures the strength of a linear


relationship between two
numerical variables
* The value of r determines direction and strength
* 1 r 1

Strength increases as the value


of r is closer to +1 or to 1

* r 0 variables are directly related


* r < 0 variables are indirectly related

Activity 3.3
Question 1
Identify the correct statement:
1. If the coefficient of correlation r is positive, the dependent variable y and the independent variable
x are said to be inversely related.

2. No indication of the value of the sample coefficient of correlation can be determined from a scatter
plot.
3. If the coefficient of correlation r = 1, then the best-fit linear equation will include all of the data
points.

4. The coefficient of correlation r is a number that only indicates the direction of the relationship
between the dependent variable y and the independent variable x.
5. If the coefficient of correlation = 0, then there is a linear relationship between the dependent
variable y and the independent variable x.
...................................................................................................

17

STA1610/1

Feedback to activities
Activity 3.1
Question 1
Option 2
When the mean is greater than the mode, we say it is positively skewed.
Question 2
Option 5
1. Correct.
The mean of the given data is 18 and the mode is 9 (there are two 9s), and we all know that 18 is
double 9.
2. Correct.
The value nearest to 45 is 23 and the data consist in general mostly of much smaller numbers, so
45 can be considered an outlier.

3. Correct.
We have already discussed the mode and saw that it is 9.
4. Correct.
To determine the median the data must be ordered (from small to large or vice versa):
9, 9, 11, 13, 15, 17, 20, 23, 45
If there are nine values, the middle one is in position five (four values on each side). In this position
we have the 15.
5. Incorrect.
Remove 45 and the total is 117, which must be divided by 8. The answer should have been 14.625
( 117
8 ). If you are interested, the incorrect answer given was calculated by dividing 117 by 9 instead
of 8.
...................................................................................................

18
Activity 3.2
Question 1

Median
Quartiles
Extremes

Males
87 students
110
95
120
55
150

Females
102 students
89
80
95
30
130

1. The fastest speed driven by anyone in the group is 150 km/h. Male and female top speeds are
indicated in the table as extremes. (Maximum for females is 130 km/h.)
2. The slowest speed driven by a male is 55 km/h.
3. The cut-off speed indicating that 25% of the men drove at that speed or faster, implies the value
of the upper quartile for males, which is 120 km/h.
4. To find the proportion of females who had driven 89 km/h or faster, you have to notice that 89 is the
value of the female median. The median divides the data into two equal parts, so the proportion
of the data above 89 is 50% (expressed as a percentage) or 0.5 expressed as a fraction.
5. The number of females who had driven 89 km/h or faster is (as said), half of the number of woman.
If there are 102 female students, half of them will be 51.
6. Use the table to interpret the differences between male and female drivers. Some of the obvious
differences are the following:
(a) The median speed for males is higher than that for females.
(b) The highest speed was recorded by a male.
(c) The lowest speed was recorded by a female (which is what can be expected if the mean values
differ the way they do).
(d) The upper quartile of females is the same as the lower quartile of the males (speed 95).
7. Range.
Males: (150 55) = 95
Females: (130 30) = 100
8. Interquartile range for male and female are respectively
(120 95) = 25 and (95 80) = 15.

19

STA1610/1

9.

Question 2
Which of the following statements is correct?
1. Incorrect.
The range is found by taking the difference between the high and low values - not divided by 2.
2. Incorrect.
The interquartile range is found by taking the difference between the 1st and 3rd quartiles - not
divided by 2.
3. Incorrect.
The mean is a measure of central tendency.
4. Correct.
The standard deviation is expressed in terms of the original units of measurement, but the
variance is not.
5. Incorrect.
The median is a measure of central tendency.
...................................................................................................

20
Question 3
a. The median is approximately 37.5 defects per day. The first quartile is approximately 37 defects
per day. The third quartile is approximately 39 defects per day.
b. The asterisks at the right are outliers, indicating two days on which unusually large numbers of
defects were produced. The production supervisor should try to determine if anything out of the
ordinary was happening at the plant on those days.
c. The distribution is positively skewed. Look at the position of the median and you will understand
this answer.
Activity 3.3
Question 1
Option 3
The corrected statements are:
1. If the coefficient of correlation r is positive, the dependent variable y and the independent variable
x are said to be directly related.

2. If the coefficient of determination is 0.81, the coefficient of correlation r can be 0 .90 or 0 .90.
3. Statement is correct.
4. The coefficient of correlation r is a number that indicates the direction as well as the strength of
the relationship between the dependent variable y and the independent variable x.
5. If the coefficient of correlation = 0, then there is no linear relationship whatsoever between the
dependent variable y and the independent variable x.
...................................................................................................

3.4 Summary: objectives of study unit 3


Once you have familiarized yourself with this chapter you should be able to
compare the considerations when using the mean and median and mode of quantitative data
understand the influence of outliers on the mean, median and mode and their correlation with the

shape of the distribution

evaluate the meaning of dispersion as conveyed by the range, the quantiles, MAD and

variance/standard deviation

say if values in a given data set are outliers or not with special reference to a box-and-whisker plot
use the standard deviation and mean to determine the coefficient of variation for both sample and

population

explore different measures of association to determine the direction and strength of relationships

21

STA1610/1

STUDY UNIT 4
STUDY CHAPTER 4

Key questions for this unit


Define probability. Distinguish between the three types of probability
What is meant with the following concepts: An event, Joint event, Complement
of an event and sample space.
Under what conditions does P (A/B) = P (A)
How would you construct new sets from old ones by forming subsets, unions,
intersections, complements, differences and symmetric differences?
What does it mean if we say that two events are mutually exclusive?
Why cant mutually exclusive events also be independent?

4.1 Introduction to this study unit


This unit introduces the basic concepts of probability. It outlines rules and techniques for assigning
probabilities to events. Probability plays a critical role in statistics. All of us form simple probability
conclusions in our daily lives. Sometimes these determinations are based on facts, while others are
subjective. If the probability of an event is high, one would expect that it would occur rather than it
would not occur. If the probability of rain is 95%, it is more likely that it would rain than not rain.
The principles of probability help bridge the words of descriptive statistics and inferential statistics.
Reading this unit will help you learn different types of probabilities, how to compute probability, and
how to revise probabilities in light of new information. Probability principles are the foundation for
the probability distribution, the concept of mathematical expectation, and the binomial and Poission
distributions, topics that are discussed in study unit 5.

22
Activity 4.1: Overview

Study skill

Draw a mind-map of the different sections/headings you will deal with in this study session. Then
page through the unit with the purpose of completing the map.
...................................................................................................
Events A and B
Complementary
events A and AC

Conditional Probability

P (A/B) =

 
P AC = 1 P (A)

P (A and B)
P (B)

Events A and B are


mutually exclusive

P (A and B) = 0
Events A and B are
are independent

P (A and B) = P (A) P (B)

Union
A or B

Multiplication Rule

P (A and B) = P (A/B) P (B)

If A and B are
INDEPENDENT then
P (A and B) = P (A) P (B)

Events A and B

Independent

P (A/B) = P (A)
Conditional Probability
Probability Rules

Addition Rule

P (A or B) = P (AB) +
P (B) P (A and B)

If A and B are MUTUALLY


EXCLUSIVE then,
P (A or B) = P (A) + P (B)

Intersection
A and B
Joint probability

23

STA1610/1

Activity 4.2: Concepts Conceptual skill Communication skill


Test your own knowledge (write in pencil) and then correct your understanding afterwards (erase and
write the correct description). Often a young language may not have all the terms in a discipline; can
you think of some examples?
...................................................................................................
English term
Probability
Event
Joint event
Exhaustive event
Venn diagram
Complement of event
Sample space
Simple probability
Joint Probability
Marginal probability
General Additional rule
Conditional Probability
Decision Trees
Independence
Multiplication Rules
Bayes Theorem

Description

Term in your home language

4.2 Assigning probability to an event


This section describes procedures for assigning probabilities to events and outlines the basic
requirements that must be satisfied by probabilities assigned to simple events. Probabilities can
be assigned to simple events (or, for that matter, to any events) using the classical approach, the
relative frequency approach, or the subjective approach.
Whatever method is used to assign probabilities to the simple events that form a sample the two
basic requirements must be satisfied:
1. Each simple event probability must lie between 0 and 1, inclusive.
2. The probabilities assigned to the simple events in a sample space must sum to 1.

The probability of any event A is then obtained by summing the probabilities assigned the simple
events contained in A.
How do I know whether I should combine two events A and B using "and" or "or"?

24
Solution:
The key here is to fully understand the meaning of the combined statement.
P (A and B) = probability that A and B both occur while P (A or B) = probability that A or B or both

occur. Sometimes it will be necessary to reword the statement of a given event so that it conforms
with one of the two expressions given above.
For example, suppose your friend Rajab is about to write two exams and you define the events as
follows:
A:

Rajab will pass the statistics examination.

B:

Rajab will pass the accounting examination.

The event Rajab will pass at least one of the two exams can be reworded as Rajab will either pass
the statistics exam or he will pass the accounting exam, or he will pass both exams. This new event
can therefore be denoted (A or B ).
On the other hand, the event Rajab will not fail either exam is the same as Rajab will pass both his
statistics exam and his accounting exam. This event can therefore be denoted (A and B )
Example 4.1
An investor has asked his stockbroker to rate three stocks (A, B , and C ) and list them in the order in
which she would recommend them. Consider the following events:
L: Stock A doesnt receive the lowest rating.
M : Stock B doesnt receive the lowest rating.
N : Stock C receives the highest rating.

(i) Define the random experiment and list the simple events in the sample space.
(ii) List the simple events in each of the events L, M , and N .
(iii) List the simple events belonging to each of the following events:
L or N , L and M , and M .

(iv) Is there a pair of mutually exclusive events among L, M , and N ?


(v) Is there a pair of exhaustive events among L, M , and N ?

Solution:
(i) The random experiment consists of observing the order in which the stockbroker recommends
the three stocks. The sample space consists of the set of all possible orderings:
S = {ABC, ACB, BAC, BCA, CAB, CBA}

(ii) L = {ABC, ACB, BAG, CAB}; M = {ABC, BAG, BCA, CBA}; N = {CAB, CBA}
(iii) The event (L or N ) consists of all simple events in L or N or both; (L or N ) =

25

STA1610/1

{ABC, ACB, BAG, CAB, CBA}. The event (L and M ) consists of all simple events in both L

and M ., (L and M ) = {ABC, BAC}

The complement of M consists of all simple events that do not belong to M


M = {ACB, CAB}

(iv) No, there is not a pair of mutually exclusive events among L, M , and N , since each pair of events
has at least one simple event in common.
(L and M ) = {ABC, BAC}
(L and N ) = {CAB}
(M and N ) = {CBA}

(v) Yes, L and M are an exhaustive pairs of events, since every simple event in the sample space is
contained either in L or M , or both. That is, (L or M ) = S

4.3 Calculation of probability


Probability can be regarded as a fraction. In a multiple choice test, a typical question has 5 possible
answers. If an examination candidate makes a random guess on one such question, what is the
probability that the response is wrong?
Solution:
The probability is 4/5, because out of the 5 answers there are 4 ways to answer incorrectly. Each
question can be presented as follows:
Type of answer
Correct
Incorrect
Total

Number
1
4
5

Probability
1/5
4/5
5/5or1

Before we calculate the probabilities, let us first discuss the meanings of the words.
At least two: This means that two is the minimum value and if we say at least two children, it means
two or three or four or ... children.
P (X 2) = P (x = 2) + P (x = 3) + P (x = 4) +

At most two: This means that two is the maximum value. At most two children means no child or
one child or two children.
P (X 2) = P (x = 0) + P (x = 1) + P (x = 2)

No more than two: This means that two is the maximum number, that is two or one or zero children.
P (X 2) = P (x = 0) + P (x = 1) + P (x = 2)

Less than two: This means that two is not included and we are only interested in the values smaller
than two, that is zero or one.
P (X < 2) = P (x = 0) + P (x = 1)

26
More than two: This means that two is not included and we are only interested in the values larger
than two, that is three, four, five, etc.
P (X > 2) = P (x = 3) + P (x = 4) + P (x = 5) +

Example 4.2
Consider the following table in which wild azaleas were classified by colour and by the presence and
absence of fragrance.
Fragrance
Yes
No
Total

White
12
50
62

Pink
60
10
70

Orange
58
10
68

Total
130
70
200

If an azaleas is randomly selected from the group, which one of the following probabilities is
incorrect?
1. P (a fragrance) =

130
200

2. P (Color is orange) =

68
200

3. P (is orange and has a fragrance) =

58
200

4. P (is orange known that is has a fragrance) =


5. P (has a fragrance given that it is orange) =

58
130

58
130

Solution:
Option (1): Correct
Option (2): Correct
Option (3): Correct
Option (4): Correct
Option (5): Incorrect
P (has a fragrance given that it is orange) =

58
68

General Addition Rule


When two events A and B occur simultaneously, the general addition rule is applied for finding
P (A or B) = probability that event A occurs or event B occurs or both occur.

Formula P (A or B) = P (A) + P (B) P (A and B)

27

STA1610/1

Example 4.3
Consider the table of wild azaleas (example 4.2). If event A denotes the flower is orange, and event
B it has a fragrance, then: P (A or B) =

68
200

130
200

58
200

140
200

Note that the word OR in probability theory denotes ADDITION.


If A and B cannot occur simultaneously then P (A and B) = 0.
Events A and B are mutually exclusive if they cannot occur simultaneously
Example 4.4
The distribution of blood types in a certain country is roughly as follows:
A:41%

B:9%

AB: 4%

0:46%

An individual is brought into the emergency room after an automobile accident.

What is the

probability that he will be of type A or B or AB?


P (A or B or AB) = P (A) + P (B) + P (AB)
= 0.41 + 0.09 + 0.04
= 0.54
Since it is impossible for one individual to have two different blood types, these events are mutually
exclusive.
For many exclusive events A1 , A2 , A3 , . . . An the addition rule may be written as:
P (A1 or A2 or A3 or . . . . . . or An ) = P (A1 ) + P (A2 ) + P (A3 ) + . . . . . . . . . . . . . + P (An )

Multiplication Rule
The multiplication rule finds the probability that events A and B both occur.

Two events are

independent if one may occur irrespective of the other. For example, events A, the patient has
tennis elbow, and B, the patient has appendicitis, are intuitively independent.
Formula P (A and B) = P (A) P (B)
For many independent events A1 , A2 , . . . . . . . . . An , the multiplication rule can be written as:
P (A1 and A2 and . . . . and An ) = P (A1 ) P (A2 ) . . . . . . . . . . . . P (An )

Note that in probability the word AND denotes Multiplication.


Example 4.5

The probability that a certain plant will flower during the first summer is 0.6. If five plants are planted,
calculate the probability that all of them will have flowers during the first summer.
The probability is: 0.6 0.6 0.6 0.6 0.6 = 0.078.

28
Computation of objective probabilities (Section 4.1 and 4.2 of Levine textbook).
Objective probabilities can be classified into 3 categories. The categories are:
Marginal probability
Joint probability
Conditional probability

The definition and computation of each type is described next.


Marginal Probability:
A marginal probability is the probability of only a single event A occurring.
It is written as: P (A)
A single event is an event that describes the outcomes of one random variable only. A frequency
distribution describes the occurrence of only one characteristic of interest at a time and it is used to
estimate marginal probabilities.
Example 4.6
In table 4.1 the random variable industry type is described by the frequency distribution in the row
total column
Table 4.1
Industry type
Mining
Finance
Service
Retail

Number of JSE firms


35
72
10
33

Let B = event (finance). Then P (B) =

72
150

= 0.48

Joint probability
A joint probability is the probability of both event A and event B occurring simultaneously on a given
trial of a random experiment. A joint event describes the behaviour of two or more random variables
(i.e. the characteristics of interest) simultaneously. It is written as:
P (A and B)

29

STA1610/1

Example 4.7
Table 4.2
Industry

Company size (in R million turnover)


Small
Medium
(0 to less than 10) (10 to less than 50)
0
0
9
21
6
3
14
13
29
37

Mining
Finance
Service
Retail
Total

Large
(50 and above)
35
42
1
6
84

Total
35
72
10
33
150

Let A = event (small company)


Let B = event (finance company)
There are 9 out of 150 JSE listed companies in the sample which are both small and finance
companies
Then P (A and B) =

9
= 0.06
150

Conditional probability
Conditional probability is the probability of one event A occurring given information about the
occurrence of another event B.
A conditional event describes the behaviour of one random variable in the light of known additional
information about a second random variable.
Conditional probability is defined as:
P (A/B) =

P (A and B)
P (B)

The essential feature of the conditional probability is that the sample space is reduced to the
outcomes describing event B (the given prior event) only, and not all possible outcomes as for
marginal and joint probabilities.
Example 4.8
Let A = event (large company)
Let B = event (retail company)

Then is the probability of selecting a company from the JSE sample which is large given that the
company is known to be a retail company.
Retail

Small
14

Medium
13

Large
6

Total
33

30
There are 6 large companies out of 33 retail companies (Refer to table 4.2)
Thus P (A/B) =

6
33

= 0.1818

Using the formula:


P (A and B) =
P (B) =

33
150

6
150

(a joint probability)

(a marginal probability)

Then P (A/B =

6
150
33
150

6
33

= 0.1818 (a conditional probability)

A conditional probability, denoted P (A/B) is the probability that an event A will occur given that we
know that an event B has already occurred.
The key to recognizing a conditional probability is to look for the words given that or their equivalent.
For example, the statement of a conditional probability might read The probability that A will occur
when B occurs or The probability that A will occur if B occurs. In each of these cases, you can
reword the statement using given that instead of when or if. Therefore, both of these statements
refer to conditional probabilities.
Probability Rules
This section outlines three rules of probability that allow you to calculate the probabilities of three
special events [A, (A or B), and (A and B)] from known probabilities of various related events. The
three rules are as follows:
1. Complement Rule: P (A ) = 1 P (A)
2. Addition Rule:

P (A or B) = P (A) + P (B) P (A and B).

3. Multiplication Rule: P (A and B) = P (A)P (B/A) or P (A and B) = P (B)P (A/B)


We note that the addition rule and the multiplication rule can be expressed more simply under certain
conditions.
If A and B are mutually exclusive events, then P (A or B) = 0, so the addition rule becomes:
P (A or B) = P (A) + P (B).

If A and B are independent events, then multiplication rule becomes: P (A and B) = P (A) P (B)
Activity 4.1: Selfassessment exercises

Application skills

Question 1
A soft drink company holds a contest in which a prize may be revealed on the inside of the bottle
cap. The probability that each bottle cap reveals a prize is 0.1 and winning is independent from one
bottle to the next. What is the probability that a customer wins a prize when opening his third bottle?
1. (0.1)(0.1)(0.9) = 0.009
2. (0.9)(0.9)(0.1) = 0.081
3. (0.9)(0.9) = 0.81

31

STA1610/1

4. 1 (0.1)(0.1)(0.9) = 0.991
5. (0.9)(0.9)(0.9) = 0.729

Question 2
Suppose two people each have to select a number from 00 to 99 (therefore 100 possible choices).
The probability that they both pick the number 13 is
1.

2
100

2.

1
100

3.

1
200

4.

1
10 000

5.

2
10 000

Question 3
Use the same information as in question 2. The probability that both persons pick the same number
is equal to
1.

2
100

2.

1
100

3.

1
200

4.

1
10 000

5.

2
10 000

Question 4
For three mutually exclusive events the probabilities are as follows:
P (A) = 0.2, P (B) = 0.7 and P (A or B or C) = 1.0. The value of P (A or C) is equal to

1. 0.3
2. 0.5

32
3. 0.9
4. 0.6
5. 0.1
...................................................................................................
Feedback to Activity 4.1:

Application skills

Question 1
Option 2.
The first two bottles must definitely not reveal a prize on the inside of the bottle top. The probability
of not winning with one bottle is 1 0.1 = 0.9. Probability not to win with 2 bottles is (0.9)(0.9) = 0.81.

Probability of winning with the third bottle is(0.9)(0.9)(0.1) = 0.081

...................................................................................................
Question 2
Option 4.
1
. The choice of the second person
100
1
. To
is independent from the first and the probability to select 13 for the second person is also
100
have the probability of two independent events use the rule
1
1
1

=
.
P (A and B) = P (A) P (B) =
100 100
10 000

For the first person to select the number 13 the probability is

Question 3
Option 2.
In the previous question we found the probability that both selected one specific number (13) was
1
. For this question any doubling of numbers is considered for the probability. There can be
10 000
two ones, two twos,..., two ninty-nines and if you count these possibilities there are 100 such double
combinations.
P (select the same number) is 100

1
1
=
.
10 000
100

Question 4
Option 1
Given that P (A or B or C) = 1.0 plus the fact that these events are mutually exclusive.
Therefore P (A) + P (B) + P (C) = 1.0.
Filling in the given probabilities we get 0.2 + 0.7 + P (C) = 1.0. Solve this for P (C), then P (C) = 0.1.
Therefore, P (A or C) = P (A) + P (C) = 0.2 + 0.1 = 0.3

33

STA1610/1

Tree diagrams
The following contingency table shows sampled data for four regions in South Africa in which people
live and three types of music the people listen to. We are going to use this table to illustrate different
types of probability.
...................................................................................................
Limpopo
Gauteng
Free State
KwaZulu-Natal

Classical
50
105
80
65

Jazz
40
85
70
55

Rock
85
160
125
80

To use the table we have to total the number of people in each region and the number of people who
listen to each of the types of music.
Limpopo
Gauteng
Free State
KwaZulu-Natal
Total

Classical
50
105
80
65
300

Jazz
40
85
70
55
250

Rock
85
160
125
80
450

Total
175
350
275
200
1000

Tree diagrams
The information given in the contingency table above can just as well be given in a probability tree
as follows.
...................................................................................................
Tree 1

Region
(Marginal)

Type of Music
x

(Conditional) = (Joint)

Limpopo
175/1000

50/175
40/175
85/175

Classical
Jazz
Rock

Gauteng
350/1000

105/350
85/350
160/350

Classical
Jazz
Rock

Free State
275/1000

80/275
70/275
125/275

Classical
Jazz
Rock

KwaZuluNatal
200/1000

65/200
55/200
80/200

Classical
Jazz
Rock

34

Notice that the regional probabilities are all marginal probabilities and the probabilities for each type
of music in each of the four regions are all conditional probabilities. So, the probability that someone
listens to classical music, for example, changes depending in which region he/she lives.
The probability tree we developed above shows the regions first and the types of music preferred
as contingent upon the region in which someone lives. But, suppose we want to show the tree the
other way around; that is, suppose we wanted to show the types of music first and the regional
identifications as contingent upon the types of music preferred.
Tree 2

Type of Music
(Marginal)

Region
x

(Conditional)

= (Joint)

Classical
300/1000

50/300
105/300
80/300
65/300

Limpopo
Gauteng
Free State
KwaZuluNatal

Jazz
250/1000

40/250
85/250
70/250
55/250

Limpopo
Gauteng
Free State
KwaZuluNatal

Rock
450/1000

85/450
160/450
125/450
80/450

Limpopo
Gauteng
Free State
KwaZuluNatal

Notice now that the music types are marginal probabilities and the probabilities for each region are
all conditional probabilities. So, the probability that someone lives in Limpopo, for example, changes
depending on which type of music he/she listens to.
The joint probabilities can be calculated directly from the contingency table or using the tree diagrams
e.g.
P(Classical and Gauteng)

=
=
=

P (Classical) P (Gauteng/Classical)
300/1000 105/300
105/1000

35
Activity 4.2: Selfassessment exercises

STA1610/1

Application skills

Question 1
Which statement is incorrect?
1. A marginal probability is the probability that an event will occur regardless of any other events.
2. A joint probability is the probability that two or more events will all occur.
3. If P (A) = 0.8 and P (B) = 0.5 and P (A and B) = 0.24, we can conclude that events A and B are
mutually exclusive.
4. Given the same information as in 3, the events A and B cannot be independent.
5. Two events cannot be mutually exclusive as well as independent.

Question 2
A study was conducted at a small college on first-year students living on campus. A number of
variables were measured. The table below provides information regarding number of roommates
and end of term health status for the first-year students at this college. Health status for individuals
is measured as poor, average, and exceptional.
Health Status
Poor
Average
Exceptional

Number of roommates
None One Two
15
36
65
35
94
40
50
50
25

Which one of the following statement is incorrect?


1. The probability that a randomly selected first-year student with no roommates had poor end of
term health status is 0.15.
2. The probability that a randomly selected first-year student with 1 roommate had poor end of term
health status is 0.20.
3. The events H = {the student has poor health status} and N = {the student has no roommates}
are mutually exclusive.

4. The events H = {the student has poor health status} and N = {the student has no roommates}
are dependent.

5. If you find a person with average health you can be 52% sure that he/she comes from a room with
only one roommate.

36
Question 3
In the Barana Republic there are two producers of fridges, referred to as Cool and Dry. Assume that
there are no fridge imports. The market shares for these two producers are 70% for Cool and 30%
for Dry. One executive at Dry proposes a longer warranty period to be offered at a slight extra cost as
a plan to increase market share. A market research company appointed by Dry conducts a census
of fridge owners on their opinion of this warranty proposal. Among owners of a fridge made by Cool,
50% like the proposal, 30% are indifferent to it, while the remaining owners oppose it. Among owners
of a fridge made by Dry, 70% like the proposal, 20% are indifferent to it, and the remaining owners
oppose it.
3.1 A fridge owner will be selected at random. What is the probability that the person will own a fridge
made by Dry?
3.2 A fridge owner will be selected at random. What is the probability that the owner will be opposed
to the proposal of a new warranty at extra cost?

Hint: Make a tree diagram.


Feedback to Activity 4.2

Application skills

Question 1
Option 3
1. Correct.
A marginal probability is the probability that an event will occur regardless of any other events.
2. Correct.
A joint probability is the probability that two or more events will all occur.
3. Incorrect.
For mutually exclusive events P (A and B) = 0.
4. Correct.
Given the same information as in 3, the events A and B cannot be independent. If events A and
B were independent, then P (A and B) = P (A) P (B) = 0.7 0.6 = 0.42 . However, the problem

says that P (A and B) = 0.35, not 0.42.


5. Correct.

Two events cannot be mutually exclusive as well as independent.

37

STA1610/1

Question 2
Option 3
1. To be able to answer this question it is advisable that you add the rows and columns of the given
table.
Health Status
Poor
Average
Exceptional
Total

Number of roommates
None One Two
15
36
65
35
94
40
50
50
25
100
180 130

Total
116
169
125
410

Correct.
Of the 100 students with no roommates 15 had bad health and

15
= 0.15.
100

2. Correct.
Of the 180 students with 1 roommate 36 had bad health and

36
= 0.20.
180

3. Incorrect.
If these two events were to be mutually exclusive, the cell where poor health and no roommates
cross should have had a zero in (and there is a 15).
4. Correct.
Use the multiplication rule to prove that the events are not independent.
Test if P (H and N ) = P (H) P (N )
P (H and N ) = 15/410 = 0.037
P (H) P (N ) = 116/410 100/410 = 0.069 = 0.037

This implies that the two events are dependent.


5. Correct.
This is a conditional probability, which you can simply read from the table as
You can also calculate it with the formula
P (A | B) = P (A and B)/P (B) = 94/410/180/410 = 0.5222.

94
= 0.5222.
100

38
Question 3
The following tree diagram gives the question information in a concise manner:

Let A = Cool fridge


A = Dry fridge

B1 = like proposal
B2 = indifferent to proposal
B3 = oppose proposal

Cool

P( B1 | A) = 0.5
P( B2 | A) = 0.3
P( B3 | A) = 0.2

P(A and B1 ) = 0.35


P(A and B2 ) = 0.21
P(A and B3 ) = 0.14

Dry

P( B1 | A ) = 0.7
P( B2 | A ) = 0.2
P( B3 | A ) = 0.1

P( A and B1 ) = 0.21
P( A and B2 ) = 0.06
P( A and B3 ) = 0.03

P(A) = 0.7

P( A ) = 0.3

3.1 The probability that the person will own a fridge made by Dry is equal to 0.30.
3.2 The probability that the owner will be opposed to the proposal of a new warranty at extra cost is
0.03 + 0.14 = 0.17.

4.4 SELF ASSESSMENT EXERCISE


TEST YOUR KNOWLEDGE
Question 1
Which statement is correct?
1. Probability takes on a value from 0 to 1
2. Probability refers to an number which express the chance that an event will occur
3. Probability is zero if the event A of interest is impossible
4. The sample space refers to all possible outcomes of an experiment
5. All the above statements are correct.

39

STA1610/1

Question 2
Assume that X and Y are two independent events with P (X) = 0.5 and P (Y ) = 0.25. Which of the
following statements is incorrect?
1. P (X ) = 0.75
2. P (X and Y ) = 0.125
3. P (X or Y ) = 0.625
4. X and Y are not mutually exclusive
5. P (X/Y ) = 0.5

Question 3
Refer to the following contingency table:
Event
D1
D2
D3
Total

C1
75
90
135
300

C2
125
105
120
325

C3
65
60
75
200

C4
35
45
70
150

Total
300
300
400
1000

Which one of the following statements is incorrect?


1. P (C1 and D1 ) = 0.075
2. P (D1 ) = 0.3
3. P (C1 or D1 ) = 0.6
4. P (D3 /C4 ) = 0.4667
5. P (C4 /D3 ) = 0.175

Question 4
A sidewalk icecream seller sells three flavours: chocolate, vanilla and strawberry. Of his sales 40%
is chocolate, 35% vanilla and 25% strawberry. Sales are by cone or cup. The percentages of cone
sales for chocolate, vanilla and strawberry are 80%, 60% and 40% respectively. Use a tree diagram
to determine the relevant probabilities of a randomly selected sale of one ice cream. Which one of
the following statements is incorrect?
1. P (strawberry) = 0.25
2. P (vanilla in a cup) = 0.14
3. P (chocolate in a cone) = 0.32
4. P (chocolate or vanilla) = 0.75
5. P (vanilla/in a cone) = 0.3889

40
Question 5
A survey asked people how often they exceed speed limits. The data are then categorized into the
following contingency table of counts showing the relationship between age group and response.

Age

Exceed limit if possible


Always Not always
Under 30 100
100
Over 30
40
160
TOTAL
140
260

Total
200
200
400

Which one of the following statements is incorrect?


1. Among people over 30, the probability of always exceeding the speed limit is 0.20
2. Among people under 30, the probability of always exceeding the speed limit is 0.5
3. The probability that a randomly chosen person is over 30 and not always exceeds the speed limit
is 0.4
4. 10% of the people in the survey always exceed the speed limit
5. Among the people who always exceed the speed limit 71.43% are under 30

Question 6
Numbers 1, 2, 3, 4, 5, 6, 7, 8, 9 are written on separate cards. The cards are shuffled and the top
one turned over. Let A = an even number B = a number greater than 6
Which one of the following statements is incorrect?
1. The sample space is S = {1, 2, 3, 4, 5, 6, 7, 8, 9}
2. P (A) = 4/9
3. P (B) = 1/3
4. P (A and B) = 1/9
5. P (A or B) = 7/9

Question 7
If A and B are independent events with P (A) = 0.25 and P (B) = 0.60, then P (A/B) is equal to
1. 0.25
2. 0.60
3. 0.35
4. 0.85
5. 0.15

41

STA1610/1

Question 8
Given that P (A) = 0.7, P (B) = 0.6 and P (A and B) = 0.35, which one of the following statements is
incorrect?
1. P (B ) = 0.4
2. A and B are not mutually exclusive
3. A and B are dependent
4. P (B/A) = 0.6
5. P (A or B) = 0.95

Question 9
The Burger Queen Company has 124 locations along the west coast.

The general manager

is concerned with the profitability of the locations compared with major menu items sold. The
information below shows the number of each menu item selected by profitability of store.

High profit
R1
Medium profit
R2
Low profit
R3
Total

Baby Burger
M1
250

Mother Burger
M2
424

Father Burger
M3
669

Nachos
M4
342

Tacos
M5
284

Total

312

369

428

271

200

1580

289

242

216

221

238

1206

851

1035

131

834

722

4755

1969

If a menu order is selected at random, which statement is incorrect?


1. P (M5 ) = 0.1518
2. P (R3 ) = 0.0501
3. P (R2 and M3 ) = 0.0900
4. P (M2 /R2 ) = 0.2335
5. P (R1 /M4 ) = 0.4101

Question 10
In a particular country, airport A handles 50% of all airline traffic, and airports B and C handle 30%
and 20% respectively. The detection rates for weapons at the three airports are 0.9, 0.5 and 0.4
respectively.
If a passenger at one of the airports is found to be carrying a weapon through the boarding gate,
what is the probability that the passenger is using airport C?
1. 0.2206

42
2. 0.6618
3. 0.5000
4. 0.2941
5. 0.1176

4.5 SOLUTIONS TO SELF ASSESSMENT EXERCISES


Question 1
Alternative 5
Question 2
P (X ) = 1 P (X) = 1 0.5 = 0.5

P (X and Y ) = P (X)P (Y ) = 0.5 0.25 = 0.125

P (X or Y ) = P (X) + P (Y ) P (X and Y ) = 0.5 + 0.25 0.125 = 0.625

P (X and Y ) = 0.125 = 0, therefore X and Y are not mutually exclusive.


P (X|Y ) = P (X and Y )/P (Y ) = 0.125/0.25 = 0.5

Alternative (1).
Question 3
P (C1 and D1 ) = 75/1000 = 0.075
P (D1 ) = 300/1000 = 0.3
P (C1 or D1 ) = 0.3 + 0.3 0.075 = 0.525

P (D3 |C4 ) = 70/150 = 0.4667


P (C4 |D3 ) = 70/400 = 0.175

Alternative (3).
Question 4

Let = event strawberry flavour


C = chocolate flavour
V = vanilla flavour
S = percentage of cone sales

43
Sa = strawberry flavour

P (Strawberry) = 0.25
P (Vanilla in a cup) = P (V ) P (S /V ) = 0.35 0.4 = 0.14

P (Chocolate in a cone) = P (C) P (S/C) = 0.4 0.8 = 0.32

P (Chocolate or vanilla) = P (C) + P (V ) = 0.4 0.35 = 0.75


P (vanilla in a cone)
P (Vanilla/in a cone) =
P (cone)
P (V ) P (S/V )
=
P (C) P (S/C) + P (V ) P (S/V ) + P (Sa) P (S/Sa)
0.35 0.6
=
(0.4 0.8) + (0.35 0.6) + (0.25 0.4)
0.21
=
0.63
= 0.333

Alternative 5
Question 5
P (always|over 30) = 40/200 = 0.20

P (always|under 30) = 100/200 = 0.5

P (over 30 and not always) = 160/400 = 0.4

Percentage people always exceeding = 140/400 = 0.35 = 35%


P (under 30 |always) = 100/140 = 0.7143 = 71.43%

Alternative 4
Question 6
P (A or B) = P (A) + P (B) P (A and B) = 4/9 + 1/3 1/9 = 6/9 = 2/3

Alternative (5)

STA1610/1

44
Question 7

P (A) P (B)
P (A and B)
=
= P (A) = 0.25
P (B)
P (B)
Alternative 1
P (A/B) =

Question 8
P (B ) = 1 P (B) = 0.4

P (A and B) = 0, A and B are not mutually exclusive events.


P (A)P (B) = P (A and B), therefore A and B are dependent events.
P (B|A) = P (A and B)/P (A) = 0.35/0.7 = 0.5
P (A or B) = P (A) + P (B) P (A and B)
= 0.7 + 0.6 0.35 = 0.95

Alternative (4).
Question 9

P (M5 ) = 722/4755 = 0.1518


P (R3 ) = 1206/4755 = 0.2536
P (R2 and M3 ) = 428/4755 = 0.0900
P (M2 |R2 ) = 369/1580 = 0.2335

P (R1 |M4 ) = 342/834 = 0.4101

Alternative (2).
Question 10

Let W = event person is carrying a weapon. Use Bayes theorem.


P (C|W ) = P (C)P (W |C)
= P (A)P (W |A) + P (B)P (W |B) + P (C)P (W |C)
= (0.2 0.4) + (0.5 0.9) + (0.3 0.5) + (0.2 0.4)
= 0.08/0.68
= 0.1176
Alternative (5).
In summary to the study unit
Once you have familiarized yourself with this study unit you should be able to
link random circumstances and probability to everyday life
define a sample space, an event and complementary events
grasp the idea of a Venn diagram displaying the sample space and the events within
differentiate between the union and intersection of events
differentiate between marginal, joint and conditional probability
clarify the difference between mutually exclusive and independent events
appreciate the different fundamentals of counting in probability
describe and apply the basic rules for probability
use contingency tables and tree diagrams to solve more complex questions on probability

45

STA1610/1

STUDY UNIT 5
STUDY CHAPTER 5

Key questions for this unit


Define a discrete probability distribution.
How would you construct a probability distribution for a discrete random variable?
Distinguish between discrete and continuous random variables.
How would you compute the expected value and the variance of a discrete random
variable?
How would you compute the expected value and the variance of a Binomial
distribution?
How would you compute the expected value and the variance of a Poisson distribution?

5.1 Introduction to this study unit


In study unit 4 learnt much about probability in general, in this study unit we discuss discrete random
variables and their probability distributions. Probability distributions are classified as either discrete
or continuous, depending on the random variable.
A random variable is a variable that can take on different values according to the outcome of an
experiment . It is described as random because we dont know ahead of time exactly what value it
will have following the experiment. For example, when we toss a coin, we dont know for sure whether
it will land heads or tails. Likewise, when we measure the diameter of a roller bearing, we dont know
in advance what the exact measurement will be. Random variables are either discrete or continuous,
in this unit the emphasis is on discrete random variables and their probability distributions. In the
next unit we will cover random variables of continuous type.
A random variable is discrete if it can assume only a countable number of possible values
(0, 1, 2, 3, ....).

A continuous random variable assumes an uncountable number of possible values; it can take

on any value in one or more intervals of values. Levene et al. Provide the following definition of
probability function.
A probability function, denoted p (x) , specifies the probability that a random
variable is equal to a specific value. More formally, p (x) is the probability that the random
variable X takes on the value x, or p (x) = P (X = x) .

The two key properties of a probability function are:


For any value of x, 0 p (x) 1.

46

p (x) = 1, the sum of the probabilities for all possible outcomes, x, for a random variable, X ,

equals one.

Activity 5.1: Overview Study skill


Draw a mind-map of the different sections/headings you will deal with in this study session. Then
page through the unit with the purpose of completing the map.
...................................................................................................

Activity 5.2: Concepts Conceptual skill Communication skill


Test your own knowledge (write in pencil) and then correct your understanding afterwards (erase and
write the correct description). Often a young language may not have all the terms in a discipline; can
you think of some examples?
...................................................................................................

47
English term
Probability distributions
Discrete random variables
Continuous random variables
Binomial distributions
Poisson distributions
The mean of the binomial distribution
The variance of the binomial distribution
The mean of the Poisson distribution
The variance of the Poisson distribution
The standard deviation of the binomial
distribution
The standard deviation of the Poisson
distribution

STA1610/1

Description

Term in your home language

5.2 Probability distribution for discrete random variables


Levine defines the probability distribution for discrete random variable as a mutually exclusive listing
of all possible numerical outcomes along with the probability of occurrence of each outcome (see
section 5.1).
That is, if X is a discrete random variable associated with a particular chance experiment, a list of
all possible values X can assume together with their associated probabilities is called a discrete
probability distribution. The total probability of all outcomes is 1.

5.2.1 Expected value of a Discrete Random Variable


The mean , of a discrete probability distribution for a discrete random variable is called its expected
value, and this referred to as E (x) , or . It is calculated as the sum of the product of the random
variable X by its corresponding probability, P (X), as follows
= E (X) =

N
[

Xi P (Xi )

i=1

Where

Xi = the ith outcome of the discrete random variable X


P (Xi ) the probability of occurrence of the ith outcome of X

Example 5.1
Based on her experience, a professor knows that the probability distribution for X = number of
students who come to her office on Wednesdays is given below.
x
P (X = x)

0
0.01

1
0.20

2
0.50

3
0.15

4
0.05

48
What is the expected number of students who visit her on Wednesdays?
1. 0.50
2. 0.70
3. 1.85
4. 0.90
5. 0.30
Solution: The expected (the mean) is calculated as the sum of the product of the random variable
X by its corresponding probabililty, P (X) , as follows:
N
[
= E (x) =
Xi P (Xi )
i=1

=
=

(0 0.1) + (1 0.20) + (2 0.5) + (3 0.15) + (4 0.5)


1.85

Alternative 3

5.2.2 Variance and standard deviation of a discrete random variable


The variance of a probability distribution is computed by multiplying each possible squared difference
k
l
(Xi )2 by its corresponding probability, P (Xi ) , and then summing the resulting products as

follows:
N k
l
[
2 =
(Xi )2 P (Xi )
i=1

Where
Xi = the ith outcome of the discrete random variable X
P (Xi ) = the probability of occurrence of the ith outcome of X

Please note that we have to compute the mean first before we think of calculating the variance of a
discrete random variable.
The standard
ydeviation is the positive square root of the variance of a discrete random variable
xN
x[

2
= =w
(Xi )2 P (Xi )
i=1

Example 5.2
Let the probability distribution for X = number of jobs held during the past year for students at a
college be as follows:

x
P (X = x)

1
0.25

2
0.33

3
0.17

4
0.15

5
0.10

49

STA1610/1

The standard deviation of the number of jobs held is


1. 8.000
2. 1.3682
3. 2.5200
4. 1.2844
5. 1.6496
Solution:
We first calculate the mean
N
[
= E (x) =
Xi P (Xi )
i=1

=
=

(1 0.25) + (2 0.33) + (3 0.17) + (4 0.15) + (5 0.10)


2.52

Then we use the mean to calculate the variance


N k
l
[
2 =
(Xi )2 p (Xi )
i=1

(1 2.52)2 0.25 + (2 2.52)2 0.33 + (3 2.52)2 0.17 + (4 2.52)2 0.15 + (5 2.52)2


0.10
1.6496

The standard deviation is

= 2 = 1.6496 = 1.2844
Alternative 4
If you have not mastered how to calculate the mean, the variance and the standard deviation of a
discrete random variable, you can now work through section 5.1 of Levene et al. again, otherwise try
the following activity before looking at its solutions.

50
Activity 5.1

Self-assessment exercise

Application skills

Question 1
The number of telephone calls coming into a switchboard and their respective probabilities for a
3-minute interval are as follows:
x
0
1
2
3
P (X = x) 0.60 0.20 0.10 0.04
How many calls might be expected over a 3-minute interval?

4
0.03

5
0.03

1. 0.04
2. 3
3. 0.2
4. 0.79
5. 3.75

Question 2
The probability distribution of a discrete random variable is shown below.
x
P (X = x)

0
0.25

1
0.40

2
0.20

3
0.15

Find the incorrect statement:


1. This is an example of a discrete probability distribution.
2. The expected value of x is 1.25
3. The variance of x is 2.55
4. If x = 0, after multiplication by P (x), the answer 0, which means that the probability associated
with the value x = 0 has no influence on the answers of the mean and the variance.
5. The standard deviation of x is 0.9937

Question 3
Use the data set given in question 2 and find the incorrect statement.
1. P (x > 1) = 0.35
2. P (x 2) = 0.65
3. p (1 < x 2) = 0.20
4. P (0 < x < 1) = 0.00
5. P (1 x < 3) = 0.60

51
Solutions to Activity 5.1

STA1610/1

Application skills

Question 1
Recall, the expected number is also the mean of a discrete random variable, calculate as:
N
[
= E (x) =
Xi P (Xi )
i=1

=
=

(0 0.60) + (1 0.20) + (2 0.10) + (3 0.04) + (4 0.03) + (5 0.03)


0.79

Alternative 4
Question 2
1. Correct. The variable takes on discrete values, therefore the statement is correct. Remember
in section 5.2 of this unit we defined the probability distribution for discrete random variable
as a mutually exclusive listing of all possible numerical outcomes along with the probability of
occurrence of each outcome which is exactly the case in this option.
2. Correct.
= E (x)

N
[

Xi P (Xi )

i=1

=
=

(0 0.25) + (1 0.40) + (2 0.20) + (3 0.15)


1.25

3. Incorrect. This figure was incorrect computed. It should be


N k
l
[
2 =
(Xi )2 p (Xi )
i=1

=
=

(0 1.25)2 0.25 + (1 1.25)2 0.40 + (2 1.25)2 0.20 + (3 1.25)2 0.15


0.9875

4. Correct. You can see it if you study the calculation of the mean and the variance.

5. Correct. = 2 = 0.9875 = 0.9937

Question 3
1. Correct. We add from two (greater than one) up to three as follows;
P (x > 1) = P (x = 2) + P (x = 3) = 0.20 + 0.15 = 0.35

2. Incorrect. Here we take values from zero to two. One could also consider this question as atmost
two as discussed in study unit 4.
P (x 2) = P (x = 0) + P (x = 1) + P (x = 2) = 0.25 + 0.40 + 0.20 = 0.85

3. Correct. In this case one is not included but two is. P (1 < x 2) = P (x = 2) = 0.20

52
4. Correct. P (1 < x < 1) = 0.00 because between 0 and 1 there is no discrete value for x.
5. Correct. Here one is inclued but three is not. P (1 x < 3) = P (x = 1)+P (x = 2) = 0.40+0.20 =
0.60

Having understood discrete random variable, we can now discuss their probability distributions. This
is very small but important section in statistics.
There are quite a number of discrete probability distribution , though Levene emphasised only two
namely
the Binomial distribution and
the Poission distribution.

5.3 The Binomial Distribution


The binomial distribution describes the probability distribution resulting from the outcome of a
binomial experiment. A binomial experiment usually involves several repetitions (trials) of the basic
experiment. The binomial probability distribution gives us the probability that a success will occur x
time in n trials, for x = 0, 1, 2, ......n.
Characteristics
The experiment must consist of n identical trials.
Each trial has 1 of 2 possible mutually exclusive outcomes: success or failure (success refers to

the occurrence of the event of interest).

The probability () that the trial results in a success remains the same from trial to trial.
The trials are independent of each other (the outcome of a trial does not affect the outcome of

any other trial).

The probability distribution of number of successes x of the random variable X in n trials of a

binomial experiment is:

n!
x (1 )nx
x! (n x)!
n = number of trials or sample size

P (x) =

probability of success on each trial

x =

the binomial variance 0, 1, 2, ..., etc.

The mathematical sign (!) is called the factorial sign of a positive integer n. It is interpreted as the
product of all positive integers less than or equal to n. For example 5! = 5 4 3 2 1 = 120, 4! =

4 3 2 1 = 24 and 0! = 1.

53

STA1610/1

5.3.1 The mean of the binomial distribution


The mean, , of the binomial distribution is equal to the sample size, n, multiplied by the probability
of an event of interest .
= E (x) = n

5.3.2 The variance and the standard deviation of the binomial distribution
=

s
s
2 = V ar (X) = n (1 )

Example 5.3

A textile firm has found from experience that only 20% of the people applying for certain stitchingmachine job are qualified for the work. If 5 people are interviewed, what is the probability of finding
at least three qualified persons?
n = 5, = 0.20, P (x 3)?

Please do not forget that at least three means, add from three, four and so on.
P (x 3)

=
=

=
=

P (x = 3) + P (x = 4) + P (x = 5)
5!
5!
0.203 (1 0.20)53 +
0.204 (1 0.20)54
3! (5 3)!
4! (5 4)!
5!
0.25 (1 0.20)55
+
5! (5 3)!
0.0512 + 0.0064 + 0.0003
0.0579

You can now attempt the following typical exam question. Please try to answer them before looking
at the solutions.
Activity 5.2: Self-assessment exercises

Application skills

Question 1
A new car salesperson knows that he sells cars to one customer out of 10 who enters the showroom.
The probability that he will sell a car to exactly two of the next three customers is
1. 0.027
2. 0.973
3. 0.000
4. 0.090
5. 0.901

54
Question 2
Use the information given in question 1. Let X be number of cars the salesperson sells to the next
three customers. Which one of the following statements is incorrect?
1. X has a binomial distribution
2. The expected number of cars sold if n = 3 is 0.3
3. The variance of this distribution is 0.27
4. P (X 1) = 0.9720
5. P (X > 2) = 0.0280

Question 3
Suppose that 62% of new cars sold in a country are made by one big car manufacturer. A random
sample of 7 purchases of new cars is selected. The probability that 4 of those selected purchases
are made by this car manufacturer is
1. 0.5800
2. 0.5714
3. 0.2838
4. 0.4200
5. 0.7162

Solutions to Activity 5.2

Application skills

Question 1
1
= 0.1,
P (x = 2)?
10
3!
P (x = 2) =
0.12 (1 0.1)32
2! (3 2)!
= 0.027

n = 3, =

Question 2
1. Correct.
2. Correct. E (x) = n = 3 0.1 = 0.3
3. Correct. 2 = n (1 ) = 3 0.1 (1 0.1) = 0.27
4. Correct.

P (x 1)

=
=
=
=

P (x = 0) + P (x = 1)
3!
3!
0.13 (1 0.1)30 +
0.11 (1 0.1)31
0! (3 0)!
1! (3 1)!
0.7290 + 0.2430
0.9720

55
5. Incorrect P (x > 2)

=
=
=

STA1610/1

P (x = 3)
3!
0.13 (1 0.1)33
3! (3 3)!
0.001

Question 3
n = 7, = 0.62,
P (x = 4)

=
=

P (x = 4)?
7!
0.624 (1 0.62)74
4! (7 4)!
0.2838

Alternative 3

5.4 Poisson Distribution


The Poisson distribution is a discrete distribution for the occurrence to an event for which the
probability of occurrence over the given span of time, space, or distance is extremely small. There is
no specific upper limit to the count (n is unknown), although a finite count is expected. The Poisson
distribution tend to describe the phenomena like:
Customers arrival at a service point during a given period of time, such as the number of motorist
approaching a toll booth, the number of hungry persons entering a McDonalds restaurant, or the
number of calls received by a company call center. In this context it is also useful in a management
science technique called queuing (waiting-line) theory.
Defects in manufacturer materials, such as the number of flaws in wire or pipe products over a
given number of feet, or the number of knots in wooden panels for a given area.
The number of work-related deaths, accidents, divorces, suicides, and homicides over a given
period of time.
Although it is closely related to the Binomial distribution, the Poisson distribution has a number of
characteristics that makes it unique. These include
The number of successes that occur in a specified interval is independent of the number of

occurrence in any other interval.

The probability that success will occur in an interval is the same for all intervals of equal size, and

is proportional to the size of the interval.

x is the count of the number of successes that occur in a given interval, and may take on any

value from 0 to infinity.

If X is a Poisson random variable, the probability distribution of the number of successes of x is

56

P (x)

x x
x!
x = 0, 1, 2, ...
= the average number of successes occurring in the given time or measurement.
= 2.71828 (the base of natural logarithms)

Example 5.4
The average number of a certain radio sold per day by a firm is approximately Poisson, with mean of
1.5. The probability that the firm will sell at least two radios over a three-day period is equal to
1. 0.5578
2. 0.1255
3. 0.9344
4. 0.0447
5. 0.4422

Solution:
Recall that this distribution has no upper bound. Therefore we have to express atleast in another
equivalent way such as
P (x 2) = 1 P (x 1)
= 1q
{P (x = 0) + P (x =r1)}
0 1.5
1 1.5
e
+ .5 e1!
= 1 1.5 0!
= 1 {0.2231 + 0.3347}
= 0.4422
Alternative 5
Example 5.5
A bank receives on average 6 bad cheques per day. The probability that it will receive exactly 4 bad
cheques on a given day is
1. 0.0892
2. 0.1393
3. 0.2851
4. 0.1339
5. 0.6667

57

STA1610/1

Solution
Given that = 6, P (x = 4)?
P (x = 4) =

64 e6
x e
=
= 0.1339
x!
4!

Alternative 4

5.5 Self Assessment exercise


TEST YOUR KNOWLEDGE
In this section we have selected questions based on whole study unit. Please attempt them before
referring to the solutions.
Question 1
Bank robbers brandish firearms to threaten their victims in 80 percent of the incidents.

An

announcement that six bank robberies are taking place is being broadcast. The probability that
a firearm is being used in at least one of the robberies is
1. 0.0015
2. 0.7379
3. 0.0001
4. 0.9999
5. 0.0016

Question 2
In an urban country, health official anticipate that the number of births this year will be the same as
last year, when 438 children were born an average of 438/356, or 1.2 births per day. Daily births
have been distributed according to a Poisson distribution.
The distribution can be represented as
x
P (X = x)

0
0.3012

1
0.3614

2
0.2169

3
0.0867

4
0.0260

5
0.0062

6
0.0012

What is the probability that at least two births will occur on a given day?
1. 0.3374
2. 0.8795
3. 0.3795

7
0.0002

58
4. 0.7831
5. 0.6626

Question 3
Given the following probability distribution for an infinite population with the discrete random
variables, x
x
P (x)

0
0.2

1
0.1

2
0.3

3
0.4

Which statement is incorrect?


1. The mean of x is 1.9
2. The probability that x is at most one equals to 0.3
3. The variance of x is 1.29
4. The standard deviation of x is 1.14
5. The probability that x is at least zero equals to 0.2

Question 4
A drug is known to be 80% effective in curing a certain disease. If four people with the disease are
to be given the drug, the probability that more than two are cured is:
1. 0.8464
2. 0.1536
3. 0.5000
4. 0.1808
5. 0.8192

Question 5
Refer to question 4, the expected value of people cured is
1. 0.80
2. 0.20
3. 3.20
4. 0.64
5. 1.00

59

STA1610/1

Question 6
Given a Poission random variable X , where the average number of successes occurring in a
specified interval is 1.8, P (X = 0) is equal to
1. 0.1653
2. 0.2975
3. 1.0000
4. 0.0000
5. 0.4762

5.6 Solutions to self assessment exercises Test your knowledge


Question 1
Solution
P (x 1)

=
=
=
=
=

1 P (x 50)
1 {P (x = 0)}
6!
1 0!(60)!
0.800 (1 0.80)60
1 000064
0.9999

Alternative 4
Question 2
P (x 2) =
=
=
=

1 P (x 1)
1 {P (x = 0) + P (x = 1)}
1 {0.3012 + 0.3614}
0.3374

Alternative 1
Question 3
1. Correct.
= E (x)

N
[

Xi P (Xi )

i=1

=
=

(0 0.20) + (1 0.10) + (2 0.30) + (3 0.4)


1.9

P (x = 0) + P (x = 1) = 0.2 + 0.1 = 0.3

2. Correct.
P (x 1)

60
3. Correct.
2

N k
l
[
(Xi )2 P (Xi )
i=1

=
=

(0 1.9)2 0.2 + (1 1.9)2 0.1 + (2 1.9)2 0.3 + (3 1.9)2 0.4


1.29

4. Correct

= 1.29 = 1.14
5. Incorrect
P (x 1) = P (x = 0) + P (x = 1) + P (x = 2) + P (x = 3) = 1.0

Question 4
P (x > 2) =
=
=
=

P (x = 3) + P (x = 4)
43
4!
3
+
3!(43)! 0.80 (1 0.8)
0.4096 + 0.4096
0.8192

4!
4
4!(44)! 0.8 (1

0.8)44

Alternative 5
Question 5
E (X) = n = 4 0.80 = 3.2

Alternative 3
Question 6
Given that = 1.8, P (x = 0)?
P (x = 0) =

1.80 e1.8
x e
=
= 0.1653
x!
0!

Alternative 1
In summary to the study unit
After you have studied this study unit about discrete probability distributions, you should be able to
recognise and define a discrete probability distribution
construct a probability distribution for a discrete random variable
understand the concept of a Bernoulli process and it application in consecutive trials, as

associated with the binomial distribution

differentiate between the binomial and Poisson distributions


determine the probability that a binomial variable will assume a given value and a Poisson variable

a value within a given range

61

STA1610/1

STUDY UNIT 6
STUDY CHAPTER 6
THE NORMAL DISTRIBUTION

Key questions for this unit


How would you compute probabilities from the normal distribution?
Can you distinguish between discrete and continuous probability distributions?
Can you use the normal table to compute probabilities?
Can you determine the Z -variable given the area under the normal curve?
Can you distinguish between the normal, the uniform and the exponential distribution?

6.1 Introduction to this study unit


This chapter describes three continuous distributions, namely the normal, uniform distribution and
the exponential distributions. The normal is the most important distribution in statistics and the key
reasons for this include:
Numerous continuous variables common in business have distributions that closely resemble the

normal distribution.

The normal distribution can be used to approximate various discrete probability distributions.
The normal distribution provides the basis for classical statistical inference because of its

relationship to the Central Limit Theorem (see section 6.1 and 6.2 of Levine)

You must make sure that you know the characteristics of the normal distribution and how to use the
normal table (E.2) to determine probabilities. For this module it is not necessary that you can use the
normal distribution to approximate the binomial distribution. Still, please read through those sections
with care as we cannot cover all the knowledge in one module, but it contains essential statistical
knowledge that you should be aware of. Fortunately you will always have the Levine text book for
reference, should you need it at any time in the future!
Activity 6.1: Overview

Study skills

Draw a mind-map of the different sections/headings you will deal with in this study session. Then
page through the unit with the purpose of completing the map.
...................................................................................................

62

Activity 6.2: Concepts

Conceptual skills

Communication skill

Test your own knowledge (write in pencil) and then correct your understanding afterwards (erase and
write the correct description). Often a young language may not have all the terms in a discipline; can
you think of some examples?
...................................................................................................
English term
Continuous Probability distributions
Discrete Probability distributions
The Standard Normal probabilities
The mean of the Normal distribution
The standard deviation of the Normal distribution
The area under the normal curve

Description

Term in your home language

6.2 The normal and standardized normal distribution


The most important fact to understand is that although a normal distribution is also a probability
distribution function, it has its own characteristics with the most obvious one the fact that the variable
it describes is continuous. It changes the concept of the probability to an area instead of the height
of a bar. Many students have problems to understand what we mean by the area beneath a curve,
especially because we say that this area is determined through the mathematics of calculus. So, if
you are one of these, do not worry or imagine it as so difficult. You have heard about area since
primary school, where it was always the product of length and breadth (l b). Now area is still the

product of two values, even though it is not such a perfect box. The nice part is that it is not your

63

STA1610/1

problem how area of such a funny box is determined you simply read off the answer from a table
(see E.2)
ACTIVITY
Read through sections 6.1 and 6.2 of Levine text book at least once!
In addition, there are a few things you have to know about the normal distribution.
The form of the distribution is described as bell-shaped, meaning that it is symmetric (if it was

possible to cut out the line forming the bell you would be able to fold it double with the two halves
fitting on top of each other).

The normal curve is indicated within an axis system two perpendicular lines (like you may have

used in grade 9 to present a straight line graph).

The values of the continuous variable X (in the notation of Levine) are indicated on the horizontal

axis.

There is a difference between the probability distribution and the probabilities.

The distribution is only the line forming the bell and is called the density function, indicated as
f (X) (a function of the variable X ).

The total probability is represented by the area between that density function (bell-shaped

line) and the x-axis.

The total area equals one, but can be broken up into sections, determined by the values given to

the variable X on the horizontal axis as shown in the following graphs:

The centre X -value, where you would fold the curve to indicate the symmetry, represents the

mean for that particular distribution (see graph above).

It is not only the placement of the mean which determines the distribution of a particular

normal distribution, the standard deviation (how the values are spread around the mean) also
determines the form of the distribution.

If the values of and are used to standardize each individual X -value with the formula
X
(see equation 6.2) then the original normal distribution will be transformed to a so-called

standard normal distribution whose mean = 0 and standard deviation = 1.

64
The values in the normal table give the areas for the probabilities of the standard normal

distribution, i.e. the one whose mean =0 and standard deviation = 1.

The previous statement implies that all general normal distributions must first be standardized
X
with the formula
before the normal table may be used.

In this study guide we also insert another version of the normal table (taken from Utts and Heckard:

Mind on Statistics, 3rd edition) for your convenience. Here two separate tables are given one for
negative Z values and one for positive Z values. Some students find it easier to calculate areas
under the normal curve, using these two tables.

65

STA1610/1

66

67

STA1610/1

The normal table can be used either to determine an area under the normal curve for a given Z -value
or it can be used backwards whenever the value of the Z variable has to be determined for a given
area.
Now you should go through sections 6.1 and 6.2 again. There is nothing wrong with you if you have
to go through a chapter a number of times. It may even be that you need to break a chapter up into
small sections and repeat them over and over until you understand what we are trying to teach you.
Remember that statistics is about understanding and then building a mental structure based on the
underlying theory.
I think you are now ready to do a few activities. Always study the feedback carefully as I do a lot of
explaining there (for those of you who could not manage the activity yourself).
Activity 6.3

Study skill

Question 1
Assume X is normally distributed with mean = 15 and standard deviation = 3. Use the
approximate areas under the normal curve, to evaluate the following statements. The incorrect
statement is
1. P (X 15) = P (X 15) = 0.5
2. P (12 X 18) = 0.955
3. P (X 9) = 0.0228
4. P (X = 20) = 0
5. P (X 12) = 0.8413
Question 2
I want you to do question 6.9 in Levine text book. The breaking strength of plastic bags used for
packaging produced is normally distributed, with a mean of 5 pounds per square inch and a standard
deviation of 1.5 pounds per square inch. What proportion of the bags have a breaking strength of
(a) less than 3.11 pounds per square inch?
(b) at least 3.8 pounds per square inch?
(c) between 5 and 5.5 pounds per square inch?
(d) 95% of the breaking strength will be contained between what two values symmetrically distributed
around the mean?

68
Question 3
A manufacturer of tow chains finds that the average breaking point is at 3500 kilograms and the
standard deviation is 250 kilograms. If you pull weight of 4200 kilograms with this tow chain, the
percentage of the time you can expect the chain to break, is
1. 2.8%
2. 0.26%
3. 49.74%
4. 99.74%
5. None of the above.

Question 4
A retailer finds that the demand for a very popular board game averages 100 per week with a standard
deviation of 20. If the seller wishes to have adequate stock 95% of the time, how many of the games
must she keep on hand?
1. 132.9
2. 67.1
3. 119
4. 195.0
5. 109

Question 5
Identify the incorrect statement:
1. The average waiting time at the checkout counter for a large grocery chain is 2.45 minutes with a
standard deviation of 24 seconds (0.40 minutes). If we assume that the distribution of waiting time
is normal, the probability that a customer must wait more than 3 minutes for check out is 0.9162.
2. Considering the information in 1, the proportion of the customers who are served between 1
minute and 2.5 minutes is 0.5518.
3. Suppose the monthly demand for automobile tyres at a tyre dealer is normally distributed with a
mean of 250 tyres and a standard deviation of 50 tyres. The number of tyres the store must have
in stock at the beginning of each month in order to meet demand for 95 percent of the time, is
332.25

4. A circus performer who gets shot from a cannon is supposed to land in a safety net. The distance
he travels is normally distributed with a mean of 55 metres and a standard deviation of 4.7 metres.
His landing net is 16 metres long and the mid-point of the net is positioned 55 metres from the

69

STA1610/1

cannon. The probability that the performer will miss the net on a given night is 0.0892.
5. The probability that the circus performer in question 4 will hit the net is equal to 0.9108.

Question 6
The scores of high school students on a national mathematics exam in Uganda were normally
distributed with a mean of 86 and a standard deviation of 4. If there were 97680 students with scores
higher than 91, how many students took the test?
1. 125000
2. 925000
3. 105000
4. 247667
5. 394400

Feedback on Activity 6.3


Question 1
1. Correct. We begin by standardising as in equation 6.2 of Levene et al.




X
15 15
15 15
<
=P Z
= P (Z 0) = 0.5
P (X 15) = P (X 15) = P

3
3
The mean lies at the centre of the distribution and therefore divides the total area of 1 into half
(each half represents 0.5 of the total area) as shown in the figure below.

70
2. Incorrect Standardize the random variables before reading off the respective probabilities from
the graphs. You can use graphs on page 4 and 5 of the study guide or table E.10 in Levene et al.


X
18 15
12 15

= P (1.00 Z 1.00) = 0.8413


P (12 X 18) = P
3

3
0.1587 = 0.6826

3. Correct

P (X 9) = P

4. Correct

P (X = 20) = 0

X
9 15

= P (Z 2.00) = 0.0228

If the variable is continuous we assume that the probability of it assuming any fixed value is always
zero! Remember the continuous variable lies somewhere within a small interval, but we cannot
give a fixed value to it.

71
5. Correct
P (X 12) = P

Question 2
(a) P (X < 3.11) = P

(b) P (X 3.8) = P

X
12 15

3.11 5
X
<

1.5

3.8 5
X

1.5

= P (Z 1.00) = 1.00 0.1587 = 0.8413

= P (Z < 1.26) = 0.1038

= P (Z 0.80) = 1.00 0.2119 = 0.7881

STA1610/1

72

(c) P (5 < X < 5.5) = P

55
1.5

5.5 5
X
<
<

1.5

= P (0 < Z < 0.33) = 0.6293

(d) P (a < X < b) = 0.9500

Note: Since the values of a and b are symmetrically distributed, they are similar in magnitude with
opposite signs. Using the normal graph (See E.2) and a = 1.96 and b = 1.96

73

STA1610/1

Question 3
Option 2
P (X > 4200) = P

4200 3500
X
>

250

= P (Z > 2.80) = 1.00 0.9974 = 0.26%

Question 4
Option 1
This question is about working backwards. Standardise the random variable first, then look for the
Z -value corresponding to the area under 0.95. Please note that this should be read off from the body

of the table as shown below.

74
P (X < w) = 0.95
P

w 100
X
<

20

= 0.95



w 100
P Z<
= 0.95
200
w 100
= 1.645
20
w = 100 + (20 1.645) = 132.9

Question 5
1. Incorrect. The correct answer is 0.0838
Note that you always have to use the same units in this case use only minutes


2.45
X
>
= P (Z > 1.38) = 1.00 0.9162 = 0.0838
P (X > 3) = P

0.40
From the normal table the value corresponding with 1.38 (rounded to two decimals) is 0.9162. This
is the given answer, but not the correct one. Remember how the normal table is tabulated? The
areas are tabulated cumulative from the mean up to the listed value, but the question specifies the
area greater than 1.38 (to the right of 1.38). For the correct answer you therefore have to subtract
0.9162 from 1.00

2. Correct
Using the normal table.


X
2.5 2.45
1 2.45
<
>
= P (3.63 < Z < 0.13) = 0.00014 +
P (1 < X < 2.5) = P
0.40

0.40
0.5517 = 0.5518
3. Correct
This question is about working backwards




X
w 250
w 250
<
P (X < w) = P
=P Z<
= 0.95

50
50
To meet the demand for 95% of the time implies that we are looking for a z -value such that 0.95 of
the area lies to the left of it. We use the normal table, (see E.2), to look for the value 0.95 inside
the normal table because this is an area.
The z -value which corresponds to an area of 0.95 is 1.645. This 1.645 is the z -value, but we have

75

STA1610/1

to use to find the w-value.

w 250
= 1.645
50
w = 250 + (50 1.645) = 332.25

4. Correct. According to the information the 16m net is placed in such a way that it begins at
(55 8 = 47) metres and stretched up to (55 + 8 = 63) metres from the cannon. The performer will

miss the net by falling short or falling past the net. In terms of the normally distributed variable,
this comment means
P (X 47) or P (X 55)




47 55
63 55
Standardize: P Z
or P Z
4.7
4.7
P (Z 1.70)
or P (Z 1.70)
The table value for P (Z 1.70) is 0.0446, which means that P (Z 1.70) = 1.00 0.9554 =

0.0446.

Combining the total probabilities is twice 0.0446 which is 0.0892.


5. Correct. I hope that you realize that it is not necessary to repeat the calculation. The person can
only hit the net or miss the net. This means that the sum of the probabilities must equal one. The
probability for a hit of the net is therefore 1 0.0892 = 0.9108.

76
SELF ASSESSMENT EXERCISE TEST YOUR KNOWLEDGE
Question 1
Which one of the following is not a characteristic of a normal distribution?
1. The normal variable can take on only discrete values.
2. It is a symmetrical distribution.
3. The mean, median and mode are all equal.
4. It is a bell-shaped distribution.
5. The area under the curve is equal to one.

Question 2
Given that Z is a standard normal random variable, a negative value of z indicates that
1. the value Z is to the left of the mean
2. the value Z is to the right of the median
3. the standard deviation of Z is negative
4. the area between zero and Z is negative
5. the area to the right of Z is equal to 1

Question 3
If Z is a normal variable with = 0 and = 1, the area to the left of Z = 1.6 is
1. 0.4452
2. 0.9452
3. 0.0548
4. 0.5548
5. 0.5000

Question 4
Use the normal table to find the Z -value Z1 if the area to the right of Z1 is 0.8413. The value of Z1 is
1. 1.36
2. 1.36
3. 0.00
4. 1.00
5. 1.00

77

STA1610/1

Question 5
Let Z be a Z -score that is unknown but identifiable by position and area. If the area to the left of Z is
0.9306, then the value of Z must be

1. 1.48
2. 0.9603
3. 1.48
4. 0.4306
5. 0.0694
Question 6
Which of the following statements is incorrect?
1. P (Z 1.63) = 0.0516
2. P (Z 0.5) = 0.3085
3. P (Z < 1.63) = 0.0516
4. P (Z > 1.28)
5. P (1 Z 1) = 0.6826
Question 7
For a normal curve, if the mean is 20 minutes and the standard deviation is 5 minutes, then the area
between 22 and 25 minutes is
1. 0.1554
2. 0.3413
3. 0.4967
4. 0.1859
5. 0.0185
Question 8
A bakery firm finds that its average weight of the most popular package of biscuits is 200.5 g with a
standard deviation of 10.5 g. What proportion of biscuit packages will weigh less than 180 g?
1. 0.4744
2. 0.0256
3. 0.5226
4. 0.4713
5. 0.9744

78
Question 9
The average labour time to sew a pair of denims is 4.2 hours with a standard deviation of 0.5 hours.
If the distribution is normal, then the probability of a worker finishing a pair of jeans in more than 3.5
hours is
1. 0.0808
2. 0.4192
3. 0.5808
4. 0.9192
5. 0.9808

Question 10
A retailer finds that the demand for a popular board game averages 50 per week with a standard
deviation of 20. If the seller wishes to have adequate stock 99% of the time, how many games must
she keep on hand?
1. 81.0
2. 89.2
3. 50.0
4. 70.0
5. 96.6

SOLUTIONS TO SELF ASSESSMENT EXERCISE


Question 1
The normal variable can only take on continuous values.
Alternative (1)
Question 2
The value Z is to the left of the mean
Alternative (1)

79
Question 3
P (Z < 1.6) = 0.9452

Alternative (2)
Question 4
P (Z > Z1 ) = 0.8413
P (Z < Z1 ) = 0.1587 Z1 = 1.00

Alternative (4)
Question 5
P (Z > z) = 0.9306 z = 1.48

Alternative (3)

STA1610/1

80
Question 6
Option 1. Correct
P (Z 1.63) = 0.0516

Option 2. Correct
P (Z 0.5) = 0.3085

Option 3. Incorrect! Remember the area under the graph cannot be negative.
P (Z < 1.63) = 0.0516

Option 4. Correct
P (Z > 1.28) = 0.1003

81

STA1610/1

Option 5. Correct
P (1 < Z < 1) = 0.6826

Alternative (3)
Question 7
P (22 X 25)

=
=
=
=

25 20
22 20
Z
P
5
5
P (0.4 Z 1)
0.8413 0.6554
0.1859

Alternative (4)
Question 8
P (X < 180)

=
=
=



180 200.5
P Z<
10.5
P (Z < 1.95)
0.0256

Alternative (2)
Question 9
P (X > 3.5)

=
=
=
=

Alternative (4)



3.5 4.2
P Z<
0.5
P (Z > 1.4)
P (Z < 1.4)
0.9192

82
Question 10
This question is about working backwards



a 50
= 0.99
Z
20

a 50
= 2.33
20
a = 50 + (20 2.33) = 96.6

Alternative (5)

6.3 In summary to the study unit


Once you have familiarized yourself with this study unit you should be able to
understand the idea that probability is given in terms of an area if the variable is continuous
identify the normal distribution as a continuous distribution and use it appropriately
recognize the characteristics of the normal distribution based on its symmetry
appreciate the link between general normal distributions and the standardized normal distribution
use the normal table to find specific areas within given limits
exploit the backwards use of the normal table, i.e. to determine the z -value, given an

83

STA1610/1

STUDY UNIT 7
STUDY CHAPTER 7
SAMPLING DISTRIBUTION

Key questions for this unit


What is meant with the following concepts: estimate, inference, sample mean,
population mean, statistic, parameter, sample proportion and population proportion?
Define a sampling distribution.
Distinguish between a sampling distribution of the mean and sampling distribution of
proportion.
What is the purpose when making a statistical inference?
What is the benefit of Central Limit Theorem?

OBJECTIVES
At the end of this chapter, you should be able:
To understand the concept of the sampling distribution.
To compute probabilities related to the sample mean and the sample proportion.
To understand the importance of the Central Limit Theorem.

7.1 Introduction to this study unit


Sampling distribution is classified as either sampling distribution of the mean or sampling distribution
of the proportion; depending on the possible samples selected or the proportion of items in a
population having a certain characteristic of interest.
Read the section 7.1 and 7.2 given in D M Levine , T C Krehbiel and M L Berenson, which are very
explanatory but they are topics on their own and do not form part of this module.
In the study unit 4 we knew about sample mean ( X ), population mean (), statistic and parameter.
In this and the next study we will use these statements but in a different manner, including sample
proportion and population proportion.

84
Inference means that we are making an assumption or a deduction where data are gathered by

drawing a sample from the population and then making assumptions about the population, based
on this sample data.

A sample must be representative of the population.


A sampling distribution is the distribution of the results if you actually selected all possible samples.

Activity 7.1

Overview

Study Skill

Draw a mind - map of the different section / headings you will deal with in this study session. Then
page through the unit with the purpose of the completing the map.

Sampling distribution

Sampling distribution
of the mean

Sampling distribution
of proportion

In many situations the population is so large that you cannot gather information on every item.
Instead, statistical sampling procedures focus on collecting a small representative group of the larger
population. The results of the sample is less time-consuming, less costly, and more practical than an
analysis of the entire population.
Activity 7.2

Concepts

Conceptual skill

Communication skill

Test your own knowledge (write in pencil) and then correct your understanding afterwards (erase and
write the correct description). Often a young language may not have all the terms in a discipline can
you think of some examples?
English term
Sample mean
Population mean
Inference
Statistic
Parameter
Sample proportion
Population proportion
Sampling distribution
Unbiased
Standard error of the mean
Standard error of proportion

Description

Term in your home language

85

STA1610/1

7.2 Sampling Distribution of the Mean


Definition:The sampling distribution of the mean is the distribution of all possible
samples means if we select all possible samples of a given size.

The sample mean (X) is unbiased because the mean of all the possible sample means is equal to
the population mean (), alternatively the sample mean (X ) is unbiased of the population parameter
because the expected value of the sample mean (X ) is equal to the population parameter:
E(X) =

STANDARD ERROR OF THE MEAN


The standard error of the mean ( X ) is equal to the standard deviation ( ) in population of all possible
sample means.
Steps
1. Calculate the population standard deviation( ), if it is not calculated.
2. Find the sample size n
The standard deviation of the mean can be used to do this.

X =
n
Activity 7.3
Question 1
Suppose a random sample of n = 25 observations is selected from a population that is normally
distributed with mean equal to 106 and the standard deviation equal to 12. Determine the mean and
the standard deviation of the sampling distribution of the sample mean X.

86
Solution
Steps
1. Population standard deviation = 12
2. The sample size n = 25
12
12

= 2.4
The Standard error of mean is equal to X = = =
5
n
25

The population mean = 106

Question 2
A random sample of n observations is selected from a population with a standard deviation = 2.
Calculate the standard error of the mean for these values of n:
(a) n = 5
(b) n = 49
Solution
(a) When n = 5
Standard deviation = 2
2

2
= 0.8944
Standard error of the mean X = = =
2.2361
n
5
(b) When n = 49

Standard deviation = 2
2

2
Standard error of the mean X = = = = 0.2857
7
n
49

Question 3
Population A consists of all values of invoices of a certain company. The mean of the population A is
R350 and the standard deviation is R100. Population B consists of all samples of 16 values drawn
from population A. The mean of population B is
1. R100
2. R250
3. R350
4. R450
5. R25

87

STA1610/1

Solution
The mean of the population A is equal to the mean of population B .
Option (3)

What distribution will the sample mean X , follow?


In the previous study unit , we studied that a random variable X is normally distributed with the mean
and the standard deviation . Now if we sampling from a population that is normally distributed

with the mean and the standard deviation , then regardless of the sample size n, the sampling
distribution of the mean is normally distributed with the mean X = , and the standard error of the

mean, X = .
n
How do you calculate the probability for the sampling distribution of the mean ?
Steps
1. Determine the population mean and the sample mean X
2. Determine the sample size n
3. Determine the number of the sample mean (X ) for which we want to determine the probability.
4. Find the value of Z called test statistic
X
Z=

Where

, is the standard error of the mean ( X )


n

or

Z=

X
X

88
Activity 7.4
Question 1
Given a normal distribution with the population mean =100 and the standard deviation = 12, if
you select a sample of n = 36, what is the probability that the sample mean X

is

1. Less than 95?


2. Between 95 and 97.5?
3. Above 102.2?
4. There is a 65% chance that is above what value?

Solution

1. P (X < 95)?

Steps

1. Use transformation formula called the test statistic Z =

X
95 100
5
5
= 2.5
=
=
=
12
12
2

n
6
36
3. Determine the equivalent number of the sample mean for which we want to determine the

2. Substitute the values into the Z formula Z =

probability P (X < 95) = P (Z < 2.5) , now determine the area which is less than 2.5

0,0062

0110
1010

_ 2,5

89

STA1610/1

4. Finding the value using the cumulative standard normal distribution table E.2 (from Appendix)
Z
6.0
5.5
.......
2.5
.......

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.0062

0.0060

0.0059

0.0057

0.0055

0.0054

0.0052

0.0051

0.0049

0.0048

P (Z < 2.5) = 0.0062

2. P (95 < X < 97.2)?


Steps
1. Z =

= 100

2. If X = 95 then Z =

n = 36

= 12

95 100
5
=
= 2.5
12
12

6
36

If X = 97.2 then Z =

97.2 100
2.8
2.8
= 1.4
=
=
12
12
2

6
36

3. P (95 < X < 97.2) = P (2.5 < Z < 1.4) , now we determine the area which is between 2.5
and 1.4

0,0062

11
00
00
11
00
11
00
11

0110
1010
1010
1010

0,0746

_ 2,5
0,0808

_ 1,4

Since our statistics tables are all going to the left:


P (2.5 < Z < 1.4) = P (Z < 1.4) P (Z < 2.5)
= 0.0808 0.0062
= 0.0746

90
3. P (above 102.2) = P (X > 102.2)?

= 100

n = 36

= 12

Steps
(a) Use transformation formula called the test statistic Z =

(b) Substitute the values into the Z formula Z =

X
102.2 100
2.2
2.2
= 1.1
=
=
=
12
12
2

n
6
36

(c) Determine the equivalent number of the sample mean for which we want to determine the
probability P (X > 102.2) = P (Z > 1.1) , now determine the area which is greater than 1.1

01
1010
10
1010
10

0,8643

0,1357

1,1

(d) Finding the value using the cumulative standard normal distribution table E.2 (from Appendix)

Z
0.0
0.1
.......
1.1
.......

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.8643

0.8665

0.8686

0.8708

0.8729

0.8749

0.8770

0.8790

0.8810

0.8830

P (Z > 1.1) = 1 P (Z < 1.1) = 1 0.8643 = 0.1357

4. P (X > a) = 0.65
a
P (Z > ) = 0.65

substitute now = 100

n = 36

= 12

a 100
find the corresponding Z value to 0.65 from the cumulative standard
) = 0.65
12

36
normal table by looking inside of the table.
P (Z >

P (Z >

a 100
) = 0.65
12

36

91

Z
0.0
0.1
.......
0.3
.......

STA1610/1

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.6179

0.6217

0.6255

0.6293

0.6331

0.6368

0.6406

0.6443

0.06480

0.6517

a 100
= 038
12

36
a 100
= 0.38
2

a = 2 0.38 + 100 = 100.76

Question 2
Given an infinity population with a mean of 75 and a standard deviation of 12, the probability that the
mean of a sample of 36 observations, taken at random from this population, exceeds 78 is

1. 0.4332
2. 0.0668
3. 0.0987
4. 0.9013
5. 0.9332
Solution
P (X > 78)?

Steps

1. Use transformation formula called the test statistic Z =

= 75

= 12

n = 36

X = 78

2. Substitute the values into the Z formula Z =

X
78 75
3
3
=
= = 1.5
=
12
12
2

n
6
36

3. Determine the equivalent number of the sample mean for which we want to determine the
probability

92
P (X > 78) = P (Z > 1.5) , now determine the area which is greater than 1.5

0,9332

01
1010
1010
1010

0,0668

1,5

4. Finding the value using the cumulative standard normal distribution table E.2
P (Z > 1.5) = 1 0.9332 = 0.0668

The central Limit Theorem is important in using statistical inference to draw conclusions about a
population without having to know the specific shape of the population distribution. The theorem
states that the sum of a large number of independent observations form the same distribution,
under certain general conditions an approximate normal distribution.

7.3 Sampling Distribution of Proportion


The sample proportion is represented by p and the population proportion is represented by . As p
is a statistic and is a parameter.
The statistic p is used to estimate the population proportion parameter .This sample proportion is an
unbiased estimator of the population proportion because the expected value of the sample statistic
is equal to the respective population parameter, i.e. E(p) = .
The sample proportion formula
p=

number of items having the characteristics of interest


X
=
n
sample size

THE STANDARD ERROR OF PROPORTION


The standard error of proportion p is given by
p =

(1 )
n

In many instances, you can use the normal distribution to estimate the sampling distribution of the
proportion. When the parameter is unknown then

93

STA1610/1

The Standard error of proportion is given by

p =

p(1 p)
n

When do you assume that the sampling distribution of proportion is approximately normally
distributed?
It is when n and n (1 ) are each at least 5.
How do you calculate the probability for sampling distribution of the proportion
Steps
1. Determine the population proportion and the sample proportion p
2. Determine the sample size n
3. Determine the number of the sample proportion (p) for which we want to determine the probability.
4. Find the value of Z called test statistic
p
Z=u
(1 )
n

or

Z=

p
p

Where
The standard error of the proportion ( p ) =

(1 )
n

when is known

The standard error of the proportion ( p ) =

p(1 p)
n

when is unknown

Activity 7.5
Question 1
In a random sample of 64 people, 48 are classified as successful".
(a) Determine the sample proportion, p of successful.
(b) If the population proportion is 0.80; determine the standard error of proportion.

94
Solution
(a) The sample proportion p =

number of items having the characteristics of interest


X
48
=
=
=
n
sample size
64

0.75

(b) Population proportion = 0.80


The standard error of the proportion p =

(1 )
=
n

0.80(1 0.80)
= 0.05
64

Question 2
Suppose that we will randomly select a sample of n = 100 units from a population and that we
will compute the sample proportion p of these units that fall into a category of interest. If the true
population proportion equals 0.9.
(a) Find the mean and the standard deviation of the sampling distribution of p.
(b) Calculate the following probabilities about the sample proportion p.
(i) P (p

0.96)

(ii) P (0.855
(iii) P (p

0.945)

0.915)

Solution
(a) The population of all possible sample proportions has mean = 0.9
u
u
(1 )
0.9(1 0.9)
=
= 0.0009 = 0.03
The standard deviation p =
n
100
(b) (i) P (p

0.96)

Steps
1. The population proportion mean = 0.9 and the sample proportion p = 0.96
2. The sample size n = 100
3. The value of Z called test statistic

Z=u

(1 )
n

or

Z=

p
0.96 0.9
=2
=
p
0.03

95
4. P (p

0.96) = P (Z

STA1610/1

2) = 0.0228

0011
11001100
11001100
11001100
2

(ii) P (0.855

0,0228

0.945)?

Steps
1. The population proportion mean = 0.9
2. The sample proportion p = 0.855 and p = 0.945
3. The sample size n = 100
4. The value of Z called test statistic
if p = 0.855 then Z =

p
0.855 0.9
= 1.5
=
p
0.03

if p = 0.945 then Z =

p
0.945 0.9
= 1.5
=
p
0.03

0,0668

0110
1010
1010

1,5

0110
1010
1010

1,5

0,9332

5. P (0.855

0.945) = P (1.5

0.9332 0.0668 = 0.8664

1.5) = P (Z

1.5) P (Z

1.5) =

96
(iii) P (p

0.915)?

Steps
1. The population proportion mean = 0.9 and the sample proportion p = 0.915
2. The sample size n = 100
3. The value of Z called test statistic
p
Z=u
or
(1 )
n

Z=

0,6915

4. P (p

0.915) = P (Z

SELFASSESSMENT

p
0.915 0.9
= 0.5
=
p
0.03

01
1010
1010
1010
10

0,3085

0,5

0.5) = 1 0.6915 = 0.3085

STUDY UNIT 7.2

Question 1
Time spent using e-mail per session is normally distributed, with a population mean of 8 minutes and
population standard deviation of 2 minutes. Select a random sample of 16 sessions,
(a) What is the probability that the sample mean is between 7.8 and 8.2 minutes?
(b) If you select a random sample of 100 session, what is the probability that the sample mean is
between 7.8 and 8.2 minutes.

97

STA1610/1

Question 2
Consider an infinite population with a mean of 160 and a standard deviation of 25. A random sample
of size 64 is taken from this population. The standard error of the mean equals
1. 0.391
2. 6.4
3. 2.50
4. 9.766
5. 3.125
Question 3
The standard error of the mean is
1. the standard deviation of the sampling distribution.
2. the squared value of the population variance.
3. the same value as the population standard deviation.
4. the same for distributions of all sample sizes.
5. the mean of the sampling distribution

Question 4
A manufacturing company packages peanuts for Piedmont Airlines. The individual packages weigh
1.4 grams with a standard deviation of 0.6 grams. For a flight of 152 passengers receiving the peanuts,

the probability that the average weight of the packages is less than 1.3 grams is
1. 0.0202
2. 0.2040
3. 0.9798
4. 0.4798
5. 2.0500

98
Question 5
The fill amount of bottles of a soft drink is normally distributed, with a mean of 2.0 liters and a
standard deviation of 0.06 liter. If you select a random sample of 36 bottles, what is the probability
that the sample mean will be
1. between 1.99 and 2.0 liters?
2. below 1.98 liters?
3. greater than 2.01 liters?
4. The probability is 99% that the sample mean amount of soft drink will be at least how much?
5. The probability is 99% that the sample mean amount of soft drink will be between which two
values?

SelfAssessment

Study Unit 7.3

Question 1
In each of the following cases, find the mean, variance, and the standard deviation of the sampling
distribution of the sample proportion p.
(a) = 0.5

n = 250

(b) = 0.98 n = 1000


Question 2
A political pollster is conducting an analysis of sample results in order to make predictions on election
night. Assuming a two- candidate election, if a specific candidate receives at least 55% of the vote in
the sample, then that candidate will be forecast as the winner of the election. If you select a random
sample of 100 voters, what is the probability that a candidate will be forecast as the winner when
(a) The true percentage of her vote is 50.1%?
(b) The true percentage of her vote is 49%?

99

STA1610/1

Question 3

According to Gallups poll on personal finances, 46% of the U.S. workers say they feel they will
have enough money to live comfortably when they retire. If you select a random sample of 200 U.S.
workers,

(a) what is the probability that the sample will have been between 45% and 55% who say they have
enough money to live comfortably now and expect to do so in future?

(b) the probability is 90% that the sample percentage will be contained within what symmetrical limits
of the population percentage?

SOLUTIONS TO SELF - ASSESSMENT STUDY UNIT 7.2

Question 1

The given informations: = 8

=2

n = 16

(a) P (7.8 < X < 8.2)?

Steps
1. The transformation formula is the test statistic Z =

2. Substitute the values into the Z formula


If X = 7.8 then Z =

7.8 8
0.2
0.2
= 0.4
=
=
2
2
0.5

4
16

If X = 8.2 then Z =

8.2 8
0.2
0.2
= 0.4
=
=
2
2
0.5

4
16

100
3. P (7.8 < X < 8.2) = P (0.4 < Z < 0.4) , now determine the area which between 0.4 and 0.4

0,3446

01
10
1010
10
1010
10

0,4

0,6554

01
10
1010
10
1010
10

0,4

4. The value using the cumulative standard normal distribution table is


P (0.4 < Z < 0.4)

=
=
=

P (Z < 0.4) P (Z < 0.4)


0.6554 0.3446
0.3108

(b) P (7.8 < X < 8.2)?

The given informations: = 8

=2

n = 100

Steps

1. Use transformation formula called the test statistic Z =

2. Substitute the values into the Z formula Z =

X
=

If X = 7.8 then Z =

7.8 8
0.2
0.2
= 0.1
=
=
2
2
0.2

10
100

If X = 8.2 then Z =

8.2 8
0.2
0.2
= 0.1
=
=
2
2
0.2

10
100

101

STA1610/1

3. P (7.8 < X < 8.2) = P (0.1 < Z < 0.1) , now determine the area which between 0.1 and 0.1

0,4602

0,5398

01 1010
10 10
1010 10
10 1010
1010 10
10 1010

_ 0,1 0,1
0

4. The value using the cumulative standard normal distribution table is


P (0.1 < Z < 0.1)

=
=
=

P (Z < 0.1) P (Z < 0.1)


0.5398 0.4602
0.0796

Question 2
Steps
1. Population mean standard deviation = 25
2. The sample size n = 64
25
25

= 3.125
The Standard error of mean is equal to X = = =
n
8
64

Question 3
Option 1
Question 4
P (X < 1.3)?

Steps
1. The test statistic Z =

= 1.4

2. Substitute the values into the Z formula Z =

= 0.6

n = 152

X = 1.3

X
1.3 1.4
0.1
0.1
= 2.05
=
=
=
0.6
0.6
0.0487

n
12.3288
152

102
3. Let determine the equivalent number of the sample mean for which we want to determine the
probability
P (X < 1.3) = P (Z < 2.05) , now determine the area which is less than 2.05

0,0202

0110
1010

_ 2,05

4. Let find the value using the cumulative standard normal distribution table E.2 (from Appendix)
P (Z < 2.05) = 0.0202

Question 5

Given information: population mean = 2.0

sample mean X = 1.99 sample size n = 36

Steps

(a) P (1.99 < X < 2.0)?

1. The test statistic Z =

2. Substitute the values into the Z


X
1.99 2.0
0.01
0.01
= 1
when X = 1.99 then Z = =
=
=
0.06
0.06
0.01

n
6
36
X
2.0 2.0
0
0
=0
when X = 2.0 then Z = =
=
=
0.06
0.06
0.01

n
6
36
3. P (1.99 < X < 2.0) = P (1 < Z < 0) = P (Z < 0) P (Z < 1) , now determine the area

103

STA1610/1

which is between 0 and 1

01
1010
10
1010
10

0,3413

_1

4. The value using the cumulative standard normal distribution table E.2
P (Z < 0) = 0.5
P (Z < 1) = 0.1587

P (1 < Z < 0) = P (Z < 0) P (Z < 1) = 0.5 0.1587 = 0.3413

(b) P (X < 1.98)?


1. The test statistic Z =

2. Substitute the values into the Z =

X
1.98 2.0
0.02
0.02
= 2
=
=
=
0.06
0.06
0.01

n
6
36

3. P (X < 1.98) = P (Z < 2) , now determine the area which is less than 2

0,0228

01
1010
1010
10

_2

4. The value using the cumulative standard normal distribution table is


P (Z < 2) = 0.0228

104
(c) P (X > 2.01)?
1. The test statistic Z =

2. Substitute the values into the Z =

X
2.01 2.0
0.01
0.01
=1
=
=
=
0.06
0.06
0.01

n
6
36

3. P (X > 2.01) = P (Z > 1), now determine the area which is greater 1

0110
1010
1010
10

0,1587

4. The value using the cumulative standard normal distribution table is


P (Z > 1) = P (Z < 1) = 0.1587

(d) P (X > a) = 0.99


X
P (Z > ) = 0.99

X = +Z
the Z value corresponding to the area 0.99 is 2.33 (using the cumulative
n
standardized normal table )
0.06
= 2.0 + 2.33
36
= 2.0233

(e) The area between A and B equals to 0.99. The Z value corresponding to the area 0.99 is 2.33.
To find A and B values associated with known probability is given by

A =Z
n
0.06
= 2.0 2.33
36
= 1.9767

B =+Z
n

105

STA1610/1

0.06
= 2.0 + 2.33
36
= 2.0233
P (1.9767 < X < 2.0233)

Solutions to self Assessment Study Unit 7.3


Solution 1
(a) = 0.5 n = 250
The mean = 0.5
The standard deviation for proportion p =
The variance 2p = (0.0316)2 = 0.001

(1 )
=
n

0.5(1 0.5)
= 0.001 = 0.0316
250

(1 )
=
n

0.98(1 0.98)
= 0.000019 = 0.0044
1000

(b) = 0.98 n = 1000


The mean = 0.98
The standard deviation for proportion p =
The variance 2p = (0.0044)2 = 0.000019
Question 2
The given information: sample proportion p = 55% = 0.55
(a) Population proportion = 50.1% = 0.501

The sample size n = 16

P (p > 0.55)?

Steps
1. The test statistic Z = u

(1 )
n

=u

0.049
0.049
=
= 0.98
=
0.05
0.0025
0.501(1 0.501)
100
0.55 0.501

2. P (p > 0.55) = P (Z > 0.98) , now determine the area which is greater than 0.98

0,8365

0110
10
1010
1010
1010

0,98

106
3. The value using the cumulative standard normal distribution table is
P (Z > 0.98) = 1 P (Z < 0.98)
= 1 P (Z < 0.98)
= x
= 0.1635
(b) Population proportion = 0.49
The given information: p = 0.55

n = 100

Steps
1. The test statistic Z = u

(1 )
n

=u

0.55 0.49

0.49(1 0.49)
100

0.06
= 1.20
0.05

2. P (p > 0.55) = P (Z > 1.20) , now determine the area which is greater than 1.20.

0,8849

0110
10
1010
10

1,20

3. The value using the cumulative standard normal distribution table is


P (Z > 1.20) = 1 P (Z < 1.20)
= 1 0.8849
= 0.1151
Question 3
(a) P (0.45 < p < 0.55)
Given information: population proportion = 0.46, sample proportions are: p = 0.45 and p = 0.55
n = 200

Steps
1. The test statistic Z = u

(1 )
n

when p = 0.45 then Z = u

0.45 0.46

0.46(1 0.46)
200

0.01
= 0.2841
0.0352

107
when p = 0.55 then Z = u

0.55 0.46

STA1610/1

0.09
= 2.5568
0.0352

0.46(1 0.46)
200
2. P (0.45 < p > 0.55) = P (0.2841 < Z < 2.5568) , now determine the area which is between
0.2841 and 2.5568.

01
1010
1010
1010

0,3897

0,28

0110
10

2,56

0,9948
3. The value using the cumulative standard normal distribution table is
P (0.2841 < Z < 2.5568) = P (Z < 2.5568) P (Z < 0.2841)
= = 0.9948 0.3897
= = 0.6051
(b) The area between A and B represents 0.90. The Z value corresponding to the area 0.90 is 1.645.
To find A and B values associated with known probability is given by
u
(1 )
A = Z
n
u
0.46(1 0.46)
= 0.46 1.645
200
= 0.4021

=
=
=

(1 )
nu
0.46(1 0.46)
0.46 + 1.645
200
0.5179

+Z

P (0.4021 < p < 0.5179)

108

STUDY UNIT 8
STUDY CHAPTER 8
CONFIDENCE INTERVAL ESTIMATION

Key questions for this unit


What is meant with the following concepts: point estimate, standard deviation known,
standard deviation unknown, the level of confidence, the level of significance,
the critical value, degrees of freedom and students t distribution.
Define a confidence interval estimate.
Distinguish between a confidence interval estimate for the mean and confidence interval
estimate for proportion.
Distinguish between a confidence interval estimate for the mean when is known and
confidence interval estimate for the mean when is unknown.
What is the purpose when constructing a confidence interval estimate?

OBJECTIVES
At the end of this chapter, you should be able to construct and interpret confidence interval estimates
for the mean and the proportion.

8.1 Introduction to this study unit


In this study unit we need to use inferential statistics, the process of using sample results to estimate
unknown population parameters such as population mean or population proportion. We estimate
population parameters using either point estimate or interval estimates.
A point estimate is the value of a single sample statistic. e.g. the sample mean X is the point
estimate of population mean and the sample proportion p is the point estimate of the population
proportion . A confidence interval estimate is a range of numbers, called an interval, constructed
around the point estimate. It is called a confidence interval because we associate a degree of
confidence that the real value of the population mean lies within this interval. Of course, the interval
may or may not contain the true value of the population mean or proportion. Note that even a
statistician cannot be 100% sure either way. So, what we do is to indicate a level of confidence that
the true population mean or population proportion will lie within the confidence interval.

109
Activity 8.1

Overview

STA1610/1

Study Skill

Draw a mind - map of the different section / headings you will deal with in this study session. Then
page through the unit with the purpose of the completing the map.
Confidence interval estimate

Confidence interval
estimate
for the mean

Confidence interval
estimate when
is known

Confidence interval
estimate for
proportion

Confidence interval
estimate when
is unknown

There are two options to consider for confidence interval estimate of the mean, depending on the
population standard deviation is known or the population standard deviation is unknown.
Activity 8.2

Concepts

Conceptual Skill

Communication skill

Test your own knowledge (write in pencil) and then correct your understanding afterwards (erase and
write the correct description). Often a young language may not have all the terms in a discipline can
you think of some examples?
English term
Confidence level
Confidence interval
Point estimate
Population standard deviation known
Population standard deviation unknown
Critical value
Degrees of freedom

Description

Term in your home language

110

8.2 Confidence Interval Estimate for the Mean when the


population standard deviation is known
When the population standard deviation is known, the normal distribution is used in the construction
of the interval.
Steps
1. Determine the sample mean X
2. Determine the population standard deviation
3. Determine the sample size n
4. Determine the Z - value called critical value corresponding to the level of confidence (1 )%
Level
99%
95%
90%

Z ( two- tailed)
2
2.58
1.96
1.645

Z ( one - tailed)
2
2.33
1.645
1.28

5. Confidence interval estimate for population mean is


X

Z
n
2

( X Z ,
n
2

X + Z )
n
2

Lower limit of the interval ,

Upper limit of the interval

6. Substitute the values into the above formula.

Activity 8.3
Question 1
The owner of a large shopping centre is besieged with complaints the shortage of parking space. He
feels that the 1 000 spaces are adequate. In an effort to address the problem, he obtain a sample of
the average number of cars on the parking lot during prime hours. The sample of 40 has a mean of
952. Assume a population standard deviation of 396. The 95% confidence interval estimate for prime

hour parking is
1. 790.46 to 1112.54
2. 849.31 to 1054.69

111

STA1610/1

3. 829.28 to 1074.72
4. 932.60 to 971.40
5. 952.00 to 1052.00

Solution
Steps

1. Sample mean X = 952


2. Population standard deviation = 396
3. Sample size n = 40
4. Select Z

= 1.96 at 95% confidence interval


2

5. Confidence interval estimate formula is X


6. Substitution of the values into Z formula:

Z
n
2
396
952 1.96
40

952 122.7217
(952 122.7217 , 952 + 122.7217)
(829.2783 , 1074.7217)

Question 2
If X = 120, = 24 and n = 36, construct a 99% confidence interval estimate of the population
mean
Solution
Steps

1. Sample mean X = 120


2. Population standard deviation = 24
3. Sample size n = 36
4. Use the critical value Z = 2.58 for 99% confidence interval estimate
2

112
24
5. Substitute the values into Z formula 120 2.58
36
120 10.32
(120 10.32 , 120 + 10.32)
(109.68 , 130.32)

8.3 Confidence Interval Estimate for the Mean (population


standard deviation is unknown)
When population standard deviation is unknown, we need to construct a confidence interval estimate
of , using the sample standard deviation S as an estimate of the population standard deviation .
Returning to the idea of confidence interval estimate of the mean ( known), the normal distribution
was used in the construction of the interval. If the is unknown, the students t distribution is used
with n - 1 degrees of freedom.
The degrees of freedom df = n 1
The critical value of t for the appropriate degrees of freedom from the table of t distribution.

The confidence interval for the mean ( unknown )


X

X + (t
( n 1 ,

S
,
n
)
2

Lower limit of the interval ,

S

n
)
2
S
X +t
)
n
( n 1 ,
)
2

( n 1 ,

Upper limit of the interval

Steps
1. Determine the sample mean X
2. Determine the sample standard deviation S
3. Determine the sample size n
4. Determine the degrees of freedom df = n 1
5. Find the critical value using the t- student table with t

)
2
6. Substitute the values into the confidence interval estimate for the mean ( unknown)
(n 1 ,

113

STA1610/1

Activity 8.4
Question 1
For a selected month, the average kilowatt hours used by 49 residential customers is 1160 kilowatt
and the standard deviation S is 1085 kilowatt. Assume that the tvalue for a 95% confidence interval
is 1.6772. Determine the confidence interval estimate for the true mean?
Solution
Steps
1. The sample mean X = 1160
2. The sample standard deviation S = 1085
3. The sample size n = 49
4. The degrees of freedom df = n 1 = 49 1 = 48
5. The critical value equals to 1.6772
6. Substitute the values into the confidence interval estimate for the mean ( unknown) formula
1085
1160

1.6772
49
1160

259.966

1160 259.966,
(900.034

Lower limit of the interval ,

1160 + 259.966
1419.966)

Upper limit of the interval

Question 2
A stationery store wants to estimate the mean retail value of greeting cards that it has in its inventory.
A random sample of 100 greeting cards indicates a mean value of R2.65 and a standard deviation
of R0.44. Assuming a normal distribution, construct a 95% confidence interval estimate of the mean
value of all greeting cards in the stores inventory.
Solution
Steps
1. The sample mean X = 2.65
2. The sample standard deviation S = 0.44
3. The sample size n = 100

114
4. The degrees of freedom df = n 1 = 100 1 = 99
5. The critical value at t( 99 ,

0.025)

equals to 1.9842

6. Substitute the values into the confidence interval estimate for the mean ( unknown) formula
2.65

0.44
1.9842
100

2.65

0.0873

2.65 0.0873,
(2.5627

Lower limit of the interval ,

2.65 + 0.0873
2.7373)

Upper limit of the interval

HOW DO YOU CALCULATE THE PROBABILITY FOR SAMPLING DISTRIBUTION


OF THE MEAN WHEN UNKNOWN

Let us recall that in study unit 73 the probability of a sampling distribution of the mean was calculated
X
using the value Z = . If the population standard deviation is unknown the following statistic is

n
used:
X
t=
S

n
This expression has the same form as Z statistic except that S is used to estimate the unknown .
SELFASSESSMENT
Question 1
Your statistics instructor wants you to determine a confidence interval estimate for the mean test
score. Past experience indicated that tests scores are normally distributed with a sample mean of
160 and a population standard deviation of 45. A confidence interval estimate if your group has 36

students is:
1. 145.3 to 174.7
2. 157.55 to 162.45
3. 152.5 to 167.5
4. 158.75 to 161.25
5. 160 to 174.7

115

STA1610/1

Question 2
If X = 70, S = 24 and n = 36, and assuming that the population is normally distributed, construct a
95% confidence interval estimate of the population mean .

Question 3
The data represents the overall miles per gallon (MPG) of 2008 SUVs priced under $30 000.
23
17

20
21

21
18

22
18

18
18

18
17

17
17

17
16

19
20

19
16

19
22

Construct a 95% confidence interval estimate for the population mean miles per gallon of 2008 SUVs
priced under $30 000 assuming a normal distribution.
SOLUTIONS FOR SELFASSESSMENT
Question 1
Steps
1. Sample mean X = 160
2. Population standard deviation = 45
3. Sample size n = 36
4. Use the critical value Z = 1.96 for 95% confidence interval estimate
2
45
5. Substitute the values into Z formula 160 1.96
36
160 14.7
(160 14.7 , 160 + 14.7)
(145.3 , 174.7)

Option (1)

116
Question 2
Steps
1. The sample mean X = 70
2. The sample standard deviation S = 24
3. The sample size n = 36
4. The degrees of freedom df = n 1 = 36 1 = 35
5. The critical value at t( 35 ,

0.025)

equals to 2.0301

6. Substitute the values into the confidence interval estimate for the mean ( unknown) formula
24
70

2.0301
36
70

8.1204
70 8.1204,
70 + 8.1204
(61.8796

78.1204)

Lower limit of the interval ,

Upper limit of the interval

Question 3
Steps
413
23 + 20 + 21 + 22 + ........ + 20 + 16 + 22
=
= 18.7727
22
22
2
S
Xi X
2. The sample standard deviation S =
n1

1. The sample mean X =

S2 =

85.8636
(23 18.7727)2 + (20 18.7727)2 + ..... + (22 18.7727)2
=
= 4.0887
22 1
21

S=

4.0887 = 2.0221

3. The sample size n = 22


4. The degrees of freedom df = n 1 = 22 1 = 21
5. The critical value at t(21 ,

0.025)

equals to 2.0796

117

STA1610/1

6. Substitute the values into the confidence interval estimate for the mean ( unknown) formula
18.7727

2.0221
2.0796
22

18.7727

0.8965

18.7727 0.8965,
(17.8762

2.65 + 0.8965

Lower limit of the interval ,

19.6692)

Upper limit of the interval

8.4 Confidence Interval Estimate for proportion


This section concerns with estimating the proportion of items in a population having a certain
characteristic of interest. The unknown population proportion is and the sample proportion is
p=

X
n

where X , is the number of items in the sample having the characteristic of interest.
n, is the sample size.

The confidence interval estimate for the proportion is given by

Z
2

p(1 p)
n

where
p is the sample proportion
Z is the critical value find from the standardized normal distribution
2

Activity 8-4
Question 1
Companies are spending more time screening applicants than in the past. a study of 102 recruiters
conducted by execunet found that 77 did internet research on candidates.

Construct a 95%

confidence interval estimate of the population proportion of recruiters who do internet research on
candidates.

118
Solution
Steps
1. The sample proportion p =

77
X
=
= 0.7549
n
102

2. The sample size n = 102


3. The critical value Z = 1.96 at 95% confidence level
2
4. Substitute the values into the confidence interval estimate for proportion formula

0.7549

1.96

0.7549

0.0835

0.7549 0.0835,
(0.6714

0.7549(1 0.7549)
102

0.7549 + 0.0835

0.8384)

Question 2
Closed caption movies allow the hearing impaired to enjoy the dialogue as well as the acting. A
local organization for the hearing impaired members of the community takes a random sample of 100
movies listings offered by the cable television company in order to estimate the proportion of closed
caption movies offered. Fourteen movies were closed captioned. The cable television company says
at least 5% of the movies shown are captioned. Use Z = 1.65 to prepare a 90% confidence interval
2
estimate for true proportion, and comment on the cable television companys claim.
Solution
Steps
1. The sample proportion p =

14
X
=
= 0.14
n
100

2. The sample size n = 100


3. The critical value Z = 1.65 at 90% confidence level
2

119

STA1610/1

4. Substitute the values into the confidence interval estimate for proportion formula

0.14

1.65

0.14

0.0573

0.14 0.0573,
(0.0827

0.14(1 0.14)
100

0.14 + 0.0573
0.1973)

The interval for the population proportion is 0.0827 to 0.1973, or approximately 0.08 to 0.20. The
organization for hearing impaired people can therefore be 90% confident that the proportion of
closed caption movies offered is somewhere between 0.08 (8%) and 0.20(20%). The cable television
company is correct in saying that at least 5% of the movies it shows are closed captioned.
SELFASSESSMENT
The owner of a restaurant that serves continental food wants to study characteristics of his
customers. He decides to focus on two variables: the amount of money spent by customers and
whether customers order dessert. the results from a sample of 60 customers are as follows:
Based on the amount spent: = R38.54 and S = R7.26 and on the 18 customers purchased dessert.
1. Construct a 95% confidence interval estimate of the population mean amount spent per customer
in the restaurant.
2. Construct a 90% confidence interval estimate of the population proportion of customers who
purchase dessert.
Solution
1. Steps
1. The sample mean X = 38.54
2. The sample standard deviation S = 7.26
3. The sample size n = 60
4. The degrees of freedom df = n 1 = 60 1 = 59
5. The critical value at t( 59 ,

0.025)

equals to 2.0010

120
6. Substitute the values into the confidence interval estimate for the mean ( unknown) formula
38.54

7.26
2.001
60

38.54

1.8755

38.54 1.8755,
(36.6645

38.54 + 1.8755
40.4155)

2. Steps
1. The sample proportion p =

18
= 0.3
60

2. The sample size n = 60

3. The critical value Z = 1.645 at 90% confidence interval


2
4. Substitute the values into the confidence interval estimate for proportion formula.

0.3

1.645

0.3

0.0973

0.3 0.0973,
(0.2027

0.3(1 0.3)
60

0.3 + 0.0973
0.3973)

121

STA1610/1

STUDY UNIT 9
STUDY CHAPTER 9
HYPOTHESIS TESTING

Key questions for this unit


What is meant with the following concepts: acceptance region, rejection region,
type I error, type II error, critical value, power of the test, two - tailed test,
one tailed test,
Define hypothesis testing.
Distinguish between a hypothesis testing for the mean and hypothesis testing for
proportion.
What is the significance level of a test?
How do you make decision in hypothesis testing

9.1 Introduction to this study unit


Hypothesis testing is the statistical assessment of a statement or idea regarding a population. This
means we state a claim or assertion about a particular parameter of a population. For instance, a
statement could be as follows: The mean weight of cereal boxes is 368 gram. Given the results of
the weight of the cereal boxes, hypothesis testing procedures can be employed to test the validity of
this statement at a given significance level for a sample weight of cereal boxes. you will examine the
results of the sample to see if it better supports the stated claim. This type of problem introduce you
to inferential statistics.
In the previous study unit you saw that a confidence interval can be used when we have to predict the
value of a population parameter. Inclusion of the parameter was never certain, but we quantified the
likelihood of the parameter lying within that particular interval through the expression of confidence
level (95%, or 99%, or, ....). The same form of quantified uncertainty is used in hypothesis testing.
Activity 9.1: Concepts

Conceptual Skill

Communication Skill

Test your own knowledge (write in pencil) and then correct your understanding afterwards (erase and
write the correct description). Often a young language may not have all the terms in a discipline can
you think of some examples?

122
English term
Acceptance region
Rejection region
Critical value
Two- tailed test
One - tailed test
Significance level of the test
The null hypothesis
The alternative hypothesis

Description

Term in your home language

9.2 Fundamental Concept of Hypothesis testing


In this section, we will present the basic concepts of hypothesis testing as follows:
1. The null hypothesis noted by H0 , represents the current belief in a situation.
2. The alternative hypothesis noted by H1 , is the opposite of the null hypothesis and represents a
research claim or specific inference you would like to prove.
3. The level of Significance () is the probability of the rejection when the null hypothesis is true. It
represents the risk level that you are willing to have of rejecting the null hypothesis when it is true.
you select levels of 0.01 (1%), 0.05 (5%) or 0.10 (10%).
4. The confidence coefficient is the probability that you will not reject the null hypothesis H0 , when
it is true and should not be rejected the confidence coefficient is (1 ) 100%
5. Acceptance region (Nonrejection) is any portion of the distribution if the observed statistic falls
in this region, the decision is to fail to reject the null hypothesis.
6. Rejection region is any portion of the distribution if the observed statistic falls in this region, the
decision is to reject the null hypothesis.
7. The critical value is the value that divides the acceptance region from the rejection region.
8. The power of the test is the probability that you will reject H0 when it is false and you should be
rejected.
9. A two - tailed test is a statistical test when the researcher is interested in testing both sides of
the distribution.
10. A one - tailed test is a statistical test when the researcher is interested in testing one side of the
distribution.

HOW TO SPECIFY THE NULL AND THE ALTERNATIVE HYPOTHESIS


The null hypothesis specifies the parameter that is equal to some particular value and it is for the
alternative hypothesis to answer the question. In order for you to specify the alternative, you must
determine what the question asks.

123

STA1610/1

Examples:
1. If the question asks if whether the waiting time to place an order has changed in the past month
from its previous population mean of 4.5 minutes.
The population mean is , then
H1 : = 4.5

and
H0 : = 4.5 that means, the population mean is equals to 4.5

Alternatively, the question might be There is sufficient evidence to conclude that the waiting time
to place an order is not equal to (or is different from) the previous population mean = 4.5.
Therefore you have to perform a two - tailed test.
2. If the question asks if there is sufficient evidence to conclude that the population mean is greater
than 4.5, then
H1 : > 4.5

and
H0 : = 4.5

Therefore you have to perform a one - tailed test.


3. If the question asks if there is sufficient evidence to conclude that the population mean is less
than 4.5, then
H1 : < 4.5

and
H0 : = 4.5

Therefore you have to perform a one - tailed test.


Hypothesis testing is classified as either hypothesis testing for the mean or hypothesis testing for
proportion; depending on the possible samples selected or the proportion of items in a population
having a certain characteristic of interest.
The steps method of hypothesis testing

124
Step 1
State the null hypothesis H0 : population parameter () = hypothesized value
Step 2
State the alternative hypothesis H1 summarizes what will be the case if the null hypothesis is not
true, and can assume one of the three possible forms:
(a) H1 : population parameter () = hypothesized value
(b) H1 : population parameter () < hypothesized value
(c) H1 : population parameter () > hypothesized value
Step 3
Choose the level of significance (), just to provide a probability basis for deciding whether an
observed difference between a sample statistic and a hypothesized is a chance difference or a
statistically significant difference.
Step 4
Determine the appropriate test statistic and compute the value.
Step 5
Determine the critical values that divide the rejection and nonrejection regions.
Step 6
State the decision rule.
The decision rule is a statement that indicates the action to be taken, that is, to fail to reject H0 or
reject H0 .
Reject H0 when the value of test statistic is greater than the critical value at a specific significance

level otherwise do not reject H0 .

Reject H0 when the p-value is less than the significance level.

The p- value is the lowest level of significance at which the null hypothesis can be rejected. It is
determined based on the test statistic value.

Step 7
State the conclusion. This conclusion should be based in the context of the problem, and the level of
significance should be included.

125

STA1610/1

9.3 Hypothesis testing of the Mean


We are again going to differentiate between the two distinct cases: the population standard deviation
is known or it is unknown. You will see that the first distinction to make is to see if population or
the sample standard deviation is known and then you decide if the population is normally distributed
or not. If you do not know this last answer, check the size of the sample . For this module we are
only considering a sample size greater than 30, because nonparametric methods for distribution- free
tests are not included in this module.
9.3.1 Population standard deviation known
When the population standard deviation is known, the normal distribution is used in the hypothesis
testing.
Steps
1. State (or identify ) the null hypothesis H0 and the alternative H1 .
2. Choose the level of significance ()
3. Determine the sample mean X
4. Determine the population mean
5. Determine the population standard deviation
6. Determine the sample size n
7. Compute the test statistic
X

Z=

8. Make the statistical decision.


Activity 9.2
Question 1
The quality control manager at a lightbulb factory needs to determine whether the mean life of a large
shipment of lightbulbs is equal to the specified value of 575 hours. State the null and the alternative
hypotheses.

126
Solution
H0 : = 575
H1 : = 575

Question 2
The p-value for hypothesis test has been reported as 0.03. If the test result is interpreted using the
= 0.05 level of significance as a criterion, will H0 be rejected? Explain.

Solution
Given information : = 0.05

p- value = 0.03

The decision rule


Reject H0 if the p-value is less than the level of significance .
Since 0.03 < 0.05 we reject H0 at 5% level of significance.
Question 3
For a sample of 12 items from a normally distributed population for which the standard deviation is
= 17.0, the sample mean is 230.8. At the 5% level of significance, test H0 : = 220 versus H1 :
> 220

(a) Compute the test statistic


(b) Determine the p-value for the test
Solution
(a) Steps
1. The null hypothesis H0 : = 220 and the alternative H1 : > 220 .
2. The level of significance = 0.05
3. The sample mean X = 230.8
4. The population mean = 220
5. The population standard deviation = 17.0
6. The sample size n = 12

127

STA1610/1

7. The test statistic


10
X
230 220
 =
= 
= 2.0377 which is approximately 2.04
Z=

17.0
4.9075

n
12
p- value: P (Z > 2.04) = 1 0.9793 = 0.0207

(using statistics table E.2 from Appendices)

9.3.2 Population standard deviation unknown


When the population standard deviation is not known, the hypothesis testing procedure is the same
as if it is known. Make sure you remember which table to use. you use the t - distribution table if you
are given the value of S and it is wrong to use the normal distribution in such a question.
Order reminders:
The test statistic follows a t distribution having n 1 degrees of freedom. and the degrees of

freedom are not the sample size.

The table values are given as positive values. if you work in the left tail of the area under the

curve, you have to put a minus sign before the table value.

working two - tailed, you have to divide the significance level by 2 and use the answer for the table.

You then use that table value twice - once in the right tail with positive sign and once in the left
tail, making it negative.

Steps
1. State (or identify ) the null hypothesis H0 and the alternative H1 .
2. Choose the level of significance ()
3. Determine the sample mean X
4. Determine the population mean
5. Determine the sample standard deviation S
6. Determine the sample size n
7. Compute the test statistic
X

t= 
S

8. Make the statistical decision.

128
Activity 9.3.2:
Question 1
My daughter and I have argued the average length of our preachers sermons on Sunday morning.
Despite my arguments, she thinks that the sermons are more than twenty minutes and this is not
acceptable to her. For one year she randomly selected 12 Sundays and found the average time of
26.42 minutes with the standard deviation of 6.69 minutes. Assuming that the population is normally

distributed and using a 0.05 level of significance, we decided to make a scientific analysis, using a
hypothesis test. Calculate the test statistic and make a statistical decision.
Solution
Steps
1. The null hypothesis H0 : 20 vesus the alternative H1 : > 20
2. The level of significance = 0.05
3. The sample mean X = 26.42
4. The population mean = 20
5. The sample standard deviation S = 6.69
6. The sample size n = 12
7. The test statistic
X
6.42
26.42 20
 =
= 
= 3.3244
t= 
S
6.69
1.9312

n
12

Conclusion
Reject H0 if the test statistic is greater than the critical value
Critical value is 1.645 (from the table)
Since the test statistic 3.3244 is greater than 1.645 then H0 can be rejected. we conclude that there
is enough evidence that the alternative H1 is true and that my daughter is correct in thinking that the
average length of sermons is more than 20 minutes.

129

STA1610/1

Question 2
A random sample of 10 observations was drawn from a normally distributed population.
The data values were: 6, 4, 4, 7,

5, 5, 4, 5, 6, and 4. a person tested the hypothesis H0 :

6 vesus H1 : > 6 and scribbled down his calculations. When his friend came along and quickly

wanted to copy his work, he did not read properly and wrote down that
1. the sample mean is equal to 4
2. the sample variance is equal to 1
3. the rejection region is t < t(0.05,10) = 1.833
4. the test statistic is t = 3.0

5. the conclusion is to reject H0 , because the test statistic t = 3.0 < 1.833
6. Which of the above statement is correct?
Solution

Xi
6+4+4+7+5+5+4+5+6+4
50
=
=
=5
n
10
10
2
S
Xi X
(6 5)2 + (4 5)2 + (4 5)2 + (7 5)2 + ..... + (4 5)2
2
=
=
2. The variance S =
n1
10 1
1.1111

1. The sample mean X =

3. The test statistic


X

t= 
S

the standard deviation S =


X

t= 
S

1.111 = 1.0541
=

1
56
=
= 3
1.0541
0.3333

10

4. The rejection region is t < t(0.05,n1) = t(0.05,9) = 1.833


5. Correct

130
Question 3
The credit manager of a parge department store claims that the mean balance for the stores charge
account customers is R410. An independent auditor selects a random sample of 18 accounts and
finds a mean balance of X = R511.33 and a standard deviation of S = R183.75. If the managers
claim is not supported by these data, the auditor intends to examine all charge account balances. If
the population of account balances is assumed to be approximately normally distributed, what action
should the auditor take?
Solution
Steps
1. The null hypothesis H0 : = 410 vesus the alternative H1 : = 20
2. For this test, let the level of significance = 0.05
3. The sample mean X = 511.33
4. The population mean = 410
5. The sample standard deviation S = 183.75
6. The sample size n = 18
7. The test statistic
X
101.33
511.33 410
 =
= 
= 2.3396
t= 
S
183.75
43.3103

n
18

Conclusion
Reject H0 if the test statistic is greater than the critical value
Critical value is 2.1098 (from the table)
Since the test statistic 2.3396 is greater than 2.1098 then H0 can be rejected. we conclude that there
is enough evidence that the alternative H1 is true and that the auditor should proceed to examine all
charge account balance.

131

STA1610/1

9.4 Hypothesis testing for proportion


Many of the principles applies in this section are not new as they have the same pattern" with the
technique of hypothesis testing for population mean.
Steps
1. State the null hypothesis H0 and the alternative H1 .
2. Choose the level of significance ()
3. Determine the sample proportion p =

X
n

4. Determine the population proportion


5. Determine the sample size n
6. Compute the test statistic for the proportion
Z=u

(1 )
n
7. Make the statistical decision.

Activity 9.4
Question 1
A random sample of 200 observations shows that there are 36 successes. We want to test at the 1%
significance level if the true proportion of successes in the population is less than 24%, and made
certain calculations.
Which one of the following statement is incorrect?
1. The value of p is

36
200

2. The appropriate hypotheses are H0 : = 0.24 vesus H1 : < 0.24


3. The critical value of Z (from the table) is Z < Z0.01 = 2.33
4. The standard error associated with this test is 0.0302.
5. The test statistic is 1.99

132
Solution
1. Correct
2. Correct
3. Correct

u
(1 )
0.24 (1 0.24)
=
= 0.0302
4. Correct the standard error is
n
200
0.06
p
0.18 0.24
=
= 1.9868
=
5. Incorrect the test statistic Z = u
0.0302
0.0302
(1 )
n

Question 2
If, in a random sample of 400 items, 164 are defective, what is the sample proportion of the defective
items?
Solution
The sample proportion p =

164
X
=
= 0.41
n
400

Question 3
Refer to question 2, suppose you are testing the null hypothesis H0 : = 0.40 against H1 : = 0.40
and you choose the level of significance = 0.05 . what is your statistical decision?
Solution
This is a two -tailed test
The decision rule:
Reject H0 when the value of test statistic is greater than the critical value at a specific significance

level otherwise do not reject H0 .

Reject H0 when the p- value is less than the significance level.

The test statistic Z = u

(1 )
n

=u

0.41 0.40

0.40 (1 0.40)
400

0.01
= 0.4082
0.0245

The critical value equals to 1.96 (from normal table).


Since 0.4082 is less than 1.96, we do not reject H0 at 5% level of significance.

133
SELF- ASSESSMENT

STA1610/1

STUDY UNIT 9.2

Question 1
A machine is supposed to be adjusted to produce components to a dimension of 2.0 centimeter. In
a sample of 50 components, the mean was found to be 2.001 centimeter and the standard deviation
to be 0.003 centimeter. Is there evidence to suggest that the machine is set too high? Use = 0.05
Question 2
The light bulbs in a industrial warehouse have been found to have a mean lifetime of 1030.0 hours,
with a standard deviation of 60.0 hours. The warehouse manager has been approached by a
representative of Extendabulb, a company that make a device intended to increase bulb life. The
manager is concerned that the average lifetime of Extendabulb-equipped bulbs might not be any
greater than the 1030 hours historically experienced. In a subsequent test, the manager tests 40
bulbs equipped with the device and finds their mean life to be 1061.6 hours.

Does Extendabulb

really work? Use = 0.05


Question 3
For a simple random sample of 15 items from a population that is approximately normally distributed,
X = 82.0 and S = 20.5. At the 0.01 level of significance, test H0 : = 90 vesus H1 : = 90

Question 4
The new director of a local YMCA has been told by his predecessors that the average member has
belong for 8.7 years. Examining a random sample of 15 memberships files, he finds the mean length
of membership to be 7.2 years, with a standard deviation of 2.5 years. assuming the population is
approximately normally distributed, and using the 0.05 level, does this result suggest that the actual
mean length of membership may be some value other than 8.7 year?
Question 5
The career services director of Hobart University has said that 70% of the schools senior enter the
job market in a position directly related to their undergraduate field of study. In a sample consisting
of 200 of the graduates from last years class, 66% have entered jobs related to their field of study.
Make the related decision. Use = 0.05

134
SOLUTIONS FOR SELF- ASSESSMENT

STUDY UNIT 9.2

Solution 1
Steps
1. The null hypothesis H0 : = 2.0 and the alternative H1 : > 2.0 (this is one -tailed test)
2. The level of significance = 0.05
3. The sample mean X = 2.001
4. The population mean = 2.0
5. The population standard deviation = 0.003
6. The sample size n = 50
7. The test statistic
X
0.001
2.001 2.0
 =
= 
= 2.357
Z=

0.003
0.00042

n
50

The critical value is equal to 1.645


8. Decision
Since 2.357 > 1.645 then H0 is rejected at 5% level of significance.
The sample results suggest that the machine is set to high.
Solution 2
Steps
1. The null hypothesis H0 : 1030.0 and the alternative H1 : > 1030.0 (this is one -tailed test)
2. The level of significance = 0.05
3. The sample mean X = 1061.6
4. The population mean = 1030.0
5. The population standard deviation = 90
6. tthe sample size n = 40

135

STA1610/1

7. The test statistic


31.6
X
1061.6 1030.0


=
=
= 2.2206
Z=

90
14.23025

n
40
8. The critical value is equal to 1.645

9. Decision
Since 2.206 > 1.645 then H0 is rejected at 5% level of significance.
The results suggest that Extendabulb does increase the mean lifetime of the bulbs. This firm may
wish to incorporate Extendbulb into its warehouse lighting system.
Solution 3
The test statistic
X
8
82.0 90
 =
= 
= 1.5114
Z=
S
20.5
5.2931

n
15
The critical value is equal to 2.58

Decision
Since 1.5114 < 2.58 then H0 is not rejected at 1% level of significance.
Solution 4
Steps
1. The null hypothesis H0 : = 8.7 vesus the alternative H1 : = 8.7
2. The level of significance = 0.05
3. The sample mean X = 7.2
4. The population mean = 8.7
5. The sample standard deviation S = 2.5
6. The sample size n = 15 and the degrees of freedom df = n 1 = 15 1 = 14
7. The test statistic
X
1.5
7.2 8.7
 =
= 
= 2.3238
t= 
S
2.5
0.6455

n
15
8. The critical value are t = 2.145 and t = 2.145

136
Conclusion
Since the calculated test statistic falls in the rejection region we reject H0
At the 0.05 level, the results suggest that the actual mean length of membership may be some value
other than 8.7 years.
Question 5
Steps
1. The value of p is 0.66
2. The appropriate hypotheses are H0 : = 0.70 vesus H1 : = 0.70
3. The critical value of Z are 1.96 and 1.96
4. The test statistic is
Z=u

(1 )
n

=u

0.66 0.70

0.70 (1 0.70)
200

0.04
= 1.2346
0.0324

The test statistic value falls between the two critical values. The null hypothesis is not rejected.
We conclude that the proportion of graduates who enter the job market in careers related to their
field of study could indeed be equal to the claimed value of 0.70. This analysis would suggest that
the director assertion not be challenged.

137

STA1610/1

STUDY UNIT 10
STUDY CHAPTER 10
CHI-SQUARE DISTRIBUTION

Key questions for this unit


What is meant with the following concepts: Chi-square, test of
independence, observed and expected frequencies ciritical area,
contingency table degrees of freedom
What are the steps to test whether two nominal variables are
related?
Under what conditions should you use 2 test of independence?

10.1 Introduction to this study unit


This study unit focus on hypothesis testing on two nominal variables having two or more categories.
Chi-square analysis is used to test whether there is relationship between these variables.
In this study unit we are only going to consider only section 11.3 in a Prescribed textbook. You may
read the rest of the chapter if you are interested, but you will not be examined on that knowledge.

10.2 Basic concepts in Chi-square testing


You have to understand the basic concepts in Chi-square testing, and be able to test for the
independence of two variables.

Going through the general characteristics of the Chi-square

distribution, you should understand that the 2 -distribution is


continuous
the sampling distribution of


  
(n 1) s2 / 2

always positive because (n 1), s2 and 2 are all always positive


a family of distributions, where every family member is determined by its number of degrees of

freedom (df )

skewed, but as the df increases, the form of the 2 -distribution becomes more and more like the

bell shape of the normal distribution

tabulated and mostly used for right-tailed areas


a reflection of the extent to which a table of observed frequencies differs from one constructed

under the assumption that the particular null hypothesis is true

138

10.3 Testing for independence of two variables


This is a special technique used to test whether or not two nominal variables could be independent
of each other.
The test procedure involves the following steps:
Start with a contingency table of observed frequencies reflecting the intersection of the various

categories of the two variables.

Formulate the null and alternative hypotheses.


Construct tables of observed and expected frequencies.
Calculate the value of the 2 test statistic.
Identify the critical value of the 2 statistic at the significance level specified for the particular

question.

Draw a conclusion regarding the statements in the hypotheses after comparison of the critical

value and the value of the test statistic

Summary of the Chi-square test for independence:

Test for independence of two variables


H0 : Variables are independent of each other
H1 : Variables not independent of each other
df = (r 1)(c 1)

Expected frequency per cell, namely fe


row total columntotal
fe =
n
Test statistic:
S (f0 fe )2
2 =
fe

NB: Use Table E.4 for 2 -testing

139

STA1610/1

Activity 10.1: Concepts Conceptual skill Communication skill


Test your own knowledge (write in pencil) and then correct your understanding afterwards (erase and
write the correct description). Often a young language may not have all the terms in a discipline; can
you think of some examples?
...................................................................................................
English term
Chi-square
Test of independence
Critical value
Degrees of freedom
Contingency Table
Observed frequency
Expected frequency

Description

Term in your home language

Question 1
Do questions 11.20 and 11.21 in a Prescribed textbook. Please try them yourselves before looking
at the solutions!
Question 2
The quality manager in tyre manufacturing plant in Port Elizabeth wants to test that the nature of
defects found in manufactured tyres depends upon the shift during which the defective tyres were
produced. Formulate the hypothesis of this test.
Question 3
A large carpet store wishes to determine if the brand of carpet purchased is related to the purchasers
family income. As a sampling frame, they mailed a survey to people who have a store credit card.
Five hundred customers returned the survey and the results follow:
Family Income
High Income
Middle Income
Low Income

Brand of Carpet
Brand A Brand B
65
32
80
68
25
35

Brand C
32
104
59

The statements below refer to a test conducted on the data above to determine if the brand of carpet
purchased is related to the purchasers family income at the 5% level of significance.
Select the incorrect statement.
1. H0 : Family income and brand of carpet are independent and
H1 : Family income and brand of carpet are dependent.

2. Rejection region: reject H0 if calculated 2 > 20.05,4 = 9.488.

140
3. The estimated frequencies are as follows:
Family Income
High Income
Middle Income
Low Income

Brand of Carpet
Brand A Brand B
43.86
34.83
85.68
68.04
24.46
32.13

Brand C
40.31
98.28
46.41

4. The calculated 2 value is 27.372.


5. We can conclude that the brand of carpet purchased is related to the purchasers family income.
...................................................................................................
Feedback to Activity 10.1

Application skills

Question 1
11.20
11.21

12
a.
b.
c.
d.
e.

15.507
20.090
23.209
34.805
34.805

Question 2
1. H0 : The nature of defects found in manufactured tyres and the shift during which they were
produced are independent
H1 : The nature of defects found in manufactured tyres and the shift during which they were produced

are related. (Dependence)


Question 3
1. Correct
2. Correct
3. Incorrect
The error lies in the table of expected frequencies.
rowtotal columntotal
The calculation of each cell is : fe =
n
Family Income Brand of Carpet
Brand A Brand B Brand C
High Income
65
32
32
E.g.
Middle Income 80
68
104
Low Income
25
35
59
Frequency for row 1, column 1: fe

=
=

129 170
500
43.86

141

STA1610/1

Two values were calculated incorrectly, indicated below in bold.


The table should be as follows:
Family Income
High Income
Middle Income
Low Income

Brand of Carpet
Brand A Brand B
43.86
34.83
85.68
68.04
40.46
32.13

Brand C
50.31
98.28
46.41

4. Because the calculation of the 2 value is very tedious and you may not know where you made a
calculation error, I will show you the manual calculation of this value:
f0

fe

65
32
32
80
68
104
25
35
59

43.85
34.83
50.31
85.68
68.04
98.28
40.46
32.13
46.41

2 =

(f0 fe )2
fe
10.1892
0.2299
6.6638
0.3765
0
0.3327
5.9074
0.2564
3.4154

[ (f0 fe )2
fe

= 27.372

5. Because 27.372 > 20.01,4 = 13.277, we can reject H0 and there is a significant relationship
between the brand of the carpet and family income.

10.4 Summary: objectives of study unit 10


Once you have familiarized yourself with this study unit you should be able to
understand the nature and procedures involved in chi-square testing in general
calculate the expected frequencies for the chi-square test of independence discussed in this study

unit

determine the value of the2 test statistic


apply the chi-square distribution to test if two nominal variables are independent or not

142

STUDY UNIT 11
STUDY CHAPTER 11
REGRESSION AND CORRELATION ANALYSIS

Key questions for this unit


What is meant with the following concepts: Regression and
correlation analysis, scatterplot, least square method,
regression coefficients, dependent and independent variables,
slope, y intercept, interpolation, extrapolation, coefficients of
correlation and determination?
How would you calculate and interpret the regression coefficients?
How would you estimate Y variable using X variable?
How do you calculate and interpret coefficients of correlation and
determination?

11.1 Introduction to this study unit


Simple linear Regression
It enables us to develop a model for the prediction of numerical variables based on the value of other
variables. The variable we wish to predict is called dependent (y) and the one used for the prediction
is called the independent (x). In simpler terms, simple linear regression describes and evaluates
the relationship between two variables.
In this study unit we are only going to consider sections 12.2 and 12.3 in Prescribed textbook. You
may read the rest of the chapter if you are interested, but you will not be examined on that knowledge.
Activity: Concepts

Conceptual skill Communication skill

Test your own knowledge (write in pencil) and then correct your understanding afterwards (erase and
write the correct description). Often a young language may not have all the terms in a discipline; can
you think of some examples?
...................................................................................................

143
English term
Regression
analysis
Correlation
Scatterplot
Least square
method
Regression
coefficients
Dependent
variable
Independent
variable
Slope
Y-intercept
Interpolation
Extrapolation
Coefficient of
correlation
Coefficient of
determination

Description

STA1610/1

Term in your home language

11.2 The simple linear regression line


If you understand the basics of the straight line equation from school, you should have no problem
to understand the concept of a linear regression line.
Suppose that you have a data set of paired observations, i.e. two observations have been recorded
for each object under study. If the objects under study were people, suppose one set of observations
indicate their lengths and the other set their mass, then the paired observation per person would be
the combined pair (length; mass). Should you take a piece of graph paper, make two perpendicular
axes for length and mass respectively and record the combined pair per person in this twodimensional plane. You have now made, what statisticians call, a scatterplot of the values.
Looking at a scatterplot, the question is: what can you deduct? Remember that you are a scientist
and you want proper backing for what you say! This is when you use a simple linear regression
model, which is scientifically acceptable. With this regression line you may even make predictions
for the one variable based on values of the other. This means that if you consider the length of the
person as the independent variable, you can use the regression line equation to predict the mass of
a person once you know what his/her length is. Never loose perspective you cannot say that the
answer in such a case is absolutely correct. As a statistician you are only estimating the value of
what can be expected at some level of certainty (never 100%!).

144
The idea of a simple linear regression model, how to determine the equation as well as the principle of
least-squares criterion are explained in detail in Prescribed textbook. Make sure that you understand
the following:
We are only considering linear relations, meaning that we only fit straight lines through the data.
There is a difference between an observed value of a variable yi and the estimated value of the

same value yi . The difference between these two is called the error.

The regression line always passes through the point (X ,Y ). Stated differently the means of the

two variables given as a pair, are always a pair of coordinates on the regression line.

In the general form of the regression line Y = b0 + b1 X the b0 and b1 are only symbols and will

be substituted by numbers in the calculated equation for a specific data set. In school you most
probably used the form Y = mX + c for the straight line equation and learnt that m indicated the

slope and c the Y -intercept. Compare what you learnt in school with this new form and you will
see that the Y -intercept is given by the number without an x, namely b0 , and that the value with
the x, namely b1 (also called the coefficient of X ), represents the slope of the straight line.
To calculate the least squares regression line manually takes a lot of time and is quite tedious.

Still, it is a necessary exercise for you at this stage. There will be enough time for you later in your
life to use software and simply interpret printouts.

Make sure that you understand the meaning of the required assumptions for the linear regression
model.
...................................................................................................
Activity 11.1
Question 1
Do question 12.1 to 12.3 in the Prescribed textbook.
Question 2
Consider the following data values of variables x and y:

X
Y

5
7

4
8

3
10

6
5

9
2

8
3

10
1

The regression coefficients were calculated as b0 = 13.223 and b1 = 1.257.

145

STA1610/1

Select the correct statement.


1. The relationship between x and y appears to be linear and positive.
2. The least squares regression line is y = 13.223 1.257x.
3. The least squares regression line is y = 1.257 13.223x.
4. If x = 2, the estimated value of y from the relevant regression line is 25.189.
5. An x-value of 11 resulted in an estimated value of 1.861.

11.3 Introduction to correlation anaylsis


Correlation analysis takes data analysis a little further. In regression analysis the relationship
between two interval or ratio scale variables is expressed in terms of a least-squares regression
line, while correlation analysis can measure the strength and the nature of the relationship between
the variables. We will discuss both the coefficient of correlation and that of determination.
The coefficient of correlation
This concept should not be new to you. In study unit 3, section 3.3 in the discussion of measures of
association, you learnt about the coefficient of correlation.
You should know that the correlation coefficient r is such that
1 r 1
if r > 0 (positive), the two variables will either both increase or both decrease
if r is < 0 (negative), the one variable will increase when the other variable decreases
the strength of the relationship depends on the actual value of r . (If r is close to +1 or close to
1, the relationship is strong.)

the strength of the relationship is weaker the closer the value of r (positive or negative) is to zero

As in the case of the regression line, a lot of calculations are needed if you want to determine the
value of r manually.
The coefficient of determination
This is simply the value of the square of the correlation coefficient r, namely r2 . The answer of
r2 indicates the proportion of the variation in y , as explained by the regression line Y = b0 + b1 X .

That is all you have to know about this coefficient. (See the first paragraph under the heading: The
coefficient of determination.)
...................................................................................................

146
Activity 11.2
Question 1
Do questions 12.11 to 12.15 in the Prescribed textbook.
Question 2
The statements in this question are based on the following data.
X
2.6
2.6
3.2
3.0
2.4
3.7
3.7
S
X = 21.2

Y
5.6
5.1
5.4
5.0
4.0
5.0
5.2
S
Y = 35.3

The correlation coefficient r was calculated as 0.327.


Identity the incorrect statement:
1. There is a positive relationship between x and y .
2. y = 5.043
3. The coefficient of determination is 0.5719.
4. The regression coefficient b1 is also positive.
5. Only 10.7% of the variation in y is explained by the variation in x.
...................................................................................................
Feedback to activities
Activity 11.1
Question 1
12.1 (a) When X = 0 then Y = 4
(b) When X increases by 1 unit, then the mean value of Y is estimated to increase by 8.
(c) Y = 20

147

STA1610/1

12.2 (a) Yes


(b) No
(c) No
(d) Yes

12.3 (a) When X = 0, then Y = 24


(b) When X increases by 1 unit, then the mean value of Y is estimated to decrease by 0.8
(c) Y = 20

Question 2
Option 2
1. Incorrect. The relationship between x and y appears to be linear and negative (as the x-values
are increasing, the y -values are decreasing).
2. Correct. The least squares regression line is y = 13.223 1.257x.
3. Incorrect.
4. Incorrect. The estimated value of y is 25.189 only when x = 5 is substituted into the equation
given in option 3. The correct answer is 10.709.

5. Incorrect. An x-value of 12 resulted in an estimated value of 1.861.


...................................................................................................
Activity 11.2
Question 1
12.11 r2 = 0.85, it means that 85% of the variation in Y is explained by X
12.22 SST = 40
r2 = 0.95, it means that 95% of the variation in Y is explained by X
SSR
12.13 rr =
SST
=

0.7

70% of the variation in Y in is explained by X.

12.14 r2 = 0.8, 80% of the variation in Y is explained by X


12.15 Coefficient of determination ranges from 0 to 0. If SST = 140 and SSR = 150, r2 will lie beyond
this range.

148
Question 2
1. Correct. Since r > 0.
2. Correct. y = 5.043.
3. Incorrect. r2 = (0.327)2 = 0.107.
4. Correct.
5. Correct. The value of r2 is (0.327)2 = 0.107. Then only 10.7% of the variation in y is explained by
the variation in x.

11.4 Summary: objectives of study unit 11


Once you have familiarized yourself with this study unit you should be able to
give detailed descriptions of the individual terms in the simple regression line
interpret the value of the coefficient of correlation
interpret the coefficient of determination

You might also like