Introduction To Statistics 1662031282
Introduction To Statistics 1662031282
Introduction To Statistics 1662031282
INTRODUCTION TO STATISTICS
An Excel-Based Approach
VALERIE WATTS
Acknowledgements xi
About this Book xii
Changes From Previous Version xv
6.1 Introduction to Sampling Distributions and the Central Limit Theorem 361
6.2 Sampling Distribution of the Sample Mean 363
6.3 Sampling Distribution of the Sample Proportion 375
6.4 Exercises 388
References 901
Versioning History 916
ACKNOWLEDGEMENTS
This open textbook has been adapted by Dr. Valerie Watts in partnership with the OER Design
Studio and the Library Learning Commons at Fanshawe College in London, Ontario
This work is part of the FanshaweOpen learning initiative and is made available through a
Creative Commons Attribution-ShareAlike 4.0 International License unless otherwise noted.
We would like to acknowledge and thank the following authors/entities who have graciously
made their work available for the remixing, reusing, and adapting of this text:
• Introductory Business Statistics by Alexander Holmes, Barbara Illowsky, Susan Dean, and
OpenStax is licensed under a Creative Commons Attribution 4.0 License.
• Introductory Statistics by Barbara Illowsky, Susan Dean, and OpenStax is licensed under a
Creative Commons Attribution 4.0 License.
Collaborators
This project was a collaboration between the author and the team in the OER Design Studio at
Fanshawe. The following staff and students were involved in the creation of this project:
Chapter 2: Descriptive Chapter 2 examines the different descriptive statistics, both graphical
Statistics and numerical, required to organize, summarize, and describe data.
Chapter 5: Continuous Chapter 5 covers continuous random variables, focusing on the normal
Random Variables and the distribution and probability problems associated with the normal
Normal Distribution distribution.
Chapter 8: Hypothesis Tests Chapter 8 introduces the formal hypothesis testing procedure, focusing
for Single Population on conducting and drawing a conclusion from a hypothesis test on a
Parameters population mean or a population proportion.
Chapter 10: Statistical Chapter 10 covers the use of the -distribution in statistical
Inference Using the inference, including confidence intervals and hypothesis testing for a
population variance, the goodness-of-fit test, and the test of
-Distribution
independence.
Each sub-chapter in this text begins with a list of relevant learning objectives and concludes with
a concept review that highlights the key topics in the sub-chapter. Where appropriate, videos are
included to review, enhance, and extend the material covered in the text. At the end of each
chapter, a series of exercises are included to check retention and asses understanding.
Accessibility Statement
We are actively committed to increasing the accessibility and usability of the textbooks we
produce. Every attempt has been made to make this OER accessible to all learners and is
compatible with assistive and adaptive technologies. We have attempted to provide closed
captions, alternative text, or multiple formats for on-screen and off-line access.
The web version of this resource has been designed to meet Web Content Accessibility Guidelines
2.0, level AA. In addition, it follows all guidelines in Appendix A: Checklist for Accessibility of
the Accessibility Toolkit – 2nd Edition.
In addition to the web version, additional files are available in a number of file formats including
PDF, EPUB (for eReaders), and MOBI (for Kindles).
If you are having problems accessing this resource, please contact us at [email protected].
Feedback
This book is an adaptation of Introductory Statistics by Open Stax, licensed under a Creative
Commons Attribution 4.0 license. Additional content was incorporated from Introduction to
Business Statistics by Open Stax, licensed under a Create Commons Attribution 4.0 license, and
other open materials.
The following is a summary of changes that were made in this version:
XVI | CHANGES FROM PREVIOUS VERSION
Chapter 1:
• Added definition of cumulative frequency and added cumulative frequency to
Sampling and
example.
Data
Chapter 13:
Multiple • Created this new chapter.
Regression
PART I
SAMPLING AND DATA
Chapter Outline
We encounter statistics in our daily lives more often than we probably realize and from many different
sources, like the news. Photo by David Sim, CC BY 4.0.
You are probably asking yourself the question, “When and where will I use statistics?” If you read
any newspaper, watch television, or use the Internet, you will see statistical information. There are
statistics about crime, sports, education, politics, and real estate, just to mention a few. Typically,
when you read a newspaper article or watch a television news program, you are given sample
information. With this information, you may make a decision about the correctness of a statement,
claim, or “fact.” Statistical methods can help you make the “best educated guess.”
Because you will undoubtedly be given statistical information at some point in your life, you
need to know some techniques for analyzing the information thoughtfully. Think about buying
a house or managing a budget. Think about your chosen profession. The fields of economics,
business, psychology, education, biology, law, computer science, police science, and early childhood
development require at least one course in statistics.
Included in this chapter are the basic ideas and words of probability and statistics. You will soon
understand that statistics and probability work together. You will also learn how data are gathered
and what “good” data can be distinguished from “bad.”
4 | 1.1 INTRODUCTION TO SAMPLING AND DATA
Attribution
LEARNING OBJECTIVES
The science of statistics deals with the collection, analysis, interpretation, and presentation of
data. We see and use data in our everyday lives. In this course, we will learn how to organize and
summarize data. The organization and summation of data is called descriptive statistics. Two
ways to summarize data are by graphing and by using numbers (for example, finding an average).
After we have studied probability and probability distributions, we will use formal methods for
drawing conclusions from “good” data. These formal methods are called inferential statistics.
Statistical inference uses probability to determine how confident we can be that our conclusions are
correct.
Effective interpretation of data (inference) is based on good procedures for producing data
and thoughtful examination of the data. You will encounter what will seem to be too many
mathematical formulas for interpreting data. The goal of statistics is not to perform numerous
calculations using the formulas, but to gain an understanding of the data. The calculations can
be done using a calculator or a computer. The understanding must come from you. If you can
thoroughly grasp the basics of statistics, you can be more confident in the decisions you make in
life.
Probability
Probability is a mathematical tool used to study randomness. It deals with the chance (or
likelihood) of an event occurring. For example, if we toss a fair coin four times, the outcomes may
6 | 1.2 DEFINITIONS OF STATISTICS, PROBABILITY, AND KEY TERMS
not be two heads and two tails. However, if we toss the same coin 4,000 times, the outcomes will
be close to half heads and half tails. The expected theoretical probability of heads in any one toss
is 50%. Even though the outcomes of a few repetitions are uncertain, there is a regular pattern
of outcomes when there are many repetitions. After reading about the English statistician Karl
Pearson who tossed a coin 24,000 times with a result of 12,012 heads, one of the authors tossed
a coin 2,000 times. The results were 996 heads, or 49.8% heads. This is very close to the expected
probability of 50%.
The theory of probability began with the study of games of chance such as poker. Predictions
take the form of probabilities. To predict the likelihood of an earthquake, of rain, or whether you
will get an A in a particular course, we use probabilities. Doctors use probability to determine the
chance of a vaccination causing the disease the vaccination is supposed to prevent. A stockbroker
uses probability to determine the rate of return on a client’s investments. You might use probability
to decide whether or not to buy a lottery ticket. In the study of statistics, we use the power of
mathematics through probability calculations to analyze and interpret the data.
Key Terms
sample must contain the characteristics of the population in order to be a representative sample.
We are interested in both the sample statistic and the population parameter in inferential statistics.
In a later chapter, we will use the sample statistic to test the validity of the established population
parameter.
A variable, notated by capital letters such as and , is a characteristic of interest for each
person or thing in a population. Variables may be numerical or categorical. Numerical variables
take on values with equal units such as weight in pounds and time in hours. Categorical
variables place the person or thing into a category. If we let equal the number of points earned
by one math student at the end of a term, then is a numerical variable. If we let be a person’s
party affiliation, then some examples of include Republican, Democrat, and Independent. In
this case, is a categorical variable. We could do some math with values of (calculate the
average number of points earned, for example), but it makes no sense to do math with values of
(calculating an average party affiliation makes no sense). Data are the actual values of the variable.
They may be numbers or they may be words. Datum is a single value.
Two words that come up often in statistics are mean and proportion. If you take three exams
in your math classes and obtain scores of 86, 75, and 92, you would calculate your mean score
by adding the three exam scores and dividing by three (your mean score would be 84.3 to one
decimal place). If, in your math class, there are 40 students and 22 are men and 18 are women, then
the proportion of men students is and the proportion of women students is . Mean and
NOTE
The words mean and average are often used interchangeably. The substitution of one word for
the other is common practice. The technical term for mean is “arithmetic mean,” and “average”
is technically a center location. However, in practice among non-statisticians, “average” is
commonly accepted for “arithmetic mean.”
8 | 1.2 DEFINITIONS OF STATISTICS, PROBABILITY, AND KEY TERMS
EXAMPLE
Determine what the key terms refer to in the following study. We want to know the average (mean)
amount of money first year college students spend at ABC College on school supplies that do not
include books. We randomly surveyed 100 first-year students at the college. Three of those students
spent $150, $200, and $225, respectively.
Solution:
• The population is all first year students attending ABC College this term.
• The sample could be all students enrolled in one section of a beginning statistics course at ABC College
(although this sample may not represent the entire population).
• The parameter is the average (mean) amount of money spent (excluding books) by first year college
students at ABC College this term (the population mean).
• The statistic is the average (mean) amount of money spent (excluding books) by first year college students
in the sample (the sample mean).
• The variable could be the amount of money spent (excluding books) by one first year student. Let be
the amount of money spent (excluding books) by one first year student attending ABC College.
• The data are the dollar amounts spent by the first year students. Examples of the data are $150, $200, and
$225.
TRY IT
Determine what the key terms refer to in the following study. We want to know the average (mean)
amount of money spent on school uniforms each year by families with children at Knoll Academy.
1.2 DEFINITIONS OF STATISTICS, PROBABILITY, AND KEY TERMS | 9
We randomly survey 100 families with children in the school. Three of the families spent $65, $75,
and $95, respectively.
TRY IT
Determine what the key terms refer to in the following study. A study was conducted at a local
college to analyze the average cumulative GPA’s of students who graduated last year. Fill in the letter
of the phrase that best describes each of the items below.
1. f 3. e 5. b
2. g 4. d 6. c
EXAMPLE
Determine what the key terms refer to in the following study. As part of a study designed to test the
safety of automobiles, the National Transportation Safety Board collected and reviewed data about
the effects of an automobile crash on test dummies. Here is the criterion they used:
Cars with dummies in the front seats were crashed into a wall at a speed of 56 kilometers per hour.
We want to know the proportion of dummies in the driver’s seat that would have had head injuries, if
they had been actual drivers. We start with a simple random sample of 75 cars.
Solution:
• The data are either: yes, had head injury, or no, did not.
EXAMPLE
Determine what the key terms refer to in the following study. An insurance company would like to
determine the proportion of all medical doctors who have been involved in one or more malpractice
lawsuits. The company selects 500 doctors at random from a professional directory and determines
the number in the sample who have been involved in a malpractice lawsuit.
Solution:
Attribution
“1.1 Definitions of Statistics, Probability, and Key Terms“ in Introductory Statistics by OpenStax is
licensed under a Creative Commons Attribution 4.0 International License.
1.3 SAMPLING AND DATA
LEARNING OBJECTIVES
Data may come from a population or from a sample. Generally, small letters like or are
used to represent data values. Most data can be put into the one of two categories: qualitative or
quantitative.
Qualitative data are the result of categorizing or describing attributes of a population.
Qualitative data are also called categorical data. Hair color, blood type, ethnic group, the car a
person drives, and the street a person lives on are examples of qualitative data. Qualitative data are
generally described by words or letters. For instance, hair color might be black, dark brown, light
brown, blonde, gray, or red. Blood type might be AB+, O-, or B+. Researchers often prefer to use
quantitative data over qualitative data because it lends itself more easily to mathematical analysis.
For example, it does not make sense to find an average hair color or blood type.
Quantitative data are always numbers. Quantitative data are the result
of counting or measuring attributes of a population. Amount of money, pulse rate, weight, the
number of people living in your town, and the number of students who take statistics are examples
of quantitative data. Quantitative data may be either discrete or continuous.
All data that are the result of counting are called quantitative discrete data. These data take
on only certain numerical values. If you count the number of phone calls you receive for each day
of the week, you might get values such as zero, one, two, or three.
All data that are the result of measuring are quantitative continuous data, assuming that
we can measure accurately. Measuring angles in radians might result in such numbers as
and so on. If you and your friends carry backpacks with books in them to
1.3 SAMPLING AND DATA | 13
school, the numbers of books in the backpacks are discrete data and the weights of the backpacks
are continuous data.
EXAMPLE
The data are the number of books students carry in their backpacks. You sample five students. Two
students carry three books, one student carries four books, one student carries two books, and one
student carries one book. The numbers of books (three, four, two, and one) are quantitative discrete
data.
TRY IT
The data are the number of machines in a gym. You sample five gyms. One gym has 12 machines,
one gym has 15 machines, one gym has ten machines, one gym has 22 machines, and the other gym
has 20 machines. What type of data is this?
EXAMPLE
The data are the weights of backpacks with books in them. You sample the same five students. The
weights (in pounds) of their backpacks are 6.2, 7, 6.8, 9.1, and 4.3. Weights are quantitative
continuous data.
TRY IT
The data are the areas of lawns in square feet. You sample five houses. The areas of the lawns are
144 sq. feet, 160 sq. feet, 190 sq. feet, 180 sq. feet, and 210 sq. feet. What type of data is this?
Sampling
Gathering information about an entire population often costs too much, is too time consuming,
or is virtually impossible. Instead, we use a sample of the population. In order to get accurate
conclusions about the population from the sample, a sample should have the same
characteristics as the population it represents. Most statisticians use various methods of
random sampling in an attempt to achieve this goal. This section will describe a few of the most
common methods.
There are several different methods of random sampling. In each form of random sampling,
1.3 SAMPLING AND DATA | 15
each member of a population initially has an equal chance of being selected for the sample. Each
method has pros and cons. The easiest method to describe is called a simple random sample.
Any group of individuals is equally likely to be chosen as any other group of individuals if the
simple random sampling technique is used. In other words, each sample of the same size has an
equal chance of being selected. For example, suppose Lisa wants to form a four-person study group
(herself and three other people) from her pre-calculus class, which has 31 members not including
Lisa. To choose a simple random sample of size three from the other members of her class, Lisa
could put all 31 names in a hat, shake the hat, close her eyes, and pick out three names. A more
technological way is for Lisa to first list the last names of the members of her class together with a
two-digit number, as in the following table.
10 Khan
Lisa can use a table of random numbers (found in many statistics books and mathematical
handbooks), a calculator, or a computer to generate random numbers. For this example, suppose
Lisa chooses to generate random numbers from a calculator. The numbers generated are as follows:
Lisa reads two-digit groups from these random numbers until she has chosen three class
members (that is, she reads 0.94360 as groups 94, 43, 36, 60). Each random number may only
contribute one class member. If she needed to, Lisa could have generated more random numbers.
The random numbers 0.94360 and 0.99832 do not contain appropriate two digit numbers.
However, the third random number, 0.14669, contains 14 (the fourth random number also contains
14), the fifth random number contains 05, and the seventh random number contains 04. The two
16 | 1.3 SAMPLING AND DATA
favor a certain group. It is better for the person conducting the survey to select the sample
respondents.
True random sampling is done with replacement. That is, once a member is picked, that
member goes back into the population and thus may be chosen more than once. However for
practical reasons, in most populations, simple random sampling is done without replacement,
where a member of the population may only be chosen once. Surveys are typically done without
replacement. Most samples are taken from large populations and the sample tends to be small in
comparison to the population. Consequently, sampling without replacement is approximately the
same as sampling with replacement because the chance of picking the same individual more than
once with replacement is very low.
In a college population of 10,000 people, suppose we want to pick a sample of 1,000 randomly for
a survey. For any particular sample of 1,000, if we are sampling with replacement:
• the chance of picking the first person is 1,000 out of 10,000 (0.1000);
• the chance of picking a different second person for this sample is 999 out of 10,000 (0.0999);
• the chance of picking the same person again is 1 out of 10,000 (very low).
• the chance of picking the first person for any particular sample is 1000 out of 10,000 (0.1000);
• the chance of picking a different second person is 999 out of 9,999 (0.0999);
• you do not replace the first person before picking the next person.
Comparing the fractions and to four decimal places, these numbers are
equivalent. So we can see that the chance of selecting a small sample from a large population is
basically the same, whether or not the sampling is done with replacement.
Sampling without replacement instead of sampling with replacement becomes a mathematical
issue only when the population is small. For example, if the population is 25 people, the sample
is ten, and we are sampling with replacement for any particular sample, then the chance of
picking the first person is 10 out of 25, and the chance of picking a different second person is 9 out
of 25 (we replace the first person). If we sample without replacement, then the chance of picking
the first person is still 10 out of 25 but the chance of picking the second person (who is different)
is 9 out of 24. Comparing the fractions and , these numbers are not
equivalent.
When we analyze data, it is important to be aware of sampling errors and non-sampling
errors. The actual process of sampling causes sampling error, which is the difference between the
18 | 1.3 SAMPLING AND DATA
actual population parameter and the corresponding sample statistic. For example, the sample may
not be large enough. Factors not related to the sampling process cause non-sampling errors. For
example, a defective counting device can cause a non-sampling error. In reality, a sample will never
be exactly representative of the population so there will always be some sampling error. As a rule,
the larger the sample, the smaller the sampling error.
In statistics, a sampling bias is created when a sample is collected from a population and some
members of the population are not as likely to be chosen as others (remember, each member of the
population should have an equally likely chance of being chosen). When a sampling bias happens,
there can be incorrect conclusions drawn about the population that is being studied.
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=21#oembed-1
Watch this video: Statistics: Sources of Bias by Mathispower4u [4:43] (transcript available).
Attribution
“1.2 Data, Sampling, and Variation in Data and Sampling“ in Introductory Statistics by OpenStax is
licensed under a Creative Commons Attribution 4.0 International License.
1.4 FREQUENCY, FREQUENCY TABLES, AND
LEVELS OF MEASUREMENT
LEARNING OBJECTIVES
Once we have a set of data, we need to organize it so that we can analyze how frequently each
datum occurs in the set. However, when calculating the frequency, we may need to round our
answers so that they are as precise as possible.
A simple way to round off answers is to carry the final answer to one more decimal place
than was present in the original data. Round off only the final answer. Do not round off any
intermediate results, if possible. If it becomes necessary to round off intermediate results, carry
them to at least twice as many decimal places as the final answer. For example, the average of the
three quiz scores four, six, and nine is 6.3, rounded off to the nearest tenth because the data are
whole numbers. Most answers will be rounded off in this manner.
Levels of Measurement
The way a set of data is measured is called its level of measurement. Correct statistical
procedures depend on a researcher being familiar with levels of measurement. Not every statistical
operation can be applied to every set of data. In addition to being classified as quantitative or
qualitative, data is classified into four levels of measurement. They are (from lowest to highest
level):
20 | 1.4 FREQUENCY, FREQUENCY TABLES, AND LEVELS OF MEASUREMENT
Nominal Scale Level Ordinal Scale Level Interval Scale Level Ratio Scale Level
Data that is measured using a nominal scale is data that can be placed into categories. Colors,
names, labels, favorite foods, and yes/no survey responses are examples of nominal level data.
Nominal scale data are not ordered, which means the categories of the data are not ordered. For
example, trying to “order” people according to their favorite food does not make any sense. Putting
pizza first and sushi second is not meaningful. Smartphone companies are another example of
nominal scale data. Some examples are Sony, Motorola, Nokia, Samsung, and Apple. This is just
a list of different brand names, and there is no agreed upon order for the categories. Some people
may prefer Apple but that is a matter of opinion. Because nominal data consists of categories,
nominal scale data cannot be used in calculations.
Data that is measured using an ordinal scale is similar to nominal scale data in that the data
can be placed into categories, but there is a big difference. The categories of ordinal scale data can
be ordered or ranked. An example of ordinal scale data is a list of the top five national parks in
the United States because the parks can be ranked from one to five. Another example of using the
ordinal scale is a cruise survey where the responses to questions about the cruise are “excellent,”
“good,” “satisfactory,” and “unsatisfactory.” These responses are ordered from the most desired
response to the least desired. In ordinal scale data, the differences between two pieces of data
cannot be measured or calculated. Similar to nominal scale data, ordinal scale data cannot be used
in calculations.
Data that is measured using an interval scale is similar to ordinal level data because it has
a definite ordering. However, the differences between interval scale data can be measured or
calculated, but the data does not have a starting point. Temperature scales like Celsius (C) and
Fahrenheit (F) are measured by using the interval scale. In both temperature measurements
(Celsius and Fahrenheit), 40° is equal to 100° minus 60°. The differences in temperature can be
measured and make sense. But there is no starting point to the temperature scales because 0° is not
the absolute lowest temperature. Temperatures like -10°F and -15°C exist, and are colder than 0°.
Interval level data can be used in calculations, but ratios do not make sense and cannot be done.
For example, 80°C is not four times as hot as 20°C (nor is 80°F four times as hot as 20°F). So there
is no meaning to the ratio of 80 to 20 (or four to one) in either temperature scale. In general, ratios
have no meaning in interval scale data.
Data that is measured using the ratio scale takes care of the ratio problem, and gives us the most
information. Ratio scale data is like interval scale data, but it has a starting point to the scale (a 0
point) and ratios can be calculated. For example, four multiple choice statistics final exam scores
are 80, 68, 20 and 92 (out of a possible 100 points). The data can be put in order from lowest to
1.4 FREQUENCY, FREQUENCY TABLES, AND LEVELS OF MEASUREMENT | 21
highest: 20, 68, 80, 92. The differences between the data have meaning: 92 minus 68 is 24. Ratios
can be calculated: 80 is four times 20. The smallest possible score is 0.
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=23#oembed-3
Watch this video: Nominal, ordinal, interval and ratio data: How to Remember the differences by NurseKillam [11:03]
(transcript available)
Frequency
Twenty students were asked how many hours they worked per day. Their responses, in hours, are
recorded in the table below:
5 6 3 3 2 4 7 5 2 3
5 6 5 4 4 3 5 2 5 3
The following table lists the different data values in ascending order and their frequencies.
A frequency is the number of times a value of the data occurs. According to the table, there are
three students who work two hours, five students who work three hours, and so on. The sum of the
values in the frequency column is 20, which is the total number of students included in the sample.
A relative frequency is the ratio (fraction or proportion) of the number of times a value of
the data occurs in the set of all outcomes to the total number of outcomes. To find the relative
frequencies divide each frequency by the total number of students in the sample–in this case20.
Relative frequencies can be written as fractions, percents, or decimals. The sum of the values in the
relative frequency column is 1 or 100%.
Cumulative frequency is the accumulation of the previous frequencies. To find the cumulative
frequencies, add all of the previous frequencies to the frequency for the current row, as shown in
the table below. The last entry of the cumulative frequency column is the number of observations
in the data.
1.4 FREQUENCY, FREQUENCY TABLES, AND LEVELS OF MEASUREMENT | 23
Frequency Table of Student Work Hours with Relative and Cumulative Frequencies
DATA
FREQUENCY RELATIVE FREQUENCY CUMULATIVE FREQUENCY
VALUE
Frequency Table of Student Work Hours with Relative, Cumulative and Cumulative Relative
Frequencies
NOTE
Because of rounding of the relative frequencies, the relative frequency column may not always
24 | 1.4 FREQUENCY, FREQUENCY TABLES, AND LEVELS OF MEASUREMENT
sum to 1 or 100%, and the last entry in the cumulative relative frequency column may not be 1 or
100%. However, they each should be close to 1 or 100%. If all of the decimals are kept in the
calculations, the relative frequency column will sum to 1 or 100% and the last cumulative relative
frequency will be 1 or 100%.
In order to create a frequency distribution and its corresponding histogram in Excel, we need to use
the Analysis ToolPak. Follow these instructions to install the Analysis ToolPak add-in in Excel.
This website provides additional information on using Excel to create a frequency distribution.
1.4 FREQUENCY, FREQUENCY TABLES, AND LEVELS OF MEASUREMENT | 25
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=23#oembed-1
Watch this video: Frequency Distributions by Joshua Emmanuel [8:40] (transcript available).
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=23#oembed-2
Watch this video: How to Construct a Histogram in Excel using built-in Data Analysis by Joshua Emmanuel [1:58]
(transcript available).
Concept Review
Some calculations generate numbers that are artificially precise. It is not necessary to report a value
to eight decimal places when the measures that generated that value were only accurate to the
nearest tenth. Round off final answers to one more decimal place than was present in the original
data. This means that if you have data measured to the nearest tenth of a unit, report the final
statistic to the nearest hundredth.
There are four levels of measurement for data:
• Nominal scale level: the data are categories, but the data cannot be ordered or used in
calculations
• Ordinal scale level: the data are categories and the data can be ordered, but the differences
cannot be measured.
• Interval scale level: the data have definite order or rank, but no starting point. The
differences can be measured, but there is no such thing as a ratio.
• Ratio scale level: the data have a definite order or rank with a starting point. The
26 | 1.4 FREQUENCY, FREQUENCY TABLES, AND LEVELS OF MEASUREMENT
When organizing data, it is important to know how many times a value appears. How many
statistics students study five hours or more for an exam? What percent of families on your block
own two pets? Frequency, relative frequency, cumulative frequency, and cumulative relative
frequency are measures that answer questions like these.
Attribution
LEARNING OBJECTIVES
Does aspirin reduce the risk of heart attacks? Is one brand of fertilizer more effective at growing
roses than another? Is fatigue as dangerous to a driver as the influence of alcohol? Questions like
these are answered using randomized experiments. Proper study design ensures the production of
reliable, accurate data.
The purpose of an experiment is to investigate the relationship between two variables. When one
variable causes change in another, we call the first variable the explanatory variable. The affected
variable is called the response variable. In a randomized experiment, the researcher manipulates
values of the explanatory variable and measures the resulting changes in the response variable.
The different values of the explanatory variable are called treatments. An experimental unit is
a single object or individual to be measured.
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=25#oembed-1
Watch this video: Observational Studies and Experiments by ProfessorMcComb [3:05] (transcript available).
28 | 1.5 EXPERIMENTAL DESIGN AND ETHICS
Suppose you want to investigate the effectiveness of vitamin E in preventing disease. You recruit a
group of subjects and ask them if they regularly take vitamin E. You notice that the subjects who
take vitamin E exhibit better health on average than those who do not. Does this prove that vitamin
E is effective in disease prevention? No, it does not. There are many differences between the two
groups compared, in addition to vitamin E consumption. People who take vitamin E often take
other steps to improve their health, such as exercise, diet, other vitamin supplements, or choosing
not to smoke. Any one of these factors could be influencing an person’s health. As described, this
study does not prove that vitamin E is the key to disease prevention.
Additional variables that can cloud a study are called lurking variables. In order to prove
that the explanatory variable is the cause of a change in the response variable, it is necessary
to isolate the explanatory variable. The researcher must design their experiment in such a way
that there is only one difference between groups being compared: the planned treatments. This
is accomplished by the random assignment of experimental units to treatment groups. When
subjects are assigned treatments randomly, all of the potential lurking variables are spread equally
among the groups. At this point the only difference between groups is the one imposed by the
researcher. Therefore, different outcomes measured in the response variable must be a direct result
of the different treatments. In this way, an experiment can prove a cause-and-effect connection
between the explanatory and response variables.
The power of suggestion can have an important influence on the outcome of an experiment.
Studies have shown that the expectation of the study participant can be as important as the actual
medication. In one study of performance-enhancing drugs, researchers noted:
Results showed that believing one had taken the substance resulted in [performance] times almost as
fast as those associated with consuming the drug itself. In contrast, taking the drug without knowledge
yielded no significant performance increment (McClung et al., 2007).
When participation in a study prompts a physical response from a participant, it is difficult to
isolate the effects of the explanatory variable. To counter the power of suggestion, researchers
set aside one treatment group as a control group. This group is given a placebo treatment—a
treatment that cannot influence the response variable. The control group helps researchers balance
the effects of being in an experiment with the effects of the active treatments. Of course, if
you are participating in a study and you know that you are receiving a pill which contains no
actual medication, then the power of suggestion is no longer a factor. Blinding in a randomized
experiment preserves the power of suggestion. When a person involved in a research study is
blinded, they do not know who is receiving the active treatment(s) and who is receiving the placebo
treatment. A double-blind experiment is one in which both the subjects and the researchers
involved with the subjects are blinded.
1.5 EXPERIMENTAL DESIGN AND ETHICS | 29
EXAMPLE
Researchers want to investigate whether taking aspirin regularly reduces the risk of heart attack.
Four hundred men between the ages of 50 and 84 are recruited as participants. The men are divided
randomly into two groups: one group will take aspirin and the other group will take a placebo. Each
man takes one pill each day for three years, but he does not know whether he is taking aspirin or the
placebo. At the end of the study, researchers count the number of men in each group who have had
heart attacks.
Identify the following values for this study: population, sample, experimental units, explanatory
variable, response variable, treatments.
Solution:
EXAMPLE
The Smell & Taste Treatment and Research Foundation conducted a study to investigate whether
smell can affect learning. Subjects completed mazes multiple times while wearing masks. They
completed the pencil and paper mazes three times wearing floral-scented masks and three times
with unscented masks. Participants were assigned at random to wear the floral mask during the first
30 | 1.5 EXPERIMENTAL DESIGN AND ETHICS
three trials or during the last three trials. For each trial, researchers recorded the time it took to
complete the maze and the subject’s impression of the mask’s scent: positive, negative, or neutral.
Solution:
1. The explanatory variable is scent and the response variable is the time it takes to complete the
maze.
2. There are two treatments: a floral-scented mask and an unscented mask.
3. All subjects experienced both treatments. The order of treatments was randomly assigned so
there were no differences between the treatment groups. Random assignment eliminates the
problem of lurking variables.
4. Subjects will clearly know whether they can smell flowers or not, so subjects cannot be
blinded in this study. However, researchers timing the mazes can be blinded. The researcher
who is observing a subject will not know which mask is being worn.
EXAMPLE
A researcher wants to study the effects of birth order on personality. Explain why this study could
not be conducted as a randomized experiment. What is the main problem in a study that cannot be
designed as a randomized experiment?
Solution:
The explanatory variable is birth order. You cannot randomly assign a person’s birth order. Random
assignment eliminates the impact of lurking variables. When you cannot assign subjects to
treatment groups at random, there will be differences between the groups other than the explanatory
variable.
1.5 EXPERIMENTAL DESIGN AND ETHICS | 31
TRY IT
You are concerned about the effects of texting on driving performance. Design a study to test the
response time of drivers while texting and while driving only. How many seconds does it take for a
driver to respond when a leading car hits the brakes?
Ethics
The widespread misuse and misrepresentation of statistical information often gives the field a bad
name. Some say that “numbers don’t lie,” but the people who use numbers to support their claims
often do.
A recent investigation of famous social psychologist, Diederik Stapel, has led to the retraction
of his articles from some of the world’s top journals including Journal of Experimental Social
Psychology, Social Psychology, Basic and Applied Social Psychology, British Journal of Social
Psychology, and the magazine Science. Diederik Stapel is a former professor at Tilburg University
in the Netherlands. Recently, an extensive investigation involving three universities where Stapel
worked concluded that the psychologist is guilty of fraud on a colossal scale. Falsified data taints
over 55 papers he authored and 10 Ph.D. dissertations that he supervised.
Stapel did not deny that his deceit was driven by ambition. But it was more complicated than that,
he told me. He insisted that he loved social psychology but had been frustrated by the messiness of
experimental data, which rarely led to clear conclusions. His lifelong obsession with elegance and
order, he said, led him to concoct sexy results that journals found attractive. “It was a quest for
32 | 1.5 EXPERIMENTAL DESIGN AND ETHICS
aesthetics, for beauty—instead of the truth,” he said. He described his behavior as an addiction that
drove him to carry out acts of increasingly daring fraud, like a junkie seeking a bigger and better high
(Levelt et al., 2012).
The committee investigating Stapel concluded that he is guilty of several practices including:
Clearly, it is never acceptable to falsify data the way this researcher did. Sometimes, however,
violations of ethics are not as easy to spot.
Researchers have a responsibility to verify that proper methods are being followed. The report
describing the investigation of Stapel’s fraud states that, “statistical flaws frequently revealed a
lack of familiarity with elementary statistics”(n.a, 2013). Many of Stapel’s co-authors should have
spotted irregularities in his data. Unfortunately, they did not know very much about statistical
analysis, and they simply trusted that he was collecting and reporting data properly.
Many types of statistical fraud are difficult to spot. Some researchers simply stop collecting data
once they have just enough to prove what they had hoped to prove. They don’t want to take the
chance that a more extensive study would complicate their lives by producing data contradicting
their hypothesis.
Professional organizations, like the American Statistical Association, clearly define expectations
for researchers. There are even laws in the federal code about the use of research data.
When a statistical study uses human participants, as in medical studies, both ethics and the
law dictate that researchers should be mindful of the safety of their research subjects. The U.S.
Department of Health and Human Services oversees federal regulations of research studies with the
aim of protecting participants. When a university or other research institution engages in research,
it must ensure the safety of all human subjects. For this reason, research institutions establish
oversight committees known as Institutional Review Boards (IRB). All planned studies must be
approved in advance by the IRB. Key protections that are mandated by law include the following:
• Risks to participants must be minimized and reasonable with respect to projected benefits.
• Participants must give informed consent. This means that the risks of participation must be
clearly explained to the subjects of the study. Subjects must consent in writing, and
researchers are required to keep documentation of their consent.
• Data collected from individuals must be guarded carefully to protect their privacy.
1.5 EXPERIMENTAL DESIGN AND ETHICS | 33
These ideas may seem fundamental, but they can be very difficult to verify in practice. Is removing
a participant’s name from the data record sufficient to protect privacy? Perhaps the person’s
identity could be discovered from the data that remains. What happens if the study does not
proceed as planned and risks arise that were not anticipated? When is informed consent really
necessary? Suppose your doctor wants a blood sample to check your cholesterol level. Once the
sample has been tested, you expect the lab to dispose of the remaining blood. At that point the
blood becomes biological waste. Does a researcher have the right to take it for use in a study?
It is important that students of statistics take time to consider the ethical questions that arise
in statistical studies. How prevalent is fraud in statistical studies? You might be surprised—and
disappointed. There is a website dedicated to cataloging retractions of study articles that have been
proven fraudulent. A quick glance will show that the misuse of statistics is a bigger problem than
most people realize.
Vigilance against fraud requires knowledge. Learning the basic theory of statistics will empower
you to analyze statistical studies critically.
EXAMPLE
Describe the unethical behavior in each example and describe how it could impact the reliability of
the resulting data. Explain how the problem should be corrected.
1. She selects a block where she is comfortable walking because she knows many of the people
living on the street.
2. No one seems to be home at four houses on her route. She does not record the addresses and
does not return at a later time to try to find residents at home.
3. She skips four houses on her route because she is running late for an appointment. When she
gets home, she fills in the forms by selecting random answers from other residents in the
neighborhood.
Solution:
1. By selecting a convenient sample, the researcher is intentionally selecting a sample that could
be biased. Claiming that this sample represents the community is misleading. The researcher
34 | 1.5 EXPERIMENTAL DESIGN AND ETHICS
TRY IT
Describe the unethical behavior, if any, in each example and describe how it could impact the
reliability of the resulting data. Explain how the problem should be corrected.
A study is commissioned to determine the favorite brand of fruit juice among teens in California.
Concept Review
A poorly designed study will not produce reliable data. There are certain key components that
must be included in every experiment. To eliminate lurking variables, subjects must be assigned
randomly to different treatment groups. One of the groups must act as a control group,
demonstrating what happens when the active treatment is not applied. Participants in the control
1.5 EXPERIMENTAL DESIGN AND ETHICS | 35
group receive a placebo treatment that looks exactly like the active treatments but cannot influence
the response variable. To preserve the integrity of the placebo, both researchers and subjects may
be blinded. When a study is designed properly, the only difference between treatment groups is the
one imposed by the researcher. Therefore, when groups respond differently to different treatments,
the difference must be due to the influence of the explanatory variable.
“An ethics problem arises when you are considering an action that benefits you or some cause
you support, hurts or reduces benefits to others, and violates some rule” (Gelman, 2013). Ethical
violations in statistics are not always easy to spot. Professional associations and federal agencies
post guidelines for proper conduct. It is important that you learn basic statistical procedures so that
you can recognize proper data analysis.
Attribution
“1.4 Experimental Design and Ethics“ in Introductory Statistics by OpenStax is licensed under
a Creative Commons Attribution 4.0 International License.
1.6 EXERCISES
Researcher B:
Determine what the key terms refer to in the example for Researcher A.
a. population
b. sample
c. parameter
d. statistic
e. variable
2. For each of the following eight exercises, identify: the population, the sample, the parameter, the
statistic, the variable, and the data. Give examples where appropriate.
a. A fitness center is interested in the mean amount of time a client exercises in the center each
week.
b. Ski resorts are interested in the mean age that children take their first ski and snowboard
lessons. They need this information to plan their ski classes optimally.
c. A cardiologist is interested in the mean recovery period of her patients who have had heart
attacks.
d. Insurance companies are interested in the mean health costs each year of their clients, so that
they can determine the costs of health insurance.
e. A politician is interested in the proportion of voters in his district who think he is doing a
good job.
f. A marriage counselor is interested in the proportion of clients she counsels who stay married.
1.6 EXERCISES | 37
g. Political pollsters may be interested in the proportion of people who will vote for a particular
cause.
h. A marketing company is interested in the proportion of people who will buy a particular
product.
3. Use the following information to answer the next three exercises: A Lake Tahoe Community
College instructor is interested in the mean number of days Lake Tahoe Community College math
students are absent from class during a quarter.
6. The table contains the total number of deaths worldwide as a result of earthquakes from 2000 to
2012. Use the table to answer the following questions.
38 | 1.6 EXERCISES
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
Total
7. For the following four exercises, determine the type of sampling used (simple random, stratified,
systematic, cluster, or convenience).
a. A group of test subjects is divided into twelve groups; then four of the groups are chosen at
random.
b. A market researcher polls every tenth person who walks into a store.
c. The first people who walk into a sporting event are polled on their television preferences.
1.6 EXERCISES | 39
d. A computer generates random numbers, and people whose names correspond with
the numbers on the list are chosen.
Researcher A
Researcher B
b. Determine what the key term data refers to in the above example for Researcher A.
c. List two reasons why the data may differ.
d. Can you tell if one researcher is correct and the other one is incorrect? Why?
e. Would you expect the data to be identical? Why or why not?
f. How might the researchers gather random data?
g. Suppose that the first researcher conducted his survey by randomly choosing one state in the
nation and then randomly picking patients from that state. What sampling method would
that researcher have used?
h. Suppose that the second researcher conducted his survey by choosing patients he knew.
What sampling method would that researcher have used? What concerns would you have
about this data set, based upon the data collection method?
9. Two researchers are gathering data on hours of video games played by school-aged children and
young adults. They each randomly sample different groups of students from the same school.
They collect the following data.
1.6 EXERCISES | 41
Researcher A
Hours Played per Week Frequency Relative Frequency Cumulative Relative Frequency
0–2
2–4
4–6
6–8
8–10
10–12
Researcher B
Hours Played per Week Frequency Relative Frequency Cumulative Relative Frequency
0–2
2–4
4–6
6–8
8–10
10–12
10. A pair of studies was performed to measure the effectiveness of a new software program
designed to help stroke patients regain their problem-solving skills. Patients were asked to use the
software program twice a day, once in the morning and once in the evening. The studies observed
42 | 1.6 EXERCISES
stroke patients recovering over a period of several weeks. The first study collected the data in
the table. The second study collected the data in the table.
Used program
Used program
g. brand of toothpaste
h. distance to the closest movie theatre
i. age of executives in Fortune 500 companies
j. number of competing computer spreadsheet software packages
17. A study was done to determine the age, number of times per week, and the duration (amount
of time) of resident use of a local park in San Jose. The first house in the neighborhood around the
park was selected randomly and then every 8th house in the neighborhood around the park was
interviewed.
18. Airline companies are interested in the consistency of the number of babies on each flight, so
that they have adequate safety equipment. Suppose an airline conducts a survey. Over Thanksgiving
weekend, it surveys six flights from Boston to Salt Lake City to determine the number of babies on
the flights. It determines the amount of safety equipment needed by the result of that study.
a. Using complete sentences, list three things wrong with the way the survey was conducted.
b. Using complete sentences, list three ways that you would improve the survey if it were to be
repeated.
19. Suppose you want to determine the mean number of students per statistics class in your state.
Describe a possible sampling method in three to five complete sentences. Make the description
detailed.
20. Suppose you want to determine the mean number of cans of soda drunk each month by
students in their twenties at your school. Describe a possible sampling method in three to five
complete sentences. Make the description detailed.
21. List some practical difficulties involved in getting accurate results from a telephone survey.
22. List some practical difficulties involved in getting accurate results from a mailed survey.
23. With your classmates, brainstorm some ways you could overcome these problems if you
needed to conduct a phone or mail survey.
24. The instructor takes her sample by gathering data on five randomly selected students from
each Lake Tahoe Community College math class. What type of sampling was used?
25. A study was done to determine the age, number of times per week, and the duration (amount
of time) of residents using a local park in San Jose. The first house in the neighborhood around the
44 | 1.6 EXERCISES
park was selected randomly and then every eighth house in the neighborhood around the park was
interviewed. What sampling method was used?
26. Name the sampling method used in each of the following situations.
a. A woman in the airport is handing out questionnaires to travelers asking them to evaluate the
airport’s service. She does not ask travelers who are hurrying through the airport with their
hands full of luggage, but instead asks all travelers who are sitting near gates and not taking
naps while they wait.
b. A teacher wants to know if her students are doing homework, so she randomly selects rows
two and five and then calls on all students in row two and all students in row five to present
the solutions to homework problems to the class.
c. The marketing manager for an electronics chain store wants information about the ages of its
customers. Over the next two weeks, at each store location, randomly selected customers
are given questionnaires to fill out asking for information about age, as well as about other
variables of interest.
d. The librarian at a public library wants to determine what proportion of the library users are
children. The librarian has a tally sheet on which she marks whether books are checked out
by an adult or a child. She records this data for every fourth patron who checks out books.
e. A political party wants to know the reaction of voters to a debate between the candidates. The
day after the debate, the party’s polling staff calls randomly selected phone numbers. If
a registered voter answers the phone or is available to come to the phone, that registered voter
is asked whom he or she intends to vote for and whether the debate changed his or her
opinion of the candidates.
27. A “random survey” was conducted of people of the “microprocessor generation” (people
born since 1971, the year the microprocessor was invented). It was reported that % of those
individuals surveyed stated that if they had to spend, they would use it for computer
equipment. Also, % of those surveyed considered themselves relatively savvy computer users.
a. Do you consider the sample size large enough for a study of this type? Why or why not?
b. Based on your “gut feeling,” do you believe the percents accurately reflect the U.S. population
for those individuals born since 1971? If not, do you think the percents of the population are
actually higher or lower than the sample statistics? Why?
Additional information: The survey, reported by Intel Corporation, was filled out by individuals
who visited the Los Angeles Convention Center to see the Smithsonian Institute’s road show called
“America’s Smithsonian.”
1.6 EXERCISES | 45
c. With this additional information, do you feel that all demographic and ethnic groups were
equally represented at the event? Why or why not?
d. With the additional information, comment on how accurately you think the sample statistics
reflect the population parameters.
28. The Gallup-Healthways Well-Being Index is a survey that follows trends of U.S. residents on
a regular basis. There are six areas of health and wellness covered in the survey: Life Evaluation,
Emotional Health, Physical Health, Healthy Behavior, Work Environment, and Basic Access. Some
of the questions used to measure the Index are listed below.78. Identify the type of data obtained
from each question used in this survey: qualitative, quantitative discrete, or quantitative
continuous.
a. Do you have any health problems that prevent you from doing any of the things people your
age can normally do?
b. During the past days, for about how many days did poor health keep you from doing your
usual activities?
c. In the last seven days, on how many days did you exercise for minutes or more?
d. Do you have health insurance coverage?
29. In advance of the 1936 Presidential Election, a magazine titled Literary Digest released the
results of an opinion poll predicting that the republican candidate Alf Landon would win by a
large margin. The magazine sent post cards to approximately prospective voters.
These prospective voters were selected from the subscription list of the magazine, from automobile
registration lists, from phone lists, and from club membership lists. Approximately
people returned the postcards.
a. Think about the state of the United States in 1936. Explain why a sample chosen from
magazine subscription lists, automobile registration lists, phone books, and club membership
lists was not representative of the population of the United States at that time.
b. What effect does the low response rate have on the reliability of the sample?
c. Are these problems examples of sampling error or nonsampling error?
d. During the same year, George Gallup conducted his own poll of prospective voters.
His researchers used a method they called “quota sampling” to obtain survey answers from
specific subsets of the population. Quota sampling is an example of which sampling method
described in this module?
30. Crime-related and demographic statistics for US states in 1960 were collected from
46 | 1.6 EXERCISES
government agencies, including the FBI’s Uniform Crime Report. One analysis of this data found
a strong connection between education and crime indicating that higher levels of education in
a community correspond to higher crime rates.Which of the potential problems with samples
discussed below could explain this connection?
31. YouPolls is a website that allows anyone to create and respond to polls. One question posted
April 15 asks:“Do you feel happy paying your taxes when members of the Obama administration
are allowed to ignore their tax liabilities?” As of April 25, people responded to this question.
Each participant answered “NO!” Which of the potential problems with samples could explain this
connection?
32. A scholarly article about response rates begins with the following quote:“Declining contact
and cooperation rates in random digit dial (RDD) national telephone surveys raise serious concerns
about the validity of estimates drawn from such research.” The Pew Research Center for People and
the Press admits: “The percentage of people we interview – out of all we try to interview – has been
declining over the past decade or more.”3
a. What are some reasons for the decline in response rate over the past decade?
b. Explain why researchers are concerned with the impact of the declining response rate on
public opinion polls.
33. Seven hundred and seventy-one distance learning students at Long Beach City College
responded to surveys in the 2010-11 academic year. Highlights of the summary report are listed in
the table.
Age 41 or over
c. If the same survey were done at Great Basin College in Elko, Nevada, do you think the
percentages would be the same? Why?
34. Several online textbook retailers advertise that they have lower prices than on-campus
bookstores. However, an important factor is whether the Internet retailers actually have the
textbooks that students need in stock. Students need to be able to get textbooks promptly at the
beginning of the college term. If the book is not available, then a student would not be able to get
the textbook at all, or might get a delayed delivery if the book is back ordered.
35. A college newspaper reporter is investigating textbook availability at online retailers. He
decides to investigate one textbook for each of the following seven subjects: calculus, biology,
chemistry, physics, statistics, geology, and general engineering. He consults textbook industry sales
data and selects the most popular nationally used textbook in each of these subjects. He visits
websites for a random sample of major online textbook sellers and looks up each of these seven
textbooks to see if they are available in stock for quick delivery through these retailers. Based on his
investigation, he writes an article in which he draws conclusions about the overall availability of all
college textbooks through online textbook retailers. Write an analysis of his study that addresses
the following issues: Is his sample representative of the population of all college textbooks? Explain
why or why not. Describe some possible sources of bias in this study, and how it might affect the
results of the study. Give some suggestions about what could be done to improve the study.
36. What type of measure scale is being used? Nominal, ordinal, interval or ratio.
a. High school soccer players classified by their athletic ability: Superior, Average, Above
average
b. Baking temperatures for various main dishes: , , , ,
c. The colors of crayons in a -crayon box
d. Social security numbers
e. Incomes measured in dollars
f. A satisfaction survey of a social website by number: = very satisfied, = somewhat
satisfied, = not satisfied
g. Political outlook: extreme left, left-of-center, right-of-center, extreme right
h. Time of day on an analog watch
i. The distance in miles to the closest grocery store
j. The dates 1066, 1492, 1644, 1947, and 1944
k. The heights of 21–65 year-old women
l. Common letter grades: A, B, C, D, and F
48 | 1.6 EXERCISES
37. Fifty part-time students were asked how many courses they were taking this term. The
(incomplete) results are shown below:
38. Sixty adults with gum disease were asked the number of times per week they used to floss before
their diagnosis. The (incomplete) results are shown below:
39. Nineteen immigrants to the U.S were asked how many years, to the nearest year, they have lived
in the U.S. The data are as follows:
a. Fix the errors in the table. Also, explain how someone might have arrived at the incorrect
number(s).
b. Explain what is wrong with this statement: “ percent of the people surveyed have lived in
the U.S. for years.”
c. Fix the statement in part (b) to make it correct.
d. What fraction of the people surveyed have lived in the U.S. five or seven years?
e. What fraction of the people surveyed have lived in the U.S. at most years?
f. What fraction of the people surveyed have lived in the U.S. fewer than years?
g. What fraction of the people surveyed have lived in the U.S. from five to years, inclusive?
40. How much time does it take to travel to work? The table below shows the mean commute time
by state for workers at least years old who are not working at home. Find the mean travel time,
and round off the answer properly.
50 | 1.6 EXERCISES
41. Forbes magazine published data on the best small firms in 2012. These were firms which had
been publicly traded for at least a year, have a stock price of at least per share, and have reported
annual revenue between million and billion. The table below shows the ages of the chief
executive officers for the first ranked firms.
40–44
45–49
50–54
55–59
60–64
65–69
70–74
42. the table below contains data on hurricanes that have made direct hits on the U.S. Between
1851 and 2004. A hurricane is given a strength category rating based on the minimum wind speed
generated by the storm.
52 | 1.6 EXERCISES
Total =
a. What is the relative frequency of direct hits that were category hurricanes?
b. What is the relative frequency of direct hits that were AT MOST a category storm?
43. Design an experiment. Identify the explanatory and response variables. Describe the population
being studied and the experimental units. Explain the treatments that will be used and how they
will be assigned to the experimental units. Describe how blinding and placebos may be used to
counter the power of suggestion.
44. Discuss potential violations of the rule requiring informed consent.
a. Inmates in a correctional facility are offered good behavior credit in return for participation in
a study.
b. A research study is designed to investigate a new children’s allergy medication.
c. Participants in a study are told that the new medication being tested is highly promising, but
they are not told that only a small portion of participants will receive the new medication.
Others will receive placebo treatments and traditional treatments.
45. How does sleep deprivation affect your ability to drive? A recent study measured the effects on
professional drivers. Each driver participated in two experimental sessions: one after normal
sleep and one after hours of total sleep deprivation. The treatments were assigned in random
order. In each session, performance was measured on a variety of tasks including a driving
simulation. Use key terms from this module to describe the design of this experiment.
46. An advertisement for Acme Investments displays the two graphs in the figure below to show
the value of Acme’s product in comparison with the Other Guy’s product. Describe the potentially
misleading visual effect of these comparison graphs. How can this be corrected?
1.6 EXERCISES | 53
47. The graph in the figure below shows the number of complaints for six different airlines
as reported to the US Department of Transportation in February 2013. Alaska, Pinnacle, and
Airtran Airlines have far fewer complaints reported than American, Delta, and United. Can we
conclude that American, Delta, and United are the worst airline carriers since they have the most
complaints?
54 | 1.6 EXERCISES
Attribution
1. (a) AIDS patients. (c) The average length of time (in months) AIDS patients live after treatment.
(e) = the length of time (in months) AIDS patients live after treatment
2. (b) all children who take ski or snowboard lessons; a group of these children; the population
mean age of children who take their first snowboard lesson; the sample mean age of children
who take their first snowboard lesson; = the age of one child who takes his or her first ski or
snowboard lesson; values for , such as , , and so on. (d) the clients of the insurance companies;
a group of the clients; the mean health costs of the clients; the mean health costs of the sample;
= the health costs of one client; values for , such as , , , and so on (f) all the clients of this
counselor; a group of clients of this marriage counselor; the proportion of all her clients who stay
married; the proportion of the sample of the counselor’s clients who stay married; = the number
of couples who stay married; yes, no (h) all people (maybe in a certain geographic area, such as the
United States); a group of the people; the proportion of all people who will buy the product; the
proportion of the sample who will buy the product; = the number of people who will buy it; buy,
not buy.
6.
a.
b.
c.
d.
e. quantitative discrete
f. quantitative continuous
g. In both years, underwater earthquakes produced massive tsunamis.
Even though the specific data support each researcher’s conclusions, the different results suggest
that more data need to be collected before the researchers can reach a conclusion.
10. (a) There is not enough information given to judge if either one is correct or incorrect. (c) The
software program seems to work because the second study shows that more patients improve while
using the software than not. Even though the difference is not as large as that in the first study, the
results from the second study are likely more reliable and still show improvement. (e) Yes, because
we cannot tell if the improvement was due to the software or the exercise; the data is confounded,
and a reliable conclusion cannot be drawn. New studies should be performed.
12. No, even though the sample is large enough, the fact that the sample consists of volunteers
makes it a self-selected sample, which is not reliable.
14. No, even though the sample is a large portion of the population, two responses are not enough
to justify any conclusions. Because the population is so small, it would be better to include everyone
in the population to get the most accurate data.
16. (a) quantitative discrete, (c) qualitative, Oakland A’s (e) quantitative discrete,
students (g) qualitative, Crest (i) quantitative continuous, years
18.
a. The survey was conducted using six similar flights. The survey would not be a true
representation of the entire population of air travelers.
Conducting the survey on a holiday weekend will not produce representative results.
b. Conduct the survey during different times of the year. Conduct the survey using flights to and
from various locations.
Conduct the survey on different days of the week.
20. Answers will vary. Sample Answer: You could use a systematic sampling method. Stop the tenth
person as they leave one of the buildings on campus at 9:50 in the morning. Then stop the tenth
person as they leave a different building on campus at 1:50 in the afternoon.
22. Answers will vary. Sample Answer: Many people will not respond to mail surveys. If they
do respond to the surveys, you can’t be sure who is responding. In addition, mailing lists can be
incomplete.
26. (a) convenience (b) cluster (c) stratified (d) systematic (e) simple random
28.
a. qualitative
b. quantitative discrete
c. quantitative discrete
d. qualitative
1.7 ANSWERS TO SELECTED EXERCISES | 57
30. Causality: The fact that two variables are related does not guarantee that one variable is
influencing the other. We cannot assume that crime rate impacts education level or that education
level impacts crime rate.
Confounding: There are many factors that define a community other than education level and
crime rate. Communities with high crime rates and high education levels may have other lurking
variables that distinguish them from communities with lower crime rates and lower education
levels. Because we cannot isolate these variables of interest, we cannot draw valid conclusions
about the connection between education and crime. Possible lurking variables include police
expenditures, unemployment levels, region, average age, and size.
32.
a. Possible reasons: increased use of caller id, decreased use of landlines, increased use of private
numbers, voice mail, privacy managers, hectic nature of personal schedules, decreased
willingness to be interviewed
b. When a large number of people refuse to participate, then the sample may not have the same
characteristics of the population. Perhaps the majority of people willing to participate are
doing so because they feel strongly about the subject of the survey.
36.
a. ordinal
b. interval
c. nominal
d. nominal
e. ratio
f. ordinal
g. nominal
h. interval
i. ratio
j. interval
k. ratio
l. ordinal
40. The sum of the travel times is . Divide the sum by to calculate the mean value:
. Because each state’s travel time was measured to the nearest tenth, round this calculation
to the nearest hundredth: .
44.
a. Inmates may not feel comfortable refusing participation, or may feel obligated to take
advantage of the promised benefits. They may not feel truly free to refuse participation.
b. Parents can provide consent on behalf of their children, but children are not competent to
provide consent for themselves.
c. All risks and benefits must be clearly outlined. Study participants must be informed of
relevant aspects of the study in order to give appropriate consent.
45.
Explanatory variable: amount of sleep
Random assignment: treatments were assigned in random order; this eliminated the effect of any
“learning” that may take place during the first experimental session
Blinding: researchers evaluating subjects’ performance must not know which treatment is being
applied at the time89. You cannot assume that the numbers of complaints reflect the quality of the
airlines. The airlines shown with the greatest number of complaints are the ones with the most
1.7 ANSWERS TO SELECTED EXERCISES | 59
passengers. You must consider the appropriateness of methods for presenting data; in this case
displaying totals is misleading.
Attribution
Chapter Outline
When you have large amounts of data, you will need to organize it in a way that makes sense. These ballots
from an election are rolled together with similar ballots to keep them organized. Photo by William Greeson,
CC BY 4.0.
Once we have collected data, what do we do with it? Data can be described and presented in many
different formats. For example, suppose you are interested in buying a house in a particular area.
You may have no clue about the house prices, so you might ask your real estate agent to give you a
sample data set of prices. Looking at all the prices in the sample is often overwhelming. A better
way might be to look at the median price and the variation of prices. The median and variation are
just two ways that we can summarize and describe data. Your agent might also provide you with a
graph of the data.
In this chapter, we will study numerical and graphical ways to describe and display your data.
64 | 2.1 INTRODUCTION TO DESCRIPTIVE STATISTICS
This area of statistics is called descriptive statistics. We will learn how to calculate, and more
importantly, how to interpret these measurements and graphs.
A statistical graph is a tool that helps us learn about the shape or distribution of a sample or a
population. A graph can be a more effective way of presenting data than a mass of numbers because
we can see where data clusters and where there are only a few data values. Newspapers and the
internet use graphs to show trends and to enable readers to compare facts and figures quickly.
Statisticians often graph data first to get a picture of the data, and then use more formal tools to
analyze the data.
Some of the types of graphs that are used to summarize and organize data are the dot plot, the bar
graph, the histogram, the stem-and-leaf plot, the frequency polygon (a type of broken line graph),
the pie chart, and the box plot. In this chapter, we will briefly look at bar graphs (or histograms), as
well as frequency polygons and time-series graphs.
Attribution
LEARNING OBJECTIVES
• Display data using an appropriate graph: histograms, frequency polygons, and time series
graphs.
• Analyze and interpret data presented in a graph.
Histograms
For most of the work we do in this book, we will use a histogram to display the data. One advantage
of a histogram is that it can readily display large data sets. A rule of thumb is to use a histogram
when the data set consists of 100 values or more.
A histogram is a visual display of a frequency chart. It consists of contiguous, vertical boxes
with both a horizontal axis and a vertical axis. The horizontal axis is labeled with the classes
or categories from the frequency chart. The vertical axis is labeled either frequency or relative
frequency (or percent frequency or probability). The graph will have the same shape with either
label on the vertical axis but the scale on the vertical axis will be different. The histogram gives us
the shape of the data, the center of the data, and the spread of the data.
Recall that the frequency is the number of times an observation falls into that particular class
and the relative frequency is the frequency for the class divided by the total number of data values
in the sample. For example, if three students in Mr. Ahab’s English class of 40 students received
from 90% to 100%, then the frequency of the 90% to 100% class is 3 and the relative frequency is
. So, 7.5% of the students received between 90% and 100%.
To construct a histogram, first decide how many bars or intervals, also called classes,
66 | 2.2 HISTOGRAMS, FREQUENCY POLYGONS, AND TIME SERIES GRAPHS
represent the data. Many histograms consist of 5 to 15 bars or classes for clarity, but the number of
bars is determined by the person constructing the histogram. Choose a starting point for the first
class to be less than the smallest data value. A convenient starting point is a lower value carried
out to one more decimal place than the value with the most decimal places. For example, if the
value with the most decimal places is 6.1 and this is the smallest value, a convenient starting point is
6.05. We say that 6.05 has more precision. If the value with the most decimal places is 2.23 and the
lowest value is 1.5, a convenient starting point is 1.495. If the value with the most decimal places is
3.234 and the lowest value is 1.0, a convenient starting point is 0.9995. If all the data happen to be
integers and the smallest value is 2, then a convenient starting point is 1.5. Also, when the starting
point and other boundaries are carried to one additional decimal place, no data value will fall on a
boundary.
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=46#oembed-3
Watch this video: Histograms | Applying mathematical reasoning | Pre-algebra | Khan Academy by Khan Academy [6:07]
(transcript available)
EXAMPLE
The following data are the heights (in inches to the nearest half inch) of 100 male semiprofessional
soccer players. The heights are continuous data because height is a measurement.
2.2 HISTOGRAMS, FREQUENCY POLYGONS, AND TIME SERIES GRAPHS | 67
61 64 66 66.5 67 67 68 69 70 72
61.5 64 66 66.5 67 67 68 69 70 72
The smallest data value is 60. Because the data with the most decimal places has one decimal (for
instance, 61.5), we want our starting point to have two decimal places. Because the numbers 0.5,
0.05, 0.005, etc. are convenient numbers, use 0.05 and subtract it from 60, the smallest value, for the
convenient starting point. Then the starting point is, then, . The largest value
is , so is the ending value.
Next, calculate the width of each bar or class interval. To calculate this width, subtract the starting
point from the ending value and divide by the number of classes (you must decide how many classes
you want). Suppose we want to have eight classes.
We will round up to two and make each bar or class interval two units wide. Rounding up to two is
one way to prevent a value from falling on a boundary. Rounding to the next number is often
necessary even if it goes against the standard rules of rounding. For this example, using 1.76 as the
width would also work.
\begin{eqnarray*} 59.95 \\ 59.95+2 & = & 61.95 \\61.95+2 & = & 63.95 \\
63.95+2 & = & 65.95 \\ 65.95+2 & = & 67.95 \\ 67.95+2 & = & 69.95 \\
69.95+2 & = & 71.95 \\ 71.95+2 & = & 73.95 \\ 73.95+2 & = & 75.95\
end{eqnarray*}
The heights 60 through 61.5 inches are in the first class 59.95–61.95. The heights that are 63.5 are in
the second class 61.95–63.95. The heights that are 64 through 64.5 are in the third class 63.95–65.95.
68 | 2.2 HISTOGRAMS, FREQUENCY POLYGONS, AND TIME SERIES GRAPHS
The heights 66 through 67.5 are in the fourth class 65.95–67.95. The heights 68 through 69.5 are in
the fifth class 67.95–69.95. The heights 70 through 71 are in the sixth class 69.95–71.95. The heights
72 through 73.5 are in the seventh class 71.95–73.95. The height 74 is in the last class 73.95–75.95.
The following histogram displays the heights on the -axis and relative frequency on the -axis.
NOTE
A guideline that is followed by some for the width of a bar or class interval is to take the square root
of the number of data values and then round to the nearest whole number, if necessary. For
example, if there are 150 values of data, take the square root of 150 and round to 12 bars or classes.
2.2 HISTOGRAMS, FREQUENCY POLYGONS, AND TIME SERIES GRAPHS | 69
TRY IT
The following data are the shoe sizes of 50 male students. The sizes are continuous data since shoe
size is measured. Construct a histogram and calculate the width of each bar or class interval.
Suppose you choose six bars.
9 9 9.5 9.5 10 10 10 10 10 10
11 11 11 11 11 11 11 11 11 11
• Smallest value:
• Largest value:
• Convenient starting value:
• Convenient ending value:
• Class width:
The calculations suggest using as the width of each bar or class interval. You can also use an
interval with a width equal to one.
70 | 2.2 HISTOGRAMS, FREQUENCY POLYGONS, AND TIME SERIES GRAPHS
EXAMPLE
The following data are the number of books bought by 50 part-time college students at ABC College.
The number of books is discrete data because books are counted.
1 1 1 1 1 1 1 1 1 1
1 2 2 2 2 2 2 2 2 2
2 3 3 3 3 3 3 3 3 3
3 3 3 3 3 3 3 4 4 4
4 4 4 5 5 5 5 5 6 6
Eleven students buy one book. Ten students buy two books. Sixteen students buy three books. Six
students buy four books. Five students buy five books. Two students buy six books.
Because the data are integers, subtract 0.5 from 1, the smallest data value and add 0.5 to 6, the largest
data value to get the starting and ending point. Then the starting point is 0.5 and the ending value is
6.5.
Next, calculate the width of each bar or class interval. If the data are discrete and there are not too
many different values, a width that places the data values in the middle of the bar or class interval is
the most convenient. Because the data consist of the numbers 1, 2, 3, 4, 5, 6 and the starting point is
0.5, a width of one places the 1 in the middle of the interval from 0.5 to 1.5, the 2 in the middle of the
interval from 1.5 to 2.5, the 3 in the middle of the interval from 2.5 to 3.5, the 4 in the middle of the
interval from _______ to _______, the 5 in the middle of the interval from _______ to _______, and
the _______ in the middle of the interval from _______ to _______ .
Solution:
• 3.5 to 4.5
• 4.5 to 5.5
• 6
• 5.5 to 6.5
The following histogram displays the number of books on the -axis and the frequency on the
-axis.
In order to create a frequency distribution and its corresponding histogram in Excel, we need to use
the Analysis ToolPak. Follow these instructions to add the Analysis ToolPak.
72 | 2.2 HISTOGRAMS, FREQUENCY POLYGONS, AND TIME SERIES GRAPHS
This website provides additional information on using Excel to create a frequency distribution.
NOTE
The histogram produced by Excel uses the frequency column from the frequency table on the
vertical axis, not the relative frequency column.
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=46#oembed-1
Watch this video: Frequency Distributions by Joshua Emmanuel [8:40] (transcript available).
2.2 HISTOGRAMS, FREQUENCY POLYGONS, AND TIME SERIES GRAPHS | 73
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=46#oembed-2
Watch this video: How to Construct a Histogram in Excel Using Data Analysis by Joshua Emmanuel [1:58] (transcript
available).
TRY IT
The following data are the number of sports played by 50 student athletes. The number of sports is
discrete data because sports are counted.
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2
2 2 3 3 3 3 3 3 3 3
Fill in the blanks for the following sentence. Because the data consist of the numbers 1, 2, 3, and the
starting point is 0.5, a width of one places the 1 in the middle of the interval 0.5 to _____, the 2 in the
middle of the interval from _____ to _____, and the 3 in the middle of the interval from _____ to
_____.
• 1.5
• 1.5 to 2.5
74 | 2.2 HISTOGRAMS, FREQUENCY POLYGONS, AND TIME SERIES GRAPHS
• 2.5 to 3.5
EXAMPLE
23 21.9 24 23.75 18
Some values in this data set fall on boundaries for the class intervals. A value is counted in a class
interval if it falls on the left boundary, but not if it falls on the right boundary. Different researchers
may set up histograms for the same data in different ways. There is more than one correct way to set
up a histogram.
Frequency Polygons
Associated with frequency charts and histograms, frequency polygons are line graphs with the
classes on the horizontal axis, frequency on the vertical axis, and the frequencies plotted against
the midpoint of the class interval. As with histograms, start by examining the data and decide on
the classes, using similar techniques as discussed above. Find the frequency for each class. Plot
76 | 2.2 HISTOGRAMS, FREQUENCY POLYGONS, AND TIME SERIES GRAPHS
the classes on the -axis and the frequency on -axis. For each class, add a point on the graph
with the -coordinate equal to the class midpoint and the -coordinate equal to the frequency of
the class. Add points on the horizontal axis at the midpoint of the class before the first class and
at the midpoint of the class after the last class. After all the points are plotted, draw line segments
to connect them. Frequency polygons are useful for comparing distributions. This is achieved by
overlaying the frequency polygons drawn for different data sets.
EXAMPLE
49.5 59.5 5
59.5 69.5 10
69.5 79.5 30
79.5 89.5 40
89.5 99.5 15
2.2 HISTOGRAMS, FREQUENCY POLYGONS, AND TIME SERIES GRAPHS | 77
The first label on the -axis is 44.5. This represents an interval extending from 39.5 to 49.5. Because
the lowest test score is 54.5, this interval is used only to allow the graph to touch the -axis. The
point labeled 54.5 represents the next interval, or the first “real” interval from the table, and contains
five scores. This reasoning is followed for each of the remaining intervals with the point 104.5
representing the interval from 99.5 to 109.5. Again, this interval contains no data and is only used so
that the graph will touch the -axis. Looking at the graph, we say that this distribution is skewed
because one side of the graph does not mirror the other side.
EXAMPLE
We will construct an overlay frequency polygon comparing the scores with the students’ final
numeric grade.
78 | 2.2 HISTOGRAMS, FREQUENCY POLYGONS, AND TIME SERIES GRAPHS
49.5 59.5 5
59.5 69.5 10
69.5 79.5 30
79.5 89.5 40
89.5 99.5 15
49.5 59.5 10
59.5 69.5 10
69.5 79.5 30
79.5 89.5 45
89.5 99.5 5
2.2 HISTOGRAMS, FREQUENCY POLYGONS, AND TIME SERIES GRAPHS | 79
Suppose that we want to study the temperature range of a region for an entire month. Every day at
noon we note the temperature and write this down in a log. A variety of statistical studies could
be done with this data. We could find the mean or the median temperature for the month. We
could construct a histogram displaying the number of days that temperatures reach a certain range
of values. However, all of these methods ignore a portion of the data that we have collected.
One feature of the data that we may want to consider is that of time. Because each date is
paired with the temperature reading for the day, we do not have to think of the data as being
random. We can instead use the times given to impose a chronological order on the data. A graph
that recognizes this ordering and displays the changing temperature as the month progresses is
called a time series graph.
To construct a time series graph, we must look at both pieces of our paired data set. We start
with a standard Cartesian coordinate system. The horizontal axis is used to plot the date or time
increments and the vertical axis is used to plot the values of the variable that we are measuring.
80 | 2.2 HISTOGRAMS, FREQUENCY POLYGONS, AND TIME SERIES GRAPHS
By doing this, we make each point on the graph correspond to a date and a measured quantity.
The points on the graph are typically connected by straight lines in the order in which they occur.
EXAMPLE
The following data shows the Annual Consumer Price Index, each month, for ten years. Construct a
time series graph for the Annual Consumer Price Index data only.
Time series graphs are important tools in various applications of statistics. When recording values
of the same variable over an extended period of time, sometimes it is difficult to discern any trend
or pattern. However, once the same data points are displayed graphically, some features jump out.
Time series graphs make trends easy to spot.
82 | 2.2 HISTOGRAMS, FREQUENCY POLYGONS, AND TIME SERIES GRAPHS
Concept Review
A histogram is a graphic version of a frequency distribution. The graph consists of bars of equal
width drawn adjacent to each other. The horizontal scale represents classes of quantitative data
values and the vertical scale represents frequencies or relative frequencies. The heights of the bars
correspond to frequency or relative frequency values. Histograms are typically used for large,
continuous, quantitative data sets.
A frequency polygon can also be used when graphing large data sets with data points that
repeat. The data usually goes on the -axis with the frequency being graphed on the -axis.
Time series graphs can be helpful when looking at large amounts of data for one variable over
a period of time.
Attribution
“2.2 Histograms, Frequency Polygons, and Time Series Graphs“ in Introductory Statistics by
OpenStax is licensed under a Creative Commons Attribution 4.0 International License.
2.3 MEASURES OF CENTRAL TENDENCY
LEARNING OBJECTIVES
• Recognize, describe, calculate, and analyze the measures of the center of data: mean, median,
and mode.
The “center” of a data set is a way of describing location. The two most widely used measures of
the “center” of the data are the mean (average) and the median. To calculate the mean weight
of 50 people, add the 50 weights together and divide by 50. To find the median weight of the 50
people, order the data, and find the number that splits the data into two equal parts. The median is
generally a better measure of the center when there are extreme values or outliers because it is not
affected by the precise numerical values of the outliers. The mean is the most common measure of
the center.
NOTE
The words “mean” and “average” are often used interchangeably. The substitution of one word
for the other is common practice. The technical term for mean is “arithmetic mean” and
“average” is technically a center location. However, in practice among non-statisticians,
“average” is commonly accepted for “arithmetic mean.”
84 | 2.3 MEASURES OF CENTRAL TENDENCY
Mean
The mean is calculated by adding up all of the values in the data and then dividing the sum by the
total number of data values.
The letter used to represent the sample mean is (read -bar). The Greek letter (pronounced
“mew”) represents the population mean. One of the requirements for the sample mean to be a
good estimate of the population mean is for the sample taken to be truly random.
Consider the sample:
1 1 1 2 2 3 4 4 4 4 4
• For array, enter the array or cell range containing the data.
The output from the average function is the mean of the entered data.
Visit the Microsoft page for more information about the average function.
Median
The median is the middle value in an ordered set of data. You can quickly find the location of the
median by using the expression where is the total number of data values in the sample.
If is an odd number, the median is the middle value of the ordered data (ordered smallest to
largest). If is an even number, the median is equal to the two middle values added together and
divided by two after the data has been ordered. For example, if the total number of data values is
97, then the median is located in position of the ordered list. If the total
2.3 MEASURES OF CENTRAL TENDENCY | 85
number of data values is 100, then and the median occurs midway
between the 50th and 51st values. The location of the median and the value of the median are not
the same. The upper case letter is often used to represent the median.
• For array, enter the array or cell range containing the data.
The output from the median function is the median of the entered data.
Visit the Microsoft page for more information about the median function.
EXAMPLE
AIDS data indicating the number of months a patient with AIDS lives after taking a new antibody
drug are as follows (smallest to largest):
3 4 8 8 10 11 12 13 14 15
15 16 16 17 17 18 21 22 22 24
24 25 26 26 27 27 29 29 31 32
33 33 34 34 35 37 40 44 44 47
Solution:
86 | 2.3 MEASURES OF CENTRAL TENDENCY
Enter the data into an Excel spreadsheet. For this example, suppose we entered the data in column
A from cell A1 to A40.
TRY IT
The following data show the number of months patients typically wait on a transplant list before
getting surgery. The data are ordered from smallest to largest. Calculate the mean and median.
3 4 5 7 7 7 7 8 8 9
9 10 10 10 10 10 11 12 12 13
14 14 15 15 17 17 18 19 19 19
21 21 22 22 23 24 24 24 24
Enter the data into an Excel spreadsheet. For this example, suppose we entered the data in column
A from cell A1 to A39.
2.3 MEASURES OF CENTRAL TENDENCY | 87
EXAMPLE
Suppose that in a small town of 50 people, one person earns per year and the other 49
each earn . Which is the better measure of the “center”: the mean or the median?
Solution:
The median is a better measure of the “center” than the mean because 49 of the values are
and one is . The is an outlier. The median of gives us a
better sense of the middle of the data.
88 | 2.3 MEASURES OF CENTRAL TENDENCY
TRY IT
In a sample of 60 households, one house is worth . Half of the rest are worth
, and all the others are worth . Which is the better measure of the “center”:
the mean or the median?
The median is the better measure of the “center” than the mean because 59 of the values are either
or and only one is . The is an outlier. Either
or gives us a better sense of the middle of the data.
Mode
Another measure of the center of the data is the mode. The mode is the most frequently occurring
value in the set of data. There can be more than one mode in a data set as long as those values have
the same frequency and that frequency is the highest. A set of data can also have no mode if all of
the observations in the data are unique.
Unlike the mean and the median, the mode can be calculated for both qualitative data and
quantitative data. For example, if the data set is: red, red, red, green, green, yellow, purple, black,
blue, the mode is red.
2.3 MEASURES OF CENTRAL TENDENCY | 89
• Use the count and mode.mult function to determine the number of modes in the data.
Enter count(mode.mult(array)) into a cell where array is the array or cell range containing
the data. This function will output the number of modes present in the data.
• If the output from the count(mode.mult(array)) function is 1, then the data has a single
mode. To find the single mode, use the mode.sngl(array) function, where array is the array
or cell range containing the data. The output from the mode.sngl function is the value of
single mode in the data.
◦ Visit the Microsoft page for more information about the mode.sngl function.
• If the output from the count(mode.mult(array)) function is greater than 1, then the data
contains multiple modes. To find the multiple modes:
◦ Left click on a cell, hold and drag down to highlight a number of vertical cells equal to
the number of modes in the data. For example, if there are 4 modes in the data,
highlight 4 cells in the vertical array.
◦ In the highlighted cells, enter the mode.mult(array) function, where array is the
array or cell range containing the data.
◦ After entering the mode.mult function in the vertical array, press
CTRL+SHIFT+ENTER. Because the output from this function is an array, we must
press CTRL+SHIFT+ENTER (and not ENTER) to produce the array output.
◦ The output from the mode.mult function are the modes in the data.
◦ Visit the Microsoft page for more information about the mode.mult function.
90 | 2.3 MEASURES OF CENTRAL TENDENCY
EXAMPLE
50 53 59 59 63 63 72 72 72 72
72 76 78 81 83 84 84 84 90 93
Solution:
Enter the data into an Excel spreadsheet. For this example, suppose we entered the data in column
A from cell A1 to A20.
Start by using the count function to count the number of modes in the data:
Field 1 A1:A20 1
Because the output from the count(mode.mult(…)) function is 1, there is only 1 mode in the data.
To find the single mode, we use the mode.sngl function:
Field 1 A1:A20 72
By examining the data, we can see that 72 is the most frequently occurring value (5 times) and that
72 is the only value that occurs 5 times.
2.3 MEASURES OF CENTRAL TENDENCY | 91
TRY IT
The number of books checked out from the library from 25 students are as follows:
0 0 0 1 2
3 3 4 4 5
5 7 7 7 7
8 8 8 9 10
10 11 11 12 12
Enter the data into an Excel spreadsheet. For this example, suppose we entered the data in column
A from cell A1 to A25.
Start by using the count function to count the number of modes in the data:
Field 1 A1:A25 1
Because the output from the count(mode.mult(…)) function is 1, there is only 1 mode in the data.
To find the single mode, we use the mode.sngl function:
Field 1 A1:A25 7
EXAMPLE
AIDS data indicating the number of months a patient with AIDS lives after taking a new antibody
drug are as follows (smallest to largest):
3 4 8 8 10 11 12 13 14 15
15 16 16 17 17 18 21 22 22 24
24 25 26 26 27 27 29 29 31 32
33 33 34 34 35 37 40 44 44 47
Solution:
Enter the data into an Excel spreadsheet. For this example, suppose we entered the data in column
A from cell A1 to A40.
Start by using the count function to count the number of modes in the data:
Field 1 A1:A40 12
Because the output from the count(mode.mult(…)) function is 12, there are 12 modes in the data.
To find the multiple modes, we use the mode.mult function. Left click on a cell, hold and drag
down to highlight 12 vertical cells. In the highlighted cells, enter the mode.mult function:
Because the output from the mode.mult function is a (vertical) array after entering the function
press CTRL+SHIFT+ENTER (not ENTER by itself).
2.3 MEASURES OF CENTRAL TENDENCY | 93
TRY IT
645 680 700 720 517 630 598 739 720 680
Enter the data into an Excel spreadsheet. For this example, suppose we entered the data in column
A from cell A1 to A10.
Start by using the count function to count the number of modes in the data:
Field 1 A1:A10 2
Because the output from the count(mode.mult(…)) function is 2, there are 2 modes in the data. To
find the multiple modes, we use the mode.mult function. Left click on a cell, hold and drag down
to highlight 2 vertical cells. In the highlighted cells, enter the mode.mult function:
Because the output from the mode.mult function is a (vertical) array after entering the function
press CTRL+SHIFT+ENTER (not ENTER by itself).
94 | 2.3 MEASURES OF CENTRAL TENDENCY
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=48#oembed-1
Watch this video: Finding mean, median, and mode | Descriptive statistics | Probability and Statistics | Khan Academy by
Khan Academy [3:54] (transcript available).
When only grouped data is available, we do not know the individual data values (we only know
intervals and interval frequencies). Therefore, we cannot compute an exact mean for the data
set. What we must do is estimate the actual mean by calculating the mean of a frequency table.
A frequency table is a data representation in which grouped data is displayed along with the
corresponding frequencies. To calculate the mean from a grouped frequency table we can apply the
basic definition of mean:
We simply need to modify the definition to fit within the restrictions of a frequency table.
Because we do not know the individual data values, we use the midpoint of each interval. The
midpoint of an interval is
where is the frequency of the interval and is the midpoint of the interval.
2.3 MEASURES OF CENTRAL TENDENCY | 95
EXAMPLE
A frequency table displaying professor Blount’s last statistic test is shown. Find the best estimate of
the class mean.
50–56.5 1
56.5–62.5 0
62.5–68.5 4
68.5–74.5 4
74.5–80.5 2
80.5–86.5 3
86.5–92.5 4
92.5–98.5 1
Solution:
50–56.5 53.25
56.5–62.5 59.5
62.5–68.5 65.5
68.5–74.5 71.5
74.5–80.5 77.5
80.5–86.5 83.5
86.5–92.5 89.5
92.5–98.5 95.5
96 | 2.3 MEASURES OF CENTRAL TENDENCY
TRY IT
Maris conducted a study on the effect that playing video games has on memory recall. As part of her
study, she compiled the following data:
0–3.5 3
3.5–7.5 7
7.5–11.5 12
11.5–15.5 7
15.5–19.5 9
What is the best estimate for the mean number of hours spent playing video games?
The measures of central tendency tell us about the center of the data, but often give different
answers. So how do we know when to use each? Here are some general rules:
1. The mean is the most frequently used measure of central tendency and is generally
considered the best measure of central location.
2. Median is the preferred measure of central tendency when:
a. There are a few extreme values or outliers in the distribution of the data. (Note:
Remember that a single outlier can have a great effect on the mean).
b. There are some missing or undetermined values in the data
c. There is an open ended distribution (For example, if you have a data field which
measures the number of children and your options are 0, 1, 2, 3, 4, 5 or “6 or more,” then
the “6 or more field” is open ended and makes calculating the mean impossible because
we do not know the exact values for this field).
d. You have data measured on an ordinal scale.
3. Mode is the preferred measure when data are measured in a nominal or ordinal scale.
Concept Review
The mean, the median, and the mode are measures of the “center” of a data set. The mean is
the best estimate for the actual data set, but the median is the best measurement when a data set
contains several outliers or extreme values. The mode tells us the most frequently occurring datum
(or data) in our data set. The mean, median, and mode are extremely helpful when we need to
analyze our data, but if the data set consists of ranges which lack specific values, the mean may be
impossible to calculate. However, the mean of grouped data can be approximated by multiplying
the midpoint of each interval with the frequency, adding up these values and then dividing by the
total number of values in the data set.
Attribution
“2.5 Measures of the Center of the Data“ in Introductory Statistics by OpenStax is licensed under
a Creative Commons Attribution 4.0 International License.
2.4 SKEWNESS AND THE MEAN, MEDIAN,
AND MODE
LEARNING OBJECTIVES
4 5 6 6 6 7 7 7
7 7 7 8 8 8 9 10
This data set can be represented by the following histogram. Each interval has width one, and each
value is located in the middle of an interval.
Figure 1
2.4 SKEWNESS AND THE MEAN, MEDIAN, AND MODE | 99
4 5 6 6 6 7 7 7 7 8
This data set can be represented by the following histogram. Each interval has width one, and each
value is located in the middle of an interval.
Figure 2
The histogram above is not symmetrical. The right-hand side seems “chopped off” compared to
the left side. A distribution of this type is called skewed to the left because it is pulled out to the
left. The mean of this data is 6.3, the median is 6.5, and the mode is 7. Notice that the mean is
less than the median, and they are both less than the mode. The mean and the median both
reflect the skewing, but the mean reflects it more so.
Consider the following data set:
6 7 7 7 7 8 8 8 9 10
100 | 2.4 SKEWNESS AND THE MEAN, MEDIAN, AND MODE
This data set can be represented by the following histogram. Each interval has width one, and each
value is located in the middle of an interval.
Figure 3
The histogram above is also not symmetrical. In this case, the data is skewed to the right. The
mean for this data is 7.7, the median is 7.5, and the mode is 7. Of the three statistics, the mean is
the largest, while the mode is the smallest. Again, the mean reflects the skewing the most.
To summarize:
Skewness and symmetry become important when we discuss probability distributions in later
chapters.
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=56#oembed-1
2.4 SKEWNESS AND THE MEAN, MEDIAN, AND MODE | 101
Watch this video: Elementary Business Statistics | Skewness and the Mean, Median, and Mode by Janux [3:57] (transcript
available).
EXAMPLE
Statistics are used to compare and sometimes identify authors. The following lists shows a simple
random sample that compares the letter counts for three authors.
Terry
7 9 3 3 3 4 1 3 2 2
Davis
3 3 3 4 1 4 3 2 3 1
Maris
2 3 4 4 4 6 6 6 8 3
1. Make a dot plot for the three authors and compare the shapes.
2. Calculate the mean for each.
3. Calculate the median for each.
4. Describe any pattern you notice between the shape and the measures of center.
Solution:
a.
Terry’s distribution has a right (positive) skew.
102 | 2.4 SKEWNESS AND THE MEAN, MEDIAN, AND MODE
Concept Review
Looking at the distribution of data can reveal a lot about the relationship between the mean, the
median, and the mode. There are three types of distributions. A right (or positive) skewed
distribution has a shape like Figure 3. A left (or negative) skewed distribution has a shape like
Figure 2 . A symmetrical distribution looks like Figure 1.
Attribution
“2.6 Skewness and the Mean, Median, and Mode“ in Introductory Statistics by OpenStax is licensed
under a Creative Commons Attribution 4.0 International License.
2.5 MEASURES OF LOCATION
LEARNING OBJECTIVES
• Recognize, describe, calculate, and interpret the measures of location of data: quartiles and
percentiles.
The common measures of location are quartiles and percentiles. Previously, we learned that the
median is a number that measures the “center” of the data. But the median can also be thought of
as a measure of location because the median is the “middle value” of a set of data. The median is a
number that separates ordered data into halves. Half of the values in the data are the same number
or smaller than the median and half of the values in the data are the same number or larger.
For example, consider the following data, already ordered from smallest to largest:
1 1 2 2 4 6 6.8
Because there are 14 observations, the median is between the seventh value, , and the eighth
value, . To find the median, add the two values together and divide by two:
The median is seven. We can see that half (or 50%) of the values are less than seven and half (or
50%) of the values are larger than seven.
The median is an example of both a quartile and a percentile. The median is also the second
quartile, , and the 50th percentile, .
104 | 2.5 MEASURES OF LOCATION
Quartiles
Quartiles are numbers that separate the data into quarters (four parts). Like the median, quartiles
may or may not be an actual value in the set of data. To find the quartiles, order the data (from
smallest to largest) and then find the median or second quartile. The first quartile, , is the
middle value of the lower half of the data and the third quartile, , is the middle value of the
upper half of the data. To get the idea, consider the same (ordered) data set used above:
1 1 2 2 4 6 6.8
The median or second quartile is seven. The lower half of the data are:
1 1 2 2 4 6 6.8
The middle value of the lower half of the data is 2. The number 2, which is part of the data, is the
first quartile, . One-fourth (or 25%) of the entire sets of values are the same as or less than 2
and three-fourths (or 75%) of the values are more than two.
The upper half of the data are:
The middle value of the upper half of the data is 9. The third quartile, , is 9. Three-fourths (or
75%) of the values are less than 9. One-fourth (or 25%) of the values in the data set are greater than
or equal to 9.
The interquartile range is a number that indicates the spread of the middle half or the middle
50\% of the data. It is the difference between the third quartile ( ) and the first quartile ( ).
The can help to determine potential outliers. A value is suspected to be a potential outlier
if it is less than below the first quartile or more than above the third
quartile. Potential outliers always require further investigation.
2.5 MEASURES OF LOCATION | 105
NOTE
A potential outlier is a data point that is significantly different from the other data points. These
special data points may be errors, some kind of abnormality, or they may be a key to
understanding the data.
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=58#oembed-1
Watch this video: Median, Quartiles and Interquartile Range by ExamSolutions [12:35] (transcript available).
• For array, enter the array or cell range containing the data.
• For quartile number, enter the quartile (1, 2 or 3) being calculated.
The output from the quartile.exc function is the value of the corresponding quartile. For
example, quartile.exc(array,1) returns the value of the first quartile where 25% of the observations
in the data are (strictly) less than the value of the first quartile.
106 | 2.5 MEASURES OF LOCATION
Visit the Microsoft page for more information about the quartile.exc function.
NOTE
We are using the quartile.exc function, and not the quartile.inc function, because we want the
percent of the observations in the data to be strictly less than the value of the quartile.
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=58#oembed-2
Watch this video: How To Find Quartiles and Construct a Box Plot in Excel by Joshua Emmanuel [4:12] (transcript
available).
EXAMPLE
For the following 13 real estate prices, calculate the three quartiles and the . Determine if any
prices are potential outliers. The prices are in dollars.
2.5 MEASURES OF LOCATION | 107
Solution:
Enter the data into an Excel spreadsheet. For this example, suppose we entered the data in column
A from cell A1 to A13.
Field 2 1
Field 2 2
Field 2 3
\begin{eqnarray*} 1.5 \times IQR & = & 1.5 \times 340,250 = 510,375 \\ \\ Q_1 – 1.5
\times IQR & = & 308,750 – 510,375 = –201,625\\ \\ Q_3 + 1.5 \times IQR & =
& 649,000 + 510,375 = 1,159,375 \end{eqnarray*}
NOTE
Quartiles have the same units as the data. In this case, the data is measured in dollars, so the
quartiles are also in dollars.
TRY IT
For the following 11 salaries, calculate the three quartiles and the . Are any of the salaries
outliers? The salaries are in dollars.
54,000 42,000
Enter the data into an Excel spreadsheet. For this example, suppose we entered the data in column
A from cell A1 to A11.
Field 2 1
Field 2 2
Field 2 3
\begin{eqnarray*} 1.5 \times IQR & = & 1.5 \times 28,500 = 42,750 \\ \\ Q_1 – 1.5 \times
IQR & = & 40,500 – 42,750 =- 2,250\\ \\ Q_3 + 1.5 \times IQR & = & 69,000+
42,750 = 111,750 \end{eqnarray*}
TRY IT
Find the interquartile range for the following two data sets and compare them.
69 96 81 79 65 76 83 99 89 67
90 77 85 98 66 91 77 69 80 94
90 72 80 92 90 97 92 75 79 68
70 80 99 95 78 73 71 68 95 100
Enter the data into an Excel spreadsheet. For this example, suppose we entered the data for Class A
into column A from cell A1 to A20 and the data for Class B into column B from cell B1 to B20.
Class A
Field 2 1
Field 2 3
Class B
Field 2 1
Field 2 3
The data for Class B has a larger , so the scores between and (the middle 50% of the
data) for the data for Class B are more spread out and not clustered about the median.
Percentiles
Percentiles are numbers that separate the (ordered) data into hundredths (100 parts). Like
quartiles, percentiles may or may not be part of the data. The th percentile, , is the value where
of the observations in the data are less than the value of the th percentile. To score in the
90th percentile of an exam does not mean, necessarily, that you received on a test. The 90th
percentile means that of test scores are less than your score and of the test scores are the
same or greater than your test score.
Quartiles are special percentiles. The first quartile, , is the same as the 25th percentile and
the third quartile, , is the same as the 75th percentile. The median is the 50th percentile.
Percentiles are useful for comparing values. For this reason, universities and colleges use
percentiles extensively. One instance in which colleges and universities use percentiles is when
SAT results are used to determine a minimum testing score that will be used as an acceptance
factor. For example, suppose Duke accepts SAT scores at or above the 75th percentile. That
translates into an SAT score of at least 1220.
Percentiles are mostly used with very large data sets. Therefore, if you were to say that of
112 | 2.5 MEASURES OF LOCATION
the test scores are less (and not the same or less) than your score, it would be acceptable because
removing one particular data value is not significant.
• For array, enter the array or cell range containing the data.
• For percent, enter the percentile (as a decimal) being calculated. For example, if we are
calculating the 60th percentile, we would enter 0.6 for the percent in the percentile.exc
function.
The output from the percentile.exc function is the value of the corresponding percentile. For
example, percentile.exc(array,0.6) returns the value of the 60th percentile where 60% of the
observations in the data are (strictly) less than the value of the 60th percentile.
Visit the Microsoft page for more information about the percentile.exc function.
NOTE
We are using the percentile.exc function, and not the percentile.inc function, because we want
the percent of the observations in the data to be strictly less than the value of the percentile.
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=58#oembed-3
2.5 MEASURES OF LOCATION | 113
Watch this video: Percentiles – How to calculate Percentiles, Quartiles, … by Joshua Emmanuel [3:43] (transcript
available).
EXAMPLE
Listed are twenty-nine ages (in years) for trees found in the Saint Louis Botanical Garden.
18 21 22 25 26 27 29 30 31 33
36 37 41 42 47 52 55 57 58 62
64 67 69 71 72 73 74 76 77
Solution:
Enter the data into an Excel spreadsheet. For this example, suppose we entered the data in column
A from cell A1 to A29.
Field 2 0.7
Field 2 0.83
114 | 2.5 MEASURES OF LOCATION
NOTE
Percentiles have the same units as the data. In this case, the data is measured in years, so the
percentiles are also in years.
TRY IT
Listed are 29 ages (in years) for Academy Award winning best actors.
18 21 22 25 26 27 29 30 31 33
36 37 41 42 47 52 55 57 58 62
64 67 69 71 72 73 74 76 77
Enter the data into an Excel spreadsheet. For this example, suppose we entered the data in column
A from cell A1 to A29.
Field 2 0.2
Field 2 0.55
A percentile indicates the relative standing of a data value when data are sorted into numerical
order from smallest to largest. Percentages of data values are less than the value of the nth
percentile. For example, 15% of the data values are less than the value of the 15th percentile. Note
that low percentiles always correspond to lower data values and high percentiles always correspond
to higher data values.
A percentile may or may not correspond to a value judgment about whether it is “good” or “bad.”
The interpretation of whether a certain percentile is “good” or “bad” depends on the context of
the situation to which the data applies. In some situations, a low percentile would be considered
“good,” but in other contexts a high percentile might be considered “good”. In many situations,
there is no value judgment that applies.
Understanding how to interpret percentiles or quartiles properly is important not only when
describing data, but also when calculating probabilities in later chapters of this text. When writing
the interpretation of a percentile or quartile in the context of the given data, the sentence should
contain the following information:
EXAMPLE
On a timed math test, the first quartile for the time it took to finish the exam was 35 minutes.
Interpret the first quartile in the context of this situation.
Solution:
TRY IT
For the 100-meter dash, the third quartile for times for finishing the race was 11.5 seconds. Interpret
the third quartile in the context of the situation.
• Interpretation: 75% of runners finished the race in less than 11.5 seconds.
• In this context, a lower percentile is good because finishing a race more quickly is desirable.
2.5 MEASURES OF LOCATION | 117
EXAMPLE
On a 20 question math test, the 70th percentile for the number of correct answers was 16. Interpret
the 70th percentile in the context of this situation.
Solution:
TRY IT
On a 60 point written assignment, the 80th percentile for the number of points earned was 49.
Interpret the 80th percentile in the context of this situation.
EXAMPLE
At a community college, it was found that the 30th percentile of credit units that students are
enrolled for is 7 units. Interpret the 30th percentile in the context of this situation.
Solution:
TRY IT
During a season, the 40th percentile for points scored per player in a game is 8. Interpret the 40th
percentile in the context of this situation.
Concept Review
The values that divide an ordered set of data into 100 equal parts are called percentiles. Percentiles
2.5 MEASURES OF LOCATION | 119
are used to compare and interpret data. For example, an observation at the 50th percentile would
be greater than 50% of the other observations in the set.
Quartiles divide data into quarters. The first quartile, , is the 25th percentile, the second
quartile, , is the 50th percentile, and the third quartile, , is the the 75th percentile. The
interquartile range, , is the range of the middle 50% of the data values. The is found by
subtracting from , and can help determine outliers by using the following two expressions.
Attribution
“2.3 Measures of the Location of the Data“ in Introductory Statistics by OpenStax is licensed under
a Creative Commons Attribution 4.0 International License.
2.6 MEASURES OF DISPERSION
LEARNING OBJECTIVES
• Recognize, describe, calculate, and analyze the measures of the spread of data: variance,
standard deviation, and range.
It can be misleading to only use the measures of central tendency (mean, median, mode) to
describe a data set. Measures of central tendency describe the center of a distribution. Measures
of dispersion or variability are used to describe the spread or dispersion of the data. So far in this
chapter, we have already seen a measure of dispersion—the interquartile range. The interquartile
range describes the spread of the middle 50% of the data. But there are other measures of
dispersion, including range, variance, and standard deviation.
Range
The range is the difference between the largest and smallest value in a set of data:
Range is not a very good measure of variability because it is based on only two values in the data
set (the largest and smallest values) and is highly influenced by outliers. Also, the range does not
help us distinguish between two data sets with the same largest and smallest values because the
two data sets will have the same range.
2.6 MEASURES OF DISPERSION | 121
EXAMPLE
AIDS data indicating the number of months a patient with AIDS lives after taking a new antibody
drug are as follows:
3 4 8 8 10 11 12 13 14 15
15 16 16 17 17 18 21 22 22 24
24 25 26 26 27 27 29 29 31 32
33 33 34 34 35 37 40 44 44 47
Solution:
An important characteristic of any set of data is the variation in the data from the mean. In some
data sets, the data values are concentrated close to the mean, but in other data sets, the data values
are more widely spread out from the mean. The most common measure of variation, or spread,
is the standard deviation. The standard deviation is a number that measures, on average, how
far data values are from their mean. The standard deviation provides a numerical measure of the
overall amount of variation in a data set, and can be used to determine whether a particular data
value is close to or far away from the mean.
The standard deviation provides a measure of the overall variation in a data set. The standard
deviation is always positive or zero. The standard deviation is small when the data are all
concentrated close to the mean because there is little variation or spread in the data. The standard
deviation is larger when the data values are more spread out from the mean because there is a
lot variation in the data. The lower case letter represents the sample standard deviation and the
Greek letter represents the population standard deviation.
122 | 2.6 MEASURES OF DISPERSION
Suppose that we are studying the amount of time customers wait in line at the checkout at
supermarket A and supermarket B. The mean wait time at both supermarkets is five minutes. At
supermarket A the standard deviation for the wait time is two minutes and at supermarket B the
standard deviation for the wait time is four minutes. Because supermarket B has a higher standard
deviation, we know that there is more variation in the wait times at supermarket B. Overall, wait
times at supermarket B are more spread out from the mean and wait times at supermarket A are
more concentrated near the mean.
As well, the standard deviation can be used to determine whether a data value is close to or far
from the mean. For example, suppose that Rosa and Binh both shop at supermarket A where the
mean wait time at the checkout is five minutes and the standard deviation is two minutes. Suppose
Rosa’s wait time is seven minutes and Binh’s wait time is one minute:
• Rosa’s wait time of seven minutes is two minutes longer than the mean of five minutes.
Because two minutes is equal to one standard deviation, Rosa’s wait time of seven minutes is
one standard deviation above the mean of five minutes.
• Binh’s wait time of one minute is four minutes less than the mean of five minutes.
Because four minutes is equal to two standard deviations, Binh’s wait time of one minute is
two standard deviations below the mean of five minutes.
A data value that is two standard deviations from the mean is just on the borderline for what many
statisticians would consider to be far from the mean. Considering data to be far from the mean if it
is more than two standard deviations away is more of an approximate “rule of thumb” than a rigid
rule. In general, the shape of the distribution of the data affects how much of the data is further
away than two standard deviations.
If is a number, then the difference “ – mean” is called its deviation from the mean. In a
data set, there are as many deviations as there are items in the data set. The deviations are used to
calculate the standard deviation. If the numbers belong to a population, in symbols a deviation is
. For sample data, in symbols a deviation is .
The procedure to calculate the standard deviation depends on whether the numbers are the
entire population or are data from a sample. The calculations are similar, but not identical.
Therefore the symbol used to represent the standard deviation depends on whether it is calculated
from a population or a sample. The lower case letter represents the sample standard deviation
and the Greek letter represents the population standard deviation. I f the sample has the same
characteristics as the population, then should be a good estimate of .
2.6 MEASURES OF DISPERSION | 123
To calculate the standard deviation, we need to calculate the variance first. The variance is the
average of the squares of the deviations (the values for a sample or the values
for a population). The symbol represents the population variance and the population standard
deviation is the square root of the population variance. The symbol represents the sample
variance and the sample standard deviation is the square root of the sample variance. The
standard deviation can be thought of as a special average of the deviations.
To calculate a population standard deviation :
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=60#oembed-2
Watch this video: How to calculate Standard Deviation and Variance by statistricsfun [5:04] (transcript available).
124 | 2.6 MEASURES OF DISPERSION
• If the data is population data, use the var.p(array) function where array is the array or cell
range containing the data. The output from the var.p function is the population variance.
◦ Visit the Microsoft page for more information about the var.p function.
• If the data is sample data, use the var.s(array) function where array is the array or cell range
containing the data. The output from the var.s function is the sample variance.
◦ Visit the Microsoft page for more information about the var.s function.
NOTE
There are two different functions to calculate variance in Excel because variance is calculated
differently depending on whether the data is from a sample or from a population. When calculating
variance, make sure that you are using the correct function based on the type of data you are
working with (sample or population).
• If the data is population data, use the stdev.p(array) function where array is the array or cell
range containing the data. The output from the stdev.p function is the population standard
deviation.
◦ Visit the Microsoft page for more information about the stdev.p function.
• If the data is sample data, use the stdev.s(array) function where array is the array or cell
range containing the data. The output from the stdev.s function is the sample standard
deviation.
◦ Visit the Microsoft page for more information about the stdev.s function.
NOTE
There are two different functions to calculate standard deviation in Excel because standard
deviation is calculated differently depending on whether the data is from a sample or from a
population. When calculating standard deviation, make sure that you are using the correct function
based on the type of data you are working with (sample or population).
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=60#oembed-1
Watch this video: Range, Variance, Standard Deviation in Excel by Joshua Emmanuel [1:10] (transcript available).
126 | 2.6 MEASURES OF DISPERSION
EXAMPLE
In a fifth grade class, the teacher was interested in the standard deviation of the ages of her students.
The following data are the ages, in years, for a sample of 20 fifth grade students. The ages are
rounded to the nearest half year:
Calculate the mean, the variance, and the standard deviation of the ages of the students. Interpret
the standard deviation.
Solution:
Enter the data into an Excel spreadsheet. For this example, suppose we entered the data in column
A from cell A1 to A20.
On average, the age of any fifth grader is 0.7159 years away from the mean of 10.525 years.
NOTES
1. We are using the var.s (not var.p) and stdev.s (not stdev.p) functions to calculate the
variance and standard deviation because the data is from a sample.
2. Standard deviation has the same units as the data. In this case, the data is measured in years,
so the standard deviation is also in years.
3. There are no units associated with variance.
TRY IT
On a baseball team, the ages, in years, of each of the players are as follows:
21 21 22 23 24
24 25 25 28 29
28 31 32 33 33
34 35 36 36 36
36 38 38 38 40
\begin{eqnarray*} \mu & = & 30.64 \mbox{ years} \\ \\ \sigma & = & 5.99
\mbox{ years} \end{eqnarray*}
NOTE
We are using the var.p (not var.s) and stdev.p (not stdev.s) functions to calculate the variance and
standard deviation because the baseball team is a population.
NOTE
Your concentration should be on what the standard deviation tells you about the data. The
standard deviation is a number which measures how far the data are spread from the mean. Let a
calculator or computer do the arithmetic.
The standard deviation, or , is either zero or a positive number. When the standard deviation is
zero, there is no dispersion—that is, the all the data values are equal to each other. The standard
deviation is small when the data are all concentrated close to the mean and is larger when the data
values show more variation from the mean. When the standard deviation is a lot larger than zero,
the data values are very spread out about the mean. Outliers in the data can make or very large.
The standard deviation, when first presented, can seem unclear. By graphing your data, you can
get a better “feel” for the deviations and the standard deviation. You will find that in symmetrical
distributions, the standard deviation can be very helpful but in skewed distributions the standard
deviation may not be much help. The reason is that the two sides of a skewed distribution have
different spreads. In a skewed distribution, it is better to look at the first quartile, the median, the
third quartile, the smallest value, and the largest value. Because numbers can be confusing, always
graph your data.
2.6 MEASURES OF DISPERSION | 129
EXAMPLE
Use the following sample of exam scores from Susan Dean’s spring pre-calculus class:
33 42 49 49 53 55 55 61
63 67 68 68 69 69 72 73
74 78 80 83 88 88 88 90
92 94 94 94 94 96 100
• The mean.
• The standard deviation.
• The median.
• The first quartile.
• The third quartile.
• .
Solution:
Enter the data into an Excel spreadsheet. For this example, suppose we entered the data in column
A from cell A1 to A31.
Field 1 A1:A31 73
130 | 2.6 MEASURES OF DISPERSION
Field 1 A1:A31 61
Field 2 1
Field 1 A1:A31 90
Field 2 3
The standard deviation is useful when comparing data values that come from different data sets.
If the data sets have different means and different standard deviations, then comparing the data
values directly can be misleading. In order to directly compare values in different data sets, we can
compare how many standard deviations away from the mean of its data set a value is. This is done
by calculating the value’s -score:
Sample
Population
EXAMPLE
Two students, John and Ali, are from different high schools and wanted to find out who had the
highest GPA when compared to their school. Which student had the highest GPA when compared to
their school?
Ali 77 80 10
Solution:
For each student, determine how many standard deviations, the -score, their GPA is away from the
mean of their school.
John:
Ali:
John has the better GPA when compared to his school because his GPA is 0.21 standard deviations
below his school’s mean while Ali’s GPA is 0.3 standard deviations below her school’s mean, which
means that John’s GPA is closer to his school’s mean than Ali’s GPA is to hers.
NOTE
The sign of a -score is important. A negative -score tells us that is below the mean. A positive
-score tell us that is above the mean. The absolute value of the -score tells us how many
standard deviations away from the mean the value of is.
132 | 2.6 MEASURES OF DISPERSION
TRY IT
Two swimmers, Angie and Beth are from different teams and wanted to find out who had the fastest
time for the 50 meter freestyle when compared to her team’s mean time. Which swimmer had the
fastest time when compared to her team?
Angie:
Beth:
Angie’s time is 1.25 standard deviations below her team’s mean time and Beth is 2 standard
deviations below her team’s time. So, Angie had the faster time when compared to her team’s mean
than Beth’s time is to hers.
The following lists give a few facts that provide a little more insight into what the standard deviation
tells us about the distribution of the data.
For ANY data set, no matter what the distribution of the data is:
Concept Review
The standard deviation measures the average spread of the data about the mean. There are
different equations to use if are calculating the standard deviation of a sample or of a population.
The standard deviation allows us to compare individual data or to the mean of the data numerically.
Attribution
“2.7 Measures of the Spread of the Data“ in Introductory Statistics by OpenStax is licensed under
a Creative Commons Attribution 4.0 International License.
2.7 EXERCISES
1. In a survey, people were asked how many times they visited a store before making a major
purchase. The results are shown in the table. Construct a line graph.
2. In a survey, several people were asked how many years it has been since they purchased a
mattress. The results are shown in the table. Construct a line graph.
3. Several children were asked how many TV shows they watch each day. The results of the survey
are shown in the table. Construct a line graph.
2.7 EXERCISES | 135
4. The students in Ms. Ramirez’s math class have birthdays in each of the four seasons. The
table shows the four seasons, the number of students who have birthdays in each season, and the
percentage (%) of students in each group. Construct a bar graph showing the number of students.
Spring
Summer
Autumn
Winter
5. Using the data from Mrs. Ramirez’s math class supplied in the tables, construct a bar graph
showing the percentages.
6. David County has six high schools. Each school sent students to participate in a county-
wide science competition. The table shows the percentage breakdown of competitors from each
school, and the percentage of the entire student population of the county that goes to each school.
Construct a bar graph that shows the population percentage of competitors from each school.
Alabaster
Concordia
Genoa
Mocksville
Tynneson
West End
136 | 2.7 EXERCISES
7. Use the data from the David County science competition supplied in the table above. Construct a
bar graph that shows the county-wide population percentage of students at each school.
8. The table contains the 2010 obesity rates in U.S. states and Washington, DC.
a. Use a random number generator to randomly pick eight states. Construct a bar graph of the
obesity rates of those eight states.
b. Construct a bar graph for all the states beginning with the letter “A.”
c. Construct a bar graph for all the states beginning with the letter “M.”
9. Sixty-five randomly selected car salespersons were asked the number of cars they generally sell
in one week. Fourteen people answered that they generally sell three cars; nineteen generally sell
four cars; twelve generally sell five cars; nine generally sell six cars; eleven generally sell seven cars.
Complete the table.
2.7 EXERCISES | 137
11. Construct a frequency polygon from the frequency distribution for the highest ranked
countries for depth of hunger.
12. Use the two frequency tables to compare the life expectancy of men and women from
randomly selected countries. Include an overlayed frequency polygon and discuss the shapes of
the distributions, the center, the spread, and any outliers. What can we conclude about the life
expectancy of women compared to men?
13. Construct a times series graph for (a) the number of male births, (b) the number of female
births, and (c) the total number of births.
Female
Male
Total
Sex/
1862 1863 1864 1865 1866 1867 1868 1869
Year
Female
Male
Total
Sex/
1871 1870 1872 1871 1872 1827 1874 1875
Year
Female
Male
Total
14. The following data sets list full-time police per citizens along with homicides per
citizens for the city of Detroit, Michigan during the period from 1961 to 1973.
140 | 2.7 EXERCISES
Police
Homicides
Police
Homicides
a. Construct a double time series graph using a common -axis for both sets of data.
b. Which variable increased the fastest? Explain.
c. Did Detroit’s increase in police officers have an impact on the murder rate? Explain.
15. Suppose that three book publishers were interested in the number of fiction paperbacks adult
consumers purchase per month. Each publisher conducted a survey. In the survey, adult consumers
were asked the number of fiction paperbacks they had purchased the previous month. The results
are as follows:
Publisher A
Publisher B
Publisher C
a. Find the relative frequencies for each survey. Write them in the charts.
b. Using either a graphing calculator, computer, or by hand, use the frequency column to
construct a histogram for each publisher’s survey. For Publishers A and B, make bar widths of
one. For Publisher C, make bar widths of two.
c. In complete sentences, give two reasons why the graphs for Publishers A and B are not
identical.
d. Would you have expected the graph for Publisher C to look like the other two graphs? Why or
why not?
e. Make new histograms for Publisher A and Publisher B. This time, make bar widths of two.
f. Now, compare the graph for Publisher C to the new graphs for Publishers A and B. Are the
graphs more similar or more different? Explain your answer.
16. Often, cruise ships conduct all on-board transactions, with the exception of gambling, on a
cashless basis. At the end of the cruise, guests pay one bill that covers all onboard transactions.
142 | 2.7 EXERCISES
Suppose that single travelers and couples were surveyed as to their on-board bills for a seven-
day cruise from Los Angeles to the Mexican Riviera. Following is a summary of the bills for each
group.
Singles
Couples
17. Twenty-five randomly selected students were asked the number of movies they watched the
previous week. The results are as follows.
18. Suppose one hundred eleven people who shopped in a special t-shirt store were asked the
number of t-shirts they own costing more than each.
144 | 2.7 EXERCISES
a. The percentage of people who own at most three t-shirts costing more than each is
approximately:
b. If the data were collected by asking the first people who entered the store, then the type
of sampling is:
19. Following are the 2010 obesity rates by U.S. states and Washington, DC. Construct a bar graph
of obesity rates of your state and the four states closest to your state. Hint: Label the -axis with the
states.
2.7 EXERCISES | 145
North
Alabama Kentucky
Dakota
South
Connecticut Minnesota
Carolina
South
Delaware Mississippi
Dakota
Washington,
Missouri Tennessee
DC
New
Idaho Virginia
Hampshire
West
Indiana New Mexico
Virginia
20. Listed are 29 ages for Academy Award winning best actors in order from smallest to largest.
; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ;
; ; ;
21. Listed are ages for Academy Award winning best actors in order from smallest to largest. ;
146 | 2.7 EXERCISES
; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ;
; ; ; ; ;
22.
a. For runners in a race, a low time means a faster run. The winners in a race have the shortest
running times. Is it more desirable to have a finish time with a high or a low percentile when
running a race?
b. Jesse was ranked th in his graduating class of 180 students. At what percentile is Jesse’s
ranking?
c. The th percentile of run times in a particular race is minutes. Write a sentence
interpreting the th percentile in the context of the situation.
d. A bicyclist in the th percentile of a bicycle race completed the race in hour and
minutes. Is he among the fastest or slowest cyclists in the race? Write a sentence interpreting
the th percentile in the context of the situation.
e. For runners in a race, a higher speed means a faster run. Is it more desirable to have a speed
with a high or a low percentile when running a race?
f. Jesse was ranked th in his graduating class of 180 students. At what percentile is Jesse’s
ranking?
g. The th percentile of speeds in a particular race is miles per hour. Write a sentence
interpreting the th percentile in the context of the situation.
23. On an exam, would it be more desirable to earn a grade with a high or low percentile? Explain.
24. Mina is waiting in line at the Department of Motor Vehicles (DMV). Her wait time of
minutes is the th percentile of wait times. Is that good or bad? Write a sentence interpreting the
th percentile in the context of this situation.
25. In a survey collecting data about the salaries earned by recent college graduates, Li found that
her salary was in the th percentile. Should Li be pleased or upset by this result? Explain.
26. In a study collecting data about the repair costs of damage to automobiles in a certain type of
crash tests, a certain model of car had in damage and was in the th percentile. Should
the manufacturer and the consumer be pleased or upset by this result? Explain and write a sentence
that interprets the th percentile in the context of this problem.
27. The University of California has two criteria used to set admission standards for freshman to
be admitted to a college in the UC system:
2.7 EXERCISES | 147
a. Students’ GPAs and scores on standardized tests (SATs and ACTs) are entered into a formula
that calculates an “admissions index” score. The admissions index score is used to set
eligibility standards intended to meet the goal of admitting the top of high school
students in the state. In this context, what percentile does the top % represent?
b. Students whose GPAs are at or above the th percentile of all students at their high school
are eligible (called eligible in the local context), even if they are not in the top of all
students in the state. What percentage of students from each high school are “eligible in the
local context”?
28. Suppose that you are buying a house. You and your realtor have determined that the most
expensive house you can afford is the th percentile. The th percentile of housing prices is
vi. th percentile
29. The median age for U.S. blacks currently is years; for U.S. whites it is years. Based
upon this information, give two reasons why the black median age could be lower than the white
median age. Does the lower median age for blacks necessarily mean that blacks die younger than
whites? Why or why not? How might it be possible for blacks and whites to die at approximately
the same age, but for the median age for whites to be higher?
30. Six hundred adult Americans were asked by telephone poll, “What do you think constitutes
a middle-class income?” The results are in the table. Also, include left endpoint, but not the right
endpoint.
148 | 2.7 EXERCISES
under
or more
Grade Frequency
a.
2.7 EXERCISES | 149
32. The following data shows the lengths of boats moored in a marina:
; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ;
; ;
33. Sixty-five randomly selected car salespersons were asked the number of cars they generally sell
in one week. Fourteen people answered that they generally sell three cars; nineteen generally sell
four cars; twelve generally sell five cars; nine generally sell six cars; eleven generally sell seven cars.
Calculate the following:
a. Mean
b. Median
c. Mode
34. The most obese countries in the world have obesity rates that range from % to %. This
data is summarized in the following table.
150 | 2.7 EXERCISES
a. What is the best estimate of the average obesity percentage for these countries?
b. The United States has an average obesity rate of %. Is this rate above average or below?
c. How does the United States compare to other countries?
35. The table gives the percent of children under five considered to be underweight. What is the
best estimate for the mean percentage of underweight children?
36. Javier and Ercilia are supervisors at a shopping mall. Each was given the task of estimating the
mean distance that shoppers live from the mall. They each randomly surveyed shoppers. The
samples yielded the following information.
Javier Ercilia
miles miles
miles miles
2.7 EXERCISES | 151
37. We are interested in the number of years students in a particular elementary statistics class have
lived in California. The information in the following table is from the entire section.
Total =
a. What is the ?
b. What is the mode?
c. Is this a sample or the entire population?
38. State whether the data are symmetrical, skewed to the left, or skewed to the right.
a. ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ;
b. ; ; ; ; ; ; ; ;
c. ; ; ; ; ; ; ; ; ;
152 | 2.7 EXERCISES
39. When the data are skewed left, what is the typical relationship between the mean and median?
40. When the data are symmetrical, what is the typical relationship between the mean and
median?
41. What word describes a distribution that has two modes?
42. Describe the shape of this distribution.
43. Describe the relationship between the mode and the median of this distribution.
44. Describe the relationship between the mean and the median of this distribution.
2.7 EXERCISES | 153
46. Describe the relationship between the mode and the median of this distribution.
154 | 2.7 EXERCISES
47. Are the mean and the median the exact same in this distribution? Why or why not?
49. Describe the relationship between the mode and the median of this distribution.
50. Describe the relationship between the mean and the median of this distribution.
156 | 2.7 EXERCISES
51. The mean and median for the data are the same.
; ; ; ; ; ; ; ; ; ; ; ; ; ;
Is the data perfectly symmetrical? Why or why not?
52. Which is the greatest, the mean, the mode, or the median of the data set?
; ; ; ; ; ; ; ; ; ; ;
53. Which is the least, the mean, the mode, and the median of the data set?
; ; ; ; ; ; ; ; ; ;
54. Of the three measures, which tends to reflect skewing the most, the mean, the mode, or the
median? Why?
55. In a perfectly symmetrical distribution, when would the mode be different from the mean and
median?
56. The median age of the U.S. population in 1980 was years. In 1991, the median age was
years.
57. The following data are the distances between retail stores and a large distribution center. The
distances are in miles.
; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ;
58. Two baseball players, Fredo and Karl, on different teams wanted to find out who had the higher
batting average when compared to his team.
Baseball Player Batting Average Team Batting Average Team Standard Deviation
Fredo
Karl
a. Which baseball player had the higher batting average when compared to his team?
b. Use the table above to find the value that is three standard deviations above the mean.
c. Use the table below to find the value that is three standard deviations above the mean.
59. Find the standard deviation for the following frequency tables using the formula.
Grade Frequency
a.
60. The population parameters below describe the full-time equivalent number of students (FTES)
each year at Lake Tahoe Community College from 1976–1977 through 2004–2005.
• = FTES
• median = FTES
• = FTES
• first quartile = FTES
• third quartile = FTES
• = years
a. A sample of years is taken. About how many are expected to have a FTES of or
above? Explain how you determined your answer.
b. of all years have an FTES:
i. at or below what value?
ii. at or above what value?
c. Find the population standard deviation.
d. What percent of the FTES were from to ? How do you know?
e. What is the ? What does the represent?
f. How many standard deviations away from the mean is the median?
The population FTES for 2005–2006 through 2010–2011 was given in an updated report. The data
are reported here.
Total FTES
g. Calculate the mean, median, standard deviation, the first quartile, the third quartile and the
. Round to one decimal place.
2.7 EXERCISES | 159
h. Construct a box plot for the FTES for 2005–2006 through 2010–2011 and a box plot for the
FTES for 1976–1977 through 2004–2005.
i. Compare the for the FTES for 1976–77 through 2004–2005 with the for the FTES
for 2005-2006 through 2010–2011. Why do you suppose the s are so different?
61. Three students were applying to the same graduate school. They came from schools with
different grading systems. Which student had the best GPA when compared to other students at his
school? Explain how you determined your answer.
Thuy
Vichet
Kamala
62. A music school has budgeted to purchase three musical instruments. They plan to purchase a
piano costing $ , a guitar costing $ , and a drum set costing $ . The mean cost for a
piano is $ with a standard deviation of $ . The mean cost for a guitar is $ with a
standard deviation of $ . The mean cost for drums is $ with a standard deviation of $ .
Which cost is the lowest, when compared to other instruments of the same type? Which cost is the
highest when compared to other instruments of the same type. Justify your answer.
63. An elementary school class ran one mile with a mean of minutes and a standard deviation
of three minutes. Rachel, a student in the class, ran one mile in eight minutes. A junior high school
class ran one mile with a mean of nine minutes and a standard deviation of two minutes. Kenji, a
student in the class, ran one mile in minutes. A high school class ran one mile with a mean of
seven minutes and a standard deviation of four minutes. Nedda, a student in the class, ran one mile
in eight minutes.
a. Why is Kenji considered a better runner than Nedda, even though Nedda ran faster than he?
b. Who is the fastest runner with respect to his or her class? Explain why.
64. The most obese countries in the world have obesity rates that range from % to %. This
data is summarized in the following table.
160 | 2.7 EXERCISES
What is the best estimate of the average obesity percentage for these countries? What is the standard
deviation for the listed obesity rates? The United States has an average obesity rate of 33.9%. Is
this rate above average or below? How “unusual” is the United States’ obesity rate compared to the
average rate? Explain.
65. The table gives the percent of children under five considered to be underweight.
What is the best estimate for the mean percentage of underweight children? What is the standard
deviation? Which interval(s) could be considered unusual? Explain.
66. Twenty-five randomly selected students were asked the number of movies they watched the
previous week. The results are as follows:
2.7 EXERCISES | 161
# of movies Frequency
67. Forty randomly selected students were asked the number of pairs of sneakers they owned. Let
= the number of pairs of sneakers owned. The results are as follows:
Frequency
68. Following are the published weights (in pounds) of all of the team members of the San Francisco
49ers from a previous year.
; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ;
; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ;
; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ;
69. One hundred teachers attended a seminar on mathematical problem solving. The attitudes of a
representative sample of of the teachers were measured before and after the seminar. A positive
number for change in attitude indicates that a teacher’s attitude toward math became more positive.
The change scores are as follows:
; ; ; ; ; ; ; ; ; ; ;
70. In a recent issue of the IEEE SPECTRUM, engineering conferences were announced. Four
conferences lasted two days. Thirty-six lasted three days. Eighteen lasted four days. Nineteen lasted
five days. Four lasted six days. One lasted seven days. One lasted eight days. One lasted nine days.
Let = the length (in days) of an engineering conference.
71. A survey of enrollment at community colleges across the United States yielded the following
figures:
; ; ; ; ; ; ; ; ; ; ; ; ;
; ; ; ; ; ; ; ; ; ; ; ; ; ;
; ; ; ; ; ; ;
a. Organize the data into a chart with five intervals of equal width. Label the two columns
“Enrollment” and “Frequency.”
b. Construct a histogram of the data.
c. If you were to build a new community college, which piece of information would be more
valuable: the mode or the mean?
d. Calculate the sample mean.
e. Calculate the sample standard deviation.
f. A school with an enrollment of would be how many standard deviations away from the
mean?
164 | 2.7 EXERCISES
72. = the number of days per week that clients use a particular exercise facility.
Frequency
73. Suppose that a publisher conducted a survey asking adult consumers the number of fiction
paperback books they had purchased in the previous month. The results are summarized in the
table.
a. Are there any outliers in the data? Use an appropriate numerical test involving the to
identify outliers, if any, and clearly state your conclusion.
b. If a data value is identified as an outlier, what should be done about it?
c. Are any data values further than two standard deviations away from the mean? In some
2.7 EXERCISES | 165
situations, statisticians may use this criteria to identify data values that are unusual,
compared to the other data values. (Note that this criteria is most appropriate to use for data
that is mound-shaped and symmetric, rather than for skewed data.)
d. Do parts a and c of this problem give the same answer?
e. Examine the shape of the data. Which part, a or c, of this question gives a more appropriate
result for this data?
f. Based on the shape of the data which is the most appropriate measure of center for this data:
mean, median or mode?
Attribution
Chapter Outline
Meteor showers are rare, but the probability of them occurring can be calculated, photo by Ed Sweeney, CC
BY 4.0.
It is often necessary to “guess” about the outcome of an event in order to make a decision.
Politicians study polls to guess their likelihood of winning an election. Teachers choose a particular
course of study based on what they think students can comprehend. Doctors choose the treatments
needed for various diseases based on their assessment of likely results. You may have visited a
casino where people play games chosen because of the belief that the likelihood of winning is good.
You may have chosen your course of study based on the probable availability of jobs.
You have, more than likely, used probability. In fact, you probably have an intuitive sense of
probability. Probability deals with the chance of an event occurring. Whenever you weigh the odds
of whether or not to do your homework or to study for an exam, you are using probability. In this
chapter, you will learn how to solve probability problems using a systematic approach.
170 | 3.1 INTRODUCTION TO PROBABILITY
Attribution
LEARNING OBJECTIVES
Every day, decisions are made that involve uncertainty about the outcome. The ability to estimate
and understand probability helps us make good decisions. Probability is a numerical measure that
is associated with how certain we are of outcomes of a particular experiment or activity. Examples
of probability used in every day life include the probability that it will rain today and the probability
of winning the lottery.
An experiment is a planned operation carried out under controlled conditions. If the result is
not predetermined, then the experiment is said to be a chance experiment. An experiment is any
activity where the outcome is uncertain. Flipping a coin, rolling a pair of dice, or drawing a card
from a deck of cards are all examples of an experiment.
A result of an experiment is called an outcome. For example, in the experiment of flipping
a coin, a possible outcome is getting heads. The sample space of an experiment is the set of
all possible outcomes of that experiment. Three ways to represent a sample space are: to list the
possible outcomes, to create a tree diagram, or to create a Venn diagram. The uppercase letter
is used to denote the sample space. For example, in the experiment of flipping a coin, the sample
space has two outcomes: heads or tails. In the notation of probability, we would write the sample
space of flipping a coin like where is heads and is tails.
An event is any combination of outcomes. Generally, an event is a collection of outcomes that
posses some trait or characteristic. Upper case letters like and are used to represent events.
For example, if the experiment is to flip a coin two times, event might be getting at most one
head in the two flips. In probability, we are interested in finding the probability of a event. The
probability of an event is written .
172 | 3.2 THE TERMINOLOGY OF PROBABILITY
EXAMPLE
Solution:
NOTE
The order in which things happens is important, so the outcomes and are different
outcomes. The outcome consists of getting heads on the first flip and tails on the second flip.
The outcome consists of getting tails on the first flip and heads on the second flip, which is a
completely different outcome from .
Probability is a numerical measure of the likelihood that an event will occur. The probability of
an event is the long-term relative frequency of that event. Probabilities are numbers between
zero and one, inclusive—that is, zero and one and all numbers between these values. Probabilities
can be written as fractions, decimals, or percents. means the event can never
happen—the probability is 0%. means the event always happens—the probability is
3.2 THE TERMINOLOGY OF PROBABILITY | 173
100%. means the event is equally likely to occur or not to occur—there is a 50%
chance will happen and a 50% chance will not happen.
The way that we calculate the probability of an event depends on the situation we are analyzing.
Most often associated with games of chance, the classical method approach requires us to
know that the outcomes of an experiment are equally like to occur. We have already seen an
experiment where the outcomes are equally likely to occur—flipping a coin. Equally likely means
that each outcome of an experiment occurs with equal probability. In the experiment of tossing
a fair coin, you know that you have a 50% chance of getting heads and a 50% chance of getting
tails—the outcomes of heads or tails are equally likely to occur. If you roll a fair, six-sided die, you
know that you have the same chance of getting any of the six faces—the outcomes of 1, 2, 3,
EXAMPLE
Solution:
174 | 3.2 THE TERMINOLOGY OF PROBABILITY
1. The outcomes in the event “exactly one head” are and . We see that there are 2
outcomes in the event out of the 4 possible outcomes in the sample space. So
2. The outcomes in the event “at least one tail” are , , and . We see that there are 3
outcomes in the event out of the 4 possible outcomes in the sample space. So
TRY IT
Suppose you roll a fair six-sided die with the numbers on the faces.
1.
2.
3.2 THE TERMINOLOGY OF PROBABILITY | 175
3.
4.
5.
It is important to realize that in many situations, the outcomes are not equally likely. A coin or
die may be unfair or biased. Two math professors in Europe had their statistics students test the
Belgian one Euro coin and discovered that in 250 trials, a head was obtained 56% of the time and
a tail was obtained 44% of the time. The data seem to show that the coin is not a fair coin, but
more repetitions would be helpful to draw a more accurate conclusion about such bias. Some dice
may be biased. Look at the dice in a game you have at home. The spots on each face are usually
small holes carved out and then painted to make the spots visible. Your dice may or may not be
biased because it is possible that the outcomes may be affected by the slight weight differences due
to the different numbers of holes in the faces. Gambling casinos make a lot of money depending on
outcomes from rolling dice, so casino dice are made differently to eliminate bias. Casino dice have
flat faces and the holes are completely filled with paint having the same density as the material that
the dice are made out of so that each face is equally likely to occur.
The empirical or relative frequency approach to probability uses results from identical previous
experiments that have been performed many times. Probabilities are based on historical or
previously recorded data by determining the proportion of times an event occurs within the data.
For example, a retail business owner might want to know the probability that a customer spends
more than $50 at their store. To determine this probability, the business owner would look at
previous sales, count the number of sales over $50 and then divide that number by the total number
of previous sales.
To calculate an empirical probability, repeat the experiment over a large number of trials and
record the result of each trial. To find the probability of event , count the number of times event
happened and divide by the total number of trials.
176 | 3.2 THE TERMINOLOGY OF PROBABILITY
To get an accurate probability using this approach, it is important that the experiment is repeated
a very large number of times. This important characteristic of probability experiments is known
as the law of large numbers, which states that as the number of repetitions of an experiment
increases, the relative frequency obtained in the experiment tends to become closer and closer
to the theoretical probability. Even though the outcomes do not happen according to any set
pattern or order, overall, the long-term observed relative frequency will approach the theoretical
probability.
EXAMPLE
An online retailer wants to know the probability that a transaction will be less than $30. In 2000
transactions, 650 are less than $30.
Solution:
In the subjective method approach to probability, probabilities are determined by educated guess,
personal belief, intuition, or expert reasoning. A subjective probability is essentially a guess, but
a guess based on an accumulation of knowledge, understanding, and experience. Estimating the
probability the price of a stock goes down over time or the probability a certain sports team will win
a championship are examples of subjective probability.
3.2 THE TERMINOLOGY OF PROBABILITY | 177
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=77#oembed-1
Watch this video: Probability: Tossing Two Coins by Joshua Emmanuel [5:55] (transcript available).
Concept Review
In this section we learned the basic terminology of probability. The set of all possible outcomes of an
experiment is called the sample space. Events are subsets of the sample space, and they are assigned
a probability that is a number between zero and one, inclusive.
Attribution
LEARNING OBJECTIVES
A contingency table provides a way of displaying data that can facilitate calculating probabilities.
The table can be used to describe the sample space of an experiment. Contingency tables allow us
to break down a sample pace when two variables are involved.
• The left-side column lists all of the values for one of the variables. In the table shown above,
the left-side column shows the variable about whether or not someone uses a cell phone
while driving.
• The top row lists all of the values for the other variable. In the table shown above, the top row
shows the variable about whether or not someone had a speeding violation in the last year.
• In the body of the table, the cells contain the number of outcomes that fall into both of the
categories corresponding to the intersecting row and column. In the table shown above, the
number of 25 at the intersection of the “cell phone user” row and “speeding violation in the
last year” column tells us that there are 25 people who have both of these characteristics.
3.3 CONTINGENCY TABLES | 179
• The bottom row gives the totals in each column. In the table shown above, the number 685 in
the bottom of the “no speeding violation in the last year” tells us that there are 685 people
who did not have a speeding violation in the last year.
• The right-side column gives the totals in each row. In the table shown above, the number 305
in the right side of the “cell phone user” row tells us that there are 305 people who use cell
phones while driving.
• The number in the bottom right corner is the size of the sample space. In the table shown
above, the number in the bottom right corner is 755, which tells us that there 755 people in
the sample space.
EXAMPLE
Suppose a study of speeding violations and drivers who use cell phones while driving produced the
following fictional data:
1. What is the probability that a randomly selected person is a cell phone user?
2. What is the probability that a randomly selected person had no speeding violations in the last
year?
3. What is the probability that a randomly selected person had a speeding violation in the last
year and does not use a cell phone?
4. What is the probability that a randomly selected person uses a cell phone and had no speeding
violations in the last year?
Solution:
180 | 3.3 CONTINGENCY TABLES
1.
2.
3.
4.
TRY IT
This table shows the number of athletes who stretch before exercising and how many had injuries
within the past year.
1. What is the probability that a randomly selected athlete stretches before exercising?
2. What is the probability that a randomly selected athlete had an injury in the last year?
3. What is the probability that a randomly selected athlete does not stretch before exercising and
had no injuries in the last year?
4. What is the probability that a randomly selected athlete stretches before exercising and had no
injuries in the last year?
1.
2.
3.
4.
EXAMPLE
The table below shows a random sample of 100 hikers broken down by gender and the areas of
hiking they prefer.
Gender The Coastline Near Lakes and Streams On Mountain Peaks Total
Female 18 16 45
Male 14 55
Total 41
Solution:
182 | 3.3 CONTINGENCY TABLES
Male 16 25 14 55
Total 34 41 25 100
2.
3.
4.
5.
TRY IT
The table below relates the weights and heights of a group of individuals participating in an
observational study.
Obese 18 28 14
Normal 20 51 28
Underweight 12 25 9
Totals
3. Find the probability that a randomly chosen individual from this group is normal.
4. Find the probability that a randomly chosen individual from this group is obese and short.
5. Find the probability that a randomly chosen individual from this group is underweight and
medium.
Normal 20 51 28 99
Underweight 12 25 9 46
2.
3.
4.
5.
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=79#oembed-1
Watch this video: Ex: Basic Example of Finding Probability From a Table by Mathispower4u [2:39] (transcript available).
184 | 3.3 CONTINGENCY TABLES
Concept Review
There are several tools we can use to help organize and sort data when calculating probabilities.
Contingency tables help display data and are particularly useful when calculating probabilities that
have two variables of interest.
Attribution
LEARNING OBJECTIVES
The complement of an event is the set of all outcomes in the sample space that are not in .
The complement of is denoted by and is read “not .”
186 | 3.4 THE COMPLEMENT RULE
EXAMPLE
Suppose a coin is flipped two times. Previously, we found the sample space for this experiment:
where is heads and is tails.
Solution:
1. The event “exactly one head” consists of the outcomes and . The complement of
“exactly one head” consists of the outcomes and . These are the outcomes in the
sample space that are NOT in the original event “exactly one head.”
2. The event “at least one tail” consists of the outcomes , , and . The complement
of “at least one tail” consists of the outcomes . These are the outcomes in the sample
space that are NOT in the original event “at least one tail.”
TRY IT
Suppose you roll a fair six-sided die with the numbers on the faces. Previously, we
found the sample space for this experiment:
1. The complement is .
2. The complement is .
3. The complement is .
4. The complement is .
In any experiment, an event or its complement must occur. This means that
. Rearranging this equation gives us a formula for finding the probability of
the complement from the original event:
\begin{eqnarray*} P(A^C) & = & 1-P(A) \\ \\ \end{eqnarray*}
EXAMPLE
An online retailer knows that 30% of customers spend more than $100 per transaction. What is the
probability that a customer spends at most $100 per transaction?
Solution:
Spending at most $100 ($100 or less) per transaction is the complement of spending more than $100
per transaction.
188 | 3.4 THE COMPLEMENT RULE
TRY IT
At a local college, a statistics professor has a class of 80 students. After polling the students in the
class, the professor finds out that 15 of the students play on one of the school’s sports team and 60 of
the students have part-time jobs.
1. What is the probability that a student in the class does not play on one of the school’s sports
teams?
2. What is the probability that a student in the class does not have a part-time job?
1.
2.
Concept Review
The complement, , of an event consists of all of the outcomes in the sample space that are
NOT in event . The probability of the complement can be found from the original event using the
formula: .
Attribution
LEARNING OBJECTIVES
For two events and we might want to know the probability that at least one of the two events
occurs. For example, we might want to find the probability of rolling a 2 or a 5 in a single roll
of a die, or we might want to find the probability that someone has a smartphone or a tablet.
In probability terms, we want to find , the probability that either or occurs. In
probability, “or” is always an inclusive “or,” which means that either occurs, or occurs, or both
occur.
EXAMPLE
At a local language school, 40% of the students are learning Spanish, 20% of the students are learning
German, and 8% of the students are learning both Spanish and German. What is the probability that
a randomly selected student is learning Spanish or German?
Solution:
EXAMPLE
There are 50 students enrolled in the second year of a business degree program. During this
semester, the students have to take some elective courses. 18 students decide to take an elective in
psychology, 27 students decide to take an elective in philosophy, and 10 students decide to take an
elective in both psychology and philosophy. What is the probability that a student takes an elective
in psychology or philosophy?
Solution:
TRY IT
At a local basketball game, 70% of the fans are cheering for the home team, 25% of the fans are
wearing blue, and 12% of the fans are cheering for the home team and wearing blue. What is the
probability that a randomly selected fan is cheering for the home team or wearing blue?
EXAMPLE
Suppose a study of speeding violations and drivers who use cell phones produced the following
fictional data:
1. What is the probability that a randomly selected person is a cell phone user or has no speeding
192 | 3.5 THE ADDITION RULE
Solution:
TRY IT
This table shows the number of athletes who stretch before exercising and how many had injuries
within the past year.
1. What is the probability that a randomly selected athlete stretches before exercising or had an
injury last year?
2. What is the probability that a randomly selected athlete does not stretch before exercising or
had no injuries in the last year?
3.5 THE ADDITION RULE | 193
1.
2.
Two events and are mutually exclusive if the two events cannot happen at the same time.
That is, the events and do not share any outcomes and so . For example, in
the experiment of flipping a coin, the events heads and tails are mutually exclusive because it is not
possible to have both heads and tails on the top face. In the case of mutually exclusive events, the
addition rule is .
EXAMPLE
Suppose a bag contains 20 balls. 10 of the balls are white, 7 of the balls are red, and 3 of the balls are
blue. Suppose one ball is selected at random from the bag.
1. Are the events “selecting a white ball” and “selecting a red ball” mutually exclusive? Why?
2. What is the probability of selecting a white or red ball?
Solution:
1. The events “selecting a white ball” and “selecting a red ball” are mutually exclusive because
the events cannot happen at the same time. It is not possible for the selected ball to be both
white and red.
\displaystyle{P(\mbox{white or red}) = P(\mbox{white})+P(\mbox{red}) =
2.
\frac{10}{20}+\frac{7}{20} = 0.85}
194 | 3.5 THE ADDITION RULE
NOTE
In the calculation of the probability in part 2, there is nothing to subtract. Because the events are
mutually exclusive, .
TRY IT
At a local college, 60% of the students are taking a math class, 50% of the students are taking a
science class, and 30% of the students are taking both a math and a science class.
1. Are the events “taking a math class” and taking a science class” mutually exclusive? Explain.
2. What is the probability that a randomly selected student is taking a math class or a science
class?
1. The events “taking a math class” and “taking a science class” are not mutually exclusive
because the events can happen at the same time (i.e. a student can be taking both a math class
and a science class). As stated in the question, .
\displaystyle{P(\mbox{math or science}) = P(\mbox{math})+P(\mbox{science})-
2.
P(\mbox{math and science}) = 0.6+0.5-0.3 = 0.8}
3.5 THE ADDITION RULE | 195
TRY IT
1. Are the events “rolling a 4” and “rolling an even number” mutually exclusive?
2. Are the events “rolling a 4” and “rolling an odd number” mutually exclusive?
3. What is the probability of rolling a 4 or rolling an odd number.
1. The events “rolling a 4” and “rolling an even number” are not mutually exclusive because the
events can happen at the same time (i.e. 4 is an even number).
2. The events “rolling a 4” and “rolling an odd number” are mutually exclusive because the
events cannot happen at the same time. It is not possible to roll a die and get a 4 (an even
number) and an odd number on the top face at the same time
\displaystyle{P(\mbox{4 or odd}) = P(\mbox{4})+P(\mbox{odd}) =
3.
\frac{1}{6}+\frac{3}{6}=\frac{4}{6}}
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=84#oembed-1
Watch this video: Addition Rule for Probability Khan Academy [10:42] (transcript available).
Concept Review
Attribution
“3.1 Terminology” , “3.2 Independent and Mutually Exclusive Events”, and “3.4 Contingency
Tables” in Introductory Statistics by OpenStax is licensed under a Creative Commons Attribution
4.0 International License.
3.6 CONDITIONAL PROBABILITY
LEARNING OBJECTIVES
A conditional probability is the probability of an event given that another event has already
occurred. The idea behind conditional probability is that it reduces the sample space to the part of
the sample space that involves just the given event —except for the event , everything else in the
sample space is throw away. Once the sample space is reduced to the given event , we calculate
the probability of occurring within the reduced sample space.
The conditional probability of given is written as and is read “the probability of
given .”
Recognizing a conditional probability and identifying which event is the given event can be
challenging. The following sentences are all asking the same conditional probability just in
different ways:
• What is the probability a student has a smartphone given that the student has a tablet?
• If a student has a tablet, what is the probability the student has a smartphone?
• What is the probability that a student with a tablet has a smartphone?
The given event is “has a tablet,” so in calculating the conditional probability we would restrict the
sample space to just those students that have a tablet and then find the probability a student has a
smartphone from among just those students with a tablet.
198 | 3.6 CONDITIONAL PROBABILITY
NOTE
EXAMPLE
Suppose a study of speeding violations and drivers who use cell phones produced the following
fictional data:
1. What is the probability that a randomly selected person is a cell phone user given that they
had no speeding violations in the last year?
2. If a randomly selected person does not have a cell phone, what is the probability they had a
speeding violation last year?
3. What is the probability that someone with a cell phone did not have a speeding violation last
year?
3.6 CONDITIONAL PROBABILITY | 199
Solution:
1. The given event is “no speeding violations,” so we restrict the table to just the column
involving “no speeding violations.” With this restriction, the table would look like this:
Total 685
Now, we want to find the probability a person is a cell phone user in this restricted sample
space:
2. The given event is “no cell phone,” so we restrict the table to just the row involving “no cell
phone.” With this restriction, the table would look like this:
Now, we want to find the probability a person has a speeding violation in the last year in this
restricted sample space:
3. The given event is “cell phone,” so we restrict the table to just the row involving “cell phone.”
With this restriction, the table would look like this:
Cell phone
25 280 305
user
200 | 3.6 CONDITIONAL PROBABILITY
Now, we want to find the probability a person does not have a speeding violation in the last
year in this restricted sample space:
NOTE
The conditional probability does not equal the conditional probability . In the
TRY IT
This table shows the number of athletes who stretch before exercising and how many had injuries
within the past year.
1. What is the probability that a randomly selected athlete stretches before exercising given that
3.6 CONDITIONAL PROBABILITY | 201
1.
2.
3.
When working with a contingency table as in the above examples, we can simply calculate
conditional probabilities by restricting the table to the given event and then finding the required
probability in the restricted sample space. Depending on the situation, it might not be possible to
workout a conditional probability this way. In these situations we can use the following formula to
find a conditional probability:
202 | 3.6 CONDITIONAL PROBABILITY
EXAMPLE
At a local language school, 40% of the students are learning Spanish, 20% of the students are learning
German and 8% of the students are learning both Spanish and German.
1. What is the probability that a randomly selected student is learning Spanish given that they
are learning German?
2. What is the probability that a randomly selected Spanish student is learning German?
Solution:
2.
EXAMPLE
There are 50 students enrolled in the second year of a business degree program. During this
semester, the students have to take some elective courses. 18 students decide to take an elective in
psychology, 27 students decide to take an elective in philosophy, and 10 students decide to take an
elective in both psychology and philosophy.
1. What is the probability that a student takes an elective in psychology given that they take an
elective in philosophy?
2. If a student takes an elective in psychology, what is the probability that they take an elective in
philosophy?
3.6 CONDITIONAL PROBABILITY | 203
Solution:
\displaystyle{P(\mbox{psychology}|\mbox{philosohpy}) =
1. \frac{P(\mbox{psychology and philosophy})}{P(\mbox{philosophy})} =
\frac{\frac{10}{50}}{\frac{27}{50}} = 0.3704}
2.
TRY IT
At a local basketball game, 70% of the fans are cheering for the home team, 25% of the fans are
wearing blue, and 12% of the fans are cheering for the home team and wearing blue.
1. What is the probability that a randomly selected fan is cheering for the home team given that
they are wearing blue?
2. If a randomly selected fan is cheering for the home team, what is the probability they are
wearing blue?
1.
2.
Independent Events
Two events are independent if the probability of the occurrence of one of the events does not
affect the probability of the occurrence of the other event. In other words, two events and are
204 | 3.6 CONDITIONAL PROBABILITY
independent if the knowledge that one of the events occurred does not affect the chance the other
event occurs. For example, the outcomes of two roles of a fair die are independent events—the
outcome of the first roll does not change the probability of the outcome of the second roll. If two
events are not independent, then we say the events are dependent.
We can test two events and for independence by comparing and :
EXAMPLE
At a local language school, 40% of the students are learning Spanish, 20% of the students are learning
German and 8% of the students are learning both Spanish and German. Are the events “Spanish”
and “German” independent? Explain.
Solution:
EXAMPLE
Suppose a study of speeding violations and drivers who use cell phones produced the following
fictional data:
Are the events “cell phone user” and “speeding violation in the last year” independent? Explain.
Solution:
TRY IT
At a local basketball game, 70% of the fans are cheering for the home team, 25% of the fans are
wearing blue, and 12% of the fans are cheering for the home team and wearing blue. Are the events
“cheering for the home team” and “wearing blue” independent? Explain.
events “cheering for the home team” and “wearing blue” are dependent.
TRY IT
This table shows the number of athletes who stretch before exercising and how many had injuries
within the past year.
Are the events “does not stretch” and “injury in last year” independent? Explain.
3.6 CONDITIONAL PROBABILITY | 207
are dependent.
Sampling may be done with replacement or without replacement, which effects whether or
not events are considered independent or dependent.
• With replacement: If each member of a population is replaced after it is picked, then that
member has the possibility of being chosen more than once. When sampling is done with
replacement, then events are considered to be independent because the result of the first pick
will not change the probabilities for the second pick.
• Without replacement: When sampling is done without replacement, each member of a
population may be chosen only once. In this case, the probabilities for the second pick are
affected by the result of the first pick. Depending on the situation, the events are considered
to be dependent or not independent.
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=86#oembed-1
Watch this video: Calculating Conditional Probability Khan Academy [6:42] (transcript available).
208 | 3.6 CONDITIONAL PROBABILITY
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=86#oembed-2
Watch this video: Conditional Probability and Independence Khan Academy [4:06] (transcript available).
Concept Review
A conditional probability is the probability of an event given that another event has already
and are independent if the knowledge that one of the events occurred does not affect the
chance that the other event occurs. If , the events and are independent.
Otherwise the events are dependent.
Attribution
“3.1 Terminology”, “3.2 Independent and Mutually Exclusive Events”, and “3.4 Contingency
Tables” in Introductory Statistics by OpenStax is licensed under a Creative Commons Attribution
4.0 International License.
3.7 JOINT PROBABILITIES
LEARNING OBJECTIVES
A joint probability is the probability of events and happening at the same time. We are
interested in both events occurring simultaneously in the unrestricted sample space. We have seen
these types of probabilities already when we looked at contingency tables and in the context of “or”
probabilities.
EXAMPLE
Suppose a study of speeding violations and drivers who use cell phones produced the following
fictional data:
1. What is the probability that a randomly selected person is a cell phone user and had no
speeding violations in the last year?
2. What is the probability that a randomly selected person had a speeding violation in the last
year and does not use a cell phone?
Solution:
2.
NOTE
These two probabilities are examples of joint probabilities. For example, in part 1, we want to find
the probability that a randomly selected person has both traits: cell phone user and no speeding
violations. So, we are interested in both events happening at the same time.
So far, most of the probabilities we have looked at are based on a single trial experiment and
finding a probability based on that single trial. For example, finding the probability of rolling an
even number in a single roll of a die is single trial experiment—we are only rolling the die one time
and then we want to find the probability of a particular event happening in that single roll. Even
the joint probabilities that we have seen so far, as in the example above, are based on a single trial
experiment. We see these types of joint probabilities when we randomly select a single item and
then want to find the probability that the item has two different characteristics at the same time.
However, we often want to calculate probabilities associated with repeated trial experiments.
In a repeated trial experiment, we deal with identical trials that are repeated a number of times.
For example, flipping a coin three times is an example of a repeated trial experiment—the trial is
flipping the coin and then that trial is repeated three identical times.
3.7 JOINT PROBABILITIES | 211
EXAMPLE
Which of the following are repeated trial experiments? For the repeated trial experiments, identify
the trial and the number of repetitions.
Solution:
We can think of repeated trial experiments as joint probabilities—event on trial one AND event
on trial two AND event on trial three and so on, depending on the number of trials. Suppose
in the example of flipping the coin three times we want to find the probability of getting three
heads in the three flips. We can think of this as a joint probability—heads on flip one AND heads
on flip two AND heads on flip three. We want to calculate probabilities for such repeated trial
experiments and, as we will see, the key to such probabilities is to think of the repeated trials as a
joint probability.
One thing we must consider in a repeated trial experiment is whether the trials are done with or
without replacement because this changes how we calculate the probability as we move from trial
to trial.
212 | 3.7 JOINT PROBABILITIES
When calculating probabilities for repeated trial experiments, it is important that we identify if
the experiment is done with or without replacement. Sometimes we will be told directly that the
experiment is done with or without replacement. But most of the time we will need to determine if
the experiment is done with or without replacement from the context of the question.
EXAMPLE
For each of the following determine if the experiment is done with or without replacement.
Solution:
1. With replacement. The probability of heads or tails stays the same with each flip.
2. Without replacement. In this case, we want three different women on the committee, so we
must select them without replacement. (Selecting with replacement would mean a possibility
of the same women being selected three times and then the committee would consist of just a
3.7 JOINT PROBABILITIES | 213
single person).
3. Depending on the context, this could be with or without replacement. If each card is replaced
after it is selected, this would be with replacement. If each card is not replaced after it is
selected, this would be without replacement. In this situation, the question would probably
include a statement about whether the cards are drawn with or without replacement.
4. With replacement. The probability of rolling any of the numbers stays the same with each roll
of the die.
5. Without replacement. In this case, we want the members of the executive committee to be all
different, so we must select them without replacement.
Unfortunately, it is more complicated than that because we have to work out the probabilities on
each trial, and these probabilities are affected by whether the experiment is done with or without
replacement.
The multiplication rule to find the probability of and in a repeated trial experiment is
If we think of as the first trial and as the second trial, the probability of and is the
probability of (the probability of on the first trial) times the probability of given (the
probability of on the second trial assuming that happened on the first trial).
In the case that the experiment is done with replacement, the events and are independent,
so and this rule becomes
We can extend this rule to any number of trials, we just need to keep multiplying as we move
from trial to trial.
When finding probabilities associated with repeated trial experiments, remember the following:
• To find the probability we work with the probabilities of the individual trials, multiplying the
214 | 3.7 JOINT PROBABILITIES
EXAMPLE
A small local high school has 25 students in its graduating class. 18 of the students are going to
college next year and the remaining 7 are not going to college next year. Suppose two students are
selected at random from the graduating class.
1. What is the probability that both students are going to college next year?
2. What is the probability that exactly one of the students is going to college next year?
Solution:
This is a repeated trial experiment. A trial is selecting a student and there are two trials. The
assumption here is that the experiment is done without replacement because we do not want to get
the same student twice.
1. We want to get college-bound students on both trials. In other words, college-bound student
on trial one AND college-bound student on trial two. On the first trial, the probability of
getting a college-bound student is . We are selecting without replacement, so after the first
trial we assume that we have removed one of the college-bound students. This means that on
the second trial, there are only 24 students to pick from (one student was removed on the first
trial) and there are only 17 college-bound students left (on the first trial we removed one of the
18 college-bound students). So on the second trial, the probability of getting a college-bound
2. We want one college-bound student (denoted ) and one non college-bound student (denoted
). In this case, we have to think about the order of the selections—there is a difference
between college-bound on trial one, non-college bound on trial two( ) and non-college
3.7 JOINT PROBABILITIES | 215
bound on trial one, college bound on trial two ( ). All possible orders must be accounted
for when we calculate the probability. One of the two possible orders must occur: OR
. For each of the individual orders, we multiply the probabilities as we move from trial to
trial. The “or” means that we add the probabilities of the different orders. In other words:
For the order (college-bound on trial one, non college-bound on trial two), we want a
college-bound student on trial one and the probability of getting a college-bound student is
. We are selecting without replacement, so after the first trial we assume that we have
removed one of the college-bound students. This means that on the second trial, there are
only 24 students to pick from (a college-bound student was removed on the first trial) and
there are 7 non college-bound students (none of the non college-bound students were
removed after the first trial). So on the second trial, the probability of getting a non college-
Similarly for the order (non college-bound on trial one, college-bound on trial two), we
want a non college-bound student on trial one and the probability of getting a non college-
bound student is . We are selecting without replacement, so after the first trial we assume
that we have removed one of the non college-bound students. This means that on the second
trial, there are only 24 students to pick from (a non college-bound student was removed on the
first trial) and there are 18 college-bound students (none of the college-bound students were
removed after the first trial). So on the second trial, the probability of getting a college-bound
EXAMPLE
Solution:
This is a repeated trial experiment. A trial is rolling a die and there are two trials. The trials are
independent (what happens on the first roll does not affect what happens on the second roll).
1. We want to get a 5 on both rolls. In other words, a 5 on roll one AND a 5 on roll two. On the
first roll, the probability of getting a 5 is . On the second roll, the probability of getting a 5 is
. Because the rolls are independent, the probability of getting a 5 on the second roll is not
affected by what happens on the first roll. The probability of getting two 5’s is
\begin{eqnarray*} \\ \mbox{Probability} & = & \frac{1}{6} \times
\frac{1}{6}=\frac{1}{36} \\ \\ \end{eqnarray*}
2. We want one 2 and one 6. In this case, we have to think about the order of the rolls—there is a
difference between 2 on roll one, 6 on roll two and 6 on roll one, 2 on roll two. All possible
orders must be accounted for when we calculate the probability. One of the two possible
orders must occur: OR . For of the individual orders, we multiply the probabilities as
we move from trial to trial. The “or” means that we add the probabilities of the different
orders. In other words,
For the order (2 on roll one, 6 on roll two), the probability of getting a 2 on roll one is
and the probability of getting a 6 on roll two is . So the probability of getting the order is
.
3.7 JOINT PROBABILITIES | 217
Similarly for the order (6 on roll one, 2 on roll two), the probability of getting a 6 on roll
one is and the probability of getting a 2 on roll two is . So the probability of getting the
order is .
TRY IT
A box of contains 30 microchips and 9 of those microchips are defective. Suppose two microchips
are selected randomly from the box for inspection by the quality control officer.
1.
2.
218 | 3.7 JOINT PROBABILITIES
EXAMPLE
A box contains 5 red cards and 12 white cards. Suppose three cards are drawn at random from the
box without replacement.
Solution:
This is a repeated trial experiment. A trial is selecting a card and there are three trials. There are 17
cards in the box.
1. We want to get a white card on all three draws. In other words, white on draw one AND white
on draw two AND white on draw three. On the first draw, the probability of getting a white
card is . We are selecting without replacement, so after the first draw we assume that we
removed a white card from the box. This means that on the second draw, there are only 16
cards left in the box (one card was removed on the first draw) and there are only 11 white
cards left (on the first draw we removed one of the 12 white cards). So on the second draw, the
probability of getting a white card is . After the second draw we assume that we removed
white cards from the box on draws one and two. This means that on the third draw, there are
only 15 cards left in the box (two cards were removed on the first two draws) and there are
only 10 white cards left (white cards were removed on draws one and two). So on the third
draw, the probability of getting a white card is . The probability of getting three white
cards is
\begin{eqnarray*} \\ \mbox{Probability} & = & \frac{12}{17} \times \frac{11}{16}
\times \frac{10}{15}=0.3235 \\ \\ \end{eqnarray*}
2. We want one red card ( ), so the other two cards must be white ( ). In this case, we have to
think about the order of the selection. All possible orders must be accounted for when we
3.7 JOINT PROBABILITIES | 219
For the order, the probability of red on draw one is . We are selecting without
replacement, so after the first trial we assume that we have removed one of the red cards. This
means that on the second draw, there are only 16 cards left in the box (one card was removed
on the first draw) and all 12 white cards are left (on the first draw we removed a red card). So
on the second draw, the probability of getting a white card is . After the second draw we
assume that we removed a red card on draw one and a white card on draw two. This means
that on the third draw, there are only 15 cards left in the box (two cards were removed on the
first two draws) and there are only 11 white cards left (one white card was removed on draw
two). So on the third draw, the probability of getting a white card is . So the probability of
Using similar logic, the probability of getting the order is and the
3. We want at least one red card in the three draws. This means we can have exactly one red card
or exactly two red cards or exactly three red cards. As before, we have to think about the
order of the selection. All possible orders must be accounted for when we calculate the
probability. Here, there are seven possible ways of getting at least one red card: OR
220 | 3.7 JOINT PROBABILITIES
OR OR OR OR OR . Of course, we could
work out the probabilities of each of these orders and add them all up. But there is a faster
way to find this probability—use the complement. The complement of “at least one red card”
is “exactly zero red cards.” When we look at the seven possible orders that make up the “at
least one red card” event, the complement consists of all of the missing orders. In this case
there is only one missing order, , which is the event “exactly zero red cards.” Using
the complement, the probability of at least one red card is
\begin{eqnarray*} \\ P(\mbox{at least one red card}) & = & 1-P(\mbox{exactly zero
red card}) \\ & = & 1-P(WWW)\end{eqnarray*}
\begin{eqnarray*} P(\mbox{at least one red card}) & = & 1-P(WWW) \\ & =
& 1-\left(\frac{12}{17} \times \frac{11}{16} \times \frac{10}{15}\right) \\ & = &
0.6765 \\ \\ \end{eqnarray*}
4. We want at most one white card in the three draws. This means we can have exactly zero
white cards or exactly one white card. As before, we have to think about the order of the
selection. All possible orders must be accounted for when we calculate the probability. Here,
there are four possible ways of getting at most one white card: OR OR
OR . Using similar logic to above, the probability of getting the order is
TRY IT
A box contains 5 red cards and 12 white cards. Suppose three cards are drawn at random from the
box with replacement.
1.
2.
3.
4.
EXAMPLE
A company produces a popular brand of sports drink. The company is currently running a contest
where winning symbols are placed under the bottle caps. 7% of all the bottle caps contain winning
symbols. You buy three bottles of the sports drink.
2. What is the probability that exactly one of the bottles has a winning symbol?
3. What is the probability that at least one bottle has a winning symbol?
Solution:
This is a repeated trial experiment. A trial is selecting a bottle and there are three trials. This is an
experiment without replacement (you do not want to select the same bottle three times). However,
because the population of bottles is very, very large, we can treat the experiment as if the selections
are made with replace. This means that we can treat the selection of the bottles as independent and
so the probability of getting a winning bottle will be 7% on every draw.
1. We want to get a winning symbol on all three bottles. In other words, win on bottle one AND
win on bottle two AND win on bottle three. The probability of winning on the first bottle is
7%, the probability of winning on the second bottle is 7%, and the probability of winning on
the third bottle is 7%. Because we can treat the selections as independent, the probability of
winning does not change from draw to draw. The probability of getting three winning bottles
is
2. We want one winning bottle ( ), so the other two bottles must be non-winners ( ). The
probability of winning on any bottle is 7%, so the probability of losing on any bottle is 93%. In
this case, we have to think about the order of the selection. All possible orders must be
accounted for when we calculate the probability. One of the three possible orders must occur:
OR OR . For each of the individual orders, we multiply the
probabilities as we move from draw to draw. The “or” means that we add the probabilities of
the different orders. In other words,
For the order (win on bottle one, non-wins on bottles two and three), the probability
of getting a win on bottle one is and the probability of getting a non-win on bottle two or
bottle three is . So the probability of getting the order is
.Using similar logic, the probability of getting the order is and
the probability of getting the order is .
3. We want at least one winning bottle. This means we can have exactly one winning bottle or
exactly two winning bottles or exactly three winning bottles. Of course, we could work out the
probabilities of each of these orders and add them all up. But the a faster way to find this
probability is to use the complement. The complement of “at least one winning bottle” is
“exactly zero winning bottles” The “exactly zero winning bottle” is the case (all three
bottles are non-winners). Using the complement, the probability of at least one winning bottle
is
NOTE
In situations like this example where we are drawing with replacement from a very, very large
population, we treat the draws as if they are independent. Because the population is so large, the
change in the probability as we go from draw to draw is very, very small, which makes it hardly
detectable in the calculation of the answer. In such situations, we can treat the draws as
independent. We cannot do this when we are drawing without replacement from a small
population (as in the red and white card example above) because there are distinct changes in the
probabilities as we move from draw to draw.
224 | 3.7 JOINT PROBABILITIES
Concept Review
In a repeated trial experiment, we deal with identical trials that are repeated a number of times. A
repeated trial experiment can be thought of as a joint probability. The multiplication rule for joint
probabilities is where is the event on the first trial and is the
event on the second trial. In sampling with replacement each member of a population is replaced
after it is picked, so that member has the possibility of being chosen more than once, and the events
are considered to be independent. In sampling without replacement, each member of a population
may be chosen only once, and the events are considered to be dependent.
Attribution
“3.1 Terminology”, “3.2 Independent and Mutually Exclusive Events”, “3.3 Two Basic Rule of
Probability”, and “3.4 Contingency Tables“ in Introductory Statistics by OpenStax is licensed under
a Creative Commons Attribution 4.0 International License.
3.8 EXERCISES
1. In a particular college class, there are male and female students. Some students have long
hair and some students have short hair. Write the symbols for the probabilities of the events for
parts a through j. (Note that you cannot find numerical answers here. You were not given enough
information to find any probability values yet; concentrate on understanding the symbols.)
2. A box is filled with several party favors. It contains 12 hats, 15 noisemakers, ten finger traps, and
five bags of confetti. Let be the event of getting a hat. Let be the event of getting a noisemaker.
Let be the event of getting a finger trap. Let be the event of getting a bag of confetti.
a. Find .
b. Find .
c. Find .
d. Find .
3. A jar of 150 jelly beans contains 22 red jelly beans, 38 yellow, 20 green, 28 purple, 26 blue, and
the rest are orange.
226 | 3.8 EXERCISES
a. Find .
b. Find .
c. Find .
d. Find .
e. Find .
4.There are 23 countries in North America, 12 countries in South America, 47 countries in Europe,
44 countries in Asia, 54 countries in Africa, and 14 in Oceania (Pacific Ocean region).
a. Find .
b. Find .
c. Find .
d. Find .
e. Find .
f. Find .
5. What is the probability of drawing a red card from a standard deck of 52 cards?
6. What is the probability of drawing a club in a standard deck of 52 cards?
7. What is the probability of rolling an even number of dots with a fair, six-sided die numbered
one through six?
8. What is the probability of rolling a prime number of dots with a fair, six-sided die numbered
one through six?
3.8 EXERCISES | 227
9. On a baseball team, there are infielders and outfielders. Some players are great hitters, and
some players are not great hitters.
a. Write the symbols for the probability that a player is not an outfielder.
b. Write the symbols for the probability that a player is an outfielder or is a great hitter.
c. Write the symbols for the probability that a player is an infielder and is not a great hitter.
d. Write the symbols for the probability that a player is a great hitter, given that the player is an
infielder.
e. Write the symbols for the probability that a player is an infielder, given that the player is a
great hitter.
f. Write the symbols for the probability that of all the outfielders, a player is not a great hitter.
g. Write the symbols for the probability that of all the great hitters, a player is an outfielder.
h. Write the symbols for the probability that a player is an infielder or is not a great hitter.
i. Write the symbols for the probability that a player is an outfielder and is a great hitter.
j. Write the symbols for the probability that a player is an infielder.
10. What is the word for the set of all possible outcomes?
11. What is conditional probability?
12. You are rolling a fair, six-sided number cube. Let = the event that it lands on an even
number. Let = the event that it lands on a multiple of three.
13. Explain what is wrong with the following statements. Use complete sentences.
a. If there is a 60% chance of rain on Saturday and a 70% chance of rain on Sunday, then there is
a 130% chance of rain over the weekend.
b. The probability that a baseball player hits a home run is greater than the probability that he
gets a successful hit.
a.
b. P(U|V)
c.
Shirt Number At most 210 211–250 251–290 More than 290 Total
1–33 21 5 0 0 26
34–66 6 18 7 4 35
66–99 6 12 22 5 45
Total 33 35 29 9 106
For the following, suppose that you randomly select one player from the 49ers or Cowboys.
a. What is the probability that the player’s shirt number is in the 34-66 category?
b. What is the probability that the player weighs at most 210 lbs?
c. What is the probability that the player’s shirt number is in the 1-33 category and weighs
between 211 and 250 lbs?
d. What is the probability that the player’s shirt number is in the 66-99 category or weights more
than 290 lbs?
e. What is the probability that the player’s shirt number is in the 34-66 category given that they
weigh between 251 and 290 lbs?
f. What is the probability that a player weights at most 210 lbs if their shirt number is in the
1-33 category?
g. Are the events “66-99” and more than 290 lbs independent? Explain.
19. At a local college, 20% of the students are studying business, 40% of the students are studying
mathematics and 8% of the students are studying both business and mathematics.
a. What is the probability that a randomly selected student studies business or mathematics?
b. What is the probability that a randomly selected student studies mathematics given that they
3.8 EXERCISES | 229
study business?
c. What is the probability that a randomly selected mathematics student studies business?
d. Are the events “business” and “mathematics” independent? Explain.
e. Are the events “business” and “mathematics” mutually exclusive? Explain.
20. The casino game, roulette, allows the gambler to bet on the probability of a ball, which spins in
the roulette wheel, landing on a particular color, number, or range of numbers. The table used to
place bets contains of 38 numbers (0,00, 1, 2,…,36), and each number is assigned to a color (green,
red or black) and a range. You can place a bet based on number, color, or range.
credit: film8ker/wikibooks
21. Suppose that you have eight cards. Five are green and three are yellow. The five green cards are
numbered 1, 2, 3, 4, and 5. The three yellow cards are numbered 1, 2, and 3. The cards are well
shuffled. You randomly draw one card.
22. A special deck of cards has ten cards. Four are green, three are blue, and three are red. When a
card is picked, its color of it is recorded.
a. Suppose three cards are picked without replacement. What is the probability that all three
cards are green?
b. Suppose three cards are picked without replacement. What is the probability that exactly two
of the cards are blue?
c. Suppose three cards are picked without replacement. What is the probability that at least one
of the cards is red?
d. Suppose three cards are picked with replacement. What is the probability that at most one
card is green?
e. Suppose three cards are picked with replacement. What is the probability that all three cards
are green?
f. Suppose three cards are picked with replacement. What is the probability that exactly two of
the cards are blue?
g. Suppose three cards are picked with replacement. What is the probability that at least one of
the cards is red?
h. Suppose three cards are picked with replacement. What is the probability that at most one
card is green?
a. Find .
b. Are and mutually exclusive? Why or why not?
c. Are and independent events? Why or why not?
d. Find .
3.8 EXERCISES | 231
e. Find .
24. In 1994, the U.S. government held a lottery to issue 55,000 Green Cards (permits for non-citizens
to work legally in the U.S.). Renate Deutsch, from Germany, was one of approximately 6.5 million
people who entered this lottery.
a. What was Renate’s chance of winning a Green Card? Write your answer as a probability
statement.
b. In the summer of 1994, Renate received a letter stating she was one of 110,000 finalists
chosen. Once the finalists were chosen, assuming that each finalist had an equal chance to
win, what was Renate’s chance of winning a Green Card? Write your answer as a conditional
probability statement.
c. Are “won a green card” and “finalist” independent or dependent events? Justify your answer
numerically and also explain why.
d. Are “won a green card” and “finalist” mutually exclusive events? Justify your answer
numerically and explain why.
25. The following table of data obtained from www.baseball-almanac.com shows hit information
for four players. Suppose that one hit from the table is randomly selected.
a. Are “the hit being made by Hank Aaron” and “the hit being a double” independent events?
Explain.
b. What is the probability that a hit was made by Babe Ruth?
c. What is the probability that a hit was made by Hank Aaron and is a home run?
d. What is the probability that a hit was made by Ty Cobb or is a single?
e. What is the probability that a hit was a double given that it was by Jackie Robinson?
f. What is the probability that a triple was hit by Babe Ruth?
26. United Blood Services is a blood bank that serves more than 500 hospitals in 18 states. According
232 | 3.8 EXERCISES
to their website, a person with type O blood and a negative Rh factor (Rh-) can donate blood to any
person with any bloodtype. Their data show that 43% of people have type O blood and 15% of people
have Rh- factor; 52% of people have type O or Rh- factor.
a. Find the probability that a person has both type O blood and the Rh- factor.
b. Find the probability that a person does NOT have both type O blood and the Rh- factor.
27. At a college, 72% of courses have final exams and 46% of courses require research papers.
Suppose that 32% of courses have a research paper and a final exam.
a. Find the probability that a course has a final exam or a research project.
b. Find the probability that a course has NEITHER of these two requirements.
28. In a box of assorted cookies, 36% contain chocolate and 12% contain nuts. Of those, 8% contain
both chocolate and nuts. Sean is allergic to both chocolate and nuts.
a. Find the probability that a cookie contains chocolate or nuts (he can’t eat it).
b. Find the probability that a cookie does not contain chocolate or nuts (he can eat it).
29. A college finds that 10% of students have taken a distance learning class and that 40% of students
are part time students. Of the part time students, 20% have taken a distance learning class. Let D
= event that a student takes a distance learning class and E = event that a student is a part time
student.
a. Find the probability that a student takes a distance learning class and is a part-time student.
b. Find the probability that a student is a part-time student given that they take a distance
learning class.
c. Find the probability that student is a part-time student or takes a distance learning class.
d. Are the events “distance learning” and “part-time” independent? Explain.
30. The table shows a random sample of musicians and how they learned to play their instruments.
Female 12 38 22 72
Male 19 24 15 58
Total 31 62 37 130
3.8 EXERCISES | 233
31. The table shows the political party affiliation of each of 67 members of the US Senate in June
2012, and when they are up for reelection.
November 2014 20 13 0
November 2016 10 24 0
Total
a. What is the probability that a randomly selected senator has an “Other” affiliation?
b. What is the probability that a randomly selected senator is up for reelection in November
2016?
c. What is the probability that a randomly selected senator is a Democrat and up for reelection
in November 2016?
d. What is the probability that a randomly selected senator is a Republican or is up for reelection
in November 2014?
e. Suppose that a member of the US Senate is randomly selected. Given that the randomly
selected senator is up for reelection in November 2016, what is the probability that this
senator is a Democrat?
f. Suppose that a member of the US Senate is randomly selected. What is the probability that
the senator is up for reelection in November 2014, knowing that this senator is a Republican?
32. Table identifies a group of children by one of four hair colors, and by type of hair.
Wavy 20 15 3 43
Straight 80 15 12
Totals 20 215
234 | 3.8 EXERCISES
33. A box of cookies contains three chocolate and seven butter cookies. Miguel randomly selects a
cookie and eats it. Then he randomly selects another cookie and eats it. (How many cookies did he
take?)
a. Let be the event that both cookies selected were the same flavor. Find .
b. Let be the event that the cookies selected were different flavors. Find .
c. Let be the event that the second cookie selected is a butter cookie. Find .
34. A cup contains three red, four yellow and five blue beads.
a. Suppose three beads are selected at random without replacement. What is the probability all
three beads are blue?
b. Suppose three beads are selected at random with replacement. What is the probability all
three beads are blue?
c. Suppose three beads are selected at random without replacement. What is the probability
that exactly one of the beads is red?
d. Suppose three beads are selected at random with replacement. What is the probability that
exactly one of the beads is red?
e. Suppose three beads are selected at random without replacement. What is the probability
that at least one bead is yellow?
f. Suppose three beads are selected at random with replacement. What is the probability that at
least one bead is yellow?
35. The percent of licensed U.S. drivers (from a recent year) that are female is 48.60. Of the females,
5.03% are age 19 and under; 81.36% are age 20–64; 13.61% are age 65 or over. Of the licensed U.S.
male drivers, 5.04% are age 19 and under; 81.43% are age 20–64; 13.53% are age 65 or over.
b. Find the probability a driver is 65 or over given that they are female.
c. Find the probability a driver is 65 or over and female.
d. In words, explain the difference between the probabilities in part c and part d.
e. Find the probability a driver is 65 or over.
f. Are being age 65 or over and being female mutually exclusive events? How do you know?
36. Approximately 86.5% of Americans commute to work by car, truck, or van. Out of that group,
84.6% drive alone and 15.4% drive in a carpool. Approximately 3.9% walk to work and
approximately 5.3% take public transportation.
a. Assuming that the walkers walk alone, what percent of all commuters travel alone to work?
b. Suppose that 1,000 workers are randomly selected. How many would you expect to travel
alone to work?
c. Suppose that 1,000 workers are randomly selected. How many would you expect to drive in a
carpool?
37. When the Euro coin was introduced in 2002, two math professors had their statistics students
test whether the Belgian one Euro coin was a fair coin. They spun the coin rather than tossing it
and found that out of 250 spins, 140 showed a head (event ) while 110 showed a tail (event ).
On that basis, they claimed that it is not a fair coin.
Attribution
Chapter Outline
You can use probability and discrete random variables to calculate the likelihood of
lightning striking the ground five times during a half-hour thunderstorm. Photo by
Leszek Leszczynski, CC BY 2.0.
A student takes a ten-question, true-false quiz. Because the student had such a busy schedule, he
or she could not study and guesses randomly at each answer. What is the probability of the student
passing the test with at least a 70%?
Small companies might be interested in the number of long-distance phone calls their employees
make during the peak time of the day. Suppose the average is 20 calls. What is the probability that
the employees make more than 20 long-distance phone calls during the peak time?
These two examples illustrate two different types of probability problems involving discrete
random variables. Recall that discrete data are data that you can count. A random variable
describes the outcomes of a statistical experiment in words. The values of a random variable can
vary with each repetition of an experiment.
240 | 4.1 INTRODUCTION TO DISCRETE RANDOM VARIABLES
Random Variables
Upper case letters such as or denote a random variable. Lower case letters like or denote
the value of a random variable. If is a random variable, then is written in words, and is given
as a number.
For example, let be the number of heads you get when you toss three fair coins. The sample
space for the toss of three fair coins is
. Then, . is in words and is a number. Notice that for this example, the values
are countable outcomes. Because you can count the possible values that can take on and the
outcomes are random (the values are ), is a discrete random variable.
A random variable describes a characteristic of interest in a population being studied.
Common notation for variables are upper case Latin letters , , ,… and common notation for a
specific value from the domain (set of all possible values of a variable) are lower case Latin letters
and . For example, if is the number of children in a family, then represents a specific
integer . Variables in statistics differ from variables in intermediate algebra in the
two following ways:
• The domain of the random variable is not necessarily a numerical set. The domain may be
expressed in words. For example, if is hair color then the domain is {black, blond, gray,
green, orange}.
• We can tell what specific value the random variable takes only after performing the
experiment.
Attribution
LEARNING OBJECTIVES
A random variable describes the outcomes of a statistical experiment in words. The values of a
random variable can vary with each repetition of an experiment.
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=96#oembed-1
Watch this video: Random Variables and Probability Distributions by Dr Nic’s Maths and Stats [4:38]
The probability distribution for a random variable lists all the possible values of the random
variable and the probability the random variable takes on each value. The probability distribution
for a random variable describes how probabilities are distributed over the values of the random
variable. A probability distribution can be a table, with a column for the values of the random
variable and another column for the corresponding probability, or a graph, like a histogram with
the values of the random variable on the horizontal axis and the probabilities on the vertical axis.
242 | 4.2 PROBABILITY DISTRIBUTION OF A DISCRETE RANDOM VARIABLE
In a probability distribution, each probability is between 0 and 1, inclusive. Because all possible
values of the random variable are included in the probability distribution, the sum of the
probabilities is 1.
EXAMPLE
A child psychologist is interested in the number of times a newborn baby’s crying wakes its mother
after midnight. For a random sample of 50 mothers, the following information was obtained. Let
be the number of times per week a newborn baby’s crying wakes its mother after midnight. For this
example, the values of the random variable are .
In the table, the left column contains all of the possible values of the random variable and the right
column, , is the probability that takes on the corresponding value . For example, in the
first row, the value of the random variable is 0 and the probability the random variable is 0 is . In
the context of this example, that means that the probability a newborn baby’s crying wakes its
Because can only take on the values 0, 1, 2, 3, 4, and 5, is a discrete random variable. Note that
each probability is between 0 and 1 and the sum of the probabilities is 1:
TRY IT
Suppose Nancy has classes three days a week. She attends classes three days a week 80% of the time,
two days a week 15% of the time, one day a week 4% of the time, and no days 1% of the time.
Suppose one week is randomly selected.
244 | 4.2 PROBABILITY DISTRIBUTION OF A DISCRETE RANDOM VARIABLE
3.
0 0.01
1 0.04
2 0.15
3 0.80
EXAMPLE
Jeremiah has basketball practice two days a week. Ninety percent of the time, he attends both
practices. Eight percent of the time, he attends one practice. Two percent of the time, he does not
attend either practice. What is and what values does it take on?
Solution:
is the number of days Jeremiah attends basketball practice per week. takes on the values 0, 1,
and 2.
4.2 PROBABILITY DISTRIBUTION OF A DISCRETE RANDOM VARIABLE | 245
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=96#oembed-2
Watch this video: Constructing a Probability Distribution for a Random Variable by Khan Academy [6:47]
Concept Review
A probability distribution for a random variable describes how the probabilities are distributed over
the random variable—in other words, the probability distribution describes the probability that the
random variable takes on a specific value. A probability distribution includes all possible value the
random variable can take on and the corresponding probability. Each probability is between 0 and
1, inclusive, and the sum of the probabilities is 1.
Attribution
“4.1 Probability Distribution Function (PDF) for a Discrete Random Variable“ in Introductory
Statistics by OpenStax is licensed under a Creative Commons Attribution 4.0 International License.
4.3 EXPECTED VALUE AND STANDARD
DEVIATION FOR A DISCRETE PROBABILITY
DISTRIBUTION
LEARNING OBJECTIVES
The expected value is often referred to as the “long-term” average or mean. That is, over the
long term of repeatedly doing an experiment, you would expect this average.
Suppose you toss a coin and record the result. What is the probability that the result is heads?
If you flip a coin two times, does the probability tell you that these flips will result in one heads
and one tail? You might toss a fair coin ten times and record nine heads. Probability does not
describe the short-term results of an experiment. Probability gives information about what can be
expected in the long term. To demonstrate this, Karl Pearson once tossed a fair coin 24,000 times!
He recorded the results of each toss, obtaining heads 12,012 times. In his experiment, Pearson
illustrated the Law of Large Numbers.
The Law of Large Numbers states that, as the number of trials in a probability experiment
increases, the difference between the theoretical probability of an event and the relative frequency
approaches zero—the theoretical probability and the relative frequency get closer and closer
together. When evaluating the long-term results of statistical experiments, we often want to know
the “average” outcome. This “long-term average” is known as the mean or expected value of
4.3 EXPECTED VALUE AND STANDARD DEVIATION FOR A DISCRETE PROBABILITY DISTRIBUTION | 247
the experiment and is denoted by or . In other words, after conducting many trials of an
experiment, you would expect this average value.
The expected value, denoted by or , is a weighted average where each value of the
random variable is weighted by the value’s corresponding probability.
\begin{eqnarray*}E(x) & = &\sum \left(x \times P(x) \right) \\ \\ \end{eqnarray*}
EXAMPLE
A men’s soccer team plays soccer zero, one, or two days a week. The probability that they play zero
days is 0.2, the probability that they play one day is 0.5, and the probability that they play two days is
0.3. Find the long-term average or expected value of the number of days per week the men’s soccer
team plays soccer.
Solution:
First let the random variable be the number of days the men’s soccer team plays soccer per week.
takes on the values 0, 1, 2. The table below shows the probability distribution for , and includes
an additional column that we will use to calculate the expected value. In this new
column, we will multiply each value by its corresponding probability.
0 0.2
1 0.5
2 0.3
Add the last column to find the long term average or expected value:
The expected value is 1.1. The men’s soccer team would, on the average, expect to play soccer 1.1
days per week. The number 1.1 is the long-term average or expected value if the men’s soccer team
plays soccer week after week after week.
248 | 4.3 EXPECTED VALUE AND STANDARD DEVIATION FOR A DISCRETE PROBABILITY DISTRIBUTION
NOTE
The expected value does not represent a value that the random variable takes on. The expected
value is an average. In this case, the expected value of 1.1 is the average times the team plays per
week. To understand what this means, imagine that each week you recorded the number of times
the soccer team played that week. You do this repeatedly for many, many, many, weeks. Then you
calculate the mean of the numbers you recorded (using the techniques we learned previously)—the
mean of these numbers equals 1.1, the expected value. The number of trials must be very, very large
in order for the mean of the values recorded from the trials to equal the expected value calculated
using the expected value formula.
Like data, probability distributions have standard deviations. The standard deviation, denoted
, of a probability distribution for a random variable describes the spread or variability of the
probability distribution. The standard deviation is the standard deviation you expect when doing
an experiment over and over.
\begin{eqnarray*}\sigma & = & \sqrt{\sum \left( (x-\mu)^2 \times P(x) \right)}
\end{eqnarray*}
To calculate the standard deviation of a probability distribution, find each deviation from its
expected value, square it, multiply it by its probability, add the products, and take the square root.
4.3 EXPECTED VALUE AND STANDARD DEVIATION FOR A DISCRETE PROBABILITY DISTRIBUTION | 249
EXAMPLE
Let be the number of times per week a newborn baby’s crying wakes its mother after midnight.
The probability distribution for is:
0 0.04
1 0.22
2 0.46
3 0.18
4 0.08
5 0.02
Find the expected value and standard deviation of the number of times a newborn baby’s crying
wakes its mother after midnight.
Solution:
0 0.04 0
1 0.22 0.22
2 0.46 0.92
3 0.18 0.54
4 0.08 0.32
5 0.02 0.1
250 | 4.3 EXPECTED VALUE AND STANDARD DEVIATION FOR A DISCRETE PROBABILITY DISTRIBUTION
On average, a newborn wakes its mother after midnight 2.1 times per week.
For the standard deviation: For each value , multiply the square of its deviation by its
probability (each deviation has the format ).
0 0.04
1 0.22
2 0.46
3 0.18
4 0.08
5 0.02
Sum
Add the values in the third column of the table and then take the square root of this sum:
TRY IT
A hospital researcher is interested in the number of times the average post-op patient will ring the
nurse during a 12-hour shift. Let be the number of times a post-op patient rings for the nurse.
For a random sample of 50 patients, the following information was obtained. What is the expected
value? What is the standard deviation?
0 0.08
1 0.16
2 0.32
3 0.28
4 0.12
5 0.04
0 0.08 0
1 0.16 0.16
2 0.32 0.64
3 0.28 0.84
4 0.12 0.48
5 0.04 0.2
\begin{eqnarray*} \\ \mu & = & 0 +0.16 +0.64+ 0.84 +0.48 +0.2 \\ & = & 2.32
\end{eqnarray*}
252 | 4.3 EXPECTED VALUE AND STANDARD DEVIATION FOR A DISCRETE PROBABILITY DISTRIBUTION
0 0.08 0.430592
1 0.16 0.278784
2 0.32 0.032768
3 0.28 0.129472
4 0.12 0.338688
5 0.04 0.287296
EXAMPLE
Suppose you play a game of chance in which five numbers are chosen from 0, 1, 2, 3, 4, 5, 6, 7, 8, 9. A
computer randomly selects five numbers from zero to nine with replacement. You pay $2 to play and
could profit $100,000 if you match all five numbers in order (you get your $2 back plus $100,000).
Over the long term, what is your expected profit of playing the game?
Solution:
To do this problem, set up an expected value table for the amount of money you can profit. Let be
the amount of money you profit. The values of are not 0, 1, 2, 3, 4, 5, 6, 7, 8, 9. Because you are
interested in your profit (or loss), the values of are $100,000 and −$2 dollars.
To win, you must get all five numbers correct, in order. The probability of choosing one correct
number is
4.3 EXPECTED VALUE AND STANDARD DEVIATION FOR A DISCRETE PROBABILITY DISTRIBUTION | 253
because there are ten numbers. You may choose a number more than once. The
Because –0.99998 is about –1, you would, on average, expect to lose approximately $1 for each game
you play. However, each time you play, you either lose $2 or profit $100,000. The $1 is the average (or
expected) LOSS per game after playing this game over and over.
TRY IT
You are playing a game of chance in which four cards are drawn from a standard deck of 52 cards.
You guess the suit of each card before it is drawn. The cards are replaced in the deck on each draw.
You pay $1 to play. If you guess the right suit every time, you get your money back and $256. What is
your expected profit of playing the game over the long term?
254 | 4.3 EXPECTED VALUE AND STANDARD DEVIATION FOR A DISCRETE PROBABILITY DISTRIBUTION
Let be the amount of money you profit. The values of are –$1 (for a loss) and $256 (for a win).
Playing the game over and over again means you would average $0.0023 in profit per game.
EXAMPLE
Suppose you play a game with a biased coin where the probability of heads is . You play each
game by tossing the coin once. If you toss a head, you pay $6. If you toss a tail, you win $10. If you
play this game many times, will you come out ahead?
Solution:
1. Let be the amount of profit per game. The values of are –$6 (for a loss) and $10 (for a
win).
2.
3.
On average, you lose $0.67 each time you play the game, so you do not come out ahead.
256 | 4.3 EXPECTED VALUE AND STANDARD DEVIATION FOR A DISCRETE PROBABILITY DISTRIBUTION
TRY IT
Suppose you play a game with a spinner that has three colours on it: red, green, and blue. The
probability of landing on red is 40% and the probability of landing on green is 20%. You play a game
by spinning the spinner once. If you land on red, you pay $10. If you land on blue, you do not pay or
win anything. If you land on green, you win $10. What is the expected value of this game? Do you
come out ahead?
Let be the amount won in a game. The values of are –$10 (for red), 0 (for blue) and $10(for a
green).
On average, you lose $2 per game. So you do not come out ahead.
4.3 EXPECTED VALUE AND STANDARD DEVIATION FOR A DISCRETE PROBABILITY DISTRIBUTION | 257
TRY IT
On May 11, 2013 at 9:30 PM, the probability that moderate seismic activity (one moderate
earthquake) would occur in the next 48 hours in Japan was about 1.08%. You bet that a moderate
earthquake will occur in Japan during this period. If you win the bet, you win $100. If you lose the
bet, you pay $10. Let be the amount of profit from a bet. Find the mean and standard deviation of
.
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=98#oembed-1
Watch this video: Mean of a Discrete Random Variable by Khan Academy [4:31]
258 | 4.3 EXPECTED VALUE AND STANDARD DEVIATION FOR A DISCRETE PROBABILITY DISTRIBUTION
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=98#oembed-2
Watch this video: Variance and Standard Deviation of a Discrete Random Variable by Khan Academy [6:25]
Concept Review
The expected value, or mean, of a discrete random variable predicts the long-term results of a
statistical experiment that has been repeated many times. The standard deviation of a probability
distribution is used to measure the variability of possible outcomes.
• Standard Deviation:
Attribution
“4.2 Mean or Expected Value and Standard Deviation“ in Introductory Statistics by OpenStax is
licensed under a Creative Commons Attribution 4.0 International License.
4.4 THE BINOMIAL DISTRIBUTION
LEARNING OBJECTIVES
1. There are a fixed number of trials. Think of trials as repetitions of an experiment. The letter
denotes the number of trials.
2. There are only two possible outcomes, called “success” and “failure,” for each trial. The letter
denotes the probability of a success on any one trial and denotes the probability of a
failure on one trial.
3. The trials are independent and are repeated using identical conditions.
4. For each individual trial, the probability of a success, , and probability of a failure, ,
remain the same. Because the trials are independent, the outcome of one trial does not
affect the outcome of another trial.
For example, randomly guessing at a true-false statistics question has only two outcomes. If a
success is guessing correctly, then a failure is guessing incorrectly. Suppose Joe always guesses
correctly on any statistics true-false question with probability . Then, . This
means that for every true-false statistics question Joe answers, his probability of success
and his probability of failure remain the same.
The outcomes of a binomial experiment fit a binomial probability distribution. The random
variable is the number of successes obtained in the independent trials. The mean of a binomial
probability distribution is and the standard deviation is
Any experiment with the characteristics of a binomial experiment and where is called a
Bernoulli Trial (named after Jacob Bernoulli who, in the late 1600s, studied them extensively). A
260 | 4.4 THE BINOMIAL DISTRIBUTION
binomial experiment takes place when the number of successes is counted in one or more Bernoulli
Trials.
EXAMPLE
At ABC College, the withdrawal rate from an elementary physics course is 30% for any given term.
This implies that, for any given term, 70% of the students stay in the class for the entire term. A
“success” could be defined as an individual who withdrew from the course. The random variable
is the number of students who withdraw from the randomly selected elementary physics class.
TRY IT
The state health board is concerned about the amount of fruit available in school lunches. 48% of
schools in the state offer fruit in their lunches every day. This implies that 52% do not. What would
a “success” be in this case?
• A success would be a school that offers fruit in their lunch every day.
4.4 THE BINOMIAL DISTRIBUTION | 261
EXAMPLE
Suppose you play a game that you can only either win or lose. The probability that you win any
game is 55% and the probability that you lose is 45%. Each game you play is independent. If you play
the game 20 times, write the function that describes the probability that you win 15 of the 20 times.
Solution:
If you define as the number of wins, then takes on the values 0, 1, 2, 3, …, 20. The probability
of a success is . The probability of a failure is . The number of trials is
. The probability question can be stated mathematically as .
EXAMPLE
Approximately 70% of statistics students do their homework in time for it to be collected and graded.
Each student does homework independently. In a statistics class of 50 students, what is the
probability that at least 40 will do their homework on time? Students are selected randomly.
1. This is a binomial problem because there is only a success or a __________, there are a fixed
number of trials, and the probability of a success is 0.70 for each trial.
2. If we are interested in the number of students who do their homework on time, then how do
we define ?
3. What values does take on?
4. What is a “failure,” in words?
5. What is the probability of “failuere”?
6. The words “at least” translate as what kind of inequality for the probability question
.
262 | 4.4 THE BINOMIAL DISTRIBUTION
Solution:
1. failure
2. is the number of statistics students who do their homework on time.
3. 0, 1, 2, …, 50
4. Failure is defined as a student who does not complete his or her homework on time.
5.
6. “At least” means greater than or equal to ( ). The probability question is .
TRY IT
Sixty-five percent of people pass the state driver’s exam on the first try. A group of 50 individuals
who have taken the driver’s exam is randomly selected. Why this is a binomial problem?
EXAMPLE
The following example illustrates a problem that is not binomial. It violates the condition of
independence. ABC College has a student advisory committee made up of ten staff members and six
students. The committee wishes to choose a chairperson and a recorder. What is the probability that
the chairperson and recorder are both students?
Solution:
The names of all committee members are put into a box and two names are drawn without
replacement. The first name drawn determines the chairperson and the second name the recorder.
There are two trials. However, the trials are not independent because the outcome of the first trial
affects the outcome of the second trial. The probability of a student on the first draw is and the
probability of a student on the second draw is . The probability of drawing a student’s name
changes for each of the trials and, therefore, violates the condition of independence.
TRY IT
A lacrosse team is selecting a captain. The names of all the seniors are put into a hat and the first
three that are drawn will be the captains. The names are not replaced once they are drawn (one
person cannot be two captains). You want to see if the captains all play the same position. State
whether or not this is binomial and state why.
This is not binomial because the names are not replaced after each draw, which means the
probability changes for each time a name is drawn. This violates the condition of independence.
To calculate probabilities associated with binomial random variables in Excel, use the
binom.dist(x,n,p,logic operator) function.
• the probability of getting exactly x success in n trials with a probability of success p when the
logic operator is false.
• the probability of at most x successes in n trials with a probability of success p when the logic
operator is true.
Visit the Microsoft page for more information about the binom.dist function.
4.4 THE BINOMIAL DISTRIBUTION | 265
NOTE
Because we can only enter false or true into the logic operator, the binom.dist function can only
directly calculate the probability of getting exactly x successes in n trials or getting at most x success
in n trials. In order to calculate other binomial probabilities, such as fewer than x successes, more
than x successes or at least x successes, we need to manipulate how we use the binom.dist function
by changing what we enter into the binom.dist function, using the complement rule, or both.
EXAMPLE
It has been stated that about 41% of adult workers have a high school diploma but do not pursue any
further education. Suppose 20 adult workers are randomly selected.
1. How many adult workers in the sample do you expect to have a high school diploma but do
not pursue any further education?
2. What is the probability that exactly 8 of the workers in the sample have a high school diploma
but do not pursue further education?
3. What is the probability that at most 12 of the workers in the sample have a high school
diploma but do not purse further education?
Solution:
Let be the number of workers in the sample who have a high school diploma but do not pursue
further education. The number of trials is and the probability of success is .
2. We want to find .
266 | 4.4 THE BINOMIAL DISTRIBUTION
Field 1 8 0.1790
Field 2 20
Field 3 0.41
Field 4 false
The probability that exactly 8 of the workers in the sample have a high school diploma but do not
pursue further education is .
3. We want to find .
Field 1 12 0.9738
Field 2 20
Field 3 0.41
Field 4 true
The probability that at most 12 of the workers in the sample have a high school diploma but do not
pursue further education is .
TRY IT
About 32% of students participate in a community volunteer program outside of school. Suppose 30
students are selected at random.
1. What is the expected number of students in the sample that participate in a community
volunteer program?
2. What is the probability that exactly 10 of the students in the sample participate in a
4.4 THE BINOMIAL DISTRIBUTION | 267
1.
Field 2 30
Field 3 0.32
Field 4 false
Field 2 30
Field 3 0.32
Field 4 true
EXAMPLE
In the 2013 Jerry’s Artarama art supplies catalog, there are 560 pages and 1.5% of the pages feature
signature artists. Suppose 100 pages are randomly selected from the catalog.
1. What is the probability that fewer than 3 of the pages in the sample feature signature artists?
2. What is the probability that more than 5 of the pages in the sample feature signature artists?
3. What is the probability that at least 4 of the pages in the sample feature signature artists?
268 | 4.4 THE BINOMIAL DISTRIBUTION
4. What is the probability that between 2 and 6 of the pages in the sample feature signature
artists?
Solution:
1. We want to find . We cannot find this probability directly in Excel because the
binom.dist function can only calculate or probabilities. Because must be an integer (it
is the number of pages), is the same as (of course, in general, this is not true).
So and is a probability we can calculate with the
binom.dist functon.
Field 1 2 0.8098
Field 2 100
Field 3 0.015
Field 4 true
2. We want to find . We cannot find this probability directly in Excel because the
binom.dist function can only calculate or probabilities. The complement of is , so
and is a probability we can calculate with the
binom.dist function.
Field 1 5 0.0177
Field 2 100
Field 3 0.015
Field 4 true
3. We want to find . We cannot find this probability directly in Excel because the
binom.dist function can only calculate or probabilities. The complement of is , so
\displaystyle{P(x \geq 4)=1-P(x \lt 4)}. Because must be an integer (it is the number of
pages), is the same as . So \displaystyle{P(x \geq 4)=1-P(x \lt 4)=1-P(x \leq 3)}
and is a probability we can calculate with the binom.dist function.
4.4 THE BINOMIAL DISTRIBUTION | 269
Field 1 3 0.0642
Field 2 100
Field 3 0.015
Field 4 true
Field 1 6 1 0.4426
TRY IT
According to a Gallup poll, 60% of American adults prefer saving over spending. Suppose 50
American adults are selected at random.
1. What is the probability that at least 35 adults in the sample prefer saving over spending?
2. What is the probability that fewer than 20 adults in the sample prefer saving over spending?
3. What is the probability between 15 and 25 adults in the sample prefer saving over spending?
4. What is the probability that more than 30 adults prefer saving over spending?
Field 2 50
Field 3 0.6
Field 4 true
Field 2 50
Field 3 0.6
Field 4 true
Field 2 50 50
Field 2 50
Field 3 0.6
Field 4 true
4.4 THE BINOMIAL DISTRIBUTION | 271
TRY IT
During the 2013 regular NBA season, DeAndre Jordan of the Los Angeles Clippers had the highest
field goal completion rate in the league. DeAndre scored with 61.3% of his shots. Suppose you
choose a random sample of 80 shots made by DeAndre during the 2013 season.
1. What is the expected number shots that scored points in a sample of 80 of DeAndre’s shots?
2. What is the probability that DeAndre scored on 60 of the 80 shots?
3. What is the probability that DeAndre scored on more than 50 of the 80 shots?
4. What is the probability that DeAndre scored on between 65 and 75 of the 80 shots?
1.
Field 2 80
Field 3 0.613
Field 4 false
Field 2 80
Field 3 0.613
Field 4 true
272 | 4.4 THE BINOMIAL DISTRIBUTION
Field 2 80 80
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=100#oembed-1
Concept Review
A statistical experiment can be classified as a binomial experiment if the following conditions are
met:
The outcomes of a binomial experiment fit a binomial probability distribution. The random
4.4 THE BINOMIAL DISTRIBUTION | 273
variable is the number of successes obtained in the independent trials. The mean of a binomial
distribution is and the standard deviation is .
Attribution
LEARNING OBJECTIVES
1. The Poisson probability distribution gives the probability of a number of events occurring in a
fixed interval of time or space if these events happen with a known average rate and
independently of the time since the last event. For example, a book editor might be interested
in the number of words spelled incorrectly in a particular book. It might be that, on the
average, there are five words spelled incorrectly in 100 pages. The interval is the 100 pages.
2. The Poisson distribution may be used to approximate the binomial distribution if the
probability of success is “small” (such as 0.01) and the number of trials is “large” (such as
1,000).
The random variable associated with a Poisson experiment is the number of occurrences in the
interval of interest. In a Poisson distribution, is the average number of occurrences in an interval.
The mean of a Poisson probability distribution is and the standard deviation is .
4.5 THE POISSON DISTRIBUTION | 275
EXAMPLE
The average number of loaves of bread put on a shelf in a bakery in a half-hour period is 12. Of
interest is the number of loaves of bread put on the shelf in five minutes. The time interval of
interest is five minutes. What is the probability that the number of loaves, selected randomly, put on
the shelf in five minutes is three?
Solution:
Let be the number of loaves of bread put on the shelf in five minutes. If the average number of
loaves put on the shelf in 30 minutes (half an hour) is 12, then the average number of loaves put on
To calculate probabilities associated with a Poisson experiment in Excel, use the Poisson.dist(x, λ,
logic operator) function.
• the probability of getting exactly x successes over the interval when the logic operator is false.
• the probability of at most x successes over the interval when the logic operator is true.
Visit the Microsoft page for more information about the Poisson.dist function.
NOTE
Because we can only enter false or true into the logic operator, the Poisson.dist function can only
directly calculate the probability of getting exactly x successes or getting at most x success over the
interval. In order to calculate other Poisson probabilities, such as fewer than x successes, more than
x successes or at least x successes, we need to manipulate how we use the Poisson.dist function by
changing what we enter into the Poisson.dist function, using the complement rule, or both.
EXAMPLE
1. What is the probability that Leah receives exactly 4 calls in the next two hours?
2. What is the probability that Leah receives at most 9 calls in the next two hours?
3. What is the probability that Leach receives at most 2 calls in the next hour?
Solution:
Field 1 4 0.1339
Field 2 6
Field 3 false
The probability that Leah receives 4 calls in the next two hours is 13.39%.
Field 1 9 0.9161
Field 2 6
Field 3 true
The probability that Leah receives at most 6 calls in the next two hours is 91.61%.
3. The average number of calls in any two hour period is 6. So the average number of calls in one
hour is .
Field 1 2 0.4232
Field 2 3
Field 3 true
The probability that Leah receives at most 6 calls in the next two hours is 42.32%.
278 | 4.5 THE POISSON DISTRIBUTION
TRY IT
The customer service department of a technology company receives an average of 10 phone calls
every hour.
1. What is the probability that the customer service department receives exactly 7 phone calls in
an hour?
2. What is the probability that the customer service department receives exactly 2 phone calls in
a 15 minute period?
3. What is the probability that the customer service department receives at most 4 phone calls in
a 30 minute period?
4. What is the probability that the customer service department receives at most 20 phone calls
in a three hour period?
Field 2 10
Field 3 false
Field 2 2.5
Field 3 false
Field 2 5
Field 3 true
4.5 THE POISSON DISTRIBUTION | 279
Field 2 30
Field 3 true
EXAMPLE
According to Baydin, an email management company, an email user gets, on average, 147 emails
over a six hour period.
1. What is the probability that an email user receives fewer than 160 emails over an six hour
period?
2. What is the probability that an email user receives more than 40 emails over a two hour
period?
3. What is the probability that an email user receives at least 600 emails over a 24 hour period?
4. What is the probability that an email user receives between 150 and 200 emails over a six hour
period?
Solution:
1. The average over a six hour period is 147. We want to find . We cannot find this
probability directly in Excel because the Poisson.dist function can only calculate or
probabilities. Because must be an integer (it is the number of emails), is the same
as . So and is a probability we can
calculate with the Poisson.dist functon.
280 | 4.5 THE POISSON DISTRIBUTION
Field 2 147
Field 3 true
The probability a user receives fewer than 160 emails over a six hour period is 84.86%.
find this probability directly in Excel because the Poisson.dist function can only calculate or
probabilities. The complement of is , so and
is a probability we can calculate with the Poisson.dist function.
Field 1 40 0.8902
Field 2 49
Field 3 true
The probability a user receives more than 40 emails over a two hour period is 89.02%.
Field 2 588
Field 3 true
The probability a user receives at least 600 emails over a 24-hour period is 31.58%.
The probability a user receives between 150 and 200 emails over a six hour period is 41.32%.
TRY IT
A car parts manufacturer can produce an average of 25 parts from 100 meters of sheet metal.
1. What is the probability that more than 30 parts can be made from 100 meters of sheet metal?
2. What is the probability that between 10 and 20 parts can be made from 50 meters of sheet
metal?
3. What is the probability that fewer than 5 parts can be made from 25 meters of sheet metal?
4. What is the probability that at least 80 parts can be made from 400 meters of sheet metal?
Field 2 25
Field 3 true
282 | 4.5 THE POISSON DISTRIBUTION
Field 2 6.25
Field 3 true
Field 2 100
Field 3 true
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=102#oembed-1
Watch this video: The Poisson Distribution by Dr. Nic’s Math and Stats [7:48]
Concept Review
A Poisson probability distribution of a discrete random variable gives the probability of a number
of events occurring in a fixed interval of time or space, if these events happen at a known average
rate and independently of the time since the last event. The Poisson distribution may be used to
4.5 THE POISSON DISTRIBUTION | 283
approximate the binomial, if the probability of success is “small” (less than or equal to 0.05) and the
number of trials is “large” (greater than or equal to 20).
Attribution
1. A company wants to evaluate its attrition rate, in other words, how long new hires stay with the
company. Over the years, they have established the following probability distribution. Let be the
number of years a new hire will stay with the company. Let be the probability that a new hire
will stay with the company years.
b.
c. P(x \geq 5) =?
d. On average, how long would you expect a new hire to stay with the company?
e. What does the column “ ” sum to?
2. A baker is deciding how many batches of muffins to make to sell in his bakery. He wants to make
enough to sell every one and no fewer. Through observation, the baker has established a probability
distribution.
4.6 EXERCISES | 285
3. Ellen has music practice three days a week. She practices for all of the three days of the time,
two days of the time, one day of the time, and no days of the time. One week is selected
at random.
4. We know that for a probability distribution function to be discrete, it must have two
characteristics. One is that the sum of the probabilities is one. What is the other characteristic?
5.Javier volunteers in community events each month. He does not do more than five events in a
month. He attends exactly five events of the time, four events of the time, three events
of the time, two events of the time, one event of the time, and no events of the
time.
6. Suppose that the PDF for the number of years it takes to earn a Bachelor of Science (B.S.) degree
is given in the following table.
286 | 4.6 EXERCISES
12. A physics professor wants to know what percent of physics majors will spend the next several
years doing post-graduate research. He has the following probability distribution.
288 | 4.6 EXERCISES
13. A ballet instructor is interested in knowing what percent of each year’s class will continue on to
the next, so that she can plan what classes to offer. Over the years, she has established the following
probability distribution.
• Let be the number of years a student will study ballet with the teacher.
• Let be the probability that a student will study ballet years.
14. You are playing a game by drawing a card from a standard deck and replacing it. If the card is
a face card, you win $30. If it is not a face card, you pay $2. There are 12 face cards in a deck of 52
cards. What is the expected value of playing the game?
15. You are playing a game by drawing a card from a standard deck and replacing it. If the card is
a face card, you win $30. If it is not a face card, you pay $2. There are 12 face cards in a deck of 52
cards. Should you play the game?
16. A theater group holds a fund-raiser. It sells 100 raffle tickets for $5 apiece. Suppose you
purchase four tickets. The prize is two passes to a Broadway show, worth a total of $150.
17. A game involves selecting a card from a regular 52-card deck and tossing a coin. The coin is a
fair coin and is equally likely to land on heads or tails.
290 | 4.6 EXERCISES
• If the card is a face card, and the coin lands on Heads, you win $6
• If the card is a face card, and the coin lands on Tails, you win $2
• If the card is not a face card, you lose $2, no matter what the coin shows.
a. Find the expected value for this game (expected net gain or loss).
b. Explain what your calculations indicate about your long-term average profits and losses on
this game.
c. Should you play this game to win money?
18. You buy a lottery ticket to a lottery that costs $10 per ticket. There are only 100 tickets available
to be sold in this lottery. In this lottery there are one $500 prize, two $100 prizes, and four $25 prizes.
Find your expected gain or loss.
19. Complete the PDF and answer the questions.
20. Suppose that you are offered the following “deal.” You roll a die. If you roll a six, you win $10. If
you roll a four or five, you win $5. If you roll a one, two, or three, you pay $6.
a. What are you ultimately interested in here (the value of the roll or the money you win)?
b. In words, define the random variable .
c. List the values that may take on.
d. Construct a PDF.
e. Over the long run of playing this game, what are your expected average winnings per game?
f. Based on numerical values, should you take the deal? Explain your decision in complete
sentences.
21. A venture capitalist, willing to invest $1,000,000, has three investments to choose from. The first
investment, a software company, has a 10% chance of returning $5,000,000 profit, a 30% chance of
4.6 EXERCISES | 291
returning $1,000,000 profit, and a 60% chance of losing the million dollars. The second company,
a hardware company, has a 20% chance of returning $3,000,000 profit, a 40% chance of returning
$1,000,000 profit, and a 40% chance of losing the million dollars. The third company, a biotech firm,
has a 10% chance of returning $6,000,000 profit, a 70% of no profit or loss, and a 20% chance of
losing the million dollars.
22. Suppose that 20,000 married adults in the United States were randomly surveyed as to the
number of children they have. The results are compiled and are used as theoretical probabilities.
Let be the number of children married people have.
(or more)
23. Suppose that the PDF for the number of years it takes to earn a Bachelor of Science (B.S.) degree
is given as in following table. On average, how many years do you expect it to take for an individual
to earn a B.S.?
292 | 4.6 EXERCISES
24. People visiting video rental stores often rent more than one DVD at a time. The probability
distribution for DVD rentals per customer at Video To Go is given in the following table. There is a
five-video limit per customer at this store, so nobody ever rents more than five DVDs.
Another shop, Entertainment Headquarters, rents DVDs and video games. The probability
distribution for DVD rentals per customer at this shop is given as follows. They also have a five-
DVD limit per customer.
4.6 EXERCISES | 293
e. At which store is the expected number of DVDs rented per customer higher?
f. If Video to Go estimates that they will have 300 customers next week, how many DVDs do
they expect to rent next week? Answer in sentence form.
g. If Video to Go expects 300 customers next week, and Entertainment HQ projects that they
will have 420 customers, for which store is the expected number of DVD rentals for next week
higher? Explain.
h. Which of the two video stores experiences more variation in the number of DVD rentals per
customer? How do you know that?
25. A “friend” offers you the following “deal.” For a $10 fee, you may pick an envelope from a box
containing 100 seemingly identical envelopes. However, each envelope contains a coupon for a free
gift.
Based upon the financial gain or loss over the long run, should you play the game?
26. Florida State University has 14 statistics classes scheduled for its Summer 2013 term. One
class has space available for 30 students, eight classes have space for 60 students, one class has space
for 70 students, and four classes have space for 100 students.
a. What is the average class size assuming each class is filled to capacity?
b. Space is available for 980 students. Suppose that each class is filled to capacity and select a
statistics student at random. Let the random variable equal the size of the student’s class.
Define the PDF for .
294 | 4.6 EXERCISES
27. In a lottery, there are 250 prizes of $5, 50 prizes of $25, and ten prizes of $100. Assuming that
10,000 tickets are to be issued and sold, what is a fair price to charge to break even?
28. The Higher Education Research Institute at UCLA collected data from 203,967 incoming first-
time, full-time freshmen from 270 four-year colleges and universities in the U.S. 71.3% of those
students replied that, yes, they believe that same-sex couples should have the right to legal marital
status. Suppose that you randomly pick eight first-time, full-time freshmen from the survey. You are
interested in the number that believes that same sex-couples should have the right to legal marital
status.
29. According to a recent article the average number of babies born with significant hearing loss
(deafness) is approximately two per 1,000 babies in a healthy baby nursery. The number climbs
to an average of 30 per 1,000 babies in an intensive care nursery. Suppose that 1,000 babies from
healthy baby nurseries were randomly surveyed. Find the probability that exactly two babies were
born deaf.
30. Recently, a nurse commented that when a patient calls the medical advice line claiming to
have the flu, the chance that he or she truly has the flu (and not just a nasty cold) is only about 4%.
Of the next 25 patients calling in claiming to have the flu, we are interested in how many actually
have the flu.
31. A school newspaper reporter decides to randomly survey 12 students to see if they will attend
4.6 EXERCISES | 295
Tet (Vietnamese New Year) festivities this year. Based on past years, she knows that 18% of students
attend Tet festivities. We are interested in the number of students who will attend the festivities.
32. The probability that the San Jose Sharks will win any given game is based on a 13-year
win history of 382 wins out of 1,034 games played (as of a certain date). An upcoming monthly
schedule contains 12 games.
33. A student takes a ten-question true-false quiz, but did not study and randomly guesses each
answer. Find the probability that the student passes the quiz with a grade of at least 70% of the
questions correct.
34. A student takes a 32-question multiple-choice exam, but did not study and randomly guesses
each answer. Each question has three possible choices for the answer. Find the probability that the
student guesses more than 75% of the questions correctly.
35. Six different colored dice are rolled. Of interest is the number of dice that show a one.
36. More than 96 percent of the very largest colleges and universities (more than 15,000 total
enrollments) have some online offerings. Suppose you randomly pick 13 such institutions. We are
interested in the number that offer distance learning courses.
37. Suppose that about 85% of graduating students attend their graduation. A group of 22
graduating students is randomly chosen.
38. At The Fencing Center, 60% of the fencers use the foil as their main weapon. We randomly
survey 25 fencers at The Fencing Center. We are interested in the number of fencers who do not
use the foil as their main weapon.
39. Approximately 8% of students at a local high school participate in after-school sports all four
years of high school. A group of 60 seniors is randomly chosen. Of interest is the number who
participated in after-school sports all four years of high school.
e. Based upon numerical values, is it more likely that four or that five of the seniors participated
in after-school sports all four years of high school? Justify your answer numerically.
40. The chance of an IRS audit for a tax return with over $25,000 in income is about 2% per year. We
are interested in the expected number of audits a person with that income has in a 20-year period.
Assume each year is independent.
41. It has been estimated that only about 30% of California residents have adequate earthquake
supplies. Suppose you randomly survey 11 California residents. We are interested in the number
who have adequate earthquake supplies.
42. There are two similar games played for Chinese New Year and Vietnamese New Year. In the
Chinese version, fair dice with numbers 1, 2, 3, 4, 5, and 6 are used, along with a board with those
numbers. In the Vietnamese version, fair dice with pictures of a gourd, fish, rooster, crab, crayfish,
and deer are used. The board has those six objects on it, also. We will play with bets being $1. The
player places a bet on a number or object. The “house” rolls three dice. If none of the dice show the
number or object that was bet, the house keeps the $1 bet. If one of the dice shows the number or
object bet (and the other two do not show it), the player gets back his or her $1 bet, plus $1 profit. If
two of the dice show the number or object bet (and the third die does not show it), the player gets
back his or her $1 bet, plus $2 profit. If all three dice show the number or object bet, the player gets
back his or her $1 bet, plus $3 profit. Let be the number of matches and be the profit per game.
c. List the values that may take on. Then, construct one PDF table that includes both and
and their probabilities.
d. Calculate the average expected matches over the long run of playing this game for the player.
e. Calculate the average expected earnings over the long run of playing this game for the player.
f. Determine who has the advantage, the player or the house.
43. According to The World Bank, only 9% of the population of Uganda had access to electricity as
of 2009. Suppose we randomly sample 150 people in Uganda. Let be the number of people who
have access to electricity.
44. The literacy rate for a nation measures the proportion of people age 15 and over that can read
and write. The literacy rate in Afghanistan is 28.1%. Suppose you choose 15 people in Afghanistan
at random. Let X be the number of people who are literate.
a. Assume the event occurs independently in any given day. Define the random variable .
b. What values does take on?
c. What is the probability of getting 150 customers in one day?
d. What is the probability of getting 35 customers in the first four hours? Assume the store is
open 12 hours each day.
e. What is the probability that the store will have more than 12 customers in the first hour?
f. What is the probability that the store will have fewer than 12 customers in the first two
hours?
4.6 EXERCISES | 299
46. On average, eight teens in the U.S. die from motor vehicle injuries per day. As a result, states
across the country are debating raising the driving age.
a. Assume the event occurs independently in any given day. In words, define the random
variable .
b. What values does take on?
c. For the given values of the random variable , fill in the corresponding probabilities.
d. Is it likely that there will be no teens killed from motor vehicle injuries on any given day in
the U.S? Justify your answer numerically.
e. Is it likely that there will be more than 20 teens killed from motor vehicle injuries on any
given day in the U.S.? Justify your answer numerically.
47. The switchboard in a Minneapolis law office gets an average of 5.5 incoming phone calls during
the noon hour on Mondays. Experience shows that the existing staff can handle up to six calls in an
hour. Let be the number of calls received at noon.
48. The maternity ward at Dr. Jose Fabella Memorial Hospital in Manila in the Philippines is one of
the busiest in the world with an average of 60 births per day. Let be the number of births in an
hour.
49. A manufacturer of Christmas tree light bulbs knows that 3% of its bulbs are defective. Find the
probability that a string of 100 lights contains at most four defective bulbs using both the binomial
and Poisson distributions.
50. The average number of children a Japanese woman has in her lifetime is 1.37. Suppose that
one Japanese woman is randomly chosen.
300 | 4.6 EXERCISES
51. The average number of children a Spanish woman has in her lifetime is 1.47. Suppose that one
Spanish woman is randomly chosen.
52. Fertile, female cats produce an average of three litters per year. Suppose that one fertile, female
cat is randomly chosen. In one year, find the probability she produces:
53. The chance of having an extra fortune in a fortune cookie is about 3%. Given a bag of 144 fortune
cookies, we are interested in the number of cookies with an extra fortune. Two distributions may be
used to solve this problem, but only use one distribution to solve the problem.
54. According to the South Carolina Department of Mental Health web site, for every 200 U.S.
4.6 EXERCISES | 301
women, the average number who suffer from anorexia is one. Out of a randomly chosen group of
600 U.S. women determine the following.
55. The chance of an IRS audit for a tax return with over $25,000 in income is about 2% per year.
Suppose that 100 people with tax returns over $25,000 are randomly picked. We are interested in the
number of people audited in one year. Use a Poisson distribution to answer the following questions.
56. Approximately 8% of students at a local high school participate in after-school sports all four
years of high school. A group of 60 seniors is randomly chosen. Of interest is the number that
participated in after-school sports all four years of high school.
57. On average, Pierre, an amateur chef, drops three pieces of egg shell into every two cake batters
he makes. Suppose that you buy one of his cakes.
d. What is the probability that there will not be any pieces of egg shell in the cake?
e. Let’s say that you buy one of Pierre’s cakes each week for six weeks. What is the probability
that there will not be any egg shell in any of the cakes?
f. Based upon the average given for Pierre, is it possible for there to be seven pieces of shell in
the cake? Why?
58. The average number of times per week that Mrs. Plum’s cats wake her up at night because they
want to play is ten. We are interested in the number of times her cats wake her up each week.
Attribution
Chapter Outline
The heights of these radish plants are continuous random variables. “Radishes” by Rev Stan, CC BY 4.0.
A continuous random variable corresponds to data that can be measured. Continuous random
variables have many applications. Baseball batting averages, IQ scores, the length of time a long
distance telephone call lasts, the amount of money a person carries, the length of time a computer
chip lasts, and SAT scores are just a few examples of continuous random variables. The field of
reliability depends on a variety of continuous random variables.
306 | 5.1 INTRODUCTION TO CONTINUOUS RANDOM VARIABLES
NOTE
The values of discrete and continuous random variables can be ambiguous. For example, if is
equal to the number of miles (to the nearest mile) you drive to work, then is a discrete random
variable because you count the miles. If is the distance you drive to work, then is a
continuous random variable because you measure the miles. For a second example, if is equal
to the number of books in a backpack, then is a discrete random variable because the number
of books is a count. If is the weight of a book, then is a continuous random variable
because weights are measured. How the random variable is defined is very important.
The graph of a continuous probability distribution is a curve. The probability a continuous random
variable takes on a value in an interval is the area under the curve of the distribution of the
continuous random variable. Properties of of a continuous random variable include:
Generally, calculate is needed to find the area under the curve of many continuous probability
distributions. However, we will use the built-in functions in Excel to calculate the area under the
continuous probability distribution functions. There are many different continuous probability
5.1 INTRODUCTION TO CONTINUOUS RANDOM VARIABLES | 307
distributions, including the uniform distribution and the exponential distribution. We will focus
on the most important continuous probability distribution—the normal distribution.
The graph shows the Standard Normal Distribution with the area between x=1
and x=2 shaded to represent the probability that the value of the random
variable xx is in the interval between one and two.
Attribution
LEARNING OBJECTIONS
For a continuous random variable, the curve of the probability distribution is denoted by the
function . The function is called a probability density function and produces the
curve of the distribution. The function is defined so that the area between it and the -axis is
equal to a probability.
NOTE
The probability density function does NOT give us probabilities associated with the
continuous random variable. The function produces the graph of the distribution and the
area under this graph corresponds to the probability.
) is 0.
EXAMPLE
Note that the total area under the curve of , above the -axis, from to is
Suppose we want to find the area between and the -axis for 0 \lt x \lt 2.
310 | 5.2 PROBABILITY DISTRIBUTION OF A CONTINUOUS RANDOM VARIABLE
In this case, the area equals the area of a rectangle from to . The area of a rectangle is
, so
The area corresponds to the probability that the associated continuous random variable takes on a
value between and . Because the area is 0.1, the probability that is 0.1.
Mathematically, we can write this as:
Suppose we want to find the probability that the random variable takes on a value between
and . This corresponds to the area under the curve in between and .
5.2 PROBABILITY DISTRIBUTION OF A CONTINUOUS RANDOM VARIABLE | 311
Suppose we want to find . This corresponds to the area above , which is just a
vertical line. A vertical line has no width (or zero width). So
312 | 5.2 PROBABILITY DISTRIBUTION OF A CONTINUOUS RANDOM VARIABLE
NOTE
TRY IT
Consider the probability density function for . Draw the graph of and
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=116#oembed-1
Watch this video: Continuous probability distribution intro by Khan Academy [9:57]
Concept Review
The probability density function describes the curve of a continuous random variables. The area
under the probability density curve between two points corresponds to the probability that the
variable falls between those two values. In other words, the area under the probability density
curve between points and is equal to .
If is a continuous random variable, the probability density function, , is used to draw the
314 | 5.2 PROBABILITY DISTRIBUTION OF A CONTINUOUS RANDOM VARIABLE
graph of the probability distribution. The total area under the graph of is one. The area under
the graph of and between values and gives the probability .
Attribution
LEARNING OBJECTIVES
The normal distribution is the most important of all the distributions. It is widely used and even
more widely abused. Its graph is a symmetric, bell-shaped curve. You see the bell curve in almost
all disciplines, including psychology, business, economics, the sciences, nursing, and, of course,
mathematics. Some of your instructors may use the normal distribution to help determine your
grade. Most IQ scores are normally distributed. Often real-estate prices fit a normal distribution.
The normal distribution is extremely important, but it cannot be applied to everything in the real
world.
A normal distribution is completely determined by its mean and its standard deviation , which
means there are an infinite number of normal distributions. The mean determines the center
of the distribution—a change in the value of causes the graph to shift to the left or right. The
standard deviation determines the shape of the bell. Because the area under the curve must equal
one, a change in the standard deviation causes a change in the shape of the curve—the curve
becomes fatter or skinnier depending on the value of .
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=122#oembed-1
Watch this video: ck12.org normal distribution problems: Quantitative sense of normal distributions | Khan Academy by
Khan Academy [10:52]
For a normal distribution with mean and standard deviation , then the Empirical Rule says
the following:
EXAMPLE
TRY IT
Suppose a normal distribution has a mean and a standard deviation . Between what values of
does of the data lie?
• Between and .
EXAMPLE
From 1984 to 1985, the height of 15 to 18-year-old males from Chile follows a normal distribution
with mean cm and standard deviation cm.
1. About of the heights of 15 to 18-year old males in Chile from 1984 to 1985 lie between
what two values?
2. About of the heights of 15 to 18-year old males in Chile from 1984 to 1985 lie between
what two values?
3. About of the heights of 15 to 18-year old males in Chile from 1984 to 1985 lie between
what two values?
Solution:
1. and
2. and
320 | 5.3 THE NORMAL DISTRIBUTION
3. and
TRY IT
The scores on a college entrance exam have an approximate normal distribution with a mean
points and a standard deviation of points.
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=122#oembed-2
5.3 THE NORMAL DISTRIBUTION | 321
Watch this video: Empirical Rule| Probability and Statistics | Khan Academy by Khan Academy [10:25]
Concept Review
The normal distribution is the most frequently used distribution in statistics. The graph of a
normal distribution is a symmetric, bell-shaped curve centered at the mean of the distribution. The
probability that a normal random variable takes on a value in inside an interval equals the area
under the corresponding normal distribution curve.
For a normal distribution, the empirical rule states that 68% of the data falls within one standard
deviation of the mean, 95% of the data falls within two standard deviations, and 99.7% of the data
falls within three standard deviations of the mean.
Attribution
“6.1 The Standard Normal Distribution” in Introductory Statistics by OpenStax is licensed under
a Creative Commons Attribution 4.0 International License.
5.4 THE STANDARD NORMAL
DISTRIBUTION
LEARNING OBJECTIVES
The standard normal distribution is the normal distribution with and . The
normal random variable associated with the standard normal distribution is denoted .
For any normal distribution with mean and standard deviation , a -score is the number
of the standard deviations a value is from the mean. For example, if a normal distribution has
and , then for
In this case, . We would say that is three standard deviations above (or to the right of)
the mean.
The standard normal distribution is a normal distribution of these standardized -scores. For
any normal distribution with mean and standard deviation , we can transform the normal
distribution to the standard normal distribution using the formula
where is a value from the normal distribution. The -score is the number of standard
deviations the value is above (to the right of) or below (to the left of) the mean . Values of
that are larger than the mean have positive -scores and values of that are smaller than the mean
have negative -scores. If equals the mean, then has a -score of zero.
5.4 THE STANDARD NORMAL DISTRIBUTION | 323
EXAMPLE
This tells us that is two standard deviations ( ) above or to the right of the mean
. Notice that
NOTES
• When is positive, is above or to the right of the mean . In other words, is greater
than .
• When is negative, is below or to the left of the mean . In other words, is less than
.
324 | 5.4 THE STANDARD NORMAL DISTRIBUTION
TRY IT
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=124#oembed-1
Watch this video: Normal Distribution Problems: z-score | Probability and Statistics | Khan Academy by Khan Academy
[7:47]
5.4 THE STANDARD NORMAL DISTRIBUTION | 325
EXAMPLE
Some doctors believe that a person can lose five pounds, on the average, in a month by reducing his
or her fat intake and by exercising consistently. Suppose the amount of weight (in pounds) a person
loses in a month has a normal distribution with and . Fill in the blanks.
1. Suppose a person lost ten pounds in a month. The -score when pounds is
(verify). This -score tells you that is ________ standard deviations to the ________
(right or left) of the mean _____ (What is the mean?).
2. Suppose a person gained three pounds (a negative weight loss). Then = __________. This
-score tells you that is ________ standard deviations to the __________ (right or left)
of the mean.
Solution:
1. This -score tells you that is standard deviations to the right of the mean .
2. . This -score tells you that is standard deviations to the left of the mean.
EXAMPLE
Suppose is a normal random variable with and and is a normal random variable
with and .
Suppose :
326 | 5.4 THE STANDARD NORMAL DISTRIBUTION
The -score for is , which means that is standard deviations to the right of the
mean .
Suppose :
The -score for is , which means that is standard deviations to the right of the
mean .
Therefore, and are both two (of their own) standard deviations to the right of their
respective means. In other words, compared the the mean of their corresponding distributions,
and have the same relative position.
NOTE
The -score allows us to compare data that are scaled differently by considering the data’s position
relative to its mean. To understand the concept, suppose represents weight gains for one group of
people who are trying to gain weight in a six week period and measures the same weight gain for
a second group of people. A negative weight gain would be a weight loss. Because and
are each two standard deviations to the right of their means, they represent the same,
standardized weight gain relative to their means.
5.4 THE STANDARD NORMAL DISTRIBUTION | 327
TRY IT
Jerome averages points a game with a standard deviation of points. Suppose Jerome scores
points in a game. The –score when is . This score tells you that is _____
standard deviations to the ______(right or left) of the mean______(What is the mean?).
• 1.5, left, 16
EXAMPLE
The height of 15 to 18-year-old males from Chile from 2009 to 2010 follow a normal distribution with
mean cm and standard deviation cm.
a. Suppose a 15 to 18-year-old male from Chile was cm tall from 2009 to 2010. The -score
when cm is = _______. This -score tells you that is ________
standard deviations to the ________ (right or left) of the mean _____ (What is the mean?).
b. Suppose that the height of a 15 to 18-year-old male from Chile from 2009 to 2010 has a
-score of . What is the male’s height? The -score ( ) tells you that the
male’s height is ________ standard deviations to the __________ (right or left) of the mean.
Solution:
a. , , left,
328 | 5.4 THE STANDARD NORMAL DISTRIBUTION
b. , , right
TRY IT
The height of 15 to 18-year-old males from Chile from 2009 to 2010 follow a normal distribution with
mean cm and standard deviation cm.
1. Suppose a 15 to 18-year-old male from Chile was cm tall from 2009 to 2010. The -score
when cm is = _______. This -score tells you that cm is ________
standard deviations to the ________ (right or left) of the mean _____ (What is the mean?).
2. Suppose that the height of a 15 to 18-year-old male from Chile from 2009 to 2010 has a -score
of . What is the male’s height? The -score ( ) tells you that the male’s height
is ________ standard deviations to the __________ (right or left) of the mean.
EXAMPLE
From 2009 to 2010, the height of 15 to 18-year-old males from Chile from 2009 to 2010 follows a
normal distribution with mean cm and standard deviation cm. Let be the height of a 15
to 18-year-old male from Chile in 2009 to 2010.
From 1984 to 1985, the heights of 15 to 18-year-old males from Chile follows a normal distribution
with mean cm and standard deviation cm. Let be the height of a 15 to 18-year-old
male from Chile in 1984 to 1985.
Find the -scores for cm and cm. Interpret each -score. What can you
say about cm and cm?
Solution:
Both and deviate the same number of standard deviations from their
respective means and in the same direction.
330 | 5.4 THE STANDARD NORMAL DISTRIBUTION
TRY IT
In 2012, students took the SAT exam. The distribution of scores in the verbal section of
the SAT followed a normal distribution with a mean of and a standard deviation of .
Find the -scores for Student 1 with a score of and for Student 2 with a score of .
Interpret each -score. What can you say about these two students’ scores?
For Student 1:
For Student 2:
Student 2 scored closer to the mean than Student 1 and, because they both had negative -scores,
Student 2 had the better score.
Concept Review
The standard normal distribution is the normal distribution with a mean of 0 and a standard
deviation of 1. A -score is a standardized value that allows us to transform any normal distribution
5.4 THE STANDARD NORMAL DISTRIBUTION | 331
back to standard normal. The formula for a -score is . The value of the -score for a
value from a normal distribution with and standard deviation tells us how many standard
deviations is above (greater than) or below (less than) .
Attribution
“6.1 The Standard Normal Distribution” in Introductory Statistics by OpenStax is licensed under
a Creative Commons Attribution 4.0 International License.
5.5 CALCULATING PROBABILITIES FOR A
NORMAL DISTRIBUTION
LEARNING OBJECTIVES
Probabilities for a normal random variable equal the area under the corresponding normal
distribution curve. The probability that the value for falls in between the values and
is the area under the normal distribution curve to the right of and to the left of .
5.5 CALCULATING PROBABILITIES FOR A NORMAL DISTRIBUTION | 333
To calculate probabilities associated with normal random variables in Excel, use the norm.dist(x, ,
,logic operator) function.
The output from the norm.dist function is the probability that . That is, the output from the
norm.dist function is the area to the left of value of x.
Visit the Microsoft page for more information about the norm.dist function.
NOTE
The norm.dist function always tells us the area to the left of the value entered for x.
• To find the area to the right of the value of x, we use 1-norm.dist(x, , ,true). This
corresponds to the probability that .
• To find the area in between x1 and x2 with , we use norm.dist(x2, , ,true)-
norm.dist(x1, , ,true). This corresponds to the probability that .
Given the area to the left of an (unknown) x-value, use the norm.inv(probability, , ) function.
The output from the norm.inv function is the value of x so that the area to left of x equals the given
probability. That is, the output from the norm.inv function is the value of x so that the
.
Visit the Microsoft page for more information about the norm.inv function.
NOTE
The norm.inv function requires that we enter the area to the left of the unknown x-value. If we are
given the area to the right of the unknown x-value, we enter 1-area to the right for the probability
in the norm.inv function. That is, given the area to the right of the x-value, we use
norm.inv(1-area, , ).
5.5 CALCULATING PROBABILITIES FOR A NORMAL DISTRIBUTION | 335
EXAMPLE
The final exam scores in a statistics class are normally distributed with a mean of 63 and a standard
deviation of 5.
1. Find the probability that a randomly selected student scored more than 65 on the exam.
2. Find the probability that a randomly selected student scored less than 75.
3. 90% of the students scored less than what value?
4. 30% of the students scored more than what value?
Solution:
1. We want to find :
Field 1 65 0.3446
Field 2 63
Field 3 5
Field 4 true
The probability that a student scores more than 65 is 0.3446 (or 34.46%)
336 | 5.5 CALCULATING PROBABILITIES FOR A NORMAL DISTRIBUTION
2. We want to find :
Field 1 75 0.9918
Field 2 63
Field 3 5
Field 4 true
The probability that a student scores less than 75 is 0.9918 (or 99.18%).
3. We want to find the value of so that the area to the left of is 0.9.
Field 2 63
Field 3 5
The 90th percentile is 69.4. This means that 90% of the test scores fall at or below 69.4 and 10% fall at or
above.
4. We want to find the value of so that the area to the right of is 0.3. This is the same as
finding the value of so that the area to left of is 0.7 (1-0.3).
Field 2 63
Field 3 5
TRY IT
The golf scores for a school team are normally distributed with a mean of 68 and a standard
deviation of 3.
1. Find the probability that a randomly selected golfer scored less than 65.
2. Find the probability that a randomly selected golfer scored more than 72.
Field 2 68
Field 3 3
Field 4 true
Field 2 68
Field 3 3
Field 4 true
5.5 CALCULATING PROBABILITIES FOR A NORMAL DISTRIBUTION | 339
EXAMPLE
A personal computer is used for office work at home, research, communication, personal finances,
education, entertainment, social networking, and a myriad of other things. Suppose that the average
number of hours a household personal computer is used for entertainment is 2 hours per day.
Assume the times for entertainment are normally distributed and the standard deviation for the
times is 0.5 hour.
1. Find the probability that a household personal computer is used for entertainment between
1.8 and 2.75 hours per day.
2. Find the maximum number of hours per day that the bottom quartile of households uses a
personal computer for entertainment.
Solution:
Let be the amount of time (in hours) a household personal computer is used for entertainment.
Field 2 2 2
The probability a household computer is used for entertainment between 1.8 and 2.75 hours a
day is 0.5886 (or 58.86%).
340 | 5.5 CALCULATING PROBABILITIES FOR A NORMAL DISTRIBUTION
2. We need to find the value x so that 25% of the number of hours as less than this value.
Field 2 2
Field 3 0.5
TRY IT
The golf scores for a school team are normally distributed with a mean of 68 and a standard
deviation of 3. Find the probability that a golfer scored between 66 and 70.
Field 1 70 66 0.4950
Field 2 68 68
Field 3 3 3
EXAMPLE
There are approximately one billion smartphone users in the world today. In the United States the
ages of smartphone users from 13 to 55+ follow a normal distribution with approximate mean and
standard deviation of 36.9 years and 13.9 years, respectively.
1. Determine the probability that a random smartphone user in the age range 13 to 55+ is
between 23 and 64.7 years old.
2. Determine the probability that a randomly selected smartphone user in the age range 13 to
55+ is at most 50.8 years old.
3. 80% of the users in the age range 13 to 55+ are less than what age?
342 | 5.5 CALCULATING PROBABILITIES FOR A NORMAL DISTRIBUTION
4. 40% of the ages that range from 13 to 55+ are at least what age?
Solution:
The probability a smartphone user is between 23 and 64.7 years of age is 0.8186 (or 81.86%).
Field 2 36.9
Field 3 13.9
Field 4 true
The probability that a smartphone user is less than 50.8 years of age is 0.8413 (or 84.13%).
Field 2 36.9
Field 3 13.9
80% of the smartphone users in the age range 13 – 55+ are 48.6 years old or less.
Field 2 36.9
Field 3 13.9
40% of the smartphone users in the age range 13 – 55+ are older than 40.42 years of age.
5.5 CALCULATING PROBABILITIES FOR A NORMAL DISTRIBUTION | 343
TRY IT
There are approximately one billion smartphone users in the world today. In the United States the
ages of smartphone users from 13 to 55+ follow a normal distribution with approximate mean and
standard deviation of 36.9 years and 13.9 years, respectively.
Field 2 36.9
Field 3 13.9
Field 2 36.9
Field 3 13.9
Field 4 true
344 | 5.5 CALCULATING PROBABILITIES FOR A NORMAL DISTRIBUTION
EXAMPLE
A citrus farmer who grows mandarin oranges finds that the diameters of mandarin oranges
harvested on his farm follow a normal distribution with a mean diameter of 5.85 cm and a standard
deviation of 0.24 cm.
1. Find the probability that a randomly selected mandarin orange from this farm has a diameter
larger than 6.0 cm.
2. 90% of the diameters of the mandarin oranges are less than what value?
3. 35% of the diameters of the mandarin oranges are greater than what value?
Solution:
Field 2 5.85
Field 3 0.24
Field 4 true
The probability an orange has a diameter greater than 6 cm is 0.2660 (or 26.60%).
Field 2 5.85
Field 3 0.24
90% of the diameters of the oranges are less than 6.16 cm.
5.5 CALCULATING PROBABILITIES FOR A NORMAL DISTRIBUTION | 345
Field 2 5.85
Field 3 0.24
35% of the diameters of the oranges are greater than 5.94 cm.
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=130#oembed-1
Watch this video: Excel 2013 Statistical Analysis #39: Probabilities for Normal (Bell) Probability Distribution by ExcelIsFun
[24:07]
Concept Review
The normal distribution, which is continuous, is the most important of all the probability
distributions. Its graph is bell-shaped. This bell-shaped curve is used in almost all disciplines. The
probability that the value for a normal random variable falls in between the values and
is the area under the normal distribution curve to the right of and to the left of .
Attribution
“6.2 Using the Normal Distribution“ in Introductory Statistics by OpenStax is licensed under
a Creative Commons Attribution 4.0 International License.
5.6 EXERCISES
1. A bottle of water contains 12.05 fluid ounces with a standard deviation of 0.01 ounces. Define the
random variable in words.
2. A normal distribution has a mean of 61 and a standard deviation of 15. What is the median?
3. A company manufactures rubber balls. The mean diameter of a ball is 12 cm with a standard
deviation of 0.2 cm. Define the random variable in words.
6. What is the -score of if it is two standard deviations to the right of the mean?
7. What is the -score of if it is 1.5 standard deviations to the left of the mean?
8. What is the -score of if it is 2.78 standard deviations to the right of the mean?
9. What is the -score of if it is 0.133 standard deviations to the left of the mean?
5.6 EXERCISES | 347
10. Suppose is a normal random variable with a mean of 2 and standard deviation of 6. What
value of has a -score of three?
11. Suppose is a normal random variable with a mean of 8 and standard deviation of 1. What
value of has a -score of ?
12. Suppose is a normal random variable with a mean of 9 and standard deviation of 5. What
value of has a -score of ?
13. Suppose is a normal random variable with a mean of 2 and standard deviation of 3. What
value of has a -score of ?
14. Suppose is a normal random variable with a mean of 4 and standard deviation of 2. What
value of is 1.5 standard deviations to the left of the mean?
15. Suppose is a normal random variable with a mean of 4 and standard deviation of 2. What
value of is 2 standard deviations to the right of the mean?
16. Suppose is a normal random variable with a mean of 8 and standard deviation of 9. What
value of is 0.67 standard deviations to the left of the mean?
17. Suppose is a normal random variable with a mean of and standard deviation of 2. What
is the -score of ?
18. Suppose is a normal random variable with a mean of 12 and standard deviation of 6. What is
the -score of ?
19. Suppose is a normal random variable with a mean of 9 and standard deviation of 3. What is
the -score of ?
348 | 5.6 EXERCISES
20. Suppose a normal distribution has a mean of six and a standard deviation of 1.5. What is the
-score of ?
21. In a normal distribution, and . This tells you that is ____ standard
deviations to the ____ (right or left) of the mean.
22. In a normal distribution, and . This tells you that is ____ standard
deviations to the ____ (right or left) of the mean.
23. In a normal distribution, and . This tells you that is ____ standard
deviations to the ____ (right or left) of the mean.
24. In a normal distribution, and . This tells you that is ____ standard
deviations to the ____ (right or left) of the mean.
25. In a normal distribution, and . This tells you that is ____ standard
deviations to the ____ (right or left) of the mean.
26. About what percent of values from a normal distribution lie within one standard deviation
(left and right) of the mean of that distribution?
27. About what percent of the values from a normal distribution lie within two standard
deviations (left and right) of the mean of that distribution?
28. About what percent of values lie between the second and third standard deviations (both
sides)?
29. Suppose is a normal random variable with mean 15 and standard deviation 3. Between
what values does of the data lie? The range of values is centered at the mean of the
distribution (i.e., 15).
5.6 EXERCISES | 349
30. Suppose is a normal random variable with mean and standard deviation . Between
what values does of the data lie? The range of values is centered at the mean of the
distribution(i.e., –3).
31. Suppose is a normal random variable with mean and standard deviation . Between
what values does 34.14% of the data lie?
32. About what percent of values lie between the mean and three standard deviations?
33. About what percent of values lie between the mean and one standard deviation?
34. About what percent of values lie between the first and second standard deviations from the
mean (both sides)?
35. About what percent of values lie between the first and third standard deviations (both sides)?
36. The patient recovery time from a particular surgical procedure is normally distributed with a
mean of 5.3 days and a standard deviation of 2.1 days.
37. The length of time to find it takes to find a parking space at 9 A.M. follows a normal distribution
with a mean of five minutes and a standard deviation of two minutes. If the mean is significantly
greater than the standard deviation, which of the following statements is true?
38. The heights of the 430 National Basketball Association players were listed on team rosters at
350 | 5.6 EXERCISES
the start of the 2005–2006 season. The heights of basketball players have an approximate normal
distribution with mean inches and a standard deviation inches. For each of the
following heights, calculate the -score and interpret it using complete sentences.
a. 77 inches
b. 85 inches
c. If an NBA player reported his height had a -score of 3.5, would you believe him? Explain
your answer.
39. The systolic blood pressure (given in millimeters) of males has an approximately normal
distribution with mean and standard deviation . Systolic blood pressure for males
follows a normal distribution.
a. Calculate the -scores for the male systolic blood pressures 100 and 150 millimeters.
b. If a male friend of yours said he thought his systolic blood pressure was 2.5 standard
deviations below the mean, but that he believed his blood pressure was between 100 and 150
millimeters, what would you say to him?
40. Kyle’s doctor told him that the -score for his systolic blood pressure is 1.75. Which of the
following is the best interpretation of this standardized score? The systolic blood pressure (given in
millimeters) of males has an approximately normal distribution with mean and standard
deviation .
41. Height and weight are two measurements used to track a child’s development. The World
Health Organization measures child development by comparing the weights of children who
are the same height and the same gender. In 2009, weights for all 80 cm girls in the reference
population had a mean kg and standard deviation kg. Weights are normally
distributed. Calculate the -scores that correspond to the following weights and interpret them.
5.6 EXERCISES | 351
a. 11 kg
b. 7.9 kg
c. 12.2 kg
42. In 2005, 1,475,623 students heading to college took the SAT. The distribution of scores in the
math section of the SAT follows a normal distribution with mean and standard deviation
.
a. Calculate the -score for an SAT score of 720. Interpret it using a complete sentence.
b. What math SAT score is 1.5 standard deviations above the mean? What can you say about this
SAT score?
c. For 2012, the SAT math test had a mean of 514 and standard deviation 117. The ACT math
test is an alternate to the SAT and is approximately normally distributed with mean 21 and
standard deviation 5.3. If one person took the SAT math test and scored 700 and a second
person took the ACT math test and scored 30, who did better with respect to the test they
took?
43. How would you represent the area to the left of one in a probability statement?
46. How would you represent the area to the left of three in a probability statement?
48. If the area to the left of in a normal distribution is 0.123, what is the area to the right of ?
5.6 EXERCISES | 353
49. If the area to the right of in a normal distribution is 0.543, what is the area to the left of ?
51. The life of Sunshine CD players is normally distributed with a mean of 4.1 years and a standard
deviation of 1.3 years.
a. A CD player is guaranteed for three years. Find the probability that a CD player will break
down during the guarantee period.
b. Find the probability that a CD player will last between 2.8 and 6 years.
c. 70% of the CD players last how long?
52. The patient recovery time from a particular surgical procedure is normally distributed with a
mean of 5.3 days and a standard deviation of 2.1 days.
53. The length of time it takes to find a parking space at 9 A.M. follows a normal distribution with
a mean of five minutes and a standard deviation of two minutes.
a. Based upon the given information and numerically justified, would you be surprised if it took
less than one minute to find a parking space?
b. Find the probability that it takes at least eight minutes to find a parking space.
c. Seventy percent of the time, it takes more than how many minutes to find a parking space?
54. According to a study done by De Anza students, the height for Asian adult males is normally
distributed with an average of 66 inches and a standard deviation of 2.5 inches. Suppose one Asian
adult male is randomly chosen. Let be the height of the individual.
a. Find the probability that the person is between 65 and 69 inches. Include a sketch of the
graph, and write a probability statement.
354 | 5.6 EXERCISES
b. Would you expect to meet many Asian adult males over 72 inches? Explain why or why not,
and justify your answer numerically.
c. The middle 40% of heights fall between what two values? Sketch the graph, and write the
probability statement.
55. IQ is normally distributed with a mean of 100 and a standard deviation of 15. Suppose one
individual is randomly chosen. Let be IQ of an individual.
a. Find the probability that the person has an IQ greater than 120. Include a sketch of the graph,
and write a probability statement.
b. MENSA is an organization whose members have the top 2% of all IQs. Find the minimum IQ
needed to qualify for the MENSA organization. Sketch the graph, and write the probability
statement.
c. The middle 50% of IQs fall between what two values? Sketch the graph and write the
probability statement.
56. The percent of fat calories that a person in America consumes each day is normally distributed
with a mean of about 36 and a standard deviation of 10. Suppose that one individual is randomly
chosen. Let be percent of fat calories.
a. Find the probability that the percent of fat calories a person consumes is more than 40. Graph
the situation. Shade in the area to be determined.
b. Find the maximum number for the lower quarter of percent of fat calories. Sketch the graph
and write the probability statement.
57. Suppose that the distance of fly balls hit to the outfield (in baseball) is normally distributed with
a mean of 250 feet and a standard deviation of 50 feet.
a. If one fly ball is randomly chosen from this distribution, what is the probability that this ball
traveled fewer than 220 feet? Sketch the graph. Scale the horizontal axis X. Shade the region
corresponding to the probability. Find the probability.
b. 80% of fly balls travel for less than what value?
58. In China, four-year-olds average three hours a day unsupervised. Most of the unsupervised
children live in rural areas, considered safe. Suppose that the standard deviation is 1.5 hours and
the amount of time spent alone is normally distributed. We randomly select one Chinese four-year-
old living in a rural area. We are interested in the amount of time the child spends alone per day.
5.6 EXERCISES | 355
a. Find the probability that the child spends less than one hour per day unsupervised. Sketch
the graph, and write the probability statement.
b. What percent of the children spend over ten hours per day unsupervised?
c. Seventy percent of the children spend at least how long per day unsupervised?
59. In the 1992 presidential election, Alaska’s 40 election districts averaged 1,956.8 votes per district
for President Clinton. The standard deviation was 572.3. (There are only 40 election districts in
Alaska.) The distribution of the votes per district for President Clinton was bell-shaped. Let be
number of votes for President Clinton for an election district.
60. Suppose that the duration of a particular type of criminal trial is known to be normally
distributed with a mean of 21 days and a standard deviation of seven days.
a. If one of the trials is randomly chosen, find the probability that it lasted at least 24 days.
Sketch the graph and write the probability statement.
b. Sixty percent of all trials of this type are completed within how many days?
61. Terri Vogel, an amateur motorcycle racer, averages 129.71 seconds per 2.5 mile lap (in a seven-
lap race) with a standard deviation of 2.28 seconds. The distribution of her race times is normally
distributed. We are interested in one of her randomly selected laps.
a. Find the percent of her laps that are completed in less than 130 seconds.
b. The fastest 3% of her laps are under what value.
c. The middle 80% of her laps are between what values?
62. Suppose that Ricardo and Anita attend different colleges. Ricardo’s GPA is the same as the
average GPA at his school. Anita’s GPA is 0.70 standard deviations above her school average. In
complete sentences, explain why each of the following statements may be false.
63. An expert witness for a paternity lawsuit testifies that the length of a pregnancy is normally
distributed with a mean of 280 days and a standard deviation of 13 days. An alleged father was out
of the country from 240 to 306 days before the birth of the child, so the pregnancy would have
been less than 240 days or more than 306 days long if he was the father. The birth was
uncomplicated, and the child needed no medical intervention. What is the probability that he was
NOT the father? What is the probability that he could be the father? Calculate the -scores first,
and then use those to calculate the probability.
64. A NUMMI assembly line, which has been operating since 1984, has built an average of 6,000
cars and trucks a week. Generally, 10% of the cars were defective coming off the assembly line.
Suppose we draw a random sample of cars. Let represent the number of defective cars
in the sample. What can we say about X in regard to the 68-95-99.7 empirical rule (one standard
deviation, two standard deviations and three standard deviations from the mean are being referred
to)? Assume a normal distribution for the defective cars in the sample.
65. We flip a coin 100 times ( ) and note that it only comes up heads 20% ( ) of the
time. The mean and standard deviation for the number of times the coin lands on heads is
and (verify the mean and standard deviation). Solve the following:
a. There is about a 68% chance that the number of heads will be somewhere between ___ and
___.
b. There is about a ____chance that the number of heads will be somewhere between 12 and 28.
c. There is about a ____ chance that the number of heads will be somewhere between eight and
32.
66. A $1 scratch off lotto ticket will be a winner one out of five times. Out of a shipment of
lotto tickets, find the probability for the lotto tickets that there are
67. Facebook provides a variety of statistics on its Web site that detail the growth and popularity of
the site. On average, 28 percent of 18 to 34 year olds check their Facebook profiles before getting
out of bed in the morning. Suppose this percentage follows a normal distribution with a standard
deviation of five percent.
5.6 EXERCISES | 357
a. Find the probability that the percent of 18 to 34-year-olds who check Facebook before getting
out of bed in the morning is at least 30.
b. 95% of the number of 18 to 34-year-olds who check Facebook before getting out of bed is less
than what value?
Attribution
Chapter Outline
If you want to figure out the distribution of the change people carry in their pockets, using the
central limit theorem and assuming your sample is large enough, you will find that the
distribution is normal and bell-shaped. Photo by John Lodder, CC BY 4.0.
Why are we so concerned with means? Two reasons are that they give us a middle ground for
comparison, and they are easy to calculate. In this chapter, we will study means, proportions and
their relationship to the central limit theorem.
The central limit theorem is one of the most powerful and useful ideas in all of statistics. The
central limit theorem basically says that if we collect samples of size from a population with mean
and standard deviation , calculate each sample’s mean, and create a histogram of those means,
then, under the right conditions, the resulting histogram will tend to have an approximate normal
bell shape.
362 | 6.1 INTRODUCTION TO SAMPLING DISTRIBUTIONS AND THE CENTRAL LIMIT THEOREM
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=139#oembed-1
Watch this video: Central limit theorem | Inferential statistics | Probability and Statistics | Khan Academy by Khan
Academy [9:45]
Attribution
LEARNING OBJECTIVES
Suppose all samples of size are selected from a population with mean and standard deviation
. For each sample, the sample mean is recorded. The probability distribution of these sample
means is called the sampling distribution of the sample means. The central limit theorem
describes the properties of the sampling distribution of the sample means.
Suppose all samples of size are taken from a population with mean and standard deviation
. The collection of sample means forms a probability distribution called the sampling
distribution of the sample mean.
1. The mean of the distribution of the sample means, denoted , equals the mean of the
population.
364 | 6.2 SAMPLING DISTRIBUTION OF THE SAMPLE MEAN
2. The standard deviation of the of the sample means (called the standard error of the mean),
denoted , equals the standard deviation of the population divided by the square root of
the sample size.
3. The distribution of the sample means follows a normal distribution if one of the following
conditions is met:
◦ The population the samples are drawn from is normal, regardless of the sample size
.
◦ The sample size .
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=141#oembed-2
Watch this video: Sampling distribution of the sample mean | Probability and Statistics | Khan Academy by Khan Academy
[10:51]
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=141#oembed-3
6.2 SAMPLING DISTRIBUTION OF THE SAMPLE MEAN | 365
Watch this video: Standard error of the mean | Inferential statistics | Probability and Statistics | Khan Academy by Khan
Academy [15:14]
Because the central limit theorem states that the sampling distribution of the sample means follows
a normal distribution (under the right conditions), the normal distribution can be used to answer
probability questions about sample means. The -score for the sampling distribution of the sample
means is
where is the mean of the population the sample is taken from, is the standard deviation of
the population the sample is taken from, and is the sample size.
Because the distribution the sample means follows a normal distribution (under the right
conditions), the norm.dist(x, , ,logic operator) function can be used to calculated probabilities
associated with a sample mean.
• For the logic operator, enter true. Note: Because we are calculating the area under the curve,
we always enter true for the logic operator.
366 | 6.2 SAMPLING DISTRIBUTION OF THE SAMPLE MEAN
NOTE
In this case, we want to calculate probabilities associated with a sample mean. The sample means
follow a normal distribution (under the right conditions), which allows us to use the norm.dist
function to calculate probabilities. Because we are working with sample means, we must enter the
mean and the standard distribution of the distribution of the sample means into the
norm.dist function, and not the mean and standard distribution of the population the samples are
taken from. The mean of the sample means equals the mean of the population, so we are entering
the value of into the second field of the norm.dist function. But the standard distribution of the
sample means equals , so we must enter this value into third field of the norm.dist function.
We use the norm.dist function in the same way as we learned previously to calculate the
probability a sample mean is less than a given value, a sample mean is greater than a given value, or
a sample mean is in between two given values.
EXAMPLE
The length of time, in hours, it takes an “over 40” group of people to play one soccer match is
normally distributed with a mean of 2 hours and a standard deviation of 0.5 hours. Suppose a
sample of size 25 is drawn randomly from the population.
3. What is the probability that the mean of the sample is less than 1.7 hours?
4. What is the probability that the mean of the sample is more than 2.2 hours?
5. What is the probability that the sample mean is between 1.8 hours and 2.3 hours?
Solution:
1. Because the population the sample is taken from follows a normal distribution, the
distribution of the sample means also follows a normal distribution.
2. The mean of the distribution of the sample means is . The standard deviation of the
sample means is .
Field 2 2
Field 3 0.5/sqrt(25)
Field 4 true
The probability the sample mean is less than 1.7 hours is 0.0013 (or 0.13%).
Note: Because we are calculating a probability for a sample mean, we enter the standard
deviation of the sample means 0.5/sqrt(25) into field 3 (and not the standard deviation of the
population).
Field 2 2
Field 3 0.5/sqrt(25)
Field 4 true
The probability the sample mean is more than 2.2 hours is 0.0228 (or 2.28%).
368 | 6.2 SAMPLING DISTRIBUTION OF THE SAMPLE MEAN
Field 2 2 2
The probability the sample mean is between 1.8 hours and 2.3 hours is 0.9759 (or 97.59%).
TRY IT
The length of time taken on the SAT for a group of students has a mean of 2.5 hours and a standard
deviation of 0.25 hours. A sample size of 60 is drawn randomly from the population.
1. The distribution of the sample means is normal because the sample size of 60 is greater than
30.
Field 2 2.5
Field 3 0.25/sqrt(60)
Field 4 true
Field 2 2.5
Field 3 0.25/sqrt(60)
Field 4 true
EXAMPLE
In a recent study reported Oct. 29, 2012 on the Flurry Blog, the mean age of tablet users is 34 years
and the standard deviation is 15 years. Suppose a sample of 100 tablet users is taken.
1. What are the mean and standard deviation for the sample mean ages of tablet users?
2. What is the distribution of the sample means? Explain.
3. Find the probability that the sample mean age is more than 30 years.
Solution:
1. The mean of the distribution of the sample means is . The standard deviation of the
sample means is .
2. The distribution of the sample means is normal because the sample size of 100 is greater than
370 | 6.2 SAMPLING DISTRIBUTION OF THE SAMPLE MEAN
30
Field 2 34
Field 3 15/sqrt(100)
Field 4 true
The probability the sample mean is more than 30 years of age is 0.9962 (or 99.62%).
TRY IT
In an article on Flurry Blog, a gaming marketing gap for men between the ages of 30 and 40 is
identified. You are researching a start-up game targeted at the 35-year-old demographic. Your idea is
to develop a strategy game that can be played by men from their late 20s through their late 30s.
Based on the article’s data, industry research shows that the average strategy player is 28 years old
with a standard deviation of 4.8 years. You take a sample of 100 randomly selected gamers. If your
target market is 29- to 35-year-olds, should you continue with your development strategy?
You need to determine the probability for men whose mean age is between 29 and 35 years of age
wanting to play a strategy game.
6.2 SAMPLING DISTRIBUTION OF THE SAMPLE MEAN | 371
Field 1 35 29 0.0186
Field 2 28 28
There is 1.86% chance that the mean age of men who will play your game is between 29 years and 35
years. Because this is a very low probability, you should not continue your development strategy.
EXAMPLE
The mean number of minutes for app engagement by a tablet user is 8.2 minutes with a standard
deviation of 1 minute. Suppose a sample of 60 table users is taken.
Solution:
1. Because the sample size of 60 is greater than 30, the distribution of the sample means also
follows a normal distribution.
2. The mean of the distribution of the sample means is . The standard deviation of the
sample means is .
372 | 6.2 SAMPLING DISTRIBUTION OF THE SAMPLE MEAN
The probability that the sample mean is between 8 and 8.5 minutes is 0.9293 (or 92.93%).
Field 2 8.2
Field 3 1/sqrt(60)
Field 4 true
The probability that the sample mean is less than 8.3 minutes is 0.7807 (or 78.07%).
TRY IT
Cans of a cola beverage claim to contain 16 ounces with a standard deviation of 0.143 ounces. The
amounts in a sample of 34 cans are measured and the mean is 16.01 ounces. Find the probability
that a sample of 34 cans will have an average amount greater than 16.01 ounces. Do the results
suggest that cans are filled with an amount greater than 16 ounces?
Field 2 16
Field 3 0.143/sqrt(34)
Field 4 true
Because there is a 34.17% probability that the average sample volume is greater than 16.01 ounces,
we should be skeptical of the company’s claimed volume. That is, based on this sample, it is likely
that the average volume of the cans is higher than the claimed 16 ounces.
As consumers, we would be glad if the average was higher than 16 ounces because we are likely
receiving more cola in the can that what we paid for. As the manufacturer, we would need to inspect
our bottling process to determine if the processes is working within acceptable limits.
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=141#oembed-1
Watch this video: Excel Statistics 76: Sampling Distribution Of Sample Mean & Central Limit Theorem by ExcelIsFun
[24:05]
Concept Review
The distribution of the sample means follows a normal distribution if one of the following
conditions is met:
The mean of the sample means equals the population mean . The standard deviation of the
sample means is equal to where is the population standard deviation and is the sample
size.
Attribution
“7.1 The Central Limit Theorem for Sample Means (Averages)“ in Introductory Statistics by
OpenStax is licensed under a Creative Commons Attribution 4.0 International License.
6.3 SAMPLING DISTRIBUTION OF THE
SAMPLE PROPORTION
LEARNING OBJECTIVES
The Central Limit Theorem tells us that the distribution of the sample means follow a normal
distribution under the right conditions. This allows us to answer probability questions about the
sample mean . Now we want to investigate the sampling distribution for another important
parameter—the sampling distribution of the sample proportion. Once we know what distribution
the sample proportions follow, we can answer probability questions about sample proportions.
A proportion is the percent, fraction, or ratio of a sample or population that have a characteristic
of interest. The population proportion is denoted by and the sample proportion is denoted
by .
If the random variable is discrete, such as for categorical data, then the parameter we wish to
estimate is the population proportion. This is, of course, the probability of drawing a success in
any one random draw. Because we are interested in the number of successes, we are dealing with
the binomial distribution. The random variable is the number of successes and the parameter
we wish to know is , the probability of drawing a success, which is of course the proportion of
successes in the population. What is the distribution of the sample proportion ?
376 | 6.3 SAMPLING DISTRIBUTION OF THE SAMPLE PROPORTION
Suppose all samples of size are taken from a population with proportion . The collection of
sample proportions forms a probability distribution called the sampling distribution of the
sample proportion.
1. The mean of the distribution of the sample proportions, denoted , equals the population
proportion.
\begin{eqnarray*}\\ \mu_{\hat{p}} & = & p \\ \\ \end{eqnarray*}
2. The standard deviation of the of the sample proportions (called the standard error of the
proportion), denoted , is
◦ Normal if and .
◦ Binomial if one of and .
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=143#oembed-1
Watch this video: Sampling Distribution of the Sample Proportion by Khan Academy [9:57]
6.3 SAMPLING DISTRIBUTION OF THE SAMPLE PROPORTION | 377
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=143#oembed-2
Watch this video: Sampling Distribution of the Sample Proportion by Khan Academy [4:34]
When and , the central limit theorem states that the sampling
distribution of the sample proportions follows a normal distribution. In this case the normal
distribution can be used to answer probability questions about sample proportions and the -score
for the sampling distribution of the sample proportions is
When the distribution of the sample proportions follows a normal distribution (when
and ), the norm.dist(x, , ,logic operator) function can be used to calculated
probabilities associated with a sample proportion.
• For the logic operator, enter true. Note: Because we are calculating the area under the curve,
we always enter true for the logic operator.
NOTE
In this case, we want to calculate probabilities associated with a sample proportion. The sample
proportions follow a normal distribution (under the right conditions), which allows us to use the
norm.dist function to calculate probabilities. Because we are working with sample proportions, we
must enter the mean and the standard distribution of the distribution of the sample
proportions into the norm.dist function. The mean of the sample proportions equals the
population proportion, so we are entering the value of into the second field of the norm.dist
must enter this value into third field of the norm.dist function.
We use the norm.dist function in the same way as we learned previously to calculate the
probability a sample proportion is less than a given value, a sample proportion is greater than a
given value, or a sample proportion is in between two given values.
EXAMPLE
A recent study asked working adults if they worked most of their time remotely. The study found
that 30% of employees spend the majority of their time working remotely. Suppose a sample of 150
working adults is taken.
6.3 SAMPLING DISTRIBUTION OF THE SAMPLE PROPORTION | 379
Solution:
2. The mean of the distribution of the sample proportions is . The standard deviation
Field 2 0.3
Field 3 sqrt(0.3*(1-0.3)/150)
Field 4 true
The probability the sample proportion is at most 27% is 0.2113 (or 21.13%).
Note: Because we are calculating a probability for a sample proportion, we enter the mean of
the sample proportions 0.3 (which is the population proportion) into field 2 and the standard
deviation of the sample proportions sqrt(0.3*(1-0.3)/150) into field 3.
4. In this case, 51 is not a proportion. It is the number of items in the sample that have the
380 | 6.3 SAMPLING DISTRIBUTION OF THE SAMPLE PROPORTION
This question is asking us to find the probability that at least 34% of the workers in the sample
work remotely most of the time.
Field 2 0.3
Field 3 sqrt(0.3*(1-0.3)/150)
Field 4 true
The probability the sample proportion is at least 34% is 0.1425 (or 14.25%).
The probability the sample proportion is between 32% and 35% is 0.2058 (or 20.58%).
TRY IT
According to a recent study, 17.5% of the adult population of Canada are smokers. Suppose a
random sample of 200 adult Canadians is taken.
3. What is the probability that less than 32 of the adults in the sample are smokers?
4. What is the probability that more than 20% of the adults in the sample are smokers?
5. What is the probability that between 34 and 44 of the adults in the sample are smokers?
1. Because and
the distribution of the sample
proportions is normal.
2. The mean of the distribution of the sample proportions is . The standard
deviation of the sample proportions is
Field 2 0.175
Field 3 sqrt(0.175*(1-0.175)/200)
Field 4 true
Field 2 0.175
Field 3 sqrt(0.175*(1-0.175)/200)
Field 4 true
follows a binomial distribution, and so we must use the binomial distribution to answer probability
questions about sample proportions. In these cases, we are actually answering probability
questions about the number of items with the characteristic of interest, . In other words, we are
answering questions about the number of successes we get in trials (the sample size) where the
probability of success is the population proportion . These are exactly the same type of questions
we answered previously with the binomial distribution.
When the distribution the sample proportions follows a binomial distribution (when one of
or ), the binom.dist(x,n,p,logic operator) function can be used to
calculated probabilities associated with a sample proportion.
NOTE
We use the binom.dist function in the same way as we learned previously to calculate the
probability a sample proportion is less than a given value, a sample proportion is at most a given
value, a sample proportion is greater than a given value, or a sample proportion is at least a given
value.
6.3 SAMPLING DISTRIBUTION OF THE SAMPLE PROPORTION | 383
EXAMPLE
At the local humane society, 3% of the dogs have heartworm disease. Suppose a sample of 60 dogs at
the humane society is taken.
Solution:
Field 1 3 0.8943
Field 2 60
Field 3 0.03
Field 4 true
The probability that at most 5% of the dogs in the sample have heartworm disease is 0.8943 (or
89.43%).
3. We want to find . Because we are using the binomial distribution, this probability
is the same as .
384 | 6.3 SAMPLING DISTRIBUTION OF THE SAMPLE PROPORTION
Field 1 6 0.9979
Field 2 60
Field 3 0.03
Field 4 true
The probability that less than 7 of the dogs in the sample have heartworm disease is 0.9979 (or
99.79%).
Field 1 4 0.0340
Field 2 60
Field 3 0.03
Field 4 true
The probability that more than 8% of the dogs in the sample have heartworm disease is 0.0340
(or 3.4%).
5. We want to find . Because we are using the binomial distribution, this probability
is the same as .
Field 1 5 0.0091
Field 2 60
Field 3 0.03
Field 4 true
The probability that at least 6 of the dogs in the sample have heartworm disease is 0.0091 (or
0.91%).
6.3 SAMPLING DISTRIBUTION OF THE SAMPLE PROPORTION | 385
TRY IT
During the past tax season, 92% of tax returns were filed using an electronic filing system. Suppose a
sample of 40 tax returns are selected.
Field 2 40
Field 3 0.92
Field 4 true
Field 2 40
Field 3 0.92
Field 4 true
386 | 6.3 SAMPLING DISTRIBUTION OF THE SAMPLE PROPORTION
Field 2 40
Field 3 0.92
Field 4 true
Field 2 40
Field 3 0.92
Field 4 true
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=143#oembed-3
Watch this video: Excel Statistics 79: Proportions Sampling Distribution by ExcelIsFun [8:54]
Concept Review
The mean of the sample proportion equals the population proportion . The standard deviation
6.3 SAMPLING DISTRIBUTION OF THE SAMPLE PROPORTION | 387
Attribution
“7.3 The Central Limit Theorem for Proportions“ in Introductory Business Statistics by OpenStax is
licensed under a Creative Commons Attribution 4.0 International License.
6.4 EXERCISES
1. Yoonie is a personnel manager in a large corporation. Each month she must review 16 of the
employees. From past experience, she has found that the reviews take her approximately four
hours each to do with a population standard deviation of 1.2 hours. Let be the random variable
representing the time it takes her to complete one review. Assume is normally distributed.
Suppose 16 review are selected at random.
2. Suppose that the distance of fly balls hit to the outfield (in baseball) is normally distributed with
a mean of 250 feet and a standard deviation of 50 feet. We randomly sample 49 fly balls.
a. What is the probability that the 49 balls traveled an average of less than 240 feet?
b. What is the probability that the 49 balls traveled an average of 245 feet to 255 feet?
c. What is the probability that the 49 balls traveled an average of more than 260 feet?
3. According to the Internal Revenue Service, the average length of time for an individual to
complete (keep records for, learn, prepare, copy, assemble, and send) IRS Form 1040 is 10.53 hours
(without any attached schedules) with a standard deviation of two hours. Suppose we randomly
sample 36 taxpayers.
4. Suppose that a category of world-class runners are known to run a marathon (26 miles) in an
6.4 EXERCISES | 389
average of 145 minutes with a standard deviation of 14 minutes. Consider 49 of the races. Find the
probability that the runner will average between 142 and 146 minutes in these 49 marathons.
5. In 1940 the average size of a U.S. farm was 174 acres. Let’s say that the standard deviation was 55
acres. Suppose we randomly survey 38 farmers from 1940.
6. The percent of fat calories that a person in America consumes each day is normally distributed
with a mean of about 36 and a standard deviation of about ten. Suppose that 16 individuals are
randomly chosen.
7. The distribution of income in some Third World countries is considered wedge shaped (many
very poor people, very few middle income people, and even fewer wealthy people). Suppose we
pick a country with a wedge shaped distribution. Let the average salary be $2,000 per year with a
standard deviation of $8,000. We randomly survey 1,000 residents of that country.
a. How is it possible for the standard deviation to be greater than the average?
b. Why is it more likely that the average of the 1,000 residents will be from $2,000 to $2,100 than
from $2,100 to $2,200?
8. NeverReady batteries has engineered a newer, longer lasting AAA battery. The company claims
this battery has an average life span of 17 hours with a standard deviation of 0.8 hours. Your
statistics class questions this claim. As a class, you randomly select 30 batteries and find that the
sample mean life span is 16.7 hours. If the process is working properly, what is the probability of
getting a random sample of 30 batteries in which the sample mean lifetime is 16.7 hours or less? Is
the company’s claim reasonable?
conditioners in a large city. Based on service records from previous years, the time that a
technician spends servicing a unit averages one hour with a standard deviation of one hour. In the
coming week, your company will service a simple random sample of 70 units in the city. You plan
to budget an average of 1.1 hours per technician to complete the work. Will this be enough time?
10. Suppose in a local Kindergarten through 12th grade (K – 12) school district, 53% of the
population favor a charter school for grades K through five. A simple random sample of 300 is
surveyed.
a. Find the probability that less than 100 favor a charter school for grades K through 5.
b. Find the probability that 170 or more favor a charter school for grades K through 5.
c. Find the probability that no more than 140 favor a charter school for grades K through 5.
d. Find the probability that there are fewer than 130 that favor a charter school for grades K
through 5.
e. Find the probability that exactly 150 favor a charter school for grades K through 5.
11. Four friends, Janice, Barbara, Kathy and Roberta, decided to carpool together to get to school.
Each day the driver would be chosen by randomly selecting one of the four names. They carpool to
school for 96 days.
12. A question is asked of a class of 200 freshmen, and 23% of the students know the correct answer.
Suppose a sample of 50 students is taken.
a. What is the mean and standard deviation of the distribution of the sample proportions?
b. What is the distribution of the sample proportions? Explain.
c. What is the probability that more than 30% of the students answered correctly?
d. What is the probability that less than 20% of the students answered correctly?
e. What is the probability that between 21% and 25% of the students answered correctly?
13. A virus attacks one in three of the people exposed to it. An entire large city is exposed. Suppose
a sample of 70 people in the city is taken.
a. What is the mean and standard deviation of the distribution of the sample proportions?
b. What is the distribution of the sample proportions? Explain.
c. What is the probability that between 21 and 40 of the people in the sample were exposed to
6.4 EXERCISES | 391
the virus?
d. What is the probability that more than 35% of the people in the sample were exposed to the
virus?
e. What is the probability that less than 25% of the people in the same were exposed to the virus?
14. A game is played repeatedly. A player wins one-fifth of the time. Suppose a player plays the game
20 times.
a. What is the mean and standard deviation of the distribution of the sample proportions?
b. What is the distribution of the sample proportions? Explain.
c. What is the probability that the player wins at most 7 times?
d. What is the probability that the player wins at least 30% of the time?
e. What is the probability that the player wins less than 15% of the time?
f. What is the probability that the player wins more than 10 times?
15. A company inspects products coming through its production process, and rejects defective
products. One-tenth of the items are defective. Suppose a sample of 40 items is taken.
a. What is the mean and standard deviation of the distribution of the sample proportions?
b. What is the distribution of the sample proportions? Explain.
c. What is the probability that fewer than 7 of the items in the sample are defective?
d. What is the probability that more than 15% of the items in the sample are defective?
e. What is the probability that at least 3 of the items in the sample are defective?
f. What is the probability that at most 20% of the items in the sample are defective?
Attribution
Chapter Outline
Have you ever wondered what the average number of M&Ms in a bag at the grocery store is? You can use
confidence intervals to answer this question. Photo by comedy_nose, CC BY 4.0.
Suppose you want to determine the mean rent of a two-bedroom apartment in your town. You
might look in the classified section of the newspaper, write down several rents listed, and average
them together. You would obtain a point estimate of the true mean rent of two-bedroom
apartments in your town. If you are trying to determine the percentage of times you make a basket
when shooting a basketball, you might count the number of shots you make and divide that by
the number of shots you attempted. In this case, you would obtain a point estimate for the true
proportion of the baskets you make when shooting a basketball.
We use sample data to make generalizations about an unknown population. This part of
396 | 7.1 INTRODUCTION TO CONFIDENCE INTERVALS
statistics is called inferential statistics. The sample data help us to make an estimate of a
population parameter. We realize that the point estimate is most likely not the exact value of
the population parameter, but close to it. After calculating point estimates, we construct interval
estimates, called confidence intervals.
In this chapter, you will learn to construct and interpret confidence intervals. You will also learn
a new distribution, the -distribution, and how it is used with these intervals. Throughout the
chapter, it is important to keep in mind that the confidence interval is a random variable. It is the
population parameter that is fixed.
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=148#oembed-1
Watch this video: Understanding Confidence Intervals: Statistics Help by Dr Nic’s Math and Stats [4:02]
If you worked in the marketing department of an entertainment company, you might be interested
in the mean number of songs a consumer downloads a month from iTunes. If so, you could
conduct a survey and calculate the sample mean and the sample standard deviation . You
would use the sample mean to estimate the population mean and the sample standard deviation
to estimate the population standard deviation. The sample mean is the point estimate for the
population mean . The sample standard deviation is the point estimate for the population
standard deviation . Each of and is called a statistic.
A confidence interval is another type of estimate but, instead of being just one number, it is an
interval of numbers. The interval of numbers is a range of values calculated from a given set of
sample data. The confidence interval is likely to include the unknown population parameter.
Suppose, for the iTunes example, we do not know the population mean , but we do know that
the population standard deviation is and the sample size is . Then, by the central
limit theorem, the standard deviation for the sample mean is
The empirical rule, which applies to bell-shaped distributions, says that in approximately 95% of
7.1 INTRODUCTION TO CONFIDENCE INTERVALS | 397
the samples, the sample mean will be within two standard deviations of the population mean .
For our iTunes example, two standard deviations is . The sample mean is likely to
be within units of .
Because is within units of , which is unknown, is likely to be within units of
in 95% of the samples. The population mean is contained in an interval whose lower number
is calculated by taking the sample mean and subtracting two standard deviations (
) and whose upper number is calculated by taking the sample mean and adding two standard
deviations. In other words, is between and in 95% of all the samples. Suppose
that a sample produced a sample mean . Then the unknown population mean is between
and
We say that we are 95% confident that the (unknown) population mean number of songs
downloaded from iTunes per month is between and . The 95% confidence interval is the
interval with lower limit and upper limit .
The 95% confidence interval implies two possibilities. Either the interval to contains the
true mean or our sample produced an that is not within units of the true mean . Because
we are 95% confident that the true population mean is inside the interval, the second possibility,
that the population mean is not inside the interval, happens for only 5% of all the samples.
Remember that a confidence interval is created for an unknown population parameter like the
population mean . Confidence intervals for some parameters have the form:
\begin{eqnarray*} \mbox{Lower Limit} & = & \mbox{point estimate}-\mbox{margin
of error} \\ \\ \mbox{Upper Limit} & = & \mbox{point estimate}+\mbox{margin of
error} \\ \end{eqnarray*}
The margin of error depends on the confidence level and the standard error of the mean.
When you read newspapers and journals, some reports will use the phrase “margin of error.”
Other reports will not use that phrase, but include a confidence interval as the point estimate plus
or minus the margin of error. These are two ways of expressing the same concept.
Attribution
LEARNING OBJECTIVES
• Calculate and interpret confidence intervals for estimating a population mean where the
population standard deviation is known.
A confidence interval for a population mean with a known standard deviation is based on the fact
that the sample means follow an approximately normal distribution. To construct a confidence
interval for a single unknown population mean , where the population standard deviation is
known, we need , which is the point estimate of the unknown population mean .
The confidence interval estimate will have the form:
\begin{eqnarray*} \mbox{Lower Limit} & = & \overline{x}-\mbox{margin of error} \\
\\ \mbox{Upper Limit} & = & \overline{x}-\mbox{margin of error} \end{eqnarray*}
The margin of error depends on the confidence level. The confidence level is often considered
the probability that the calculated confidence interval estimate will contain the true population
parameter. However, it is more accurate to state that the confidence level is the percent of
confidence intervals that contain the true population parameter when repeated samples are taken.
Most often, it is the choice of the person constructing the confidence interval to choose a confidence
level of 90% or higher because that person wants to be reasonably certain of their conclusions.
7.2 CONFIDENCE INTERVALS FOR A SINGLE POPULATION MEAN WITH KNOWN POPULATION STANDARD
DEVIATION | 399
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=157#oembed-1
EXAMPLE
Suppose we have collected data from a sample. The sample mean is 7 and the margin of error is 2.5.
If the confidence level is 95%, then we say that, “We estimate with 95% confidence that the true value
of the population mean is between 4.5 and 9.5.”
TRY IT
Suppose we have data from a sample. The sample mean is 15 and the margin of error is 3.2. What is
the confidence interval estimate for the population mean?
400 | 7.2 CONFIDENCE INTERVALS FOR A SINGLE POPULATION MEAN WITH KNOWN POPULATION STANDARD
DEVIATION
A confidence interval for a population mean with a known standard deviation is based on the fact
that the sample means follow an approximately normal distribution. Suppose that our sample has
a mean of and we have constructed the 90% confidence interval with a lower limit of 5 and
an upper limit of 15.
To get a 90% confidence interval, we must include the central 90% of the probability of the normal
distribution. If we include the central 90%, we leave out a total of 10% in both tails, or 5% in each
tail, of the normal distribution.
To capture the central 90%, we must go out 1.645 “standard deviations” on either side of the
calculated sample mean. The value 1.645 is the -score from a standard normal probability
distribution that puts an area of 0.90 in the center, an area of 0.05 in the far left tail, and an area of
0.05 in the far right tail.
It is important that the “standard deviation” used must be appropriate for the parameter we are
estimating. So in this section we need to use the standard deviation that applies to sample means,
7.2 CONFIDENCE INTERVALS FOR A SINGLE POPULATION MEAN WITH KNOWN POPULATION STANDARD
DEVIATION | 401
which is (the standard deviation of the sample means). The fraction is commonly called
the standard error of the mean in order to clearly distinguish the standard deviation for a sample
mean from the population standard deviation .
To construct a confidence interval estimate for an unknown population mean, we need data from a
random sample. The steps to construct and interpret the confidence interval are:
• Calculate the sample mean from the sample data. Remember, in this section we already
know the population standard deviation .
• Find the -score that corresponds to the confidence level .
• Calculate the limits for the confidence interval.
• Write a sentence that interprets the estimate in the context of the problem. (Explain what the
confidence interval means, in the words of the problem.)
We will first examine each step in more detail, and then illustrate the process with some examples.
When we know the population standard deviation , we use a standard normal distribution to
calculate the margin of error and construct the confidence interval. We need to find the value of
that puts an area equal to the confidence level (in decimal form) in the middle of the standard
normal distribution. The confidence level is the area in the middle of the standard normal
distribution. The remaining area, , is split equally between the two tails, so each of the tails
contains an area equal to .
The -score needed to construct the confidence interval is the -score so that the entire area to
the left of -score equals the area in the middle (the confidence level) plus the area in the left tail
. That is, the required -score for the confidence interval is the -score so that the entire
For example, if the confidence level is 95%, then the area in the center of the standard normal
402 | 7.2 CONFIDENCE INTERVALS FOR A SINGLE POPULATION MEAN WITH KNOWN POPULATION STANDARD
DEVIATION
distribution is 0.95 and the area in the left tail is . We would need to find the
-score so that the entire area to the left of the -score equals .
To find the -score to construct a confidence interval with confidence level , use the
norm.s.inv(area to the left of z) function.
• For area to the left of z, enter the entire area to the left of the -score you are trying to find.
The output from the norm.s.inv function is the value of -score needed to construct the confidence
interval.
NOTE
The norm.s.inv function requires that we enter the entire area to the left of the unknown
-score. This area includes the confidence level (the area in the middle of the distribution) plus the
remaining area in the left tail.
The margin of error for a confidence interval with confidence level for an unknown population
mean when the population standard deviation is known is
7.2 CONFIDENCE INTERVALS FOR A SINGLE POPULATION MEAN WITH KNOWN POPULATION STANDARD
DEVIATION | 403
The limits for the confidence interval with confidence level for an unknown population mean
when the population standard deviation is known are
The interpretation should clearly state the confidence level , explain what population parameter
is being estimated (in this case a population mean), and state the confidence interval (both
endpoints)—”We estimate with ___% confidence that the true population mean (include the
context of the problem) is between ___ and ___ (include appropriate units).”
404 | 7.2 CONFIDENCE INTERVALS FOR A SINGLE POPULATION MEAN WITH KNOWN POPULATION STANDARD
DEVIATION
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=157#oembed-2
Watch this video: Confidence Interval for a population mean – known by Joshua Emmanuel [4:30]
EXAMPLE
Suppose scores on exams in statistics are normally distributed with an unknown population mean
and a population standard deviation of 3 points. A random sample of 36 scores is taken and has a
sample mean of 68 points.
Solution:
1. To find the confidence interval, we need to find the -score for the 90% confidence interval.
This means that we need to find the -score so that the entire area to the left of is
.
7.2 CONFIDENCE INTERVALS FOR A SINGLE POPULATION MEAN WITH KNOWN POPULATION STANDARD
DEVIATION | 405
2. We are 90% confident that the mean exam score is between 67.18 points and 68.82
points.
3. It is not reasonable to conclude that the mean exam score is 70 points because 70 points is
outside the confidence interval. (In this case there is a 90% chance that the actual mean exam
score is in between 67.18 and 68.82 and only a 10% chance that the mean exam score is outside
this interval. So it is unlikely (but not impossible) that the actual mean exam score is a value
outside of the confidence interval.)
406 | 7.2 CONFIDENCE INTERVALS FOR A SINGLE POPULATION MEAN WITH KNOWN POPULATION STANDARD
DEVIATION
NOTES
1. When calculating the limits for the confidence interval keep all of the decimals in the -score
and other values throughout the calculation. This will ensure that there is no round-off error
in the answers. You can use Excel to do the calculation of the limits, clicking on the cells
containing the -score and any other values, to ensure that all of the decimal places are used
in the calculation.
2. When writing down the interpretation of the confidence interval, make sure to include the
confidence level, the actual population mean captured by the confidence interval (i.e. be
specific to the context of the question), and appropriate units for the limits.
3. 90% of all confidence interval constructed this way contain the true mean exam score. For
example, if we constructed 100 of these confidence intervals (using 100 different samples of
size 36), we would expect 90 of them to contain the true mean exam score.
TRY IT
Suppose average pizza delivery times are normally distributed with an unknown population mean
and a population standard deviation of 6 minutes. A random sample of 28 pizza delivery restaurants
is taken and has a sample mean delivery time of 36 minutes.
2. We are 96% confident that the mean delivery time is between 33.67 minutes and
38.05 minutes.
3. It is reasonable to conclude that the mean delivery time is 35 minutes because 35 minutes is
inside the confidence interval.
EXAMPLE
The Specific Absorption Rate (SAR) for a cell phone measures the amount of radio frequency (RF)
energy absorbed by the user’s body when using the handset. Every cell phone emits RF energy.
408 | 7.2 CONFIDENCE INTERVALS FOR A SINGLE POPULATION MEAN WITH KNOWN POPULATION STANDARD
DEVIATION
Different phone models have different SAR measures. To receive certification from the Federal
Communications Commission (FCC) for sale in the United States, the SAR level for a cell phone
must be no more than 1.6 watts per kilogram. This table shows the highest SAR level for a random
selection of cell phone models as measured by the FCC.
BlackBerry Tour 9630 1.43 LG Cosmos 1.18 Samsung Epic 4G Touch 0.4
HP/Palm Centro 1.09 LG Trax CU575 1.26 Samsung Messager III SCH-R750 0.68
HTC Touch Pro 2 1.41 Motorola Razr2 V8 0.36 Samsung SGH-A227 1.13
Huawei M835 Ideos 0.82 Motorola Razr2 V9 0.52 SGH-a107 GoPhone 0.3
Kyocera K127 Marbl 1.25 Nokia 1680 1.39 T-Mobile Concord 1.38
1. Find a 98% confidence interval for the mean of the Specific Absorption Rates (SARs)
for cell phones. Assume that the population standard deviation is
\sigma<span style="font-size: 1rem"> = 0.337</span>.
2. Interpret the confidence interval found in part 1.
Solution:
1. To find the confidence interval, we need to find the -score for the 98% confidence interval.
This means that we need to find the -score so that the entire area to the left of is
.
7.2 CONFIDENCE INTERVALS FOR A SINGLE POPULATION MEAN WITH KNOWN POPULATION STANDARD
DEVIATION | 409
2. We are 98% confident that the mean of the Specific Absorption Rates is between
0.8806 watts per kilogram and 1.1839 watts per kilogram.
410 | 7.2 CONFIDENCE INTERVALS FOR A SINGLE POPULATION MEAN WITH KNOWN POPULATION STANDARD
DEVIATION
TRY IT
This table shows a different random sampling of 20 cell phone models. As previously, assume that
the population standard deviation is .
1. Construct a 93% confidence interval for the mean SAR for cell phones certified for use in the
United States.
2. Interpret the confidence interval found in part 1.
2. We are 93% confident that the mean of the Specific Absorption Rates is between
0.8035 watts per kilogram and 1.0766 watts per kilogram.
Notice the difference in the confidence intervals calculated in the Example and Try It just
completed. These intervals are different for several reasons: they were calculated from different
samples, the samples were different sizes, and the intervals were calculated for different levels of
confidence. Even though the intervals are different, they do not yield conflicting information.
EXAMPLE
Suppose scores on exams in statistics are normally distributed with an unknown population mean
and a population standard deviation of 3 points. A random sample of 36 scores is taken and gives a
sample mean of 68 points. Previously we found a 90% confidence interval for the mean exam score.
Now, find a 95% confidence interval for the mean exam score. Interpret the 95% confidence interval.
Solution:
To find the confidence interval, we need to find the -score for the 95% confidence interval. This
means that we need to find the -score so that the entire area to the left of is
.
412 | 7.2 CONFIDENCE INTERVALS FOR A SINGLE POPULATION MEAN WITH KNOWN POPULATION STANDARD
DEVIATION
We are 95% confident that the mean exam score is between 67.02 points and 68.98 points.
For the exam scores examples, the 90% confidence interval has a lower limit of 67.18 and an upper
limit of 68.82, and the 95% confidence interval has a lower limit of 67.02 and an upper limit of
68.98. Notice that the 95% confidence interval is wider (the distance between the limits is larger
in the 95% confidence interval). If we look at the graphs, because the area 0.95 is larger than the
area 0.90, it makes sense that the 95% confidence interval is wider. To be more confident that the
confidence interval actually does contain the true value of the population mean for all statistics
exam scores, the confidence interval necessarily needs to be wider.
7.2 CONFIDENCE INTERVALS FOR A SINGLE POPULATION MEAN WITH KNOWN POPULATION STANDARD
DEVIATION | 413
• Increasing the confidence level increases the margin of error, making the confidence interval
wider.
• Decreasing the confidence level decreases the margin of error, making the confidence interval
narrower.
EXAMPLE
Suppose scores on exams in statistics are normally distributed with an unknown population mean
and a population standard deviation of 3 points. Previously, we found a 90% confidence interval for
the mean exam score using a sample of size 36 with a sample mean of 68.
1. Suppose everything is kept the same but the sample size is 100 (instead of 36). Find the 90%
confidence interval.
2. Suppose everything is kept the same but the sample size is 25 (instead of 36). Find the 90%
confidence interval.
Solution:
414 | 7.2 CONFIDENCE INTERVALS FOR A SINGLE POPULATION MEAN WITH KNOWN POPULATION STANDARD
DEVIATION
For the exam scores examples, the 90% confidence interval with a sample size of 36 has a lower
limit of 67.18 and an upper limit of 68.82, with a sample size of 100 has a lower limit is 67.51 and
an upper limit is 68.49, and with a sample size of 25 has a lower limit is 67.01 and an upper limit is
69.27. When the sample size increased, the confidence interval is narrower. When the sample size
decreased, the confidence interval is wider. Generally, the smaller the sample size, the wider the
confidence interval needs to be in order to achieve the same level of confidence.
• Increasing the sample size causes the margin of error to decrease, making the confidence
interval narrower.
• Decreasing the sample size causes the margin of error to increase, making the confidence
interval wider.
7.2 CONFIDENCE INTERVALS FOR A SINGLE POPULATION MEAN WITH KNOWN POPULATION STANDARD
DEVIATION | 415
Concept Review
In this section, we learned how to calculate the confidence interval for a single population mean
where the population standard deviation is known. A confidence interval has the general form:
\begin{eqnarray*}\\ \mbox{Lower Limit} & = & \overline{x}-\mbox{margin of error}
\\ \\ \mbox{Upper Limit} & = & \overline{x}-\mbox{margin of error}\\ \\
\end{eqnarray*}
The general form for a confidence interval for a single population mean, known standard
deviation is given by
\begin{eqnarray*}\\ \mbox{Lower Limit} & = & \overline{x}-z \times
\frac{\sigma}{\sqrt{n}} \\ \\ \mbox{Upper Limit} & = & \overline{x}+z \times
\frac{\sigma}{\sqrt{n}}\\ \\ \end{eqnarray*}
where is the the -score so the area the left of is .
The calculation of the margin of error depends on the size of the sample and the level of
confidence required. The confidence level is the percent of all possible samples that can be
expected to include the true population parameter. As the confidence level increases, the
corresponding margin of error increases as well. As the sample size increases, the margin of error
decreases.
Attribution
“8.1 A Single Population Mean using the Normal Distribution“ in Introductory Statistics by
OpenStax is licensed under a Creative Commons Attribution 4.0 International License.
7.3 CONFIDENCE INTERVALS FOR A
SINGLE POPULATION MEAN WITH
UNKNOWN POPULATION STANDARD
DEVIATION
LEARNING OBJECTIVES
• Calculate and interpret confidence intervals for estimating a population mean where the
population standard deviation is unknown.
In practice, we rarely know the population standard deviation. In the past, when the sample size
was large, this did not present a problem to statisticians. They used the sample standard deviation
as an estimate for , and proceeded as before to calculate a confidence interval with close enough
results. However, statisticians ran into problems when the sample size was small. A small sample
size caused inaccuracies in the confidence interval.
William S. Goset (1876–1937) of the Guinness brewery in Dublin, Ireland ran into this problem.
His experiments with hops and barley produced very few samples. Just replacing with did not
produce accurate results when he tried to calculate a confidence interval. He realized that he could
not use a normal distribution for the calculation. He found that the actual distribution depends on
the sample size. This problem led him to “discover” what is called the Student’s -distribution.
The name comes from the fact that Gosset wrote under the pen name “Student.”
Up until the mid-1970s, some statisticians used the normal distribution approximation for large
sample sizes and only used the -distribution for sample sizes of at most 30. With technology, the
practice now is to use the -distribution whenever is used as an estimate for .
When a simple random sample of size is taken from a population that has an approximately
normal distribution with mean , an unknown population standard deviation, and the sample
7.3 CONFIDENCE INTERVALS FOR A SINGLE POPULATION MEAN WITH UNKNOWN POPULATION STANDARD
DEVIATION | 417
standard deviation is used as an estimate for the population standard deviation, the distribution
of the sample means follows a -distribution with degrees of freedom. For each sample size
, there is a different -distribution. The -score is
Every -distribution has a parameter called the degrees of freedom (df). In this case where the
-distribution is used for the distribution of the sample means, the value of the degrees of freedom
is . Here the value of used as the degrees of freedom comes from the calculation of
the sample standard deviation . Because the sum of the deviations is zero, we can find the last
deviation once we know the other deviations. The other deviations can change or vary
freely. Note that the value or formula of the degrees of freedom for the -distribution will vary
depending on the situation in which the -distribution is used.
When finding a confidence interval for an unknow population mean when the population standard
deviation is unknown, we use the sample standard deviation as an estimate for the (unknown)
population standard deviation and we use a -distribution with degrees of freedom to find the
required -score for the confidence interval. In this case we replace the -score with a -score and
with in the formulas for the limits of the confidence interval for a population mean.
To construct the confidence interval, take a random sample of size from the population.
Calculate the sample mean and the sample standard deviation . The limits for the confidence
interval with confidence level for an unknown population mean when the population standard
deviation is unknown are
\begin{eqnarray*} \\ \mbox{Lower Limit} & = & \overline{x}-t \times \frac{s}{\sqrt{n}}
\\ \\ \mbox{Upper Limit} & = & \overline{x}+t \times \frac{s}{\sqrt{n}} \\ \\
\end{eqnarray*}
where is the (positive) -score of the -distribution with degrees of freedom so the area
under the -distribution in between and is .
7.3 CONFIDENCE INTERVALS FOR A SINGLE POPULATION MEAN WITH UNKNOWN POPULATION STANDARD
DEVIATION | 419
To find the -score to construct a confidence interval with confidence level , use the t.inv.2t(area
in the tails, degrees of freedom) function.
• For area in the tails, enter the sum of the area in the tails of the -distribution. For a
confidence interval, the area in the tails is .
• For degrees of freedom, enter the value of the degrees of freedom for the -distribution. For
a confidence interval for a population mean, the degrees of freedom is .
The output from the t.inv.2t function is the value of the -score needed to construct the confidence
interval.
Visit the Microsoft page for more information about the t.inv.2t function.
NOTE
The t.inv.2t function requires that we enter the sum of the area in both tails. The area in the
420 | 7.3 CONFIDENCE INTERVALS FOR A SINGLE POPULATION MEAN WITH UNKNOWN POPULATION STANDARD
DEVIATION
middle of the distribution is the confidence level , so the sum of the area in both tails is the
leftover area .
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=163#oembed-1
Watch this video: Confidence Interval for a population mean – unknown by Joshua Emmanuel [7:40]
EXAMPLE
Suppose you do a study of acupuncture to determine how effective it is in relieving pain. You
measure sensory rates for 15 subjects with the results given below.
7.3 CONFIDENCE INTERVALS FOR A SINGLE POPULATION MEAN WITH UNKNOWN POPULATION STANDARD
DEVIATION | 421
Solution:
1. To find the confidence interval, we need to find the -score for the 95% confidence interval.
This means that we need to find the -score so that the area in the tails is .
The degrees of freedom for the -distribution is .
Field 2 14
2. We are 95% confident that the mean sensory rate is between 7.30 and 9.15.
3. It is not reasonable to conclude that the mean sensory rate is 10 because 10 is outside of the
confidence interval.
NOTE
When calculating the limits for the confidence interval keep all of the decimals in the -score and
other values such as and throughout the calculation. This will ensure that there is no round-off
error in the answers. You can use Excel to do the calculation of the limits, clicking on the cells
containing the -score, and , to ensure that all of the decimal places are used in the calculation.
TRY IT
You do a study of hypnotherapy to determine how effective it is in increasing the number of hours of
sleep subjects get each night. You measure hours of sleep for 12 subjects with the following results.
1. Construct a 97% confidence interval for the mean number of hours slept each night.
2. Interpret the confidence interval found in part 1.
3. Is it reasonable to assume that the mean number of hours slept each night is 9 hours?
Explain.
Field 2 11
2. We are 97% confident that the mean number of hours slept each night is between
8.056 hours and 9.911 hours.
3. It is reasonable to assume the mean number of hour slept each night is 9 hours because 9 is
inside the confidence interval.
EXAMPLE
The Human Toxome Project (HTP) is working to understand the scope of industrial pollution in the
human body. Industrial chemicals may enter the body through pollution or as ingredients in
consumer products. In October 2008, the scientists at HTP tested cord blood samples for 20 newborn
infants in the United States. The cord blood of the “In utero/newborn” group was tested for 430
industrial compounds, pollutants, and other chemicals, including chemicals linked to brain and
424 | 7.3 CONFIDENCE INTERVALS FOR A SINGLE POPULATION MEAN WITH UNKNOWN POPULATION STANDARD
DEVIATION
nervous system toxicity, immune system toxicity, and reproductive toxicity, and fertility problems.
There are health concerns about the effects of some chemicals on the brain and nervous system. This
table shows how many of the targeted chemicals were found in each infant’s cord blood.
1. Construct a 90% confidence interval for the mean number of targeted industrial chemicals
found in an infant’s blood.
2. Interpret the confidence interval found in part 1.
Solution:
1. To find the confidence interval, we need to find the -score for the 90% confidence interval.
This means that we need to find the -score so that the area in the tails is .
The degrees of freedom for the -distribution is
Field 2 19
2. We are 90% confident that the mean number of targeted industrial chemicals found
in an infant’s blood is between 117.41 and 137.49.
7.3 CONFIDENCE INTERVALS FOR A SINGLE POPULATION MEAN WITH UNKNOWN POPULATION STANDARD
DEVIATION | 425
TRY IT
A random sample of statistics students were asked to estimate the total number of hours they spend
watching television in an average week. The responses are recorded in this table.
0 3 1 20 9
5 10 1 10 4
14 2 4 4 5
1. Construct a 98% confidence interval for the mean number of hours statistics students will
spend watching television in one week.
2. Interpret the confidence interval found in part 1.
3. Is it reasonable to conclude that the mean number of hours statistics students spend watching
television in one week is 5? Explain.
Field 2 14
426 | 7.3 CONFIDENCE INTERVALS FOR A SINGLE POPULATION MEAN WITH UNKNOWN POPULATION STANDARD
DEVIATION
2. We are 98% confident that the mean number of hours statistics students will spend
watching television in one week is between 2.397 hours and 9.870 hours.
3. It is reasonable to assume the mean number of hours statistics students will spend watching
television in one week is 5 hours because 5 is inside the confidence interval.
Concept Review
In many cases, the population standard deviation for the population being studied is unknown.
In these cases, it is common to use the sample standard deviation as an estimate of . The normal
distribution creates accurate confidence intervals when is known, but it is not as accurate when
is used as an estimate. In this case, the -distribution is much better.
The general form for a confidence interval for a single population mean with unknown
population standard deviation is given by
where is the (positive) -score of the -distribution with degrees of freedom so the area
under the -distribution in between and is .
7.3 CONFIDENCE INTERVALS FOR A SINGLE POPULATION MEAN WITH UNKNOWN POPULATION STANDARD
DEVIATION | 427
Attribution
“8.2 A Single Population Mean using the Student t Distribution“ in Introductory Statistics by
OpenStax is licensed under a Creative Commons Attribution 4.0 International License.
7.4 CONFIDENCE INTERVALS FOR A
POPULATION PROPORTION
LEARNING OBJECTIVES
During an election year, we see articles in the newspaper that state confidence intervals in terms
of proportions or percentages. For example, a poll for a particular candidate running for president
might show that the candidate has 40% of the vote within three percentage points (if the sample is
large enough). Often, election polls are calculated with 95% confidence, so, the pollsters would be
95% confident that the true proportion of voters who favored the candidate would be between 37%
and 43%.
Investors in the stock market are interested in the true proportion of stocks that go up and down
each week. Businesses that sell personal computers are interested in the proportion of households
in the United States that own personal computers. Confidence intervals can be calculated for the
true proportion of stocks that go up or down each week and for the true proportion of households
in the United States that own personal computers.
A confidence interval for a population proportion is based on the fact that the sample proportions
follow an approximately normal distribution when both and . Similar
to confidence intervals for population means, a confidence interval for a population proportion is
constructed by taking a sample of size from the population, calculating the sample proportion
, and then adding and subtracting the margin of error from to get the limits of the confidence
interval.
In order to construct a confidence interval for a population proportion, we must be able to
assume the sample proportions follow a normal distribution. As we have seen previously, we
can assume the sample proportions follow a normal distribution when both and
7.4 CONFIDENCE INTERVALS FOR A POPULATION PROPORTION | 429
The margin of error for a confidence interval with confidence level for an unknown population
proportion is
NOTE
In the margin of error formula, the sample proportion is used to estimate the unknown
population proportion . The estimated sample proportion is used because is the unknown
quantity we are trying to estimate with the confidence interval. The sample proportion is
calculated from the sample taken to construct the confidence interval where
The limits for the confidence interval with confidence level for an unknown population
proportion are
430 | 7.4 CONFIDENCE INTERVALS FOR A POPULATION PROPORTION
NOTE
The confidence interval can only be used if we can assume the sample proportions follow a
normal distribution. This means we must check that and before
constructing the confidence interval. If one of or is less than 5, we cannot
construct the confidence interval.
To find the -score to construct a confidence interval with confidence level , use the
norm.s.inv(area to the left of z) function.
• For area to the left of z, enter the entire area to the left of the -score you are trying to find.
The output from the norm.s.inv function is the value of -score needed to construct the confidence
interval.
7.4 CONFIDENCE INTERVALS FOR A POPULATION PROPORTION | 431
NOTE
The norm.s.inv function requires that we enter the entire area to the left of the unknown
-score. This area includes the confidence level (the area in the middle of the distribution) plus the
remaining area in the left tail.
EXAMPLE
Suppose that a market research firm is hired to estimate the percent of adults living in a large city
who have cell phones. Five hundred randomly selected adult residents in this city are surveyed to
determine whether they have cell phones. Of the 500 people surveyed, 421 responded yes – they own
cell phones.
1. Construct a 95% confidence interval for the proportion of adult residents of this city who have
cell phones.
2. Interpret the confidence interval found in part 1.
3. Is it reasonable to conclude that 85% of the adult residents of this city have cell phones?
Explain.
Solution:
To find the confidence interval, we need to find the -score for the 95% confidence interval.
This means that we need to find the -score so that the entire area to the left of is
2. We are 95% confident that the proportion of adult residents of this city who have cell
phones is between 81% and 87.4%.
3. It is reasonable to conclude that 85% of the adult residents of this city have cell phones
because 85% is inside the confidence interval.
NOTES
1. When calculating the limits for the confidence interval keep all of the decimals in the -score
and other values throughout the calculation. This will ensure that there is no round-off error
in the answers. You can use Excel to do the calculation of the limits, clicking on the cells
containing the -score and any other values, to ensure that all of the decimal places are used
in the calculation.
2. The limits for the confidence interval are percents. For example, the upper limit of 0.8740 is
the decimal form of a percent: 87.4%.
3. When writing down the interpretation of the confidence interval, make sure to include the
confidence level, the actual population proportion captured by the confidence interval (i.e. be
specific to the context of the question), and express the limits as percents.
4. 95% of all confidence interval constructed this way contain the proportion of adult residents
in this city that have a cell phone. For example, if we constructed 100 of these confidence
(using 100 different samples of size 500), we would expect 95 of them to contain the true
proportion of adult residents in this city that have a cell phone.
TRY IT
Suppose 250 randomly selected people are surveyed to determine if they own a tablet. Of the 250
surveyed, 98 reported owning a tablet.
434 | 7.4 CONFIDENCE INTERVALS FOR A POPULATION PROPORTION
1. Construct a 94% confidence interval for the proportion of people who own tablets.
2. Interpret the confidence interval found in part 1.
3. Is it reasonable to assume that 30% of people own tablets? Explain.
2. We are 94% confident that the proportion of people who own tablets is between
33.39% and 45.01%.
3. It is not reasonable to claim the proportion of people who own tablets is 30% because 30% is
outside the confidence interval.
7.4 CONFIDENCE INTERVALS FOR A POPULATION PROPORTION | 435
EXAMPLE
For a class project, a political science student at a large university wants to estimate the percent of
students who are registered voters. He surveys 500 students and finds that 300 are registered voters.
1. Construct a 90% confidence interval for the percent of students who are registered voters.
2. Interpret the confidence interval found in part 1.
Solution:
\begin{eqnarray*} \\ n \times \hat{p} & = & 500 \times 0.6=300 \geq 5 \\ \\ n \times
(1-\hat{p}) & = & 500 \times (1-0.6)=200 \geq 5 \\ \\ \end{eqnarray*}
To find the confidence interval, we need to find the -score for the 90% confidence interval.
This means that we need to find the -score so that the entire area to the left of is
.
436 | 7.4 CONFIDENCE INTERVALS FOR A POPULATION PROPORTION
2. We are 90% confident that the percent of students who are registered voters is
between 56.4% and 63.6%.
TRY IT
A student polls her school to see if students in the school district are for or against the new
legislation regarding school uniforms. She surveys 600 students and finds that 480 are against the
new legislation.
1. Construct a 98% confidence interval for the proportion of students who are against the new
legislation.
2. Interpret the confidence interval found in part 1.
3. A parents group claims that only 75% of students are against the legislation. Is it reasonable
for the group to make this claim? Explain.
2. We are 98% confident that the proportion of students who are against the new
legislation is between 76.20% and 83.80%.
3. It is not reasonable for the group to claim the proportion is 75% because 75% is outside of the
confidence interval.
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=166#oembed-1
Watch this video: Confidence Interval for a population proportion by Excel is Fun [8:34]
438 | 7.4 CONFIDENCE INTERVALS FOR A POPULATION PROPORTION
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=166#oembed-2
Watch this video: Confidence Interval for a population proportion by Excel is Fun [4:51]
Concept Review
Some statistical measures, like many survey questions, measure qualitative rather than quantitative
data. In this case, the population parameter being estimated is a proportion. It is possible to create
a confidence interval for the true population proportion following procedures similar to those used
in creating confidence intervals for population means. The formulas are slightly different, but they
follow the same reasoning.
The general form for a confidence interval for a single population proportion is given by
\begin{eqnarray*} \\ \mbox{Lower Limit} & = & \hat{p}-z \times \sqrt{\frac{\hat{p}
\times (1-\hat{p})}{n}} \\ \\ \mbox{Upper Limit} & = & \hat{p}+z \times
\sqrt{\frac{\hat{p} \times (1-\hat{p})}{n}} \\ \\ \end{eqnarray*}
where is the the -score so the area to the left of is .
Attribution
LEARNING OBJECTIVES
Usually we have no control over the sample size of a data set. However, if we are able to set the
sample size, as in cases where we are taking a survey, it is very helpful to know just how large it
should be to provide the most information. Sampling can be very costly, in both time and product.
Simple telephone surveys will cost approximately $30.00 each, for example, and some sampling
requires the destruction of the product. Selecting a sample that is too large is expensive and time
consuming. But selecting a sample that is too small can lead to inaccurate conclusions. We want
to find the minimum sample size required to achieve the desired level of accuracy in the confidence
interval.
where is the -score so that the area under the standard normal distribution in between
and is the confidence level .
Rearranging this formula for we get a formula for the sample size :
• The value for is determined by the confidence level of the interval, calculated the same way
we calculate the -score for a confidence interval.
• The value for the margin of error is set as the predetermined acceptable error, or tolerance,
for the difference between the sample mean and the population mean . In other words,
is set to the maximum allowable width of the confidence interval.
• An estimate for the population standard deviation can be found by one of the following
methods:
◦ Conduct a small pilot study and use the sample standard deviation from the pilot study.
◦ Use the sample standard deviation from previously collected data. Although crude, this
method of estimating the standard deviation may help reduce costs significantly.
◦ Use where is the difference between the maximum and minimum values
NOTES
1. Although we do not know the population standard deviation when calculating the sample
size, we do not use the -distribution in the sample size formula. In order to use the
-distribution in this situation, we need the degrees of freedom . But is the sample
size we are trying to estimate. So, we must use the normal distribution to determine the
sample size.
2. The value of determined from the formula is the minimum sample size required to
achieve the desired level of confidence. The sample size is a count, and so is an integer.
It would be unusual for the value of generated by the formula to be an integer. Because
is the minimum sample size required, we must round the output from the formula up
to the next integer. If we round the value of down, the sample size will be below the
minimum required sample size.
3. After we have found the sample size and collected the data for the sample, we use the
appropriate confidence interval formula and the sample standard deviation from the actual
sample (assuming is unknown), and not the estimate of the standard deviation used in
the calculation of the sample size.
7.5 CALCULATING THE SAMPLE SIZE FOR A CONFIDENCE INTERVAL | 441
To find the -score to calculate the sample size for a confidence interval with confidence level , use
the norm.s.inv(area to the left of z) function.
• For area to the left of z, enter the entire area to the left of the -score you are trying to find.
The output from the norm.s.inv function is the value of -score needed to find the sample size.
EXAMPLE
We want to estimate the mean age of Foothill College students. From previous information, an
estimate of the standard deviation of the ages of the students is 15 years. We want to be 95%
confident that the sample mean age is within two years of the population mean age. How many
randomly selected Foothill College students must be surveyed to achieved the desired level of
accuracy?
Solution:
To find the sample size, we need to find the -score for the 95% confidence interval. This means that
we need to find the -score so that the entire area to the left of is .
442 | 7.5 CALCULATING THE SAMPLE SIZE FOR A CONFIDENCE INTERVAL
NOTE
Remember to round the value for the sample size UP to the next integer. This ensures that the
sample size is an integer and is large enough. Do not forget to include appropriate units with the
sample size.
7.5 CALCULATING THE SAMPLE SIZE FOR A CONFIDENCE INTERVAL | 443
TRY IT
You want to estimate the height of all high school basketball players. You want to be 98% confident
with a margin of error of 1.5. From a small pilot study, you estimate the standard deviation to be 3
inches. How large a sample do you need to take to achieve the desired level of accuracy?
where is the -score so that the area under the standard normal distribution in between
and is the confidence level .
Rearranging this formula for we get a formula for the sample size :
• The value for is determined by the confidence level of the interval, calculated the same way
we calculate the -score for a confidence interval.
• The value for the margin of error is set as the predetermined acceptable error, or tolerance,
for the difference between the sample proportion and the population proportion . In other
words, is set to the maximum allowable width of the confidence interval.
• An estimate for the population proportion . If no estimate for the population proportion is
provided, we use .
NOTES
1. The value of determined from the formula is the minimum sample size required to
achieve the desired level of confidence. The sample size is a count, and so is an integer.
It would be unusual for the value of generated by the formula to be an integer. Because
is the minimum sample size required, we must round the output from the formula up
to the next integer. If we round the value of down, the sample size will be below the
minimum required sample size.
2. After we have found the sample size and collected the data for the sample, we use the
appropriate confidence interval formula and the sample proportion from the actual
sample.
3. By using as an estimate for in the sample size formula we will get the largest
required sample size for the confidence level and margin of error we selected. This is true
7.5 CALCULATING THE SAMPLE SIZE FOR A CONFIDENCE INTERVAL | 445
because of all combinations of two fractions (the values of and ) that add to one,
the largest multiple is when each is 0.5. Without any other information concerning the
population parameter , this is the common practice. This may result in oversampling,
but certainly not under sampling.
There is an interesting trade-off between the level of confidence and the sample size that shows up
here when considering the cost of sampling. The table below shows the appropriate sample size at
different levels of confidence and different margins of error, assuming . Looking at each
row, we can see that for the same margin of error, a higher level of confidence requires a larger
sample size. Similarly, looking at each column, we can see that for the same confidence level, a
smaller margin of error requires a larger sample size.
Required Sample Size (90%) Required Sample Size (95%) Margin of Error
1691 2401 2%
752 1067 3%
271 384 5%
68 96 10%
EXAMPLE
Suppose a mobile phone company wants to determine the current percentage of customers aged 50+
who use text messaging on their cell phones. How many customers aged 50+ should the company
survey in order to be 90% confident with a margin of error of 3%?.
Solution:
446 | 7.5 CALCULATING THE SAMPLE SIZE FOR A CONFIDENCE INTERVAL
To find the sample size, we need to find the -score for the 90% confidence interval. This means that
we need to find the -score so that the entire area to the left of is .
752 customers aged 50+ must be surveyed to achieve the desired accuracy.
NOTE
Remember to round the value for the sample size UP to the next integer. This ensures that the
sample size is large enough. Do not forget to include appropriate units with the sample size.
7.5 CALCULATING THE SAMPLE SIZE FOR A CONFIDENCE INTERVAL | 447
TRY IT
Suppose an internet marketing company wants to determine the percentage of customers who click
on ads on their smartphones. How many customers should the company survey in order to be 94%
confident that the estimated proportion is within 5% of the population proportion of customers who
click on ads on their smartphones?
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=168#oembed-1
Watch this video: Sample Size for Confidence Intervals by ExcelIsFun [7:54]
Concept Review
In order to construct a confidence interval, a sample is taken from the population under study.
But collecting sample information is time consuming and expensive. The minimum sample size
required to achieve the desired level of accuracy is determined before collecting the sample data.
After calculating the value of from the formula, round the value of up to the next integer.
7.5 CALCULATING THE SAMPLE SIZE FOR A CONFIDENCE INTERVAL | 449
Attribution
“7.2 The Central Limit Theorem for Sums“ in Introductory Statistics by OpenStax is licensed under
a Creative Commons Attribution 4.0 International License.
“8.4 Calculating the Sample Size n: Continuous and Binary Random Variables“ in Introductory
Business Statistics by OpenStax is licensed under a Creative Commons Attribution 4.0
International License.
7.6 EXERCISES
a. Construct a 95% confidence interval for the population mean weight of newborn elephants.
b. Interpret the confidence interval found in part (a).
c. What will happen to the confidence interval obtained, if 500 newborn elephants are weighed
instead of 50? Why?
2. The U.S. Census Bureau conducts a study to determine the time needed to complete the short
form. The Bureau surveys 200 people. The sample mean is 8.2 minutes. There is a known standard
deviation of 2.2 minutes. The population distribution is assumed to be normal.
a. Construct a 90% confidence interval for the population mean time to complete the forms.
b. Interpret the confidence interval found in part (a).
c. Is it reasonable to conclude the mean time to complete the forms is 10 minutes? Explain.
d. If the Census wants to increase its level of confidence and keep the error bound the same by
taking another survey, what changes should it make?
e. If the Census did another survey, kept the error bound the same, and surveyed only 50 people
instead of 200, what would happen to the level of confidence? Why?
f. Suppose the Census needed to be 98% confident of the population mean length of time.
Would the Census have to survey more people? Why or why not?
3. A sample of 20 heads of lettuce was selected. Assume that the population distribution of head
weight is normal. The weight of each head of lettuce was then recorded. The mean weight was 2.2
pounds with a standard deviation of 0.1 pounds. The population standard deviation is known to be
0.2 pounds.
a. Construct a 90% confidence interval for the population mean weight of the heads of lettuce.
b. Interpret the confidence interval found in part (a).
c. Construct a 95% confidence interval for the population mean weight of the heads of lettuce.
7.6 EXERCISES | 451
d. In complete sentences, explain why the confidence interval in part (a) is larger than in part
(c).
e. What would happen if 40 heads of lettuce were sampled instead of 20, and the error bound
remained the same?
f. What would happen if 40 heads of lettuce were sampled instead of 20, and the confidence
level remained the same?
4. The mean age for all Foothill College students for a recent Fall term was 33.2. The population
standard deviation has been pretty consistent at 15. Suppose that twenty-five Winter students were
randomly selected. The mean age for the sample was 30.4. We are interested in the true mean age
for Winter Foothill College students.
a. Construct a 99% confidence interval for the mean age of students at Foothill College.
b. Interpret the confidence interval found in part (a).
c. Is it reasonable for the college to claim that the mean age of its students is 35? Explain.
d. Using the same mean, standard deviation, and level of confidence, suppose that were 69
instead of 25. Would the error bound become larger or smaller? How do you know?
e. Using the same mean, standard deviation, and sample size, how would the error bound
change if the confidence level were reduced to 90%? Why?
5. Among various ethnic groups, the standard deviation of heights is known to be approximately
three inches. We wish to construct a 95% confidence interval for the mean height of male Swedes.
Forty-eight male Swedes are surveyed. The sample mean is 71 inches. The sample standard
deviation is 2.8 inches.
a. Construct a 95% confidence interval for the population mean height of male Swedes.
b. Interpret the confidence interval found in part (a).
c. Is it reasonable to claim that the mean height of male Swedes is 75 inches? Explain.
6. Announcements for 84 upcoming engineering conferences were randomly picked from a stack
of IEEE Spectrum magazines. The mean length of the conferences was 3.94 days, with a standard
deviation of 1.28 days. Assume the underlying population is normal.
a. Construct a 97% confidence interval for the population mean length of engineering
conferences.
b. Interpret the confidence interval found in part (a).
c. Is it reasonable to claim that the mean length of the conferences is 3 days? Explain.
452 | 7.6 EXERCISES
7. Suppose that an accounting firm does a study to determine the time needed to complete one
person’s tax forms. It randomly surveys 100 people. The sample mean is 23.6 hours. There is a
known standard deviation of 7.0 hours. The population distribution is assumed to be normal.
a. Construct a 90% confidence interval for the population mean time to complete the tax forms.
b. If the firm wished to increase its level of confidence and keep the error bound the same by
taking another survey, what changes should it make?
c. If the firm did another survey, kept the error bound the same, and only surveyed 49 people,
what would happen to the level of confidence? Why?
d. Suppose that the firm decided that it needed to be at least 96% confident of the population
mean length of time to within one hour. How would the number of people the firm surveys
change? Why?
8. A sample of 16 small bags of the same brand of candies was selected. Assume that the population
distribution of bag weights is normal. The weight of each bag was then recorded. The mean weight
was two ounces with a standard deviation of 0.12 ounces. The population standard deviation is
known to be 0.1 ounce.
a. Construct a 90% confidence interval for the population mean weight of the candies.
b. Construct a 98% confidence interval for the population mean weight of the candies.
c. In complete sentences, explain why the confidence interval in part (b) is larger than the
confidence interval in part (a).
d. In complete sentences, give an interpretation of what the interval in part (b) means.
9. What is meant by the term “90% confident” when constructing a confidence interval for a mean?
a. If we took repeated samples, approximately 90% of the samples would produce the same
confidence interval.
b. If we took repeated samples, approximately 90% of the confidence intervals calculated from
those samples would contain the sample mean.
c. If we took repeated samples, approximately 90% of the confidence intervals calculated from
those samples would contain the true value of the population mean.
d. If we took repeated samples, the sample mean would equal the population mean in
approximately 90% of the samples.
10. The average height of young adult males has a normal distribution with standard deviation of
7.6 EXERCISES | 453
2.5 inches. You want to estimate the mean height of students at your college or university to within
one inch with 93% confidence. How many male students must you measure?
11.A hospital is trying to cut down on emergency room wait times. It is interested in the amount
of time patients must wait before being called back to be examined. An investigation committee
randomly surveyed 70 patients. The sample mean was 1.5 hours with a sample standard deviation
of 0.5 hours.
a. Construct a 99% confidence interval for the population mean time spent waiting.
b. Interpret the confidence interval found in part (a).
c. Is it reasonable to claim that the mean time spent waiting is 2 hours? Explain.
12. 108 Americans were surveyed to determine the number of hours they spend watching television
each month. It was revealed that they watched an average of 151 hours each month with a standard
deviation of 32 hours. Assume that the underlying population distribution is normal.
a. Construct a 99% confidence interval for the population mean hours spent watching television
per month.
b. Interpret the confidence interval found in part (a).
c. Why would the error bound change if the confidence level were lowered to 95%?
13. In six packages of “The Flintstones® Real Fruit Snacks” there were five Bam-Bam snack pieces.
The total number of snack pieces in the six bags was 68.
a. Construct a 96% confidence interval for the proportion of Bam-Bam snack pieces per bag.
b. Interpret the confidence interval found in part (a).
14. A random survey of enrollment at 35 community colleges across the United States yielded the
following figures: 6,414; 1,550; 2,109; 9,350; 21,828; 4,300; 5,944; 5,722; 2,825; 2,044; 5,481; 5,200;
5,853; 2,750; 10,012; 6,357; 27,000; 9,414; 7,681; 3,200; 17,500; 9,200; 7,380; 18,314; 6,557; 13,713;
17,768; 7,493; 2,771; 2,861; 1,263; 7,285; 28,165; 5,080; 11,622. Assume the underlying population is
normal.
a. Construct a 95% confidence interval for the mean enrollment at community colleges in the
United States.
b. Interpret the confidence interval found in part (a).
c. Is it reasonable to conclude that the mean enrollment at community colleges in the U.S. is
454 | 7.6 EXERCISES
15,000? Explain.
d. What will happen to the error bound and confidence interval if 500 community colleges were
surveyed? Why?
15. Suppose that a committee is studying whether or not there is waste of time in our judicial
system. It is interested in the mean amount of time individuals waste at the courthouse waiting to
be called for jury duty. The committee randomly surveyed 81 people who recently served as jurors.
The sample mean wait time was eight hours with a sample standard deviation of four hours.
a. Construct a 98% confidence interval for the population mean time wasted.
b. Explain in a complete sentence what the confidence interval means.
16. A pharmaceutical company makes tranquilizers. It is assumed that the distribution for the
length of time they last is approximately normal. Researchers in a hospital used the drug on a
random sample of nine patients. The effective period of the tranquilizer for each patient (in hours)
was as follows: 2.7; 2.8; 3.0; 2.3; 2.3; 2.2; 2.8; 2.1; and 2.4.
a. Construct a 95% confidence interval for the mean length of time the tranquilizers last.
b. What does it mean to be “95% confident” in this problem?
17. Suppose that 14 children, who were learning to ride two-wheel bikes, were surveyed to
determine how long they had to use training wheels. It was revealed that they used them an average
of six months with a sample standard deviation of three months. Assume that the underlying
population distribution is normal.
a. Construct a 99% confidence interval for the mean length of time children use training wheels.
b. Interpret the confidence interval found in part (a).
c. Why would the error bound change if the confidence level were lowered to 90%?
18. The Federal Election Commission (FEC) collects information about campaign contributions
and disbursements for candidates and political committees each election cycle. A political action
committee (PAC) is a committee formed to raise money for candidates and campaigns. A
Leadership PAC is a PAC formed by a federal politician (senator or representative) to raise money to
help other candidates’ campaigns. The FEC has reported financial information for 556 Leadership
PACs that operating during the 2011–2012 election cycle. The following table shows the total
receipts during this cycle for a random selection of 20 Leadership PACs.
7.6 EXERCISES | 455
a. Construct a 96% confidence interval for the mean amount of money raised by all Leadership
PACs during the 2011–2012 election cycle.
b. Interpret the confidence interval found in part (a).
19. Unoccupied seats on flights cause airlines to lose revenue. Suppose a large airline wants to
estimate its mean number of unoccupied seats per flight over the past year. To accomplish this, the
records of 225 flights are randomly selected and the number of unoccupied seats is noted for each
of the sampled flights. The sample mean is 11.6 seats and the sample standard deviation is 4.1 seats.
a. Construct a 92% confidence interval for the mean number of unoccupied seats per flight.
b. Interpret the confidence interval found in part (a).
c. Is it reasonable for the airlines to claim that the mean number of unoccupied seats per flight
is 20? Exlain.
20. In a recent sample of 84 used car sales costs, the sample mean was $6,425 with a standard
deviation of $3,156. Assume the underlying distribution is approximately normal.
a. Construct a 95% confidence interval for the mean cost of a used car.
b. Explain what a “95% confidence interval” means for this study.
21. A survey of the mean number of cents off that coupons give was conducted by randomly
surveying one coupon per page from the coupon sections of a recent San Jose Mercury News. The
following data were collected: 20¢; 75¢; 50¢; 65¢; 30¢; 55¢; 40¢; 40¢; 30¢; 55¢; $1.50; 40¢; 65¢; 40¢.
Assume the underlying distribution is approximately normal.
22. Marketing companies are interested in knowing the percent of women who make the majority
of household purchasing decisions.
a. When designing a study to determine this population proportion, what is the minimum
number you would need to survey to be 90% confident that the population proportion is
estimated to within 0.05?
b. If it were later determined that it was important to be more than 90% confident and a new
survey were commissioned, how would it affect the minimum number you need to survey?
Why?
23. Suppose the marketing company did do a survey. They randomly surveyed 200 households
and found that in 120 of them, the woman made the majority of the purchasing decisions. We
are interested in the population proportion of households where women make the majority of the
purchasing decisions.
a. Construct a 95% confidence interval for the proportion of households where the women make
the majority of the purchasing decisions.
b. Interpret the confidence interval found in part (a).
c. Is it reasonable for the marketing company to claim that women make the majority of
purchasing decisions in 70% of households? Explain.
24. A poll of 1,200 voters asked what the most significant issue was in the upcoming election. Sixty-
five percent answered the economy. We are interested in the population proportion of voters who
feel the economy is the most important.
a. Construct a 90% confidence interval for the proportion of voters who believe the economy is
the most significant issue in the upcoming election.
b. Interpret the confidence interval found in part (a).
c. Is it reasonable to claim that 60% of voters believe the economy is the most significant issue in
the upcoming election? Explain.
d. What would happen to the confidence interval if the level of confidence were 95%?
25. The Ice Chalet offers dozens of different beginning ice-skating classes. All of the class names are
put into a bucket. The 5 P.M., Monday night, ages 8 to 12, beginning ice-skating class was picked. In
that class were 64 girls and 16 boys. Suppose that we are interested in the true proportion of girls,
ages 8 to 12, in all beginning ice-skating classes at the Ice Chalet. Assume that the children in the
selected class are a random sample of the population.
7.6 EXERCISES | 457
a. Construct a 92% confidence interval for the proportion of girls in the ages 8 to 12 beginning
ice-skating classes at the Ice Chalet.
b. Interpret the confidence interval found in part (a).
26. Insurance companies are interested in knowing the percent of drivers who always buckle up
before riding in a car.
a. When designing a study to determine this population proportion, what is the minimum
number you would need to survey to be 95% confident that the population proportion is
estimated to within 0.03?
b. If it were later determined that it was important to be more than 95% confident and a new
survey was commissioned, how would that affect the minimum number you would need to
survey? Why?
c. Suppose that the insurance companies did do a survey. They randomly surveyed 400 drivers
and found that 320 claimed they always buckle up. We are interested in the population
proportion of drivers who claim they always buckle up. Construct a 95% confidence interval
for the proportion who claim they always buckle up.
27. Stanford University conducted a study of whether running is healthy for men and women over
age 50. During the first eight years of the study, 1.5% of the 451 members of the 50-Plus Fitness
Association died. We are interested in the proportion of people over 50 who ran and died in the
same eight-year period.
a. Construct a 97% confidence interval for the proportion of people over 50 who ran and died in
the same eight–year period.
b. Explain what a “97% confidence interval” means for this study.
28. A telephone poll of 1,000 adult Americans was reported in an issue of Time Magazine. One of
the questions asked was “What is the main problem facing the country?” Twenty percent answered
“crime.” We are interested in the population proportion of adult Americans who feel that crime is
the main problem.
a. Construct a 93% confidence interval for the proportion of adult Americans who feel that
crime is the main problem.
b. Interpret the confidence interval found in part (a).
c. Is it reasonable to claim that 30% of Americans feel crime is the main problem? Explain.
29. According to a Field Poll, 79% of California adults (actual results are 400 out of 506 surveyed)
458 | 7.6 EXERCISES
feel that “education and our schools” is one of the top issues facing California. We wish to construct
a 90% confidence interval for the true proportion of California adults who feel that education and
the schools is one of the top issues facing California.
a. Construct a 90% confidence interval for the proportion of California adults who feel
education and schools is one of the top issues facing California.
b. Interpret the confidence interval found in part (a).
c. Is it reasonable to claim that 90% of California adults feel education and schools is one of the
top issues facing California? Explain.
30. Public Policy Polling recently conducted a survey asking adults across the U.S. about music
preferences. When asked, 80 of the 571 participants admitted that they have illegally downloaded
music.
a. Construct a 99% confidence interval for the proportion of American adults who have illegally
downloaded music.
b. Interpret the confidence interval found in part (a).
c. Without performing any calculations, describe how the confidence interval would change if
the confidence level changed from 99% to 90%.
31. You plan to conduct a survey on your college campus to learn about the political awareness of
students. You want to estimate the true proportion of college students on your campus who voted
in the 2012 presidential election with 95% confidence and a margin of error no greater than five
percent. How many students must you interview?
Attribution
Chapter Outline
You can use a hypothesis test to decide if a dog breeder’s claim that every Dalmatian has 35 spots is
statistically sound. Photo by Robert Neff, CC BY 4.0.
One job of a statistician is to make statistical inferences about populations based on samples
taken from the population. Confidence intervals are one way to estimate a population parameter.
Another way to make a statistical inference is to make a decision about a parameter. For instance,
a car dealer advertises that its new small truck gets an average of 35 miles per gallon. A tutoring
service claims that its method of tutoring helps 90% of its students get an A or a B. A company says
that women managers in their company earn an average of $60,000 per year.
A statistician will make a decision about whether these claims are true or false. This process is
462 | 8.1 INTRODUCTION TO HYPOTHESIS TESTING
called hypothesis testing. A hypothesis test involves collecting data from a sample and evaluating
the data. From the evidence provided by the sample data, the statistician makes a decision as to
whether or not there is sufficient evidence to reject or not reject the null hypothesis.
In this chapter, you will conduct hypothesis tests on single population means and single
population proportions. You will also learn about the errors associated with these tests.
Hypothesis testing consists of two contradictory hypotheses, a decision based on the data, and a
conclusion. To perform a hypothesis test, a statistician will:
1. Set up two contradictory hypotheses. Only one of these hypotheses is true and the hypothesis
test will determine which of the hypothesis is most likely true.
2. Collect sample data. (In homework problems, the data or summary statistics will be given to
you.)
3. Determine the correct distribution to perform the hypothesis test.
4. Analyze the sample data by performing calculations that ultimately will allow you to reject or
not reject the null hypothesis.
5. Make a decision and write a meaningful conclusion.
Attribution
LEARNING OBJECTIVES
A hypothesis test begins by considering two hypotheses. They are called the null hypothesis and
the alternative hypothesis. These hypotheses contain opposing viewpoints and only one of these
hypotheses is true. The hypothesis test determines which hypothesis is most likely true.
• The null hypothesis is denoted . It is a statement about the population that either is
believed to be true or is used to put forth an argument unless it can be shown to be incorrect
beyond a reasonable doubt.
◦ The null hypothesis is a claim that a population parameter equals some value. For
example, .
• The alternative hypothesis is denoted . It is a claim about the population that is
contradictory to the null hypothesis and is what we conclude is true when we reject .
◦ The alternative hypothesis is a claim that a population parameter is greater than, less
than, or not equal to some value. For example, , , or
. The form of the alternative hypothesis depends on the wording of the
hypothesis test.
◦ An alternative notation for is .
Because the null and alternative hypotheses are contradictory, we must examine evidence to decide
if we have enough evidence to reject the null hypothesis or not reject the null hypothesis. The
evidence is in the form of sample data. After we have determined which hypothesis the sample
data supports, we make a decision. There are two options for a decision. They are “reject ”
464 | 8.2 NULL AND ALTERNATIVE HYPOTHESES
if the sample information favors the alternative hypothesis or “do not reject ” if the sample
information is insufficient to reject the null hypothesis.
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=175#oembed-1
Watch this video: Simple hypothesis testing | Probability and Statistics | Khan Academy by Khan Academy [6:24]
EXAMPLE
A candidate in a local election claims that 30% of registered voters voted in a recent election.
Information provided by the returning office suggests that the percentage is higher than the 30%
claimed.
Solution:
The parameter under study is the proportion of registered voters, so we use in the statements of the
hypotheses. The hypotheses are
8.2 NULL AND ALTERNATIVE HYPOTHESES | 465
NOTES
1. The null hypothesis is the claim that the proportion of registered voters that voted equals
30%.
2. The alternative hypothesis is the claim that the proportion of registered voters that voted
is greater than (i.e. higher) than 30%.
TRY IT
A medical researcher believes that a new medicine reduces cholesterol by 25%. A medical trial
suggests that the percent reduction is different than claimed. State the null and alternative
hypotheses.
EXAMPLE
We want to test whether the mean GPA of students in American colleges is different from 2.0 (out of
4.0). State the null and alternative hypotheses.
Solution:
\begin{eqnarray*} H_0: & & \mu=2 \mbox{ points} \\ \\ H_a: & & \mu \neq 2
\mbox{ points} \end{eqnarray*}
EXAMPLE
We want to test whether or not the mean height of eighth graders is 66 inches. State the null and
alternative hypotheses.
Solution:
\begin{eqnarray*} H_0: & & \mu=66 \mbox{ inches} \\ \\ H_a: & & \mu \neq 66
\mbox{ inches} \end{eqnarray*}
8.2 NULL AND ALTERNATIVE HYPOTHESES | 467
EXAMPLE
We want to test if college students take less than five years to graduate from college, on the average.
The null and alternative hypotheses are:
Solution:
\begin{eqnarray*} H_0: & & \mu=5 \mbox{ years} \\ \\ H_a: & & \mu \lt 5
\mbox{ years} \end{eqnarray*}
TRY IT
We want to test if it takes fewer than 45 minutes to teach a lesson plan. State the null and alternative
hypotheses.
\begin{eqnarray*} H_0: & & \mu=45 \mbox{ minutes} \\ \\ H_a: & & \mu \lt 45
\mbox{ minutes} \end{eqnarray*}
468 | 8.2 NULL AND ALTERNATIVE HYPOTHESES
EXAMPLE
In an issue of U.S. News and World Report, an article on school standards stated that about half of all
students in France, Germany, and Israel take advanced placement exams and a third pass. The same
article stated that 6.6% of U.S. students take advanced placement exams and 4.4% pass. Test if the
percentage of U.S. students who take advanced placement exams is more than 6.6%. State the null
and alternative hypotheses.
Solution:
\begin{eqnarray*} H_0: & & p=6.6\% \\ \\ H_a: & & p \gt 6.6\% \end{eqnarray*}
TRY IT
On a state driver’s test, about 40% pass the test on the first try. We want to test if more than 40% pass
on the first try. State the null and alternative hypotheses.
\begin{eqnarray*} H_0: & & p=40\% \\ \\ H_a: & & p \gt 40\% \end{eqnarray*}
8.2 NULL AND ALTERNATIVE HYPOTHESES | 469
Concept Review
In a hypothesis test, sample data is evaluated in order to arrive at a decision about some type of
claim. If certain conditions about the sample are satisfied, then the claim can be evaluated for a
population. In a hypothesis test, we evaluate the null hypothesis, typically denoted with . The
null hypothesis is not rejected unless the hypothesis test shows otherwise. The null hypothesis
always contain an equal sign ( ). Always write the alternative hypothesis, typically denoted
with or , using less than, greater than, or not equals symbols ( , , ). If we reject the null
hypothesis, then we can assume there is enough evidence to support the alternative hypothesis.
But we can never state that a claim is proven true or false. All we can conclude from the hypothesis
test is which of the hypothesis is most likely true. Because the underlying facts about hypothesis
testing is based on probability laws, we can talk only in terms of non-absolute certainties.
Attribution
“9.1 Null and Alternative Hypotheses“ in Introductory Statistics by OpenStax is licensed under
a Creative Commons Attribution 4.0 International License.
8.3 OUTCOMES AND THE TYPE I AND TYPE
II ERRORS
LEARNING OBJECTIVES
When we perform a hypothesis test, there are four possible outcomes depending on the actual
truth (or falseness) of the null hypothesis and the decision to reject or not the null hypothesis.
Ideally, the hypothesis test should tell us to not reject the null hypothesis when the null hypothesis
is true and reject the null hypothesis when the null hypothesis is false. However, the outcome
of the hypothesis test is based on sample information and probabilities, so there is a chance that
the hypothesis test does not correctly identify the truth or falseness of the null hypothesis. The
outcomes are summarized in the following table:
• The decision is not to reject when is true (correct decision). That is, the test
identifies is true and in reality is true, which means the test correctly identified
as true.
• The decision is to reject when is true (incorrect decision known as a Type I error).
8.3 OUTCOMES AND THE TYPE I AND TYPE II ERRORS | 471
That is, the test identifies as false but in reality is true, which means the test did not
correctly identify as true.
• The decision is not to reject when is false (incorrect decision known as a Type II
error). That is, the test identifies is true but in reality is false, which means the test
did not correctly identify as false.
• The decision is to reject when is false (correct decision whose probability is called
the Power of the Test). That is, the test identifies is false and in reality is false,
which means the test correctly identified as false.
There are two types of error that can occur in hypothesis testing. Each of the errors occurs with a
particular probability.
• A Type I error occurs when the null hypothesis is rejected by the test (i.e. the test identifies
the null hypothesis as false) but in reality the null hypothesis is true. The probability of a
Type I error is denoted by .
• A Type II error occurs when the null hypothesis is not rejected by the test (i.e. the test
identifies the null hypothesis as true) but in reality the null hypothesis is false. The
probability of a Type II error is denoted by .
Although the probabilities of a Type I or Type II error should be as small as possible, because they
are probabilities of errors, they are rarely zero.
EXAMPLE
• Type I error: Frank thinks his rock climbing equipment is not safe when in fact the
equipment is safe.
• Type II error: Frank thinks his rock climbing equipment is safe when in fact the equipment
472 | 8.3 OUTCOMES AND THE TYPE I AND TYPE II ERRORS
is not safe.
Note that, in this case, the error with the greater consequence is the Type II error. If Frank thinks his
rock climbing equipment is safe and it actually is not safe, he will go ahead and use it.
TRY IT
• Type I error: The researcher thinks the blood cultures do contain traces of pathogen
when in fact, they do not.
• Type II error: The researcher thinks the blood cultures do not contain traces of pathogen ,
when in fact, they do.
8.3 OUTCOMES AND THE TYPE I AND TYPE II ERRORS | 473
EXAMPLE
• Type I error: The ER staff thinks that the victim is dead when in fact the victim is alive.
• Type II error: The ER staff think the victim is alive when in fact the victim is dead.
Note that, in this case, the error with the greater consequence is the Type I error. If the ER staff think
the victim is dead, then they will not treat him.
TRY IT
Which type of error has the greater consequence, Type I or Type II? Why?
The error with the greater consequence is the Type II error: the patient will be thought well when, in
fact, they are sick, and so they will not get treatment.
474 | 8.3 OUTCOMES AND THE TYPE I AND TYPE II ERRORS
EXAMPLE
A genetics lab claims its product can increase the likelihood a pregnancy will result in a boy being
born. Statisticians want to test this claim. Suppose that the null hypothesis is
• Type I error: We believe the genetics lab’s product can influence gender outcome when in
fact the product has no effect.
• Type II error: We believe the genetics lab’s product cannot influence gender outcome when
in fact the product does have an effect.
Note that, in this case, the error with the greater consequence is the Type I error because couples
would use the product in hopes of increasing the chances of having a boy.
TRY IT
“Red tide” is a bloom of poison-producing algae—a few different species of a class of plankton called
dinoflagellates. When the weather and water conditions cause these blooms, shellfish such as clams
living in the area develop dangerous levels of a paralysis-inducing toxin. In Massachusetts, the
Division of Marine Fisheries (DMF) monitors levels of the toxin in shellfish by regularly sampling
shellfish along the coastline. If the mean level of toxin in clams exceeds 800 μg (micrograms) of toxin
per kg of clam meat in any area, clam harvesting is banned there until the bloom is over and levels of
toxin in clams subside. Describe both a Type I and a Type II error in this context, and state which
error has the greater consequence.
• Type I error: The DMF believes that toxin levels are still too high when, in fact, toxin levels
are at most 800 μg. The DMF continues the harvesting ban.
• Type II error: The DMF believes that toxin levels are within acceptable levels (are at most 800
μg) when, in fact, toxin levels are still too high (more than 800 μg). The DMF lifts the
harvesting ban. This error could be the most serious. If the ban is lifted and clams are still
toxic, consumers could possibly eat tainted food.
In summary, the more dangerous error would be to commit a Type II error, because this error
involves the availability of tainted clams for consumption.
EXAMPLE
A certain experimental drug claims a cure rate of at least 75% for males with prostate cancer.
Describe both the Type I and Type II errors in context. Which error is more serious?
• Type I: A cancer patient believes the cure rate for the drug is less than 75% when it actually is
at least 75%.
• Type II: A cancer patient believes the experimental drug has at least a 75% cure rate when it
has a cure rate that is less than 75%.
In this scenario, the Type II error contains the more severe consequence. If a patient believes the
drug works at least 75% of the time, this will most likely influence the patient’s (and doctor’s) choice
about whether to use the drug as a treatment option.
476 | 8.3 OUTCOMES AND THE TYPE I AND TYPE II ERRORS
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=177#oembed-1
Watch this video: Type 1 errors | Inferential statistics | Probability and Statistics | Khan Academy by Khan Academy [3:23]
Concept Review
In every hypothesis test, the outcomes from the test are dependent on sample data and probabilities,
which means that the conclusion of the test may not correctly identify the actual truth state of the
null hypothesis. Such occurrences are expected. A Type I error occurs when a true null hypothesis
is rejected. A Type II error occurs when a false null hypothesis is not rejected.
Attribution
“9.2 Outcomes and the Type I and Type II Errors“ in Introductory Statistics by OpenStax is licensed
under a Creative Commons Attribution 4.0 International License.
8.4 DISTRIBUTIONS REQUIRED FOR A
HYPOTHESIS TEST
LEARNING OBJECTIVES
Earlier in the course, we discussed sampling distributions: the sampling distribution of the sample
mean and the sampling distribution of the sample proportion. These distributions play a role in
hypothesis testing.
If the hypothesis test is on a population mean, we use the distribution of the sample means
in the hypothesis test. As we learned previously, the distribution of the sample means follows a
normal distribution if the population the sample is taken from is normal or if the sample size is
large enough ( ). For a hypothesis test on a population mean we use a normal distribution
when the population standard deviation is known or a -distribution when the population standard
deviation is unknown.
If the hypothesis test is on a population proportion, we use the distribution of the sample
proportions in the hypothesis test. As we learned previously, the distribution of the sample
proportions follows a normal distribution if and or a binomial
distribution if one of or . For a hypothesis test on a population
proportion we use either a normal distribution or a binomial distribution, depending on which of
the above conditions is met.
Assumptions
When we perform a hypothesis test of a single population mean and the population standard
deviation is known, we take a simple random sample from the population. We use a normal
478 | 8.4 DISTRIBUTIONS REQUIRED FOR A HYPOTHESIS TEST
distribution, assuming the population is normal or the sample size is large enough ( ). The
-score we need is the -score from the distribution of the sample means: .
When we perform a hypothesis test of a single population mean and the population
standard deviation is unknown, we take a simple random sample from the population. We use a
-distribution, assuming the population is normal or the sample size is large enough ( ). We
use the sample standard deviation to approximate the population standard deviation. The -score
we need is: .
or .
Concept Review
In order for a hypothesis test’s results to be generalized to a population, certain requirements must
be satisfied.
Testing a population mean:
• Population standard deviation is known: use a normal distribution, assuming the population
is normal or .
• Population standard deviation is unknown: use a -distribution, assuming the population is
normal or .
Attribution
“9.3 Distribution Needed for Hypothesis Testing“ in Introductory Statistics by OpenStax is licensed
under a Creative Commons Attribution 4.0 International License.
8.5 RARE EVENTS, THE SAMPLE, DECISION,
AND CONCLUSION
LEARNING OBJECTIVES
• Define a rare event and identify how a rare event is used in a hypothesis test.
• Define p-value and significance level and identify how they are used in determining the
outcome of a hypothesis test.
Establishing the type of distribution, sample size, and known or unknown population standard
deviation can help us figure out how to go about a hypothesis test. However, there are several other
factors we should consider when working out a hypothesis test.
Rare Events
Suppose we make an assumption about the value of a population parameter (this assumption is
the null hypothesis). We conduct the hypothesis under the assumption that the null hypothesis
is true. Then we randomly select a sample from the population. If the sample has properties that
would be very unlikely to occur under the assumption the null hypothesis is true, then we would
conclude that our assumption about the population is probably incorrect. Remember that our
assumption is just an assumption—it is not a fact, and it may or may not be true. But the sample
data we collect is real and the information from that sample is a fact that may or may not support
the assumption we make about the null hypothesis.
For example, Didi and Ali are at the birthday party of a very wealthy friend. They hurry to be
first in line to grab a prize from a tall basket that they cannot see inside of because they will be
blindfolded. There are 200 plastic bubbles in the basket and Didi and Ali have been told that there
is only one with a $100 bill. Didi is the first person to reach into the basket and pull out a bubble.
8.5 RARE EVENTS, THE SAMPLE, DECISION, AND CONCLUSION | 481
Her bubble contains a $100 bill. The probability of this happening is . Because this
is such an unlikely occurrence, Ali is hoping that what the two of them were told is wrong and
there are more $100 bills in the basket. In this case a “rare event” has occurred (Didi getting the
$100 bill) so Ali doubts the original assumption about only one $100 bill being in the basket.
A rare event is something we consider to be unlikely to happen (i.e. the probability of that
event happening is very small). This is what we are looking for in a hypothesis test. We want
to determine if the sample collected for the test is a rare event (unlikely to happen) under the
assumption the null hypothesis is true. To determine if the sample is a rare event, we calculate the
probability of the sample occurring, assuming that the null hypothesis is true. If the probability
of the sample occurring is small, then the sample is a “rare event” and unlikely to occur under
the assumption the null hypothesis is true. In such a case we would conclude that the original
assumption that the null hypothesis is true must be incorrect, and so we would reject the null
hypothesis. If the probability of the sample occurring is not small, then the sample is not a “rare
event” and is actually likely to occur under the assumption the null hypothesis is true. In this case
we would conclude the original assumption that the null hypothesis is true must be correct, and so
we would not reject the null hypothesis.
Remember, a rare event is an event that is unlikely to happen. But unlikely does not mean
impossible. The probability of a rare event is very small, which means that the chance of it
happening is very small. But as long as the probability is not zero, there is still a possibility the
event could happen.
We use the sample data to calculate the actual probability of getting the selected sample, called the
p-value, under the assumption the null hypothesis is true. The p-value is the probability that, if
the null hypothesis is true, the results from another randomly selected sample will be as
extreme or more extreme as the results obtained from the given sample.
A large p-value calculated from the sample data indicates that we should not reject the null
hypothesis. The smaller the p-value, the more unlikely the outcome, and the stronger the evidence
is against the null hypothesis. We would reject the null hypothesis if the evidence is strongly
against it.
482 | 8.5 RARE EVENTS, THE SAMPLE, DECISION, AND CONCLUSION
EXAMPLE
The customers of a local bakery claim that the height of the bakery’s bread is, on average, 15 cm. The
baker believes his customers are wrong, and that the average height of the bread is more than 15 cm.
To persuade his customers that he is right, the baker decides to do a hypothesis test. He bakes 10
loaves of bread. The mean height of the sample loaves is 17 cm. The baker knows from baking
hundreds of loaves of bread that the standard deviation for the height is 1 cm and the distribution
of the heights is normal. Based on this sample, who is right: the customers or the baker?
Solution:
Here, the population under study is the height of the loaves of bread and is the average height of
the loaves of bread.
The customers’ claim is the null hypothesis: . The alternative hypothesis is the baker’s
claim: . In mathematical notation, the hypothesis are
\begin{eqnarray*} H_0: & & \mu=15 \mbox{ cm} \\ H_a: & & \mu \gt 15
\mbox{ cm} \end{eqnarray*}
Because the population standard deviation is known ( ), the distribution we would use is the
normal distribution.
Suppose the null hypothesis is true. That is, suppose . Under this assumption, we have to
ask if the sample mean of 17 is likely or unlikely to occur. The hypothesis test works by asking the
question how unlikely is this sample mean if the null hypothesis is true. The graph shows how far
out the sample mean is on the normal curve. The p-value is the probability that, if we took another
sample of size 10, any other sample mean would fall at least as far out as 17 cm.
8.5 RARE EVENTS, THE SAMPLE, DECISION, AND CONCLUSION | 483
The p-value is the probability that a sample mean is the same or greater than 17 cm when the
population mean is, in fact, 15 cm. We can calculate this probability using the normal distribution
for sample means. In fact, we are calculating the probability that in a sample of size 10 the sample
mean is greater than 17. We learned how to calculate this type of probability when we learned about
the sampling distribution of the sample mean:
Field 1 17 0.0000000001
Field 2 15
Field 3 1/sqrt(10)
Field 4 true
So p-value=0.0000000001, which tells us the probability of selecting a sample of size 10 and getting a
sample mean greater than 17 is 0.0000000001 under the assumption that the null hypothesis is true (
). This is a very, very small probability, which tells us that a sample mean of 17 is unlikely
to happen if the population mean is 15 cm. Because the sample mean of 17 is so unlikely (meaning it
is not happening by chance alone), we conclude that the assumption that the mean is 15 cm is
wrong. That is, the evidence provided by the sample is strongly against the claim of the null
hypothesis. So we reject the null hypothesis in favour of the alternative hypothesis. That is, based on
the test, we believe the null hypothesis is false and the alternative hypothesis is true. So there is
enough evidence to suggest that the average height of the loaves of bread is greater than 15 cm.
484 | 8.5 RARE EVENTS, THE SAMPLE, DECISION, AND CONCLUSION
TRY IT
A normal distribution has a standard deviation of 1. The original claim is that the mean of the
distribution is 12. An alternative claim is that the mean is greater than 12. The hypotheses are:
\begin{eqnarray*} H_0: & & \mu=12 \\ H_a: & & \mu \gt 12 \end{eqnarray*}
Field 2 12
Field 3 1/sqrt(36)
Field 4 true
p-value=0.0013
8.5 RARE EVENTS, THE SAMPLE, DECISION, AND CONCLUSION | 485
A systematic way to make a decision of whether to reject or not reject the null hypothesis is to
compare the p-value and a preset or preconceived value called the significance level, denoted by .
The significance level is the cut-off value for likely versus unlikely when compared to the p-value.
When the p-value is greater than the significance level, the sample is likely to occur under the
assumption the null hypothesis is true, and so we would fail to reject the null hypothesis. When
the p-value is less than or equal to the significance level, the sample is unlikely to occur under the
assumption the null hypothesis is true, and so we would reject the null hypothesis in favour of the
alternative hypothesis. A preset value for is the probability of a Type I error (rejecting the null
hypothesis when the null hypothesis is true). The significance level may or may not be given at the
beginning of the problem.
When we make a decision to reject or not reject H0, do as follows:
When we “do not reject ,” it does not mean that we should believe that is true. It simply
means that the sample data have failed to provide sufficient evidence to cast serious doubt about
the truthfulness of .
After comparing the p-value and significance level, write a thoughtful conclusion about the
hypotheses in terms of the given problem.
486 | 8.5 RARE EVENTS, THE SAMPLE, DECISION, AND CONCLUSION
TRY IT
A genetics lab claims its product can increase the likelihood a pregnancy will result in a boy being
born. Statisticians want to test this claim. Suppose the hypotheses are
\begin{eqnarray*} H_0: & & p=50\% \\ H_a: & & p \gt 50\% \end{eqnarray*}
After conducting the hypothesis test, p-value=0.025. If the significance level is 1%, what is the
conclusion of the test?
Because the p-value is greater than the significance level (p-value ), we do not
reject the null hypothesis. There is not enough evidence to support the lab’s stated claim that their
procedures improve the chances of a boy being born.
Concept Review
A rare event is an event that is unlikely to occur. The probability of a rare event happening is very
small. In a hypothesis test, we want to determine if the collected sample is a rare event. The p-value
is the probability of getting the sample.
In a hypothesis test, the significance level is the cut-off value for likely versus unlikely. The
significance level is compared to the p-value for the test.
After determining the outcome of test, we write a conclusion based on the specific context of the
question.
8.5 RARE EVENTS, THE SAMPLE, DECISION, AND CONCLUSION | 487
Attribution
“9.4 Rare Events, the Sample, Decision and Conclusion“ in Introductory Statistics by OpenStax is
licensed under a Creative Commons Attribution 4.0 International License.
8.6 HYPOTHESIS TESTS FOR A POPULATION
MEAN WITH KNOWN POPULATION
STANDARD DEVIATION
LEARNING OBJECTIVES
• Conduct and interpret hypothesis tests for a population mean with known population
standard deviation.
• The null hypothesis is always an “equal to.” The null hypothesis is the original claim
about the population parameter.
• The alternative hypothesis is a “less than,” “greater than,” or “not equal to.” The form of
the alternative hypothesis depends on the context of the question.
• The form of the alternative hypothesis tell us if the test is left-tail, right-tail, or two-tail. The
alternative hypothesis is the key to conducting the test and finding the correct p-value.
◦ If the alternative hypothesis is a “less than”, then the test is left-tail. The p-value is the
area in the left-tail of the distribution.
◦ If the alternative hypothesis is a “greater than”, then the test is right-tail. The p-value is
the area in the right-tail of the distribution.
◦ If the alternative hypothesis is a “not equal to”, then the test is two-tail. The p-value is
the sum of the area in the two-tails of the distribution. Each tail represents exactly half
of the p-value.
• Think about the meaning of the p-value. A data analyst (and anyone else) should have
more confidence that they made the correct decision to reject the null hypothesis with a
smaller p-value (for example, 0.001 as opposed to 0.04) even if using a significance level of
8.6 HYPOTHESIS TESTS FOR A POPULATION MEAN WITH KNOWN POPULATION STANDARD DEVIATION | 489
0.05. Similarly, for a large p-value such as 0.4, as opposed to a p-value of 0.056 (a significance
level of 0.05 is less than either number), a data analyst should have more confidence that they
made the correct decision in not rejecting the null hypothesis. This makes the data analyst
use judgment rather than mindlessly applying rules.
• The significance level must be identified before collecting the sample data and conducting the
test. Generally, the significance level will be included in the question. If no significance level
is given, a common standard is to use a significance level of 5%.
• An alternative approach for hypothesis testing is to use what is called the critical value
approach. In this book, we will only use the p-value approach. Some of the videos below
may mention the critical value approach, but this approach will not be used in this book.
EXAMPLE
Because the alternative hypothesis is a , this is a left-tailed test. The p-value is the area in the left-
tail of the distribution.
490 | 8.6 HYPOTHESIS TESTS FOR A POPULATION MEAN WITH KNOWN POPULATION STANDARD DEVIATION
EXAMPLE
\begin{eqnarray*} H_0: & & \mu=0.5 \\ H_a: & & \mu \neq 0.5 \end{eqnarray*}
Because the alternative hypothesis is a , this is a two-tailed test. The p-value is the sum of the areas
in the two tails of the distribution. Each tail contains exactly half of the p-value.
EXAMPLE
\begin{eqnarray*} H_0: & & \mu=10 \\ H_a: & & \mu \lt 10 \end{eqnarray*}
Because the alternative hypothesis is a , this is a left-tailed test. The p-value is the area in the left-
tail of the distribution.
8.6 HYPOTHESIS TESTS FOR A POPULATION MEAN WITH KNOWN POPULATION STANDARD DEVIATION | 491
1. Write down the null and alternative hypotheses in terms of the population mean . Include
appropriate units with the values of the mean.
2. Use the form of the alternative hypothesis to determine if the test is left-tailed, right-tailed, or
two-tailed.
3. Collect the sample information for the test and identify the significance level .
4. When the population standard deviation is known, we use a normal distribution with
to find the p-value. The p-value is the area in the corresponding tail of the
normal distribution.
5. Compare the p-value to the significance level and state the outcome of the test:
The p-value for a hypothesis test on a population mean is the area in the tail(s) of the distribution of
the sample mean. When the population standard deviation is known, use the normal distribution to
find the p-value.
The p-value is the area in the tail(s) of a normal distribution, so the norm.dist(x, , ,logic
operator) function can be used to calculate the p-value.
• For the logic operator, enter true. Note: Because we are calculating the area under the
curve, we always enter true for the logic operator.
Use the appropriate technique with the norm.dist function to find the area in the left-tail or the area
in the right-tail.
EXAMPLE
Jeffrey, as an eight-year old, established a mean time of 16.43 seconds with a standard deviation of
0.8 seconds for swimming the 25-meter freestyle. His dad, Frank, thought that Jeffrey could swim
the 25-meter freestyle faster using goggles. Frank bought Jeffrey a new pair of goggles and timed
8.6 HYPOTHESIS TESTS FOR A POPULATION MEAN WITH KNOWN POPULATION STANDARD DEVIATION | 493
Jeffrey swimming the 25-meter freestyle 15 different times. In the sample of 15 swims, Jeffrey’s
mean time was 16 seconds. Frank thought that the goggles helped Jeffrey swim faster than 16.43
seconds. At the 5% significance level, did Jeffrey swim faster wearing the goggles? Assume that the
swim times for the 25-meter freestyle are normally distributed.
Solution:
Hypotheses:
p-value:
This is a test on a population mean where the population standard deviation is known ( ).
So we use a normal distribution to calculate the p-value. Because the alternative hypothesis is a ,
the p-value is the area in the left-tail of the distribution.
Field 1 16 0.0187
Field 2 16.43
Field 3 0.8/sqrt(15)
Field 4 true
So the p-value .
494 | 8.6 HYPOTHESIS TESTS FOR A POPULATION MEAN WITH KNOWN POPULATION STANDARD DEVIATION
Conclusion:
NOTES
1. The null hypothesis is the claim that Jeffrey’s mean swim time with the goggles
is 16.43 seconds (the same as it is without the googles).
2. The alternative hypothesis is the claim that Jeffrey’s swim time with the
goggles is less than 16.43 seconds.
3. The p-value is the area in the left tail of the sampling distribution, to the left of . In
the calculation of the p-value:
◦ The function is norm.dist because we are finding the area in the left tail of a normal
distribution.
◦ Field 1 is the value of
◦ Field 2 is the value of from the null hypothesis. Remember, we run the test
assuming the null hypothesis is true, so that means we assume .
◦ Field 3 is the standard deviation for the sample means . Note that we are not
using the standard deviation from the population ( ). This is because the
p-value is the area under the curve of the distribution of the sample means, not the
distribution of the population.
4. The p-value of 0.0187 tells us that under the assumption that Jeffrey’s mean swim time with
goggles is 16.43 seconds (the null hypothesis), there is only a 1.87% chance that the mean
time for the 15 sample swims is 16 seconds or less. This is a small probability, and so is
unlikely to happen assuming the null hypothesis is true. This suggests that the assumption
that the null hypothesis is true is most likely incorrect, and so the conclusion of the test is to
reject the null hypothesis in favour of the alternative hypothesis.
5. The Type I error for this problem is to conclude that Jeffrey swims the 25-meter freestyle, on
average, in less than 16.43 seconds (the alternative hypothesis) when, in fact, he actually
swims the 25-meter freestyle, on average, in 16.43 seconds (the null hypothesis). That is,
reject the null hypothesis when the null hypothesis is actually true.
6. The Type II error for this problem is to conclude that Jeffrey swims the 25-meter freestyle, on
8.6 HYPOTHESIS TESTS FOR A POPULATION MEAN WITH KNOWN POPULATION STANDARD DEVIATION | 495
average, in 16.43 seconds (the null hypothesis) when, in fact, he actually swims the 25-meter
freestyle, on average, in less than 16.43 seconds (the alternative hypothesis). That is, do not
reject the null hypothesis when the null hypothesis is actually false.
TRY IT
The mean throwing distance of a football for Marco, a high school freshman quarterback, is 40 yards
with a standard deviation of 2 yards. The team coach tells Marco to adjust his grip to get more
distance. The coach records the distances for 20 throws with the new grip. For the 20 throws,
Marco’s mean distance was 41.5 yards. The coach thought the different grip helped Marco throw
farther than 40 yards. At the 5% significance level, is Marco’s mean throwing distance higher with
the new grip? Assume the throw distances for footballs are normally distributed.
Hypotheses:
p-value:
This is a test on a population mean where the population standard deviation is known ( ). So
we use a normal distribution to calculate the p-value. Because the alternative hypothesis is a , the
p-value is the area in the right-tail of the distribution.
496 | 8.6 HYPOTHESIS TESTS FOR A POPULATION MEAN WITH KNOWN POPULATION STANDARD DEVIATION
Field 2 40
Field 3 2/sqrt(20)
Field 4 true
So the p-value .
Conclusion:
NOTES
1. The null hypothesis is the claim that Marco’s mean throwing distance with the new
grip is 40 yards (the same as it is without the new grip).
2. The alternative hypothesis is the claim that Marco’s mean throwing distance with
the new grip is greater than 40 yards.
3. The p-value is the area in the right tail of the normal distribution. To calculate the area in the
8.6 HYPOTHESIS TESTS FOR A POPULATION MEAN WITH KNOWN POPULATION STANDARD DEVIATION | 497
4. The p-value of 0.0004 tells us that under the assumption that Marco’s mean throwing distance
with the new grip is 40 yards, there is only a 0.047% chance that the mean throwing distance
for the 20 sample throws is more than 40 yards. This is a small probability, and so is unlikely
to happen assuming the null hypothesis is true. This suggests that the assumption that the
null hypothesis is true is most likely incorrect, and so the conclusion of the test is to reject the
null hypothesis in favour of the alternative hypothesis.
EXAMPLE
A local college states in its marketing materials that the average age of its first-year students is 18.3
years with a standard deviation of 3.4 years. But this information is based on old data and does not
take into account that more older adults are returning to college. A researcher at the college believes
that the average age of its first-year students has changed. The researcher takes a sample of 50 first-
year students and finds the average age is 19.5 years. At the 1% significance level, has the average age
of the college’s first-year students changed?
Solution:
Hypotheses:
498 | 8.6 HYPOTHESIS TESTS FOR A POPULATION MEAN WITH KNOWN POPULATION STANDARD DEVIATION
p-value:
This is a test on a population mean where the population standard deviation is known ( ).
In this case, the sample size is greater than 30. So we use a normal distribution to calculate the
p-value. Because the alternative hypothesis is a , the p-value is the sum of area in the tails of the
distribution.
Because there is only one sample, we only have information relating to one of the two tails, either the
left tail or the right tail. We need to know if the sample relates to the left tail or right tail because that
will determine how we calculate out the area of that tail using the normal distribution. In this case,
the sample mean is greater than the value of the population mean in the null hypothesis
( ), so the sample information relates to the right-tail of the
normal distribution. This means that we will calculate out the area in the right tail using
1-norm.dist. However, this is a two-tailed test where the p-value is the sum of the area in the two
tails and the area in the right-tail is only one half of the p-value. The area in the left tail equals the
area in the right tail and the p-value is the sum of these two areas.
8.6 HYPOTHESIS TESTS FOR A POPULATION MEAN WITH KNOWN POPULATION STANDARD DEVIATION | 499
Field 2 18.3
Field 3 3.4/sqrt(50)
Field 4 true
So the area in the right tail is 0.0063 and (p-value) . This is also the area in the left tail,
so
p-value
Conclusion:
NOTES
1. The null hypothesis is the claim that the average age of the first-year students is
still 18.3 years.
2. The alternative hypothesis is the claim that the average age of the first-year
students has changed from 18.3 years.
3. In a two-tailed hypothesis test that uses the normal distribution, we will only have sample
information relating to one of the two tails. We must determine which of the tails the sample
information belongs to, and then calculate out the area in that tail. The area in each tail
represents exactly half of the p-value, so the p-value is the sum of the areas in the two tails.
◦ If the sample mean is less than the population mean in the null hypothesis (
), then the sample information belongs to the left tail.
▪ We use norm.dist( , , ,true) to find the area in the left tail. The
area in the right tail equals the area in the left tail, so we can find the p-value by
adding the output from this function to itself.
500 | 8.6 HYPOTHESIS TESTS FOR A POPULATION MEAN WITH KNOWN POPULATION STANDARD DEVIATION
◦ If the sample mean is greater than the population mean in the null hypothesis (
), then the sample information belongs to the right tail.
4. The p-value of 0.0126 is a large probability compared to the 1% significance level, and so is
likely to happen assuming the null hypothesis is true. This suggests that the assumption that
the null hypothesis is true is most likely correct, and so the conclusion of the test is to not
reject the null hypothesis. In other words, the claim that the average age of first-year students
is 18.3 years is most likely correct.
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=191#oembed-1
Watch this video: Hypothesis Testing: z-test, right tail by ExcelIsFun [33:47]
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=191#oembed-2
Watch this video: Hypothesis Testing: z-test, left tail by ExcelIsFun [10:57]
8.6 HYPOTHESIS TESTS FOR A POPULATION MEAN WITH KNOWN POPULATION STANDARD DEVIATION | 501
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=191#oembed-3
Watch this video: Hypothesis Testing: z-test, two tail by ExcelIsFun [9:56]
Concept Review
1. Write down the null and alternative hypotheses in terms of the population mean . Include
appropriate units with the values of the mean.
2. Use the form of the alternative hypothesis to determine if the test is left-tailed, right-tailed, or
two-tailed.
3. Collect the sample information for the test and identify the significance level.
4. When the population standard deviation is known, find the p-value (the area in the
corresponding tail) for the test using the normal distribution.
5. Compare the p-value to the significance level and state the outcome of the test.
6. Write down a concluding sentence specific to the context of the question.
Attribution
“9.6 Hypothesis Testing of a Single Mean and Single Proportion“ in Introductory Statistics by
OpenStax is licensed under a Creative Commons Attribution 4.0 International License.
8.7 HYPOTHESIS TESTS FOR A POPULATION
MEAN WITH UNKNOWN POPULATION
STANDARD DEVIATION
LEARNING OBJECTIVES
• Conduct and interpret hypothesis tests for a population mean with unknown population
standard deviation.
• The null hypothesis is always an “equal to.” The null hypothesis is the original claim
about the population parameter.
• The alternative hypothesis is a “less than,” “greater than,” or “not equal to.” The form of
the alternative hypothesis depends on the context of the question.
• The form of the alternative hypothesis tell us if the test is left-tail, right-tail, or two-tail. The
alternative hypothesis is the key to conducting the test and finding the correct p-value.
◦ If the alternative hypothesis is a “less than”, then the test is left-tail. The p-value is the
area in the left-tail of the distribution.
◦ If the alternative hypothesis is a “greater than”, then the test is right-tail. The p-value is
the area in the right-tail of the distribution.
◦ If the alternative hypothesis is a “not equal to”, then the test is two-tail. The p-value is
the sum of the area in the two-tails of the distribution. Each tail represents exactly half
of the p-value.
• Think about the meaning of the p-value. A data analyst (and anyone else) should have
more confidence that they made the correct decision to reject the null hypothesis with a
smaller p-value (for example, 0.001 as opposed to 0.04) even if using a significance level of
8.7 HYPOTHESIS TESTS FOR A POPULATION MEAN WITH UNKNOWN POPULATION STANDARD DEVIATION | 503
0.05. Similarly, for a large p-value such as 0.4, as opposed to a p-value of 0.056 (a significance
level of 0.05 is less than either number), a data analyst should have more confidence that they
made the correct decision in not rejecting the null hypothesis. This makes the data analyst
use judgment rather than mindlessly applying rules.
• The significance level must be identified before collecting the sample data and conducting the
test. Generally, the significance level will be included in the question. If no significance level
is given, a common standard is to use a significance level of 5%.
• An alternative approach for hypothesis testing is to use what is called the critical value
approach. In this book, we will only use the p-value approach. Some of the videos below
may mention the critical value approach, but this approach will not be used in this book.
1. Write down the null and alternative hypotheses in terms of the population mean . Include
appropriate units with the values of the mean.
2. Use the form of the alternative hypothesis to determine if the test is left-tailed, right-tailed, or
two-tailed.
3. Collect the sample information for the test and identify the significance level .
4. When the population standard deviation is unknown, the p-value is the area in the
corresponding tail of the -distribution with:
5. Compare the p-value to the significance level and state the outcome of the test:
The p-value for a hypothesis test on a population mean is the area in the tail(s) of the distribution of
the sample mean. When the population standard deviation is unknown, use the -distribution to
find the p-value.
• Use the t.dist function to find the p-value. In the t.dist(t-score, degrees of freedom, logic
operator) function:
◦ For degrees of freedom, enter the degrees of freedom for the -distribution .
◦ For the logic operator, enter true. Note: Because we are calculating the area under
the curve, we always enter true for the logic operator.
• The output from the t.dist function is the area under the -distribution to the left of the
entered -score.
• Visit the Microsoft page for more information about the t.dist function.
• Use the t.dist.rt function to find the p-value. In the t.dist.rt(t-score, degrees of freedom)
function:
◦ For degrees of freedom, enter the degrees of freedom for the -distribution .
8.7 HYPOTHESIS TESTS FOR A POPULATION MEAN WITH UNKNOWN POPULATION STANDARD DEVIATION | 505
• The output from the t.dist.rt function is the area under the -distribution to the right of the
entered -score.
• Visit the Microsoft page for more information about the t.dist.rt function.
• Use the t.dist.2t function to find the p-value. In the t.dist.2t(t-score, degrees of freedom)
function:
◦ For t-score, enter the absolute value of calculated from . Note: In the
t.dist.2t function, the value of the -score must be a positive number. If the -score is
negative, enter the absolute value of the -score into the t.dist.2t function.
◦ For degrees of freedom, enter the degrees of freedom for the -distribution .
• The output from the t.dist.2t function is the sum of areas in the tails under the -distribution.
• Visit the Microsoft page for more information about the t.dist.2t function.
EXAMPLE
Statistics students believe that the mean score on the first statistics test is 65. A statistics instructor
thinks the mean score is higher than 65. He samples ten statistics students and obtains the following
scores:
65 67 66 68 72
65 70 63 63 71
The instructor performs a hypothesis test using a 1% level of significance. The test scores are
assumed to be from a normal distribution.
506 | 8.7 HYPOTHESIS TESTS FOR A POPULATION MEAN WITH UNKNOWN POPULATION STANDARD DEVIATION
Solution:
Hypotheses:
\begin{eqnarray*} H_0: & & \mu=65 \\ H_a: & & \mu \gt 65 \end{eqnarray*}
p-value:
This is a test on a population mean where the population standard deviation is unknown (we only
know the sample standard deviation ). So we use a -distribution to calculate the
p-value. Because the alternative hypothesis is a , the p-value is the area in the right-tail of the
distribution.
Field 2 9
8.7 HYPOTHESIS TESTS FOR A POPULATION MEAN WITH UNKNOWN POPULATION STANDARD DEVIATION | 507
So the p-value .
Conclusion:
NOTES
1. The null hypothesis is the claim that the mean test score is 65.
2. The alternative hypothesis is the claim that the mean test score is greater than 65.
3. Keep all of the decimals throughout the calculation (i.e. in the sample standard deviation, the
-score, etc.) to avoid any round-off error in the calculation of the p-value. This ensures that
we get the most accurate value for the p-value.
4. The p-value is the area in the right-tail of the -distribution, to the right of .
5. The p-value of 0.0396 tells us that under the assumption that the mean test score is 65 (the
null hypothesis), there is a 3.96% chance that the mean test score is 65 or more. Compared to
the 1% significance level, this is a large probability, and so is likely to happen assuming the
null hypothesis is true. This suggests that the assumption that the null hypothesis is true is
most likely correct, and so the conclusion of the test is to not reject the null hypothesis.
TRY IT
A company claims that the average change in the value of their stock is $3.50 per week. An investor
believes this average is too high. The investor records the changes in the company’s stock price over
30 weeks and finds the average change in the stock price is $2.60 with a standard deviation of $1.80.
At the 5% significance level, is the average change in the company’s stock price lower than the
company claims?
508 | 8.7 HYPOTHESIS TESTS FOR A POPULATION MEAN WITH UNKNOWN POPULATION STANDARD DEVIATION
Hypotheses:
\begin{eqnarray*} H_0: & & \mu=$3.50 \\ H_a: & & \mu \lt $3.50
\end{eqnarray*}
p-value:
This is a test on a population mean where the population standard deviation is unknown (we only
know the sample standard deviation ). So we use a -distribution to calculate the p-value.
Because the alternative hypothesis is a , the p-value is the area in the left-tail of the distribution.
Field 2 29
Field 3 true
So the p-value .
Conclusion:
NOTES
1. The null hypothesis is the claim that the average change in the company’s
stock is $3.50 per week.
2. The alternative hypothesis is the claim that the average change in the
company’s stock is less than $3.50 per week.
3. The p-value is the area in the left-tail of the -distribution, to the left of .
4. The p-value of 0.0636 tells us that under the assumption that the average change in the stock
is $3.50 (the null hypothesis), there is a 6.36% chance that the average change is $3.50 or less.
Compared to the 5% significance level, this is a large probability, and so is likely to happen
assuming the null hypothesis is true. This suggests that the assumption that the null
hypothesis is true is most likely correct, and so the conclusion of the test is to not reject the
null hypothesis. In other words, the company’s claim that the average change in their stock
price is $3.50 per week is most likely correct.
510 | 8.7 HYPOTHESIS TESTS FOR A POPULATION MEAN WITH UNKNOWN POPULATION STANDARD DEVIATION
EXAMPLE
A paint manufacturer has their production line set-up so that the average volume of paint in a can is
3.78 liters. The quality control manager at the plant believes that something has happened with the
production and the average volume of paint in the cans has changed. The quality control
department takes a sample of 100 cans and finds the average volume is 3.62 liters with a standard
deviation of 0.7 liters. At the 5% significance level, has the volume of paint in a can changed?
Solution:
Hypotheses:
\begin{eqnarray*} H_0: & & \mu=3.78 \mbox{ liters} \\ H_a: & & \mu \neq 3.78
\mbox{ liters} \end{eqnarray*}
p-value:
This is a test on a population mean where the population standard deviation is unknown (we only
know the sample standard deviation ). So we use a -distribution to calculate the p-value.
Because the alternative hypothesis is a , the p-value is the sum of area in the tails of the
distribution.
Field 2 99
So the p-value .
Conclusion:
NOTES
1. The null hypothesis is the claim that the average volume of paint in the cans is
3.78.
2. The alternative hypothesis is the claim that the average volume of paint in the
cans is not 3.78.
3. Keep all of the decimals throughout the calculation (i.e. in the -score) to avoid any round-off
error in the calculation of the p-value. This ensures that we get the most accurate value for
the p-value.
4. The p-value is the sum of the area in the two tails. The output from the t.dist.2t function is
exactly the sum of the area in the two tails, and so is the p-value required for the test. No
additional calculations are required.
5. The t.dist.2t function requires that the value entered for the -score is positive. A negative
-score entered into the t.dist.2t function generates an error in Excel. In this case, the value of
512 | 8.7 HYPOTHESIS TESTS FOR A POPULATION MEAN WITH UNKNOWN POPULATION STANDARD DEVIATION
the -score is negative, so we must enter the absolute value of this -score into field 1.
6. The p-value of 0.0244 is a small probability compared to the significance level, and so is
unlikely to happen assuming the null hypothesis is true. This suggests that the assumption
that the null hypothesis is true is most likely incorrect, and so the conclusion of the test is to
reject the null hypothesis in favour of the alternative hypothesis. In other words, the average
volume of paint in the cans has most likely changed from 3.78 liters.
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=196#oembed-1
Watch this video: Hypothesis Testing: t-test, right tail by ExcelIsFun [11:02]
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=196#oembed-2
Watch this video: Hypothesis Testing: t-test, left tail by ExcelIsFun [7:48]
8.7 HYPOTHESIS TESTS FOR A POPULATION MEAN WITH UNKNOWN POPULATION STANDARD DEVIATION | 513
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=196#oembed-3
Watch this video: Hypothesis Testing: t-test, two tail by ExcelIsFun [8:54]
Concept Review
1. Write down the null and alternative hypotheses in terms of the population mean . Include
appropriate units with the values of the mean.
2. Use the form of the alternative hypothesis to determine if the test is left-tailed, right-tailed, or
two-tailed.
3. Collect the sample information for the test and identify the significance level.
4. When the population standard deviation is unknown, find the p-value (the area in the
corresponding tail) for the test using the -distribution with and .
5. Compare the p-value to the significance level and state the outcome of the test.
6. Write down a concluding sentence specific to the context of the question.
Attribution
“9.6 Hypothesis Testing of a Single Mean and Single Proportion“ in Introductory Statistics by
OpenStax is licensed under a Creative Commons Attribution 4.0 International License.
8.8 HYPOTHESIS TESTS FOR A POPULATION
PROPORTION
LEARNING OBJECTIVES
• The null hypothesis is always an “equal to.” The null hypothesis is the original claim
about the population parameter.
• The alternative hypothesis is a “less than,” “greater than,” or “not equal to.” The form of
the alternative hypothesis depends on the context of the question.
• The form of the alternative hypothesis tell us if the test is left-tail, right-tail, or two-tail. The
alternative hypothesis is the key to conducting the test and finding the correct p-value.
◦ If the alternative hypothesis is a “less than”, then the test is left-tail. The p-value is the
area in the left-tail of the distribution.
◦ If the alternative hypothesis is a “greater than”, then the test is right-tail. The p-value is
the area in the right-tail of the distribution.
◦ If the alternative hypothesis is a “not equal to”, then the test is two-tail. The p-value is
the sum of the area in the two-tails of the distribution. Each tail represents exactly half
of the p-value.
• Think about the meaning of the p-value. A data analyst (and anyone else) should have
more confidence that they made the correct decision to reject the null hypothesis with a
smaller p-value (for example, 0.001 as opposed to 0.04) even if using a significance level of
0.05. Similarly, for a large p-value such as 0.4, as opposed to a p-value of 0.056 (a significance
level of 0.05 is less than either number), a data analyst should have more confidence that they
8.8 HYPOTHESIS TESTS FOR A POPULATION PROPORTION | 515
made the correct decision in not rejecting the null hypothesis. This makes the data analyst
use judgment rather than mindlessly applying rules.
• The significance level must be identified before collecting the sample data and conducting the
test. Generally, the significance level will be included in the question. If no significance level
is given, a common standard is to use a significance level of 5%.
EXAMPLE
Because the alternative hypothesis is a , this is a right-tail test. The p-value is the area in the right-
tail of the distribution.
516 | 8.8 HYPOTHESIS TESTS FOR A POPULATION PROPORTION
EXAMPLE
\begin{eqnarray*} H_0: & & p=50 \% \\ H_a: & & p \neq 50\% \end{eqnarray*}
Because the alternative hypothesis is a , this is a two-tail test. The p-value is the sum of the areas in
the two tails of the distribution. Each tail contains exactly half of the p-value.
EXAMPLE
\begin{eqnarray*} H_0: & & p=10\% \\ H_a: & & p \lt 10\% \end{eqnarray*}
Because the alternative hypothesis is a , this is a left-tail test. The p-value is the area in the left-tail
of the distribution.
8.8 HYPOTHESIS TESTS FOR A POPULATION PROPORTION | 517
1. Write down the null and alternative hypotheses in terms of the population proportion .
Include appropriate units with the values of the proportion.
2. Use the form of the alternative hypothesis to determine if the test is left-tailed, right-tailed, or
two-tailed.
3. Collect the sample information for the test and identify the significance level.
4. Find the p-value (the area in the corresponding tail) for the test using the appropriate
distribution:
5. Compare the p-value to the significance level and state the outcome of the test:
The p-value for a hypothesis test on a population proportion is the area in the tail(s) of distribution of
518 | 8.8 HYPOTHESIS TESTS FOR A POPULATION PROPORTION
If both and :
• The p-value is the area in the tail(s) of a normal distribution, so the norm.dist(x, , ,logic
operator) function can be used to calculate the p-value.
◦ For the logic operator, enter true. Note: Because we are calculating the area under
the curve, we always enter true for the logic operator.
• Use the appropriate technique with the norm.dist function to find the area in the left-tail or
the area in the right-tail.
If at least one of or :
• If the alternative hypothesis is a , the p-value is the probability of getting at least successes
in trials where the probability of success is the claim about the population proportion in
8.8 HYPOTHESIS TESTS FOR A POPULATION PROPORTION | 519
EXAMPLE
Marketers believe that 92% of adults own a cell phone. A cell phone manufacturer believes that
number is actually lower. In a sample of 200 adults, 87% own a cell phone. At the 1% significance
level, determine if the proportion of adults that own a cell phone is lower than the marketers’ claim.
Solution:
Hypotheses:
p-value:
To determine the distribution, we check and . For the value of , we use the
claim from the null hypothesis ( ).
Because both and n \times (1-p) \geq 5 we use a normal distribution to calculate the
520 | 8.8 HYPOTHESIS TESTS FOR A POPULATION PROPORTION
p-value. Because the alternative hypothesis is a , the p-value is the area in the left tail of the
distribution.
Field 2 0.92
Field 3 sqrt(0.92*(1-0.92)/200)
Field 4 true
So the p-value .
Conclusion:
NOTES
1. The null hypothesis is the claim that 92% of adults own a cell phone.
2. The alternative hypothesis is the claim that less than 92% of adults own a cell
phone.
8.8 HYPOTHESIS TESTS FOR A POPULATION PROPORTION | 521
3. The p-value is the area in the left tail of the sampling distribution, to the left of .
In the calculation of the p-value:
◦ The function is norm.dist because we are finding the area in the left tail of a normal
distribution.
◦ Field 1 is the value of .
◦ Field 2 is the value of from the null hypothesis. Remember, we run the test
assuming the null hypothesis is true, so that means we assume .
4. The p-value of 0.0046 tells us that under the assumption that 92% of adults own a cell phone
(the null hypothesis), there is only a 0.46% chance that the proportion of adults who own a
cell phone in a sample of 200 is 87% or less. This is a small probability, and so is unlikely to
happen assuming the null hypothesis is true. This suggests that the assumption that the null
hypothesis is true is most likely incorrect, and so the conclusion of the test is to reject the null
hypothesis in favour of the alternative hypothesis. In other words, the proportion of adults
who own a cell phone is most likely less than 92%.
EXAMPLE
A consumer group claims that the proportion of households that have at least three cell phones is
30%. A cell phone company has reason to believe that the proportion of households with at least
three cell phones is much higher. Before they start a big advertising campaign based on the
proportion of households that have at least three cell phones, they want to test their claim. Their
marketing people survey 150 households with the result that 54 of the households have at least three
522 | 8.8 HYPOTHESIS TESTS FOR A POPULATION PROPORTION
cell phones. At the 1% significance level, determine if the proportion of households that have at least
three cell phones is less than 30%.
Solution:
Hypotheses:
p-value:
To determine the distribution, we check and . For the value of , we use the
claim from the null hypothesis ( ).
Because both and n \times (1-p) \geq 5 we use a normal distribution to calculate the
p-value. Because the alternative hypothesis is a , the p-value is the area in the right tail of the
distribution.
8.8 HYPOTHESIS TESTS FOR A POPULATION PROPORTION | 523
Field 2 0.3
Field 3 sqrt(0.3*(1-0.3)/150)
Field 4 true
So the p-value .
Conclusion:
NOTES
1. The null hypothesis is the claim that 30% of households have at least three cell
phones.
2. The alternative hypothesis is the claim that more than 30% of households have at
least three cell phones.
3. The p-value is the area in the right tail of the sampling distribution, to the right of
. In the calculation of the p-value:
◦ The function is 1-norm.dist because we are finding the area in the right tail of a normal
distribution.
◦ Field 1 is the value of .
◦ Field 2 is the value of from the null hypothesis. Remember, we run the test
assuming the null hypothesis is true, so that means we assume .
4. The p-value of 0.0544 tells us that under the assumption that 30% of households have at least
three cell phones (the null hypothesis), there is a 5.44% chance that the proportion of
households with at least three cell phones in a sample of 150 is 36% or more. Compared to
the 1% significance level, this is a large probability, and so is likely to happen assuming the
524 | 8.8 HYPOTHESIS TESTS FOR A POPULATION PROPORTION
null hypothesis is true. This suggests that the assumption that the null hypothesis is true is
most likely correct, and so the conclusion of the test is to not reject the null hypothesis. In
other words, the claim that 30% of households have at least three cell phones is most likely
correct.
TRY IT
A teacher believes that 70% of students in the class will want to go on a field trip to the local zoo. The
students in the class believe the proportion is much higher and ask the teacher to verify her claim.
The teacher samples 50 students and 39 reply that they would want to go to the zoo. At the 5%
significance level, determine if the proportion of students who want to go on the field trip is higher
than 70%.
Hypotheses:
\begin{eqnarray*} H_0: & & p = 70\% \mbox{ of students want to go on the field trip} \\
H_a: & & p \gt 70\% \mbox{ of students want to go on the field trip} \end{eqnarray*}
p-value:
Because both and n \times (1-p) \geq 5 we use a normal distribution to calculate the
p-value. Because the alternative hypothesis is a , the p-value is the area in the right tail of the
distribution.
Field 2 0.7
Field 3 sqrt(0.7*(1-0.7)/50)
Field 4 true
So the p-value .
Conclusion:
NOTES
1. The null hypothesis is the claim that 70% of the students want to go on the field
trip.
2. The alternative hypothesis is the claim that more than 70% of students want to go
on the field trip.
3. The p-value of 0.1085 tells us that under the assumption that 70% of students want to go on
the field trip (the null hypothesis), there is a 10.85% chance that the proportion of students
who want to go on the field trip in a sample of 50 students is 78% or more. Compared to the
5% significance level, this is a large probability, and so is likely to happen assuming the null
hypothesis is true. This suggests that the assumption that the null hypothesis is true is most
likely correct, and so the conclusion of the test is to not reject the null hypothesis. In other
words, the teacher’s claim that 70% of students want to go on the field trip is most likely
correct.
526 | 8.8 HYPOTHESIS TESTS FOR A POPULATION PROPORTION
EXAMPLE
Joan believes that 50% of first-time brides in the United States are younger than their grooms. She
performs a hypothesis test to determine if the percentage is the same or different from 50%. Joan
samples 100 first-time brides and 56 reply that they are younger than their grooms. Use a 5%
significance level.
Solution:
Hypotheses:
p-value:
To determine the distribution, we check and . For the value of , we use the
claim from the null hypothesis ( ).
Because both and n \times (1-p) \geq 5 we use a normal distribution to calculate the
p-value. Because the alternative hypothesis is a , the p-value is the sum of area in the tails of the
distribution.
8.8 HYPOTHESIS TESTS FOR A POPULATION PROPORTION | 527
Because there is only one sample, we only have information relating to one of the two tails, either the
left or the right. We need to know if the sample relates to the left or right tail because that will
determine how we calculate out the area of that tail using the normal distribution. In this case, the
sample proportion is greater than the value of the population proportion in the null
hypothesis ( ), so the sample information relates to the right-tail of
the normal distribution. This means that we will calculate out the area in the right tail using
1-norm.dist. However, this is a two-tailed test where the p-value is the sum of the area in the two
tails and the area in the right-tail is only one half of the p-value. The area in the left tail equals the
area in the right tail and the p-value is the sum of these two areas.
Field 2 0.5
Field 3 sqrt(0.5*(1-0.5)/100)
Field 4 true
So the area in the right tail is 0.1151 and (p-value) . This is also the area in the left tail,
so
p-value
Conclusion:
significance level there is not enough evidence to suggest that the proportion of first-time brides that
are younger than the groom is different from 50%.
NOTES
1. The null hypothesis is the claim that the proportion of first-time brides that are
younger than the groom is 50%.
2. The alternative hypothesis is the claim that the proportion of first-time brides
that are younger than the groom is different from 50%.
3. In a two-tailed hypothesis test that uses the normal distribution, we will only have sample
information relating to one of the two tails. We must determine which of the tails the sample
information belongs to, and then calculate out the area in that tail. The area in each tail
represents exactly half of the p-value, so the p-value is the sum of the areas in the two tails.
◦ If the sample proportion is less than the population proportion in the null
hypothesis ( ), the sample information belongs to the left tail.
◦ If the sample proportion is greater than the population proportion in the null
hypothesis ( ), the sample information belongs to the right tail.
4. The p-value of 0.2302 is a large probability compared to the 5% significance level, and so is
likely to happen assuming the null hypothesis is true. This suggests that the assumption that
the null hypothesis is true is most likely correct, and so the conclusion of the test is to not
reject the null hypothesis. In other words, the claim that the proportion of first-time brides
who are younger than the groom is most likely correct.
8.8 HYPOTHESIS TESTS FOR A POPULATION PROPORTION | 529
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=200#oembed-1
Watch this video: Hypothesis Testing for Proportions: z-test by ExcelIsFun [7:27]
EXAMPLE
An online retailer believes that 93% of the visitors to its website will make a purchase. A researcher
in the marketing department thinks the actual percent is lower than claimed. The researcher
examines a sample of 50 visits to the website and finds that 45 of the visits resulted in a purchase. At
the 1% significance level, determine if the proportion of visits to the website that result in a purchase
is lower than claimed.
Solution:
Hypotheses:
p-value:
To determine the distribution, we check and . For the value of , we use the
claim from the null hypothesis ( ).
Because n \times (1-p) \lt 5 we use a binomial distribution to calculate the p-value. Because the
alternative hypothesis is a , the p-value is the probability of getting at most 45 successes in 50 trials.
530 | 8.8 HYPOTHESIS TESTS FOR A POPULATION PROPORTION
Field 1 45 0.2710
Field 2 50
Field 3 0.93
Field 4 true
So the p-value .
Conclusion:
NOTES
1. The null hypothesis is the claim that 93% of visitors to the website make a
purchase.
2. The alternative hypothesis is the claim that less than 93% of visitors to the
website make a purchase.
3. The p-value is the binomial probability of getting at most 45 successes (the number in the
sample with the characteristic of interest) in 50 trials (the sample size) with a probability of
success of 93% (the value of in the null hypothesis). In the calculation of the p-value:
4. The p-value of 0.2710 tells us that under the assumption that 93% of visitors make a purchase
(the null hypothesis), there is a 27.10% chance that the number of visitors in a sample of 50
who make a purchase is 45 or less. This is a large probability compared to the significance
level, and so is likely to happen assuming the null hypothesis is true. This suggests that the
8.8 HYPOTHESIS TESTS FOR A POPULATION PROPORTION | 531
assumption that the null hypothesis is true is most likely correct, and so the conclusion of the
test is to not reject the null hypothesis. In other words, the proportion of visitors to the
website who make a purchase adults is most likely 93%.
EXAMPLE
A drug company claims that only 4% of people who take their new drug experience any side effects
from the drug. A researcher believes that the percent is higher than drug company’s claim. The
researcher takes a sample of 80 people who take the drug and finds that 10% of the people in the
sample experience side effects from the drug. At the 5% significance level, determine if the
proportion of people who experience side effects from taking the drug is higher than claimed.
Solution:
Hypotheses:
p-value:
To determine the distribution, we check and . For the value of , we use the
claim from the null hypothesis ( ).
Because n \times p \lt 5 we use a binomial distribution to calculate the p-value. Because the
alternative hypothesis is a , the p-value is the probability of getting at least 8 successes in 80 trials.
532 | 8.8 HYPOTHESIS TESTS FOR A POPULATION PROPORTION
(Note: In the sample of size 80, 10% have the characteristic of interest, so this means that
people in the sample have the characteristic of interest.)
Field 1 7 0.0147
Field 2 80
Field 3 0.04
Field 4 true
So the p-value .
Conclusion:
NOTES
1. The null hypothesis is the claim that 4% of the people experience side effects from
taking the drug.
2. The alternative hypothesis is the claim that more than 4% of the people experience
side effects from taking the drug.
3. The p-value is the binomial probability of getting at least 8 successes (the number in the
sample with the characteristic of interest) in 80 trials (the sample size) with a probability of
success of 4% (the value of in the null hypothesis). In the calculation of the p-value:
4. The p-value of 0.0147 tells us that under the assumption that 4% of people experience side
effects (the null hypothesis), there is a 1.47% chance that the number of people in a sample of
80 who experience side effects is 8 or more. This is a small probability compared to the
significance level, and so is unlikely to happen assuming the null hypothesis is true. This
suggests that the assumption that the null hypothesis is true is most likely incorrect, and so
the conclusion of the test is to reject the null hypothesis in favour of the alternative
hypothesis. In other words, the proportion of people who experience side effects is most
likely greater than 4%.
Concept Review
1. Write down the null and alternative hypotheses in terms of the population proportion .
Include appropriate units with the values of the proportion.
2. Use the form of the alternative hypothesis to determine if the test is left-tailed, right-tailed, or
two-tailed.
3. Collect the sample information for the test and identify the significance level.
4. Find the p-value (the area in the corresponding tail) for the test using the appropriate
distribution (normal or binomial).
5. Compare the p-value to the significance level and state the outcome of the test.
6. Write down a concluding sentence specific to the context of the question.
Attribution
“9.6 Hypothesis Testing of a Single Mean and Single Proportion“ in Introductory Statistics by
OpenStax is licensed under a Creative Commons Attribution 4.0 International License.
8.9 EXERCISES
1. You are testing that the mean speed of your cable Internet connection is more than three
Megabits per second. State the null and alternative hypotheses.
2. The mean entry level salary of an employee at a company is $58,000. You believe it is higher for
IT professionals in the company. State the null and alternative hypotheses.
3. A sociologist claims the probability that a person picked at random in Times Square in New York
City is visiting the area is 0.83. You want to test to see if the claim is correct. State the null and
alternative hypotheses.
4. In a population of fish, approximately 42% are female. A test is conducted to see if, in fact, the
proportion is less. State the null and alternative hypotheses.
5. Suppose that a recent article stated that the mean time spent in jail by a first–time convicted
burglar is 2.5 years. A study was then done to see if the mean time has increased in the new
century. A random sample of 26 first-time convicted burglars in a recent year was picked. The
mean length of time in jail from the survey was 3 years with a standard deviation of 1.8 years.
Suppose that it is somehow known that the population standard deviation is 1.5. If you were
conducting a hypothesis test to determine if the mean length of jail time has increased, what
would the null and alternative hypotheses be? The distribution of the population is normal.
6. A random survey of 75 death row inmates revealed that the mean length of time on death row
is 17.4 years with a standard deviation of 6.3 years. If you were conducting a hypothesis test to
determine if the population mean time on death row could likely be 15 years, what would the null
and alternative hypotheses be?
7. The National Institute of Mental Health published an article stating that in any one-year period,
8.9 EXERCISES | 535
approximately 9.5 percent of American adults suffer from depression or a depressive illness.
Suppose that in a survey of 100 people in a certain town, seven of them suffered from depression
or a depressive illness. If you were conducting a hypothesis test to determine if the true proportion
of people in that town suffering from depression or a depressive illness is lower than the percent in
the general adult American population, what would the null and alternative hypotheses be?
8. Previously, an organization reported that teenagers spent 4.5 hours per week, on average, on
the phone. The organization thinks that, currently, the mean is higher. Fifteen randomly chosen
teenagers were asked how many hours per week they spend on the phone. The sample mean was
4.75 hours with a sample standard deviation of 2.0. State the null and alternative hypotheses.
9. The mean price of mid-sized cars in a region is $32,000. A test is conducted to see if the claim
is true. State the Type I and Type II errors in complete sentences.
10. A sleeping bag is tested to withstand temperatures of –15 °F. You think the bag cannot stand
temperatures that low. State the Type I and Type II errors in complete sentences.
11. A group of doctors is deciding whether or not to perform an operation. Suppose the null
hypothesis is: the surgical procedure will go well. State the Type I and Type II errors in complete
sentences.
12.A group of doctors is deciding whether or not to perform an operation. Suppose the null
hypothesis is: the surgical procedure will go well. Which is the error with the greater consequence?
13. A group of divers is exploring an old sunken ship. Suppose the null hypothesis is: the sunken
ship does not contain buried treasure. State the Type I and Type II errors in complete sentences.
14. A microbiologist is testing a water sample for E-coli. Suppose the null hypothesis is: the sample
contains E-coli. Which is the error with the greater consequence?
536 | 8.9 EXERCISES
15. When a new drug is created, the pharmaceutical company must subject it to testing before
receiving the necessary permission from the Food and Drug Administration (FDA) to market the
drug. Suppose the null hypothesis is “the drug is unsafe.” What is the Type II error?
16. A statistics instructor believes that fewer than 20% of Evergreen Valley College (EVC) students
attended the opening midnight showing of the latest Harry Potter movie. She surveys 84 of her
students and finds that 11 of them attended the midnight showing. What is the Type I error?
17. Previously, an organization reported that teenagers spent 4.5 hours per week, on average, on
the phone. The organization thinks that, currently, the mean is higher. Fifteen randomly chosen
teenagers were asked how many hours per week they spend on the phone. The sample mean was
4.75 hours with a sample standard deviation of 2.0. What is the Type I error?
18. Which distributions can you use for hypothesis testing for this chapter?
19. Which distribution do you use when you are testing a population mean and the standard
deviation is known? Assume sample size is large.
20. Which distribution do you use when the standard deviation is not known and you are testing
one population mean? Assume sample size is large.
21. A population mean is 13. The sample mean is 12.8, and the sample standard deviation is two.
The sample size is 20. What distribution should you use to perform a hypothesis test? Assume the
underlying population is normal.
22. A population has a mean is 25 and a standard deviation of five. The sample mean is 24, and the
sample size is 108. What distribution should you use to perform a hypothesis test?
23. It is thought that 42% of respondents in a taste test would prefer Brand A. In a particular test of
100 people, 39% preferred Brand A. What distribution should you use to perform a hypothesis test?
8.9 EXERCISES | 537
24. You are performing a hypothesis test of a single population mean using a Student’s
t-distribution. What must you assume about the distribution of the data?
25. You are performing a hypothesis test of a single population mean using a Student’s
t-distribution. The data are not from a simple random sample. Can you accurately perform the
hypothesis test?
26. You are performing a hypothesis test of a single population proportion. What must be true about
the quantities of and in order to use the normal distribution?
27. You are performing a hypothesis test of a single population proportion. You find out that
is less than five. What must you do to be able to perform a valid hypothesis test?
29. The probability of winning the grand prize at a particular carnival game is 0.005. Is the outcome
of winning very likely or very unlikely?
30. The probability of winning the grand prize at a particular carnival game is 0.005. Michele wins
the grand prize. Is this considered a rare or common event? Why?
31. It is believed that the mean height of high school students who play basketball on the school
team is 73 inches with a standard deviation of 1.8 inches. A random sample of 40 players is chosen.
The sample mean was 71 inches, and the sample standard deviation was 1.5 years. Do the data
support the claim that the mean height is less than 73 inches? The p-value is almost zero. State the
null and alternative hypotheses and interpret the p-value.
32. The mean age of graduate students at a University is at most 31 y ears with a standard deviation
of two years. A random sample of 15 graduate students is taken. The sample mean is 32 years and
the sample standard deviation is three years. Are the data significant at the 1% level? The p-value is
0.0264. State the null and alternative hypotheses and interpret the p-value.
538 | 8.9 EXERCISES
35. If you do not reject the null hypothesis, then it must be true. Is this statement correct? State why
or why not in complete sentences.
36. Suppose that a recent article stated that the mean time spent in jail by a first-time convicted
burglar is 2.5 years. A study was then done to see if the mean time has increased in the new century.
A random sample of 26 first-time convicted burglars in a recent year was picked. The mean length
of time in jail from the survey was three years with a standard deviation of 1.8 years. Suppose that
it is somehow known that the population standard deviation is 1.5. Conduct a hypothesis test to
determine if the mean length of jail time has increased. Assume the distribution of the jail times is
approximately normal.
37. A random survey of 75 death row inmates revealed that the mean length of time on death row
is 17.4 years with a standard deviation of 6.3 years. Conduct a hypothesis test to determine if the
population mean time on death row could likely be 15 years.
38. The National Institute of Mental Health published an article stating that in any one-year
period, approximately 9.5 percent of American adults suffer from depression or a depressive illness.
Suppose that in a survey of 100 people in a certain town, seven of them suffered from depression or
a depressive illness. Conduct a hypothesis test to determine if the true proportion of people in that
town suffering from depression or a depressive illness is lower than the percent in the general adult
American population.
42. A bottle of water is labeled as containing 16 fluid ounces of water. You believe it is less than
that. What type of test would you use?
43. Your friend claims that his mean golf score is 63. You want to show that it is higher than that.
What type of test would you use?
44. A bathroom scale claims to be able to identify correctly any weight within a pound. You think
that it cannot be that accurate. What type of test would you use?
540 | 8.9 EXERCISES
45. You flip a coin and record whether it shows heads or tails. You know the probability of getting
heads is 50%, but you think it is less for this particular coin. What type of test would you use?
46. Assume the null hypothesis states that the mean is equal to 88. The alternative hypothesis states
that the mean is not equal to 88. Is this a left-tailed, right-tailed, or two-tailed test?
47. A particular brand of tires claims that its deluxe tire averages 50,000 miles before it needs to
be replaced. A group of owners believe this number is too high. From past studies of this tire, the
standard deviation is known to be 8,000. A survey of owners of that tire design is conducted. From
the 28 tires surveyed, the mean lifespan was 46,500 miles with a standard deviation of 9,800 miles.
At the 5% significance level, is the data highly inconsistent with the claim?
48. From generation to generation, the mean age when smokers first start to smoke is 19 years.
However, the standard deviation of that age remains constant of around 2.1 years. A survey of 40
smokers of this generation was done to see if the mean starting age is at least 19. The sample mean
was 18.1 with a sample standard deviation of 1.3. Does the data support the claim at the 5% level?
49. The cost of a daily newspaper varies from city to city. However, the variation among prices
remains steady with a standard deviation of 20¢. A study was done to test the claim that the mean
cost of a daily newspaper is $1.00. Twelve costs yield a mean cost of 95¢ with a standard deviation
of 18¢. Does the data support the claim at the 1% level?
50. An article in the San Jose Mercury News stated that students in the California state university
system take 4.5 years, on average, to finish their undergraduate degrees. Suppose you believe that
the mean time is longer. You conduct a survey of 49 students and obtain a sample mean of 5.1 with
a sample standard deviation of 1.2. Does the data support your claim at the 1% level?
51. The mean number of sick days an employee takes per year is believed to be about ten. Members
of a personnel department do not believe this figure. They randomly survey eight employees. The
8.9 EXERCISES | 541
number of sick days they took for the past year are as follows: 12; 4; 15; 3; 11; 8; 6; 8. At the 5%
significance level, should the personnel team believe that the mean number is ten?
52. In 1955, Life Magazine reported that the 25 year-old mother of three worked, on average, an 80
hour week. Recently, many groups have been studying whether or not the women’s movement has,
in fact, resulted in an increase in the average work week for women (combining employment and
at-home work). Suppose a study was done to determine if the mean work week has increased. 81
women were surveyed with the following results. The sample mean was 83; the sample standard
deviation was ten. Does it appear that the mean work week has increased for women at the 5%
level?
53. Your statistics instructor claims that 60 percent of the students who take her Elementary
Statistics class go through life feeling more enriched. For some reason that she can’t quite figure
out, most people don’t believe her. You decide to check this out on your own. You randomly survey
64 of her past Elementary Statistics students and find that 34 feel more enriched as a result of her
class. Now, what do you think? Use a 5% significance level.
54. A Nissan Motor Corporation advertisement read, “The average man’s I.Q. is 107. The average
brown trout’s I.Q. is 4. So why can’t man catch brown trout?” Suppose you believe that the brown
trout’s mean I.Q. is greater than four. You catch 12 brown trout. A fish psychologist determines the
I.Q.s as follows: 5; 4; 7; 3; 6; 4; 5; 3; 6; 3; 8; 5. Conduct a hypothesis test of your belief. Use a 5%
significance level.
55. The mean work week for engineers in a start-up company is believed to be about 60 hours. A
newly hired engineer hopes that it’s shorter. She asks ten engineering friends in start-ups for the
lengths of their mean work weeks. Based on the results that follow, should she count on the mean
work week to be shorter than 60 hours? Use a 5% significance level.
Data (length of mean work week): 70; 45; 55; 60; 65; 55; 55; 60; 50; 55.
56. Toastmasters International cites a report by Gallop Poll that 40% of Americans fear public
speaking. A student believes that less than 40% of students at her school fear public speaking. She
randomly surveys 361 schoolmates and finds that 135 report they fear public speaking. Conduct a
542 | 8.9 EXERCISES
hypothesis test to determine if the percent at her school is less than 40%. Use a 1% significance
level.
57. According to an article in Bloomberg Businessweek, New York City’s most recent adult smoking
rate is 14%. Suppose that a survey is conducted to determine this year’s rate. Nine out of 70
randomly chosen N.Y. City residents reply that they smoke. Conduct a hypothesis test to determine
if the rate is still 14% or if it has decreased. Use a 1% significance level.
58. The mean age of De Anza College students in a previous term was 26.6 years old. An instructor
thinks the mean age for online students is older than 26.6. She randomly surveys 56 online students
and finds that the sample mean is 29.4 with a standard deviation of 2.1. Conduct a hypothesis test.
Use a 5% significance level.
59. Registered nurses earned an average annual salary of $69,110. For that same year, a survey was
conducted of 41 California registered nurses to determine if the annual salary is higher than $69,110
for California nurses. The sample average was $71,121 with a sample standard deviation of $7,489.
Conduct a hypothesis test. Use a 5% significance level.
60. Previously, an organization reported that teenagers spent 4.5 hours per week, on average, on
the phone. The organization thinks that, currently, the mean is higher. Fifteen randomly chosen
teenagers were asked how many hours per week they spend on the phone. The sample mean
was 4.75 hours with a sample standard deviation of 2.0. Conduct a hypothesis test. Use a 5%
significance level.
61. According to the Center for Disease Control website, in 2011 18% of high school students
have smoked a cigarette. An Introduction to Statistics class in Davies County, KY conducted a
hypothesis test at the local high school (a medium sized–approximately 1,200 students–small city
demographic) to determine if the local high school’s percentage was lower. One hundred fifty
students were chosen at random and surveyed. Of the 150 students surveyed, 82 have smoked. Use
8.9 EXERCISES | 543
a significance level of 0.05 and using appropriate statistical evidence, conduct a hypothesis test and
state the conclusions.
62. A recent survey in the N.Y. Times Almanac indicated that 48.8% of families own stock. A broker
wanted to determine if this survey could be valid. He surveyed a random sample of 250 families
and found that 142 owned some type of stock. At the 0.05 significance level, can the survey be
considered to be accurate?
63. Driver error can be listed as the cause of approximately 54% of all fatal auto accidents, according
to the American Automobile Association. Thirty randomly selected fatal accidents are examined,
and it is determined that 14 were caused by driver error. Using α = 0.05, is the AAA proportion
accurate?
64. For Americans using library services, the American Library Association claims that 67% of
patrons borrow books. The library director in Owensboro, Kentucky feels this is not true, so she
asked a local college statistic class to conduct a survey. The class randomly selected 100 patrons
and found that 82 borrowed books. Did the class demonstrate that the percentage was higher in
Owensboro, KY? Use α = 0.01 level of significance. What is the possible proportion of patrons that
do borrow books from the Owensboro Library?
65. The Weather Underground reported that the mean amount of summer rainfall for the
northeastern US is 11.52 inches. Ten cities in the northeast are randomly selected and the mean
rainfall amount is calculated to be 7.42 inches with a standard deviation of 1.3 inches. At the α =
0.05 level, can it be concluded that the mean rainfall was below the reported average? What if α =
0.01? Assume the amount of summer rainfall follows a normal distribution.
66. A survey in the N.Y. Times Almanac finds the mean commute time (one way) is 25.4 minutes
for the 15 largest US cities. The Austin, TX chamber of commerce feels that Austin’s commute
time is less and wants to publicize this fact. The mean for 25 randomly selected commuters is 22.1
minutes with a standard deviation of 5.3 minutes. At the α = 0.10 level, is the Austin, TX commute
significantly less than the mean commute time for the 15 largest US cities?
544 | 8.9 EXERCISES
Attribution
Chapter Outline
If you want to test a claim that involves two groups (the types of breakfasts eaten east and west of the
Mississippi River) you can use a slightly different technique when conducting a hypothesis test. Photo by
Chloe Lim, CC BY 4.0.
Studies often compare two groups. For example, researchers are interested in the effect aspirin
has in preventing heart attacks. Over the last few years, newspapers and magazines have reported
various aspirin studies involving two groups. Typically, one group is given aspirin and the other
group is given a placebo. Then, the heart attack rate is studied over several years.
There are other situations that deal with the comparison of two groups. For example, studies
compare various diet and exercise programs. Politicians compare the proportion of individuals from
different income brackets who might vote for them. Students are interested in whether SAT or GRE
preparatory courses really help raise their test scores.
548 | 9.1 INTRODUCTION TO STATISTICAL INFERENCE WITH TWO POPULATIONS
Previously, we learned to conduct confidence intervals and hypothesis tests on single means and
single proportions. We will extend these ideas in this chapter so that we can compare two means or
two proportions to each other. The general procedures are similar to any confidence or hypothesis
test, following the same basic steps we have already learned, just expanded to include the cases of
studying two population parameters.
To compare two means or two proportions, we work with two populations. The groups are
classified either as independent or matched pairs. Independent groups consist of two samples
that are independent. That is, one population is independent of the other if the sample values
selected from one population are not related in any way to sample values selected from the other
population. Matched pairs consist of two samples that are dependent. That is, there is some
relationship between the samples selected from the two populations. In this book, independent
groups are used for either two population means or two population proportions and matched pairs
are for two population means.
Attribution
LEARNING OBJECTIVES
• Construct and interpret a confidence interval for two population means with known
population standard deviations.
• Conduct and interpret hypothesis tests for two population means with known population
standard deviations.
The comparison of two population means is very common. Often, we want to find out if the two
populations under study have the same mean or if there is some difference in the two population
means. The approach we take when studying two population means depends on whether the
samples are independent or matched. In the case where the samples are independent, we also
have to contend with whether or not we know the population standard deviations.
Two populations are independent if the sample taken from population 1 is not related in anyway
to the sample taken from population 2. In this situation, any relationship between the samples or
populations is entirely coincidental.
Throughout this section, we will use subscripts to identify the values for the means, sample sizes,
and standard deviations for the two populations:
550 | 9.2 STATISTICAL INFERENCE FOR TWO POPULATION MEANS WITH KNOWN POPULATION STANDARD
DEVIATIONS
Population Mean
Population Standard
Deviation
Sample Size
Sample Mean
In order to construct a confidence interval or conduct a hypothesis test on the difference in two
population means ( ), we need to use the distribution of the difference in the sample means
:
• The distribution of the difference in the sample means is normal if one of the following is
true:
◦ Both populations are normally distributed.
◦ The sample sizes are large enough ( and ).
• Assuming the distribution of the difference of the sample means is normal, the -score is
Suppose a sample of size with sample mean is taken from population 1 and a sample of size
with sample mean is taken from population 2 where the populations are independent and
the population standard deviations, and , are known. The limits for the confidence interval
with confidence level for the difference in the population means are:
9.2 STATISTICAL INFERENCE FOR TWO POPULATION MEANS WITH KNOWN POPULATION STANDARD
DEVIATIONS | 551
\begin{eqnarray*} \\ \mbox{Lower Limit} & = & \overline{x}_1-\overline{x}_2-z \times
\sqrt{\frac{\sigma^2_1}{n_1}+\frac{\sigma^2_2}{n_2}} \\ \\ \mbox{Upper Limit} & = &
\overline{x}_1-\overline{x}_2+z \times \sqrt{\frac{\sigma^2_1}{n_1}+\frac{\sigma^2_2}{n_2}} \\
\end{eqnarray*}
where is the positive -score of the standard normal distribution so that the area under the
curve in between and is .
NOTE
In order to construct the confidence interval for the difference in two population means with
independent samples, we need to check that the distribution of the difference in the sample
means follows a normal distribution. This means that we need to check that either the
populations are normal or that the sample sizes are large enough (greater than or equal to 30).
552 | 9.2 STATISTICAL INFERENCE FOR TWO POPULATION MEANS WITH KNOWN POPULATION STANDARD
DEVIATIONS
To find the -score to construct a confidence interval with confidence level , use the
norm.s.inv(area to the left of z) function.
• For area to the left of z, enter the entire area to the left of the -score you are trying to find.
The output from the norm.s.inv function is the value of the -score needed to construct the
confidence interval.
NOTE
The norm.s.inv function requires that we enter the entire area to the left of the unknown
-score. This area includes the confidence level (the area in the middle of the distribution) plus the
remaining area in the left tail.
EXAMPLE
A consumer advocacy group wants to study consumer satisfaction with their shopping experience at
the country’s two biggest retailers. The group surveyed consumers and asked them to rate one of the
9.2 STATISTICAL INFERENCE FOR TWO POPULATION MEANS WITH KNOWN POPULATION STANDARD
DEVIATIONS | 553
retailers in a number of different categories. An overall satisfaction score out of 100 summarized the
responses for each consumer sampled. In a sample of 35 consumers for retailer A, the average
overall satisfaction score was 79. In a sample of 30 consumers for retailer B, the average overall
satisfaction score was 71. Based on prior experience with the satisfaction rating scale, the population
standard deviation for retailer A is assumed to be 10 and the population standard deviation for
retailer B is assumed to be 12.
1. Construct a 94% confidence interval for the difference in the mean satisfaction score for the
two retailers.
2. Interpret the confidence interval found in part 1.
3. Is there evidence to suggest that the mean satisfaction score for retailer A is greater than the
mean satisfaction score for retailer B? Explain.
Solution:
Retailer A Retailer B
The normal distribution applies because the sample sizes are both greater than or equal to 30.
To find the confidence interval, we need to find the -score for the 94% confidence interval.
This means that we need to find the -score so that the entire area to the left of is
.
554 | 9.2 STATISTICAL INFERENCE FOR TWO POPULATION MEANS WITH KNOWN POPULATION STANDARD
DEVIATIONS
2. We are 94% confident that the difference in the mean satisfaction score for the two
retailers is between 2.796 and 13.204.
3. Because 0 is outside the confidence interval and both limits are positive, it suggests that the
difference in the means is greater than 0. That is, ( ). This
suggests that the mean for population 1 (retailer A) is greater than the mean for population 2
(retailer B). So the mean satisfaction score for retailer A is greater than the mean satisfaction
score for retailer B.
9.2 STATISTICAL INFERENCE FOR TWO POPULATION MEANS WITH KNOWN POPULATION STANDARD
DEVIATIONS | 555
NOTES
1. When calculating the limits for the confidence interval keep all of the decimals in the -score
and other values throughout the calculation. This will ensure that there is no round-off error
in the answers. You can use Excel to do the calculation of the limits, clicking on the cells
containing the -score or any other values, to ensure that all of the decimal places are used in
the calculation.
2. When writing down the interpretation of the confidence interval, make sure to include the
confidence level, the actual difference in the population means captured by the confidence
interval (i.e. be specific to the context of the question), and appropriate units for the limits.
1. Write down the null hypothesis that there is no difference in the population means:
3. Use the form of the alternative hypothesis to determine if the test is left-tailed, right-tailed, or
two-tailed.
4. Collect the sample information for the test and identify the significance level.
556 | 9.2 STATISTICAL INFERENCE FOR TWO POPULATION MEANS WITH KNOWN POPULATION STANDARD
DEVIATIONS
5. Assuming the population standard deviations are known, use the normal distribution to find
the p-value (the area in the corresponding tail) for the test. The -score is
6. Compare the p-value to the significance level and state the outcome of the test:
Assuming that the population standard deviations are known, the p-value for a hypothesis test on the
difference in two independent population means is the area in the tail(s) of the normal distribution.
The p-value is the area in the tail(s) of a normal distribution, so the norm.dist(x, , ,logic
operator) function can be used to calculate the p-value.
As with the previous chapter, use the appropriate technique with the norm.dist function to find the
area in the left-tail, the area in the right-tail or the sum of the area in tails.
EXAMPLE
A floor cleaning company has been using Wax 1 to wax floors for a long time. A new floor wax, Wax
2, has recently come on the market with the claim that it is longer lasting than Wax 1. The company
wants to investigate this claim. The company waxed a sample of 20 floors with Wax 1 and found the
average number of months the wax lasted was 2.7 months. The company waxed a sample of 20 floors
with Wax 2 and found the average number of months the wax lasted was 2.9 months. Based on
previous information, the standard deviation for the length of time Wax 1 lasts is 0.33 months and
the standard deviation for the length of time Wax 2 lasts is 0.36 months. Both populations have
normal distributions. At the 5% significance level, test if Wax 2 lasts longer, on average, than Wax 1.
Solution:
Let Wax 1 be population 1 and Wax 2 be population 2. These populations are independent because
there is no relationship between the length of time each type of wax lasts. From the question, we
have the following information:
558 | 9.2 STATISTICAL INFERENCE FOR TWO POPULATION MEANS WITH KNOWN POPULATION STANDARD
DEVIATIONS
Wax 1 Wax 2
Hypotheses:
\begin{eqnarray*} H_0: & & \mu_1-\mu_2=0 \\ H_a: & & \mu_1-\mu_2 \lt 0
\end{eqnarray*}
p-value:
This is a test on a the difference in two population means where the population standard deviation
are known. So we use a normal distribution to calculate the p-value. Because the alternative
hypothesis is a , the p-value is the area in the left-tail of the distribution.
Field 2 0
Field 3 sqrt(0.33^2/20+0.36^2/20)
Field 4 true
So the p-value .
Conclusion:
9.2 STATISTICAL INFERENCE FOR TWO POPULATION MEANS WITH KNOWN POPULATION STANDARD
DEVIATIONS | 559
NOTES
1. The null hypothesis is the claim that the mean number of months for Wax 1
equals the mean number of months for Wax 2. That is, the two types of waxes have the same
mean.
2. The alternative hypothesis is the claim that the mean for Wax 1 is less than
the mean for Wax 2 ( ). This is the same as saying that the mean for Wax 2 is larger
than the mean for Wax 1.
3. The p-value is the area in the left tail of the normal distribution. In the calculation of the
p-value:
◦ The function is norm.dist because we are finding the area in the left tail of a normal
distribution.
◦ Field 1 is the value of
◦ Field 2 is 0, the value of from the null hypothesis. Remember, we run the
test assuming the null hypothesis is true, so that means we assume .
◦ Field 3 is the standard deviation for the difference in the sample means
4. The p-value of 0.0335 is a small probability compared to the significance level, and so is
unlikely to happen assuming the null hypothesis is true. This suggests that the assumption
that the null hypothesis is true is most likely incorrect, and so the conclusion of the test is to
reject the null hypothesis in favour of the alternative hypothesis. In other words, the mean
number of months for Wax 1 is less than the mean number of months for Wax 2. For the
company this suggests that they should switch to Wax 2 because of it is longer lasting than
Wax 1.
560 | 9.2 STATISTICAL INFERENCE FOR TWO POPULATION MEANS WITH KNOWN POPULATION STANDARD
DEVIATIONS
EXAMPLE
A consumer advocacy group wants to compare the revolutions per minute (RPM) for two
different engines. The group believes that Engine A has a higher average RPM than
Engine B. In a sample of 40 Engine A’s, the sample mean number of RPMs was 1550. In a
sample of 30 Engine B’s, the sample mean number of RPMs was 1500. Based on previous
information, the standard deviation for the RPMs for Engine A is 75 and the standard
deviation for Engine B is 65. At the 1% significance level, is the average RPM for Engine A
higher than for Engine B?
Solution:
Let Engine A be population 1 and Engine B be population 2. These populations are independent
because there is no relationship between the RPMs for the two engines. From the questions, we have
the following information:
Engine A Engine B
Hypotheses:
\begin{eqnarray*} H_0: & & \mu_1-\mu_2=0 \\ H_a: & & \mu_1-\mu_2 \gt 0
\end{eqnarray*}
p-value:
This is a test on a the difference in two population means where the population standard deviation
are known. So we use a normal distribution to calculate the p-value. Because the alternative
hypothesis is a , the p-value is the area in the right tail of the distribution.
9.2 STATISTICAL INFERENCE FOR TWO POPULATION MEANS WITH KNOWN POPULATION STANDARD
DEVIATIONS | 561
Field 2 0
Field 3 sqrt(75^2/40+65^2/30)
Field 4 true
So the p-value .
Conclusion:
NOTES
1. The null hypothesis is the claim that the mean RPM for Engine A equals the
mean RPM for Engine B. That is, the two engines have the same average RPM.
2. The alternative hypothesis is the claim that the mean RPM for Engine A is
greater than the mean RPM for Engine B ( ).
3. The p-value is the area in the right tail of the normal distribution. In the calculation of the
562 | 9.2 STATISTICAL INFERENCE FOR TWO POPULATION MEANS WITH KNOWN POPULATION STANDARD
DEVIATIONS
p-value:
◦ The function is 1-norm.dist because we are finding the area in the right tail of a normal
distribution.
◦ Field 1 is the value of
◦ Field 2 is 0, the value of from the null hypothesis. Remember, we run the
test assuming the null hypothesis is true, so that means we assume .
◦ Field 3 is the standard deviation for the difference in the sample means
4. The p-value of 0.0014 is a small probability compared to the significance level, and so is
unlikely to happen assuming the null hypothesis is true. This suggests that the assumption
that the null hypothesis is true is most likely incorrect, and so the conclusion of the test is to
reject the null hypothesis in favour of the alternative hypothesis. In other words, the mean
RPM for Engine A is greater than the mean RPM for Engine B, just as the consumer advocacy
group claimed.
EXAMPLE
The student union at a local college owns two coffee shops on campus: The Study Cafe
and Coffee&Books. The student union wants to find out if there is a difference the average
amount students spend per transaction at each of the coffee shops. In a sample of 65
transactions at the Study Cafe, the average amount spent was $9.40. In a sample of 50
transactions at Coffee&Books, the average amount spent was $10.15. Based on previous
information, the standard deviation for the amount spent at the Study Cafe is $1.35 and the
9.2 STATISTICAL INFERENCE FOR TWO POPULATION MEANS WITH KNOWN POPULATION STANDARD
DEVIATIONS | 563
Let the Study Cafe be population 1 and Coffee&Books be population 2. These populations are
independent because there is no relationship between the amount spent at each coffee shop. From
the question, we have the following information:
Hypotheses:
\begin{eqnarray*} H_0: & & \mu_1-\mu_2=0 \\ H_a: & & \mu_1-\mu_2 \neq 0
\end{eqnarray*}
p-value:
This is a test on a the difference in two population means where the population standard deviation
are known. So we use a normal distribution to calculate the p-value. Because the alternative
hypothesis is a , the p-value is the sum of the area in the two tails of the distribution.
We need to know if the sample information relates to the left or right tail because that will determine
564 | 9.2 STATISTICAL INFERENCE FOR TWO POPULATION MEANS WITH KNOWN POPULATION STANDARD
DEVIATIONS
how we calculate out the area of that tail using the normal distribution. In this case, the (
), so the sample information relates to the left tail of the normal distribution. This
means that we will calculate out the area in the left tail using norm.dist. However, this is a two-
tailed test where the p-value is the sum of the area in the two tails and the area in the left tail is only
one half of the p-value. The area in the left tail equals the area in the right tail and the p-value is the
sum of these two areas.
Field 2 0
Field 3 sqrt(1.35^2/65+2.7^2/50)
Field 4 true
So the area in the left tail is 0.0360, which means (p-value) . This is also the area in the
right tail, so
p-value
Conclusion:
NOTES
1. The null hypothesis is the claim that the mean amount spent at the Study
Cafe equals the mean amount spent at Coffee&Books. That is, the average amount spent is
the same at both coffee shops.
2. The alternative hypothesis is the claim that the mean amount spent at the
Study Cafe is different than the mean amount spent at Coffee&Books ( ).
3. In a two-tailed hypothesis test that uses the normal distribution, we will only have sample
information relating to one of the two tails. We must determine which of the tails the sample
information belongs to, and then calculate out the area in that tail. The area in each tail
9.2 STATISTICAL INFERENCE FOR TWO POPULATION MEANS WITH KNOWN POPULATION STANDARD
DEVIATIONS | 565
represents exactly half of the p-value, so the p-value is the sum of the areas in the two tails.
◦ If the sample mean is less than the sample mean ( ), the sample
information belongs to the left tail.
left tail. The area in the right tail equals the area in the left tail, so we can find
the p-value by adding the output from this function to itself.
◦ If the sample mean is greater than the sample mean ( ), the sample
information belongs to the right tail.
right tail. The area in the left tail equals the area in the right tail, so we can find
the p-value by adding the output from this function to itself.
4. The p-value of 0.0720 is a large probability compared to the significance level, and so is likely
to happen assuming the null hypothesis is true. This suggests that the assumption that the
null hypothesis is true is most likely correct, and so the conclusion of the test is to not reject
the null hypothesis. In other words, the mean amount spent at the Study Cafe equals the
mean amount spent at Coffee&Books.
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=207#oembed-1
Watch this video: Confidence Intervals for Two Population Means, Sigma Known by ExcelIsFun [9:52]
566 | 9.2 STATISTICAL INFERENCE FOR TWO POPULATION MEANS WITH KNOWN POPULATION STANDARD
DEVIATIONS
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=207#oembed-2
Watch this video: Hypothesis Testing for Two Population Means, Sigma Known by ExcelIsFun [16:47]
Concept Review
The general form of a confidence interval for the difference in two independent population means
with known population standard deviations is
where is the positive -score of the standard normal distribution so the area under the normal
distribution in between and is .
The hypothesis test for the difference in two independent population means with known
population standard deviations is a well established process:
1. Write down the null and alternative hypotheses in terms of the differences in the
population means .
2. Use the form of the alternative hypothesis to determine if the test is left-tailed, right-
tailed, or two-tailed.
3. Collect the sample information for the test and identify the significance level.
4. Find the p-value (the area in the corresponding tail) for the test using the normal
distribution. Because the population standard deviations are known, we use the normal
distribution to find the p-value.
5. Compare the p-value to the significance level and state the outcome of the test.
6. Write down a concluding sentence specific to the context of the question.
9.2 STATISTICAL INFERENCE FOR TWO POPULATION MEANS WITH KNOWN POPULATION STANDARD
DEVIATIONS | 567
Attribution
“10.2 Two Population Means with Known Standard Deviations“ in Introductory Statistics by
OpenStax is licensed under a Creative Commons Attribution 4.0 International License.
9.3 STATISTICAL INFERENCE FOR TWO
POPULATION MEANS WITH UNKNOWN
POPULATION STANDARD DEVIATIONS
LEARNING OBJECTIVES
• Construct and interpret a confidence interval for two population means with unknown
population standard deviations.
• Conduct and interpret hypothesis tests for two population means with unknown population
standard deviations.
The comparison of two population means is very common. Often, we want to find out if the two
populations under study have the same mean or if there is some difference in the two population
means. The approach we take when studying two population means depends on whether the
samples are independent or matched. In the case the samples are independent, we also have to
contend with whether or not we know the population standard deviations.
Two populations are independent if the sample taken from population 1 is not related in anyway
to the sample taken from population 2. In this situation, any relationship between the samples or
populations is entirely coincidental.
Throughout this section, we will use subscripts to identify the values for the means, sample sizes,
and standard deviations for the two populations:
9.3 STATISTICAL INFERENCE FOR TWO POPULATION MEANS WITH UNKNOWN POPULATION STANDARD
DEVIATIONS | 569
Population Mean
Population Standard
Deviation
Sample Size
Sample Mean
In order to construct a confidence interval or conduct a hypothesis test on the difference in two
population means ( ), we need to use the distribution of the difference in the sample means
:
• The distribution of the difference in the sample means is normal if one of the following is
true:
◦ Both populations are normally distributed.
◦ The sample sizes are large enough ( and ).
• Assuming the distribution of the difference of the sample means is normal, the -score is
As we have seen previously when working with confidence intervals and hypothesis testing for
a single population, when the population standard deviation is unknown and we must use the
sample standard deviation as an estimate for the population standard deviation, we use a
-distribution. We do the same thing when working with the two population means. When the
population standard deviations are unknown, we use the sample standard deviations as estimates
for the population standard deviations and . In this situation, we use a -distribution for the
distribution of the difference in the sample means. So, when the population standard deviations are
unknown for a confidence interval or hypothesis test on the difference in two population means,
we will use a -distribution. The -score and the degrees of freedom are:
570 | 9.3 STATISTICAL INFERENCE FOR TWO POPULATION MEANS WITH UNKNOWN POPULATION STANDARD
DEVIATIONS
\begin{eqnarray*} t & = & \frac{(\overline{x}_1-\overline{x}_2)-
(\mu_1-\mu_2)}{\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}} \\ \\ df & = &
\frac{\left(\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}\right)^2}{\frac{1}{n_1-1} \times
\left(\frac{s_1^2}{n_1}\right)^2+\frac{1}{n_2-1} \times \left(\frac{s_2^2}{n_2}\right)^2}
\end{eqnarray*}
Obviously, the degrees of freedom formula is somewhat complicated. But a computer makes the
calculation a bit more manageable. The output from the degrees of freedom formula is rarely a
whole number. After calculating the value of using the above formula, round the output from
this formula down to the next whole number to get the degrees of freedom for the -distribution.
Suppose a sample of size with sample mean and standard deviation is taken from
population 1 and a sample of size with sample mean and standard deviation is taken from
population 2 where the populations are independent and the population standard deviations are
unknown. The limits for the confidence interval with confidence level for the difference in the
population means are:
NOTES
1. In order to construct the confidence interval for the difference in two population means
with independent samples, we need to check that the distribution of the difference in the
9.3 STATISTICAL INFERENCE FOR TWO POPULATION MEANS WITH UNKNOWN POPULATION STANDARD
DEVIATIONS | 571
sample means follows a normal distribution. This means that we need to check that either
the populations are normal or that the sample sizes are large enough (greater than or equal
to 30).
2. When the population standard deviations are unknown, we must use a -distribution in
the construction of the confidence interval.
3. The value of degrees of freedom must be a whole number. After using the formula,
remember to round the value down to the next whole number to get the required degrees
of freedom for the -distribution.
To find the -score to construct a confidence interval with confidence level , use the t.inv.2t(area
in the tails, degrees of freedom) function.
• For area in the tails, enter the sum of the area in the tails of the -distribution. For a
confidence interval, the area in the tails is .
• For degrees of freedom, enter the degrees of freedom calculated using
\displaystyle{df = \frac{\left(\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}\right)^2}{\frac{1}{n_1-1}
\times \left(\frac{s_1^2}{n_1}\right)^2+\frac{1}{n_2-1} \times \left(\frac{s_2^2}{n_2}\right)^2}}
.
The output from the t.inv.2t function is the value of -score needed to construct the confidence
interval.
572 | 9.3 STATISTICAL INFERENCE FOR TWO POPULATION MEANS WITH UNKNOWN POPULATION STANDARD
DEVIATIONS
NOTE
1. The t.inv.2t function requires that we enter the sum of the area in both tails. The area in the
middle of the distribution is the confidence level , so the sum of the area in both tails is the
leftover area .
2. The degrees of freedom for a -distribution must be a whole number. The output from the
degrees of freedom formula
\displaystyle{df = \frac{\left(\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}\right)^2}{\frac{1}{n_1-1}
\times \left(\frac{s_1^2}{n_1}\right)^2+\frac{1}{n_2-1} \times
\left(\frac{s_2^2}{n_2}\right)^2}}
is almost never a whole number. After calculating the value of using the formula, round
the value down to the next whole number. Remember to entered the rounded down
value of for the degrees of freedom in the t.inv.2t function.
EXAMPLE
A company that manufacturers and services photocopiers wants to study the difference in the
average repair time for the two different models of photocopiers they make. In a sample of 60 repairs
of photocopier A, the mean repair time was 84.2 minutes with a standard deviation of 19.4 minutes.
In a sample of 70 repairs of photocopier B, the mean repair time was 91.6 minutes with a standard
deviation of 18.8 minutes.
1. Construct a 95% confidence interval for the difference in the mean repair time for the two
photocopiers.
2. Interpret the confidence interval found in part 1.
3. Is there evidence to suggest that the mean repair times for the photocopiers is the same?
9.3 STATISTICAL INFERENCE FOR TWO POPULATION MEANS WITH UNKNOWN POPULATION STANDARD
DEVIATIONS | 573
Explain.
Solution:
Photocopier A Photocopier B
To find the confidence interval, we need to find the -score for the 95% confidence interval.
This means that we need to find the -score so that the area in the tails is .
Field 2 123
2. We are 95% confident that the difference in the mean repair time for the two
photocopiers is between -14.06 minutes and -0.74 minutes.
3. Because 0 is outside the confidence interval and both limits are negative, it suggests that the
difference in the means is less than 0. That is, ( ). This
suggests that the mean for population 1 (photocopier A) is less than the mean for population 2
(photocopier B). So the mean repair time for photocopier A is less than the mean repair time
for photocopier B.
NOTES
1. When calculating the limits for the confidence interval keep all of the decimals in the -score
and other values throughout the calculation. This will ensure that there is no round-off error
in the answers. You can use Excel to do the calculation of the limits, clicking on the cells
containing the -score and any other values, to ensure that all of the decimal places are used
in the calculation.
2. When writing down the interpretation of the confidence interval, make sure to include the
confidence level, the actual difference in the population means captured by the confidence
interval (i.e. be specific to the context of the question), and appropriate units for the limits.
3. The value of the degrees of freedom must be a whole number. After using the formula,
remember to round the value down to the next whole number to get the required degrees of
freedom for the -distribution.
9.3 STATISTICAL INFERENCE FOR TWO POPULATION MEANS WITH UNKNOWN POPULATION STANDARD
DEVIATIONS | 575
1. Write down the null hypothesis that there is no difference in the population means:
The null hypothesis is always the claim that the two population means are equal (
).
2. Write down the alternative hypotheses in terms of the difference in the population means.
The alternative hypothesis will be one of the following:
\begin{eqnarray*} \\ H_a: \mu_1-\mu_2 <0 & & (\mu_1 \lt \mu_2) \\ H_a:
\mu_1-\mu_2>0 & & (\mu_1 \gt \mu_2) \\ H_a: \mu_1-\mu_2 \neq 0 &
& (\mu_1 \neq \mu_2) \\ \\ \end{eqnarray*}
3. Use the form of the alternative hypothesis to determine if the test is left-tailed, right-tailed, or
two-tailed.
4. Collect the sample information for the test and identify the significance level.
5. Assuming the population standard deviations are unknown, use a -distribution to find the
p-value (the area in the corresponding tail) for the test. The -score and degrees of freedom
are
Assuming that the population standard deviations are unknown, the p-value for a hypothesis test on
the difference in two independent population means is the area in the tail(s) of the -distribution.
• Use the t.dist function to find the p-value. In the t.dist(t-score, degrees of freedom, logic
operator) function:
• Use the t.dist.rt function to find the p-value. In the t.dist.rt(t-score, degrees of freedom)
function:
\displaystyle{t = \frac{(\overline{x}_1-\overline{x}_2)-
(\mu_1-\mu_2)}{\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}}}
.
◦ For degrees of freedom, enter the degrees of freedom calculated using
\displaystyle{df =
\frac{\left(\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}\right)^2}{\frac{1}{n_1-1} \times
\left(\frac{s_1^2}{n_1}\right)^2+\frac{1}{n_2-1} \times \left(\frac{s_2^2}{n_2}\right)^2}}
.
• Use the t.dist.2t function to find the p-value. In the t.dist.2t(t-score, degrees of freedom)
function:
NOTE
The degrees of freedom for a -distribution must be a whole number. The output from the
degrees of freedom formula
\displaystyle{df = \frac{\left(\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}\right)^2}{\frac{1}{n_1-1} \times
\left(\frac{s_1^2}{n_1}\right)^2+\frac{1}{n_2-1} \times \left(\frac{s_2^2}{n_2}\right)^2}}
is almost never a whole number. After calculating the value of using the formula, round the
value down to the next whole number. Remember to entered the rounded down value of
for the degrees of freedom in the t.dist functions.
578 | 9.3 STATISTICAL INFERENCE FOR TWO POPULATION MEANS WITH UNKNOWN POPULATION STANDARD
DEVIATIONS
EXAMPLE
A researcher wants to study the difference between the average amount of time boys and girls aged
seven to eleven spend playing sports each day. In a sample of 9 girls, the average number of hours
spent playing sports per day is 2 hours with a standard deviation of 0.866 hours. In a sample of 16
boys, the average number of hours spent playing sports per day is 3.2 hours with a standard deviation
of 1 hours. Both populations have a normal distribution. At the 5% significance level, is there a
difference in the mean amount of time boys and girls aged seven to eleven play sports each day?
Solution:
Let girls be population 1 and boys be population 2. These populations are independent because there
is no relationship between the two groups. From the questions, we have the following information:
Girls Boys
Hypotheses:
\begin{eqnarray*} H_0: & & \mu_1-\mu_2=0 \\ H_a: & & \mu_1-\mu_2 \neq 0
\end{eqnarray*}
p-value:
This is a test on a the difference in two population means where the population standard deviation
are unknown. So we use a -distribution to calculate the p-value. Because the alternative hypothesis
is a , the p-value is the sum of areas in the tails of the distribution.
9.3 STATISTICAL INFERENCE FOR TWO POPULATION MEANS WITH UNKNOWN POPULATION STANDARD
DEVIATIONS | 579
To use the t.dist.2t function, we need to calculate out the -score and the degrees of freedom:
Field 2 18
So the p-value .
Conclusion:
NOTES
1. The null hypothesis is the claim that there is no difference in the mean
amount of time boys and girls spend playing sports each day. That is, the two populations
have the same mean.
2. The alternative hypothesis is the claim that there is a difference in the mean
amount of time boys and girls spend playing sports each day ( ). That is, the two
populations have different means.
3. Keep all of the decimals throughout the calculation (i.e. in the -score, etc.) to avoid any
round-off error in the calculation of the p-value. This ensures that we get the most accurate
value for the p-value. Use Excel to do the calculations, and then click on the cells in
subsequent calculations.
4. The value of the degrees of freedom must be a whole number. After using the formula,
remember to round the value down to the next whole number to get the required degrees of
freedom for the -distribution.
5. The t.dist.2t function requires that the value entered for the -score is positive. A negative
-score entered into the t.dist.2t function generates an error in Excel. In this case, the value of
the -score is negative, so we must enter the absolute value of this -score into field 1.
6. The p-value of 0.0056 is a small probability compared to the significance level, and so is
unlikely to happen assuming the null hypothesis is true. This suggests that the assumption
that the null hypothesis is true is most likely incorrect, and so the conclusion of the test is to
reject the null hypothesis in favour of the alternative hypothesis. In other words, there is a
difference in the mean amount of time boys and girls spend playing sports each day.
EXAMPLE
A town has two colleges. A local community group believes that students who graduate from
9.3 STATISTICAL INFERENCE FOR TWO POPULATION MEANS WITH UNKNOWN POPULATION STANDARD
DEVIATIONS | 581
College A have taken more math classes than the students who graduate from College B. In a
sample of 11 graduates from College A, the average is 4 math classes per graduate with a standard
deviation of 1.5 math classes. In a sample of 9 graduates from College B, the average is 3.5 math
classes per graduate with a standard deviation of 1 math class. Both populations have a normal
distribution. At the 1% significance level, test the community groups claim that graduates from
College A have taken more math classes than graduates from College B.
Solution:
Let College A be population 1 and College B be population 2. These populations are independent
because there is no relationship between the two groups. From the questions, we have the following
information:
College A College B
Hypotheses:
\begin{eqnarray*} H_0: & & \mu_1-\mu_2=0 \\ H_a: & & \mu_1-\mu_2 \gt 0
\end{eqnarray*}
p-value:
This is a test on a the difference in two population means where the population standard deviation
are unknown. So we use a -distribution to calculate the p-value. Because the alternative hypothesis
is a , the p-value is the area in the right tail of the distribution.
582 | 9.3 STATISTICAL INFERENCE FOR TWO POPULATION MEANS WITH UNKNOWN POPULATION STANDARD
DEVIATIONS
To use the t.dist.rt function, we need to calculate out the -score and the degrees of freedom:
Field 2 17
So the p-value .
Conclusion:
NOTES
1. The null hypothesis is the claim that the average number of math classes
taken by graduates of College A equals the average number of math classes taken by
graduates of College B. That is, the two populations have the same mean.
2. The alternative hypothesis is the claim that, on average, graduates of College
A taken more math classes than graduates of College B ( ).
3. Keep all of the decimals throughout the calculation (i.e. in the -score, etc.) to avoid any
round-off error in the calculation of the p-value. This ensures that we get the most accurate
value for the p-value. Use Excel to do the calculations, and then click on the cells in
9.3 STATISTICAL INFERENCE FOR TWO POPULATION MEANS WITH UNKNOWN POPULATION STANDARD
DEVIATIONS | 583
subsequent calculations.
4. The value of the degrees of freedom must be a whole number. After using the formula,
remember to round the value down to the next whole number to get the required degrees of
freedom for the -distribution.
5. The p-value of 0.1930 is a large probability compared to the significance level, and so is likely
to happen assuming the null hypothesis is true. This suggests that the assumption that the
null hypothesis is true is most likely correct, and so the conclusion of the test is to not reject
the null hypothesis. In other words, graduates from the two colleges take, on average, the
same number of math classes.
EXAMPLE
A professor at a large community college taught both an online section and a face-to-face section of
his statistics course. The professor wants to study the difference in the average score on the final
exam, believing that the mean score for the online section would be lower than the face-to-face
section. The professor randomly selected 30 final exam scores from each section and recorded the
scores in the tables below.
Online Section:
67.6 41.2 85.3 55.9 82.4 91.2 73.5 94.1 64.7 64.7
70.6 38.2 61.8 88.2 70.6 58.8 91.2 73.5 82.4 35.5
94.1 88.2 64.7 55.9 88.2 97.1 85.3 61.8 79.4 79.4
Face-to-Face Section:
584 | 9.3 STATISTICAL INFERENCE FOR TWO POPULATION MEANS WITH UNKNOWN POPULATION STANDARD
DEVIATIONS
77.9 95.3 81.2 74.1 98.8 88.2 85.9 92.9 87.1 88.2
69.4 57.6 69.4 67.1 97.6 85.9 88.2 91.8 78.8 71.8
98.8 61.2 92.9 90.6 97.6 100 95.3 83.5 92.9 89.4
At the 5% significance level, is the mean of the final exam score for the online section lower than the
mean of the final exam score for the face-to-face section?
Solution:
Let the online section be population 1 and the face-to-face section be population 2. These
populations are independent because there is no relationship between the two groups. From the
questions, we have the following information:
Online Face-to-Face
Hypotheses:
\begin{eqnarray*} H_0: & & \mu_1-\mu_2=0 \\ H_a: & & \mu_1-\mu_2 \lt 0
\end{eqnarray*}
p-value:
This is a test on a the difference in two population means where the population standard deviation
are unknown. So we use a -distribution to calculate the p-value. Because the alternative hypothesis
is a , the p-value is the area in the left tail of the distribution.
9.3 STATISTICAL INFERENCE FOR TWO POPULATION MEANS WITH UNKNOWN POPULATION STANDARD
DEVIATIONS | 585
To use the t.dist function, we need to calculate out the -score and the degrees of freedom:
Field 2 51
Field 3 true
So the p-value .
Conclusion:
NOTES
1. The null hypothesis is the claim that the average final exam score is the
same for both sections. That is, the two populations have the same mean.
2. The alternative hypothesis is the claim that average final exam score for the
online section is lower than the face-to-face section ( ).
3. Keep all of the decimals throughout the calculation (i.e. in the sample means, sample
standard deviations, in the -score, etc.) to avoid any round-off error in the calculation of the
p-value. This ensures that we get the most accurate value for the p-value. Use Excel to do the
calculations, and then click on the cells in subsequent calculations.
4. The value of the degrees of freedom must be a whole number. After using the formula,
remember to round the value down to the next whole number to get the required degrees of
freedom for the -distribution.
5. The p-value of 0.0011 is a small probability compared to the significance level, and so is
unlikely to happen assuming the null hypothesis is true. This suggests that the assumption
that the null hypothesis is true is most likely incorrect, and so the conclusion of the test is to
reject the null hypothesis in favour of the alternative hypothesis. In other words, the average
final exam score for the online section is lower than for the face-to-face section.
TRY IT
A study is done to determine if Company A retains its workers longer than Company B. Company A
samples 15 workers, and their average time with the company is 5 years with a standard deviation of
1.2 years. Company B samples 20 workers, and their average time with the company is 4.5 years with
a standard deviation of 0.8 years. The populations are normally distributed. At the 5% significance
level, on average, do workers at Company A stay longer than workers at Company B?
9.3 STATISTICAL INFERENCE FOR TWO POPULATION MEANS WITH UNKNOWN POPULATION STANDARD
DEVIATIONS | 587
Hypotheses:
\begin{eqnarray*} H_0: & & \mu_1-\mu_2=0 \\ H_a: & & \mu_1-\mu_2 \gt 0
\end{eqnarray*}
p-value:
Field 2 23
Conclusion:
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=209#oembed-1
Watch this video: Confidence Intervals for Two Population Means, Sigma Unknown by ExcelIsFun [16:11]
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=209#oembed-2
Watch this video: Hypothesis Testing for Two Population Means, Sigma Unknown by ExcelIsFun [17:29]
Concept Review
The general form of a confidence interval for the difference in two independent population means
with unknown population standard deviations is
1. Write down the null and alternative hypotheses in terms of the differences in the population
means .
2. Use the form of the alternative hypothesis to determine if the test is left-tailed, right-tailed, or
two-tailed.
3. Collect the sample information for the test and identify the significance level.
4. Find the p-value (the area in the corresponding tail) for the test using the -distribution with
Attribution
“10.1 Two Population Means with Unknown Standard Deviations“ and “10.2 Two Population
Means with Known Standard Deviations“ in Introductory Statistics by OpenStax is licensed under
a Creative Commons Attribution 4.0 International License.
9.4 STATISTICAL INFERENCE FOR
MATCHED SAMPLES
LEARNING OBJECTIVES
• Construct and interpret a confidence interval for the mean difference for matched samples.
• Conduct and interpret hypothesis tests for matched samples.
The comparison of two population means is very common. Often, we want to find out if the two
populations under study have the same mean or if there is some difference in the two population
means. The approach we take when studying two population means depends on whether the
samples are independent or matched.
In a matched sample experiment, there is some relationship between pairs of data in the
samples. Inferences on matched samples are typically more accurate than inferences on
independent samples because matched samples reduce the variability measures to only the ones
within the pairs.
EXAMPLE
In a clinical trial for a new drug, patients are tested before the drug is administered and then the
same group of patients are tested after being given the drug. This is a matched sample experiment
because the same group of patients is measured before and after the administration of the drug. In
9.4 STATISTICAL INFERENCE FOR MATCHED SAMPLES | 591
this way, there are a pair of observations (a before measurement and an after measurement) for each
patient.
EXAMPLE
A manufacturing company wants to know which of two different production methods allow
employees to perform a task the fastest. The table below illustrates the difference in an independent
sample design and a matched sample design to test the difference in the average time it takes to
perform the task using the two different methods.
• The company randomly selects two different • The company randomly selects one group of
groups of employees. employees.
• The employees in Group 1 perform the task • Each of the employees in the group perform
using Method 1 and their times are recorded. the task using both methods and their times
• The employees in Group 2 perform the task using each method are recorded.
using Method 2 and their times are recorded.
In the independent sample design, there is no relationship between the two groups of employees. In
the matched sample design, there is one group of employees with a pair of observations (a time from
Method 1 and a time from Method 2) for each employee.
In matched sample designs, we work with the differences in the paired observations. We combine
the two samples into a single sample by calculating out the difference between each of the paired
observations. Throughout this section, we will use the following notation for the sample size,
mean, and standard deviation of the differences in the paired observations:
592 | 9.4 STATISTICAL INFERENCE FOR MATCHED SAMPLES
In order to construct a confidence interval or conduct a hypothesis test on the mean of the
differences in the paired data ( ), we need to use the distribution of the differences in the paired
data. In such cases, we need the distribution of the differences in the paired data to be normal,
either because the differences are assumed to be normal or because the sample size is large
enough ( ).
By calculating out the differences in the paired data, we combine the two samples into a single
sample consisting of the differences in the paired data. We use the differences to construct the
confidence interval and run the hypothesis test. The confidence interval on the mean difference
is a confidence interval for a single population mean. Similarly, the hypothesis test on the
mean difference is actually a hypothesis test on a single population mean. In this case, we will
follow the exact same procedures as we learned previously for a single population mean confidence
interval and hypothesis test, only now the single population consists of the differences in the paired
data.
When working with a matched sample design and the differences in the paired data, the
population standard deviation will be unknown. So we will need to estimate the population
standard deviation with the sample standard deviation. As we have seen previously, this means
we must use a -distribution in the confidence intervals and hypothesis test on the mean of the
differences in the paired data.
Suppose matched samples, each of size , are taken from two related populations. The sample
mean and sample standard deviation for the differences in the matched pairs are calculated.
The limits for the confidence interval with confidence level for the mean difference are:
9.4 STATISTICAL INFERENCE FOR MATCHED SAMPLES | 593
where is the positive -score of the -distribution with \displaystyle{df = n_D-1} so that the
area under the curve in between and is .
NOTES
1. In order to construct the confidence interval for the mean difference, we need to check
that the distribution of the differences in the paired data follows a normal distribution.
This means that we need to check that either the differences follow a normal distribution
or that the sample size is large enough (greater than or equal to 30).
2. When the population standard deviations are unknown, we must use a -distribution in
the construction of the confidence interval.
To find the -score to construct a confidence interval with confidence level , use the t.inv.2t(area
in the tails, degrees of freedom) function.
• For area in the tails, enter the sum of the area in the tails of the -distribution. For a
confidence interval, the area in the tails is .
• For degrees of freedom, enter the degrees of freedom \displaystyle{df = n_D-1}.
594 | 9.4 STATISTICAL INFERENCE FOR MATCHED SAMPLES
The output from the t.inv.2t function is the value of the -score needed to construct the confidence
interval.
NOTE
The t.inv.2t function requires that we enter the sum of the area in both tails. The area in the
middle of the distribution is the confidence level , so the sum of the area in both tails is the
leftover area .
EXAMPLE
A company has two different methods that employees can use to complete a manufacturing task. A
sample of workers is taken and the time, in minutes, that each worker takes to complete the task
using each method is recorded. The data is shown in the table below. Assume the differences in the
paired times have a normal distribution.
9.4 STATISTICAL INFERENCE FOR MATCHED SAMPLES | 595
1 5.5 6.8
2 6.9 6.6
3 6.1 5.1
4 6 6.8
5 7 6.7
6 6.7 6.5
7 6.4 5.8
8 7 6.8
9 6.6 5.3
10 5.7 5.8
11 5.9 6.9
12 7 6.7
13 5.4 6.5
14 5.4 6.3
15 5.3 5
1. Construct a 98% confidence interval for the mean difference in the time it takes the workers to
complete the task.
2. Interpret the confidence interval found in part 1.
3. Is there evidence to suggest that the mean completion time for the two methods is the same?
Explain.
Solution:
1. We start by calculating out the differences in the paired data. We will calculate the differences
as Method 1-Method 2.
596 | 9.4 STATISTICAL INFERENCE FOR MATCHED SAMPLES
3 6.1 5.1 1
4 6 6.8 -0.8
5 7 6.7 0.3
8 7 6.8 0.2
11 5.9 6.9 -1
12 7 6.7 0.3
15 5.3 5 0.3
To find the confidence interval, we need to find the -score for the 98% confidence interval.
This means that we need to find the -score so that the sum of the area in the tails is
. The degrees of freedom for the -distribution is
.
9.4 STATISTICAL INFERENCE FOR MATCHED SAMPLES | 597
Field 2 14
2. We are 98% confident that the mean difference in the completion times using the two
methods is between -0.584 minutes and 0.491 minutes.
3. Because 0 is inside the confidence interval, it suggests that the mean difference is 0. That
is, . This suggests that the mean completion times for the two methods are the same.
NOTES
1. When calculating the limits for the confidence interval keep all of the decimals in the -score
and other values throughout the calculation. This will ensure that there is no round-off error
in the answers. You can use Excel to do the calculation of the differences, sample mean,
598 | 9.4 STATISTICAL INFERENCE FOR MATCHED SAMPLES
sample standard deviation, and the limits, clicking on the corresponding cells to ensure that
all of the decimal places are used in the calculation.
2. When writing down the interpretation of the confidence interval, make sure to include the
confidence level, the actual mean difference captured by the confidence interval (i.e. be
specific to the context of the question), and appropriate units for the limits.
\begin{eqnarray*} H_a: & & \mu_D \leq 0\\ H_a: & & \mu_D>0 \\
H_a: & & \mu_D \neq 0 \\ \\\end{eqnarray*}
3. Use the form of the alternative hypothesis to determine if the test is left-tailed, right-tailed, or
two-tailed.
4. Collect the sample information for the test and identify the significance level.
5. Use a -distribution to find the p-value (the area in the corresponding tail) for the test. The
-score and degrees of freedom are
6. Compare the p-value to the significance level and state the outcome of the test:
▪ The results of the sample data are significant. There is sufficient evidence to
conclude that the null hypothesis is an incorrect belief and that the alternative
hypothesis is most likely correct.
◦ If p-value , do not reject .
▪ The results of the sample data are not significant. There is not sufficient evidence to
conclude that the alternative hypothesis may be correct.
The p-value for a hypothesis test on the mean difference in matched samples is the area in the tail(s)
of the -distribution.
• Use the t.dist function to find the p-value. In the t.dist(t-score, degrees of freedom, logic
operator) function:
• Use the t.dist.rt function to find the p-value. In the t.dist.rt(t-score, degrees of freedom)
function:
• Use the t.dist.2t function to find the p-value. In the t.dist.2t(t-score, degrees of freedom)
function:
EXAMPLE
A study was conducted to investigate the effectiveness of hypnosis on reducing pain. Eight subjects
are randomly selected. Each subject’s pain is measured before and after being hypnotized. A lower
score indicates less pain. Assume the differences in the before and after scores have a normal
distribution. At the 5% significance level, are the pain sensory measurements, on average, lower after
hypnotism?
Subject: A B C D E F G H
Solution:
We start by calculating out the differences in the paired data. We will calculate the differences as
before-after.
9.4 STATISTICAL INFERENCE FOR MATCHED SAMPLES | 601
C 9 7.4 1.6
F 8.1 6.1 2
H 11.6 2 9.6
Hypotheses:
\begin{eqnarray*} H_0: & & \mu_D=0 \\ H_a: & & \mu_D \gt 0
\end{eqnarray*}
p-value:
This is a test on the mean difference in matched samples, so we use a -distribution to calculate the
p-value. Because the alternative hypothesis is a , the p-value is the area in the right tail of the
distribution.
Field 2 7
So the p-value .
Conclusion:
NOTES
1. Before writing down the hypotheses, decide on the order of subtraction for calculating the
differences. In a matched sample experiment, the form of the alternative hypothesis depends
on the order of subtraction, so we must decide on the order of subtraction before writing
down the hypotheses.
2. The null hypothesis is the claim that there is no difference in the pain sensory
measurements after hypnosis. That is, the average pain sensory measurement is the same
before and after hypnosis.
3. For the alternative hypothesis, we are testing that the after score is lower than the before
score. In other words, before>after. Because we calculated the differences as before-after,
before>after means before-after>0. So the alternative hypothesis is , the claim
that the before score is larger than the after score (or the after score is lower than the
before score).
9.4 STATISTICAL INFERENCE FOR MATCHED SAMPLES | 603
4. Keep all of the decimals throughout the calculation (i.e. in the -score, etc.) to avoid any
round-off error in the calculation of the p-value. This ensures that we get the most accurate
value for the p-value. Use Excel to do the calculations, and then click on the cells in
subsequent calculations.
5. The p-value of 0.0095 is a small probability compared to the significance level, and so is
unlikely to happen assuming that the null hypothesis is true. This suggests that the
assumption that the null hypothesis is true is most likely incorrect, and so the conclusion of
the test is to reject the null hypothesis in favour of the alternative hypothesis. In other words,
the after score is, on average, lower than the before score.
EXAMPLE
A study was conducted to investigate how effective a new diet was in lowering cholesterol. Nine
patients were selected for the new diet and their cholesterol was measured before and after starting
the new diet. The results are recorded in the table below. Assume the differences have a normal
distribution. At the 5% significance level, was the new diet, on average, successful in lowering
patients’ cholesterol?
Subject A B C D E F G H I
Before 209 210 205 198 216 217 238 240 222
After 199 207 189 209 217 202 211 223 201
Solution:
We start by calculating out the differences in the paired data. We will calculate the differences as
after-before.
604 | 9.4 STATISTICAL INFERENCE FOR MATCHED SAMPLES
B 210 207 -3
D 198 209 11
E 216 217 1
Hypotheses:
\begin{eqnarray*} H_0: & & \mu_D=0 \\ H_a: & & \mu_D \lt 0 \end{eqnarray*}
p-value:
This is a test on the mean difference in matched samples, so we use a -distribution to calculate the
p-value. Because the alternative hypothesis is a , the p-value is the area in the left tail of the
distribution.
Field 2 8
Field 3 true
So the p-value .
Conclusion:
NOTES
1. Before writing down the hypotheses, decide on the order of subtraction for calculating the
differences. In a matched sample experiment, the form of the alternative hypothesis depends
on the order of subtraction, so we must decide on the order of subtraction before writing
down the hypotheses.
2. The null hypothesis is the claim that there is no difference in the patients’
cholesterol level. That is, the average cholesterol level is the same before and after the diet.
3. For the alternative hypothesis, we are testing that the after score is lower than the before
score. In other words, after<before. Because we calculated the differences as after-before,
after<before means after-before<0. So, the alternative hypothesis is , the claim
that the after score is lower than the before score.
4. Keep all of the decimals throughout the calculation (i.e. in the -score, etc.) to avoid any
606 | 9.4 STATISTICAL INFERENCE FOR MATCHED SAMPLES
round-off error in the calculation of the p-value. This ensures that we get the most accurate
value for the p-value. Use Excel to do the calculations, and then click on the cells in
subsequent calculations.
5. The p-value of 0.0224 is a small probability compared to the significance level, and so is
unlikely to happen assuming that the null hypothesis is true. This suggests that the
assumption that the null hypothesis is true is most likely incorrect, and so the conclusion of
the test is to reject the null hypothesis in favour of the alternative hypothesis. In other words,
the after score is, on average, lower than the before score.
EXAMPLE
Seven eighth graders at Kennedy Middle School measured how far they could push the shot-put with
their dominant (writing) hand and their weaker (non-writing) hand. They thought that they could
push equal distances with either hand. The results from their throws are recorded in the table
below. Assume the differences are normally distributed. At the 5% significance level, is there a
difference in the average distance for the dominant versus weaker hand?
Dominant Hand 30 26 34 17 19 26 20
Weaker Hand 28 14 27 18 17 26 16
Solution:
We start by calculating out the differences in the paired data. We will calculate the differences as
dominant-weaker.
9.4 STATISTICAL INFERENCE FOR MATCHED SAMPLES | 607
1 30 28 2
2 26 14 12
3 34 27 7
4 17 18 -1
5 19 17 2
6 26 26 0
7 20 16 4
Hypotheses:
\begin{eqnarray*} H_0: & & \mu_D=0 \\ H_a: & & \mu_D \neq 0
\end{eqnarray*}
p-value:
This is a test on the mean difference in matched samples, so we use a -distribution to calculate the
p-value. Because the alternative hypothesis is a , the p-value is the sum of area in the tails of the
distribution.
Field 2 6
So the p-value .
Conclusion:
NOTES
1. Before writing down the hypotheses, decide on the order of subtraction for calculating the
differences. In a matched sample experiment, the form of the alternative hypothesis depends
on order of subtraction, so we must decide on the order of subtraction before writing down
the hypotheses.
2. The null hypothesis is the claim that there is no difference in the average distance.
That is, the average distance is the same for both hands.
3. For the alternative hypothesis, we are testing that there is a difference in the dominant hand
≠
and weaker hand distances. In other words, dominant≠weaker . So, the alternative
hypothesis is , the claim that there is a difference in the distances.
4. Keep all of the decimals throughout the calculation (i.e. in the -score, etc.) to avoid any
round-off error in the calculation of the p-value. This ensures that we get the most accurate
value for the p-value. Use Excel to do the calculations, and then click on the cells in
9.4 STATISTICAL INFERENCE FOR MATCHED SAMPLES | 609
subsequent calculations.
5. The p-value of 0.0716 is a large probability compared to the significance level, and so is likely
to happen assuming the null hypothesis is true. This suggests that the assumption that the
null hypothesis is true is most likely correct, and so the conclusion of the test is to not reject
the null hypothesis. In other words, on average, the distances are the same for both hands.
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=211#oembed-1
Watch this video: Hypothesis Testing for Matched/Paired Samples by ExcelIsFun [20:48]
Concept Review
The general form of a confidence interval for the mean difference of matched samples is
where is the positive -score of the -distribution with degrees of freedom so the area
under the -distribution in between and is .
The hypothesis test for matched samples is a well established process:
610 | 9.4 STATISTICAL INFERENCE FOR MATCHED SAMPLES
1. Write down the null and alternative hypotheses in terms of the mean difference .
2. Use the form of the alternative hypothesis to determine if the test is left-tailed, right-tailed, or
two-tailed.
3. Collect the sample information for the test and identify the significance level.
4. Find the p-value (the area in the corresponding tail) for the test using the -distribution.
Because the population standard deviation is unknown, we use the -distribution to find the
5. Compare the p-value to the significance level and state the outcome of the test.
6. Write down a concluding sentence specific to the context of the question.
Attribution
LEARNING OBJECTIVES
Similar to comparing two population means, the comparison of two population proportions is very
common. Often, we want to find out if the two populations under study have the same proportion
or if there is some difference in the two population proportions. Unlike two population means,
we can only approach the comparison of two population proportions using independent samples.
Recall that two populations are independent if the sample taken from population 1 is not related
in anyway to the sample taken from population 2. In this situation, any relationship between the
samples or populations is entirely coincidental.
Throughout this section, we will use subscripts to identify the values for the proportions and
sample sizes for the two populations:
Population Proportion
Sample Size
Sample Proportion
In order to construct a confidence interval or conduct a hypothesis test on the difference in two
612 | 9.5 STATISTICAL INFERENCE FOR TWO POPULATION PROPORTIONS
population proportions ( ), we need to use the distribution of the difference in the sample
proportions :
is .
Suppose a sample of size with sample proportion is taken from population 1 and a sample
of size with sample proportion is taken from population 2. The limits for the confidence
interval with confidence level for the difference in the population proportions are:
NOTES
1. In order to construct the confidence interval for the difference in two population
proportions, we need to check that the normal distribution applies. This means that we
need to check that , , and
.
2. Because the population proportions and are often unknown, we replace the values
of the population proportions with the sample proportions and in the normal
distribution check. That is, when the population proportions are unknown, we check
, , and to verify
that the normal distribution applies.
To find the -score to construct a confidence interval with confidence level , use the
norm.s.inv(area to the left of z) function.
• For area to the left of z, enter the entire area to the left of the -score you are trying to find.
The output from the norm.s.inv function is the value of the -score needed to construct the
confidence interval.
NOTE
The norm.s.inv function requires that we enter the entire area to the left of the unknown
-score. This area includes the confidence level (the area in the middle of the distribution) plus the
remaining area in the left tail.
EXAMPLE
A marketing company places an advertisement for a new brand of deodorant on two different
platforms: television and social media. The company wants to study the proportion of people who
remembered seeing the advertisement two hours later. In a sample of 200 people who saw the
9.5 STATISTICAL INFERENCE FOR TWO POPULATION PROPORTIONS | 615
advertisement on television, 74 remembered seeing it two hours later. In a sample of 300 people who
saw the advertisement on social media, 129 remembered seeing it two hours later.
1. Construct a 98% confidence interval for the difference in the proportion of people from the two
different platforms that remember seeing the advertisement two hours later.
2. Interpret the confidence interval found in part 1.
3. Is there evidence to suggest that the proportion of people from social media who remember
seeing the advertisement two hours later is greater than the proportion of people from
television? Explain.
Solution:
1. Let television be population 1 and social media be population 2. From the question we have
the following information:
Before constructing the confidence interval, we check that the normal distribution applies:
To find the confidence interval, we need to find the -score for the 98% confidence interval.
This means that we need to find the -score so that the entire area to the left of is
.
616 | 9.5 STATISTICAL INFERENCE FOR TWO POPULATION PROPORTIONS
2. We are 98% confident that the difference in the proportion of people from the two
platforms that remember seeing the advertisement two hours later is between
-16.36% and 4.36%.
3. Because 0 is inside the confidence interval, it suggests that the difference in the proportions
is 0. That is, . This suggests that the two proportions are equal. So
the proportion of people from social media who remember seeing the advertisement two
hours is not greater than the proportion of people from television.
9.5 STATISTICAL INFERENCE FOR TWO POPULATION PROPORTIONS | 617
NOTES
1. Because the population proportions are unknown, we use the sample proportions in the
check for normality.
2. When calculating the limits for the confidence interval keep all of the decimals in the -score
and other values throughout the calculation. This will ensure that there is no round-off error
in the answers. You can use Excel to do the calculation of the limits, clicking on the cell
containing the -score and any other values, to ensure that all of the decimal places are used
in the calculation.
3. The limits for the confidence interval are percents. For example, the upper limit of 0.0436 is
the decimal form of a percent: 4.36%.
4. When writing down the interpretation of the confidence interval, make sure to include the
confidence level, the actual difference in the population proportions captured by the
confidence interval (i.e. be specific to the context of the question), and express the limits as
percents.
1. Write down the null hypothesis that there is no difference in the population proportions:
\begin{eqnarray*} H_a: p_1-p_2 <0 & & (p_1 \lt p_2) \\ H_a: p_1-p_2>0
& & (p_1 \gt p_2) \\ H_a: p_1-p_2 \neq 0 & & (p_1 \neq p_2) \\ \\
\end{eqnarray*}
3. Use the form of the alternative hypothesis to determine if the test is left-tailed, right-tailed, or
two-tailed.
618 | 9.5 STATISTICAL INFERENCE FOR TWO POPULATION PROPORTIONS
4. Collect the sample information for the test and identify the significance level.
5. Check the conditions , , and
to verify that the normal distribution applies. Use the normal distribution to find the p-value
(the area in the corresponding tail) for the test. The -score is
NOTES
1. Because the population proportions and are often unknown, we replace the values
of the population proportions with the sample proportions and in the normal
distribution check. That is, when the population proportions are unknown, we check
, , and to verify
that the normal distribution applies to the calculation of the p-value.
2. Because we are testing the equality of the two population proportions, the -score for the
hypothesis test uses a pooled sample proportion . The pooled sample proportion
combines the sample data to create an estimate of the overall proportion of success.
9.5 STATISTICAL INFERENCE FOR TWO POPULATION PROPORTIONS | 619
The p-value for a hypothesis test on the difference in two population proportions is the area in the
tail(s) of the normal distribution, assuming that the conditions for using a normal distribution are
met ( , , and ).
The p-value is the area in the tail(s) of a normal distribution, so the norm.dist(x, , ,logic
operator) function can be used to calculate the p-value.
value for is the bottom part of the -score used in the hypothesis test.
• For the logic operator, enter true. Note: Because we are calculating the area under the
curve, we always enter true for the logic operator.
As with the previous chapter, use the appropriate technique with the norm.dist function to find the
area in the left-tail, the area in the right-tail or the sum of the area in tails.
EXAMPLE
A cell phone company claimed that iPhones are more popular with adults 30 years old or younger
620 | 9.5 STATISTICAL INFERENCE FOR TWO POPULATION PROPORTIONS
than with adults over 30 years old. A consumer advocacy group wants to test this claim. In a sample
of 1340 adults 30 years old or younger, 134 own an iPhone. In a sample of 250 adults over the age of
30, 15 own an iPhone. At the 5% significance level, is the proportion of adults 30 years old or
younger who own an iPhone greater than the proportion of adults over the age of 30 who own an
iPhone?
Solution:
Let adults 30 years old or younger be population 1 and adults over 30 years old be population 2.
From the question, we have the following information:
Hypotheses:
\begin{eqnarray*} H_0: & & p_1-p_2=0 \\ H_a: & & p_1-p_2 \gt 0
\end{eqnarray*}
p-value:
Before calculating the p-value, we check that the normal distribution applies:
Field 2 0
Field 4 true
So the p-value .
Conclusion:
NOTES
1. The null hypothesis is the claim that the proportion of adults 30 or younger
with an iPhone equals the proportion of adults over 30 with an iPhone. That is, the two
populations have the same proportion.
2. The alternative hypothesis is the claim that the proportion of adults 30 or
younger with an iPhone is greater than the proportion of adults over 30 with an iPhone (
).
3. Make sure to keep all of the decimal places throughout the calculation to avoid any round-off
error in the p-value. Perform the calculations of the sample proportions and the pooled
sample proportion in Excel and then click on the corresponding cells when completing the
fields in the norm.dist function.
4. The p-value is the area in the right tail of the normal distribution. In the calculation of the
p-value:
◦ The function is 1-norm.dist because we are finding the area in the right tail of a normal
distribution.
◦ Field 1 is the value of .
◦ Field 2 is 0, the value of from the null hypothesis. Remember, we run the
test assuming the null hypothesis is true, so that means we assume .
◦ Field 3 is the value of
5. The p-value of 0.0232 is a small probability compared to the significance level, and so is
unlikely to happen that assuming the null hypothesis is true. This suggests that the
assumption that the null hypothesis is true is most likely incorrect, and so the conclusion of
the test is to reject the null hypothesis in favour of the alternative hypothesis. In other words,
the proportion of adults 30 years old or younger who own an iPhone is greater than the
proportion of adults over the age of 30 who own an iPhone.
9.5 STATISTICAL INFERENCE FOR TWO POPULATION PROPORTIONS | 623
EXAMPLE
Two types of medication for hives are tested to determine if there is a difference in the proportions of
adult patient reactions. In a sample of 200 adults given medication A, 20 still had hives 30 minutes
after taking the medication. In a sample of 200 adults given medication B, 12 still had hives 30
minutes after taking the medication. At the 1% significance level, is there a difference in the
proportion of adults who still have hives 30 minutes after taking medications?
Solution:
Let medication A be population 1 and medication B be population 2. From the question, we have the
following information:
Medication A Medication B
Hypotheses:
\begin{eqnarray*} H_0: & & p_1-p_2=0 \\ H_a: & & p_1-p_2 \neq 0
\end{eqnarray*}
p-value:
Before calculating the p-value, we check that the normal distribution applies:
We need to know if the sample information relates to the left or right tail because that will determine
how we calculate out the area of that tail using the normal distribution. In this case, (
), so the sample information relates to the right tail of the normal distribution. This
means that we will calculate out the area in the right tail using 1-norm.dist. However, this is a two-
tailed test where the p-value is the sum of the area in the two tails and the area in the right tail is only
one half of the p-value. The area in the right tail equals the area in the left tail and the p-value is the
sum of these two areas.
Field 2 0
Field 3 sqrt(0.08*(1-0.08)*(1/200+1/200))
Field 4 true
So the area in the right tail is 0.0702, which means (p-value) . This is also the area in
p-value
Conclusion:
NOTES
1. The null hypothesis is the claim that the there is no difference in the
proportion of adults with hives 30 minutes after taking the medications. That is, the two
populations have the same proportion.
2. The alternative hypothesis is the claim that there is a difference in the
proportion of adults with hives 30 minutes after taking the medications ( ).
3. In a two-tailed hypothesis test that uses the normal distribution, we will only have sample
information relating to one of the two tails. We must determine which of the tails the sample
information belongs to, and then calculate out the area in that tail. The area in each tail
represents exactly half of the p-value, so the p-value is the sum of the areas in the two tails.
to find the area in the left tail. The area in the right tail equals the area in the
left tail, so we can find the p-value by adding the output from this function to
itself.
▪ We use 1-norm.dist( , ,
,true) to find the area in the right tail. The area in the left tail equals the area in
the right tail, so we can find the p-value by adding the output from this function
to itself.
4. The p-value of 0.1404 is a large probability compared to the significance level, and so is likely
to happen assuming that the null hypothesis is true. This suggests that the assumption that
the null hypothesis is true is most likely correct, and so the conclusion of the test is to not
reject the null hypothesis. In other words, there is no difference in the proportion of adults
with hives 30 minutes after taking the medications.
EXAMPLE
A valve manufacturer recently launched a new valve, Valve A, and they want to claim that the
proportion of their valves that fail under 4500 psi is the smallest of all the other valves on the
market. The manufacturer decides to compare Valve A with the most popular valve on the market,
Valve B. In a sample of 100 Valve A’s, 6 failed at 4500 psi. In a sample of 150 Valve B’s, 16 failed at
9.5 STATISTICAL INFERENCE FOR TWO POPULATION PROPORTIONS | 627
4500 psi. At the 5% significance level, is the proportion of Valve As that fail under 4500 psi less than
the proportion of Valve Bs that fail under 4500 psi?
Solution:
Let Valve A be population 1 and Valve B be population 2. From the question, we have the following
information:
Valve A Valve B
Hypotheses:
\begin{eqnarray*} H_0: & & p_1-p_2=0 \\ H_a: & & p_1-p_2 \lt 0
\end{eqnarray*}
p-value:
Before calculating the p-value, we check that the normal distribution applies:
Field 2 0
Field 3
Field 4 true
So the p-value .
Conclusion:
NOTES
1. The null hypothesis is the claim that the proportion of valves that fail under
4500 psi is the same for both valves. That is, the two populations have the same proportion.
2. The alternative hypothesis is the claim that the proportion of Valve As that
fail under 4500 psi less than the proportion of Valve Bs that fail under 4500 psi ( ).
3. Make sure to keep all of the decimal places throughout the calculation to avoid any round-off
error in the p-value. Perform the calculations of the sample proportions and the pooled
sample proportion in Excel and then click on the corresponding cells when completing the
fields in the norm.dist function.
4. The p-value of 0.1010 is a large probability compared to the significance level, and so is likely
to happen assuming that the null hypothesis is true. This suggests that the assumption that
the null hypothesis is true is most likely correct, and so the conclusion of the test is to not
reject the null hypothesis. In other words, the proportion of Valve As that fail under 4500 psi
equals the proportion of Valve Bs that fail under 4500 psi. For the company, this means that
they could not claim that the proportion of their valves that fail under 4500 psi is the smallest
of all the other valves on the market.
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=213#oembed-1
Watch this video: Excel 2013 Statistical Analysis #71: Inference About Difference Between 2 Pop. Proportions Z Method by
ExcelIsFun [28:03]
630 | 9.5 STATISTICAL INFERENCE FOR TWO POPULATION PROPORTIONS
Concept Review
The general form of a confidence interval for the difference in two population proportions is
where is the positive -score of the standard normal distribution so the area under the normal
distribution in between and is .
The hypothesis test for the difference in two population proportions with is a well established
process:
1. Write down the null and alternative hypotheses in terms of the differences in the population
proportions .
2. Use the form of the alternative hypothesis to determine if the test is left-tailed, right-tailed, or
two-tailed.
3. Collect the sample information for the test and identify the significance level.
4. Check that , , and to verify
that the normal distribution applies.
5. Find the p-value (the area in the corresponding tail) for the test using the normal distribution.
6. Compare the p-value to the significance level and state the outcome of the test.
7. Write down a concluding sentence specific to the context of the question.
Attribution
1. The known standard deviation in salary for all mid-level professionals in the financial industry
is $11,000. Company A and Company B are in the financial industry. Suppose samples are taken of
mid-level professionals from Company A and from Company B. The sample mean salary for mid-
level professionals in Company A is $80,000. The sample mean salary for mid-level professionals
in Company B is $96,000.
a. Construct a 99% confidence interval for the difference in the mean salary for mid-level
professionals at the two companies.
b. Interpret the confidence interval in part (a).
c. Is it reasonable to claim that mean salary for mid-level professionals the same at the two
companies? Explain.
2. It is believed that the average grade on an English essay in a particular school system for females
is higher than for males. A random sample of 31 females had a mean score of 82 with a standard
deviation of three, and a random sample of 25 males had a mean score of 76 with a standard
deviation of four. At the 5% significance level test if the average grade on an English essay is higher
for females than males.
3. In a random sample of 100 forests in the United States, 56 were coniferous or contained conifers.
In a random sample of 80 forests in Mexico, 40 were coniferous or contained conifers. At the 5%
significance level, is the proportion of conifers in the United States greater than the proportion of
conifers in Mexico?
4. A study is done to determine which of two soft drinks has more sugar. There are 13 cans of
Beverage A in a sample and six cans of Beverage B. The mean amount of sugar in Beverage A is 36
grams with a standard deviation of 0.6 grams. The mean amount of sugar in Beverage B is 38 grams
with a standard deviation of 0.8 grams. Both populations have normal distributions. At the 5%
significance level, determine if the average amount of sugar in Beverage B is greater than Beverage
A.
632 | 9.6 EXERCISES
5. The mean number of English courses taken in a two–year time period by male and female college
students is believed to be about the same. An experiment is conducted and data are collected from
29 males and 16 females. The males took an average of three English courses with a standard
deviation of 0.8. The females took an average of four English courses with a standard deviation of
1.0. Are the means statistically the same? Use a 5% significance level.
6. A student at a four-year college claims that mean enrollment at four–year colleges is higher than
at two–year colleges in the United States. Two surveys are conducted. Of the 35 two–year colleges
surveyed, the mean enrollment was 5,068 with a standard deviation of 4,777. Of the 35 four-year
colleges surveyed, the mean enrollment was 5,466 with a standard deviation of 8,191. At the 5%
significance level, is the mean enrollment at four-year colleges higher than at two-year colleges?
7. Mean entry-level salaries for college graduates with mechanical engineering degrees and
electrical engineering degrees are believed to be approximately the same. A recruiting office thinks
that the mean mechanical engineering salary is actually lower than the mean electrical engineering
salary. The recruiting office randomly surveys 50 entry level mechanical engineers and 60 entry
level electrical engineers. Their mean salaries were $46,100 and $46,700, respectively. Their
standard deviations were $3,450 and $4,210, respectively. Conduct a hypothesis test to determine if
you agree that the mean entry-level mechanical engineering salary is lower than the mean entry-
level electrical engineering salary. Use a 5% significance level.
8. Marketing companies have collected data implying that teenage girls use more ring tones on their
cellular phones than teenage boys do. In one particular study of 40 randomly chosen teenage girls
and boys (20 of each) with cellular phones, the mean number of ring tones for the girls was 3.2 with
a standard deviation of 1.5. The mean for the boys was 1.7 with a standard deviation of 0.8. Conduct
a hypothesis test to determine if the means are approximately the same or if the girls’ mean is higher
than the boys’ mean. Use a 5% significance level.
9. Researchers interviewed street prostitutes in Canada and the United States. The mean age of the
100 Canadian prostitutes upon entering prostitution was 18 with a standard deviation of six. The
9.6 EXERCISES | 633
mean age of the 130 United States prostitutes upon entering prostitution was 20 with a standard
deviation of eight. Is the mean age of entering prostitution in Canada lower than the mean age in
the United States? Test at a 1% significance level.
10. A powder diet is tested on 49 people, and a liquid diet is tested on 36 different people. Of interest
is whether the liquid diet yields a higher mean weight loss than the powder diet. The powder diet
group had a mean weight loss of 42 pounds with a standard deviation of 12 pounds. The liquid diet
group had a mean weight loss of 45 pounds with a standard deviation of 14 pounds. Test at a 5%
significance level.
a. Construct a 94% confidence interval for the difference in the mean weight loss for the powder
and liquid diets.
b. Interpret the confidence interval in part (a).
c. Is it reasonable to claim that the mean weight loss with the powder diet is less than the liquid
diet? Explain.
11. The mean speeds of fastball pitches from two different baseball pitchers are to be compared.
A sample of 14 fastball pitches is measured from each pitcher. The populations have normal
distributions. The table shows the result. Scouters believe that Rodriguez pitches a speedier
fastball. At the 1% significance level, what is your conclusion?
Wesley 86 3
Rodriguez 91 7
12. A researcher is testing the effects of plant food on plant growth. Nine plants have been given
the plant food. Another nine plants have not been given the plant food. The heights of the plants
are recorded after eight weeks. The populations have normal distributions. The following table is
the result. The researcher thinks the food makes the plants grow taller. At the 1% significance
level, what is your conclusion?
Plant Group Sample Mean Height of Plants (inches) Population Standard Deviation
Food 16 2.5
No food 14 1.5
634 | 9.6 EXERCISES
13. Two metal alloys are being considered as material for ball bearings. The mean melting point of
the two alloys is to be compared. 15 pieces of each metal are being tested. Both populations have
normal distributions. The following table is the result. It is believed that Alloy Zeta has a different
melting point. At the 1% significance level, what is your conclusion?
14. A study is done to determine if students in the California state university system take longer
to graduate, on average, than students enrolled in private universities. One hundred students from
both the California state university system and private universities are surveyed. The California
state university system students took on average 4.5 years with a standard deviation of 0.8. The
private university students took on average 4.1 years with a standard deviation of 0.3. Suppose that
from years of research, it is known that the population standard deviations are 1.5811 years and 1
year, respectively. At the 5% significance level, what is your conclusion?
15. Parents of teenage boys often complain that auto insurance costs more, on average, for teenage
boys than for teenage girls. A group of concerned parents examines a random sample of insurance
bills. The mean annual cost for 36 teenage boys was $679. For 23 teenage girls, it was $559. From
past years, it is known that the population standard deviation for each group is $180. Determine
whether or not you believe that the mean cost for auto insurance for teenage boys is greater than
that for teenage girls. Use a 5% significance level.
16. A group of transfer bound students wondered if they will spend the same mean amount on
texts and supplies each year at their four-year university as they have at their community college.
They conducted a random survey of 54 students at their community college and 66 students at
their local four-year university. The sample means were $947 and $1,011, respectively. The
population standard deviations are known to be $254 and $87, respectively.
a. Construct a 96% confidence interval for the difference in the mean amount students spend on
texts at university and community college.
b. Interpret the confidence interval in part (a).
c. Is it reasonable to claim that the mean amount students spend on texts is the same at
university and community college? Explain.
9.6 EXERCISES | 635
17. Some manufacturers claim that non-hybrid sedan cars have a lower mean miles-per-gallon
(mpg) than hybrid ones. Suppose that consumers test 21 hybrid sedans and get a mean of 31 mpg
with a standard deviation of seven mpg. Thirty-one non-hybrid sedans get a mean of 22 mpg with a
standard deviation of four mpg.
a. Construct a 95% confidence interval for the difference in the average miles-per-gallon in
hybrid and non-hybrid cars.
b. Interpret the confidence interval in part (a).
c. Is it reasonable to claim that the average mpg for hybrid cars is higher than non-hybrid cars?
Explain.
18. One of the questions in a study of marital satisfaction of dual-career couples was to rate the
statement “I’m pleased with the way we divide the responsibilities for childcare.” The ratings went
from one (strongly agree) to five (strongly disagree). The table below contains ten of the paired
responses for husbands and wives. Conduct a hypothesis test to see if the mean difference in the
husband’s versus the wife’s satisfaction level is negative (meaning that, within the partnership, the
husband is happier than the wife). Use a 5% significance level.
Wife’s Score 2 2 3 3 4 2 1 1 2 4
Husband’s Score 2 2 1 3 2 1 1 1 2 4
19. Two types of phone operating system are being tested to determine if there is a difference in the
proportions of system failures (crashes). Fifteen out of a random sample of 150 phones with OS1
had system failures within the first eight hours of operation. Nine out of another random sample
of 150 phones with OS2 had system failures within the first eight hours of operation. OS2 is
believed to be more stable (have fewer crashes) than OS1. At the 5% significance level, is there a
difference in the proportions of system failures?
20. A recent drug survey showed an increase in the use of drugs and alcohol among local high
school seniors as compared to the national percent. Suppose that a survey of 100 local seniors and
100 national seniors is conducted to see if the proportion of drug and alcohol use is higher locally
than nationally. Locally, 65 seniors reported using drugs or alcohol within the past month, while
60 national seniors reported using them. At the 5% significance level, is the proportion of drug and
alcohol abuse higher locally than nationally?
636 | 9.6 EXERCISES
21. Neuroinvasive West Nile virus is a severe disease that affects a person’s nervous system . It is
spread by the Culex species of mosquito. In the United States in 2010 there were 629 reported
cases of neuroinvasive West Nile virus out of a total of 1,021 reported cases and there were 486
neuroinvasive reported cases out of a total of 712 cases reported in 2011. Is the 2011 proportion of
neuroinvasive West Nile virus cases more than the 2010 proportion of neuroinvasive West Nile
virus cases? Using a 1% level of significance, conduct an appropriate hypothesis test.
22. Adults aged 18 years old and older were randomly selected for a survey on obesity. Adults are
considered obese if their body mass index (BMI) is at least 30. The researchers wanted to determine
if the proportion of women who are obese in the south is less than the proportion of southern men
who are obese. The results are shown in table. Test at the 1% level of significance.
23. Two computer users were discussing tablet computers. A higher proportion of people ages 16 to
29 use tablets than the proportion of people age 30 and older. The table details the number of tablet
owners for each age group. Test at the 1% level of significance.
24. A group of friends debated whether more men use smartphones than women. They consulted a
research study of smartphone use among adults. The results of the survey indicate that of the 973
men randomly sampled, 379 use smartphones. For women, 404 of the 1,304 who were randomly
sampled use smartphones. Test at the 5% level of significance.
a. Construct a 93% confidence interval for the difference in the proportion of men and women
who use smartphones.
b. Interpret the confidence interval in part (a).
c. Is it reasonable to claim that the proportion of men who use smartphones is higher than the
proportion of women? Explain.
9.6 EXERCISES | 637
25. We are interested in whether children’s educational computer software costs less, on average,
than children’s entertainment software. Thirty-six educational software titles were randomly picked
from a catalog. The mean cost was $31.14 with a standard deviation of $4.69. Thirty-five
entertainment software titles were randomly picked from the same catalog. The mean cost was
$33.86 with a standard deviation of $10.87. At the 5% significance level, determine if children’s
educational software costs less, on average, than children’s entertainment software.
26. Joan Nguyen recently claimed that the proportion of college-age males with at least one pierced
ear is as high as the proportion of college-age females. She conducted a survey in her classes. Out
of 107 males, 20 had at least one pierced ear. Out of 92 females, 47 had at least one pierced ear.
a. Construct a 98% confidence interval for the difference in the proportion of college-age males
and females with at least one pierced ear.
b. Interpret the confidence interval in part (a).
c. Is it reasonable to claim that the proportion of college-age males with at least one pierced ear
equals the proportion of college-age females? Explain.
27. A study was conducted to test the effectiveness of a software patch in reducing system failures
over a six-month period. Results for randomly selected installations are shown in the table below.
The “before” value is matched to an “after” value, and the differences are calculated. The
differences have a normal distribution.
Installation A B C D E F G H
Before 3 6 4 2 5 8 2 6
After 1 5 2 0 1 0 2 2
a. Construct a 97% confidence interval for the mean difference in the number of failures before
and after the software patch was installed.
b. Interpret the confidence interval in part (a).
c. Is it reasonable to claim that the average number of failures did not change after the software
patch was installed? Explain.
28. A study was conducted to test the effectiveness of a juggling class. Before the class started, six
subjects juggled as many balls as they could at once. After the class, the same six subjects juggled
as many balls as they could. The differences in the number of balls are calculated. The differences
have a normal distribution.
638 | 9.6 EXERCISES
Subject A B C D E F
Before 3 4 3 2 4 5
After 4 5 6 4 5 7
a. Construct a 99% confidence interval for the mean difference in the number of balls a subject
can juggle after the class.
b. Interpret the confidence interval in part (a).
c. Is it reasonable to claim that the average number of balls a subject can juggle higher after the
class? Explain.
29. A doctor wants to know if a blood pressure medication is effective. Six subjects have their blood
pressures recorded. After twelve weeks on the medication, the same six subjects have their blood
pressure recorded again. For this test, only systolic pressure is of concern. At the 1% significance
level, did the medication, on average, lower the patients blood pressure?
Patient A B C D E F
30. Ten individuals went on a low–fat diet for 12 weeks to lower their cholesterol. The data are
recorded in the table below. Do you think that their cholesterol levels were significantly lowered?
Use a 5% significance level.
9.6 EXERCISES | 639
140 140
220 230
110 120
240 220
200 190
180 150
190 200
360 300
280 300
260 240
31. A local cancer support group believes that the estimate for new female breast cancer cases in the
south is higher in 2013 than in 2012. The group compared the estimates of new female breast cancer
cases by southern state in 2012 and in 2013. The results are in the table. At the 5% significance
level, determine if the average number of breast cancer cases is higher in 2013 than in 2012.
640 | 9.6 EXERCISES
32. A traveler wanted to know if the prices of hotels are different in the ten cities that he visits the
most often. The list of the cities with the corresponding hotel prices for his two favorite hotel chains
is in the table. Test at the 1% level of significance.
Attribution
Chapter Outline
The chi-square distribution can be used to find relationships between two things, like grocery prices at
different stores. Photo by Pete, CC BY 4.0.
Have you ever wondered if lottery numbers were evenly distributed or if some numbers occurred
with a greater frequency? How about if the types of movies people preferred were different across
different age groups? What about if a coffee machine was dispensing approximately the same
amount of coffee each time? You could answer these questions by conducting a hypothesis test.
We will now study a new distribution called the -distribution. Statistical inferences that use
the -distribution can help us answer the types of questions posed above. In this chapter, we will
646 | 10.1 INTRODUCTION TO STATISTICAL INFERENCES USING THE CHI-SQUARE DISTRIBUTION
learn the three major applications of the -distribution: testing a single population variance, the
goodness-of-fit test, and the test of independence.
Attribution
LEARNING OBJECTIVES
normal distribution.
• The total area under the graph of a -distribution is 1.
• The mean of a -distribution is its degrees of freedom: .
• The variance of a -distribution is twice its degrees of freedom: .
• The mode of a -distribution is . The peak of the graph occurs at the mode.
• Probabilities associated with a -distribution are given by the area under the curve of the
-distribution.
• To find the area under a -distribution to the left of a given -score, use the chisq.dist(
, degrees of freedom, logic operator) function.
• The output from the chisq.dist function is the area to the left of the entered -score.
• Visit the Microsoft page for more information about the chisq.dist function.
• To find the area under a -distribution to the right of a given -score, use the
chisq.dist.rt( , degrees of freedom) function.
• The output from the chisq.dist.rt function is the area to the right of the entered
10.2 THE CHI SQUARE DISTRIBUTION | 649
-score.
• Visit the Microsoft page for more information about the chisq.dist.rt function.
EXAMPLE
Solution:
Field 2 12
Field 3 true
Field 2 12
650 | 10.2 THE CHI SQUARE DISTRIBUTION
• To find the -score for a given area under the -distribution to the left of the -score, use
the chisq.inv(area to the left, degrees of freedom) function.
◦ For area to the left, enter the area to the left of required -score.
◦ For degrees of freedom, enter the value of the degrees of freedom for the
-distribution.
• The output from the chisq.inv function is the value of -score so that the area to the
left of the -score is the entered area.
• Visit the Microsoft page for more information about the chisq.inv function.
• To find the -score for a given area under the -distribution to the right of the -score,
use the chisq.inv.rt(area to the right, degrees of freedom) function.
◦ For area to the right, enter the area to the right of required -score.
◦ For degrees of freedom, enter the value of the degrees of freedom for the
-distribution.
• The output from the chisq.inv.rt function is the value of -score so that the area to
the right of the -score is the entered area.
• Visit the Microsoft page for more information about the chisq.inv.rt function.
10.2 THE CHI SQUARE DISTRIBUTION | 651
EXAMPLE
1. Find the -score so that the area under the -distribution to the left of is 0.25.
2. Find the -score so that the area under the -distribution to the right of is 0.148.
Solution:
Field 2 37
Field 2 37
Concept Review
The -distribution is a useful tool for assessment in a series of problem categories. These problem
categories include: determining if a data set fits a particular distribution, determining if the
distributions of two populations are the same, determining if two categorical variables are
independent or dependent, and determining if there is a different variability than expected within
a population.
An important parameter in a -distribution is the degrees of freedom in a given problem. The
-distribution curve is skewed to the right, and its shape depends on the degrees of freedom. As
the degrees of freedom increases, the curve of a -distribution approaches a normal distribution.
652 | 10.2 THE CHI SQUARE DISTRIBUTION
Attribution
“11.1 Facts About the Chi-Square Distribution“ in Introductory Statistics by OpenStax is licensed
under a Creative Commons Attribution 4.0 International License.
10.3 STATISTICAL INFERENCE FOR A
SINGLE POPULATION VARIANCE
LEARNING OBJECTIVES
The mean of a population is important, but in many cases the variance of the population is just as
important. In most production processes, quality is measured by how closely the process matches
the target (i.e. the mean) and by the variability (i.e. the variance) of the process. For example, if a
process is to fill bags of coffee beans, we are interested in both the average weight of the bag and
how much variation there is in the weight of the bags. The quality is considered poor if the average
weight of the bags is accurate but the variance of the weight of the bags is too high—a variance that
is too large means some bags would be too full and some bags would be almost empty.
As with other population parameters, we can construct a confidence interval to capture the
population variance and conduct a hypothesis test on the population variance. In order to construct
a confidence interval or conduct a hypothesis test on a population variance , we need to use the
and a sample of size is taken from the population. The sampling distribution of
To construct the confidence interval, take a random sample of size from a normally distributed
654 | 10.3 STATISTICAL INFERENCE FOR A SINGLE POPULATION VARIANCE
population. Calculate the sample variance . The limits for the confidence interval with
confidence level for an unknown population variance are
\begin{eqnarray*} \mbox{Lower Limit} & = & \frac{(n-1) \times s^2}{\chi^2_R} \\ \\
\mbox{Upper Limit} & = & \frac{(n-1) \times s^2}{\chi^2_L} \\ \\ \end{eqnarray*}
where is the -score so that the area in the left-tail of the -distribution is ,
is the -score so that the area in the right-tail of the -distribution is and the
NOTES
1. Like the other confidence intervals we have seen, the -scores are the values that trap
of the observations in the middle of the distribution so that the area of each tail is
2. Because the -distribution is not symmetrical, the confidence interval for a population
variance requires that we calculate two different -scores: one for the left tail and one
for the right tail. In Excel, we will need to use both the chisq.inv function (for the left tail)
and the chisq.inv.rt function (for the right tail) to find the two different -scores.
10.3 STATISTICAL INFERENCE FOR A SINGLE POPULATION VARIANCE | 655
3. The -score for the left tail is part of the formula for the upper limit and the -score for
the right tail is part of the formula for the lower limit. This is not a mistake. It follows
from the formula used to determine the limits for the confidence interval.
EXAMPLE
A local telecom company conducts broadband speed tests to measure how much data per second
passes between a customer’s computer and the internet compared to what the customer pays for as
part of their plan . The company needs to estimate the variance in the broadband speed. A sample of
15 ISPs is taken and amount of data per second is recorded. The variance in the sample is 174.
1. Construct a 97% confidence interval for the variance in the amount of data per second that
passes between a customer’s computer and the internet.
2. Interpret the confidence interval found in part 1.
Solution:
1. To find the confidence interval, we need to find the -score for the 97% confidence interval.
This means that we need to find the -score so that the area in the left tail is
.
656 | 10.3 STATISTICAL INFERENCE FOR A SINGLE POPULATION VARIANCE
Field 2 14
We also need find the -score for the 97% confidence interval. This means that we need to
find the -score so that the area in the right tail is . The degrees of
Field 2 14
10.3 STATISTICAL INFERENCE FOR A SINGLE POPULATION VARIANCE | 657
2. We are 97% confident that the variance in the amount of data per second that passes
between a customer’s computer and the internet is between 87.54 and 481.69.
NOTES
1. When calculating the limits for the confidence interval keep all of the decimals in the
-scores and other values throughout the calculation. This will ensure that there is no round-
off error in the answer. You can use Excel to do the calculations of the limits, clicking on the
cells containing the -scores and any other values.
2. When writing down the interpretation of the confidence interval, make sure to include the
confidence level and the actual population variance captured by the confidence interval (i.e.
be specific to the context of the question). In this case, there are no units for the limits
because variance does not have any limits.
1. Write down the null and alternative hypotheses in terms of the population variance .
2. Use the form of the alternative hypothesis to determine if the test is left-tailed, right-tailed, or
two-tailed.
3. Collect the sample information for the test and identify the significance level .
4. Use the -distribution to find the p-value (the area in the corresponding tail) for the test.
The -score and degrees of freedom are
658 | 10.3 STATISTICAL INFERENCE FOR A SINGLE POPULATION VARIANCE
5. Compare the p-value to the significance level and state the outcome of the test:
EXAMPLE
A statistics instructor at a local college claims that the variance for the final exam scores was 25.
After speaking with his classmates, one the class’s best students thinks that the variance for the final
exam scores is higher than the instructor claims. The student challenges the instructor to prove her
claim. The instructor takes a sample 30 final exams and finds the variance of the scores is 28. At the
5% significance level, test if the variance of the final exam scores is higher than the instructor claims.
Solution:
Hypotheses:
\begin{eqnarray*} H_0: & & \sigma^2=25 \\ H_a: & & \sigma^2 \gt 25
\end{eqnarray*}
p-value:
Because the alternative hypothesis is a , the p-value is the area in the right tail of the
-distribution.
10.3 STATISTICAL INFERENCE FOR A SINGLE POPULATION VARIANCE | 659
To use the chisq.dist.rt function, we need to calculate out the -score and the degrees of freedom:
Field 2 29
So the p-value .
Conclusion:
NOTES
1. The null hypothesis is the claim that the variance on the final exam is 25.
2. The alternative hypothesis is the claim that the variance on the final exam is
greater than 25.
3. There are no units included with the hypotheses because variance does not have any units.
4. The p-value is the area in the right tail of the -distribution, to the right of .
In the calculation of the p-value:
◦ The function is chisq.dist.rt because we are finding the area in the right tail of a
-distribution.
◦ Field 1 is the value of .
◦ Field 2 is the degrees of freedom.
5. The p-value of 0.2992 is a large probability compared to the significance level, and so is likely
to happen assuming the null hypothesis is true. This suggests that the assumption that the
null hypothesis is true is most likely correct, and so the conclusion of the test is to not reject
the null hypothesis. In other words, the variance of the scores on the final exam is most likely
25.
EXAMPLE
With individual lines at its various windows, a post office finds that the standard deviation for
normally distributed waiting times for customers is 7.2 minutes. The post office experiments with a
single, main waiting line and finds that for a random sample of 25 customers the waiting times for
customers have a standard deviation of 4.5 minutes. At the 5% significance level, determine if the
single line changed the variation among the wait times for customers.
10.3 STATISTICAL INFERENCE FOR A SINGLE POPULATION VARIANCE | 661
Solution:
Hypotheses:
\begin{eqnarray*} H_0: & & \sigma^2=51.84 \\ H_a: & & \sigma^2 \neq 51.84
\end{eqnarray*}
p-value:
Because the alternative hypothesis is a , the p-value is the sum of the areas in the tails of the
-distribution.
Because this is a two-tailed test, we need to know which tail (left or right) we have the -score for
so that we can use the correct Excel function. If , the -score corresponds to the
662 | 10.3 STATISTICAL INFERENCE FOR A SINGLE POPULATION VARIANCE
right tail. If the , the -score corresponds to the left tail. In this case,
, so the -score corresponds to the left tail. We need to use
chisq.dist to find the area in the left tail.
Field 2 24
So the area in the left tail is 0.0033, which means that (p-value)=0.0033. This is also the area in
p-value=
Conclusion:
NOTES
1. The null hypothesis is the claim that the variance in the wait times is 51.84.
Note that we were given the standard deviation ( ) in the question. But this is a test
on variance, so we must write the hypotheses in terms of the variance
.
2. The alternative hypothesis is the claim that the variance in the wait times has
changed from 51.84.
3. There are no units included with the hypotheses because variance does not have any units.
4. In a two-tailed hypothesis test for population variance, we will only have sample information
relating to one of the two tails. We must determine which of the tails the sample information
belongs to, and then calculate out the area in that tail. The area in each tail represents exactly
half of the p-value, so the p-value is the sum of the areas in the two tails.
▪ We use chisq.dist to find the area in the left tail. The area in the right tail
equals the area in the left tail, so we can find the p-value by adding the output
from this function to itself.
▪ We use chisq.dist.rt to find the area in the right tail. The area in the left tail
equals the area in the right tail, so we can find the p-value by adding the output
from this function to itself.
5. The p-value of 0.0066 is a small probability compared to the significance level, and so is
unlikely to happen assuming the null hypothesis is true. This suggests that the assumption
that the null hypothesis is true is most likely incorrect, and so the conclusion of the test is to
reject the null hypothesis in favour of the alternative hypothesis. In other words, the variance
in the wait times has most likely changed.
TRY IT
A scuba instructor wants to record the collective depths each of his students dives during their
checkout. He is interested in how the depths vary, even though everyone should have been at the
same depth. He believes the standard deviation of the depths is 1.2 meters. But his assistant thinks
the standard deviation is less than 1.2 meters. The instructor wants to test this claim. The scuba
instructor uses his most recent class of 20 students as a sample and finds that the standard deviation
of the depths is 0.85 meters. At the 1% significance level, test if the variability in the depths of the
student scuba divers is less than claimed.
Hypotheses:
\begin{eqnarray*} H_0: & & \sigma^2=1.44 \\ H_a: & & \sigma^2 \lt 1.44
\end{eqnarray*}
p-value:
Because the alternative hypothesis is a , the p-value is the area in the left tail of the
-distribution.
To use the chisq.dist function, we need to calculate out the -score and the degrees of
freedom:
10.3 STATISTICAL INFERENCE FOR A SINGLE POPULATION VARIANCE | 665
Field 2 19
Field 3 true
So the p-value .
Conclusion:
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=229#oembed-1
Watch this video: Hypothesis Tests for One Population Variance by jbstatistics [8:51]
Concept Review
freedom.
The hypothesis test for a population variance is a well established process:
1. Write down the null and alternative hypotheses in terms of the population variance .
2. Use the form of the alternative hypothesis to determine if the test is left-tailed, right-tailed, or
666 | 10.3 STATISTICAL INFERENCE FOR A SINGLE POPULATION VARIANCE
two-tailed.
3. Collect the sample information for the test and identify the significance level.
4. Find the p-value (the area in the corresponding tail) for the test using the -distribution
where and .
5. Compare the p-value to the significance level and state the outcome of the test.
6. Write down a concluding sentence specific to the context of the question.
where is the -score so that the area in the left-tail of of the -distribution is ,
is the -score so that the area in the right-tail of of the -distribution is and the
Attribution
“11.6 Test of a Single Variance“ in Introductory Statistics by OpenStax is licensed under a Creative
Commons Attribution 4.0 International License.
10.4 THE GOODNESS-OF-FIT TEST
LEARNING OBJECTIVES
Recall that a categorical (or qualitative) variable is a variable where the data can be grouped
by specific categories. Examples of categorical variables include eye colour, blood type, or brand
of car. A categorical variable is a random variable that takes on categories. Suppose we want to
determine whether the data from a categorical variable “fit” a particular distribution or not. That
is, for a categorical variable with a historical or assumed probability distribution, does a new sample
from the population support the assumed probability distribution or does the sample indicate that
there has been a change in the probability distribution?
The -goodness-of-fit test allows us the test if the sample data from a categorical variable fits the
pattern of expected probabilities for the variable. In a -goodness-of-fit test, we are analyzing
the distribution of the frequencies for one categorical variable. This is a hypothesis test where
the hypotheses state that the categorial variable does or does not follow an assumed probability
distribution and a -distribution is used to determine the p-value for the test.
2. Collect the sample information for the test and identify the significance level .
3. Use the -distribution to find the p-value, which is the area in the right tail of the
distribution. The -score and degrees of freedom are
NOTES
1. The null hypothesis is the claim that the categorial variable follows the assumed
distribution. That is, the probability of each possible outcome of the categorical
variable equals a hypothesized probability .
2. The alternative hypothesis is the claim that the categorical variable does not follow the
assumed distribution. That is, for at least one possible outcome of the categorical variable
the probability does not equal the claimed probability .
10.4 THE GOODNESS-OF-FIT TEST | 669
3. In order to use the -goodness-of-fit test, the expected frequency for each category must
be at least 5.
4. The p-value for a -goodness-of-fit test is always the area in the right tail of the
-distribution. So, we use chisq.dist.rt to find the p-value for a -goodness-of-fit test.
5. To calculate the -score:
i. Find the difference between the observed frequency (from the sample) and
the expected frequency (from the null hypothesis). The expected frequency
equals where is the sample size and is the assumed probability
for the th outcome claimed in the null hypothesis.
ii. Square the difference in step (i).
iii. Divide the value found in step (iii) by the expected frequency.
6. We expect that there will be a discrepancy between the observed frequency and the
expected frequency. If this discrepancy is very large, the value of will be very large and
result in a small p-value.
EXAMPLE
Absenteeism of college students from math classes is a major concern to math instructors because
missing class appears to increase the drop rate. Suppose that a study was done to determine if the
670 | 10.4 THE GOODNESS-OF-FIT TEST
actual student absenteeism rate follows faculty perception. The faculty believe that the distribution
of the number of absences per term is as follows:
0–2 50%
3–5 30%
6–8 12%
9–11 6%
12+ 2%
At the end of the semester, a random survey of 300 students across all mathematics courses was
taken and the actual (observed) number of absences for the 300 students is recorded.
0–2 120
3–5 100
6–8 55
9–11 15
12+ 10
At the 5% significance level, determine if the number of absences per term follow the distribution
assumed by the faculty.
Solution:
Let be the probability a student has 0-2 absences, be the probability a student has 3-5
absences, be the probability a student has 6-8 absences, be the probability a student has 9-11
absences, and be the probability a student has 12 or more absences.
Hypotheses:
\begin{eqnarray*} H_0: & & p_1=50\%, p_2=30\%, p_3=12\%, p_4=6\%, p_5=2\% \\ H_a:
& & \mbox{at least one of the } p_i's \mbox{ does not equal its stated probability}
\end{eqnarray*}
p-value:
10.4 THE GOODNESS-OF-FIT TEST | 671
From the question, we have and . Now we need to calculate out the -score for
the test.
The observed frequency for each category is the number of observations in the sample that fall into
that category. This is the information provided in the sample above.
Next, we must calculate out the expected frequencies. The expected frequency is the number of
observations we would expect to see in the sample, assuming the null hypothesis is true. To calculate
the expected frequency for each category, we multiply the sample size by the probability associated
with that category claimed in the null hypothesis.
Field 2 4
So the p-value .
Conclusion:
NOTES
1. The null hypothesis is the claim that the percent of students that fall into each category is as
stated. That is, 50% students miss between 0 and 2 classes, 30% of the students miss between
3 and 5 students, etc.
2. The alternative hypothesis is the claim that at least one of the percent of students that fall into
each category is not as stated. The alternative hypothesis does not say that every does not
equal its stated probabilities, only that one of them does not equal its stated probability.
3. Keep all of the decimals throughout the calculation (i.e. in the calculation of the -score) to
10.4 THE GOODNESS-OF-FIT TEST | 673
avoid any round-off error in the calculation of the p-value. This ensures that we get the most
accurate value for the p-value. You can use Excel to calculate the expected frequencies and
the -score.
4. The p-value is the area in the right tail of the -distribution, to the right of
. In the calculation of the p-value:
◦ The function is chisq.dist.rt because we are finding the area in the right tail of a
-distribution.
◦ Field 1 is the value of .
◦ Field 2 is the value of the degrees of freedom .
5. The p-value of 0.0004 is a small probability compared to the significance level, and so is
unlikely to happen assuming the null hypothesis is true. This suggests that the assumption
that the null hypothesis is true is most likely incorrect, and so the conclusion of the test is to
reject the null hypothesis in favour of the alternative hypothesis. In other words, student
absenteeism does not fit faculty perception.
EXAMPLE
Employers want to know which days of the week employees have the highest number of absences in
a five-day work week. Most employers would like to believe that employees are absent equally
during the week. Suppose a random sample of 60 managers are asked on which day of the week they
had the highest number of employee absences. The results are recorded in the table below. At the
5% significance level, test if the day of the week with the highest number of absences occur with
equal frequency during a five-day work week.
674 | 10.4 THE GOODNESS-OF-FIT TEST
Monday 15
Tuesday 11
Wednesday 10
Thursday 9
Friday 15
Solution:
Let be the probability the highest number of absences occurs on Monday, be probability the
highest number of absences occurs on Tuesday, be the probability the highest number of
absences occurs on Wednesday, be the probability the highest number of absences occurs on
Thursday, and be the probability the highest number of absences occurs on Friday.
If the day of the week with the highest number of absences occurs with equal frequency, then the
probability that any day has the highest number of absences is the same as any other day. Because
there are 5 days (categories), if the frequencies are equal then each day would have a probability of
20% .
Hypotheses:
p-value:
From the question, we have and . Now we need to calculate out the -score for the
test.
The observed frequency for each category is the number of observations in the sample that fall into
that category. This is the information provided in the sample above.
Next, we must calculate out the expected frequencies. The expected frequency is the number of
observations we would expect to see in the sample, assuming the null hypothesis is true. To calculate
the expected frequency for each category, we multiply the sample size by the probability associated
with that category claimed in the null hypothesis.
10.4 THE GOODNESS-OF-FIT TEST | 675
Field 2 4
So the p-value .
Conclusion:
NOTES
1. The null hypothesis is the claim that the probability each day of the week has the highest
number of absences is 20%.
2. The alternative hypothesis is the claim that at least one of the probabilities is not 20%. The
alternative hypothesis does not say that every does not equal 20%, only that one of them
does not equal 20%.
3. Keep all of the decimals throughout the calculation (i.e. in the calculation of the -score) to
avoid any round-off error in the calculation of the p-value. This ensures that we get the most
accurate value for the p-value.
4. The p-value of 0.6151 is a large probability compared to the significance level, and so is likely
to happen assuming the null hypothesis is true. This suggests that the assumption that the
null hypothesis is true is most likely correct, and so the conclusion of the test is to not reject
the null hypothesis.
10.4 THE GOODNESS-OF-FIT TEST | 677
TRY IT
Teachers want to know which night each week their students are doing most of their homework.
Most teachers think that students do homework equally throughout the week. Suppose a random
sample of 49 students are asked on which night of the week they did the most homework. The
results are shown in the table below. At the 5% significance level, are the nights that students do
most of their homework equally distributed?
Sunday 11
Monday 8
Tuesday 10
Wednesday 7
Thursday 10
Friday 5
Saturday 5
Let be the probability students do their homework on Sunday, be the probability students do
their homework on Monday, be the probability students do their homework on Tuesday, be
the probability students do their homework on Wednesday, be the probability students do their
homework on Thursday, be the probability students do their homework on Friday, and be the
probability students do their homework on Saturday.
Hypotheses:
p-value:
678 | 10.4 THE GOODNESS-OF-FIT TEST
Field 2 6
So the p-value .
Conclusion:
TRY IT
One study indicates that the number of televisions that American families have is distributed as
shown in this table:
0 10%
1 16%
2 55%
3 11%
4 or more 8%
A researcher wants to determine if the number of televisions that families in the far western part of
the U.S. have the same distribution as the above study. A random sample of 600 families in the far
western U.S. is taken and the results are recorded in the following table:
0 66
1 119
2 340
3 60
4 or more 15
At the 1% significance level, does it appear that the distribution of the number of televisions for
families in the far western U.S is different from the distribution for the American population as a
whole?
Let be the probability a family owns 0 televisions, be the probability a family owns 1
680 | 10.4 THE GOODNESS-OF-FIT TEST
television, be the probability a family owns 2 televisions, be the probability a family owns 3
televisions, and be the probability a family owns 4 or more televisions.
Hypotheses:
\begin{eqnarray*} H_0: & & p_1=10\%, p_2= 16\%, p_3= 55\%, p_4= 11\%, p_5=8\% \\
H_a: & & \mbox{at least one of the } p_i's \mbox{ does not equal its stated probability}
\end{eqnarray*}
p-value:
0 66 0.1 600=60
3 60 0.11 600=66
Field 2 4
So the p-value .
Conclusion:
TRY IT
The expected percentage of the number of pets students in the United States have in their homes is
distributed as follows:
0 18%
1 25%
2 30%
3 18%
4 or more 9%
A researcher wants to find out if the distribution of the number of pets students in Canada have is
the same as the distribution shown in the U.S. A random sample of 1,000 students from Canada is
taken and the results are shown in the table below:
0 210
1 240
2 320
3 140
4+ 90
At the 1% significance level, is the distribution of the number of pets students in Canada have
different from the distribution for the United States?
Let be the probability a student owns 0 pets, be the probability a student owns 1 pet, be
682 | 10.4 THE GOODNESS-OF-FIT TEST
the probability a student owns 2 pets, be the probability a student owns 3 pets, and be the
probability a student owns 4 or more pets.
Hypotheses:
\begin{eqnarray*} H_0: & & p_1=18\%, p_2= 25\%, p_3= 30\%, p_4= 18\%, p_5=9\% \\
H_a: & & \mbox{at least one of the } p_i's \mbox{ does not equal its stated probability}
\end{eqnarray*}
p-value:
Field 2 4
So the p-value .
Conclusion:
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=231#oembed-1
Watch this video: Pearson’s chi square test (goodness of fit) | Probability and Statistics | Khan Academy by Khan Academy
[11:47]
Concept Review
1. Write down the null and alternative hypotheses. The null hypothesis is the claim that the
categorical variable follows the hypothesized distribution and the alternative hypothesis is the
claim that the categorical variable does not follow the hypothesized distribution.
2. Collect the sample information for the test and identify the significance level.
3. The p-value is the area in the right tail of the -distribution where
\displaystyle{\chi^2 = \sum \frac{(\mbox{observed-expected})^2}{\mbox{expected}}} and
.
4. Compare the p-value to the significance level and state the outcome of the test.
5. Write down a concluding sentence specific to the context of the question.
Attribution
LEARNING OBJECTIVES
Given two categorical variables, is there some relationship between the two categorical variables
or are the two categorical variables independent. The test of independence allows us to test
if two categorical variables are independent (not related) or dependent (related). The test of
independence can only show if a relationship exists between two variables, but the test does not
show if one variable causes changes in the other variable.
The test of independence uses a contingency table to analyze the data. As we saw previously in
probability, a contingency table lists the categories of one variables as the rows and the categories
of the other variable as the columns. The frequency of a row-column combination is the number of
items that occur in both categories.
Suppose one categorical variable has possible outcomes (categories) and the other categorical
variable has possible outcomes (categories).
NOTES
1. The null hypothesis is the claim that the two categorical variables are independent. That
is, there is no relationship between the two categorical variables.
2. The alternative hypothesis is the claim that the two categorical variables are dependent.
That is, there is some relationship between the two categorical variables.
3. The test can only show if a relationship exists between the two categorical variables. The
test cannot show any type of causal relationship between the two categorical variables.
4. The formula to find the expected frequencies follows from the assumption that the null
hypothesis is true and how we calculate joint probabilities for independent events.
Assuming the null hypothesis is true means that we assume the variables are
independent. This means that we assume that the events in any row and column
combination of the contingency tables are independent. As we saw in probability, when
686 | 10.5 THE TEST OF INDEPENDENCE
i. Find the difference between the observed frequency (from the sample) and
the expected frequency (from the null hypothesis). The expected frequency of
any cell of the contingency table when the null hypothesis is true is:
EXAMPLE
A researcher is studying the relationship between the drivers who commit speeding violations and
10.5 THE TEST OF INDEPENDENCE | 687
drivers who use cell phones while driving. The researcher took a sample of 755 drivers and obtained
the information shown in the table below.
Total
At the 5% significance level, is there a relationship between drivers who commit speeding violations
and drivers who use cell phones while driving?
Solution:
Hypotheses:
\begin{eqnarray*} H_0: & & \mbox{The two variables are independent} \\ H_a: &
& \mbox{The two variables are dependent} \end{eqnarray*}
p-value:
From the question, we have and . Now we need to calculate out the -score for the
test.
The observed frequency for each cell is the number of observations in the sample that fall into that
cell. This is the information provided in the sample above.
Total
Next, we must calculate out the expected frequencies. Because we assume the null hypothesis is true
(i.e. the variables are independent), the expected frequency in each cell is
688 | 10.5 THE TEST OF INDEPENDENCE
Expected Frequencies
Total
To calculate the -score, for each cell we work out the quantity
Field 2 1
So the p-value .
Conclusion:
NOTES
1. The null hypothesis is the claim that the variables are independent. That is, there is no
relationship between drivers who commit speeding violations and drivers who use cell
phones while driving.
2. The alternative hypothesis is the claim that the variables are dependent. That is, there is a
relationship between drivers who commit speeding violations and drivers who use cell
phones while driving.
3. Keep all of the decimals throughout the calculation (i.e. in the calculation of the -score) to
avoid any round-off error in the calculation of the p-value. This ensures that we get the most
690 | 10.5 THE TEST OF INDEPENDENCE
◦ The function is chisq.dist.rt because we are finding the area in the right tail of a
-distribution.
◦ Field 1 is the value of .
◦ Field 2 is the value of the degrees of freedom .
5. The p-value of 0.4019 is a large probability compared to the significance level, and so is likely
to happen assuming the null hypothesis is true. This suggests that the assumption that the
null hypothesis is true is most likely correct, and so the conclusion of the test is to not reject
the null hypothesis. In other words, the two variables are independent.
EXAMPLE
In a volunteer group, adults 21 and older volunteer from one to nine hours each week to spend time
with a disabled senior citizen. The program recruits among college students, university students,
and non students. The table below is a sample of the adult volunteers and the number of hours they
volunteer per week.
10.5 THE TEST OF INDEPENDENCE | 691
College Students
University
Students
Non Students
Total
At the 5% significance level, is the number of hours volunteered independent of the type of
volunteer?
Solution:
Hypotheses:
\begin{eqnarray*} H_0: & & \mbox{The two variables are independent} \\ H_a: &
& \mbox{The two variables are dependent} \end{eqnarray*}
p-value:
From the question, we have and . Now we need to calculate out the -score for the
test.
The observed frequency for each cell is the number of observations in the sample that fall into that
cell. This is the information provided in the sample above.
College Students
University
Students
Non Students
Total
Next, we must calculate out the expected frequencies. Because we assume the null hypothesis is true
(i.e. the variables are independent), the expected frequency in each cell is
692 | 10.5 THE TEST OF INDEPENDENCE
Expected Frequencies
College
Students
University
Students
Non
Students
Total
To calculate the -score, for each cell we work out the quantity
Field 2 4
So the p-value .
Conclusion:
NOTES
1. The null hypothesis is the claim that the variables are independent. That is, there is no
relationship between the number of hours volunteered and type of volunteer.
2. The alternative hypothesis is the claim that the variables are dependent. That is, there is a
relationship between the number of hours volunteered and type of volunteer.
3. Keep all of the decimals throughout the calculation (i.e. in the calculation of the -score) to
avoid any round-off error in the calculation of the p-value. This ensures that we get the most
accurate value for the p-value.
694 | 10.5 THE TEST OF INDEPENDENCE
4. The p-value of 0.0113 is a small probability compared to the significance level, and so is
unlikely to happen assuming the null hypothesis is true. This suggests that the assumption
that the null hypothesis is true is most likely incorrect, and so the conclusion of the test is to
reject the null hypothesis in favour of the alternative hypothesis. In other words, the two
variables are dependent.
TRY IT
In a local school district, a music teacher wants to study the relationship between students who take
music and students on the honour roll. The teacher took a sample of 300 students and obtained the
information shown in the table below.
Music Student
Non-Music Student
Total
At the 5% significance level, is there a relationship between music/non-music students and honour
roll/non-honour roll students?
Hypotheses:
10.5 THE TEST OF INDEPENDENCE | 695
\begin{eqnarray*} H_0: & & \mbox{The two variables are independent} \\ H_a: &
& \mbox{The two variables are dependent} \end{eqnarray*}
p-value:
Music Student
Non-Music Student
Total
Expected Frequencies
Music Student
Non-Music Student
Total
Field 2 1
So the p-value .
696 | 10.5 THE TEST OF INDEPENDENCE
Conclusion:
TRY IT
A local college is interested in the relationship between student anxiety level and the need to succeed
in school. A random sample of 400 students took a test that measured anxiety level and the need to
succeed in school. The results are shown in the table below.
High Need
Medium
Need
Low Need
Total
At the 5% significance level, is there a relationship between student anxiety level and the need to
succeed in school?
Hypotheses:
\begin{eqnarray*} H_0: & & \mbox{The two variables are independent} \\ H_a: &
& \mbox{The two variables are dependent} \end{eqnarray*}
p-value:
10.5 THE TEST OF INDEPENDENCE | 697
High Need
Medium
Need
Low Need
Total
Expected Frequencies
High Need
Medium
Need
Low Need
Total
Field 2 8
So the p-value .
Conclusion:
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=233#oembed-1
Concept Review
The test of independence is used to determine if two categorical variables are independent or
dependent. The test of independence is a well established process:
1. Write down the null and alternative hypotheses. The null hypothesis is the claim that the
categorical variables are independent and the alternative hypothesis is the claim that the
categorical variables are dependent.
2. Collect the sample information for the test and identify the significance level.
3. The p-value is the area in the right tail of the -distribution where
10.5 THE TEST OF INDEPENDENCE | 699
Attribution
1. If the number of degrees of freedom for a -distribution is 25, what is the population mean and
standard deviation?
3.A teacher predicts that the distribution of grades on the final exam will be and they are recorded
in the table.
Grade Proportion
A 0.25
B 0.30
C 0.35
D 0.10
Grade Frequency
A 7
B 7
C 5
D 1
At the 5% significance level, do the actual grades match the teacher’s assumed distribution?
4. The following data are real. The cumulative number of AIDS cases reported for Santa Clara
County is broken down by ethnicity as in the table below.
10.6 EXERCISES | 701
White 2,229
Hispanic 1,157
Black/African-American 457
Total = 4,075
The percentage of each ethnic group in Santa Clara County is as in the table below.
White 42.9%
Hispanic 26.7%
Black/African-American 2.6%
Total = 100%
At the 5% significance level, does it appear that the pattern of AIDS cases in Santa Clara County
corresponds to the distribution of ethnic groups in this county?
5. A six-sided die is rolled 120 times and the results are recorded in the table below. At the 5%
significance level, determine if the die is fair. (Hint: in a fair die, each of the faces is equally likely
to occur.)
1 15
2 29
3 16
4 15
5 30
6 15
702 | 10.6 EXERCISES
6. The marital status distribution of the U.S. male population, ages 15 and older, is as shown in the
table below.
married 56.1
widowed 2.5
divorced/separated 10.1
Suppose that a random sample of 400 U.S. young adult males, 18 to 24 years old, yielded the
following frequency distribution. At the 5% significance level, test if this age group of males fits the
distribution of the U.S. adult population.
married 238
widowed 2
divorced/separated 20
7. The columns in the table below contain the Race/Ethnicity of U.S. Public Schools for a recent
year, the percentages for the Advanced Placement Examinee Population for that class, and the
Overall Student Population. Suppose the right column contains the result of a survey of 1,000 local
students from that year who took an AP Exam.
a. At the 5% significance level, determine if the local results follow the distribution of the U.S.
overall student population based on ethnicity.
b. At the 5% significance level, determine if the local results follow the distribution of U.S. AP
examinee population, based on ethnicity.
8. UCLA conducted a survey of more than 263,000 college freshmen from 385 colleges in fall 2005.
The results of students’ expected majors by gender were reported in THE CHRONICLE OF HIGHER
EDUCATION (2/2/2006). Suppose a survey of 5,000 graduating females and 5,000 graduating males
was done as a follow-up last year to determine what their actual majors were. The results are shown
in the tables below. The second column in each table does not add to 100% because of rounding.
a. At the 5% significance level, determine if the actual college majors of graduating females fit
the distribution of their expected majors.
Technical 0.4% 15
b. At the 5% significance level determine if the actual college majors of graduating males fit the
distribution of their expected majors.
704 | 10.6 EXERCISES
Technical 1.8% 90
9. The table below contains information from a survey among 499 participants classified according
to their age groups. The second column shows the percentage of obese people per age class among
the study participants. The last column comes from a different study at the national level that
shows the corresponding percentages of obese people in the same age classes in the USA. At the 5%
significance level to determine whether the survey participants are a representative sample of the
USA obese population.
10. Transit Railroads is interested in the relationship between travel distance and the ticket class
purchased. A random sample of 200 passengers is taken. The table below shows the results. At the
5% significance level determine if a passenger’s choice in ticket class is independent of the distance
they must travel.
10.6 EXERCISES | 705
1–100 miles 21 14 6 41
101–200 miles 18 16 8 42
201–300 miles 16 17 15 48
301–400 miles 12 14 21 47
401–500 miles 6 6 10 22
Total 73 67 60 200
11. A recent debate about where in the United States skiers believe the skiing is best prompted the
following survey. At the 5% significance level test to see if the best ski area is independent of the
level of the skier.
Tahoe 20 30 40
Utah 10 30 60
Colorado 10 40 50
12. Car manufacturers are interested in whether there is a relationship between the size of car an
individual drives and the number of people in the driver’s family (that is, whether car size and
family size are independent). To test this, suppose that 800 car owners were randomly surveyed
with the results in the table. Conduct a test of independence. Use a 5% significance level.
Family Size Sub & Compact Mid-size Full-size Van & Truck
1 20 35 40 35
2 20 50 70 80
3–4 20 50 100 90
5+ 20 30 70 70
13. College students may be interested in whether or not their majors have any effect on starting
salaries after graduation. Suppose that 300 recent graduates were surveyed as to their majors in
college and their starting salaries after graduation. The table below shows the data. Conduct a test
of independence. Use a 5% significance level.
706 | 10.6 EXERCISES
English 5 20 5
Engineering 10 30 60
Nursing 10 15 15
Business 10 20 30
Psychology 20 30 20
14. Some travel agents claim that honeymoon hot spots vary according to age of the bride. Suppose
that 280 recent brides were interviewed as to where they spent their honeymoons. The information
is recorded in the table below. Conduct a test of independence. Use a 5% significance level.
Niagara Falls 15 25 25 20
Poconos 15 25 25 10
Europe 10 25 15 5
Virgin Islands 20 25 15 5
15. A manager of a sports club keeps information concerning the main sport in which members
participate and their ages. To test whether there is a relationship between the age of a member and
his or her choice of sport, 643 members of the sports club are randomly selected. Conduct a test of
independence. Use a 5% significance level
racquetball 42 58 30 46
tennis 58 76 38 65
swimming 72 60 65 33
16. A major food manufacturer is concerned that the sales for its skinny french fries have been
decreasing. As a part of a feasibility study, the company conducts research into the types of fries sold
across the country to determine if the type of fries sold is independent of the area of the country. The
results of the study are shown in the table. Conduct a test of independence. Use a 5% significance
level.
10.6 EXERCISES | 707
skinny fries 70 50 20 25
steak fries 20 40 10 10
17. According to Dan Lenard, an independent insurance agent in the Buffalo, N.Y. area, the
following is a breakdown of the amount of life insurance purchased by males in the following age
groups. He is interested in whether the age of the male and the amount of life insurance purchased
are independent events. Conduct a test for independence. Use a 5% significance level.
20–29 40 15 40 0 5
30–39 35 5 20 20 10
40–49 20 0 30 0 30
50+ 40 30 15 15 10
18. Suppose that 600 thirty-year-olds were surveyed to determine whether or not there is a
relationship between the level of education an individual has and salary. Conduct a test of
independence. Use a 5% significance level.
< $30,000 15 25 10 5
$30,000–$40,000 20 40 70 30
$40,000–$50,000 10 20 40 55
$50,000–$60,000 5 10 20 60
$60,000+ 0 5 10 150
19. An ice cream maker performs a nationwide survey about favorite flavors of ice cream in
different geographic areas of the U.S. Based on the table, do the numbers suggest that geographic
location is independent of favorite ice cream flavors? Test at the 5% significance level.
708 | 10.6 EXERCISES
West 12 21 22 19 15 8 97
Midwest 10 32 22 11 15 6 96
East 8 31 27 8 15 7 96
South 15 28 30 8 15 6 102
Column
45 112 101 46 60 27 391
Total
20. The table provides a recent survey of the youngest online entrepreneurs whose net worth
is estimated at one million dollars or more. Their ages range from 17 to 30. Each cell in the
table illustrates the number of entrepreneurs who correspond to the specific age group and their
net worth. Are the ages and net worth independent? Perform a test of independence at the 5%
significance level.
Age Group\ Net Worth Value (in millions of US dollars) 1–5 6–24 ≥25
≥ Row Total
17–25 8 7 5 20
26–30 6 5 9 20
Column Total 14 12 14 40
21. A 2013 poll in California surveyed people about taxing sugar-sweetened beverages. The results
are presented in the table, and are classified by ethnic group and response type. Are the poll
responses independent of the participants’ ethnic group? Conduct a test of independence at the 5%
significance level.
No opinion 16 43 16 19 84
22. An archer’s standard deviation for his hits is six (data is measured in distance from the center of
10.6 EXERCISES | 709
the target). An observer claims the standard deviation is less. At the 5% significance level, test the
observer’s claim.
23. The variance of heights for students in a school is 0.66. A random sample of 50 students is taken,
and the standard deviation of heights of the sample is 0.96. A researcher in charge of the study
believes the variation of heights for the school is greater than 0.66. At the 5% significance level,
determine if the variance in the heights for students in the school is greater than 0.66.
24. The average waiting time in a doctor’s office varies. A random sample of 30 patients in the
doctor’s office has a standard deviation of waiting times of 4.1 minutes. One doctor believes the
variance of waiting times is greater than originally thought.
a. Construct a 96% confidence interval for the variation in the wait times at the doctor’s office.
b. Interpret the confidence interval found in part (a).
c. One of the doctors believes that the variance in the wait times is greater than 12. Is the
doctor’s claim reasonable? Explain.
25. Suppose an airline claims that its flights are consistently on time with an average delay of at
most 15 minutes. It claims that the average delay is so consistent that the variance is no more than
150. Doubting the consistency part of the claim, a disgruntled traveler calculates the delays for his
next 25 flights. The average delay for those 25 flights is 22 minutes with a standard deviation of 15
minutes. At the 5% significance level, determine if variance in the delay times is greater than 150.
26. A plant manager is concerned her equipment may need recalibrating. It seems that the actual
weight of the 15 oz. cereal boxes it fills has been fluctuating. In order to determine if the machine
needs to be recalibrated, 84 randomly selected boxes of cereal from the next day’s production were
weighed. The standard deviation of the 84 boxes was 0.54.
a. Construct a 99% confidence interval for the variance in the weight of the cereal boxes.
b. Interpret the confidence interval found in part (a).
c. If the variance in the weight of the cereal boxes is supposed to be at most 25, does the
machine need to be recalibrated?
27. Consumers may be interested in whether the cost of a particular calculator varies from store to
store. Based on surveying 43 stores, which yielded a sample mean of $84 and a sample standard
710 | 10.6 EXERCISES
deviation of $12, test the claim that the standard deviation is greater than $15. Use a 5% significance
level.
28. Airline companies are interested in the consistency of the number of babies on each flight, so
that they have adequate safety equipment. They are also interested in the variation of the number
of babies. Suppose that an airline executive believes the average number of babies on flights is six
with a variance of nine at most. The airline conducts a survey. The results of the 18 flights surveyed
give a sample average of 6.4 with a sample standard deviation of 3.9. Conduct a hypothesis test of
the airline executive’s belief. Use a 5% significance level.
29. According to an avid aquarist, the average number of fish in a 20-gallon tank is 10, with a
variance of four. His friend, also an aquarist, does not believe that the standard deviation is two. She
counts the number of fish in 15 other 20-gallon tanks. Based on the results that follow, do you think
that the variance is different from four? Use a 5% significance level. Data: 11; 10; 9; 10; 10; 11; 11;
10; 12; 9; 7; 9; 11; 10; 11
30. The manager of “Frenchies” is concerned that patrons are not consistently receiving the same
amount of French fries with each order. The chef claims that the variation for a ten-ounce order
of fries is 2.25., but the manager thinks that it may be higher. He randomly weighs 49 orders of
fries, which yields a mean of 11 oz. and a standard deviation of two oz. At the 5% significance level,
determine if the variation in the amount of fries per order is higher than claimed.
31. You want to buy a specific computer. A sales representative of the manufacturer claims that
retail stores sell this computer at an average price of $1,249 with a variance of 625. You find a
website that has a price comparison for the same computer at a series of stores as follows: $1,299;
$1,229.99; $1,193.08; $1,279; $1,224.95; $1,229.99; $1,269.95; $1,249. Can you argue that pricing has
a larger variation than claimed by the manufacturer? Use the 5% significance level. As a potential
buyer, what would be the practical conclusion from your analysis?
32. A company packages apples by weight. One of the weight grades is Class A apples. A batch
of apples is selected to be included in a Class A apple package.
a. Construct a 95% confidence interval for the variation in the weight of apples in the package.
b. Interpret the confidence interval found in part (a).
10.6 EXERCISES | 711
Weights in selected apple batch (in grams): 158; 167; 149; 169; 164; 139; 154; 150; 157; 171; 152; 161;
141; 166; 172;
Attribution
Chapter Outline
One-way ANOVA is used to measure information from several groups. Photo by OpenStax, CC BY
4.0.
Many statistical applications in psychology, social science, business administration, and the
natural sciences involve several groups. For example, an environmentalist is interested in knowing
if the average amount of pollution varies in several bodies of water. A sociologist is interested in
knowing if the amount of income a person earns varies according to his or her upbringing. A
consumer looking for a new car might compare the average gas mileage of several models.
For hypothesis tests that compare averages between more than two groups, statisticians have
developed a method called “Analysis of Variance” (abbreviated ANOVA). In this chapter, we will
study the simplest form of ANOVA called single factor or one-way ANOVA. We will also study the
716 | 11.1 INTRODUCTION TO STATISTICAL INFERENCES USING THE F-DISTRIBUTION
–distribution, used in a one-way ANOVA and the test of two population variances. This is just a
very brief overview of one-way ANOVA.
Attribution
LEARNING OBJECTIVES
• To find the area under an -distribution to the left of a given -score, use the f.dist( ,
degrees of freedom 1, degrees of freedom 2, logic operator) function.
• The output from the f.dist function is the area to the left of the entered -score.
• Visit the Microsoft page for more information about the f.dist function.
• To find the area under an -distribution to the right of a given -score, use the f.dist.rt( ,
11.2 THE F-DISTRIBUTION | 719
• The output from the f.dist.rt function is the area to the right of the entered -score.
• Visit the Microsoft page for more information about the f.dist.rt function.
EXAMPLE
Solution:
Field 2 12
Field 3 27
Field 4 true
Field 2 12
Field 3 27
720 | 11.2 THE F-DISTRIBUTION
• To find the -score for a given area under an -distribution to the left of the -score, use the
f.inv(area to the left, degrees of freedom 1, degrees freedom 2) function.
◦ For area to the left, enter the area to the left of required -score.
◦ For degrees of freedom 1, enter the value of for the -distribution.
◦ For degrees of freedom 2, enter the value of for the -distribution.
• The output from the f.inv function is the value of -score so that the area to the left of
the -score is the entered area.
• Visit the Microsoft page for more information about the f.inv function.
• To find the -score for a given area under an -distribution to the right of the -score, use
the f.inv.rt(area to the right, degrees of freedom 1, degrees of freedom 2) function.
◦ For area to the right, enter the area to the right of required -score.
◦ For degrees of freedom 1, enter the value of for the -distribution.
◦ For degrees of freedom 2, enter the value of for the -distribution.
• The output from the f.inv.rt function is the value of -score so that the area to the
right of the -score is the entered area.
• Visit the Microsoft page for more information about the f.inv.rt function.
11.2 THE F-DISTRIBUTION | 721
EXAMPLE
1. Find the -score so that the area under the -distribution to the left of is 0.413.
2. Find the -score so that the area under the -distribution to the right of is 0.148.
Solution:
Field 2 37
Field 3 15
Field 2 37
Field 3 15
Concept Review
The -distribution is a useful tool for assessment in a series of problem categories. These problem
categories include: statistical inference for two population variances, testing the equality of three
or more population means (one-way ANOVA), and testing the overall significance of the multiple
regression model.
Important parameters in an -distribution are the degrees of freedom in a given problem. The
-distribution curve is skewed to the right, and its shape depends on the degrees of freedom. As
the degrees of freedom increase, the curve of an -distribution approaches a normal distribution.
722 | 11.2 THE F-DISTRIBUTION
Attribution
“13.3 Facts About the F Distribution“ in Introductory Statistics by OpenStax is licensed under
a Creative Commons Attribution 4.0 International License.
11.3 STATISTICAL INFERENCE FOR TWO
POPULATION VARIANCES
LEARNING OBJECTIVES
Sometimes we want to compare the variability between two populations instead of comparing the
means of the populations. For example, college administrators would like two college professors
grading exams to have the same variation in their grading or a supermarket might be interested in
the variability of the check-out times for two checkers.
As with comparing other population parameters, we can construct confidence intervals and
conduct hypothesis tests to study the relationship between two population variances. However,
because of the distribution we need to use, we study the ratio of two population variances, not
the difference in the variances.
Throughout this section, we will use subscripts to identify the values for the sample sizes,
variances, and standard deviations for the two populations:
Population Variance
Population Standard
Deviation
Sample Size
Sample Variance
In order to construct a confidence interval or conduct a hypothesis test on the ratio of two
population variances, , we need to use the distribution of when the population variances
are equal ( ). Suppose we have two normal populations with equal variances . A
sample of size with sample variance is taken from population 1 and a sample of size with
sample variance is taken from population 2. The sampling distribution of the ratio of the sample
Suppose a sample of size with sample variance is taken from population 1 and a sample of
size with sample variance is taken from population 2, where the populations are independent
and normally distributed. The limits for the confidence interval with confidence level for the
where is the -score so that the area in the left-tail of the -distribution is , is
the -score so that the area in the right tail of the -distribution is and the -distribution
NOTES
1. Like the other confidence intervals we have seen, the -scores are the values that trap
of the observations in the middle of the distribution so that the area of each tail is
2. Because the -distribution is not symmetrical, the confidence interval for the ratio of the
population variances requires that we calculate two different -scores: one for the left tail
and one for the right tail. In Excel, we will need to use both the f.inv function (for the left
tail) and the f.inv.rt function (for the right tail) to find the two different -scores.
3. The -score for the left tail is part of the formula for the upper limit and the -score for
the right tail is part of the formula for the lower limit. This is not a mistake. It follows
from the formula used to determine the limits for the confidence interval.
4. It is important that the populations are independent and normally distributed. If the
populations are not normal, the confidence interval will not give an accurate result.
726 | 11.3 STATISTICAL INFERENCE FOR TWO POPULATION VARIANCES
EXAMPLE
Two local walk-in medical clinics want to determine if there is any variability in the time patients
wait to see a doctor at each clinic. In a sample of 30 patients at Clinic 1, the standard deviation for
the wait time to see a doctor was 45 minutes. In a sample of 40 patients at Clinic 2, the standard
deviation for the wait time to see a doctor was 27 minutes. Assume the population of wait times at
the two clinics are independent and normally distributed.
1. Construct a 95% confidence interval for the ratio of the variances for the wait times at the two
clinics.
2. Interpret the confidence interval found in part 1.
3. Is there evidence to suggest that there is a difference in the variances of the wait times at the
two clinics? Explain.
Solution:
1. Let Clinic 1 be population 1 and Clinic 2 be population 2. From the question we have the
following information:
Clinic 1 Clinic 2
To find the confidence interval, we need to find the -score for the 95% confidence interval.
This means that we need to find the -score so that the area in the left tail is
and .
11.3 STATISTICAL INFERENCE FOR TWO POPULATION VARIANCES | 727
Field 2 29
Field 3 39
We also need find the -score for the 95% confidence interval. This means that we need to
find the -score so that the area in the right tail is . The degrees of
Field 2 29
Field 3 39
2. We are 95% confident that the ratio of the variances in the wait times at the two
clinics is between 1.416 and 5.646.
3. Because 1 is outside the confidence interval, it suggests that the ratio of the variances is
not 1. If the ratio of the variances cannot equal 1, then the variances cannot be equal. So there
is a difference in the variances of the wait times at the two clinics.
NOTES
1. When calculating the limits for the confidence interval keep all of the decimals in the
-scores and other values throughout the calculation. This will ensure that there is no round-
off error in the answer. You can use Excel to do the calculations of the limits, clicking on the
cells containing the -scores and any other values.
2. When writing down the interpretation of the confidence interval, make sure to include the
confidence level and the actual ratio of population variances captured by the confidence
interval (i.e. be specific to the context of the question). In this case, there are no units for the
limits because variance does not have any limits.
Variances
1. Write down the null hypothesis that there is no difference in the population variances:
\begin{eqnarray*} \\ H_a: & & \sigma^2_1 \lt \sigma_2^2 \\ H_a: & &
\sigma^2_1 \gt \sigma^2_2 \\ H_a: & & \sigma^2_1 \neq \sigma^2_2 \\ \\
\end{eqnarray*}
3. Use the form of the alternative hypothesis to determine if the test is left-tailed, right-tailed, or
two-tailed.
4. Collect the sample information for the test and identify the significance level .
5. Use the -distribution to find the p-value (the area in the corresponding tail) for the test. The
-score and degrees of freedom are
EXAMPLE
Two college instructors are interested in whether or not there is any variation in the way they grade
math exams. They each grade the same set of 30 exams. The first instructor’s grades have a variance
of 52.3. The second instructor’s grades have a variance of 89.9. At the 5% significance level, test the
claim that the first instructor’s variance is smaller.
Solution:
Let the first instructor’s grades be population 1 and the second instructor’s grades be population 2.
From the question we have the following information:
Instructor 1 Instructor 2
Hypotheses:
\begin{eqnarray*} H_0: & & \sigma_1^2=\sigma^2_2 \\ H_a: & & \sigma_1^2 \lt
\sigma^2_2 \end{eqnarray*}
p-value:
Because the alternative hypothesis is a , the p-value is the area in the left tail of the -distribution.
11.3 STATISTICAL INFERENCE FOR TWO POPULATION VARIANCES | 731
To use the f.dist function, we need to calculate out the -score and the degrees of freedom:
732 | 11.3 STATISTICAL INFERENCE FOR TWO POPULATION VARIANCES
Field 2 29
Field 3 29
Field 4 true
So the p-value .
Conclusion:
NOTES
1. The null hypothesis is the claim that the variances for the two instructors are
equal.
2. The alternative hypothesis \sigma_1^2 \lt \sigma^2_2 is the claim that the variance for the
first instructor’s grades is less than the variance for the second instructor’s grades.
3. The p-value is the area in the left tail of the -distribution, to the left of .
In the calculation of the p-value:
◦ The function is f.dist because we are finding the area in the left tail of an
-distribution.
◦ Field 1 is the value of .
◦ Field 2 is the value of .
◦ Field 3 is the value of .
◦ Field 4 is true.
4. The p-value of 0.0753 is a large probability compared to the significance level, and so is likely
to happen assuming the null hypothesis is true. This suggests that the assumption that the
null hypothesis is true is most likely correct, and so the conclusion of the test is to not reject
the null hypothesis. In other words, the variances for the two instructors are most likely
equal.
11.3 STATISTICAL INFERENCE FOR TWO POPULATION VARIANCES | 733
EXAMPLE
A local choral society divides the male singers into tenors and basses. The choral society director
wants to know if the variance in the heights of the two groups of singers is the same or different. The
director takes a sample from each group and records their height in inches. In a sample of 22 tenors,
the sample variance is 3.89. In a sample of 27 basses, the sample variance is 2.72. At the 5%
significance level, is there a difference in the heights of the two groups of singers?
Solution:
Let the tenors be population 1 and the basses be population 2. From the question we have the
following information:
Tenors Basses
Hypotheses:
p-value:
Because the alternative hypothesis is , the p-value is the sum of the areas in the tails of the
-distribution.
734 | 11.3 STATISTICAL INFERENCE FOR TWO POPULATION VARIANCES
Because this is a two-tailed test, we need to know which tail (left or right) we have the -score for so
that we can use the correct Excel function. If , the -score corresponds to the right tail. If
the , the -score corresponds to the left tail. In this case , so the -score
corresponds to the right tail. We need to use f.dist.rt to find the area in the right tail.
Field 2 21
Field 3 26
So the area in the right tail is 0.1919, which means that (p-value)=0.1919. This is also the area in
p-value=
11.3 STATISTICAL INFERENCE FOR TWO POPULATION VARIANCES | 735
Conclusion:
NOTES
1. The null hypothesis is the claim that the variances of the heights for the two
groups of singers are equal.
2. The alternative hypothesis \sigma_1^2 \neq \sigma^2_2 is the claim that the variances of the
heights for the two groups of singers are not equal
3. In a two-tailed hypothesis test for two population variance, we will only have sample
information relating to one of the two tails. We must determine which of the tails the sample
information belongs to, and then calculate out the area in that tail. The area in each tail
represents exactly half of the p-value, so the p-value is the sum of the areas in the two tails.
▪ We use f.dist to find the area in the left tail. The area in the right tail equals the
area in the left tail, so we can find the p-value by adding the output from this
function to itself.
▪ We use f.dist.rt to find the area in the right tail. The area in the left tail equals
the area in the right tail, so we can find the p-value by adding the output from
this function to itself.
4. The p-value of 0.3838 is a large probability compared to the significance level, and so is likely
to happen assuming the null hypothesis is true. This suggests that the assumption that the
null hypothesis is true is most likely correct, and so the conclusion of the test is to not reject
the null hypothesis In other words, the variances in the heights of the two groups of singers
are the same.
736 | 11.3 STATISTICAL INFERENCE FOR TWO POPULATION VARIANCES
NOTES
• When two populations have equal variances, the values of and are close in value.
So, the value of is close to 1. This will result in a large p-value in the hypothesis test
in value. So, the value of will either be larger than 1 or smaller than 1 (depending on
which sample variance is smaller and which is larger). This will result in a small p-value
in the hypothesis test and the evidence favours the alternative hypothesis.
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=248#oembed-1
Watch this video: Hypothesis Tests for Equality of Two Variences by jbstatistics [11:39]
Concept Review
To construct a confidence interval or conduct a hypothesis test on two population variances, we use
the sampling distribution of the ratio of the sample variances , which follows an -distribution
with and .
The hypothesis test for two population variances is a well established process:
11.3 STATISTICAL INFERENCE FOR TWO POPULATION VARIANCES | 737
1. Write down the null and alternative hypotheses in terms of the population variances.
2. Use the form of the alternative hypothesis to determine if the test is left-tailed, right-tailed, or
two-tailed.
3. Collect the sample information for the test and identify the significance level.
4. Find the p-value (the area in the corresponding tail) for the test using the -distribution
where , , and .
5. Compare the p-value to the significance level and state the outcome of the test.
6. Write down a concluding sentence specific to the context of the question.
The limits for the confidence interval for the ratio of the population variances are
where is the -score so that the area in the left-tail of of the -distribution is , is
the -score so that the area in the right tail of the -distribution is , and the -distribution
Attribution
“13.4 Test of Two Variances“ in Introductory Statistics by OpenStax is licensed under a Creative
Commons Attribution 4.0 International License.
11.4 ONE-WAY ANOVA AND HYPOTHESIS
TESTS FOR THREE OR MORE POPULATION
MEANS
LEARNING OBJECTIVES
• Conduct and interpret hypothesis tests for three or more population means using one-way
ANOVA.
The purpose of a one-way ANOVA (analysis of variance) test is to determine the existence of a
statistically significant difference among the means of three or more populations. The test actually
uses variances to help determine if the population means are equal or not.
Throughout this section, we will use subscripts to identify the values for the means, sample sizes,
and standard deviations for the populations:
Population Mean
Sample Size
Sample Mean
is the number of populations under study, is the total number of observations in all of the
samples combined, and is the mean of the sample means.
11.4 ONE-WAY ANOVA AND HYPOTHESIS TESTS FOR THREE OR MORE POPULATION MEANS | 739
One-Way ANOVA
A predictor variable is called a factor or independent variable. For example age, temperature,
and gender are factors. The groups or samples are often referred to as treatments. This
terminology comes from the use of ANOVA procedures in medical and psychological research to
determine if there is a difference in the effects of different treatments.
EXAMPLE
A local college wants to compare the mean GPA for players on four of its sports teams: basketball,
baseball, hockey, and lacrosse. A random sample of players was taken from each team and their GPA
recorded in the table below.
Sample Size ( ) 5 5 5 5
Sample Mean (
3.22 3.02 3 2.94
)
The logic behind one-way ANOVA is to compare population means based on two independent
estimates of the (assumed) equal variance between the populations:
• One estimate of the equal variance is based on the variability among the sample means
themselves (called the between-groups estimate of population variance).
• One estimate of the equal variance is based on the variability of the data within each
sample (called the within-groups estimate of population variance).
The one-way ANOVA procedure compares these two estimates of the population variance to
determine if the population means are equal or if there is a difference in the population means.
Because ANOVA involves the comparison of two estimates of variance, an -distribution is used
to conduct the ANOVA test. The test statistic is an -score that is the ratio of the two estimates of
population variance:
11.4 ONE-WAY ANOVA AND HYPOTHESIS TESTS FOR THREE OR MORE POPULATION MEANS | 741
The degrees of freedom for the -distribution are and where is the
number of populations and is the total number of observations in all of the samples combined.
The variance between groups estimate of the population variance is called the mean square
due to treatment, . The is the estimate of the population variance determined by
the variance of the sample means from the overall sample mean . When the population means
are equal, provides an unbiased estimate of the population variance. When the population
means are not equal, provides an overestimate of the population variance.
\begin{eqnarray*} SST & = & n_1 \times
(\overline{x}_1-\overline{\overline{x}})^2+n_2\times (\overline{x}_2-\overline{\overline{x}})^2+
\cdots +n_k \times (\overline{x}_k-\overline{\overline{x}})^2 \\ \\ MST & =&
\frac{SST}{k-1} \end{eqnarray*}
The variance within groups estimate of the population variance is called the mean square
due to error, . The is the pooled estimate of the population variance using the
sample variances as estimates for the population variance. The always provides an unbiased
estimate of the population variance because it is not affected by whether or not the population
means are equal.
\begin{eqnarray*} SSE & = & (n_1-1) \times s_1^2+ (n_2-1) \times s_2^2+ \cdots +
(n_k-1) \times s_k^2\\ \\ MSE & =& \frac{SSE}{n -k} \end{eqnarray*}
The one-way ANOVA test depends on the fact that the variance between groups is
influenced by differences between the population means, which results in being either an
unbiased or overestimate of the population variance. Because the variance within groups
compares values of each group to its own group mean, is not affected by differences between
the population means and is always an unbiased estimate of the population variance.
The null hypothesis in a one-way ANOVA test is that the population means are all equal and the
alternative hypothesis is that there is a difference in the population means. The -score for the
one-way ANOVA test is with and . The p-value for the
test is the area in the right tail of the -distribution, to the right of the -score.
• When the variance between groups and variance within groups are close in
value, the -score is close to 1 and results in a large p-value. In this case, the conclusion is
that the population means are equal.
• When the variance between groups is significantly larger than the variability within
groups , the -score is large and results in a small p-value. In this case, the conclusion
is that there is a difference in the population means.
Population Means
\begin{eqnarray*} \\ H_a: & & \mbox{at least one population mean is different
from the others} \\ \\ \end{eqnarray*}
4. Collect the sample information for the test and identify the significance level .
5. The p-value is the area in the right tail of the -distribution. The -score and degrees of
freedom are
EXAMPLE
A local college wants to compare the mean GPA for players on four of its sports teams: basketball,
11.4 ONE-WAY ANOVA AND HYPOTHESIS TESTS FOR THREE OR MORE POPULATION MEANS | 743
baseball, hockey, and lacrosse. A random sample of players was taken from each team and their GPA
recorded in the table below.
Assume the populations are normally distributed and have equal variances. At the 5% significance
level, is there a difference in the average GPA between the sports team.
Solution:
Let basketball be population 1, let baseball be population 2, let hockey be population 3, and let
lacrosse be population 4. From the question we have the following information:
Hypotheses:
p-value:
The p-value is the area in the right tail of the -distribution. To use the f.dist.rt function, we need
to calculate out the -score and the degrees of freedom:
Field 2 3
Field 3 16
So the p-value .
Conclusion:
significance level there is enough evidence to suggest that the mean GPA for the sports teams are the
same.
NOTES
1. The null hypothesis is the claim that the mean GPA for the sports
teams are all equal.
2. The alternative hypothesis is the claim that at least one of the population means is not equal
to the others. The alternative hypothesis does not say that all of the population means are not
equal, only that at least one of them is not equal to the others.
3. The p-value is the area in the right tail of the -distribution, to the right of
. In the calculation of the p-value:
◦ The function is f.dist.rt because we are finding the area in the right tail of an
-distribution.
◦ Field 1 is the value of .
◦ Field 2 is the value of .
◦ Field 3 is the value of .
4. The p-value of 0.9271 is a large probability compared to the significance level, and so is likely
to happen assuming the null hypothesis is true. This suggests that the assumption that the
null hypothesis is true is most likely correct, and so the conclusion of the test is to not reject
the null hypothesis. In other words, the population means are all equal.
The calculation of the , , and the -score for a one-way ANOVA test can be time
consuming, even with the help of software like Excel. However, Excel has a built-in one-way
ANOVA summary table that not only generates the averages, variances, and , but also
calculates the required -score and p-value for the test.
746 | 11.4 ONE-WAY ANOVA AND HYPOTHESIS TESTS FOR THREE OR MORE POPULATION MEANS
In order to create a one-way ANOVA summary table, we need to use the Analysis ToolPak. Follow
these instructions to add the Analysis ToolPak.
This website provides additional information on using Excel to create a one-way ANOVA summary
table.
NOTE
Because we are using the p-value approach to hypothesis testing, it is not crucial that we enter the
actual significance level we are using for the test. The p-value (the area in the right tail of the
-distribution) is not affected by significance level. For the critical-value approach to hypothesis
testing, we must enter the correct significance level for the test because the critical value does
depend on the significance level.
11.4 ONE-WAY ANOVA AND HYPOTHESIS TESTS FOR THREE OR MORE POPULATION MEANS | 747
EXAMPLE
A local college wants to compare the mean GPA for players on four of its sports teams: basketball,
baseball, hockey, and lacrosse. A random sample of players was taken from each team and their GPA
recorded in the table below.
Assume the populations are normally distributed and have equal variances. At the 5% significance
level, is there a difference in the average GPA between the sports team.
Solution:
Let basketball be population 1, let baseball be population 2, let hockey be population 3, and let
lacrosse be population 4.
Hypotheses:
p-value:
SUMMARY
Hockey 5 15 3 0.56
ANOVA
Total 8.0095 19
The p-value for the test is in the P-value column of the between groups row. So the p-value
.
Conclusion:
NOTES
1. In the top part of the ANOVA summary table (under the Summary heading), we have the
averages and variances for each of the groups (basketball, baseball, hockey, and lacrosse).
2. In the bottom part of the ANOVA summary table (under the ANOVA heading), we have
11.4 ONE-WAY ANOVA AND HYPOTHESIS TESTS FOR THREE OR MORE POPULATION MEANS | 749
EXAMPLE
A fourth grade class is studying the environment. One of the assignments is to grow bean plants in
different soils. Tommy chose to grow his bean plants in soil found outside his classroom mixed with
dryer lint. Tara chose to grow her bean plants in potting soil bought at the local nursery. Nick chose
to grow his bean plants in soil from his mother’s garden. No chemicals were used on the plants, only
water. They were grown inside the classroom next to a large window. Each child grew five plants.
At the end of the growing period, each plant was measured, producing the data (in inches) in the
table below.
24 25 23
21 31 27
23 23 22
30 20 30
23 28 20
Assume the heights of the plants are normally distribution and have equal variance. At the 5%
750 | 11.4 ONE-WAY ANOVA AND HYPOTHESIS TESTS FOR THREE OR MORE POPULATION MEANS
significance level, does it appear that the three media in which the bean plants were grown produced
the same mean height?
Solution:
Let Tommy’s plants be population 1, let Tara’s plants be population 2, and let Nick’s plants be
population 3.
Hypotheses:
\begin{eqnarray*} H_0: & & \mu_1=\mu_2=\mu_3 \\ H_a: & & \mbox{at least
one population mean is different from the others} \end{eqnarray*}
p-value:
SUMMARY
ANOVA
Total 189.3333 14
So the p-value .
Conclusion:
significance level there is enough evidence to suggest that the mean heights of the plants grown in
three media are the same.
NOTES
1. The null hypothesis is the claim that the mean heights of the plants
grown in the three different media are all equal.
2. The alternative hypothesis is the claim that at least one of the population means is not equal
to the others. The alternative hypothesis does not say that all of the population means are not
equal, only that at least one of them is not equal to the others.
3. The p-value of 0.8760 is a large probability compared to the significance level, and so is likely
to happen assuming the null hypothesis is true. This suggests that the assumption that the
null hypothesis is true is most likely correct, and so the conclusion of the test is to not reject
the null hypothesis. In other words, the population means are all equal.
TRY IT
A statistics professor wants to study the average GPA of students in four different programs:
marketing, management, accounting, and human resources. The professor took a random sample of
GPAs of students in those programs at the end of the past semester. The data is recorded in the table
below.
752 | 11.4 ONE-WAY ANOVA AND HYPOTHESIS TESTS FOR THREE OR MORE POPULATION MEANS
Assume the GPAs of the students are normally distributed and have equal variance. At the 5%
significance level, is there a difference in the average GPA of the students in the different programs?
Let marketing be population 1, let management be population 2, let accounting be population 3, and
let human resources be population 4.
Hypotheses:
p-value:
SUMMARY
ANOVA
Total 8.999895 19
So the p-value .
Conclusion:
TRY IT
A manufacturing company runs three different production lines to produce one of its products. The
company wants to know if the average production rate is the same for the three lines. For each
754 | 11.4 ONE-WAY ANOVA AND HYPOTHESIS TESTS FOR THREE OR MORE POPULATION MEANS
production line, a sample of eight hour shifts was taken and the number of items produced during
each shift was recorded in the table below.
35 21 31
35 36 34
36 22 24
39 38 21
37 28 27
36 34 29
31 35 33
38 39 20
33 40 24
Assume the numbers of items produced on each line during an eight hour shift are normally
distributed and have equal variance. At the 1% significance level, is there a difference in the average
production rate for the three lines?
Let Line 1 be population 1, let Line 2 be population 2, and let Line 3 be population 3.
Hypotheses:
\begin{eqnarray*} H_0: & & \mu_1=\mu_2=\mu_3 \\ H_a: & & \mbox{at least
one population mean is different from the others} \end{eqnarray*}
p-value:
SUMMARY
Line 3 9 243 27 26
ANOVA
Total 1007.63 26
So the p-value .
Conclusion:
Concept Review
A one-way ANOVA hypothesis test determines if several population means are equal. In order to
conduct a one-way ANOVA test, the following assumptions must be met:
The analysis of variance procedure compares the variation between groups to the variation
within groups . The ratio of these two estimates of variance is the -score from an
756 | 11.4 ONE-WAY ANOVA AND HYPOTHESIS TESTS FOR THREE OR MORE POPULATION MEANS
-distribution with and . The p-value for the test is the area in the right
tail of the -distribution. The statistics used in an ANOVA test are summarized in the ANOVA
summary table generated by Excel.
The one-way ANOVA hypothesis test for three or more population means is a well established
process:
1. Write down the null and alternative hypotheses in terms of the population means. The null
hypothesis is the claim that the population means are all equal and the alternative hypothesis
is the claim that at least one of the population means is different from the others.
2. Collect the sample information for the test and identify the significance level.
3. The p-value is the area in the right tail of the -distribution. Use the ANOVA summary table
generated by Excel to find the p-value.
4. Compare the p-value to the significance level and state the outcome of the test.
5. Write down a concluding sentence specific to the context of the question.
Attribution
“13.1 One-Way ANOVA“ and “13.2 The F Distribution and the F-Ratio“ in Introductory Statistics
by OpenStax is licensed under a Creative Commons Attribution 4.0 International License.
11.5 EXERCISES
1. Three different traffic routes are tested for mean driving time. The entries in the table are the
driving times in minutes on the three different routes. At the 5% significance level, test if the mean
driving time for the three routes are the same.
30 27 16
32 29 41
27 28 22
35 36 31
2. Suppose a group is interested in determining whether teenagers obtain their drivers licenses at
approximately the same average age across the country. Suppose that the following data are
randomly collected from five teenagers in each region of the country. The numbers represent the
age at which teenagers obtained their drivers licenses. At the 5% significance level, determine if
the mean age is the same in the different regions of the country.
3. Groups of men from three different areas of the country are to be tested for mean weight. The
entries in the table are the weights for the different groups. At the 5% significance level, test if the
average weight for men is the same for the three groups.
758 | 11.5 EXERCISES
4. Girls from four different soccer teams are to be tested for mean goals scored per game. The
entries in the table are the goals per game for the different teams. At the 5% significance level, test
if the mean goal scored per game is the same for the four teams.
1 2 0 3
2 3 1 4
0 2 1 4
3 4 0 3
2 4 0 2
5. Four basketball teams took a random sample of players regarding how high each player can
jump (in inches). At the 5% significance level, is there a difference in the mean jump heights
among the teams?
36 32 48 38 41
42 35 50 44 39
51 38 39 46 40
6. A video game developer is testing a new game on three different groups. Each group represents
a different target market for the game. The developer collects scores from a random sample from
each group. At the 5% significance level, are the scores among the different groups different?
11.5 EXERCISES | 759
98 160 198
7. Three students, Linda, Tuan, and Javier, are given five laboratory rats each for a nutritional
experiment. Each rat’s weight is recorded in grams. Linda feeds her rats Formula A, Tuan feeds his
rats Formula B, and Javier feeds his rats Formula C. At the end of a specified time period, each rat
is weighed again, and the net gain in grams is recorded. Using a significance level of 5%, determine
if the three formulas produce the same mean weight gain.
8. A grassroots group opposed to a proposed increase in the gas tax claimed that the increase would
hurt working-class people the most, since they commute the farthest to work. Suppose that the
group randomly surveyed 24 individuals and asked them their daily one-way commuting mileage.
Using a 5% significance level, test if the three mean commuting mileages are the same.
760 | 11.5 EXERCISES
9. The following table lists the number of pages in four different types of magazines. Using a
significance level of 5%, test if the four magazine types have the same mean length. Assume that
all distributions are normal, the four population standard deviations are approximately the same,
and the data were collected independently and randomly
172 87 82 104
163 123 87 98
10. A researcher wants to know if the mean times (in minutes) that people watch their favorite
news station are the same. At the 5% significance level, test if the mean times that people watch
their favorite news station are the same. Assume that all distributions are normal, the three
population standard deviations are approximately the same, and the data were collected
independently and randomly
11.5 EXERCISES | 761
45 15 72
12 43 37
18 68 56
38 50 60
23 31 51
35 22
11. Are the means for the final exams the same for all statistics class delivery types? The table
shows the scores on final exams from several randomly selected classes that used the different
delivery types. Assume that all distributions are normal, the four population standard deviations
are approximately the same, and the data were collected independently and randomly. Use a 5%
significance level.
72 83 80
84 73 78
77 84 84
80 81 81
81 86
79
82
12. Are the mean numbers of daily visitors to a ski resort the same for the three types of snow
conditions? The table shows the results of a study. Assume that all distributions are normal,
the four population standard deviations are approximately the same, and the data were collected
independently and randomly. Use a 5% significance level.
762 | 11.5 EXERCISES
1,528 2,233
1,382
13. Two coworkers commute from the same building. They are interested in whether or not there
is any variation in the time it takes them to drive to work. They each record their times for 20
commutes. The first worker’s times have a variance of 12.1. The second worker’s times have a
variance of 16.9. At the 5% significance level, test if the variation in the first worker’s commute time
is smaller than the second worker’s.
14. Two students are interested in whether or not there is variation in their test scores for math
class. There are 15 total math tests they have taken so far. The first student’s grades have a
standard deviation of 38.1. The second student’s grades have a standard deviation of 22.5. At the
5% significance level, determine if the variation in the second student’s scores are lower than the
first student’s.
15. Two cyclists are comparing the variances of their overall paces going uphill. Each cyclist
records his or her speeds going up 35 hills. The first cyclist has a variance of 23.8 and the second
cyclist has a variance of 32.1. At the 5% significance level, is there a difference in the variance in
the cyclists’ speeds?
16. Students Linda and Tuan are given five laboratory rats each for a nutritional experiment. Each
rat’s weight is recorded in grams. Linda feeds her rats Formula A and Tuan feeds his rats Formula
B. At the end of a specified time period, each rat is weighed again and the net gain in grams is
recorded.
11.5 EXERCISES | 763
43.5 47.0
39.4 40.5
41.3 38.9
46.0 46.3
38.2 44.2
a. Construct a 98% confidence interval for the ratio of the variance in the net weight gain for
Linda’s and Tuan’s rats.
b. Interpret the confidence interval found in part (a).
c. Is there evidence to suggest that the variance in the net weight gain for Linda and Tuan’s rats
is the same? Explain.
17. A grassroots group opposed to a proposed increase in the gas tax claimed that the increase
would hurt working-class people the most, since they commute the farthest to work. Suppose that
the group randomly surveyed 16 individuals and asked them their daily one-way commuting
mileage. The results are as follows. Determine whether or not the variance in mileage driven is
statistically the same among the working class and professional (middle income) groups. Use a 5%
significance level.
17.8 16.5
26.7 17.4
49.4 22.0
9.4 7.4
65.4 9.4
47.1 2.1
19.5 6.4
51.2 13.9
18. A researcher wants to study the amount of money, in dollars, that shoppers spend on Saturdays
and Sundays at the mall. A sample of shoppers is taken, and the amount of money they spent at
the mall on Saturday or Sunday is recorded in the table below.
764 | 11.5 EXERCISES
75 44 62 137
18 58 0 82
150 61 124 39
94 19 50 127
62 99 31 141
73 60 118 73
89
a. Construct a 93% confidence interval for the ratio of the variances for the amount of money
spent on Saturdays and Sundays at the mall.
b. Interpret the confidence interval found in part (a).
c. Is there evidence to suggest that variance in the amount of money spent on Saturdays and
Sundays at the mall is different? Explain.
19. Are the variances for incomes on the East Coast and the West Coast the same? The table shows
the results of a study. Income is shown in thousands of dollars. Assume that both distributions are
normal. Use a 5% level of significance.
East West
38 71
47 126
30 42
82 51
75 44
52 90
115 88
67
11.5 EXERCISES | 765
Attribution
Chapter Outline
Linear regression and correlation can help you determine if an auto mechanic’s salary is related to his work
experience. Photo by Joshua Rothhaas, CC BY 4.0.
Professionals often want to know how two or more numeric variables are related. For example, is
there a relationship between the grade on the second math exam a student takes and the grade on
the final exam? If there is a relationship, what is the relationship and how strong is it? In another
example, the amount you pay a repair person for labor is often determined by an initial amount
plus an hourly fee.
In this chapter, we will be studying the simplest form of regression, simple linear regression,
with one independent variable . This involves data that fits a line in two dimensions. We will also
study correlation which measures how strong the relationship is.
770 | 12.1 INTRODUCTION TO LINEAR REGRESSION AND CORRELATION
Attribution
LEARNING OBJECTIVES
In this chapter we will be studying simple linear regression, which models the linear relationship
between two variables and . A linear equation has the form where is the
-intercept of the line and is the slope of the line. For example, and are
examples of linear equations. The graph of linear equation is a straight line.
EXAMPLE
The equation is a linear equation. The slope is and the -intercept is . The
graph of is shown below.
772 | 12.2 LINEAR EQUATIONS
TRY IT
Is the graph shown below the graph of a linear equation? Why or why not?
12.2 LINEAR EQUATIONS | 773
This is not a linear equation because the graph is not a straight line.
The slope is a number that describes the steepness of a line. The slope tells us how the value of
the variable will change for every one unit increase in the value of the variable.
The -intercept is the value of the -coordinate where the line crosses the -axis. Algebraically,
the -intercept is the value of when .
Consider the figure below, which illustrates three different linear equations:
• In (a), the line rises from left to right across the graph. This means that the slope is a
positive number ( ).
• In (b), the line is horizontal (parallel to the -axis). This means that the slope is zero (
).
• In (c), the line falls from left to right across the graph. This means that the slope is a
negative number ( ).
774 | 12.2 LINEAR EQUATIONS
0 and so the line slopes upward to the right. For the second, b = 0 and the graph of the equation is
a horizontal line. In the third graph, (c), b
EXAMPLE
• The slope is . This tells us that when the value of increases by , the value of increases
by . Because the slope is positive, the graph of rises from left to right.
• The -intercept is . This tells us that when , . On the graph of
, the line crosses the -axis at .
TRY IT
Consider the linear equation . Identify the slope and -intercept. Describe the slope
and -intercept in sentences.
• The slope is . This tells us that when the value of increases by , the value of
decreases by . Because the slope is negative, the graph of falls from left to
right.
• The -intercept is . This tells us that when , . On the graph of
, the line crosses the -axis at .
Concept Review
The most basic type of association is a linear association. This type of relationship can be defined
algebraically by the equation used, numerically with actual or predicted data values, or graphically
from a plotted curve (lines are classified as straight curves). Algebraically, a linear equation
typically takes the form , where is the -intercept and is the slope.
The slope is a value that describes the rate of change of the variable for a single unit increase
in the variable. The -intercept is the value of when .
Attribution
LEARNING OBJECTIVES
An independent variable (or the -variable) is called the explanatory or predictor variable.
The independent variable is used for prediction and provides the basis for estimation. The
independent variable may be thought of as the input value and is used to determine the output
value (the value of the dependent variable).
A dependent variable (or the -variable) is called the response or outcome variable. The
dependent variable is the variable being predicted or estimated based on the value of the
independent variable. The dependent variable may be thought of as the output value and is
determined by the input value (the value of the independent variable).
EXAMPLE
Svetlana tutors to make extra money for college. For each tutoring session, she charges a one-time
12.3 SCATTER DIAGRAMS | 777
fee of $25 plus $25 per hour of tutoring. Here, there are two variables: the number of hours per
session and the amount of money earned per session.
• The number of hours per session is the independent variable because it can be used to predict
the value of the other variable (the amount of money earned per session).
• The amount of money earned per session is the dependent variable because its value can be
determined from the value of the other variable (the number of hours per session).
Scatter Diagrams
Before we begin the discussion about correlation and linear regression, we need to consider ways
to display the relationship between the independent variable and the dependent variable . The
most common and easiest way to illustrate the relationship between the two variables is with a
scatter diagram.
A scatter diagram (or scatter plot) is a graphical presentation of the relationship between two
numerical variables. Each point on the scatter diagram represents the values of two variables.
The -coordinate is the value of the independent variable and the -coordinate is the value of the
corresponding dependent variable.
To construct a scatter diagram:
EXAMPLE
In Europe and Asia, m-commerce is popular. M-commerce users have special mobile phones that
work like electronic wallets as well as provide phone and internet services. Users can do everything
from paying for parking to buying a TV set or soda from a machine to banking to checking sports
scores on the internet. Data for the number of user from years 2000 through 2004 is given in the
table below.
2000 0.5
2002 20.0
2003 33.0
2004 47.0
Which variable is the independent variable? Which variable is the dependent variable? Construct a
scatter diagram for this data.
Solution:
• The year is the independent variable because it can be used to predict the value of the other
variable (the number of users).
• The number of users is the dependent variable because its value can be determined from the
value of the other variable (year).
12.3 SCATTER DIAGRAMS | 779
TRY IT
Amelia plays basketball for her high school. She wants to improve her play so she can compete at
the college level. The table below records the number of hours she spends practicing her jump shot
before a game and the number of points she scored in the following game.
780 | 12.3 SCATTER DIAGRAMS
5 15
7 22
9 28
10 31
11 33
12 36
Which variable is the independent variable? Which variable is the dependent variable? Construct a
scatter diagram for this data.
• The hours spent practicing jump shot is the independent variable because it can be used to
predict the value of the other variable (points scored in game).
• The points scored in game is the dependent variable because its value can be determined from
the value of the other variable (hours spent practicing jump shot).
12.3 SCATTER DIAGRAMS | 781
4. Using the chart tools, add axis titles, including both the variable names and units on the axes.
5. Using the chart tools, add a chart title. A common chart title is independent variable vs
dependent variable, using the actual names of the variables.
Visit the Microsoft page for more information about creating a scatter diagram in Excel.
A scatter diagram shows the direction of the relationship between the independent and
dependent variables. That is, a scatter diagram shows if the points are, in general, rising or falling
as we read from left to right across the graph.
We can determine the strength of the relationship by looking at the scatter diagram to see how
close the points are to a line, a power function, an exponential function, or to some other type
of function. The stronger the relationship, the better the corresponding regression model (linear,
exponential, etc.) will be a predicting values of the dependent variable.
When we look at a scatter diagram, we want to notice the overall pattern and any deviations
from the pattern. The scatter diagrams shown below illustrate these concepts.
12.3 SCATTER DIAGRAMS | 783
In this chapter, we are only concerned with the strength and direction of the linear relationship
between the independent and dependent variables. In the next section, we will learn about a
numerical measure, the correlation coefficient, that measures the strength and direction of the
linear relationship.
Because linear patterns are quite common, we are interested in scatter diagrams that show a
linear pattern. The linear relationship is strong if the points are close to a straight line, except in
the case of a horizontal line where there is no relationship. If a scatter diagram shows a linear
relationship, we would like to create a model based on this apparent linear relationship. This model
is constructed through a process called simple linear regression. However, we only calculate a
regression line if one of the variables, , helps to explain or predict the other variable, . If is the
independent variable and is the dependent variable, then we can use a regression line to predict a
value for for a given value of .
784 | 12.3 SCATTER DIAGRAMS
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=267#oembed-1
Watch this video: Introduction to Linear Regression and Scatter Diagrams by ExcelIsFun [15:45]
Concept Review
Scatter diagrams are particularly helpful graphs when we want to see if there is a linear relationship
between two variables. They indicate both the direction of the relationship between the
independent variable and the dependent variable , and the strength of the relationship.
Attribution
“12.2 Scatter Plots“ in Introductory Statistics by OpenStax is licensed under a Creative Commons
Attribution 4.0 International License.
12.4 CORRELATION
LEARNING OBJECTIVES
The purpose of simple linear regression is to build a linear model that can be used to make
predictions for the variable for given value of the variable. Of course, we want the model to give
us good predictions—there is no point in using a model that gives bad or inaccurate predictions.
But how can we tell if the linear model will provide accurate predictions? As we have seen, we can
examine the scatter diagram for a set of data to get a sense of the strength and direction of the linear
relationship between the independent variable and the dependent variable . But we would like
a numerical measure of the strength and direction of the linear relationship we observe on the
scatter diagram. This numerical measure is called the correlation coefficient.
The correlation coefficient was developed by Karl Pearson in the early 1900s, and is sometimes
referred to as Pearson’s correlation coefficient. Denoted by , the correlation coefficient is a
numerical measure of the strength and direction of the linear relationship between the independent
variable and the dependent variable . Although there is an algebraic formula to find the value
of , we will perform the calculation using the built-in function in Excel.
The value of :
there is a perfect, negative correlation between and , in which case the points on the
scatter diagram would all line on a straight line that falls from left to right.
• Values of close to or indicate a moderate linear relationship between and .
• Values of close to 0 indicate a negative linear relationship between and . If , then
there is no correlation between and .
The sign of :
• A positive value of means that the points on the scatter diagram have the general tendency
to rise from left to right. In other words, when increases, tends to increase and when
decreases, tends to decrease.
• A negative value of means that the points on the scatter diagram have the general tendency
to fall from left to right. In other words, when increases, tends to decrease and when
decreases, tends to increase.
12.4 CORRELATION | 787
Strong
Correlation
Moderate
Correlation
788 | 12.4 CORRELATION
Weak
Correlation
To calculate the correlation coefficient, use the correl(array,array) function. Enter the cell array
containing the independent variable data into one of the arrays and enter the cell array containing
the dependent variable data into the other array.
The output from the correl function is the value of the correlation coefficient.
Visit the Microsoft page for more information about the correl function.
NOTE
The arrays containing the independent and dependent variable data may be entered into the correl
function in either order. The output from the correl function does not depend on the order in
which the arrays are entered.
12.4 CORRELATION | 789
EXAMPLE
A statistics professor wants to study the relationship between a student’s score on the third exam in
the course and their final exam score. The professor took a random sample of 11 students and
recorded their third exam score (out of 80) and their final exam score (out of 200). The results are
recorded in the table below.
1 65 175
2 67 133
3 71 185
4 71 163
5 66 126
6 75 198
7 67 153
8 70 163
9 71 159
10 69 151
11 69 159
Solution:
1. Enter the data into an Excel spreadsheet. For this example, suppose we entered the data
(without the column headings) so that the student column is in column A from A1 to A11, the
third exam score is in column B from B1 to B11, and the final exam score is in column C from
C1 to C11.
790 | 12.4 CORRELATION
Field 2 C1:C11
By examining the scatter diagram for this data, shown below, we can see that the points are
rising from left to right (corresponding to the fact that is positive) and the general pattern of
points corresponds to a moderate linear relationship (corresponding to the fact that is close
to ).
2. There is a moderate, positive linear relationship between the third test score and the final
exam score.
12.4 CORRELATION | 791
NOTES
1. In this case the value of is close to , so we would consider this a moderate linear
relationship.
2. When writing down the interpretation of the correlation coefficient, remember to be specific
to the question using the actual names of the independent and dependent variables. Also
make sure to include in the sentence the strength of the linear relationship (strong, moderate,
or weak) and the direction of the linear relationship (positive or negative).
TRY IT
SCUBA divers have maximum dive times they cannot exceed when going to different depths. The
data in the table below shows different depths with the maximum dive times in minutes.
50 80
60 55
70 45
80 35
90 25
100 22
1.
2. There is a strong, negative linear relationship between depth and maximum dive time.
The correlation coefficient only measures the correlation between two variables, not causation.
A strong correlation between two variables does not mean that changes in one variable actually
cause changes in the other variable. The correlation coefficient can only tell us that changes in the
independent variable and dependent variable are related. In general remember “correlation does
not equal causation.”
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=276#oembed-1
Watch this video: Using Excel to Calculate a Correlation Coefficient by Matt Macarty [5:21]
Concept Review
The correlation coefficient measures the strength and direction of the linear relationship between
and . The value of is between and . When is positive, the values of and will tend
to increase and decrease together. When is negative, will increase and will decrease, or the
opposite, will decrease and will increase.
12.4 CORRELATION | 793
Attribution
“12.3 The Regression Equation“ and “12.4 Testing the Significance of the Correlation Coefficient“
in Introductory Statistics by OpenStax is licensed under a Creative Commons Attribution 4.0
International License.
12.5 THE REGRESSION EQUATION
LEARNING OBJECTIVES
We often want to use values of the independent variable to make predictions about the value of
the dependent variable. For example, we might want to use the amount a business spends on
advertising each quarter to make a prediction about the revenue the business will generate that
quarter. When a linear relationship exists between an independent and dependent variable, we can
build a linear model of that relationship, and then we can use that model to make predictions about
the dependent variable.
Simple linear regression is a modeling technique in which the linear relationship between one
independent variable and one dependent variable is approximated by a straight line, called the
line-of-best-fit or least squares line. It is important to note that the line-of-best-fit only models
the linear relationship between the independent and dependent variables.
The equation for the regression line is:
The value of is the estimated value of . It is the value of obtained using the regression line.
It is not generally equal to the value of from the sample data. The values for the slope and
the -intercept in the line-of-best-fit are calculated using the sample data and the least squares
12.5 THE REGRESSION EQUATION | 795
method. Although there are formulas to calculate the values of the slope and -intercept in the
regression line, we will calculate the slope and -intercept using the built-in functions in Excel.
The slope of the linear regression equation:
• The slope of the line-of-best-fit and the correlation coefficient have the same sign. That
is, and are either both positive or both negative.
• The slope of the regression equation tells us how the dependent variable changes for a
one unit increase in the independent variable .
• When interpreting the slope, be specific to the context of the question, using the actual names
of the variable and correct units.
• The -intercept of the line-of-best-fit is the predicted value of the dependent variable
when .
• When interpreting the -intercept, be specific to the context of the question, using the actual
names of the variable and correct units.
To calculate the slope of the linear regression equation, use the slope(array for y’s,array for x’s)
function.
• For array for y’s, enter the cell array containing the dependent variable data.
• For array for x’s, enter the cell array containing the independent variable data.
Visit the Microsoft page for more information about the slope function.
To calculate the -intercept of the linear regression equation, use the intercept(array for y’s,array
for x’s) function.
796 | 12.5 THE REGRESSION EQUATION
• For array for y’s, enter the cell array containing the dependent variable data.
• For array for x’s, enter the cell array containing the independent variable data.
Visit the Microsoft page for more information about the intercept function.
NOTE
The order in which the data is entered into these functions is important. In both the slope and
intercept functions, the data for the dependent variable is entered in the first array and the data for
the independent variable is entered in the second array. The output from the slope and
intercept function will be different when the order of the inputs are switched.
EXAMPLE
A statistics professor wants to study the relationship between a student’s score on the third exam in
the course and their final exam score. The professor took a random sample of 11 students and
recorded their third exam score (out of 80) and their final exam score (out of 200). The results are
recorded in the table below. The professor wants to develop a linear regression model to predict a
student’s final exam score from the third exam score.
12.5 THE REGRESSION EQUATION | 797
1 65 175
2 67 133
3 71 185
4 71 163
5 66 126
6 75 198
7 67 153
8 70 163
9 71 159
10 69 151
11 69 159
Solution:
1. Because we want to predict the final exam score from the third exam score, the independent
variable is the third exam score and the dependent variable is the final exam score. Enter
the data into an Excel spreadsheet. For this example, suppose we entered the data (without
the column headings) so that the student column is in column A from A1 to A11, the third
exam score is in column B from B1 to B11, and the final exam score is in column C from C1 to
C11.
Field 2 B1:B11
Field 2 B1:B11
798 | 12.5 THE REGRESSION EQUATION
The graph below shows the scatter diagram with the line-of-best-fit.
2. The slope is . Interpretation: For a one point increase in the score on the third
exam, the final exam score increases by 4.83 points.
NOTE
1. When writing down the linear regression equation, remember to define what the variables
represent in the context of the question. That is, state what and represent in relation to
the question.
2. When writing down the interpretation of the slope, remember to be specific to the question
using the actual names of the independent and dependent variables and appropriate units.
12.5 THE REGRESSION EQUATION | 799
Given a specific value of the independent variable , the linear regression equation may be used
to predict/estimate the value of the dependent variable . To make predictions, the following
condition must be met:
• There must be a linear relationship between the variables. The stronger the linear
relationship, the better the prediction will be.
• The linear regression equation is only valid to predict values of the dependent variable. That
is, we may only use the equation to solve for for a given value of , and not the other way
around.
• The linear regression equation should only be used to make predictions for for values of
within the domain of the values in the sample data used to construct the regression
equation. The regression equation does not provide reliable predictions for values of that
fall outside the domain of the values in the sample data.
EXAMPLE
A statistics professor wants to study the relationship between a student’s score on the third exam in
the course and their final exam score. The professor took a random sample of 11 students and
recorded their third exam score (out of 80) and their final exam score (out of 200). The results are
recorded in the table below. The professor developed the linear regression model
to predict a student’s final exam score ( ) from a student’s third exam
score ( ).
800 | 12.5 THE REGRESSION EQUATION
1 65 175
2 67 133
3 71 185
4 71 163
5 66 126
6 75 198
7 67 153
8 70 163
9 71 159
10 69 151
11 69 159
1. What is the professor’s final exam prediction for a student that scored 66 on the third exam?
2. What is the professor’s final exam prediction for a student that scored 73 on the third exam?
3. Should the professor use the linear regression model to predict the final exam score for a
student that scored 90 on the third exam? Why?
Solution:
A student that scored 66 on the third exam has a predicted score of 145.27 on the final exam.
A student that scored 73 on the third exam has a predicted score of 179.08 on the final exam.
3. The values (third exam score) in the sample data are between 65 and 75. An value of 90 is
outside the domain of the observed values in the data. So, we cannot reliably predict the
final exam score for a student that scored 90 on the third exam. Of course, it is possible to
12.5 THE REGRESSION EQUATION | 801
enter into the linear regression equation and calculate the corresponding value of ,
but this value is not a reliable prediction. If we calculate out the value of in the regression
equation for , we get , a value that makes no sense in the context of the
question because the maximum score on the final exam is 200.
NOTES
1. The values obtained for the linear regression equation are predictions only. Here, 145.27 is
the predicted final exam score for a student that scored 66 on the third exam. This does not
mean that a student that actually scored 66 on the third exam will score 145.27 on the final
exam.
2. Remember that the linear regression only gives reliable predictions for values of that fall
within the domain of values in the sample data.
TRY IT
SCUBA divers have maximum dive times they cannot exceed when going to different depths. The
data in the table below shows different depths with the maximum dive times in minutes.
802 | 12.5 THE REGRESSION EQUATION
50 80
60 55
70 45
80 35
90 25
100 22
1. Find the linear regression equation to predict the maximum dive time from the depth.
2. Interpret the slope of the regression equation found in part 1.
3. Predict the maximum dive time for a depth of 75 feet.
The difference between the actual value of the dependent variable (in the sample date) and the
predicted value of the dependent variable obtained from the linear regression equation is called
the error or residual.
Graphically, the absolute value of the error is the vertical distance between the actual value of
(the point on the scatter diagram) and the predicted value of (the point on the linear regression
line). In other words, the absolute value of the error measures the vertical distance between the
actual data point and the line.
12.5 THE REGRESSION EQUATION | 803
The slope and -intercept for the linear regression equation are generated using the errors and
the least squares method. The idea behind finding the line-of-best-fit is based on the assumption
that the data are scattered about a straight line. For any line, the errors can be calculated, squared,
and then these squared errors can be added up. Of all of the possible lines, the line-of-best-fit is the
one line that minimizes this sum of the squared errors. Any other line will have a higher sum of
the squared errors compared to the sum of the squared errors for the line-of-best-fit.
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=280#oembed-1
Watch this video: Slope and Intercept for Linear Regression in Excel by ExcelIsFun [18:29]
804 | 12.5 THE REGRESSION EQUATION
Concept Review
A regression line, or a line-of best-fit, can be drawn on a scatter diagram and used to predict
outcomes for the variable in a given data set or sample data. Regression lines can be used to
predict values within the given set of data, but should not be used to make predictions for values
outside the set of data.
Attribution
“12.3 The Regression Equation“ and “12.5 Prediction“ in Introductory Statistics by OpenStax is
licensed under a Creative Commons Attribution 4.0 International License.
12.6 COEFFICIENT OF DETERMINATION
LEARNING OBJECTIVES
Previously, we saw how to use the correlation coefficient to measure the strength and direction
of the linear relationship between the independent and dependent variables. The correlation
coefficient gives us a way to measure how good a linear regression model fits the data. The
coefficient of determination is another way to evaluate how well a linear regression model fits
the data. Denoted , the coefficient of determination is the proportion of variation in the
dependent variable that can be explained by the regression equation based on the independent
variable. The coefficient of determination is the square of the correlation coefficient.
The coefficient of determination is a number between 0 and 1, and is the decimal form of a
percent. The closer the coefficient of determination is to 1, the better the independent variable is at
predicting the dependent variable. When we interpret the coefficient of determination, we use the
percent form. When expressed as a percent, represents the percent of variation in the dependent
variable that can be explained by the variation in the independent variable using the regression
line. When interpreting the coefficient of determination, remember to be specific to the context of
the question.
806 | 12.6 COEFFICIENT OF DETERMINATION
EXAMPLE
A statistics professor wants to study the relationship between a student’s score on the third exam in
the course and their final exam score. The professor took a random sample of 11 students and
recorded their third exam score (out of 80) and their final exam score (out of 200). The results are
recorded in the table below. The professor wants to develop a linear regression model to predict a
student’s final exam score from the third exam score.
1 65 175
2 67 133
3 71 185
4 71 163
5 66 126
6 75 198
7 67 153
8 70 163
9 71 159
10 69 151
11 69 159
Solution:
1. .
2. of the variation in the final exam score can be explained by the regression line based
12.6 COEFFICIENT OF DETERMINATION | 807
TRY IT
SCUBA divers have maximum dive times they cannot exceed when going to different depths. The
data in the table below shows different depths with the maximum dive times in minutes. Previously,
we found the correlation coefficient and the regression line to predict the maximum dive time from
depth.
50 80
60 55
70 45
80 35
90 25
100 22
1. .
2. of the variation in the maximum dive time can be explained by the regression line
based on depth.
808 | 12.6 COEFFICIENT OF DETERMINATION
Concept Review
The coefficient of determination, , is equal to the square of the correlation coefficient. When
expressed as a percent, the coefficient of determination represents the percent of variation in the
dependent variable that can be explained by the variation in the independent variable using the
regression line.
Attribution
“12.3 The Regression Equation“ in Introductory Statistics by OpenStax is licensed under a Creative
Commons Attribution 4.0 International License.
12.7 STANDARD ERROR OF THE ESTIMATE
LEARNING OBJECTIVES
The difference between the actual value of the dependent variable (in the sample data) and the
predicted value of the dependent variable obtained from the linear regression equation is called
the error or residual.
Graphically, the absolute value of the error is the vertical distance between the actual value of
(the point on the scatter diagram) and the predicted value of (the point on the linear regression
line). In other words, the absolute value of the error measures the vertical distance between the
actual data point and the line.
810 | 12.7 STANDARD ERROR OF THE ESTIMATE
The standard error of the estimate, denoted , is a measure of the standard deviation
of the errors in a regression model. The standard error of the estimate is a measure of the
average deviation or dispersion of the points on the scatter diagram around the line-of-best-fit.
The standard error of the estimate for the linear regression model is analogous to the standard
deviation for a set of points, but instead of measuring the average distance from the mean we are
measuring the average distance from the regression line. Graphically, the standard error of the
estimate measures the average vertical distance (the absolute value of the errors) between the points
on the scatter diagram and the regression line.
When the points on the scatter diagram are close to the regression line, the errors are small, and
so the average of the dispersion of the points around the line will be small. In this case, the value of
the standard error of the estimate will be relatively small, which reflects the fact that there is little
variation between the actual data pints (the points on the scatter diagram) and the linear regression
model. This implies that the linear regression model is a good fit for the data and predictions made
with the linear regression model will be fairly accurate.
Conversely, when the points on the scatter diagram are widely dispersed around the regression
line, there errors are large, and so the average dispersion of the points around the line will be large.
In this case, the value of the standard error of the estimate will be large, which reflects the greater
dispersion between the actual data points and the linear regression model. This implies that the
linear regression model is not a good fit for the data and predictions made with the linear regression
model will be inaccurate.
12.7 STANDARD ERROR OF THE ESTIMATE | 811
The value of tells us, on average, how much the dependent variable differs from the regression
line based on the independent variable. When interpreting the standard error of the estimate,
remember to be specific to the question, using the actual names of the dependent and independent
variables, and include appropriate units. The units of the standard error of the estimate are the
same as the units of the dependent variable.
Although there is a formula to calculate out the value of the standard error of the estimate, we
will calculate the standard error of the estimate using the built-in function in Excel.
To calculate the standard error of the estimate, use the steyx(array for y’s,array for x’s) function.
• For array for y’s, enter the cell array containing the dependent variable data.
• For array for x’s, enter the cell array containing the independent variable data.
Visit the Microsoft page for more information about the steyx function.
NOTE
The order in which the data is entered into the steyx function is important. The data for the
dependent variable is entered in the first array and the data for the independent variable is
entered in the second array. The output from the steyx function will be different when the order of
the inputs is switched.
812 | 12.7 STANDARD ERROR OF THE ESTIMATE
EXAMPLE
A statistics professor wants to study the relationship between a student’s score on the third exam in
the course and their final exam score. The professor took a random sample of 11 students and
recorded their third exam score (out of 80) and their final exam score (out of 200). The results are
recorded in the table below. The professor wants to develop a linear regression model to predict a
student’s final exam score from the third exam score.
1 65 175
2 67 133
3 71 185
4 71 163
5 66 126
6 75 198
7 67 153
8 70 163
9 71 159
10 69 151
11 69 159
Solution:
1. Enter the data into an Excel spreadsheet. For this example, suppose we entered the data
(without the column headings) so that the student column is in column A from A1 to A11, the
12.7 STANDARD ERROR OF THE ESTIMATE | 813
third exam score is in column B from B1 to B11, and the final exam score is in column C from
C1 to C11.
Field 2 B1:B11
2. On average, the final exam score differs by 16.41 points from the regression line based on the
third exam score.
TRY IT
SCUBA divers have maximum dive times they cannot exceed when going to different depths. The
data in the table below shows different depths with the maximum dive times in minutes. Previously
we found the regression line to predict the maximum dive time from depth.
50 80
60 55
70 45
80 35
90 25
100 22
1. .
2. On average, the maximum dive time differs by 6.53 minutes from the regression line based on
depth.
Concept Review
The standard error of the estimate, , measures the average deviation or dispersion of the points
on the scatter diagram around the line-of-best-fit. The smaller the value of the standard error of the
estimate, the better the fit of the regression line to the data.
12.8 EXERCISES
1. A vacation resort rents SCUBA equipment to certified divers. The resort charges an up-front fee
of $25 and another fee of $12.50 an hour.
a.
b.
c.
d.
4. The table below contains real data for the first two decades of AIDS reporting. Use the columns
“year” and “# AIDS cases diagnosed. Why is “year” the independent variable and “# AIDS cases
diagnosed.” the dependent variable (instead of the reverse)?
816 | 12.8 EXERCISES
Pre-1981 91 29
5. A specialty cleaning company charges an equipment fee and an hourly labor fee. A linear
equation that expresses the total amount of the fee the company charges for each session is
.
12.8 EXERCISES | 817
6. Due to erosion, a river shoreline is losing several thousand pounds of soil each year. A linear
equation that expresses the total amount of soil lost per year is .
7. The price of a single issue of stock can fluctuate throughout the day. A linear equation that
represents the price of stock for Shipment Express is where is the number of hours
passed in an eight-hour day of trading.
8. For each of the following situations, state the independent variable and the dependent variable.
a. A study is done to determine if elderly drivers are involved in more motor vehicle fatalities
than other drivers. The number of fatalities per 100,000 drivers is compared to the age of
drivers.
b. A study is done to determine if the weekly grocery bill changes based on the number of family
members.
c. Insurance companies base life insurance premiums partially on the age of the applicant.
d. Utility bills vary according to power consumption.
e. A study is done to determine if a higher education reduces the crime rate in a population.
9. Does the scatter plot appear linear? Strong, moderate, or weak? Positive or negative?
818 | 12.8 EXERCISES
10. Does the scatter plot appear linear? Strong, moderate, or weak? Positive or negative?
11. Does the scatter plot appear linear? Strong, moderate, or weak? Positive or negative?
12.8 EXERCISES | 819
12. The Gross Domestic Product Purchasing Power Parity is an indication of a country’s currency
value compared to another country. The table below shows the GDP PPP of Cuba as compared to
US dollars. Construct a scatter plot of the data.
2005 3,500
13. The following table shows the poverty rates and cell phone usage in the United States. Construct
a scatter plot of the data
820 | 12.8 EXERCISES
2007 12 84.86
2009 12 90.82
14. Does the higher cost of tuition translate into higher-paying jobs? The table lists the top ten
colleges based on mid-career salary and the associated yearly tuition costs. Construct a scatter plot
of the data.
15. A random sample of ten professional athletes produced the following data where is the
number of endorsements the player has and is the amount of money made (in millions of dollars).
0 2 5 12
3 8 4 9
2 7 3 9
1 3 0 3
5 13 4 10
12.8 EXERCISES | 821
17. What is the process through which we can calculate a line that goes through a scatter plot
with a linear pattern?
20. A landscaping company is hired to mow the grass for several large properties. The total area of
the properties combined is 1,345 acres. The rate at which one person can mow is
where is the number of hours and represents the number of acres left to mow.
21. The table below contains real data for the first two decades of AIDS reporting.
822 | 12.8 EXERCISES
Pre-1981 91 29
a. Graph “year” versus “# AIDS cases diagnosed” (plot the scatter plot). Do not include pre-1981
data.
b. Calculate the correlation coefficient.
c. Interpret the correlation coefficient.
12.8 EXERCISES | 823
22. Recently, the annual number of driver deaths per 100,000 for the selected age groups was as
follows:
17.5 38
22 36
29.5 24
44.5 20
64.5 18
80 28
a. Using “ages” as the independent variable and “Number of driver deaths per 100,000” as the
dependent variable, make a scatter plot of the data.
b. Calculate the least squares (best–fit) line.
c. Interpret the slope of the least squares line.
d. Predict the number of deaths people aged 40.
e. Find the correlation coefficient.
f. Interpret the correlation coefficient.
g. Find the coefficient of determination.
h. Interpret the coefficient of determination.
i. Find the standard error of the estimate.
j. Interpret the standard error of the estimate.
23. The table below shows the life expectancy for an individual born in the United States in certain
years.
824 | 12.8 EXERCISES
1930 59.7
1940 62.9
1950 70.2
1965 69.7
1973 71.4
1982 74.5
1987 75
1992 75.7
2010 78.7
a. Decide which variable should be the independent variable and which should be the
dependent variable.
b. Draw a scatter plot of the ordered pairs.
c. Find the correlation coefficient
d. Interpret the correlation coefficient.
e. Find the linear regression equation.
f. Interpret the slope of the linear regression equation.
g. What is the estimated life expectancy for someone born in 1950? Why doesn’t this value
match the life expectancy given in the table for 1950?
h. What is the estimated life expectancy for someone born in 1982?
i. Using the regression equation, find the estimated life expectancy for someone born in 1850.
Is this an accurate estimate for that year? Explain why or why not.
j. Calculate the coefficient of determination.
k. Interpret the coefficient of determination.
l. Calculate the standard error of the estimate.
m. Interpret the standard error of the estimate.
24. The maximum discount value of the Entertainment® card for the “Fine Dining” section, Edition
ten, for various pages is given in the table below.
12.8 EXERCISES | 825
4 16
14 19
25 15
32 17
43 19
57 15
72 16
85 15
90 17
a. Decide which variable should be the independent variable and which should be the
dependent variable.
b. Draw a scatter plot of the ordered pairs.
c. Find the correlation coefficient
d. Interpret the correlation coefficient.
e. Find the linear regression equation.
f. Interpret the slope of the linear regression equation.
g. What is the estimated maximum value for restaurants on page 10?
h. What is the estimated maximum value for restaurants on page 70?
i. Using the regression equation, find the estimated maximum value for restaurants on page
200. Is this an accurate estimate for that page? Explain why or why not.
j. Calculate the coefficient of determination.
k. Interpret the coefficient of determination.
l. Calculate the standard error of the estimate.
m. Interpret the standard error of the estimate.
25. The table below gives the gold medal times for every other Summer Olympics for the women’s
100-meter freestyle (swimming).
826 | 12.8 EXERCISES
1912 82.2
1924 72.4
1932 66.8
1952 66.8
1960 61.2
1968 60.0
1976 55.65
1984 55.92
1992 54.64
2000 53.8
2008 53.1
a. Decide which variable should be the independent variable and which should be the
dependent variable.
b. Draw a scatter plot of the ordered pairs.
c. Find the correlation coefficient
d. Interpret the correlation coefficient.
e. Find the linear regression equation.
f. Interpret the slope of the linear regression equation.
g. What is the estimated gold medal time for 1932?
h. What is the estimated gold medal time for 1984?
i. Calculate the coefficient of determination.
j. Interpret the coefficient of determination.
k. Calculate the standard error of the estimate.
l. Interpret the standard error of the estimate.
26. The height (sidewalk to roof) of notable tall buildings in America is compared to the number of
stories of the building (beginning at street level).
12.8 EXERCISES | 827
1,050 57
428 28
362 26
529 40
790 60
401 22
380 38
1,454 110
1,127 100
700 46
a. Using “stories” as the independent variable and “height” as the dependent variable, draw a
scatter plot of the ordered pairs.
b. Find the correlation coefficient
c. Interpret the correlation coefficient.
d. Find the linear regression equation.
e. Interpret the slope of the linear regression equation.
f. What is the estimated height for a 32 story building?
g. What is the estimated height for a 94 story building?
h. Using the regression equation, find the estimated height for a 6 story building. Is this an
accurate estimate for the height of a 6 story building? Explain why or why not.
i. Calculate the coefficient of determination.
j. Interpret the coefficient of determination.
k. Calculate the standard error of the estimate.
l. Interpret the standard error of the estimate
27. The following table shows data on average per capita wine consumption and heart disease rate
in a random sample of 10 countries.
Yearly wine consumption in liters 2.5 3.9 2.9 2.4 2.9 0.8 9.1 2.7 0.8 0.7
Death from heart diseases 221 167 131 191 220 297 71 172 211 300
a. Decide which variable should be the independent variable and which should be the
828 | 12.8 EXERCISES
dependent variable.
b. Draw a scatter plot of the ordered pairs.
c. Find the correlation coefficient
d. Interpret the correlation coefficient.
e. Find the linear regression equation.
f. Interpret the slope of the linear regression equation.
g. Calculate the coefficient of determination.
h. Interpret the coefficient of determination.
i. Calculate the standard error of the estimate.
j. Interpret the standard error of the estimate.
28. The following table consists of one student athlete’s time (in minutes) to swim 2000 yards and
the student’s heart rate (beats per minute) after swimming on a random sample of 10 days:
34.12 144
35.72 152
34.72 124
34.05 140
34.13 152
35.73 146
36.17 128
35.57 136
35.37 144
35.57 148
a. Decide which variable should be the independent variable and which should be the
dependent variable.
b. Draw a scatter plot of the ordered pairs.
c. Find the correlation coefficient
d. Interpret the correlation coefficient.
e. Find the linear regression equation.
f. Interpret the slope of the linear regression equation.
g. What is the estimated heart rate for a swim time of 34.75 minutes?
h. Calculate the coefficient of determination.
12.8 EXERCISES | 829
29. The table below gives percent of workers who are paid hourly rates for the years 1979 to 1992.
1979 61.2
1980 60.7
1981 61.3
1982 61.3
1983 61.8
1984 61.7
1985 61.8
1986 62.0
1987 62.7
1990 62.8
1992 62.9
a. Decide which variable should be the independent variable and which should be the
dependent variable.
b. Draw a scatter plot of the ordered pairs.
c. Find the correlation coefficient
d. Interpret the correlation coefficient.
e. Find the linear regression equation.
f. Interpret the slope of the linear regression equation.
g. What is the estimated percent of workers paid hourly rates in 1988?
h. Calculate the coefficient of determination.
i. Interpret the coefficient of determination.
j. Calculate the standard error of the estimate.
k. Interpret the standard error of the estimate.
30. The table below shows the average heights for American boys in 1990.
830 | 12.8 EXERCISES
birth 50.8
2 83.8
3 91.4
5 106.6
7 119.3
10 137.1
14 157.5
a. Decide which variable should be the independent variable and which should be the
dependent variable.
b. Draw a scatter plot of the ordered pairs.
c. Find the correlation coefficient
d. Interpret the correlation coefficient.
e. Find the linear regression equation.
f. Interpret the slope of the linear regression equation.
g. What is the estimated average height for a one-year old?
h. Using the regression equation, find the estimated average height for a 62 year old man. Do
you think that your answer is reasonable? Explain why or why not.
i. Calculate the coefficient of determination.
j. Interpret the coefficient of determination.
k. Calculate the standard error of the estimate.
l. Interpret the standard error of the estimate.
Attribution
Chapter Outline
Previously, we studied simple linear regression, which allowed us to build a model of the linear
relationship between one independent variable and one dependent variable. Then we could use the
model to make predictions about the value of the dependent variable. For example, a simple linear
regression model can be used to predict a person’s salary (the dependent variable) from the person’s
age (the independent variable).
But, what if more than one independent variable impacts the value of the dependent variable?
For example, a person’s salary depends on more factors than just the person’s age. A person’s
salary can also be related to their experience, their education, and their profession. We want to
build a model that allows us to incorporate more than one independent variable. Because more
information can be used in the model, additional independent variables can make regression
models more accurate in predicting the dependent variable. A multiple regression model allows us
to use two or more independent variables to predict one dependent variable.
As we saw with simple linear regression, in addition to building the model, we need ways to
assess how good the multiple regression model fits the data and how good the model is at predicting
values of the dependent variable.
13.2 MULTIPLE REGRESSION
LEARNING OBJECTIVES
Previously, we learned about simple linear regression, which models the linear relationship
between one independent variable and one dependent variable . The equation for the regression
line is:
Multiple regression is an extension of simple linear regression where there is still only one
dependent variable but two or more dependent variables . Multiple regression is
motivated by scenarios where many independent variables may be simultaneously connected to a
dependent variable. For example, the price of product is related to demand for the product, the
time of year, and the competition’s price.
The equation for the multiple regression model is:
use a regression summary table to generate the values of the regression coefficients. As we will
see, the regression summary table contains lots of information relating to the multiple regression
model. For now, we will use the regression summary table to find the regression coefficients to
create the multiple regression model. In later sections, we will learn about some of the other
information contained on the regression summary table.
In order to create a regression summary table, we need to use the Analysis ToolPak. Follow these
instructions to add the Analysis ToolPak.
NOTES
1. For the Input X Range, the data for the independent variables must all be together. That is,
the columns (or rows) containing the data for the independent variables must all be
consecutive. If the column (or row) containing data for the dependent variable is in between
two columns (or rows) containing independent variables, copy the dependent variable
836 | 13.2 MULTIPLE REGRESSION
column and paste the dependent variable column at the beginning or end of the columns (or
rows) of data. Make sure to delete the original dependent variable column after placing a
copy at the beginning or end of the data.
2. There are several other options available in Regression input window, such as for confidence
intervals or information about residuals. We will not need any of this other information, so
leave everything else unchecked.
3. This website provides a detailed explanation of the information contained on the regression
summary table.
EXAMPLE
The human resources department at a large company wants to develop a model to predict an
employee’s job satisfaction from the number of hours of unpaid work per week the employee does,
the employee’s age, and the employee’s income. A sample of 25 employees at the company is taken
and the data is recorded in the table below. The employee’s income is recorded in $1000s and the job
satisfaction score is out of 10, with higher values indicating greater job satisfaction. Develop a
multiple regression model to predict the job satisfaction score from the other variables.
13.2 MULTIPLE REGRESSION | 837
4 3 23 60
5 8 32 114
2 9 28 45
6 4 60 187
7 3 62 175
8 1 43 125
7 6 60 93
3 3 37 57
5 2 24 47
5 5 64 128
7 2 28 66
8 1 66 146
5 7 35 89
2 5 37 56
4 0 59 65
6 2 32 95
5 6 76 82
7 5 25 90
9 0 55 137
8 3 34 91
7 5 54 184
9 1 57 60
7 0 68 39
10 2 66 187
5 0 50 49
Solution:
There are three independent variables: hours of unpaid work per week, age, and income ($1000s).
838 | 13.2 MULTIPLE REGRESSION
Let be the hours of unpaid work per week, let be age, and let be income ($1000s). The
regression summary table generated by Excel is shown below:
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.711779225
R Square 0.506629665
Adjusted R
0.436148189
Square
Standard
1.585212784
Error
Observations 25
ANOVA
Significance
df SS MS F
F
Total 24 106.96
Standard
Coefficients t Stat P-value Lower 95% Upper 95%
Error
Hours of
Unpaid Work -0.38184722 0.130750479 -2.9204269 0.008177146 -0.65375772 -0.10993671
per Week
Income
0.023250418 0.007610353 3.055103771 0.006012895 0.007423823 0.039077013
($1000s)
The coefficients for the multiple regression model are in the Coefficients column in the bottom part
of the table. The value of is in the Intercept row, so . The value of , the
coefficient for , is in the Hours of Unpaid Work per Week row, so . The value of
13.2 MULTIPLE REGRESSION | 839
NOTES
1. When writing down the multiple regression equation, remember to define what the variables
represent in the context of the question. That is, state what and the independent variables
represent in relation to the question.
2. A couple of the columns on the right side of the regression summary table generated by Excel
where deleted in order to fit the table onto the page. These columns are not necessary for the
work we will be doing.
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=294#oembed-1
Watch this video: Business Excel Business Analytics #50: Introduction to Multiple Regression, Data Analysis Regression by
ExcelIsFun [13:33]
840 | 13.2 MULTIPLE REGRESSION
Regression Coefficients
Recall that the slope in the simple linear regression model tells us how the
dependent variable changes for a single unit increase in the independent variable . In a similar
way, each regression coefficient represents the change (increase or decrease) in the dependent
variable for a one unit increase in the corresponding independent variable , while all the other
variables are held constant. When interpreting a regression coefficient, it is important to be specific
to the question, using the actual names of the variables and correct units.
EXAMPLE
The human resources department at a large company wants to develop a model to predict an
employee’s job satisfaction from the number of hours of unpaid work per week the employee does,
the employee’s age, and the employee’s income. A sample of 25 employees at the company is taken
and the data is recorded in the table below. The employee’s income is recorded in $1000s and the job
satisfaction score is out of 10, with higher values indicating greater job satisfaction.
13.2 MULTIPLE REGRESSION | 841
4 3 23 60
5 8 32 114
2 9 28 45
6 4 60 187
7 3 62 175
8 1 43 125
7 6 60 93
3 3 37 57
5 2 24 47
5 5 64 128
7 2 28 66
8 1 66 146
5 7 35 89
2 5 37 56
4 0 59 65
6 2 32 95
5 6 76 82
7 5 25 90
9 0 55 137
8 3 34 91
7 5 54 184
9 1 57 60
7 0 68 39
10 2 66 187
5 0 50 49
Previously, we found the multiple regression equation to predict the job satisfaction score from the
other variables:
842 | 13.2 MULTIPLE REGRESSION
1. Interpret the regression coefficient for hours of unpaid work per week.
2. Interpret the regression coefficient for age.
3. Interpret the regression coefficient for income.
Solution:
1. . Interpretation: For a one hour increase in the hours of unpaid work per
week, the job satisfaction score decreases by 0.3818, while the other variables are held
constant.
2. . Interpretation: For a one year increase in the age of the employee, the job
satisfaction score increases by 0.0046, while the other variables are held constant.
3. . Interpretation: For a $1000 increase in income, the job satisfaction score
increases by 0.0233, while the other variables are held constant.
NOTES
1. Remember to include “while the other variables are held constant” with the interpretation of
each regression coefficient. We can only talk about how the change in one independent
variable affects the dependent variable, so the values of the other variables must be kept fixed.
2. When writing down the interpretation of each regression coefficient, remember to be specific
to the question using the actual names of the independent and dependent variables and
appropriate units.
3. Each regression coefficient has the same units as the dependent variable.
4. Income is measured in $1000s, so a one unit increase in income actually corresponds to a
$1000 increase in income.
13.2 MULTIPLE REGRESSION | 843
As with simple linear regression, a multiple regression model can be used to make predictions
about the dependent variable from specific values of the independent variables. To make a
prediction, substitute the corresponding values of the independent variables into the multiple
regression equation and calculate out the value of . Watch out for the units of measurement
for each variable when using the multiple regression equation—the units of the values entered
into the independent variable in the multiple regression equation must match the units of the
independent variable in the sample data.
EXAMPLE
The human resources department at a large company wants to develop a model to predict an
employee’s job satisfaction from the number of hours of unpaid work per week the employee does,
the employee’s age, and the employee’s income. A sample of 25 employees at the company is taken
and the data is recorded in the table below. The employee’s income is recorded in $1000s and the job
satisfaction score is out of 10, with higher values indicating greater job satisfaction.
844 | 13.2 MULTIPLE REGRESSION
4 3 23 60
5 8 32 114
2 9 28 45
6 4 60 187
7 3 62 175
8 1 43 125
7 6 60 93
3 3 37 57
5 2 24 47
5 5 64 128
7 2 28 66
8 1 66 146
5 7 35 89
2 5 37 56
4 0 59 65
6 2 32 95
5 6 76 82
7 5 25 90
9 0 55 137
8 3 34 91
7 5 54 184
9 1 57 60
7 0 68 39
10 2 66 187
5 0 50 49
Previously, we found the multiple regression equation to predict the job satisfaction score from the
other variables:
13.2 MULTIPLE REGRESSION | 845
Predict the job satisfaction score for a 40-year old employee who works two hours of unpaid work per
week and has an income of $75,000.
Solution:
The values of the independent variables we need to enter into the multiple regression model are
, , and :
The predicted job satisfaction score for a 40-year old employee who works two hours of unpaid work
per week and has an income of $75,000 is 5.96.
NOTES
The multiple regression model given above is the model we create from sample data—a sample
is taken from the population and the sample data is used to find the regression coefficients in the
model. So the regression coefficients, , are estimates of the corresponding population
parameters for the regression coefficients, \beta_0, \beta_1, \ldots, \beta_k.
846 | 13.2 MULTIPLE REGRESSION
Concept Review
In multiple regression, two or more independent variables are used to predict one dependent
variable. We can find the values of the regression coefficients for the multiple regression model by
generating a regression summary table in Excel. Each regression coefficient represents the change
in the dependent variable for a single unit increase in the corresponding independent variable,
while the other variables are held fixed. Certain assumptions about the errors in a multiple
regression model are necessary in order to test the validity of the model.
Attribution
LEARNING OBJECTIVES
• Calculate and interpret the standard error of the estimate for multiple regression.
The difference between the actual value of the dependent variable (in the sample date) and the
predicted value of the dependent variable obtained from the multiple regression model is called
the error or residual.
For the simple linear regression model, the standard error of the estimate measures the average
vertical distance (the error) between the points on the scatter diagram and the regression line.
848 | 13.3 STANDARD ERROR OF THE ESTIMATE
The standard error of the estimate, denoted , is a measure of the standard deviation of
the errors in a regression model. The standard error of the estimate is a measure of the average
deviation of the errors, the difference between the -values predicted by the multiple regression
model and the -values in the sample. The standard error of the estimate for the regression model
is the standard deviation of the errors/residuals.
The value of tells us, on average, how much the dependent variable differs from the regression
model based on the independent variables. When interpreting the standard error of the estimate,
remember to be specific to the question, using the actual names of the dependent and independent
variables, and include appropriate units. The units of the standard error of the estimate are the
same as the units of the dependent variable.
The value of the standard error of the estimate for the regression model can be found on the
regression summary table, which we learned how to generate in Excel in the previous section.
EXAMPLE
The human resources department at a large company wants to develop a model to predict an
employee’s job satisfaction from the number of hours of unpaid work per week the employee does,
the employee’s age, and the employee’s income. A sample of 25 employees at the company is taken
and the data is recorded in the table below. The employee’s income is recorded in $1000s and the job
satisfaction score is out of 10, with higher values indicating greater job satisfaction.
13.3 STANDARD ERROR OF THE ESTIMATE | 849
4 3 23 60
5 8 32 114
2 9 28 45
6 4 60 187
7 3 62 175
8 1 43 125
7 6 60 93
3 3 37 57
5 2 24 47
5 5 64 128
7 2 28 66
8 1 66 146
5 7 35 89
2 5 37 56
4 0 59 65
6 2 32 95
5 6 76 82
7 5 25 90
9 0 55 137
8 3 34 91
7 5 54 184
9 1 57 60
7 0 68 39
10 2 66 187
5 0 50 49
Previously, we found the multiple regression equation to predict the job satisfaction score from the
other variables:
850 | 13.3 STANDARD ERROR OF THE ESTIMATE
Solution:
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.711779225
R Square 0.506629665
Adjusted R
0.436148189
Square
Standard
1.585212784
Error
Observations 25
ANOVA
Significance
df SS MS F
F
Total 24 106.96
Standard
Coefficients t Stat P-value Lower 95% Upper 95%
Error
Hours of
Unpaid
-0.38184722 0.130750479 -2.9204269 0.008177146 -0.65375772 -0.10993671
Work per
Week
Income
0.023250418 0.007610353 3.055103771 0.006012895 0.007423823 0.039077013
($1000s)
The standard error of the estimate for the regression models is in the top part of the table,
under the Regression Statistics heading in the Standard Error row. The value of the
standard error of the estimate is .
2. On average, the job satisfaction score is 1.5852 points away from the regression model based
852 | 13.3 STANDARD ERROR OF THE ESTIMATE
on the independent variables “hours of unpaid work per week,” “age,” and “income.”
NOTE
The standard error of the estimate for the regression model is located in the top part of the table
under the Regression Statistics heading. You will notice another standard error column at the
bottom in the rows corresponding to the independent variables. These standard errors in the
bottom part of the table are not related to the standard error of the estimate. In fact, the standard
errors in the independent variable rows are measures of the uncertainty around the estimate of the
regression coefficient for each independent variable.
Concept Review
The standard error of the estimate, , measures the average deviation of the errors of the regression
model. The smaller the value of the standard error of the estimate, the better the fit of the
regression model to the data.
13.4 COEFFICIENT OF MULTIPLE
DETERMINATION
LEARNING OBJECTIVES
Previously, we learned about the coefficient of determination, , for simple linear regression,
which is the proportion of variation in the dependent variable that can be explained by the simple
linear regression model based on the independent variable. The coefficient of determination is a
good way to measure how well the simple linear regression model fits the data.
EXAMPLE
The human resources department at a large company wants to develop a model to predict an
employee’s job satisfaction from the number of hours of unpaid work per week the employee does,
the employee’s age, and the employee’s income. A sample of 25 employees at the company is taken
and the data is recorded in the table below. The employee’s income is recorded in $1000s and the job
satisfaction score is out of 10, with higher values indicating greater job satisfaction.
13.4 COEFFICIENT OF MULTIPLE DETERMINATION | 855
4 3 23 60
5 8 32 114
2 9 28 45
6 4 60 187
7 3 62 175
8 1 43 125
7 6 60 93
3 3 37 57
5 2 24 47
5 5 64 128
7 2 28 66
8 1 66 146
5 7 35 89
2 5 37 56
4 0 59 65
6 2 32 95
5 6 76 82
7 5 25 90
9 0 55 137
8 3 34 91
7 5 54 184
9 1 57 60
7 0 68 39
10 2 66 187
5 0 50 49
Previously, we found the multiple regression equation to predict the job satisfaction score from the
other variables:
856 | 13.4 COEFFICIENT OF MULTIPLE DETERMINATION
Solution:
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.711779225
R Square 0.506629665
Adjusted R
0.436148189
Square
Standard
1.585212784
Error
Observations 25
ANOVA
Significance
df SS MS F
F
Total 24 106.96
Standard
Coefficients t Stat P-value Lower 95% Upper 95%
Error
Hours of
Unpaid
-0.38184722 0.130750479 -2.9204269 0.008177146 -0.65375772 -0.10993671
Work per
Week
Income
0.023250418 0.007610353 3.055103771 0.006012895 0.007423823 0.039077013
($1000s)
The coefficient of multiple determination for the regression model is in the top part of the
table, under the Regression Statistics heading in the R Square row. The value of the
coefficient of multiple determination is .
2. 50.66% of the variation in the job satisfaction score can be explained by the regression model
858 | 13.4 COEFFICIENT OF MULTIPLE DETERMINATION
based on the independent variables “hours of unpaid work per week,” “age,” and “income.”
The value of the coefficient of multiple determination always increases as more independent
variables are added to the model, even if the new independent variable has no relationship with the
dependent variable. The coefficient of multiple determination is an inflated value when additional
independent variables do not add any significant information to the dependent variable.
Consequently, the coefficient of multiple determination is an overestimate of the contribution of
the independent variables when new independent variables are added to the model.
Instead, we use the adjusted coefficient of multiple determination, denoted
, which corrects the overestimation of the coefficient of multiple determination when new
independent variables are added to the model. The adjusted coefficient of multiple determination
is interpreted in the same way as the coefficient of multiple determination. The adjusted coefficient
of multiple determination adjusts the value of to account for the number of independent
variables in the model in order to avoid overestimating the impact of adding independent variables
to the model.
The adjusted coefficient of multiple determination is calculated from the value of :
\displaystyle{adjusted \; R^2 = 1-\left( \frac{(n-1) \times (1-R^2)}{n-k-1}\right)}
where is the number of observations and is the number of independent variables. Although
we can find the value of the adjusted coefficient of multiple determination using the above formula,
the value of the coefficient of multiple determination is found on the regression summary table.
EXAMPLE
The human resources department at a large company wants to develop a model to predict an
employee’s job satisfaction from the number of hours of unpaid work per week the employee does,
the employee’s age, and the employee’s income. A sample of 25 employees at the company is taken
13.4 COEFFICIENT OF MULTIPLE DETERMINATION | 859
and the data is recorded in the table below. The employee’s income is recorded in $1000s and the job
satisfaction score is out of 10, with higher values indicating greater job satisfaction.
4 3 23 60
5 8 32 114
2 9 28 45
6 4 60 187
7 3 62 175
8 1 43 125
7 6 60 93
3 3 37 57
5 2 24 47
5 5 64 128
7 2 28 66
8 1 66 146
5 7 35 89
2 5 37 56
4 0 59 65
6 2 32 95
5 6 76 82
7 5 25 90
9 0 55 137
8 3 34 91
7 5 54 184
9 1 57 60
7 0 68 39
10 2 66 187
5 0 50 49
860 | 13.4 COEFFICIENT OF MULTIPLE DETERMINATION
Previously, we found the multiple regression equation to predict the job satisfaction score from the
other variables:
Solution:
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.711779225
R Square 0.506629665
Adjusted R
0.436148189
Square
Standard
1.585212784
Error
Observations 25
ANOVA
Significance
df SS MS F
F
Total 24 106.96
Standard
Coefficients t Stat P-value Lower 95% Upper 95%
Error
Hours of
Unpaid
-0.38184722 0.130750479 -2.9204269 0.008177146 -0.65375772 -0.10993671
Work per
Week
Income
0.023250418 0.007610353 3.055103771 0.006012895 0.007423823 0.039077013
($1000s)
The adjusted coefficient of multiple determination for the regression model is in the top part
of the table, under the Regression Statistics heading in the Adjusted R Square row. The
value of the adjusted coefficient of multiple determination is .
2. 43.61% of the variation in the job satisfaction score can be explained by the regression model
862 | 13.4 COEFFICIENT OF MULTIPLE DETERMINATION
based on the independent variables “hours of unpaid work per week,” “age,” and “income.”
If the addition of a new independent variable increases the value of the adjusted coefficient
of multiple determination, then it is an indication that the regression model has improved as
a result of adding the new independent variable. But, if the addition of a new independent
variable decreases the value of the adjusted coefficient of multiple determination, then the added
independent variable has not improved the overall regression model. In such cases, the new
independent variable should not be added to the model.
Concept Review
LEARNING OBJECTIVES
Previously, we learned that the population model for the multiple regression equation is
Because we do not have the population data, we cannot verify that these conditions are met. We
need to assume that the regression model has these properties in order to conduct hypothesis tests
on the model.
864 | 13.5 TESTING THE SIGNIFICANCE OF THE OVERALL MODEL
We want to test if there is a relationship between the dependent variable and the set of independent
variables. In other words, we want to determine if the regression model is valid or invalid.
• Invalid Model. There is no relationship between the dependent variable and the set of
independent variables. In this case, all of the regression coefficients in the population
model are zero. This is the claim for the null hypothesis in the overall model test:
.
• Valid Model. There is a relationship between the dependent variable and the set of
independent variables. In this case, at least one of the regression coefficients in the
population model is not zero. This is the claim for the alternative hypothesis in the overall
model test: .
The overall model test procedure compares the means of explained and unexplained variation in
the model in order to determine if the explained variation (caused by the relationship between
the dependent variable and the set of independent variables) in the model is larger than the
unexplained variation (represented by the error variable ). If the explained variation is larger than
the unexplained variation, then there is a relationship between the dependent variable and the set
of independent variables, and the model is valid. Otherwise, there is no relationship between the
dependent variable and the set of independent variables, and the model is invalid.
The logic behind the overall model test is based on two independent estimates of the variance of
the errors:
• One estimate of the variance of the errors, , is based on the mean amount of explained
variation in the dependent variable .
• One estimate of the variance of the errors, , is based on the mean amount of
unexplained variation in the dependent variable .
The overall model test compares these two estimates of the variance of the errors to determine
if there is a relationship between the dependent variable and the set of independent variables.
Because the overall model test involves the comparison of two estimates of variance, an
-distribution is used to conduct the overall model test, where the test statistic is the ratio of the two
estimates of the variance of the errors.
The mean square due to regression, , is one of the estimates of the variance of the
errors. The is the estimate of the variance of the errors determined by the variance of the
predicted -values from the regression model and the mean of the -values in the sample, . If
13.5 TESTING THE SIGNIFICANCE OF THE OVERALL MODEL | 865
there is no relationship between the dependent variable and the set of independent variables, then
the provides an unbiased estimate of the variance of the errors. If there is a relationship
between the dependent variable and the set of independent variables, then the provides an
overestimate of the variance of the errors.
\begin{eqnarray*} SSR & = & \sum \left(\hat{y}-\overline{y}\right)^2 \\ \\ MSR
& =& \frac{SSR}{k} \end{eqnarray*}
The mean square due to error, , is the other estimate of the variance of the errors. The
is the estimate of the variance of the errors determined by the error in using the
regression model to predict the values of the dependent variable in the sample. The always
provides an unbiased estimate of the variance of errors, regardless of whether or not there is a
relationship between the dependent variable and the set of independent variables.
\begin{eqnarray*} SSE & = & \sum \left(y-\hat{y}\right)^2\\ \\ MSE & =&
\frac{SSE}{n -k-1} \end{eqnarray*}
The overall model test depends on the fact that the is influenced by the explained
variation in the dependent variable, which results in the being either an unbiased or
overestimate of the variance of the errors. Because the is based on the unexplained variation
in the dependent variable, the is not affected by the relationship between the dependent
variable and the set of independent variables, and is always an unbiased estimate of the variance of
the errors.
The null hypothesis in the overall model test is that there is no relationship between the
dependent variable and the set of independent variables. The alternative hypothesis is that there is
a relationship between the dependent variable and the set of independent variables. The -score
for the overall model test is the ratio of the two estimates of the variance of the errors,
with and . The p-value for the test is the area in the right tail of the
-distribution to the right of the -score.
NOTES
1. If there is no relationship between the dependent variable and the set of independent
variables, both the and the are unbiased estimates of the variance of the
errors. In this case, the and the are close in value, which results in an
-score close to 1 and a large p-value. The conclusion of the test would be that the null
hypothesis is true.
866 | 13.5 TESTING THE SIGNIFICANCE OF THE OVERALL MODEL
2. If there is a relationship between the dependent variable and the set of independent
variables, the is an overestimate of the variance of the errors. In this case, the
is significantly larger than the , which results in a large -score and a
small p-value. The conclusion of the test would be that the alternative hypothesis is true.
1. Write down the null hypothesis that there is no relationship between the dependent variable
and the set of independent variables:
\begin{eqnarray*} H_a: & & \mbox{at least one } \beta_i \mbox{ is not 0} \\ \\
\end{eqnarray*}
3. Collect the sample information for the test and identify the significance level .
4. The p-value is the area in the right tail of the -distribution. The -score and degrees of
freedom are
The calculation of the , the , and the -score for the overall model test can be time
consuming, even with the help of software like Excel. However, the required -score and p-value
for the test can be found on the regression summary table, which we learned how to generate in
Excel in a previous section.
EXAMPLE
The human resources department at a large company wants to develop a model to predict an
employee’s job satisfaction from the number of hours of unpaid work per week the employee does,
the employee’s age, and the employee’s income. A sample of 25 employees at the company is taken
and the data is recorded in the table below. The employee’s income is recorded in $1000s and the job
satisfaction score is out of 10, with higher values indicating greater job satisfaction.
868 | 13.5 TESTING THE SIGNIFICANCE OF THE OVERALL MODEL
4 3 23 60
5 8 32 114
2 9 28 45
6 4 60 187
7 3 62 175
8 1 43 125
7 6 60 93
3 3 37 57
5 2 24 47
5 5 64 128
7 2 28 66
8 1 66 146
5 7 35 89
2 5 37 56
4 0 59 65
6 2 32 95
5 6 76 82
7 5 25 90
9 0 55 137
8 3 34 91
7 5 54 184
9 1 57 60
7 0 68 39
10 2 66 187
5 0 50 49
Previously, we found the multiple regression equation to predict the job satisfaction score from the
other variables:
13.5 TESTING THE SIGNIFICANCE OF THE OVERALL MODEL | 869
At the 5% significance level, test the validity of the overall model to predict the job satisfaction score.
Solution:
Hypotheses:
p-value:
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.711779225
R Square 0.506629665
Adjusted R
0.436148189
Square
Standard
1.585212784
Error
Observations 25
ANOVA
Significance
df SS MS F
F
Total 24 106.96
Standard
Coefficients t Stat P-value Lower 95% Upper 95%
Error
Hours of
Unpaid Work -0.38184722 0.130750479 -2.9204269 0.008177146 -0.65375772 -0.10993671
per Week
Income
0.023250418 0.007610353 3.055103771 0.006012895 0.007423823 0.039077013
($1000s)
The p-value for the overall model test is in the middle part of the table under the ANOVA heading in
the Significance F column of the Regression row. So the p-value= .
Conclusion:
relationship between the dependent variable “job satisfaction” and the set of independent variables
“hours of unpaid work per week,” “age”, and “income.”
NOTES
1. The null hypothesis is the claim that all of the regression coefficients
are zero. That is, the null hypothesis is the claim that there is no relationship between the
dependent variable and the set of independent variables, which means that the model is not
valid.
2. The alternative hypothesis is the claim that at least one of the regression coefficients is not
zero. The alternative hypothesis is the claim that at least one of the independent variables is
linearly related to the dependent variable, which means that the model is valid. The
alternative hypothesis does not say that all of the regression coefficients are not zero, only
that at least one of them is not zero. The alternative hypothesis does not tell us which
independent variables are related to the dependent variable.
3. The p-value for the overall model test is located in the middle part of the
table under the Significance F column heading in the Regression row
(right underneath the ANOVA heading). You will notice a p-value column
heading at the bottom of the table in the rows corresponding to the
independent variables. These p-values in the bottom part of the table are
not related to the overall model test we are conducting here. These p-values
in the independent variable rows are the p-values we will need when we
conduct tests on the individual regression coefficients in the next section.
4. The p-value of 0.0017 is a small probability compared to the significance level, and so is
unlikely to happen assuming the null hypothesis is true. This suggests that the assumption
that the null hypothesis is true is most likely incorrect, and so the conclusion of the test is to
reject the null hypothesis in favour of the alternative hypothesis. In other words, at least one
of the regression coefficients is not zero and at least one independent variable is linearly
related to the dependent variable.
872 | 13.5 TESTING THE SIGNIFICANCE OF THE OVERALL MODEL
One or more interactive elements has been excluded from this version of the text. You can view them
online here: https://ecampusontario.pressbooks.pub/introstats/?p=300#oembed-1
Watch this video: Basic Excel Business Analytics #51: Testing Significance of Regression Relationship with p-value by
ExcelIsFun [20:44]
Concept Review
The overall model test determines if there is a relationship between the dependent variable and the
set of independent variable. The test compares two estimates of the variance of the errors (
and ). The ratio of these two estimates of the variance of the errors is the -score from an
-distribution with and . The p-value for the test is the area in the right
tail of the -distribution. The p-value can be found on the regression summary table generated by
Excel.
The overall model hypothesis test is a well established process:
1. Write down the null and alternative hypotheses in terms of the regression coefficients. The
null hypothesis is the claim that there is no relationship between the dependent variable and
the set of independent variables. The alternative hypothesis is the claim that there is a
relationship between the dependent variable and the set of independent variables.
2. Collect the sample information for the test and identify the significance level.
3. The p-value is the area in the right tail of the -distribution. Use the regression summary
table generated by Excel to find the p-value.
4. Compare the p-value to the significance level and state the outcome of the test.
5. Write down a concluding sentence specific to the context of the question.
13.6 TESTING THE REGRESSION
COEFFICIENTS
LEARNING OBJECTIVES
Previously, we learned that the population model for the multiple regression equation is
For an individual regression coefficient, we want to test if there is a relationship between the
dependent variable and the independent variable .
In order to conduct a hypothesis test on an individual regression coefficient , we need to use the
distribution of the sample regression coefficient :
• The mean of the distribution of the sample regression coefficient is the population regression
coefficient .
• The standard deviation of the distribution of the sample regression coefficient is . Because
we do not know the population standard deviation we must estimate with the sample
standard deviation .
• The distribution of the sample regression coefficient follows a normal distribution.
Because we are using a sample standard deviation to estimate a population standard deviation in a
normal distribution, we need to use a -distribution with degrees of freedom to find the
p-value for the test on an individual regression coefficient. The -score for the test is .
1. Write down the null hypothesis that there is no relationship between the dependent variable
and the independent variable :
The required -score and p-value for the test can be found on the regression summary table, which
we learned how to generate in Excel in a previous section.
EXAMPLE
The human resources department at a large company wants to develop a model to predict an
employee’s job satisfaction from the number of hours of unpaid work per week the employee does,
the employee’s age, and the employee’s income. A sample of 25 employees at the company is taken
and the data is recorded in the table below. The employee’s income is recorded in $1000s and the job
satisfaction score is out of 10, with higher values indicating greater job satisfaction.
876 | 13.6 TESTING THE REGRESSION COEFFICIENTS
4 3 23 60
5 8 32 114
2 9 28 45
6 4 60 187
7 3 62 175
8 1 43 125
7 6 60 93
3 3 37 57
5 2 24 47
5 5 64 128
7 2 28 66
8 1 66 146
5 7 35 89
2 5 37 56
4 0 59 65
6 2 32 95
5 6 76 82
7 5 25 90
9 0 55 137
8 3 34 91
7 5 54 184
9 1 57 60
7 0 68 39
10 2 66 187
5 0 50 49
Previously, we found the multiple regression equation to predict the job satisfaction score from the
other variables:
13.6 TESTING THE REGRESSION COEFFICIENTS | 877
At the 5% significance level, test the relationship between the dependent variable “job satisfaction”
and the independent variable “hours of unpaid work per week”.
Solution:
Hypotheses:
\begin{eqnarray*} H_0: & & \beta_1=0 \\ H_a: & & \beta_1 \neq 0
\end{eqnarray*}
p-value:
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.711779225
R Square 0.506629665
Adjusted R
0.436148189
Square
Standard
1.585212784
Error
Observations 25
ANOVA
Significance
df SS MS F
F
Total 24 106.96
Standard
Coefficients t Stat P-value Lower 95% Upper 95%
Error
Hours of
Unpaid Work -0.38184722 0.130750479 -2.9204269 0.008177146 -0.65375772 -0.10993671
per Week
Income
0.023250418 0.007610353 3.055103771 0.006012895 0.007423823 0.039077013
($1000s)
The p-value for the test on the hours of unpaid work per week regression coefficient is in the bottom
part of the table under the P-value column of the Hours of Unpaid Work per Week row. So
the p-value= .
Conclusion:
hypothesis. At the 5% significance level there is enough evidence to suggest that there is a
relationship between the dependent variable “job satisfaction” and the independent variable “hours
of unpaid work per week.”
NOTES
1. The null hypothesis is the claim that the regression coefficient for the independent
variable is zero. That is, the null hypothesis is the claim that there is no relationship
between the dependent variable and the independent variable “hours of unpaid work per
week.”
2. The alternative hypothesis is the claim that the regression coefficient for the independent
variable is not zero. The alternative hypothesis is the claim that there is a relationship
between the dependent variable and the independent variable “hours of unpaid work per
week.”
3. When conducting a test on a regression coefficient, make sure to use the correct subscript on
to correspond to how the independent variables were defined in the regression model and
which independent variable is being tested. Here the subscript on is 1 because the “hours
of unpaid work per week” is defined as in the regression model.
4. The p-value for the tests on the regression coefficients are located in the
bottom part of the table under the P-value column heading in the
corresponding independent variable row.
5. Because the alternative hypothesis is a , the p-value is the sum of the area in the tails of the
-distribution. This is the value calculated out by Excel in the regression summary table.
6. The p-value of 0.0082 is a small probability compared to the significance level, and so is
unlikely to happen assuming the null hypothesis is true. This suggests that the assumption
that the null hypothesis is true is most likely incorrect, and so the conclusion of the test is to
reject the null hypothesis in favour of the alternative hypothesis. In other words, the
regression coefficient is not zero, and so there is a relationship between the dependent
variable “job satisfaction” and the independent variable “hours of unpaid work per week.”
This means that the independent variable “hours of unpaid work per week” is useful in
predicting the dependent variable.
880 | 13.6 TESTING THE REGRESSION COEFFICIENTS
EXAMPLE
The human resources department at a large company wants to develop a model to predict an
employee’s job satisfaction from the number of hours of unpaid work per week the employee does,
the employee’s age, and the employee’s income. A sample of 25 employees at the company is taken
and the data is recorded in the table below. The employee’s income is recorded in $1000s and the job
satisfaction score is out of 10, with higher values indicating greater job satisfaction.
13.6 TESTING THE REGRESSION COEFFICIENTS | 881
4 3 23 60
5 8 32 114
2 9 28 45
6 4 60 187
7 3 62 175
8 1 43 125
7 6 60 93
3 3 37 57
5 2 24 47
5 5 64 128
7 2 28 66
8 1 66 146
5 7 35 89
2 5 37 56
4 0 59 65
6 2 32 95
5 6 76 82
7 5 25 90
9 0 55 137
8 3 34 91
7 5 54 184
9 1 57 60
7 0 68 39
10 2 66 187
5 0 50 49
Previously, we found the multiple regression equation to predict the job satisfaction score from the
other variables:
882 | 13.6 TESTING THE REGRESSION COEFFICIENTS
At the 5% significance level, test the relationship between the dependent variable “job satisfaction”
and the independent variable “age”.
Solution:
Hypotheses:
\begin{eqnarray*} H_0: & & \beta_2=0 \\ H_a: & & \beta_2 \neq 0
\end{eqnarray*}
p-value:
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.711779225
R Square 0.506629665
Adjusted R
0.436148189
Square
Standard
1.585212784
Error
Observations 25
ANOVA
Significance
df SS MS F
F
Total 24 106.96
Standard
Coefficients t Stat P-value Lower 95% Upper 95%
Error
Hours of
Unpaid Work -0.38184722 0.130750479 -2.9204269 0.008177146 -0.65375772 -0.10993671
per Week
Income
0.023250418 0.007610353 3.055103771 0.006012895 0.007423823 0.039077013
($1000s)
The p-value for the test on the age regression coefficient is in the bottom part of the table under the
P-value column of the Age row. So the p-value= .
Conclusion:
significance level there is not enough evidence to suggest that there is a relationship between the
dependent variable “job satisfaction” and the independent variable “age.”
NOTES
1. The null hypothesis is the claim that the regression coefficient for the independent
variable is zero. That is, the null hypothesis is the claim that there is no relationship
between the dependent variable and the independent variable “age.”
2. The alternative hypothesis is the claim that the regression coefficient for the independent
variable is not zero. The alternative hypothesis is the claim that there is a relationship
between the dependent variable and the independent variable “age.”
3. When conducting a test on a regression coefficient, make sure to use the correct subscript on
to correspond to how the independent variables were defined in the regression model and
which independent variable is being tested. Here the subscript on is 2 because “age” is
defined as in the regression model.
4. The p-value of 0.8439 is a large probability compared to the significance level, and so is likely
to happen assuming the null hypothesis is true. This suggests that the assumption that the
null hypothesis is true is most likely correct, and so the conclusion of the test is to not reject
the null hypothesis. In other words, the regression coefficient is zero, and so there is no
relationship between the dependent variable “job satisfaction” and the independent variable
“age.” This means that the independent variable “age” is not particularly useful in predicting
the dependent variable.
EXAMPLE
The human resources department at a large company wants to develop a model to predict an
employee’s job satisfaction from the number of hours of unpaid work per week the employee does,
the employee’s age, and the employee’s income. A sample of 25 employees at the company is taken
13.6 TESTING THE REGRESSION COEFFICIENTS | 885
and the data is recorded in the table below. The employee’s income is recorded in $1000s and the job
satisfaction score is out of 10, with higher values indicating greater job satisfaction.
4 3 23 60
5 8 32 114
2 9 28 45
6 4 60 187
7 3 62 175
8 1 43 125
7 6 60 93
3 3 37 57
5 2 24 47
5 5 64 128
7 2 28 66
8 1 66 146
5 7 35 89
2 5 37 56
4 0 59 65
6 2 32 95
5 6 76 82
7 5 25 90
9 0 55 137
8 3 34 91
7 5 54 184
9 1 57 60
7 0 68 39
10 2 66 187
5 0 50 49
886 | 13.6 TESTING THE REGRESSION COEFFICIENTS
Previously, we found the multiple regression equation to predict the job satisfaction score from the
other variables:
At the 5% significance level, test the relationship between the dependent variable “job satisfaction”
and the independent variable “income”.
Solution:
Hypotheses:
\begin{eqnarray*} H_0: & & \beta_3=0 \\ H_a: & & \beta_3 \neq 0
\end{eqnarray*}
p-value:
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.711779225
R Square 0.506629665
Adjusted R
0.436148189
Square
Standard
1.585212784
Error
Observations 25
ANOVA
Significance
df SS MS F
F
Total 24 106.96
Standard
Coefficients t Stat P-value Lower 95% Upper 95%
Error
Hours of
Unpaid Work -0.38184722 0.130750479 -2.9204269 0.008177146 -0.65375772 -0.10993671
per Week
Income
0.023250418 0.007610353 3.055103771 0.006012895 0.007423823 0.039077013
($1000s)
The p-value for the test on the income regression coefficient is in the bottom part of the table under
the P-value column of the Income row. So the p-value= .
Conclusion:
relationship between the dependent variable “job satisfaction” and the independent variable
“income.”
NOTES
1. The null hypothesis is the claim that the regression coefficient for the independent
variable is zero. That is, the null hypothesis is the claim that there is no relationship
between the dependent variable and the independent variable “income.”
2. The alternative hypothesis is the claim that the regression coefficient for the independent
variable is not zero. The alternative hypothesis is the claim that there is a relationship
between the dependent variable and the independent variable “income.”
3. When conducting a test on a regression coefficient, make sure to use the correct subscript on
to correspond to how the independent variables were defined in the regression model and
which independent variable is being tested. Here the subscript on is 3 because “income” is
defined as in the regression model.
4. The p-value of 0.0060 is a small probability compared to the significance level, and so is
unlikely to happen assuming the null hypothesis is true. This suggests that the assumption
that the null hypothesis is true is most likely incorrect, and so the conclusion of the test is to
reject the null hypothesis in favour of the alternative hypothesis. In other words, the
regression coefficient is not zero, and so there is a relationship between the dependent
variable “job satisfaction” and the independent variable “income.” This means that the
independent variable “income” is useful in predicting the dependent variable.
Concept Review
The test on a regression coefficient determines if there is a relationship between the dependent
variable and the corresponding independent variable. The p-value for the test is the sum of the area
in tails of the -distribution. The p-value can be found on the regression summary table generated
by Excel.
The hypothesis test for a regression coefficient is a well established process:
1. Write down the null and alternative hypotheses in terms of the regression coefficient being
13.6 TESTING THE REGRESSION COEFFICIENTS | 889
tested. The null hypothesis is the claim that there is no relationship between the dependent
variable and independent variable. The alternative hypothesis is the claim that there is a
relationship between the dependent variable and independent variable.
2. Collect the sample information for the test and identify the significance level.
3. The p-value is the sum of the area in the tails of the -distribution. Use the regression
summary table generated by Excel to find the p-value.
4. Compare the p-value to the significance level and state the outcome of the test.
5. Write down a concluding sentence specific to the context of the question.
13.7 MULTICOLLINEARITY
LEARNING OBJECTIVES
The term independent variable applies to any variable that is used to predict or explain the value
of the dependent variable. But this does not mean that the independent variables themselves are
unrelated to each other. In fact, most independent variables in multiple regression models share
some degree of relatedness. For example, if “distance travelled” and “litres of gas consumed” are
the independent variables in a regression model to predict the dependent variable “travel time,” the
variables “distance travelled” and “litres of gas consumed” are highly correlated.
When two or more independent variables in a regression model are highly correlated to each
other, multicollinearity exists between the independent variables. Consequently, the conclusions
about the relationship between the dependent variable and the individual independent variables
may be affected when the independent variables are related to each other. In addition,
multicollinearity may affect the outcome of the tests on the individual regression coefficients. But
multicollinearity does not affect the outcome of the overall test on the regression model.
Even though the overall model test may conclude that there is a relationship between the
dependent variable and the set of independent variables, multicollinearity amongst the
independent variables may cause all of the tests on the individual regression coefficients to
conclude that none of the individual independent variables are related to the dependent variable.
One way to address the problem of multicollinearity is to avoid including independent variables
that are highly correlated or remove one of two highly correlated independent variables from the
model.
13.7 MULTICOLLINEARITY | 891
Concept Review
Multicollinearity refers to the correlation that may exist between two or more independent
variables in a regression model. Although multicollinearity may affect conclusions drawn about
the individual regression coefficients, multicollinearity does not affect conclusions about the overall
model.
13.8 EXERCISES
1. A local restaurant advocacy group wants to study the relationship between a restaurant’s average
weekly profit, the restaurant’s seating capacity and average daily traffic that passes the restaurant’s
location. The group took a sample of restaurants and recording their average weekly profit (in
$1000s), the seating restaurant’s seating capacity, and the average number of cars (in 1000s) that
passes the restaurant’s location. The data is recorded in the following table:
13.8 EXERCISES | 893
120 19 23.8
180 8 29.2
150 12 22
180 15 26.2
220 16 33.5
235 10 32
115 18 22.4
110 12 20.4
165 21 23.7
220 20 34.7
140 24 27.1
145 24 23.3
140 13 20.9
200 14 29.6
210 14 31.4
175 12 23.2
175 15 31.1
190 17 28.2
100 23 25.2
145 20 20.7
135 13 37.2
25 13 26.3
140 25 20
130 14 28.2
135 10 24.6
160 23 23.7
a. Find the regression model to predict the average weekly profit from the other variables.
b. Interpret the coefficient for seating capacity.
c. Interpret the coefficient for traffic count.
894 | 13.8 EXERCISES
d. Predict the average weekly profit for a restaurant with a seating capacity of 150 and a traffic
count of 25,000 cars.
e. Find the adjusted coefficient of determination.
f. Interpret the adjusted coefficient of determination.
g. Find the standard error of the estimate.
h. Interpret the standard error of the estimate.
i. At the 5% significance level, test the validity of the model.
j. At the 5% significance level, test the coefficient of seating capacity.
k. At the 5% significance level, test the coefficient of traffic count.
2. A local university wants to study the relationship between a student’s GPA, the average number
of hours they spend studying each night and the average number of nights they go out each week.
The university took a sample of students and recorded the following data:
13.8 EXERCISES | 895
3.72 5 1
3.88 3 1
3.67 2 1
3.87 3 4
2.49 1 4
1.29 1 2
1.01 2 4
2.12 1 1
1.9 1 5
3.42 3 2
1.33 1 4
1.07 0 2
2.75 3 1
3.82 4 1
3.91 5 0
2.25 2 3
2.06 1 5
2.92 3 2
3.06 3 1
3.65 2 2
3.69 4 1
a. Find the regression model to predict GPA from the other variables.
b. Interpret the coefficient for the average number of hours spent studying each night.
c. Interpret the coefficient for the average number of nights a student goes out each week.
d. Predict the GPA for a student who spends an average of 4 hours a night studying and goes out
an average of 3 nights a week.
e. Find the adjusted coefficient of determination.
f. Interpret the adjusted coefficient of determination.
g. Find the standard error of the estimate.
896 | 13.8 EXERCISES
3. A very large company wants to study the relationship between the salaries of employees in
management positions, their age, the number of years the employee spent in college, and the
number of years the employee has been with the company. A sample management employees is
taken and the data recorded below:
13.8 EXERCISES | 897
60 8 29 317.3
33 3 5 97.3
57 6 27 263.1
32 4 5 101.3
31 6 3 114.2
61 8 19 350.4
41 7 8 146.9
35 4 2 91.7
51 6 21 198.2
50 8 10 196.5
57 5 15 105.7
49 6 18 118.3
62 7 27 305.2
52 8 26 239.9
39 4 8 145.9
42 7 5 175.4
62 4 24 219.4
60 4 22 202.1
65 3 21 196.3
40 4 10 143.9
62 6 29 408.7
53 7 5 145.2
48 8 5 175.1
61 5 6 152.7
38 7 3 99.7
40 7 12 174.9
898 | 13.8 EXERCISES
45 7 7 149.2
58 7 14 282.8
38 4 3 95.7
41 5 18 232.8
a. Find the regression model to predict salary from the other variables.
b. Interpret the coefficient for age.
c. Interpret the coefficient for years of college
d. Interpret the coefficient for years with the company.
e. Predict the salary for a 47 year old management employee who spent 5 years in college and
has been with the company for 15 years.
f. Find the adjusted coefficient of determination.
g. Interpret the adjusted coefficient of determination.
h. Find the standard error of the estimate.
i. Interpret the standard error of the estimate.
j. At the 1% significance level, test the validity of the model.
k. At the 1% significance level, test the coefficient of age.
l. At the 1% significance level, test the coefficient of the years of college.
m. At the 1% significance level, test the coefficient for the years with the company.
13.9 ANSWERS TO SELECT EXERCISES
1.
a.
b. For each additional seat in the restaurant, the average weekly profit increases by $46.
c. For each additional 1000 cars that pass the restaurant, the average weekly profit decreases by
$196.
d. $24,519.20
e. 0.2250
f. 22.50% of the variation in the average weekly profit can be explained by the regression model
based on seating capacity and traffic count.
g. 4.1675.
h. On average, the average weekly profit differs by $4167.50 from the regression model based on
seating capacity and traffic count.
i. p-value=0.0205; reject the null hypothesis.
j. p-value=0.0144; reject the null hypothesis.
k. p-value=0.2645; do not reject the null hypothesis.
2.
a.
b. For each additional hour spent studying each night, the student’s GPA increases by 0.524.
c. For each additional hour a student goes out each week, the student’s GPA decreases by 0.082.
d. 3.54
e. 0.5833
f. 58.33% of the variation in GPA can be explained by the regression model based on the average
number of hours spent studying a night and the average number of nights a student goes out
900 | 13.9 ANSWERS TO SELECT EXERCISES
each week.
g. 0.6613.
h. On average, GPA differs by 0.6613 from the regression model based on the average number of
hours spent studying a night and the average number of nights a student goes out each week.
i. p-value=0.0002; reject the null hypothesis.
j. p-value=0.0009; reject the null hypothesis.
k. p-value=0.5083; do not reject the null hypothesis.
3.
a.
The Data and Story Library. (n.d.). Retrieved May 1, 2013, from http://lib.stat.cmu.edu/DASL/
Stories/CrashTestDummies.html
Book of Odds. (n.d.). How George Gallup Picked the President. http://www.bookofodds.com/
Relationships-Society/Articles/A0374-How-George-Gallup-Picked-the-President
Gallup. (n.d.). Gallup Presidential Election Trial-Heat Trends, 1936–2008. Retrieved May 1, 2013,
from http://www.gallup.com/poll/110548/gallup-presidential-election-trialheat-
trends-19362004.aspx#4
Gallup-Healthways Well-Being Index. (n.d.). Retrieved May 1, 2013, from http://www.well-
beingindex.com/default.asp
Gallup-Healthways Well-Being Index. (n.d.). Retrieved May 1, 2013, from http://www.well-
beingindex.com/methodology.asp
Gallup-Healthways Well-Being Index. (n.d.) Retrieved May 1, 2013, from
http://www.gallup.com/poll/146822/gallup-healthways-index-questions.aspx
LBCC Distance Learning (DL) program data in 2010-2011. (n.d.). Retrieved May 1, 2013, from
http://de.lbcc.edu/reports/2010-11/future/highlights.html#focus
Lusinchi, D. (2012) “’President’ Landon and the 1936 Literary Digest Poll: Were Automobile and
Telephone Owners to Blame?” Social Science History ,36(1)1, 23-54. http://ssh.dukejournals.org/
content/36/1/23.abstract
San Jose Mercury News. (n.d.).
The Literary Digest Poll,” Virtual Laboratories in Probability and Statistics. (n.d.). Retrieved May 1,
2013, from http://www.math.uah.edu/stat/data/LiteraryDigest.html
The Data and Story Library. (n.d.). Retrieved May 1, 2013, from http://lib.stat.cmu.edu/DASL/
Datafiles/USCrime.html
902 | REFERENCES
Lane, D. 2003, June 20. Levels of Measurement. OpenStax CNX. Retrieved May 1, 2013, from
http://cnx.org/content/m10809/latest/
Levels of Measurement. (n.d.). Retrieved May 1, 2013, from http://infinity.cos.edu/faculty/
woodbury/stats/tutorial/Data_Levels.htm
State & County QuickFacts. (n.d.). Retrieved May 1, 2013, from http://quickfacts.census.gov/qfd/
download_data.html
State & County QuickFacts: Quick, easy access to facts about people, business, and geography. (n.d.).
U.S. Census Bureau. Retrieved from May 1, 2013, http://quickfacts.census.gov/qfd/index.html
Table 5: Direct hits by mainland United States Hurricanes (1851-2004). (n.d.). National Hurricane
Center. Retrieved May 1, 2013, from http://www.nhc.noaa.gov/gifs/table5.gif
Taylor, C. 2018, February 2. The Levels of Measurement in Statistics. Thoughtco.
http://statistics.about.com/od/HelpandTutorials/a/Levels-Of-Measurement.htm
Alden, L. (2013, May 1). Statistics can be Misleading. econoclass.com. Retrieved May 1, 2013, from
http://www.econoclass.com/misleadingstats.html
America’s Best Small Companies. (n.d.). Forbes. Retrieved May 1, 2013, from,
http://www.forbes.com/best-small-companies/list/
April 2013 Air Travel Consumer Report. (2013, April 11). U.S. Department of Transportation.
Retrieved, May 1, 2013, from, http://www.dot.gov/airconsumer/april-2013-air-travel-consumer-
report (accessed May 1, 2013).
Bhattacharjee, Y. (2013, April 26). The Mind of a Con Man. The New York Times Magazine.
http://www.nytimes.com/2013/04/28/magazine/diederik-stapels-audacious-academic-
fraud.html?src=dayp&_r=2&
Data. (n.d.). BusinessWeek. Retrieved May 1, 2013, from https://www.bloomberg.com/
businessweek
Data. (n.d.). Forbes. Retrieved May 1, 2013, from, https://www.forbes.com/
Earthquake Information by Year. (n.d.). U.S. Geological Survey. Retrieved, May 1, 2013, from
http://earthquake.usgs.gov/earthquakes/eqarchives/year/
Jacskon, M. L., Croft, R. J., Kennedy, G. A., Owens, K., & Howard, M. E. (2013). Cognitive
Components of Simulated Driving Performance: Sleep Loss effect and Predictors. Accident Analysis
and Prevention, Jan(50), 438-44. http://www.ncbi.nlm.nih.gov/pubmed/22721550
Levelt, W. J. M., Drenth, P., & Noort, E. (Eds.). (2012). Flawed science: The fraudulent research
practices of social psychologist Diederik Stapel. Tilburg: Commissioned by the Tilburg University,
REFERENCES | 903
Births Time Series Data. (2013). General Register Office For Scotland. Retrieved April 3, 2013, from,
http://www.gro-scotland.gov.uk/statistics/theme/vital-events/births/time-series.html
CO2 emissions (kt). (2013). The World Bank. Retrieved April 3, 2013, from,
http://databank.worldbank.org/data/home.aspx
Consumer Price Index. (n.d.). United States Department of Labor: Bureau of Labor Statistics.
Retrieved April 3, 2013, from, http://data.bls.gov/pdq/SurveyOutputServlet
Demographics: Children under the age of 5 years underweight. (n.d.). Indexmundi. Retrieved April
3, 2013, from, http://www.indexmundi.com/g/r.aspx?t=50&v=2224&aml=en
Food Security Statistics. (n.d.). Food and Agriculture Organization of the United Nations.
Retrieved April 3, 2013, from, http://www.fao.org/economic/ess/ess-fs/en/
Gunst, R., & Mason, R. (1980). Regression Analysis and Its Application: A Data-Oriented
Approach. CRC Press.
904 | REFERENCES
Overweight and Obesity: Adult Obesity Facts. (n.d.). Centers for Disease Control and Prevention.
Retrieved April 3, 2013, from, http://www.cdc.gov/obesity/data/adult.html
Presidents. (2007). Fact Monster. Retrieved April 3, 2013, from, http://www.factmonster.com/
ipka/A0194030.html
Timeline: Guide to the U.S. Presidents: Information on every president’s birthplace, political party,
term of office, and more. (2013). Scholastic. Retrieved April 3, 2013, from,
http://www.scholastic.com/teachers/article/timeline-guide-us-presidents
Data. (n.d.). The World Bank. Retrieved April 3, 2013, from, http://www.worldbank.org
Demographics: Obesity – adult prevalence rate. (n.d.). Indexmundi. Retrieved April 3, 2013, from,
http://www.indexmundi.com/g/r.aspx?t=50&v=2228&l=en
1990 Census. (n.d.). United States Department of Commerce: United States Census Bureau.
Retrieved April 3, 2013, from, http://www.census.gov/main/www/cen1990.html
Cauchon, D., & Overberg, P. (2012). Census data shows minorities now a majority of U.S.
births. USA Today. Retrieved April 3, 2013, from, http://usatoday30.usatoday.com/news/nation/
story/2012-05-17/minority-birthscensus/55029100/1
Data. (n.d.). San Jose Mercury News.
Data. (n.d.). The United States Department of Commerce: United States Census Bureau.
Retrieved April 3, 2013, from, http://www.census.gov/
Yankelovich Partners. (n.d.). Survey. Time Magazine.
Worldatlas. (2013). Countries List by Continent. In Worldatlas.com. Retrieved May 2, 2013, from,
http://www.worldatlas.com/cntycont.htm
Blood Types. (n.d.). American Red Cross. Retrieved May 3, 2013, from,
http://www.redcrossblood.org/learn-about-blood/blood-types
Data. (n.d.). National Center for Health Statistics, The United States Department of Health and
Human Services.
Data. (n.d.). United States Senate. Retrieved May 2, 2013, from, https://www.senate.gov/
Haiman, C. A., Stram, D. O., Wilkens, L. R., Pike, M. C., Kolonel, L. N., Henderson, B. E.,
& le Marchand, L. (2006, January 26). Ethnic and Racial Differences in the Smoking-Related
Risk of Lung Cancer. The New England Journal of Medicine. http://www.nejm.org/doi/full/10.1056/
NEJMoa033250
Human Blood Types. (2011). Unite Blood Services. Retrieved May 2, 2013, from,
http://www.unitedbloodservices.org/learnMore.aspx
Samuel, T. M. (2013). Strange Facts about RH Negative Blood. eHow Health. Retrieved May 2,
2013, from, http://www.ehow.com/facts_5552003_strange-rh-negative-blood.html
United States: Uniform Crime Report – State Statistics from 1960–2011. (n.d.). The Disaster Center.
Retrieved May 2, 2013, from, http://www.disastercenter.com/crime/
Rider, D. (2011, September 14). Ford support plummeting, poll suggests. The Star. Retrieved
May 2, 2013, from, http://www.thestar.com/news/gta/2011/09/14/
ford_support_plummeting_poll_suggests.html
Roulette. (n.d.). In Wikipedia. http://en.wikipedia.org/wiki/Roulette
Shin, H. B., & Kominski, R. A. (2010, April 1). Language Use in the United States: 2007. United
States Census Bureau. https://www.census.gov/library/publications/2010/acs/acs-12.html
Course Catalog. (n.d.). Florida State University. Retrieved May 15, 2013, from, https://m.fsu.edu/
default/course_catalog/index
World Earthquakes: Live Earthquake News and Highlights. (2012). World Earthquakes. Retrieved
May 15, 2013, from, http://www.world-earthquakes.com/index.php?option=ethq_prediction
Access to electricity (% of population). (2013). The World Bank. Retrieved May 15, 2015, from,
http://data.worldbank.org/indicator/
EG.ELC.ACCS.ZS?order=wbapi_data_value_2009%20wbapi_data_value%20wbapi_data_value-
first&sort=asc
Distance Education. (n.d.). In Wikipedia. Retrieved May 15, 2013, from, http://en.wikipedia.org/
wiki/Distance_education
NBA Statistics – 2013. ESPN. Retrieved May 15, 2013, from, http://espn.go.com/nba/
statistics/_/seasontype/2
Newport, F. (2013, May 9). Americans Still Enjoy Saving Rather than Spending: Few demographic
differences seen in these views other than by income. Gallup. http://www.gallup.com/poll/162368/
americans-enjoy-saving-rather-spending.aspx
Pryor, J. H., DeAngelo, L., Palucki Blake, L., Hurtado,S., & Tran, S. (2011). The American
Freshman: National Norms Fall 2011. Cooperative Institutional Research Program at the Higher
Education Research Institute at UCLA. http://heri.ucla.edu/PDFs/pubs/TFS/Norms/Monographs/
TheAmericanFreshman2011.pdf
The World FactBook. (n.d.). Central Intelligence Agency. Retrieved May 15, 2013, from,
https://www.cia.gov/library/publications/the-world-factbook/geos/af.html
What are the key statistics about pancreatic cancer? (2013). American Cancer Society. Retrieved
REFERENCES | 907
ATL Fact Sheet. (2013). Department of Aviation at the Hartsfield-Jackson Atlanta International
Airport. Retrieved February 18, 2019, from, http://www.atl.com/about-atl/atl-factsheet/
Children and Childrearing. (n.d.). Ministry of Health, Labour, and Welfare. Retrieved May 15,
2013, from, http://www.mhlw.go.jp/english/policy/children/children-childrearing/index.html
Daily Mail Reporter. (2011, June 9). One born every minute: The maternity unit where mothers
are THREE to a bed. Daily Mail. Retrieved May 15, 2013, from, http://www.dailymail.co.uk/news/
article-2001422/Busiest-maternity-ward-planet-averages-60-babies-day-mothers-bed.html
Eating Disorder Statistics. (2006). South Carolina Department of Mental Health. Retrieved May
15, 2013, from, http://www.state.sc.us/dmh/anorexia/statistics.htm
Giving Birth in Manila. (2011, June 8). The Guardian. Retrieved May 15, 2013, from,
http://www.theguardian.com/world/gallery/2011/jun/08/philippines-
health#/?picture=375471900&index=2
Lenhart, A. (2012, March 19). Teens, Smartphones & Texting. Pew Research Center. Retrieved
May 15, 2013, from, http://www.pewinternet.org/~/media/Files/Reports/2012/
PIP_Teens_Smartphones_and_Texting.pdf
Smith, A. (2011, September 19). How Americans Use Text Messaging. Pew Research Center.
Retrieved May 15, 2013, from, http://pewinternet.org/Reports/2011/Cell-Phone-Texting-2011/
Main-Report.aspx
Teen Drivers: Fact Sheet, Injury Prevention & Control: Motor Vehicle Safety. (2012, October 2).
Center for Disease Control and Prevention. Retrieved May 15, 2013, from, http://www.cdc.gov/
Motorvehiclesafety/Teen_Drivers/teendrivers_factsheet.html
Vanderkam, L. (2012, October 8). Stop Checking Your Email, Now. Fortune. Retrieved May 15,
2013, from, http://management.fortune.cnn.com/2012/10/08/stop-checking-your-email-now/
World Earthquakes: Live Earthquake News and Highlights. (n.d.). World Earthquakes Live.
Retrieved May 15, 2013, from, http://www.world-earthquakes.com/
index.php?option=ethq_prediction
2012 College-Bound Seniors Total Group Profile Report. (2012). CollegeBoard. Retrieved May 14,
2013, from, http://media.collegeboard.com/digitalServices/pdf/research/TotalGroup-2012.pdf
908 | REFERENCES
Blood Pressure of Males and Females. (n.d.). StatCruch. Retrieved May 14, 2013, from,
http://www.statcrunch.com/5.0/viewreport.php?reportid=11960
Data. (n.d.). National Basketball Association. Retrieved May 14, 2013, from, www.nba.com
Data. (n.d.). San Jose Mercury News.
Digest of Education Statistics: ACT score average and standard deviations by sex and race/ethnicity
and percentage of ACT test takers, by selected composite score ranges and planned fields of study:
Selected years, 1995 through 2009. (2009). National Center for Education Statistics. Retrieved May
14, 2013, from, http://nces.ed.gov/programs/digest/d09/tables/dt09_147.asp
Janssen, S. (Ed.). (n.d.). The World Almanac and Book of Facts. World Almanac Books.
List of stadiums by capacity. (n.d.). InWikipedia. Retrieved May 14, 2013, from,
https://en.wikipedia.org/wiki/List_of_stadiums_by_capacity
The Use of Epidemiological Tools in Conflict-affected populations: Open-access educational
resources for policy-makers: Calculation of z-scores. (2009). London School of Hygiene and Tropical
Medicine. Retrieved May 14, 2013, from, http://conflict.lshtm.ac.uk/page_125.htm
Facebook Statistics. (n.d.). Statistics Brain. Retrieved May 14, 2013, from,
http://www.statisticbrain.com/facebook-statistics/
Naegele’s rule. (n.d.). In Wikipedia. Retrieved May 14, 2013, from, http://en.wikipedia.org/wiki/
Naegele’s_rule
NUMMI. (2010, March 26). This American Life. Retrieved May 14, 2013, from,
http://www.thisamericanlife.org/radio-archives/episode/403/nummi
Scratch-Off Lottery Ticket Playing Tips. (n.d.). WinAtTheLottery.com. Retrieved May 14, 2013,
from, http://www.winatthelottery.com/public/department40.cfm
Smart Phone Users, By The Numbers. (n.d.). Visual.ly. Retrieved May 14, 2013, from,
http://visual.ly/smart-phone-users-numbers
Baran, D. (n.d.). 20 Percent of Americans Have Never Used Email. WebGuild. Retrieved May 14, 2013,
from, http://www.webguild.org/20080519/20-percent-of-americans-have-never-used-email
Data. (n.d.). The Flurry Blog. Retrieved May 17, 2013, from, http://blog.flurry.com
Data. (n.d.). The United States Department of Agriculture.
American Fact Finder. (n.d.). U.S. Census Bureau. Retrieved July 2, 2013, from,
http://factfinder2.census.gov/faces/nav/jsf/pages/searchresults.xhtml?refresh=t
Disclosure Data Catalog: Candidate Summary Report 2012. (n.d.). U.S. Federal Election
Commission. Retrieved July 2, 2013, from, http://www.fec.gov/data/index.jsp
Headcount Enrollment Trends by Student Demographics Ten-Year Fall Trends to Most Recently
Completed Fall. (n.d.). Foothill De Anza Community College District. Retrieved Septmeber 30,
2013, from, http://research.fhda.edu/factbook/FH_Demo_Trends/
FoothillDemographicTrends.htm
Kuczmarski, R. J., Ogden, C. L., Guo, S. S., Grummer-Strawn, L. M., Flegal, K. M., Mei, Z. ,
Wei, R., Curtin, L. R., Roche, A. F., & Johnson, C. L. (2002, May). Vital Health Statistics: 2000
CDC Growth Charts for the United States: Methods and Development. Centers for Disease Control
and Prevention, 11(246). Retrieved July 2, 2013, from, http://www.cdc.gov/growthcharts/
2000growthchart-us.pdf
Mean Income in the Past 12 Months (in 2011 Inflation-Adjusted Dollars): 2011 American
Community Survey 1-Year Estimates. (n.d.). American Fact Finder, U.S. Census Bureau. Retrieved
July 2, 2013, from, http://factfinder2.census.gov/faces/tableservices/jsf/pages/
productview.xhtml?pid=ACS_11_1YR_S1902&prodType=table
Metadata Description of Candidate Summary File. (n.d.). U.S. Federal Election Commission.
Retrieved July 2, 2013, from, http://www.fec.gov/finance/disclosure/metadata/
metadataforcandidatesummary.shtml
National Health and Nutrition Examination Survey. (n.d.). Centers for Disease Control and
Prevention. Retrieved July 2, 2013, from, http://www.cdc.gov/nchs/nhanes.htm
Ralph, N., & German, K. (2011, June 1). Cell phones with the highest radiation levels (pictures).
CNET. Retrieved July 2, 2013, from, http://reviews.cnet.com/cell-phone-radiation-levels/
America’s Best Small Companies. (2013). Forbes. Retrieved July 2, 2013, from,
http://www.forbes.com/best-small-companies/list/
Data. (n.d.). Businessweek. http://www.businessweek.com/.
Data. (n.d.). Forbes. http://www.forbes.com/.
Data. (n.d.). In Microsoft Bookshelf.
910 | REFERENCES
Disclosure Data Catalog: Leadership PAC and Sponsors Report, 2012. (n.d.). Federal Election
Commission. Retrieved July 2, 2013, from, http://www.fec.gov/data/index.jsp
Human Toxome Project: Mapping the Pollution in People. (n.d.). Environmental Working Group.
Retrieved July 2, 2013, from, http://www.ewg.org/sites/humantoxome/participants/participant-
group.php?group=in+utero%2Fnewborn
Metadata Description of Leadership PAC List. (n.d.). Federal Election Commission. Retrieved July
2, 2013, from, http://www.fec.gov/finance/disclosure/metadata/metadataLeadershipPacList.shtml
2013 Teen and Privacy Management Survey. (n.d.). Pew Research Center: Internet and American
Life Project. Retrieved July 2, 2013, from, http://www.pewinternet.org/~/media//Files/
Questionnaire/2013/Methods%20and%20Questions_Teens%20and%20Social%20Media.pdf
52% Say Big-Time College Athletics Corrupt Education Process. (2013, May 16). Rasmussen
Reports. Retrieved July 2, 2013, from, http://www.rasmussenreports.com/public_content/lifestyle/
sports/may_2013/52_say_big_time_college_athletics_corrupt_education_process
Jensen, T. (2013, May 10). Democrats, Republicans Divided on Opinion of Music Icons. Public
Policy Polling. Retrieved July 2, 2013, from, https://www.publicpolicypolling.com/polls/democrats-
republicans-divided-on-opinion-of-music-icons/
Madden, M., Lenhart, A., Coresi, S., Gasser, U., Duggan, M., Smith, A., & Beaton, M. (2013,
May 21). Teens, Social Media, and Privacy. Pew Research Center. Retrieved July 2, 2013, from,
https://www.pewresearch.org/internet/2013/05/21/teens-social-media-and-privacy/
Saad, L. (2013, May 23). Three in Four U.S. Workers Plan to Work Pas Retirement Age: Slightly
more say they will do this by choice rather than necessity. Gallup. Retrieved July 2, 2013, from,
http://www.gallup.com/poll/162758/three-four-workers-plan-work-past-retirement-age.aspx
The Field Poll. (n.d.). Field. Retrieved July 2, 2013, from, http://field.com/fieldpollonline/
subscribers/
Zogby. (2013, May 16). New SUNYIT/Zogby Analytics Poll: Few Americans Worry about Emergency
Situations Occurring in Their Community; Only one in three have an Emergency Plan; 70% Support
Infrastructure ‘Investment’ for National Security. Zogby Analytics. Retrieved July 2, 2013, from,
http://www.zogbyanalytics.com/news/299-americans-neither-worried-nor-prepared-in-case-of-a-
disaster-sunyit-zogby-analytics-poll
REFERENCES | 911
Allen, E. I., & Seaman, J. (2005). Growing by Degrees: Online Education in the United States, 2005.
The Sloan Consortium.
Amit Schitai, A. (n.d.). Data.
Data. (n.d.). American Automobile Association. Retrieved June 27, 2013, from, www.aaa.com
Data. (n.d.). American Library Association. Retrieved June 27, 2013, from, https://www.ala.org/
Data. (n.d.). Bureau of Labor Statistics. http://www.bls.gov/oes/current/oes291111.htm.
Data. (n.d.). Centers for Disease Control and Prevention. Retrieved June 27, 2013, from,
www.cdc.gov
Data. (n.d.). Energy.Gov. Retrieved June 27, 2013, from, http://energy.gov
Data. (n.d.). Gallup. Retrieved June 27, 2013, from https://www.gallup.com/home.aspx
Data. (n.d.). La Leche League International. http://www.lalecheleague.org/Law/BAFeb01.html
Data. (n.d.). Toastmasters International. http://toastmasters.org/artisan/
detail.asp?CategoryID=1&SubCategoryID=10&ArticleID=429&Page=1.
Data. (n.d.). United States Census Bureau. Retrieved June 27, 2013, from,
https://www.census.gov/programs-surveys/sis/resources/data-tools/quickfacts.html
Data. (n.d.). United States Census Bureau. http://www.census.gov/hhes/socdemo/language/.
Data, (n.d.). Weather Underground. Retrieved June 27, 2013, from,
https://www.wunderground.com/
Deprez, E. E. NYC Smoking Rate Falls to Record Low of 14%, Bloomberg Says. Businessweek.
Retrieved June 27, 2013, from https://www.bloomberg.com/news/articles/2011-09-15/new-york-
city-adult-smoking-rate-falls-to-all-time-low-of-14-mayor-
says#:~:text=New%20York’s%20adult%20smoking%20rate,are%20smoking%2C%20the%20mayor%
20said
FBI. (n.d.). Uniform Crime Reports and Index of Crime in Daviess in the State of Kentucky enforced
by Daviess County from 1985 to 2005. The Disaster Center. Retrieved June 27, 2013, from,
http://www.disastercenter.com/kentucky/crime/3868.htm
912 | REFERENCES
Baseball-Almanac. (2013). World Series History. In Baseball-Almanac, 2013. Retrieved June 17,
2013, from, http://www.baseball-almanac.com/ws/wsmenu.shtml
Data. (n.d.). Graduating Engineer + Computer Careers. http://www.graduatingengineer.com
Data. (n.d.). In Microsoft Bookshelf.
Data. (n.d.). United States Senate. Retireved June 17, 2013, from https://www.senate.gov/
REFERENCES | 913
Data. (n.d.). American Cancer Society. Retrieved June 17, 2013, from, http://www.cancer.org/index
Data. (1994, November). Chancellor’s Office, California Community Colleges.
Data. (December). Educational Resources.
Data. (n.d.). Hilton Hotels. Retrieved June 17, 2013, from, http://www.hilton.com
Data. (n.d.). Hyatt Hotels. Retrieved June 17, 2013, from, http://hyatt.com
Data. (n.d.). Statistics. United States Department of Health and Human Services.
Data. (n.d.). Whitney Exhibit on loan to San Jose Museum of Art.
State of the States. (2013). Gallup. Retrieved June 17, 2013, from, http://www.gallup.com/poll/
125066/State-States.aspx?ref=interactive
West Nile Virus. Centers for Disease Control and Prevention, National Center for Emerging and
Zoonotic Infectious Diseases (NCEZID), Division of Vector-Borne Diseases (DVBD). Retrieved June
17, 2013, from, http://www.cdc.gov/ncidod/dvbid/westnile/index.htm
AppleInsider Price Guides. (n.d.). Apple Insider. Retrieved June 17, 2013, from,
http://appleinsider.com/mac_price_guide
Data. (n.d.). World Bank.
DiCamilo, M., & Field, M. (2013, February 14). Most Californians See a Direct Linkage between
Obesity and Sugary Sodas. Two in Three Voters Support Taxing Sugar-Sweetened Beverages If Proceeds
are Tied to Improving School Nutrition and Physical Activity Programs. The Field Poll. Retrieved May
24, 2013, from, http://field.com/fieldpollonline/subscribers/Rls2436.pdf
Favorite Flavor of Ice Cream. (2016, October 22). Statistic Brain Research Institute.
http://www.statisticbrain.com/favorite-flavor-of-ice-cream
Youngest Online Entrepreneurs List. (2016, June 29). Statistic Brain Research Institute.
http://www.statisticbrain.com/youngest-online-entrepreneur-list
Data. (n.d.). Fourth-grade classroom in 1994 in a private K – 12 school, San Jose, CA.
Hand, D. J., Daly, F., Lunn, A. D., McConway, K. J., & Ostrowski, E. (1994). A Handbook of Small
Datasets: Data for Fruitfly Fecundity. Chapman & Hall.
MLB Standings – 2012. ESPN. http://espn.go.com/mlb/standings/_/year/2012
Mackowiak, P. A., Wasserman, S. S., and Levine, M. M. (1992). A Critical Appraisal of 98.6
Degrees F, the Upper Limit of the Normal Body Temperature, and Other Legacies of Carl Reinhold
August Wunderlich. Journal of the American Medical Association, 268, 1578-1580.
REFERENCES | 915
This page provides a record of edits and changes made to this book since its initial publication.
Whenever edits or updates are made in the text, we provide a record and description of those
changes here. If the change is minor, the version number increases by 0.1. If the edits involve a
number of changes, the version number increases to the next full number.
The files posted alongside this book always reflect the most recent version.