3 Inference For One Population Proportion

Download as pdf or txt
Download as pdf or txt
You are on page 1of 33

Inference for One Population Proportion

Sections 2.3-2.8, 3.1

Point Estimates
A large part of statistics is common sense. For example, we use the sample mean to estimate the
population mean and the sample proportion to estimate the population proportion. These estimates of
our population parameters are called point estimates.

We’ve encountered several point estimates so far:


Point estimate Parameter
Sample mean, 𝑥̅ Population mean, 𝜇
Sample variance, 𝑠 ! Population variance, 𝜎 !
Sample standard deviation, 𝑠 Population standard deviation, 𝜎
Sample proportion, 𝑝̂ Population proportion, 𝑝
Regression:
Estimated response in linear model, 𝑦) Mean response, 𝜇$|&'(
Estimated y-intercept, 𝑏" Population y-intercept, 𝛽"
Estimated slope, 𝑏# Population slope, 𝛽#

In Section 2.4, we will be using the sample proportion 𝑝̂ to estimate the population proportion 𝑝. We
will then extend that work to looking at the difference between two population proportions.

Introduction to Inference
So far, we’ve explored various methods for describing sample data once we have it. In this chapter, we
will discuss the basics of how to use sample data to make inferences about to generalize to the
population at large. We will use sample statistics (our point estimates) to estimate population
parameters. Then we examine whether the point estimate we saw in our data is a typical value (could
have happened by chance alone) or whether it seems unusual (something other than chance is going
on). We begin with another dolphin example.

Section 2.4 Simulation Case Studies


Example: Can Dolphins Communicate?1
A famous study from the 1960s explored whether two dolphins (Doris and Buzz) could communicate
abstract ideas. Researchers believed dolphins could communicate simple feelings like “Watch out!” or
“I’m happy,” but Dr. Jarvis Bastian wanted to explore whether they could also communicate in a more
abstract way, much like humans do. To investigate this, Dr. Bastian spent many years training Doris and
Buzz and exploring the limits of their communicative ability.

During a training period lasting many months, Dr. Bastian placed buttons underwater on each end of a
large pool—two buttons for Doris and two buttons for Buzz. He then used an old automobile headlight

1
This example (and much of the wording) is taken from Tintle N.L., Chance B.L., Cobb G.W., Rossman A.J., Roy S., &
Swanson T.M., VanderStoep J.L. (2021). Introduction to Statistical Investigations (2nd edition, pp. 23-26). John Wiley & Sons,
Inc. Many thanks to Tintle et al. for sharing these materials in a 2014 workshop that Dr. Miller attended.
Sections 2.3-2.8 and 3.1, page 1
as his signal. When he turned on the headlight and let it shine steadily, he intended for this signal to
mean “push the button on the right.” When he let the headlight blink on and off, this was meant as a
signal to “push the button on the left.” Every time the dolphins pushed the correct button, Dr. Bastian
gave the dolphins a reward of some fish. Over time Doris and Buzz caught on and could earn their fish
reward every time.

Then Dr. Bastian made things a bit harder. Now, Buzz had to push his button before Doris. If they didn’t
push the buttons in the correct order—no fish. After a bit more training, the dolphins caught on again
and could earn their fish reward every time. The dolphins were now ready to participate in the real
study to examine whether they could communicate with each other.

Dr. Bastian placed a large canvas curtain in the middle of the pool. Doris was on one side of the curtain
and could see the headlight, whereas Buzz was on the other side of the curtain and could not see the
headlight. Dr. Bastian turned on the headlight and let it shine steadily. He then watched to see what
Doris would do. After looking at the light, Doris swam near the curtain and began to whistle loudly.
Shortly after that, Buzz whistled back and then pressed the button on the right—he got it correct and
so both dolphins got a fish. But this single attempt was not enough to convince Dr. Bastian that Doris
had communicated with Buzz through her whistling. Dr. Bastian repeated the process several times,
sometimes having the light blink (so Doris needed to let Buzz know to push the left button) and other
times having it glow steadily (so Doris needed to let Buzz know to push the right button). He kept track
of how often Buzz pushed the correct button.

In this scenario, even if Buzz and Doris can communicate, we don’t necessarily expect Buzz to push the
correct button every time. We allow for some “randomness” in the process; maybe on one trial Doris
was a bit more underwater when she whistled and the signal wasn’t as clear for Buzz. Or maybe Buzz
and Doris aren’t communicating at all and Buzz guesses which button to push every time and just
happens to guess correctly once in a while. Our goal is to get an idea of how likely Buzz is to push the
correct button in the long run.

Dr. Bastian took some time to train the dolphins in order to get them to a point where he could test a
specific research conjecture. The research conjecture is that Buzz pushes the correct button more
often than he would if he and Doris were not communicating. Let’s be skeptical and assume that Buzz
and Doris were not communicating. In that case, Buzz would have no additional information that
would make him more likely to choose one button over the other—Buzz would just be guessing which
button to push.

If Buzz was just guessing, what is the chance that he would choose the correct button? How would this
chance change if there were 3 buttons? 4 buttons?

Sections 2.3-2.8 and 3.1, page 2


Let p represent the true proportion of times that Buzz would push the correct button if he was just
guessing. If Buzz was just guessing which of the two buttons to push, then p = 0.50. If Buzz pushes the
correct button more often than he would if he and Doris were not communicating, then we expect p to
be greater than 0.50.

In one phase of the study, Dr. Bastian had Buzz attempt to push the correct button a total of 16
different times. These 16 trials2 are a mere snapshot of Buzz’s overall selection process. We are
interested in Buzz’s actual long-run proportion (i.e., probability) of pushing the correct button based on
Doris’s whistles. This unknown long-run proportion is a (population) parameter, and we will denote it
with p.

Note that we are assuming this parameter is not changing over time, at least for the process used by
Buzz in this phase of the study. Because we can’t observe Buzz pushing the button forever, we need to
draw conclusions (possibly incorrect, but hopefully not) about the value of the parameter based only
on the 16 attempts in this phase of the study.

It will be helpful for us to consider what we might expect if there is Buzz and Doris were not
communicating.

Chance Models
Scientists use models to help understand complicated real-world phenomena. Statisticians often
employ chance models to generate data from random processes to help them investigate such
processes. We need to decide whether the process could be Buzz simply guessing or whether the
process is something else, such as Buzz and Doris communicating.

Let us first investigate the “Buzz was simply guessing” process. Because Buzz is choosing between two
equally-likely options, the simplest chance model to consider is a coin flip. We can flip a coin to
represent, or simulate, Buzz’s choice assuming he is just guessing which button to push. To generate
this artificial data, we can let “heads” represent the outcome that Buzz pushes the correct button and
let “tails” be the outcome that Buzz pushes the incorrect button. This gives Buzz a 50% chance of
pushing the correct button. This can be used to represent the “Buzz was just guessing” or the “random-
chance-alone” explanation.

The correspondence between the real study and the physical simulation is shown in the following
table:
Assuming that Buzz was just guessing…
Coin flip
Heads
Tails
Chance of heads
One repetition

2
It seems more appropriate to call these trials instead of cases or observations, but it’s the same idea.
Sections 2.3-2.8 and 3.1, page 3
Now that we see how flipping a coin can simulate Buzz
guessing, let’s flip some coins to simulate Buzz’s performance.3
Suppose that on the first flip we got heads. What does this
mean?

What if we keep flipping the coin? Each time we flip the coin we are simulating another attempt where
Buzz guesses which button to push. Remember that heads represents Buzz guessing correctly and tails
represents Buzz guessing incorrectly.

How many times do we flip the coin?

After the tosses, we obtained the sequence of flips shown below:

Summarize the results:

Will we get this same result every time we flip a coin 16 times?

Here are the results of two more repetitions representing Buzz’s 16 trials. Calculate the values of the
simulated statistics.

Well, that was fun. Can we learn anything from these coin tosses when the results vary between the
sets of 16 tosses?4

3
We will use the applet at http://www.isi-stats.com/isi2nd/ISIapplets2021.html (Categorical Response ® One Proportion).
4
Clearly, we can. Otherwise we wouldn’t be doing this.
Sections 2.3-2.8 and 3.1, page 4
Using and Evaluating the Coin Flip Chance Model
Because coin flipping is a random process, we know that we won’t obtain the same number of heads
with every set of 16 flips. But are some numbers of heads more likely than others? If we continue our
repetitions of 16 tosses, we can start to see how the outcomes for the number of heads are
distributed. Does the distribution of the number of heads that result in 16 flips have a predictable long-
run pattern? In particular, how much variability is there in our simulated statistics between repetitions
(sets of 16 flips) just by random chance?

In order to investigate these questions, we need to continue to flip our coin to get many, many sets of
16 flips (or many repetitions of the 16 choices where we are modeling Buzz simply guessing each time).
We did this, and the figure below shows what we found when we graphed the number of heads from
each set of 16 coin flips.

The plot on the left shows the number of heads in 100 repetitions of 16 coin flips, while the plot on the
right shows the number of heads in 1000 repetitions of 16 coin flips.5 We chose 100 repetitions
because 100 is small enough that we can see the individual dots) and 1000 repetitions that is large
enough to give us a fairly accurate sense of the long-run behavior for the number of heads in 16 tosses.
Each dot in the plot on the left indicates the number of heads in one repetition of 16 coin flips.

The resulting number of heads follows a clear pattern: 7, 8, and 9 heads happened quite a lot, 6 and 10
were pretty common also, 5 and 11 happened some of the time, and the other values had fewer
occurrences.

What values appear to be unusual in figure on the right?

Note: We refer to these unusual results as being out in the “tails” of the distribution.

5
Figure 1.1.4A from ISI, page 36
Sections 2.3-2.8 and 3.1, page 5
Putting It All Together
In one phase of the study, Dr. Bastian had Buzz attempt to push the correct button a total of 16
different times. In this sample of 16 attempts, Buzz pushed the correct button 15 out of 16 times.

Our sample statistic is the proportion of times Buzz was correct in the 16 trials:

These 16 observations are a mere snapshot of Buzz’s overall selection process. We are interested in
Buzz’s actual long-run proportion (i.e., probability) of pushing the correct button based on Doris’s
whistles. This unknown long-run proportion is a (population) parameter, and we will denote it with p.
Note that we are assuming this parameter is not changing over time, at least for the process used by
Buzz in this phase of the study. Because we can’t observe Buzz pushing the button forever, we need to
draw conclusions (possibly incorrect, but hopefully not) about the value of the parameter based only
on these 16 attempts.

The researchers wondered if the dolphins were Buzz certainly pushed the correct button most of the
time, so we might consider either of the following:
• Buzz is just guessing (his probability of a correct button push is 0.50, p = 0.50) and he got lucky
in these 16 attempts.
• Buzz is doing something other than just guessing (his probability of a correct button push is
larger than 0.50, p > 0.50).
These are the two possible explanations to be evaluated. Because we can’t collect more data, we have
to base our conclusions only on the data we have. It’s certainly possible that Buzz was just guessing
and got lucky!

Does the “just guessing” proposition seem like a reasonable explanation to you? How would you argue
against someone who thought this was the case?

How does the analysis above help us address the strength of evidence for our research conjecture that
Buzz was doing something other than just guessing?

Even though we expect some variability in the results for different sets of 16 tosses, the pattern shown
in this distribution indicates that an outcome of 15 heads is outside the typical chance variability we
would expect to see when Buzz is simply guessing. Our coin flip chance model tells us that we
have very strong evidence that Buzz wasn’t just guessing.

Therefore, we don’t believe the “just guessing” explanation is a good one for Buzz. That is, we don’t
think our study result (15 out of 16 correct) happened by chance alone, but rather, something other
than “random chance” was at play. We don’t believe that Buzz was just guessing.

Sections 2.3-2.8 and 3.1, page 6


Why Use Simulations?
What would happen to the percentage of
heads if we flipped a fair coin 10,000 times?
Although this has been done,6 tossing a coin
that many times would take a very long
time. With computers, we can simulate
tossing a coin this many times in a matter of
milliseconds. The graph to the right shows
what happened in a simulation of 10,000
flips of a fair coin. It is easy to see that,
indeed, in the long run close to 50% of coin
flips land on heads (but not exactly, due to
randomness!).

Our Plan Forward


We will now use both simulation and graphical representations of simulated results to help us answer
questions about what we would expect to happen by chance and to evaluate whether such a model fits
observed results from collected data or experiments.

Another Example: Rock-Paper-Scissors7


Rock-paper-scissors (also known as roshambo)8 is a two-player game in
which both players simultaneously “throw” one of three hand gestures:
rock (a closed fist), paper (a flat hand), or scissors (a fist with the index
and middle fingers extended in a ‘V’). The object of the game is to throw a
gesture that beats your opponent. Rock crushes scissors, scissors cut
paper, paper covers rock. Those who have played rock-paper-scissors
develop strategies to play the game. We might expect that those new to
the game (novices) would tend to throw each of the hand gestures 1/3 of
the time.

An article9 found that novice players tend to throw scissors less than 1/3
of the time. Suppose you decide to investigate this tendency with 20 people playing rock-paper-
scissors for the first time. You explain the rules of the game to the players and have them play one
round (throw one hand gesture). Suppose only 4 of the 20 novice players throw scissors.

Note that this scenario is similar to what we had with Buzz and Doris. We have repeated outcomes (20)
from the same random process (choosing a hand gesture in each play of the game). There are two
possible outcomes (scissors, not scissors).

6
Count Buffon of France tossed a coin 4040 times (2048 heads); around 1900 Karl Pearson tossed a coin 24,000 times
(12,012 heads); John Kerrich tossed a coin 10,000 times (5067 heads) while a prisoner of war. (Source: The Basic Practice of
Statistics (9th edition, 2021.)
7
Again, credit to Tintle et al. for this great example.
8
Rock-paper-scissors image credit: By Enzoklop - Own work, CC BY-SA 3.0,
https://commons.wikimedia.org/w/index.php?curid=27958688
9
Eyler, D., Shalla Z., Doumaux, A., and McDevitt, T. (March 2009) Winning at Rock-Paper-Scissors. The College Mathematics
Journal (v. 40, no. 2), pages 125-128.
Sections 2.3-2.8 and 3.1, page 7
What is our parameter of interest?

Formulating Hypothesis Statements


This study, like the one we did with Doris and Buzz, has two competing claims. We typically call these the
null hypothesis and the alternative hypothesis, respectively. You may also see the alternative hypothesis
called the research hypothesis.

The null hypothesis (often denoted by 𝐻" ) is a statement that there is no change, nothing is happening,
no difference, no relationship, or no effect in the underlying population. The null hypothesis usually
assumes the status quo. It is the claim that any differences we see in the sample results (when compared
to the status quo) are due to chance alone, that is, due to naturally occurring variability.

The alternative hypothesis (often denoted by 𝐻) ) is a statement that something has changed,
something is happening, there is a difference, there is a relationship, or there is an effect in the
underlying population. It is the claim that any difference between the sample results compared to the
status quo is difficult to explain away as randomness and is not due to chance alone.

Note: It is not possible for both the null and alternative hypotheses to be true at the same time, so if
the alternative is true, the null is false, and vice versa.

To see these ideas, let's write the null and alternative hypotheses for our rock-paper-scissors example.

It is important to remember that the null and alternative hypotheses are statements about the
population parameter (here, the long-run relative frequency of picking scissors), not just about what
happened in the study (here, the proportion of scissors thrown by our 20 novice players). Note too that
we determined the alternative hypothesis based on our belief that novice players choose scissors less
than one-third of the time (not based on what our 20 novices did). It is important to state the
hypotheses prior to conducting a study, before any data are gathered.

Once we have formulated our research question and hypotheses, we collect data. Then we analyze the
data, determine how likely we would see results as weird as ours when the null hypothesis is true, and
make conclusions. To get a sense of the logic of a hypothesis test, let’s draw a parallel to the U.S. Court
System.

Sections 2.3-2.8 and 3.1, page 8


Example: Hypothesis Testing in the U.S. Court System
1. Research question and hypotheses: Is the defendant guilty of a crime?
H0: The defendant is innocent
HA: The defendant is guilty

2. Collect data: Detectives investigate the crime

3. Analyze data: Prosecution and defense present the result of the investigations in court

4. Evaluating the evidence: A jury deliberates about whether the prosecution has provided
evidence that calls into question the innocence of the defendant.

In practice, juries and judges have to determine whether there is convincing evidence to conclude that
the defendant is guilty. When there is convincing evidence, they find the defendant “guilty.” When
there is not convincing evidence, they find the defendant “not guilty.” Note: A verdict of “not guilty”
does not mean that the defendant is innocent; rather, it means that there was not enough evidence to
convince the jury or judge that the defendant is guilty.

Simulating the Study


How many scissors do we expect to be thrown by our 20 novice players if novice players throw scissors
one-third of the time?

Let’s investigate what we would expect if scissors are thrown one-third of the time. We need our null
model10 to reflect that a novice player throws scissors 1/3 of the time under the null hypothesis.

Let’s use 3 blue and yellow poker chips to simulate the study, where a blue poker chip indicates that a
novice player throws scissors, and a yellow poker chip indicates that a novice player throws something
other than scissors. How many of each color poker chip should we put in a bag?
_______ Blue _______ Yellow

We mix the poker chips thoroughly and draw one poker chip from the bag to represent one play of the
game. We repeat the mix-and-draw process a total of 20 times, recording the color, then replacing the
poker chip and reshuffling each time.

Should we draw from the bag with or without replacement?11

10
Now that we have formalized our terms, we are using the term “null model” to indicate that we are operating under the
assumption that the null hypothesis is true.
11
With replacement means that we draw a chip, note its color, and put it back in the bag. Without replacement means that
we draw a chip, note its color, and set it aside.
Sections 2.3-2.8 and 3.1, page 9
The correspondence between the real study and the physical simulation is shown in the table below:
Assuming scissors will be thrown 1/3 of the time…
One draw
Blue poker chip
Yellow poker chip

Chance of blue

One repetition

Here are three example results of 20 trials:


Result #1: Blue chips: _____ ; Yellow chips: _____ Simulated statistic:

Result #2: Blue chips: _____ ; Yellow chips: _____ Simulated statistic:

Result #3: Blue chips: _____ ; Yellow chips: _____ Simulated statistic:

Using and Evaluating the Null Model


Just like we did with the Doris and Buzz study, we use a computer simulation12 to run this experiment
many times so we can see the null distribution. The simulation of drawing a poker chip (with
replacement) 20 times when the probability of getting a blue poker chip is 1/3 was repeated 100 times
in the figure on the left (a number small enough so we can see the individual dots) and 2500 times13 in
the figure on the right—a number chosen for convenience but also large enough to give us a fairly
accurate sense of the long-run behavior for the proportion of blue poker chips in 20 draws.

Notice that the distribution is centered at 1/3, which is the probability under the null hypothesis that a
novice player throws scissors one-third of the time.14

12
We again use the applet at http://www.isi-stats.com/isi2nd/ISIapplets2021.html.
13
Why 2500? No reason in particular. We want you to know that you don’t always need to have 1000 total simulations. We
want enough simulations so that we can see an overall pattern and suggest at least 1000 total simulations.
14
The simulated null model will always be centered at the proportion specified in the null hypothesis.
Sections 2.3-2.8 and 3.1, page 10
Only 4 of the 20 novice players threw scissors. What is the observed
sample statistic?

Our alternative hypothesis was that novice players throw scissors less
than one-third of the time. What proportion of the simulated results
indicated throwing scissors are at least as extreme as we observed in
the direction of the alternative hypothesis?

Evaluating the Results


We can use the results of the simulation to help us decide whether the null hypothesis model is a good
fit for our observed results. In other words, is our observed result something that we could expect to
see if the probability a novice player throws scissors is one-third?

What does this mean?

Section 2.3 Hypothesis Testing


In our examples, we simulated results under the assumption that the null hypothesis was true. In each
case, we calculated the proportion of the simulations that were at least as or more extreme in favor of
(in the direction of) the alternative hypothesis.15 We thought Buzz was correct more often than if he
were just guessing, so we looked at the right tail of the simulated distribution. We thought novice
players would throw scissors less than 1/3 of the time, so we looked at the left tail of the simulated
distribution.

Relative frequencies can be thought of as probabilities, so we can think of this proportion as an estimate
of the probability of observing a result as favorable to the alternative hypothesis as our observed data.
The probability that Buzz would push the correct button at least 15 out of 16 times if he were just
guessing which button to push was about 0, and the probability that the probability that 4 or fewer of
our 20 novices would throw scissors is about 0.15.

p-values
When we calculate the proportion of the simulated statistics that are at least as extreme (in the
direction of the alternative hypothesis), we are calculating an estimated p-value. We can estimate
the p-value by finding the proportion of the simulated statistics in the null distribution that are at least
as extreme (in the direction of the alternative hypothesis) as the value of the statistic actually observed
in the research study.

The p-value is the probability of obtaining a value of the statistic at least as extreme as the observed
statistic when the null hypothesis is true.

15
See https://education.wiley.com/content/Tintle_Intro_2e/media/simulations/faq/c01faq_1_2_1.pdf for a nice
explanation of why we include “or more extreme” in our calculation.
Sections 2.3-2.8 and 3.1, page 11
“The p-value takes into account the could-have-been outcomes (assuming the null hypothesis is true)
that are as extreme or more extreme than the one we observed. This provides a direct measure of our
strength of evidence against the ‘by-chance-alone’ or null model and allows for a standard, comparable
value for all scientific research studies. Smaller p-values mean the value of the observed statistic,
under the null model, is more unlikely by chance alone. Hence, smaller p-values indicate stronger
evidence against the null model.”16

Calculating p-values
Our estimated p-value gives us a sense of how unusual our study results are. As mentioned above, we
determine the proportion of the simulated statistics in the null distribution that are at least as
extreme (in the direction of the alternative hypothesis).

Students sometimes find it difficult to figure out which “tail” of the distribution we should use to
estimate the p-value. As such, we give you the following handy guide:
Tail of the
Alternative hypothesis “At least as extreme”
distribution
Simulated statistics smaller than our study results
Less than (<) Left tail
provide more evidence against the null hypothesis
Simulated statistics larger than our study results
Greater than (>) Right tail
provide more evidence against the null hypothesis
Simulated statistics in both tails provide more evidence
Not equal to (≠) Both tails
against the null hypothesis

The p-value and Strength of Evidence


We can use p-values to help us evaluate how well the null hypothesis model explains or “fits” our
observed sample results:
• If the p-value is large, this indicates that our observed results look like they could be a result of
the natural variation that we expect to see when we take random samples.
• The smaller a p-value is, the less inclined we are to think that our sample results are simply due
to natural variation. In other words, small p-values give us reason to doubt that the null model
is a good explanation for our observed results.

It pains us a bit to have a rule for large and small. However, we realize that it helps students to have a
guideline, so we provide the following table. We will use a (somewhat arbitrarily) chosen scale for
evaluating the amount of evidence a p-value gives us to doubt the compatibility of the null model with
our data:
If the p-value is: In this class, we will say we have:
Greater than 0.10 (p-value > 0.10) little evidence against 𝐻!
Between 0.05 and 0.10 (0.05 < p-value < 0.10) some evidence against 𝐻!
Between 0.01 and 0.05 (0.01 < p-value < 0.05) strong evidence against 𝐻!
Less than 0.01 (p-value < 0.01) very strong evidence against 𝐻!

For example, if the p-value is between 0.01 and 0.05, we will say we have strong evidence that the null
model is not a good fit for our observed sample results.

16
Tintle et al., page 49
Sections 2.3-2.8 and 3.1, page 12
Examples
What is the estimated p-value for the Buzz and Doris study? How much evidence do we have against
the null hypothesis?

What is the p-value for the rock-paper-scissors study? How much evidence do we have against the null
hypothesis?

Why Not Just Pick One Value as a Cutoff?


Outside of Stats 250, you will most likely see 0.05 as a cutoff to determine statistical significance.
Often, students studying statistics are told that they should compare p-values to 0.05. In these cases,
p-values less than or equal to 0.05 are considered to be “small” (the null hypothesis is rejected in favor
of the alternative hypothesis; the results are considered to be statistically significant) while p-values
greater than 0.05 are considered to be “large” (the null hypothesis is not rejected; the results are not
considered to be statistically significant).

This “magic cutoff” of 0.05 can be traced back to a publication by Sir Ronald A. Fisher back in the 1920s
and came to prominence in the 1960s when the “FDA began using these statistical tests in decision-
making, and the 0.05 standard became enshrined in U.S. drug development.”17 It has only become
more entrenched in clinical trials over the decades.

We say that the data provide statistically significant evidence against the null hypothesis if the p-value
is less than some reference value, usually 𝛼 = 0.05.

What does this mean for you? Outside of Stats 250, you will very likely see the following:
Decision Conclusion
p-value ≤ 0.05 Reject the null hypothesis Statistically significant results
p-value > 0.05 Fail to reject the null hypothesis Not statistically significance results

Still wondering about 0.05? Check out the video that our authors prepared at www.openintro.org/why05.

Here’s a summary18 of our overall approach to assessing statistical significance:


• We observed a sample statistic (e.g., the number of “successes” or the proportion of
“successes” in the sample).
• Then we simulated “could-have-been” outcomes for that statistic under a specific chance
model (Buzz was just guessing, a novice rock-paper-scissors player will throw scissors 1/3 of the
time).
• Then we used the information we gained about the random variation in the “just chance”
17
Kennedy-Shaffer, Lee. “When the Alpha is the Omega: P-Values, “Substantial Evidence,” and the 0.05 Standard at the
FDA. Food Drug Law J. 2017; 72(4):595–635. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6169785/
18
Adapted from Tintle et al., page 37.
Sections 2.3-2.8 and 3.1, page 13
values of the statistic to help us judge whether the observed value of the statistic is an unusual
or a typical outcome. If it is unusual—we say the observed statistic is statistically significant—it
provides strong evidence that the chance-alone explanation is wrong. If it is typical, we consider
the chance model plausible.

Decision Errors
Hypothesis tests are not flawless. Just think of the court system: innocent people are sometimes
wrongly convicted and the guilty sometimes walk free. Similarly, data can point to the wrong
conclusion. However, what distinguishes statistical hypothesis tests from a court system is that our
framework allows us to quantify and control how often the data lead us to the incorrect conclusion.

Type 1 and Type 2 Errors


In the context of decision making, there are four possible scenarios that could arise depending on the
truth about the population parameter(s). They are summarized in the table below:
Test Conclusion
Decide in favor of 𝐻" Decide in favor of 𝐻)
𝐻" true okay Type 1 Error
Truth
𝐻) true (𝐻" false) Type 2 Error okay

A Type 1 Error occurs when we decide in favor of the alternate hypothesis (reject 𝐻" ) when 𝐻" is
actually true.
A Type 2 Error occurs when we decide in favor of the null hypothesis (fail to reject 𝐻" ) when 𝐻) is
actually true.

The 0.05 significance level that we discussed above was settled on by Fisher and the FDA (and others)
because they determined that the probability of a Type 1 error should be no more than 5%. That is, we
should reject a true null hypothesis only at most 5% of the time.

Example: U.S. Court System


In the U.S. criminal court system, a person who is accused of a crime is assumed to be innocent until
proven guilty and it is up to the prosecution to present evidence. We can think of this as two
competing hypotheses:
H0: The defendant is innocent
HA: The defendant is guilty
Test Conclusion
Decide in favor of 𝐻" Decide in favor of 𝐻)
okay Type 1 Error
𝐻" true
(innocent)
Truth
Type 2 Error okay
𝐻) true
(guilty)

Sections 2.3-2.8 and 3.1, page 14


Example: Inspecting Parachutes
When a parachute is inspected, the inspector is looking for anything that might indicate that the
parachute will not open. We can think of this in terms of a hypothesis test.
H0: The parachute will open
HA: The parachute will not open
When the null hypothesis is rejected, the parachute is set aside and not used. When the null
hypothesis is not rejected, the parachute is put into use.
Conclusion
Decide in favor of 𝐻" Decide in favor of 𝐻)
okay Type 1 Error
𝐻" true
(parachute will open)
Truth
Type 2 Error okay
𝐻) true
(parachute will not open)

What are the consequences of each of the errors? Which error is worse?

Reducing the Error Rate


Suppose we want to reduce the rate of Type 1 Errors in the U.S. court system, that is, we want to
reduce the chance that an innocent person will be sent to jail. To do this, we would need to make it
harder to convict someone of a crime. We would require more convincing evidence that the defendant
was guilty. This could be a good idea, but at the same time, it would increase the chance that a guilty
person would go free. Any time you decrease the Type 1 Error rate you increase the Type 2 Error rate
(and vice versa).

Choosing a Significance Level


The chance of making a Type 1 Error is also called the “significance level” of the test. Often, a
researcher will decide ahead of time what significance level/Type 1 Error rate they are comfortable
with, based on the context of the research.

In the medical testing example above, a doctor may decide they are willing to give unnecessary
treatment to 5 out of 100 patients. In this case, they would choose a 5% significance level. Then, when
making a decision based on a hypothesis test, if the p-value was less than or equal to 0.05 (5%) the
doctor would decide against the null hypothesis and conclude the patient is sick and should receive
treatment. This is called “rejecting the null” at 5% significance. If, on the other hand, the p-value was
more than 0.05, the doctor would “not reject the null” and continue to act as if the null model is valid,
that is, they would not provide treatment to the patient.

Sections 2.3-2.8 and 3.1, page 15


By choosing a very small significance level you minimize the chance of making a Type I Error but at the
same time increase the chance of making a Type 2 Error. Conversely, if making a Type 2 Error is more
costly, you could choose a larger significance level to reduce the chance of making a Type 1 Error.
The significance level should reflect the real-world consequences associated with making a Type 1 or a
Type 2 Error and will vary from situation to situation.

Usually, a significance level/Type 1 Error rate is chosen ahead of time and then the chance of making a
Type 2 error for different alternative values of the parameter can be calculated.19

One-sided and Two-sided Hypotheses and p-values


The estimated p-values for the Buzz and Doris and rock-paper-scissors examples were both calculated
by looking at just one tail of the simulated null model. Sometimes we need to consider both tails of the
simulated null model in a two-sided hypothesis test.

Example: Senator or CEO?


One of the questions in a recent Stats 250 student survey was “Would you rather suddenly be elected a
senator or suddenly become a CEO of a major company? (You won’t have any more knowledge about
how to do either job than you do right now.)” Let p be the population proportion of Stats 250 students
who would prefer to suddenly be elected a senator. A priori, I don’t think that that Stats 250 students
would prefer one of the two positions, so I use the following hypotheses:
H0: Stats 250 students are equally likely to choose between senator and CEO, 𝑝 = 0.50.
HA: Stats 250 students prefer one option over the other, 𝑝 ≠ 0.50.

In a random sample of 50 Stats 250 students who completed the survey, 17 chose “suddenly be
#*
elected a senator.” Our observed proportion is 𝑝̂ = +" = 0.34.

Because our alternative hypothesis specified that one option is


preferred over the other, it would have been just as unusual for
there to have been 17 students who chose “suddenly become a CEO
of a major company” (so 33 would have chosen “suddenly be elected
a senator”). Our estimated p-value should include both extremes.
For a two-sided test, we often will take the area (or count of
observations) in a single tail and double it to get the p-value. Many
applets will count observations “beyond” the observed value in the
data. Use this value as an approximate p-value for randomization
tests. Later in the course we will learn techniques that will add to our methods for calculating p-values.

Note: Unless a priori the research question indicates one side or the other, you should perform a two-
sided test.

Caution! Hypotheses should be set up before seeing the data. Switching to a one-sided test after
performing the experiment is bad statistical practice!

19
We won’t directly calculate the chance of making a Type 2 error in this course—we leave that and power to your future
statistics courses.
Sections 2.3-2.8 and 3.1, page 16
The basic ideas that we learned for doing inference for one population proportion will be extended to
different situations throughout the course. By spending time learning and understanding the basics of
inference, we have set ourselves up for success.

Normal Theory
Simulations serve as a good way to see how inference works. However, sometimes simulations are
expensive, so it’s helpful to be able to use theory-based methods for our inference. They are much
cheaper to use and perform well, provided that certain conditions are satisfied.

In our examples, we compared our observed statistic (the sample proportion, 𝑝̂ ) to what we would
expect to see when the null hypothesis is true. To assess the evidence against the null hypothesis, we
simulated the null distribution. It’s not a coincidence that the simulated null distributions looked
similar:
Can dolphins communicate? Rock-Paper-Scissors

Here, number of heads represents number of times Buzz Here, number of successes represents number of plays
pushes the correct button where scissors are thrown

Back in the 1900s, however, no one wanted to sit around flipping a coin or shuffling decks of cards all
day long. Instead, they “focused their attention on mathematical and probabilistic rules and theories
that could predict what would happen”20 if many repetitions of a simulation were done. The Central
Limit Theorem (CLT) came from this work that those theoretical statisticians did.

Section 2.5 The Central Limit Theorem


In this section we consider the Central Limit Theorem as it applies to one proportion.21 If we look at a
proportion and the scenario satisfies certain conditions, then the sample proportion will appear to
follow a bell-shaped curve called the normal distribution.

Central Limit Theorem (as applied to one proportion)


When the sample size is large enough, the distribution of sample proportions will be approximately
,(#.,)
normal, centered at the long-run proportion (𝑝), with a standard deviation of 8 0
.

20
Ibid, p. 77.
21
We will see how the Central Limit Theorem applies to means later in the course.
Sections 2.3-2.8 and 3.1, page 17
Here is a plot of a simple normal distribution. Imagine superimposing it
over each of the null distributions on the previous page and witnessing
a relatively good fit.

Mathematical theory guarantees that a sample proportion 𝑝̂ will have an approximately normal
distribution when two conditions are met:
• The observations must be independent.
• The sample is large enough. Just how large is large enough? That differs from one context to
the next, and we’ll provide guidelines as we encounter them.22
We will formalize our theory-based inference for a single proportion shortly. Before we do that, we
need to talk about the normal distribution.

Section 2.6 The Normal Distribution


“Among all the distributions we see in statistics, one is overwhelmingly the most common. The
symmetric, unimodal, bell curve is ubiquitous throughout statistics. It is so common that people often
know it as the normal curve, normal model, or normal distribution. Under certain conditions, sample
proportions, sample means, and differences can be modeled using the normal distribution.”23

“All Models Are Wrong, but Some Are Useful”24


Many summary statistics and variables are nearly normal, but none are exactly normal. Thus the
normal distribution, while not perfect for any single problem, is very useful for a variety of problems.
We will use it in data exploration and to solve important problems in statistics.

The area under a normal curve can be considered a probability. In this section we will discuss the
common features of different normal curves, and learn how to use technology to find the areas we are
interested in.

There are five characteristics that normal curves have in common:


1. Normal curves are symmetric
2. Normal curves are unimodal
3. Normal curves are bell-shaped
4. Normal curves are centered at the mean of the distribution
5. The total area under the normal curve is 1

22
It turns out that the large enough condition is not satisfied for either the dolphin or the rock-paper-scissors example.
However, the simulated null distribution is more symmetric because p = 0.50 in the dolphin example.
23
ISRS, page 85
24
Attributed to George Box
Sections 2.3-2.8 and 3.1, page 18
Despite these common characteristics, normal distributions can look quite different. This is because all
normal distributions can be adjusted using two parameters, the mean and the standard deviation.
• Changing the mean of a normal curve shifts the mean to the left or to the right.
• Changing the standard deviation of a normal curve stretches or constricts the curve around the
mean.

Notation: When a normal curve has mean 𝜇 and standard deviation 𝜎, we will write the distribution as
the 𝑁(𝜇, 𝜎) distribution.

Figures 2.20 and 2.21 from the text show the 𝑁(0,1) and 𝑁(19, 4) distributions so that we can
compare them.

Because the mean and standard deviation describe a normal distribution exactly, they are called the
distribution’s parameters. The mean 𝜇 specifies the center of the distribution, and the standard
deviation 𝜎 specifies the variability of the distribution.

Standard Scores: The Standard Score as a Comparative Measuring Tool


The standard deviation is a useful “yardstick” for measuring how far a typical value falls from the
mean. In particular, knowing how many standard deviations above or below the mean an observation
is can help us get a sense about how unusual that observation is.

The standard score is the distance between the observed value and the mean, measured in terms of
number of standard deviations:
observed value − expected value 𝑥 − 𝜇
standard score = =
standard deviation 𝜎

We can interpret standard scores as quantifying the number of standard deviations an observation falls
from its mean or expected value. Values that are above the mean have positive standard scores, and
values that are below the mean have negative standard scores.

Example: Golden Retriever Weights


Weights of adult female golden retrievers have mean 60 pounds and standard deviation 2.5 pounds.
Luna, an adult female golden retriever, weighs 53 pounds. Calculate the standard score for Luna’s
weight.

Sections 2.3-2.8 and 3.1, page 19


Example: Using Standard Scores to Make Comparisons
Historically, scores on the Scholastic Assessment Test (SAT) can be modeled with a normal distribution.
SAT Math scores and SAT Evidence-Based Reading and Writing (ERW) are each reported on a scale of
200 to 800. In a recent year, SAT Math scores had a mean of 523 and a standard deviation of 117. That
same year, SAT ERW scores had a mean of 528 and a standard deviation of 105. Ryan took the SAT that
year and earned 760 on the SAT Math portion and 750 on the SAT ERW portion. Relatively speaking,
did Ryan score higher on the SAT Math or the SAT ERW portion of the test?

Random Variables
A random variable assigns a number to each possible outcome. For example, if we let 𝑋 represent the
SAT Math score for a randomly selected student, 𝑋 is a random variable. Each random variable will
have a distribution that specifies how the possible values of the random variable are distributed. In the
previous example, 𝑋 has a normal distribution with mean 523 and standard deviation 117.

Notation: When a random variable 𝑋 has a normal distribution with mean 𝜇 and standard deviation 𝜎,
we use the notation 𝑋~𝑁(𝜇, 𝜎).25

When the distribution of the random variable is normal (or approximately normal), we can calculate
probabilities in addition to calculating standard scores.

Standard Scores and the Normal Model


Any time we standardize a normally distributed random variable 𝑋 as
𝑋−𝜇
𝑍=
𝜎
we get a random variable that has the standard normal 𝑁(0, 1) distribution. The standard score for a
normally distributed random variable is often called the z-score.

Calculating Probabilities with the Normal Model


Probabilities for random variables that are normally distributed can be
found by calculating the area under the normal curve. For example, if the
weights of adult female golden retrievers are normally distributed with
mean 60 pounds and standard deviation 2.5 pounds, the probability that a
randomly selected adult female golden retriever weights less than Luna’s 53
pounds is about 0.003.

25
The tilde (~) here stands for “is distributed as.” 𝑋~𝑁(𝜇, 𝜎) then is shorthand for “the random variable X has a normal
distribution with mean 𝜇 and standard deviation 𝜎.” (Sometimes notation can be very helpful!)
Sections 2.3-2.8 and 3.1, page 20
Because calculating areas under the normal curve is extremely painstaking, 18th-century statisticians
developed the standard normal table. The first widely produced standard normal table was Sheppard’s
1903 table.26 The following is an excerpt27

Some introductory statistics courses still use tables to calculate probabilities, but we can do even
better than that by using R to calculate probabilities. When we use R, we don’t even need to calculate
the z-score—instead we just provide R the values of the parameters 𝜇 and 𝜎.

Using R to Find Probabilities, Areas, and Quantiles


Consider a general 𝑁(𝜇, 𝜎) distribution. There are two built-in functions in R that we can use. Here
they are with their defaults:
• pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
Given a quantile28 q, the function pnorm() gives us the area to the left of q under the normal
curve.
• qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
Given a number p between 0 and 1, the function qnorm() returns the p-th quantile of the
normal distribution.

Example: Golden Retriever Weights


Weights of adult female golden retrievers are normally distributed. Recall that the mean weight is 60
pounds, and the standard deviation is 2.5 pounds.
a. What R code would we use to find the proportion of adult female golden retrievers who weigh less
than Luna?

26
David, H. A. (2005) “Tables Related to the Normal Distribution: A Short History.” The American Statistician, November
2005, Vol. 59, No. 4.
27
Sheppard, W. F. (1903). “New Tables of the Probability Integral,” Biometrika, 2, 174– 190.
28
Quantiles are cut points of a distribution that divides the distribution into equal areas. For example, the median divides a
distribution into 2 equal AREAS, and quartiles divide a distribution into 4 equal AREAS. When we pass q to the pnorm()
function, we are asking R to give us the value that has probability q to its left.
Sections 2.3-2.8 and 3.1, page 21
b. What proportion of adult female golden retrievers weigh between 58 and 63
pounds?

c. What proportion of adult female golden retrievers weigh more than 64


pounds?

d. How much does an adult female golden retriever weigh if she is in the 10th
percentile?

The 68-95-99.7 Rule (a.k.a. Empirical Rule)


Many data sets we encounter in the natural world follow the normal curve, so it’s good to have a
handy rule to help us make quick estimations about our data and to check if our results seem
reasonable.

For a general 𝑁(𝜇, 𝜎) distribution:29

o Approximately 68% of the data fall between 𝜇 ± 1𝜎 (and thus have a z-score between –1 and 1).
o Approximately 95% of the data fall between 𝜇 ± 2𝜎 (and thus have a z-score between –2 and 2).
o Approximately 99.7% of the data fall between 𝜇 ± 3𝜎 (and thus have a z-score between –3 and 3).

29
ISRS Figure 2.27, page 94
Sections 2.3-2.8 and 3.1, page 22
Values of random variables can fall more than three standard deviations from the mean, but these
values are extremely rare if the data are nearly normal. The empirical rule gives us a quick way to think
about how unusual an observed value is.

Using the Normal Model for Inference for a Single Proportion


It will be helpful for us to have common terminology and notation. In the Doris and Buzz example, each
“guess” by Buzz can be thought of as a trial. We could label each trial a success if Buzz pushed the
correct button and a failure if Buzz pushed the incorrect button.

Trial, success, and failure


A single event that leads to an outcome can be called a trial. If the trial has two possible outcomes,
e.g., heads or tails when flipping a coin, we typically label one of those outcomes a success and the
other a failure. The choice of which outcome is labeled a success and which a failure is arbitrary, and it
will not impact the results of our analyses.

Note: We typically make the outcome of interest the success in the trial. For example, if we were
interested in the proportion of first-generation college students in Stats 250, a success would be that a
Stats 250 student is a first-generation college student, and a failure would be that a Stats 250 student
is not a first-generation college student.

When a proportion is recorded, it is common to use a 1 to represent a “success” and a 0 to represent a


“failure.” For example, in the Doris and Buzz example, there were 16 trials (15 successes and 1 failure).
The sample proportion can be computed as:
number of successes 15
𝑝̂ = = = 0.9375
number of trials 16

When we repeatedly take a sample from a population and calculate the sample proportion, 𝑝̂ , we
generate a sampling distribution30 that resembles the normal distribution. There are conditions we
need to have to apply this normal distribution framework to the distribution of 𝑝̂ .

Using the Normal Model for a Sample Proportion


The sampling distribution for 𝑝̂ , taken from a sample of size n from a population with true proportion
p, is nearly normal when the following two conditions are met:
1. Independence: the sample observations are independent of one another
2. Success-Failure Check: there are at least 10 successes and 10 failures

When these conditions are met, then the sampling distribution of 𝑝̂ is nearly normal with mean 𝑝. The
standard deviation for this sampling distribution is called the standard error (SE) and is calculated as
,(#.,)
𝑆𝐸(𝑝̂ ) = 8 0
.

Because we typically don’t know the population proportion 𝑝,31 we need to estimate it. For hypothesis
tests, we use the hypothesized population proportion 𝑝" to estimate p.

30
A sampling distribution is the distribution of all possible values for the sample statistic. The sampling distribution gives us
an idea of the values the sample statistic can take.
31
If we knew p, we wouldn’t be doing inference!
Sections 2.3-2.8 and 3.1, page 23
Wait. What’s the Difference Between the Standard Deviation and the Standard Error?
The standard deviation refers to the variability in data (sample standard deviation 𝑠) or in populations
(population standard deviation 𝜎), whereas the term “standard error” refers to the standard deviation
of an estimate.

Here’s one way to keep these two things straight:


standard Deviation refers to Data, standard Error refers to an Estimate.

The Independence Condition


The most important condition that we have is the requirement that our observations are independent
of one another. Independence between observations means that the value of any one observation has
no impact on the value of any other observation in the data set.
• If we are told that the data are from a random sample, then we can treat the observations as
independent.
• If we are not told the data are from a random sample, then we need to think about whether it
is reasonable for us to treat the observations as independent.

Computing the Standard Error of 𝑝̂


As mentioned above, we typically don’t know the population proportion p, so we need to substitute in
some value to check conditions and estimate the standard error of 𝑝̂ . For hypothesis tests, we use the
null value of the population proportion, 𝑝" , to check the success-failure condition and to estimate the
standard error. We use 𝑝" in the calculation for the standard error because the hypothesized null value
of 𝑝" is used to create the null model.
Success-failure check for hypothesis test: The expected number of successes and failures when
the null hypothesis is true are both at least 10. That is, both 𝑛𝑝" ≥ 10 and 𝑛(1 − 𝑝" ) ≥ 10.

Hypothesis Testing for a Population Proportion


We are already familiar with hypothesis tests for a population proportion, but we formalize them now.

Basic steps of a hypothesis test:


1. Determine appropriate null and alternative hypotheses.
2. Check the conditions for performing the test.32
3. Calculate the test statistic and determine the p-value.
4. Evaluate the p-value and the compatibility of the null model.
5. Make a conclusion in the context of the problem.

Remember that our hypotheses come in the form of two competing claims. To test a particular value of
a population proportion, we have the following possible pairs of hypotheses:
𝐻" : 𝑝 ≤ 𝑝" versus 𝐻) : 𝑝 > 𝑝"
𝐻" : 𝑝 = 𝑝" versus 𝐻) : 𝑝 ≠ 𝑝"
𝐻" : 𝑝 ≥ 𝑝" versus 𝐻) : 𝑝 < 𝑝"
Sometimes it’s simpler to write the null hypothesis in all of these situations as 𝐻" : 𝑝 = 𝑝" . It’s up to you
whether you want to include the inequality or not.

32
Check the success-failure condition with the expected number of successes and failures when the null hypothesis is true:
𝑛𝑝! ≥ 10 and 𝑛(1 − 𝑝! ) ≥ 10
Sections 2.3-2.8 and 3.1, page 24
Wait. What is this 𝑝" and where does it come from?
This is the hypothesized value of the population proportion 𝑝 that we will use to build the null model.
This 𝑝" value comes from the research question, not from the data.

Earlier we talked about how, when we have a normal model for a variable, we can standardize that
variable to compute probabilities, as long as we have the mean and standard error for that statistic. In
general, we know a normal model can be used for the sample proportion 𝑝̂ . The model is written as
,(#.,)
𝑁(𝑝, 𝑆𝐸(𝑝̂ )), where 𝑆𝐸(𝑝̂ ) = 8 0
.
Problematically, we don’t know the value of 𝑝, so we need to estimate it. For hypothesis testing, we
assume that the null hypothesis is true, so we use our hypothesized value 𝑝" to estimate 𝑝 in the
," (#.," )
standard error formula. Our estimated standard error for 𝑝̂ is then 𝑆𝐸(𝑝̂ ) = 8 0
.

Test Statistic
A test statistic is the name for a standardized sample statistic. The test statistic tells us how our sample
statistic (𝑝̂ ) compares to the hypothesized value 𝑝" , using the standard error as our “yardstick.” Since
we assume that the null hypothesis is true, we use a hypothesized value, 𝑝 = 𝑝" , to build a null model.
The standardized test statistic for a sample proportion is
𝑝̂ − 𝑝"
𝑧=
8𝑝" (1 − 𝑝" )
𝑛
Under the null model, this z-test statistic will have approximately the standard normal 𝑁(0, 1)
distribution, and we use this to compute the p-value for the test.

Example
Calculate the test statistic for the Buzz and Doris example.

p-value Reminder
Definition: The p-value is the probability of obtaining a value of the statistic at least as extreme as the
observed statistic when the null hypothesis is true.

We still don’t want to specify just one cutoff that determines when our results are statistically
significant. Rather, we continue to use the following table to give us a guideline about how much
evidence we have against the null hypothesis.
If the p-value is: In this class, we will say we have:
Greater than 0.10 (p-value > 0.10) little evidence against 𝐻!
Between 0.05 and 0.10 (0.05 < p-value < 0.10) some evidence against 𝐻!
Between 0.01 and 0.05 (0.01 < p-value < 0.05) strong evidence against 𝐻!
Less than 0.01 (p-value < 0.01) very strong evidence against 𝐻!

Sections 2.3-2.8 and 3.1, page 25


Example
Calculate the p-value for the Buzz and Doris example and comment on the strength of the evidence
against the null hypothesis that Buzz is just guessing.

Example: Generational Opinions about Increased Racial and Ethnic Diversity


The headline “Generation Z Looks a Lot Like Millennials on Key Social and Political Issues”33 might catch
your attention. According to Pew Research, 61% of Millennials (those born between 1981 and 1996)
think that increased racial and ethnic diversity is a good thing for our society. Is the percentage of Gen
Zers (those born between 1997 and 2010) who think increased racial and ethnic diversity is a good
thing for our society higher than 61%?

Step 1: Hypotheses
𝐻" : 𝑝 = 0.61, where the parameter 𝑝 represents the population proportion of all Gen Zers
who think increased racial and ethnic diversity is a good thing for our society
𝐻) : 𝑝 > 0.61

The Pew Research study revealed that 730 of 1178 Gen Zers surveyed said increased racial and ethnic
diversity is a good thing for our society.

Step 2: Conditions
• Independence: The question stem above does not tell us that Pew took a random
sample from the population of all Gen Zers. If we had no other information, we would
need to think about whether the observations are independent. The methodology
section of the article (https://www.pewresearch.org/social-
trends/2019/01/17/generations-methodology/) tells us that the data were collected
from two surveys that used random sampling, so the opinions of one respondent were
independent of the opinions of any other respondent.
• Success-Failure:
𝑛𝑝" = 1178(0.61) = 718.58 ≥ 10 and 𝑛(1 − 𝑝" ) = 1178(0.39) = 459.42 ≥ 10
We have at least 10 successes and at least 10 failures, so this condition is satisfied.

Step 3: b, test statistic, p-value)


Calculations (𝒑

33
https://www.pewsocialtrends.org/2019/01/17/generation-z-looks-a-lot-like-millennials-on-key-social-and-political-
issues/
Sections 2.3-2.8 and 3.1, page 26
The p-value is the probability of getting results
at least as extreme than the sample results,
under the null model. Since we have a one-sided
test to the right, toward the larger values…
p-value =

Step 4: Evaluate the p-value and the compatibility of the null model with observed results.

Step 5: Make a conclusion in the context of the problem.

Note: In lab, you will learn how to do these calculations with a function called prop_test. That will
free you up to check conditions and think about the conclusion that you can make. Here is the output
from prop_test for this example:
prop_test(x = 730, n = 1178, p = 0.61, alternative = "greater")

1-sample proportions test without continuity correction

data: x out of n, null probability p


Z = 0.68218, p-value = 0.2476
alternative hypothesis: true p is greater than 0.61
95 percent confidence interval:
0.596429 1.000000
sample estimates:
p
0.6196944

Section 2.8 Confidence Intervals


Remember that one of our goals is to use sample statistics to estimate population parameters. A point
estimate provides a single plausible value for a population parameter using collected sample data. For
example, we use the sample mean 𝑥̅ to estimate the population mean 𝜇 and we use the sample
proportion 𝑝̂ to estimate the population proportion p.

We have also seen that samples vary. As a result, sample statistics vary. So, while a point estimate
gives us an estimate for a parameter, it is far more useful to provide a plausible range of values for that
parameter. Statisticians call this range of plausible values a confidence interval. We might construct a
confidence interval after finding evidence against a null hypothesis or in and of itself if we want to
determine a reasonable range of values for our parameter of interest.

Sections 2.3-2.8 and 3.1, page 27


In general, the form of a confidence interval will be
point estimate ± margin of error
The margin of error is the “wiggle room” that we provide to account for the sample-to-sample
variability of the point estimate; it is the product of a multiplier and the standard error of the sample
statistic.

A confidence interval for the population proportion 𝑝 is:


𝑝̂ ± multiplier × 𝑆𝐸(𝑝̂ )

,(#.,)
Recall that the standard error of 𝑝̂ is given by 𝑆𝐸(𝑝̂ ) = 8 0
. When we worked with hypothesis
tests, we used 𝑝" in our calculation of 𝑆𝐸(𝑝̂ ). Since there is not a 𝑝" for confidence intervals, we need
to use our best guess for the population proportion 𝑝. The resulting formula for a confidence interval
for the population proportion 𝑝 is:
𝑝̂ (1 − 𝑝̂ )
𝑝̂ ± multiplier × d
𝑛

What multiplier should we use? Well, it depends. In particular, it depends on how confident we want
to be in our interval. Consider two extremes:
• If we set multiplier = 0, the confidence “interval” is simply 𝑝) .
We have no confidence that this “interval” contains the actual population proportion, 𝑝.
• If we set multiplier = ∞, the confidence interval is (−∞, ∞).
We can be 100% confident that this interval contains the actual population proportion, 𝑝,
because we have specified all possible values. In fact, we are 100% confident that the interval
[0, 1] contains the population proportion, 𝑝.
Neither of these is good!

What Do We Mean by “Confident”?


Understanding confidence can be tricky for students. The “confidence” we have is in the process of
constructing our confidence intervals. The confidence level tells us how sure we are that the
confidence interval we constructed contains the population parameter we are estimating.

You might recall that the 68-95-99.7 Rule tells us that, when
we have a normal distribution, about 95% of the values will
be within 2 standard deviations of the mean.34

In our context, about 95% of the 𝑝̂ values to be within


2 × 𝑆𝐸(𝑝̂ ) of the population proportion 𝑝. And, when about
95% of the 𝑝̂ values are within 2 × 𝑆𝐸(𝑝̂ ) of the population
,1(#.,1)
proportion 𝑝, it is also true that about 95% of all 𝑝̂ ± 2 × 8 0
contain the population proportion 𝑝.

34
Figure 3.2.1 from Tintle et al.’s ISI
Sections 2.3-2.8 and 3.1, page 28
Let’s examine this idea of “confidence” through an example.

Example: Estimating the Proportion


In a random sample of 320 undergraduate students in Stats 250, 251 returned to Ann Arbor for Fall
2020 classes. Using this information, estimate the proportion of all Stats 250 students who returned to
Ann Arbor for Fall 2020 classes.

When we construct a confidence interval, we calculate just one


set of reasonable values for the population parameter. However,
we’re interested in understanding what it means for us to have
confidence in the process. This graph shows the 𝑝̂ values for 1000
repetitions of sampling 320 students from the population of all
Fall 2020 Stats 250 students. Note that the plot of the 1000 𝑝̂
values is approximately normal. The plot is centered at the
population proportion (which, for now, is unknown).

!+#
Our sample proportion of 𝑝̂ = 2!" = 0.784 is somewhere in this distribution, somewhat near the
center of the distribution but not at the center.

An approximate 95% confidence interval for the population proportion of Stats 250 students who
returned to Ann Arbor for Fall 2020 classes is calculated as

What’s the probability that the confidence interval we just calculated contains 𝑝?

Our confidence is in the process of constructing confidence intervals, not in any


one interval itself. The output to the right35 is a simulation of 100 95% confidence
intervals taken from a population with the actual population proportion, 𝑝 =
0.7739. Notice that the parameter is within the bounds of 96 of the intervals.
Only 4 intervals “missed” the parameter.

Key idea: Because about 95% of sample proportions are within 2 standard errors
of the parameter, approximately 95% of the intervals we create using this
method will include the parameter.

Note: The 2 we used as our multiplier is just an approximation.

35
Generated using http://www.rossmanchance.com/applets/ConfSim.html.
Sections 2.3-2.8 and 3.1, page 29
The Gist of Confidence Intervals (in Five Key Points):
1. The value of the sample estimate will vary from one sample to the next. The values often vary
around the population parameter, and the standard error gives an idea about how far the
sample estimates tend to be from the true population proportion on average.
2. The standard error of the sample estimate provides an idea of how far away the estimate
would tend to vary from the parameter value (on average).
3. The general format for a confidence interval is given by:
sample estimate ± (a few) standard errors
4. The “few” or number of standard errors we go out each way from the sample estimate will
depend on what coverage rate (i.e., how confident) we want to be. We call the (a few)
standard errors the margin of error.
5. The “how confident” we want to be is referred to as the confidence level. This level reflects
how confident we are in the procedure. Most of the intervals that are calculated using this
procedure will contain the true parameter value, but occasionally intervals will be produced
that do not.
Note: Each interval either contains the population parameter or it doesn’t. The confidence level
is the percentage of the time we expect the procedure to produce an interval that does contain
the population parameter in the long run.

Conditions for Constructing Confidence Intervals for 𝑝


Before constructing a confidence interval for 𝑝, we need to check two conditions to be sure that the
sampling distribution of 𝑝̂ will be approximately normal:
1. Independence: the sample observations are independent of one another
2. Success-Failure Check: there are at least 10 successes and 10 failures

For confidence intervals, the sample proportion, 𝑝̂ , is used to check the success-failure condition and
to estimate the standard error because we do not know the actual value of the parameter, 𝑝.
Success-failure check for confidence intervals: The number of successes and the number of
failures are both at least 10. When we are not given the counts, we check this condition with
𝑛𝑝̂ ≥ 10 and 𝑛(1 − 𝑝̂ ) ≥ 10.

What Multiplier Should We Use?


The 2 that we used as our multiplier was just an estimate. The
more exact multiplier for a 95% confidence interval is 1.96.
Why?

Sections 2.3-2.8 and 3.1, page 30


The 1.96 we just found is called the critical value and is denoted by 𝑧 ∗ . Our confidence intervals for the
population proportion 𝑝 can be written as
𝑝̂ ± 𝑧 ∗ × 𝑆𝐸(𝑝̂ )
In this calculation,
𝑝̂ is the point estimate
𝑆𝐸(𝑝̂ ) is the standard error36 for 𝑝̂
𝑧 ∗ × 𝑆𝐸(𝑝̂ ) is the margin of error

The table below gives a summary of confidence levels that are commonly seen in statistical studies,
along with their associated multipliers.
Confidence Level Multiplier
90% 1.65
95% 1.96
99% 2.58
After looking at this table, you’ll probably notice a key idea underlying confidence intervals: If you want
to be more confident in your interval of plausible values, you need to make your interval wider.37

Let’s do another example to put all of the pieces together.

Example: Should Marriages Between Same-Sex Couples Be Legally Valid?


Just before the five-year anniversary of the Supreme Court ruling that all U.S. states must grant same-
sex marriages and recognize same-sex marriages granted in other states,38 a Gallup poll found that 67%
of a random sample of 1028 American adults believe that marriages between same-sex couples should
be recognized by the law as valid.39
a. Check the conditions for constructing a confidence interval for the proportion of all American adults
who believe that marriages between same-sex couples should be recognized by the law as valid.

b. Calculate a 99% confidence interval for the proportion of all American adults who believe that
marriages between same-sex couples should be recognized by the law as valid.

36
Remember that standard error is the name for the standard deviation of an estimate.
37
Imagine kicking a field goal in American football. Regulation field goal uprights are 18’6” apart. What would this distance
have to be to give a field goal kicker a better chance of making a field goal? A worse chance of making a field goal?
38
Obergefell v. Hodges, June 26, 2015
39
https://news.gallup.com/poll/311672/support-sex-marriage-matches-record-high.aspx
Sections 2.3-2.8 and 3.1, page 31
c. What does this confidence interval tell us?

d. What is the probability that the population proportion is in the interval we constructed?

e. A 95% confidence interval produced from the same survey results would be
a. narrower
b. wider
c. the same width as
the interval computed in (b).

f. Can you use this confidence interval to conclude that a majority of American adults believe that
marriages between same-sex couples should be recognized by the law as valid? More than 65%?

Sample Size Calculations


Frequently statisticians find themselves in a position to not only analyze data, but to help others
determine how to most effectively collect data and also how much data should be collected. We can
perform sample size calculations that are helpful in planning a study. Our task will be to identify an
appropriate sample size that ensures the margin of error is no larger than some value m. That is, we
want
𝑝(1 − 𝑝)
𝑀𝐸 = 𝑧 ∗ d <𝑚
𝑛
Generally, we plug in a suitable value for 𝑧 ∗ for the confidence level we plan to use and then solve for
the sample size n.

There are two unknowns in the equation: p and n. If we have an estimate of p, perhaps from a similar
survey, we could use that value. If we have no such estimate, we must use some other value for p. The
margin of error for a proportion is largest when p is 0.5,40 so we typically use this worst-case estimate if
no other estimate is available.

40
Think about the value of p that makes 3𝑝(1 − 𝑝) the largest.
Sections 2.3-2.8 and 3.1, page 32
Example: Is Computer Science Education Important?
We plan to survey parents and guardians of 7th to 12th graders in Michigan about the importance of
learning computer science. We want to sample enough parents and guardians of 7th to 12th graders in
Michigan to estimate the true proportion who think it is very important or extremely important to
learn computer science within about 3% with a 95% confidence level.
a. How many parents and guardians should we include in our sample?

b. Would our sample size decrease or increase if we wanted to use a higher confidence level? Why?

c. What would happen to our sample size if we wanted a smaller margin of error at the same level of
confidence?

Looking Forward
In this set of notes, we focused first on understanding the concepts behind inference and formalized
the conditions for theory-based inference for one population proportion. Next, we will discuss
inference for the difference between two proportions. Then we will move on to inference for means
followed by inference for simple linear regression. We round out the semester with an introduction to
multiple regression.

Sections 2.3-2.8 and 3.1, page 33

You might also like