3 Inference For One Population Proportion
3 Inference For One Population Proportion
3 Inference For One Population Proportion
Point Estimates
A large part of statistics is common sense. For example, we use the sample mean to estimate the
population mean and the sample proportion to estimate the population proportion. These estimates of
our population parameters are called point estimates.
In Section 2.4, we will be using the sample proportion 𝑝̂ to estimate the population proportion 𝑝. We
will then extend that work to looking at the difference between two population proportions.
Introduction to Inference
So far, we’ve explored various methods for describing sample data once we have it. In this chapter, we
will discuss the basics of how to use sample data to make inferences about to generalize to the
population at large. We will use sample statistics (our point estimates) to estimate population
parameters. Then we examine whether the point estimate we saw in our data is a typical value (could
have happened by chance alone) or whether it seems unusual (something other than chance is going
on). We begin with another dolphin example.
During a training period lasting many months, Dr. Bastian placed buttons underwater on each end of a
large pool—two buttons for Doris and two buttons for Buzz. He then used an old automobile headlight
1
This example (and much of the wording) is taken from Tintle N.L., Chance B.L., Cobb G.W., Rossman A.J., Roy S., &
Swanson T.M., VanderStoep J.L. (2021). Introduction to Statistical Investigations (2nd edition, pp. 23-26). John Wiley & Sons,
Inc. Many thanks to Tintle et al. for sharing these materials in a 2014 workshop that Dr. Miller attended.
Sections 2.3-2.8 and 3.1, page 1
as his signal. When he turned on the headlight and let it shine steadily, he intended for this signal to
mean “push the button on the right.” When he let the headlight blink on and off, this was meant as a
signal to “push the button on the left.” Every time the dolphins pushed the correct button, Dr. Bastian
gave the dolphins a reward of some fish. Over time Doris and Buzz caught on and could earn their fish
reward every time.
Then Dr. Bastian made things a bit harder. Now, Buzz had to push his button before Doris. If they didn’t
push the buttons in the correct order—no fish. After a bit more training, the dolphins caught on again
and could earn their fish reward every time. The dolphins were now ready to participate in the real
study to examine whether they could communicate with each other.
Dr. Bastian placed a large canvas curtain in the middle of the pool. Doris was on one side of the curtain
and could see the headlight, whereas Buzz was on the other side of the curtain and could not see the
headlight. Dr. Bastian turned on the headlight and let it shine steadily. He then watched to see what
Doris would do. After looking at the light, Doris swam near the curtain and began to whistle loudly.
Shortly after that, Buzz whistled back and then pressed the button on the right—he got it correct and
so both dolphins got a fish. But this single attempt was not enough to convince Dr. Bastian that Doris
had communicated with Buzz through her whistling. Dr. Bastian repeated the process several times,
sometimes having the light blink (so Doris needed to let Buzz know to push the left button) and other
times having it glow steadily (so Doris needed to let Buzz know to push the right button). He kept track
of how often Buzz pushed the correct button.
In this scenario, even if Buzz and Doris can communicate, we don’t necessarily expect Buzz to push the
correct button every time. We allow for some “randomness” in the process; maybe on one trial Doris
was a bit more underwater when she whistled and the signal wasn’t as clear for Buzz. Or maybe Buzz
and Doris aren’t communicating at all and Buzz guesses which button to push every time and just
happens to guess correctly once in a while. Our goal is to get an idea of how likely Buzz is to push the
correct button in the long run.
Dr. Bastian took some time to train the dolphins in order to get them to a point where he could test a
specific research conjecture. The research conjecture is that Buzz pushes the correct button more
often than he would if he and Doris were not communicating. Let’s be skeptical and assume that Buzz
and Doris were not communicating. In that case, Buzz would have no additional information that
would make him more likely to choose one button over the other—Buzz would just be guessing which
button to push.
If Buzz was just guessing, what is the chance that he would choose the correct button? How would this
chance change if there were 3 buttons? 4 buttons?
In one phase of the study, Dr. Bastian had Buzz attempt to push the correct button a total of 16
different times. These 16 trials2 are a mere snapshot of Buzz’s overall selection process. We are
interested in Buzz’s actual long-run proportion (i.e., probability) of pushing the correct button based on
Doris’s whistles. This unknown long-run proportion is a (population) parameter, and we will denote it
with p.
Note that we are assuming this parameter is not changing over time, at least for the process used by
Buzz in this phase of the study. Because we can’t observe Buzz pushing the button forever, we need to
draw conclusions (possibly incorrect, but hopefully not) about the value of the parameter based only
on the 16 attempts in this phase of the study.
It will be helpful for us to consider what we might expect if there is Buzz and Doris were not
communicating.
Chance Models
Scientists use models to help understand complicated real-world phenomena. Statisticians often
employ chance models to generate data from random processes to help them investigate such
processes. We need to decide whether the process could be Buzz simply guessing or whether the
process is something else, such as Buzz and Doris communicating.
Let us first investigate the “Buzz was simply guessing” process. Because Buzz is choosing between two
equally-likely options, the simplest chance model to consider is a coin flip. We can flip a coin to
represent, or simulate, Buzz’s choice assuming he is just guessing which button to push. To generate
this artificial data, we can let “heads” represent the outcome that Buzz pushes the correct button and
let “tails” be the outcome that Buzz pushes the incorrect button. This gives Buzz a 50% chance of
pushing the correct button. This can be used to represent the “Buzz was just guessing” or the “random-
chance-alone” explanation.
The correspondence between the real study and the physical simulation is shown in the following
table:
Assuming that Buzz was just guessing…
Coin flip
Heads
Tails
Chance of heads
One repetition
2
It seems more appropriate to call these trials instead of cases or observations, but it’s the same idea.
Sections 2.3-2.8 and 3.1, page 3
Now that we see how flipping a coin can simulate Buzz
guessing, let’s flip some coins to simulate Buzz’s performance.3
Suppose that on the first flip we got heads. What does this
mean?
What if we keep flipping the coin? Each time we flip the coin we are simulating another attempt where
Buzz guesses which button to push. Remember that heads represents Buzz guessing correctly and tails
represents Buzz guessing incorrectly.
Will we get this same result every time we flip a coin 16 times?
Here are the results of two more repetitions representing Buzz’s 16 trials. Calculate the values of the
simulated statistics.
Well, that was fun. Can we learn anything from these coin tosses when the results vary between the
sets of 16 tosses?4
3
We will use the applet at http://www.isi-stats.com/isi2nd/ISIapplets2021.html (Categorical Response ® One Proportion).
4
Clearly, we can. Otherwise we wouldn’t be doing this.
Sections 2.3-2.8 and 3.1, page 4
Using and Evaluating the Coin Flip Chance Model
Because coin flipping is a random process, we know that we won’t obtain the same number of heads
with every set of 16 flips. But are some numbers of heads more likely than others? If we continue our
repetitions of 16 tosses, we can start to see how the outcomes for the number of heads are
distributed. Does the distribution of the number of heads that result in 16 flips have a predictable long-
run pattern? In particular, how much variability is there in our simulated statistics between repetitions
(sets of 16 flips) just by random chance?
In order to investigate these questions, we need to continue to flip our coin to get many, many sets of
16 flips (or many repetitions of the 16 choices where we are modeling Buzz simply guessing each time).
We did this, and the figure below shows what we found when we graphed the number of heads from
each set of 16 coin flips.
The plot on the left shows the number of heads in 100 repetitions of 16 coin flips, while the plot on the
right shows the number of heads in 1000 repetitions of 16 coin flips.5 We chose 100 repetitions
because 100 is small enough that we can see the individual dots) and 1000 repetitions that is large
enough to give us a fairly accurate sense of the long-run behavior for the number of heads in 16 tosses.
Each dot in the plot on the left indicates the number of heads in one repetition of 16 coin flips.
The resulting number of heads follows a clear pattern: 7, 8, and 9 heads happened quite a lot, 6 and 10
were pretty common also, 5 and 11 happened some of the time, and the other values had fewer
occurrences.
Note: We refer to these unusual results as being out in the “tails” of the distribution.
5
Figure 1.1.4A from ISI, page 36
Sections 2.3-2.8 and 3.1, page 5
Putting It All Together
In one phase of the study, Dr. Bastian had Buzz attempt to push the correct button a total of 16
different times. In this sample of 16 attempts, Buzz pushed the correct button 15 out of 16 times.
Our sample statistic is the proportion of times Buzz was correct in the 16 trials:
These 16 observations are a mere snapshot of Buzz’s overall selection process. We are interested in
Buzz’s actual long-run proportion (i.e., probability) of pushing the correct button based on Doris’s
whistles. This unknown long-run proportion is a (population) parameter, and we will denote it with p.
Note that we are assuming this parameter is not changing over time, at least for the process used by
Buzz in this phase of the study. Because we can’t observe Buzz pushing the button forever, we need to
draw conclusions (possibly incorrect, but hopefully not) about the value of the parameter based only
on these 16 attempts.
The researchers wondered if the dolphins were Buzz certainly pushed the correct button most of the
time, so we might consider either of the following:
• Buzz is just guessing (his probability of a correct button push is 0.50, p = 0.50) and he got lucky
in these 16 attempts.
• Buzz is doing something other than just guessing (his probability of a correct button push is
larger than 0.50, p > 0.50).
These are the two possible explanations to be evaluated. Because we can’t collect more data, we have
to base our conclusions only on the data we have. It’s certainly possible that Buzz was just guessing
and got lucky!
Does the “just guessing” proposition seem like a reasonable explanation to you? How would you argue
against someone who thought this was the case?
How does the analysis above help us address the strength of evidence for our research conjecture that
Buzz was doing something other than just guessing?
Even though we expect some variability in the results for different sets of 16 tosses, the pattern shown
in this distribution indicates that an outcome of 15 heads is outside the typical chance variability we
would expect to see when Buzz is simply guessing. Our coin flip chance model tells us that we
have very strong evidence that Buzz wasn’t just guessing.
Therefore, we don’t believe the “just guessing” explanation is a good one for Buzz. That is, we don’t
think our study result (15 out of 16 correct) happened by chance alone, but rather, something other
than “random chance” was at play. We don’t believe that Buzz was just guessing.
An article9 found that novice players tend to throw scissors less than 1/3
of the time. Suppose you decide to investigate this tendency with 20 people playing rock-paper-
scissors for the first time. You explain the rules of the game to the players and have them play one
round (throw one hand gesture). Suppose only 4 of the 20 novice players throw scissors.
Note that this scenario is similar to what we had with Buzz and Doris. We have repeated outcomes (20)
from the same random process (choosing a hand gesture in each play of the game). There are two
possible outcomes (scissors, not scissors).
6
Count Buffon of France tossed a coin 4040 times (2048 heads); around 1900 Karl Pearson tossed a coin 24,000 times
(12,012 heads); John Kerrich tossed a coin 10,000 times (5067 heads) while a prisoner of war. (Source: The Basic Practice of
Statistics (9th edition, 2021.)
7
Again, credit to Tintle et al. for this great example.
8
Rock-paper-scissors image credit: By Enzoklop - Own work, CC BY-SA 3.0,
https://commons.wikimedia.org/w/index.php?curid=27958688
9
Eyler, D., Shalla Z., Doumaux, A., and McDevitt, T. (March 2009) Winning at Rock-Paper-Scissors. The College Mathematics
Journal (v. 40, no. 2), pages 125-128.
Sections 2.3-2.8 and 3.1, page 7
What is our parameter of interest?
The null hypothesis (often denoted by 𝐻" ) is a statement that there is no change, nothing is happening,
no difference, no relationship, or no effect in the underlying population. The null hypothesis usually
assumes the status quo. It is the claim that any differences we see in the sample results (when compared
to the status quo) are due to chance alone, that is, due to naturally occurring variability.
The alternative hypothesis (often denoted by 𝐻) ) is a statement that something has changed,
something is happening, there is a difference, there is a relationship, or there is an effect in the
underlying population. It is the claim that any difference between the sample results compared to the
status quo is difficult to explain away as randomness and is not due to chance alone.
Note: It is not possible for both the null and alternative hypotheses to be true at the same time, so if
the alternative is true, the null is false, and vice versa.
To see these ideas, let's write the null and alternative hypotheses for our rock-paper-scissors example.
It is important to remember that the null and alternative hypotheses are statements about the
population parameter (here, the long-run relative frequency of picking scissors), not just about what
happened in the study (here, the proportion of scissors thrown by our 20 novice players). Note too that
we determined the alternative hypothesis based on our belief that novice players choose scissors less
than one-third of the time (not based on what our 20 novices did). It is important to state the
hypotheses prior to conducting a study, before any data are gathered.
Once we have formulated our research question and hypotheses, we collect data. Then we analyze the
data, determine how likely we would see results as weird as ours when the null hypothesis is true, and
make conclusions. To get a sense of the logic of a hypothesis test, let’s draw a parallel to the U.S. Court
System.
3. Analyze data: Prosecution and defense present the result of the investigations in court
4. Evaluating the evidence: A jury deliberates about whether the prosecution has provided
evidence that calls into question the innocence of the defendant.
In practice, juries and judges have to determine whether there is convincing evidence to conclude that
the defendant is guilty. When there is convincing evidence, they find the defendant “guilty.” When
there is not convincing evidence, they find the defendant “not guilty.” Note: A verdict of “not guilty”
does not mean that the defendant is innocent; rather, it means that there was not enough evidence to
convince the jury or judge that the defendant is guilty.
Let’s investigate what we would expect if scissors are thrown one-third of the time. We need our null
model10 to reflect that a novice player throws scissors 1/3 of the time under the null hypothesis.
Let’s use 3 blue and yellow poker chips to simulate the study, where a blue poker chip indicates that a
novice player throws scissors, and a yellow poker chip indicates that a novice player throws something
other than scissors. How many of each color poker chip should we put in a bag?
_______ Blue _______ Yellow
We mix the poker chips thoroughly and draw one poker chip from the bag to represent one play of the
game. We repeat the mix-and-draw process a total of 20 times, recording the color, then replacing the
poker chip and reshuffling each time.
10
Now that we have formalized our terms, we are using the term “null model” to indicate that we are operating under the
assumption that the null hypothesis is true.
11
With replacement means that we draw a chip, note its color, and put it back in the bag. Without replacement means that
we draw a chip, note its color, and set it aside.
Sections 2.3-2.8 and 3.1, page 9
The correspondence between the real study and the physical simulation is shown in the table below:
Assuming scissors will be thrown 1/3 of the time…
One draw
Blue poker chip
Yellow poker chip
Chance of blue
One repetition
Result #2: Blue chips: _____ ; Yellow chips: _____ Simulated statistic:
Result #3: Blue chips: _____ ; Yellow chips: _____ Simulated statistic:
Notice that the distribution is centered at 1/3, which is the probability under the null hypothesis that a
novice player throws scissors one-third of the time.14
12
We again use the applet at http://www.isi-stats.com/isi2nd/ISIapplets2021.html.
13
Why 2500? No reason in particular. We want you to know that you don’t always need to have 1000 total simulations. We
want enough simulations so that we can see an overall pattern and suggest at least 1000 total simulations.
14
The simulated null model will always be centered at the proportion specified in the null hypothesis.
Sections 2.3-2.8 and 3.1, page 10
Only 4 of the 20 novice players threw scissors. What is the observed
sample statistic?
Our alternative hypothesis was that novice players throw scissors less
than one-third of the time. What proportion of the simulated results
indicated throwing scissors are at least as extreme as we observed in
the direction of the alternative hypothesis?
Relative frequencies can be thought of as probabilities, so we can think of this proportion as an estimate
of the probability of observing a result as favorable to the alternative hypothesis as our observed data.
The probability that Buzz would push the correct button at least 15 out of 16 times if he were just
guessing which button to push was about 0, and the probability that the probability that 4 or fewer of
our 20 novices would throw scissors is about 0.15.
p-values
When we calculate the proportion of the simulated statistics that are at least as extreme (in the
direction of the alternative hypothesis), we are calculating an estimated p-value. We can estimate
the p-value by finding the proportion of the simulated statistics in the null distribution that are at least
as extreme (in the direction of the alternative hypothesis) as the value of the statistic actually observed
in the research study.
The p-value is the probability of obtaining a value of the statistic at least as extreme as the observed
statistic when the null hypothesis is true.
15
See https://education.wiley.com/content/Tintle_Intro_2e/media/simulations/faq/c01faq_1_2_1.pdf for a nice
explanation of why we include “or more extreme” in our calculation.
Sections 2.3-2.8 and 3.1, page 11
“The p-value takes into account the could-have-been outcomes (assuming the null hypothesis is true)
that are as extreme or more extreme than the one we observed. This provides a direct measure of our
strength of evidence against the ‘by-chance-alone’ or null model and allows for a standard, comparable
value for all scientific research studies. Smaller p-values mean the value of the observed statistic,
under the null model, is more unlikely by chance alone. Hence, smaller p-values indicate stronger
evidence against the null model.”16
Calculating p-values
Our estimated p-value gives us a sense of how unusual our study results are. As mentioned above, we
determine the proportion of the simulated statistics in the null distribution that are at least as
extreme (in the direction of the alternative hypothesis).
Students sometimes find it difficult to figure out which “tail” of the distribution we should use to
estimate the p-value. As such, we give you the following handy guide:
Tail of the
Alternative hypothesis “At least as extreme”
distribution
Simulated statistics smaller than our study results
Less than (<) Left tail
provide more evidence against the null hypothesis
Simulated statistics larger than our study results
Greater than (>) Right tail
provide more evidence against the null hypothesis
Simulated statistics in both tails provide more evidence
Not equal to (≠) Both tails
against the null hypothesis
It pains us a bit to have a rule for large and small. However, we realize that it helps students to have a
guideline, so we provide the following table. We will use a (somewhat arbitrarily) chosen scale for
evaluating the amount of evidence a p-value gives us to doubt the compatibility of the null model with
our data:
If the p-value is: In this class, we will say we have:
Greater than 0.10 (p-value > 0.10) little evidence against 𝐻!
Between 0.05 and 0.10 (0.05 < p-value < 0.10) some evidence against 𝐻!
Between 0.01 and 0.05 (0.01 < p-value < 0.05) strong evidence against 𝐻!
Less than 0.01 (p-value < 0.01) very strong evidence against 𝐻!
For example, if the p-value is between 0.01 and 0.05, we will say we have strong evidence that the null
model is not a good fit for our observed sample results.
16
Tintle et al., page 49
Sections 2.3-2.8 and 3.1, page 12
Examples
What is the estimated p-value for the Buzz and Doris study? How much evidence do we have against
the null hypothesis?
What is the p-value for the rock-paper-scissors study? How much evidence do we have against the null
hypothesis?
This “magic cutoff” of 0.05 can be traced back to a publication by Sir Ronald A. Fisher back in the 1920s
and came to prominence in the 1960s when the “FDA began using these statistical tests in decision-
making, and the 0.05 standard became enshrined in U.S. drug development.”17 It has only become
more entrenched in clinical trials over the decades.
We say that the data provide statistically significant evidence against the null hypothesis if the p-value
is less than some reference value, usually 𝛼 = 0.05.
What does this mean for you? Outside of Stats 250, you will very likely see the following:
Decision Conclusion
p-value ≤ 0.05 Reject the null hypothesis Statistically significant results
p-value > 0.05 Fail to reject the null hypothesis Not statistically significance results
Still wondering about 0.05? Check out the video that our authors prepared at www.openintro.org/why05.
Decision Errors
Hypothesis tests are not flawless. Just think of the court system: innocent people are sometimes
wrongly convicted and the guilty sometimes walk free. Similarly, data can point to the wrong
conclusion. However, what distinguishes statistical hypothesis tests from a court system is that our
framework allows us to quantify and control how often the data lead us to the incorrect conclusion.
A Type 1 Error occurs when we decide in favor of the alternate hypothesis (reject 𝐻" ) when 𝐻" is
actually true.
A Type 2 Error occurs when we decide in favor of the null hypothesis (fail to reject 𝐻" ) when 𝐻) is
actually true.
The 0.05 significance level that we discussed above was settled on by Fisher and the FDA (and others)
because they determined that the probability of a Type 1 error should be no more than 5%. That is, we
should reject a true null hypothesis only at most 5% of the time.
What are the consequences of each of the errors? Which error is worse?
In the medical testing example above, a doctor may decide they are willing to give unnecessary
treatment to 5 out of 100 patients. In this case, they would choose a 5% significance level. Then, when
making a decision based on a hypothesis test, if the p-value was less than or equal to 0.05 (5%) the
doctor would decide against the null hypothesis and conclude the patient is sick and should receive
treatment. This is called “rejecting the null” at 5% significance. If, on the other hand, the p-value was
more than 0.05, the doctor would “not reject the null” and continue to act as if the null model is valid,
that is, they would not provide treatment to the patient.
Usually, a significance level/Type 1 Error rate is chosen ahead of time and then the chance of making a
Type 2 error for different alternative values of the parameter can be calculated.19
In a random sample of 50 Stats 250 students who completed the survey, 17 chose “suddenly be
#*
elected a senator.” Our observed proportion is 𝑝̂ = +" = 0.34.
Note: Unless a priori the research question indicates one side or the other, you should perform a two-
sided test.
Caution! Hypotheses should be set up before seeing the data. Switching to a one-sided test after
performing the experiment is bad statistical practice!
19
We won’t directly calculate the chance of making a Type 2 error in this course—we leave that and power to your future
statistics courses.
Sections 2.3-2.8 and 3.1, page 16
The basic ideas that we learned for doing inference for one population proportion will be extended to
different situations throughout the course. By spending time learning and understanding the basics of
inference, we have set ourselves up for success.
Normal Theory
Simulations serve as a good way to see how inference works. However, sometimes simulations are
expensive, so it’s helpful to be able to use theory-based methods for our inference. They are much
cheaper to use and perform well, provided that certain conditions are satisfied.
In our examples, we compared our observed statistic (the sample proportion, 𝑝̂ ) to what we would
expect to see when the null hypothesis is true. To assess the evidence against the null hypothesis, we
simulated the null distribution. It’s not a coincidence that the simulated null distributions looked
similar:
Can dolphins communicate? Rock-Paper-Scissors
Here, number of heads represents number of times Buzz Here, number of successes represents number of plays
pushes the correct button where scissors are thrown
Back in the 1900s, however, no one wanted to sit around flipping a coin or shuffling decks of cards all
day long. Instead, they “focused their attention on mathematical and probabilistic rules and theories
that could predict what would happen”20 if many repetitions of a simulation were done. The Central
Limit Theorem (CLT) came from this work that those theoretical statisticians did.
20
Ibid, p. 77.
21
We will see how the Central Limit Theorem applies to means later in the course.
Sections 2.3-2.8 and 3.1, page 17
Here is a plot of a simple normal distribution. Imagine superimposing it
over each of the null distributions on the previous page and witnessing
a relatively good fit.
Mathematical theory guarantees that a sample proportion 𝑝̂ will have an approximately normal
distribution when two conditions are met:
• The observations must be independent.
• The sample is large enough. Just how large is large enough? That differs from one context to
the next, and we’ll provide guidelines as we encounter them.22
We will formalize our theory-based inference for a single proportion shortly. Before we do that, we
need to talk about the normal distribution.
The area under a normal curve can be considered a probability. In this section we will discuss the
common features of different normal curves, and learn how to use technology to find the areas we are
interested in.
22
It turns out that the large enough condition is not satisfied for either the dolphin or the rock-paper-scissors example.
However, the simulated null distribution is more symmetric because p = 0.50 in the dolphin example.
23
ISRS, page 85
24
Attributed to George Box
Sections 2.3-2.8 and 3.1, page 18
Despite these common characteristics, normal distributions can look quite different. This is because all
normal distributions can be adjusted using two parameters, the mean and the standard deviation.
• Changing the mean of a normal curve shifts the mean to the left or to the right.
• Changing the standard deviation of a normal curve stretches or constricts the curve around the
mean.
Notation: When a normal curve has mean 𝜇 and standard deviation 𝜎, we will write the distribution as
the 𝑁(𝜇, 𝜎) distribution.
Figures 2.20 and 2.21 from the text show the 𝑁(0,1) and 𝑁(19, 4) distributions so that we can
compare them.
Because the mean and standard deviation describe a normal distribution exactly, they are called the
distribution’s parameters. The mean 𝜇 specifies the center of the distribution, and the standard
deviation 𝜎 specifies the variability of the distribution.
The standard score is the distance between the observed value and the mean, measured in terms of
number of standard deviations:
observed value − expected value 𝑥 − 𝜇
standard score = =
standard deviation 𝜎
We can interpret standard scores as quantifying the number of standard deviations an observation falls
from its mean or expected value. Values that are above the mean have positive standard scores, and
values that are below the mean have negative standard scores.
Random Variables
A random variable assigns a number to each possible outcome. For example, if we let 𝑋 represent the
SAT Math score for a randomly selected student, 𝑋 is a random variable. Each random variable will
have a distribution that specifies how the possible values of the random variable are distributed. In the
previous example, 𝑋 has a normal distribution with mean 523 and standard deviation 117.
Notation: When a random variable 𝑋 has a normal distribution with mean 𝜇 and standard deviation 𝜎,
we use the notation 𝑋~𝑁(𝜇, 𝜎).25
When the distribution of the random variable is normal (or approximately normal), we can calculate
probabilities in addition to calculating standard scores.
25
The tilde (~) here stands for “is distributed as.” 𝑋~𝑁(𝜇, 𝜎) then is shorthand for “the random variable X has a normal
distribution with mean 𝜇 and standard deviation 𝜎.” (Sometimes notation can be very helpful!)
Sections 2.3-2.8 and 3.1, page 20
Because calculating areas under the normal curve is extremely painstaking, 18th-century statisticians
developed the standard normal table. The first widely produced standard normal table was Sheppard’s
1903 table.26 The following is an excerpt27
Some introductory statistics courses still use tables to calculate probabilities, but we can do even
better than that by using R to calculate probabilities. When we use R, we don’t even need to calculate
the z-score—instead we just provide R the values of the parameters 𝜇 and 𝜎.
26
David, H. A. (2005) “Tables Related to the Normal Distribution: A Short History.” The American Statistician, November
2005, Vol. 59, No. 4.
27
Sheppard, W. F. (1903). “New Tables of the Probability Integral,” Biometrika, 2, 174– 190.
28
Quantiles are cut points of a distribution that divides the distribution into equal areas. For example, the median divides a
distribution into 2 equal AREAS, and quartiles divide a distribution into 4 equal AREAS. When we pass q to the pnorm()
function, we are asking R to give us the value that has probability q to its left.
Sections 2.3-2.8 and 3.1, page 21
b. What proportion of adult female golden retrievers weigh between 58 and 63
pounds?
d. How much does an adult female golden retriever weigh if she is in the 10th
percentile?
o Approximately 68% of the data fall between 𝜇 ± 1𝜎 (and thus have a z-score between –1 and 1).
o Approximately 95% of the data fall between 𝜇 ± 2𝜎 (and thus have a z-score between –2 and 2).
o Approximately 99.7% of the data fall between 𝜇 ± 3𝜎 (and thus have a z-score between –3 and 3).
29
ISRS Figure 2.27, page 94
Sections 2.3-2.8 and 3.1, page 22
Values of random variables can fall more than three standard deviations from the mean, but these
values are extremely rare if the data are nearly normal. The empirical rule gives us a quick way to think
about how unusual an observed value is.
Note: We typically make the outcome of interest the success in the trial. For example, if we were
interested in the proportion of first-generation college students in Stats 250, a success would be that a
Stats 250 student is a first-generation college student, and a failure would be that a Stats 250 student
is not a first-generation college student.
When we repeatedly take a sample from a population and calculate the sample proportion, 𝑝̂ , we
generate a sampling distribution30 that resembles the normal distribution. There are conditions we
need to have to apply this normal distribution framework to the distribution of 𝑝̂ .
When these conditions are met, then the sampling distribution of 𝑝̂ is nearly normal with mean 𝑝. The
standard deviation for this sampling distribution is called the standard error (SE) and is calculated as
,(#.,)
𝑆𝐸(𝑝̂ ) = 8 0
.
Because we typically don’t know the population proportion 𝑝,31 we need to estimate it. For hypothesis
tests, we use the hypothesized population proportion 𝑝" to estimate p.
30
A sampling distribution is the distribution of all possible values for the sample statistic. The sampling distribution gives us
an idea of the values the sample statistic can take.
31
If we knew p, we wouldn’t be doing inference!
Sections 2.3-2.8 and 3.1, page 23
Wait. What’s the Difference Between the Standard Deviation and the Standard Error?
The standard deviation refers to the variability in data (sample standard deviation 𝑠) or in populations
(population standard deviation 𝜎), whereas the term “standard error” refers to the standard deviation
of an estimate.
Remember that our hypotheses come in the form of two competing claims. To test a particular value of
a population proportion, we have the following possible pairs of hypotheses:
𝐻" : 𝑝 ≤ 𝑝" versus 𝐻) : 𝑝 > 𝑝"
𝐻" : 𝑝 = 𝑝" versus 𝐻) : 𝑝 ≠ 𝑝"
𝐻" : 𝑝 ≥ 𝑝" versus 𝐻) : 𝑝 < 𝑝"
Sometimes it’s simpler to write the null hypothesis in all of these situations as 𝐻" : 𝑝 = 𝑝" . It’s up to you
whether you want to include the inequality or not.
32
Check the success-failure condition with the expected number of successes and failures when the null hypothesis is true:
𝑛𝑝! ≥ 10 and 𝑛(1 − 𝑝! ) ≥ 10
Sections 2.3-2.8 and 3.1, page 24
Wait. What is this 𝑝" and where does it come from?
This is the hypothesized value of the population proportion 𝑝 that we will use to build the null model.
This 𝑝" value comes from the research question, not from the data.
Earlier we talked about how, when we have a normal model for a variable, we can standardize that
variable to compute probabilities, as long as we have the mean and standard error for that statistic. In
general, we know a normal model can be used for the sample proportion 𝑝̂ . The model is written as
,(#.,)
𝑁(𝑝, 𝑆𝐸(𝑝̂ )), where 𝑆𝐸(𝑝̂ ) = 8 0
.
Problematically, we don’t know the value of 𝑝, so we need to estimate it. For hypothesis testing, we
assume that the null hypothesis is true, so we use our hypothesized value 𝑝" to estimate 𝑝 in the
," (#.," )
standard error formula. Our estimated standard error for 𝑝̂ is then 𝑆𝐸(𝑝̂ ) = 8 0
.
Test Statistic
A test statistic is the name for a standardized sample statistic. The test statistic tells us how our sample
statistic (𝑝̂ ) compares to the hypothesized value 𝑝" , using the standard error as our “yardstick.” Since
we assume that the null hypothesis is true, we use a hypothesized value, 𝑝 = 𝑝" , to build a null model.
The standardized test statistic for a sample proportion is
𝑝̂ − 𝑝"
𝑧=
8𝑝" (1 − 𝑝" )
𝑛
Under the null model, this z-test statistic will have approximately the standard normal 𝑁(0, 1)
distribution, and we use this to compute the p-value for the test.
Example
Calculate the test statistic for the Buzz and Doris example.
p-value Reminder
Definition: The p-value is the probability of obtaining a value of the statistic at least as extreme as the
observed statistic when the null hypothesis is true.
We still don’t want to specify just one cutoff that determines when our results are statistically
significant. Rather, we continue to use the following table to give us a guideline about how much
evidence we have against the null hypothesis.
If the p-value is: In this class, we will say we have:
Greater than 0.10 (p-value > 0.10) little evidence against 𝐻!
Between 0.05 and 0.10 (0.05 < p-value < 0.10) some evidence against 𝐻!
Between 0.01 and 0.05 (0.01 < p-value < 0.05) strong evidence against 𝐻!
Less than 0.01 (p-value < 0.01) very strong evidence against 𝐻!
Step 1: Hypotheses
𝐻" : 𝑝 = 0.61, where the parameter 𝑝 represents the population proportion of all Gen Zers
who think increased racial and ethnic diversity is a good thing for our society
𝐻) : 𝑝 > 0.61
The Pew Research study revealed that 730 of 1178 Gen Zers surveyed said increased racial and ethnic
diversity is a good thing for our society.
Step 2: Conditions
• Independence: The question stem above does not tell us that Pew took a random
sample from the population of all Gen Zers. If we had no other information, we would
need to think about whether the observations are independent. The methodology
section of the article (https://www.pewresearch.org/social-
trends/2019/01/17/generations-methodology/) tells us that the data were collected
from two surveys that used random sampling, so the opinions of one respondent were
independent of the opinions of any other respondent.
• Success-Failure:
𝑛𝑝" = 1178(0.61) = 718.58 ≥ 10 and 𝑛(1 − 𝑝" ) = 1178(0.39) = 459.42 ≥ 10
We have at least 10 successes and at least 10 failures, so this condition is satisfied.
33
https://www.pewsocialtrends.org/2019/01/17/generation-z-looks-a-lot-like-millennials-on-key-social-and-political-
issues/
Sections 2.3-2.8 and 3.1, page 26
The p-value is the probability of getting results
at least as extreme than the sample results,
under the null model. Since we have a one-sided
test to the right, toward the larger values…
p-value =
Step 4: Evaluate the p-value and the compatibility of the null model with observed results.
Note: In lab, you will learn how to do these calculations with a function called prop_test. That will
free you up to check conditions and think about the conclusion that you can make. Here is the output
from prop_test for this example:
prop_test(x = 730, n = 1178, p = 0.61, alternative = "greater")
We have also seen that samples vary. As a result, sample statistics vary. So, while a point estimate
gives us an estimate for a parameter, it is far more useful to provide a plausible range of values for that
parameter. Statisticians call this range of plausible values a confidence interval. We might construct a
confidence interval after finding evidence against a null hypothesis or in and of itself if we want to
determine a reasonable range of values for our parameter of interest.
,(#.,)
Recall that the standard error of 𝑝̂ is given by 𝑆𝐸(𝑝̂ ) = 8 0
. When we worked with hypothesis
tests, we used 𝑝" in our calculation of 𝑆𝐸(𝑝̂ ). Since there is not a 𝑝" for confidence intervals, we need
to use our best guess for the population proportion 𝑝. The resulting formula for a confidence interval
for the population proportion 𝑝 is:
𝑝̂ (1 − 𝑝̂ )
𝑝̂ ± multiplier × d
𝑛
What multiplier should we use? Well, it depends. In particular, it depends on how confident we want
to be in our interval. Consider two extremes:
• If we set multiplier = 0, the confidence “interval” is simply 𝑝) .
We have no confidence that this “interval” contains the actual population proportion, 𝑝.
• If we set multiplier = ∞, the confidence interval is (−∞, ∞).
We can be 100% confident that this interval contains the actual population proportion, 𝑝,
because we have specified all possible values. In fact, we are 100% confident that the interval
[0, 1] contains the population proportion, 𝑝.
Neither of these is good!
You might recall that the 68-95-99.7 Rule tells us that, when
we have a normal distribution, about 95% of the values will
be within 2 standard deviations of the mean.34
34
Figure 3.2.1 from Tintle et al.’s ISI
Sections 2.3-2.8 and 3.1, page 28
Let’s examine this idea of “confidence” through an example.
!+#
Our sample proportion of 𝑝̂ = 2!" = 0.784 is somewhere in this distribution, somewhat near the
center of the distribution but not at the center.
An approximate 95% confidence interval for the population proportion of Stats 250 students who
returned to Ann Arbor for Fall 2020 classes is calculated as
What’s the probability that the confidence interval we just calculated contains 𝑝?
Key idea: Because about 95% of sample proportions are within 2 standard errors
of the parameter, approximately 95% of the intervals we create using this
method will include the parameter.
35
Generated using http://www.rossmanchance.com/applets/ConfSim.html.
Sections 2.3-2.8 and 3.1, page 29
The Gist of Confidence Intervals (in Five Key Points):
1. The value of the sample estimate will vary from one sample to the next. The values often vary
around the population parameter, and the standard error gives an idea about how far the
sample estimates tend to be from the true population proportion on average.
2. The standard error of the sample estimate provides an idea of how far away the estimate
would tend to vary from the parameter value (on average).
3. The general format for a confidence interval is given by:
sample estimate ± (a few) standard errors
4. The “few” or number of standard errors we go out each way from the sample estimate will
depend on what coverage rate (i.e., how confident) we want to be. We call the (a few)
standard errors the margin of error.
5. The “how confident” we want to be is referred to as the confidence level. This level reflects
how confident we are in the procedure. Most of the intervals that are calculated using this
procedure will contain the true parameter value, but occasionally intervals will be produced
that do not.
Note: Each interval either contains the population parameter or it doesn’t. The confidence level
is the percentage of the time we expect the procedure to produce an interval that does contain
the population parameter in the long run.
For confidence intervals, the sample proportion, 𝑝̂ , is used to check the success-failure condition and
to estimate the standard error because we do not know the actual value of the parameter, 𝑝.
Success-failure check for confidence intervals: The number of successes and the number of
failures are both at least 10. When we are not given the counts, we check this condition with
𝑛𝑝̂ ≥ 10 and 𝑛(1 − 𝑝̂ ) ≥ 10.
The table below gives a summary of confidence levels that are commonly seen in statistical studies,
along with their associated multipliers.
Confidence Level Multiplier
90% 1.65
95% 1.96
99% 2.58
After looking at this table, you’ll probably notice a key idea underlying confidence intervals: If you want
to be more confident in your interval of plausible values, you need to make your interval wider.37
b. Calculate a 99% confidence interval for the proportion of all American adults who believe that
marriages between same-sex couples should be recognized by the law as valid.
36
Remember that standard error is the name for the standard deviation of an estimate.
37
Imagine kicking a field goal in American football. Regulation field goal uprights are 18’6” apart. What would this distance
have to be to give a field goal kicker a better chance of making a field goal? A worse chance of making a field goal?
38
Obergefell v. Hodges, June 26, 2015
39
https://news.gallup.com/poll/311672/support-sex-marriage-matches-record-high.aspx
Sections 2.3-2.8 and 3.1, page 31
c. What does this confidence interval tell us?
d. What is the probability that the population proportion is in the interval we constructed?
e. A 95% confidence interval produced from the same survey results would be
a. narrower
b. wider
c. the same width as
the interval computed in (b).
f. Can you use this confidence interval to conclude that a majority of American adults believe that
marriages between same-sex couples should be recognized by the law as valid? More than 65%?
There are two unknowns in the equation: p and n. If we have an estimate of p, perhaps from a similar
survey, we could use that value. If we have no such estimate, we must use some other value for p. The
margin of error for a proportion is largest when p is 0.5,40 so we typically use this worst-case estimate if
no other estimate is available.
40
Think about the value of p that makes 3𝑝(1 − 𝑝) the largest.
Sections 2.3-2.8 and 3.1, page 32
Example: Is Computer Science Education Important?
We plan to survey parents and guardians of 7th to 12th graders in Michigan about the importance of
learning computer science. We want to sample enough parents and guardians of 7th to 12th graders in
Michigan to estimate the true proportion who think it is very important or extremely important to
learn computer science within about 3% with a 95% confidence level.
a. How many parents and guardians should we include in our sample?
b. Would our sample size decrease or increase if we wanted to use a higher confidence level? Why?
c. What would happen to our sample size if we wanted a smaller margin of error at the same level of
confidence?
Looking Forward
In this set of notes, we focused first on understanding the concepts behind inference and formalized
the conditions for theory-based inference for one population proportion. Next, we will discuss
inference for the difference between two proportions. Then we will move on to inference for means
followed by inference for simple linear regression. We round out the semester with an introduction to
multiple regression.