Theory R
Theory R
Theory R
Fall 2019
Lecture Notes 8: Statistical Inference: Estimation for Single Populations1
Recall that our goal for statistical inference is to use the information we gather from our
sample to generalize it into some knowledge about the population. Throughout our studies
we have learned a few estimates, such as the sample mean (x̄), the sample proportion (p̂),
and the sample variance (s2 ). These can be use to estimate the population mean (µ), the
population proportion (p), and the population variance (σ 2 ) respectively. However, these
point estimates are only good for the sample it comes from. As mentioned before, if we
have a new sample, the estimates will likely to be different. Hence it is preferable to have
a range of possible values, leading to confidence interval (or CI for short).
Definition 5.2
P
xi
x̄ = (point estimate)
n
σ σ
x̄ − zα/2 √ ≤ µ ≤ x̄ + zα/2 √ (100(1 − α)% confidence interval)
n n
where
α = the area under the normal curve outside the confidence interval area
α/2 = the area in one end of the distribution outside the confidence interval
1
Last update: November 17, 2019
1
Remark: Confidence intervals can be two-sided (shown above) or one-sided. Here we
will focus on two-sided intervals.
Take a closer look at the confidence interval formula. It involves two terms: the point
estimate, and a second term after the sum/minus sign (zα/2 √σn ). This second term is
called the margin of error. This is the general structure of a confidence interval:
Figure 1: 1 − α confidence.
For example, if we are interested in a 95% CI, then α = 0.05, and α/2 = 0.025. Looking
at Figure 1, it means that the tail probability is 0.025, hence we need zα/2 such that
P (Z > zα/2 ) = 0.025. Using the z-table, we find that this value is 1.96. Figure 2 shows a
picture of this example.
2
The 100(1 − α)% is called the confidence level. Some common confidence levels (and
the corresponding zα/2 ) are listed in Table 1:
Confidence Level
100(1-α) α α/2 zα/2
90% 0.10 0.05 1.645
95% 0.05 0.025 1.96
99% 0.01 0.005 2.575
Example 1 Suppose a random sample of 400 is selected from a population with a standard
deviation of 5. If the sample mean is 25, what is the 95% confidence interval?
Answer: n = 400, x̄ = 25, σ = 5. With 95% confidence level, α = 0.05, and the corresponding
z-value is zα/2 = z0.025 = 1.96. Hence the 95% CI is
σ σ
x̄ − zα/2 √ ≤ µ ≤ x̄ + zα/2 √
n n
5 5
25 − (1.96) √ ≤ µ ≤ 25 + (1.96) √
400 400
24.51 ≤ µ ≤ 25.49.
Hence, it is 95% confident that the population mean is between 24.51 and 25.49.
Example 2 A manufacturer wants to purchase a certain product of foil. The foil is stored on
rolls each containing a varying amount of foil with a standard deviation of 12.5. In order to
estimate the total number of foil on all the rolls, the manufacturer randomly selected 200 rolls
and measured the number of foil on each roll. The sample mean was 48. Calculate the 98%
confidence interval to estimate the population mean of foil.
Answer: n = 200, x̄ = 48, σ = 12.5. With 98% confidence level, α = 0.02, and the corresponding
3
Figure 3: 95% Confidence Intervals of µ
12.5 12.5
48 − (2.325) √ ≤ µ ≤ 48 + (2.325) √
200 200
45.94 ≤ µ ≤ 50.06.
Therefore, with 98% confidence level, one can say that the average number of foil is between
45.94 and 50.06.
Remark : Notice in the above examples the sample sizes are all larger than 30, hence we
can use the normal distribution by the CLT. However, if the sample size is less than 30,
we can still use the above procedure as long as we know that the population is normally
distributed and σ 2 is known. However, if σ 2 is not known, we cannot (or shouldn’t) use
the above procedure. We have the following method.
4
Instead of using the z-distribution (normal distribution), we will use the t-distribution
instead. The formula for t statistic is
x̄ − µ
t= √ .
σ/ n
Notice it is very similar to the z score formula, except now σ is replaced by s. This also
implies that the C.I. can be calculated similarly:
Immediately we can see this formula looks very similar with the z formula above, except
we replace σ with s and zα/2 with tα/2;n−1 .
Figure 4: Comparison of t-distribution with different degrees of freedom and the normal
distribution. Image taken from Professor Soorapanth’s note.
To find the t-value using the t-table (Table A.6 of the textbook), we need to know the
degrees of freedom (d.f.). For this case, the d.f. is simply n−1, the number of samples
5
minus 1. The t value is located at the intersection of the d.f. value and the selected α/2
value.
Example 3 Below shows a portion of Table A.6.
Suppose we are interested in a 95% C.I. with a sample of 9 items. The d.f. is then 9 − 1 = 8.
Hence tα/2;n−1 = t0.025;8 = 2.306.
Example 4 Calculate the 90% C.I. for the following data:
6
Therefore, we are 90% confident that the true mean value is between 1.15 and 1.33.
Note that we have to assume the population is normally distributed.
Example 5 In planning for a new forest road to be used for tree harvesting, planners must
select the location that will minimize tractor skidding distance. Researchers wanted to estimate
the true mean skidding distance along a new road in a European forest. The skidding distances
(in meters) were measured at 20 randomly selected road sites, and the values are given below:
488 350 457 199 285 409 435 574 439 546
385 295 184 261 273 400 311 312 141 425
Also suppose the underlying population of these distances is approximately normally distributed.
Estimate, with 99% confidence interval, the true mean skidding distance of the road, and interpret
the result.
Answer: Again since we don’t know the population variance, we need to use the t-distribution
C.I. For this data set, n = 20, x̄ = 358.45, s ≈ 117.82. A 99% confidence level means α = 0.01,
and with n = 20, the d.f. is 20 − 1 = 19. Looking up from the table, t0.005;19 = 2.861. Therefore,
the C.I. is
117.82 117.82
358.45 − (2.861) √ ≤ µ ≤ 358.45 + (2.861) √
20 20
283.08 ≤ µ ≤ 433.82.
With 99% confidence level, we can conclude that the true mean distance is between 283.08 and
433.82 meters.
In summary:
• If the population standard deviation is known and n ≥ 30, then use the z distribution
interval.
• If the population standard deviation is known but n < 30, if the population is also
known to be normally distributed, use the z distribution interval.
• If the population standard deviation is unknown, but the population is approxi-
mately normal, use the t distribution interval regardless of the sample size.
7
in multiple terms in the z-score formula, solve for it will be complicated. For this C.I.
purpose, we replace p in the denominator with p̂ in the z-score formula, to get
p̂ − p
z= q .
p̂q̂
n
Definition 5.4 If np̂ > 5 and nq̂ > 5, then the 100(1 − α)% C.I. for population
proportion p is r r
p̂q̂ p̂q̂
p̂ − zα/2 ≤ p ≤ p̂ + zα/2 ,
n n
x
where p̂ = n
and q̂ = 1 − p̂.
Example 6 A random sample of size n = 144 yielded 115 positive responds. Construct a 90%
C.I. for p.
Answer: First we need to check if the sample size requirements are satisfied. p̂ = 155
144 ≈ 0.80, so
np̂ = (144)(0.80) = 115.2 > 5 and nq̂ = (144)(0.20) = 28.8 > 5. Thus we can proceed with the
calculation. With 90%, zα/2 = z0.05 ≈ 1.645, hence the 90% C.I. is
r r
(0.80)(0.20) (0.80)(0.20)
0.80 − (1.645) ≤ p ≤ 0.80 + (1.645)
144 144
0.7452 ≤ p ≤ 0.8548.
Hence, with 90% confidence level, we can conclude that the true proportion of yes lies between
0.7452 and 0.8548.
Example 7 The Minneapolis Star Tribune (August 12, 2008) reported that 73% of Americans
say that Starbucks coffee is overpriced. The source of this information was a national telephone
survey of 1000 American adults conducted by Rasmussen Reports. Find and interpret the 92%
C.I. for the parameter of interest.
Answer: n = 1000 and p̂ = 0.73. The requires are satisfied: np̂ = (1000)(0.73) = 730 > 5 and
nq̂ = (1000)(0.27) = 270 > 5. With 92% confidence level, zα/2 = z0.04 ≈ 1.751. Hence the C.I.
is
r r
(0.73)(0.27) (0.73)(0.27)
0.73 − (1.751) ≤ p ≤ 0.73 + (1.751)
1000 1000
0.7054 ≤ p ≤ 0.7546.
With 92% confidence level, we can conclude that between 70.54% to 75.46% of Americans believe
Starbucks coffee is overpriced.
8
Optional Reading: Determining the Sample Size
In this section, we show the appropriate sample size for making an inference about a
population mean or proportion depends on the desired reliability.
Determination of Sample Size for 100(1 − α)% Confidence Intervals for µ
In order to estimate µ within a given margin of error, and with 100(1 − α)% confidence,
the required sample size is found as follows:
σ
zα/2 √ = E.
n
The solution for n is given by the equation
(zα/2 )2 σ 2
n= .
E2
The value of σ is usually unknown. It can be estimated by the standard deviation s
from a previous sample. Alternatively, we may approximate the range R of observations
in the population and (conservatively) estimate σ ≈ R/4 (some people like to be less
conservative and use σ ≈ R/6 instead). In any case, you should round the value of n
obtained upward to ensure that the sample size will be sufficient to achieve the specified
reliability.
Example 8 If you wish to estimate a population mean to within 0.2 with a √95% confidence
interval and you known from previous sampling that σ is approximately equal to 5.4, how many
observations would you have to include in your sample?
Answer: Since we have a previous estimate of σ, we can use it directly as an estimate of the
population value. Using the formula above, we have
(zα/2 )2 σ 2
n =
E2
(1.96)2 (5.4)
=
0.22
= 518.62 ≈ 519.
Example 9 Suppose the manufacturer of official NFL footballs uses a machine to inflate the
new balls to a pressure of 13.5 pounds. When the machine is properly calibrated, the mean
inflation pressure is 13.5 pounds, but uncontrollable factors cause the pressures of individual
footballs to vary randomly from about 13.3 to 13.7 pounds. For quality control purposes, the
manufacturer wishes to estimate the mean inflation pressure to within 0.025 pound of its true
value with a 99% confidence interval. What sample size should be specified for the experiment?
Answer: Since we are not given σ, we need to estimate it using the range. The range is R =
13.7 − 13.3 = 0.4, so an approximate of σ is R/4 = 0.4/4 = 0.1. Also, z0.005 = 2.575. Therefore,
the required sample size is
(2.575)2 (0.1)2
n =
0.0252
= 106.09 ≈ 107.
9
Thus, the required sample size to achieve a margin of error of 0.025 with a confidence level of
99% is 107.
Determination of Sample Size for 100(1 − α)% Confidence Interval for p In order
to estimate a population proportion p with a margin of error E and with 100(1 − α)%
confidence, the required sample size is found by solving the following equation for n:
r
pq
zα/2 =E
n
The solution for n can be written as follows:
(zα/2 )2 (pq)
n=
(E)2
Since the value of the product pq is unknown, it can be estimated by the sample fraction
of successes, p̂, from a previous sample. We can also show that the value of pq is at
its maximum when p equals 0.5, so you can obtain conservatively large values of n by
approximating p by 0.5 or values close to 0.5. In any case, you should round the value of
n obtained upward to ensure that the sample size will be sufficient to achieve the specified
reliability.
Example 10 What is the approximate sample size required to construct a 95% C.I. for p that
has a margin of error 0.06. Suppose a prior studies should the proportion is roughly 0.3.
Answer: Using the sample size equation above, we have
(1.96)2 (0.3)(0.7)
n =
0.062
= 224.09 ≈ 225.
Example 11 A cellular telephone manufacturer that entered the post-regulation market quickly
has initial problem with excessive customer complaints and consequent returns of cell phones for
repair or replacement. The manufacturer wants to estimate the magnitude of the problem in
order to design a quality control program. how many cellular telephones should be sampled and
checked in order to estimate the fraction defective, p, to within 0.01 with 90% confidence?
Answer: Since we are not given a prior estimate of p, we can use a conservative value, p = 0.5,
as an estimate. Hence
(1.645)2 (0.5)(0.5)
n =
0.012
= 6765.06 ≈ 6766.
However, we may think this estimate of p = 0.5 is too conservative. We can instead use a value
of 0.1, corresponding to 10% defective, and get
(1.645)2 (0.1)(0.9)
n =
0.012
= 2435.4 ≈ 2436.
10
Thus, the manufacturer should sample 2436 telephones in order to estimate the fraction defective,
p, to within 0.01 with 90% confidence.
11
DS 212 Business Statistics I
Fall 2019
Lecture Notes 9: Statistical Inference: Hypothesis Testing for Single
Populations1
Here are some motivating examples on why we want to use hypothesis testing in real life
situations.
Example 1 The administrator at your local hospital states that on weekends the average wait
time for emergency room visits is 10 minutes. Based on discussions you have had with friends
who have complained on how long they waited to be seen in the ER over a weekend, you dispute
the administrator’s claim. You decide to test you hypothesis. Over the course of a few weekends
you record the wait time for 40 randomly selected patients. The average wait time for these 40
patients is 11 minutes with a standard deviation of 3 minutes. Do you have enough evidence
to support your hypothesis that the average ER wait time exceeds 10 minutes? (source: https:
// onlinecourses. science. psu. edu/ stat500/ node/ 41 )
Example 2 An e-commerce research company claims that 60% or more graduate students have
bought merchandise on-line. A consumer group is suspicious of the claim and thinks that the
proportion is lower than 60%. A random sample of 80 graduate students show that only 22
students have ever done so. Is there enough evidence to show that the true proportion is lower
than 60%? (source: https: // onlinecourses. science. psu. edu/ stat500/ node/ 41 )
1
This is the general procedure for conducting a hypothesis test. It is important that the
hypotheses is setup before the data is collected, as we do not want to alter the hypotheses
based on the data.
1. Hypotheses:
Types of Hypotheses
There are three types of hypotheses we will explore here:
• Research hypotheses: a statement of what the researcher believes will be the
outcome of an experiment or a study.
• Statistical hypotheses: a formal (mathematical) structure used to statistically
test the research hypothesis.
• Substantive hypotheses: hypotheses resulting in the outcomes that are impor-
tant to decision makers.
We will put our focus on learning how to set up the statistical hypotheses.
Statistical Hypotheses
A statistical hypothesis consists of 2 parts: the null hypothesis and the alterna-
tive hypothesis.
• Null hypothesis: Denote H0 . The hypothesis that assumes the original param-
eter value.
• Alternative hypothesis: Denote Ha or H1 . The opposite of H0 . A new assump-
tion/claim that the researcher wishes to test.
The null hypothesis is always assumed to be true, unless we perform a statistical
test and concludes that Ha is true.
Example 3 Refer to Example 1. The null hypothesis is the average wait time is 10 minutes,
while the alternative hypothesis is the average wait time is more than 10 minutes.
Example 4 Refer to Example 2. The null hypothesis is 60% or more graduate students
have bought merchandise online, and the alternative hypothesis is less than 60% have done so.
2
Left-tailed test Right-tailed test
H0 : µ = µ0 H0 : µ = µ0
Ha : µ < µ0 Ha : µ > µ0
for some specific value µ0 . Sometimes the “=” sign is replaced by ≥ (for left-
tailed) or ≤ (for right-tailed). Notice H0 and Ha are mutually exclusive sets
and their union is the entire space. Both Examples 3 and 4 are directional
hypotheses.
• Two-sided: The hypothesis concerns the hypothesized value to be different
from the original value.
H0 : µ = µ0
Ha : µ 6= µ0
Example 5 Refer to Example 1. Let µ denote the average waiting time, and µ0 = 10 the
claimed waiting time. The null and alternative hypotheses are then
H0 : µ = 10
Ha : µ > 10
Example 6 A manufacturer is filling 40 oz. packages with flour. The company wants
to test whether or not the package contents average 40 ounces. Let µ denote the average
package weight, and µ0 = 40. Hence,
H0 : µ = 40
Ha : µ 6= 40
3
Example 1, we will committee a Type II error if we keep the waiting time as 10
minutes when in fact it is not. The computation of β is much more difficult, we will
not be discussing about it here. The complement of Type II error is the power,
which is the probability of rejecting the null hypothesis when the null is indeed false.
Below shows a chart that summarizes this:
Truth
Null is true Null is false
Fail to reject null Correct decision Type II error (β)
Decision
Reject null Type I error (α) Correct decision (power)
4. Decision rule:
We need some rules to help us determine whether the null hypothesis should be
rejected or not. The possible outcomes of a hypothesis test can be divided into two
groups:
• Reject the null (rejection region).
• Fail to reject the null (nonrejection region).
There are two approaches to derive conclusion: the critical value approach or
the p-value approach. The regions described above come from the critical value
approach.
• Critical Value Approach: The critical value approach works by first determin-
ing a critical value, such that if the test statistics (to be discussed later)
exceeds (or falls below) this critical value, we will reject the null hypothesis.
Specific values of critical value depends on a few criterion: the type of data we
have, the hypotheses, and the α value. The three hypotheses can be determined
in the following way:
– Left-tailed: reject H0 if the test statistics is less than the critical value
(note this critical value is negative).
4
– Right-tailed: reject H0 if the test statistics is larger than the critical value
(note this critical value is positive).
5
the population standard deviation σ is known, we can use the z-statistics and normal
distribution to conduct the hypothesis test. The z-statistics is
x̄ − µ0
z= √ (1)
σ/ n
where x̄ is the sample mean, µ0 is the hypothesized mean, σ is the population standard
deviation and n is the sample size. We call this hypothesis test the one sample z-test.
Often times we call z the test statistics.
Example 7 The mean composite score on the ACT among the students at a large Midwestern
University is 24 with a standard deviation of 4. We wish to know whether the average composite
ACT score for business majors is different from the average for the University. We sample 100
business majors and calculate an average score of 26.
Step 1: Let µ denote the average ACT score. The null and alternative hypotheses are
H0 : µ = 24
Ha : µ 6= 24
Step 2: Since we have more than 30 sample points and we know the population standard deviation,
we can use the z-test.
Step 3: Let’s set α = 0.05. Hence, Type I error = P(reject H0 |H0 is true) = 0.05.
Step 4: Since this is a two-sided test and α = 0.05, there is 0.05/2 = 0.025 area on both ends of
the normal distribution (see Figure 3). Hence the critical value is Zα/2 = Z0.025 = 1.96.
Step 5: This step calculates the test statistics using formula (1):
x̄ − µ0 26 − 24
z= √ = √ = 5.
σ/ n 4/ 100
Step 6: Since |z| > |z0.025 | ⇒ 5 > 1.96, we can reject the null hypothesis. We can conclude that,
at 0.05 level of significance, there is enough evidence to conclude that the average ACT
score for business majors is different from the university average.
As mentioned above, we can also make decision using the p-value approach. The
definition of p-value is as follow: the probability of observing a test statistics as extreme
as the one computed, given the null hypothesis is true. In another words, if the p-value is
small, i.e. the probability is small, we can conclude that the null hypothesis is incorrect
because, even with a small probability, we are still able to obtain a data set that deviates
far from the original value. Hence, we will reject the null hypothesis if the p-value is less
than α (for one-sided test) or α/2 (for two-sided test). The following lists the ways to
calculate the p-values for the z-test:
6
• Left-tailed test: the p-value is the area under the standard normal distribution to
the left of the test statistics, i.e. P (Z < z), where z is the z-statistics, and Z is the
standard normal random variable.
• Right-tailed test: the p-value is the area under the standard normal distribution to
the right of the test statistics, i.e. P (Z > z).
• Two-sided test: the p-value is the area under the standard normal distribution to
the right of the absolute value of the test statistics, i.e. P (Z > |z|).
Example 8 Continuing with Example 7. The p-value is then P (Z > 5) ≈ 0. Since p-value
< 0.025, we will reject H0 . This is the same conclusion as above.
Example 9 A bank manager wants to test the claim that the average daily deposits (which
are known to be normally distributed) into the savings account at the bank more than $550 per
person. A random sample of 60 deposits had a mean of $560. Suppose it is know that the
population standard deviation is σ = $35. Can he prove his claim? Test this using α = 0.01.
Solution: Let µ denote the average deposit per person.
H0 : µ = 550
Ha : µ > 550
Notice that in the above example, if we choose a higher level of significance, say 0.05, we
will then be able to reject the null hypothesis. The level of significance is similar to the
confidence level: the lower the level of significance, the higher the confident you are with
your result.
Example 10 Now suppose someone told the bank manager that the average waiting time for
a customer to get help is 5.4 minutes with a standard deviation of 1.2 minutes. The manager
believes it is exaggerating, and sampled 50 customers to get the average waiting time of 4.7
minutes. Can he justify his claim at 0.05 level of significance?
Solution: Let µ denote the average waiting time.
H0 : µ = 5.4
Ha : µ < 5.4
4.7−5.4
The test statistics is z = 1.2/√
50
≈ −4.12. With α = 0.05 and a left-tailed test, the critical value
is zalpha = −1.645. Since −4.12 < −1.645, we reject H0 , and conclude that at 0.05 significance
level, there is enough evidence to conclude the waiting time is less than 5.4 minutes. The p-value
for this problem is P (Z < −4.12) ≈ 0.0000, which is also smaller than α = 0.05.
7
In summary:
Note: we can also use the z-test if the sample size is smaller than 30, as long as we know
the population standard deviation and we can assume that the population is (approxi-
mately) normally distributed.
H0 : µ = 300
Ha : µ > 300
8
Example 12 According to CNN, in 2011, the average American spent $16,803 on housing. A
suburban community wants to know if their residents spent differently than this national average.
In a survey of 40 randomly selected residents, they found that they spent an annual average of
$15,800 with a standard deviation of $2,600. Test the hypothesis with 0.1 level of significance,
and assume the spending is approximately normally distributed.
Solution: Let µ denote the average housing spending.
H0 : µ = 16803
Ha : µ 6= 16803
15800−16803
The test statistics is t = √
2600/ 40
= −2.4398, with a d.f. 40 − 1 = 39. Since this is a
two-sided test, we will use α/2 when looking up the critical value. Hence, the critical value is
t0.05;39 ≈ 1.684 (note this is the critical value for when d.f. = 40. however, this is a good enough
approximate for our case when d.f. = 39). Since | − 2.4398| = 2.4398 > 1.684, we can reject H0 .
At 0.1 level of significance, we can conclude that the average spending on housing is different
from $16,803.
In summary:
z-test t-test
x̄−µ
Formula z = σ/ √0
n
t = x̄−µ
√0
s/ n
Degrees of Freedom Doesn’t use d.f. n−1
Population Standard Known σ Unknown, use the sample
Deviation standard deviation s
Sample size n ≥ 30, or n < 30 if the Any sample size
population is normally dis-
tributed
Assumption None unless n < 30, then Population is (approx-
assume population is nor- imately) normally dis-
mally distributed tributed
9
Testing Hypotheses About A Proportion
In Chapter 8 we have seen how to construct a confidence interval about the population
proportion parameter. Just like the z-test and t-test, the test about a population propor-
tion value can be carried out similarly. We will again use the z-test for this task, with
the test statistics
p̂ − p0
z = p p0 q0
n
x
where p̂ is the estimated proportion p̂ = ,
p0 is the hypothesized population proportion
n
value, and q0 = 1 − p0 . This is the z-test for proportion.
Example 13 The CEO of a large electric utility claims that 80 percent of his 1,000,000 cus-
tomers are very satisfied with the service they receive. To test this claim, the local newspaper
surveyed 100 customers. Among the sampled customers, 73 percent say they are very satisfied.
Based on these findings, can we reject the CEO’s hypothesis that 80% of the customers are very
satisfied? Use a 0.02 level of significance.
Solution: Let p denote the true proportion of customers that are very satisfied with the service.
Then,
H0 : p = 0.8
Ha : p 6= 0.8
The test statistics is then z = q0.73−0.8 = −1.75. With 0.02 level of significance and a two-
(0.8)(0.2)
100
tailed test, the critical value is z0.02/2 = z0.01 = 2.33. Since | − 1.75| = 1.75 6> 2.33, we cannot
reject H0 . Therefore, there is insufficient evidence at 0.02 level of significance to indicate that
different from 80% of the customers are very satisfied. Note the p-value for this problem is
P (Z > 1.75) = 0.04, which is not less than α/2 = 0.01, hence we will not reject H0 .
Example 14 The reputations (and hence sales) of many businesses can be severely damaged
by shipments of manufactured items that contain a large percentage of defectives. For example,
a manufacturer of alkaline batteries may want to be reasonably certain that less than 5% of its
batteries are defective. Suppose 300 batteries are randomly selected from a very large shipment
each is tested and 5 defective batteries are found. Does this outcome provide sufficient evidence
for the manufacturer to conclude that the fraction defective in the entire shipment is less than
5%? Use α = 0.1.
Solution: Let p bet the true proportion of defective batteries.
H0 : p = 0.05
Ha : p < 0.05
5/300−0.05
p̂ = 5/300. The test statistics is z = √ = −2.64.. The critical value is z0.1 = −1.28.
(0.05)(0.95)/300
Since z = −2.64 < −1.28, we can reject H0 , and conclude that with 0.1 level of significance,
there is enough evidence to conclude the shipment contains less than 5% defective batteries. The
p-value for this problem is P (Z < −2.64) = 0.004, which is less than α = 0.1. Hence we will
reject H0 .
10
In summary:
11
Hypothesis testing in R
2024-10-20
Outline
library(tidyverse)
library(dplyr)
library(ggplot2)
#Exploring the data set We will start by doing some data processing
1
'physician.visits',
'birthwt.grams')
birthwt$mother.race=recode_factor(birthwt$mother.race, '1' = "white", '2' = "black", '3' = "other")
birthwt$mother.smokes= recode_factor(birthwt$mother.smokes, '0' = "no", '1' = "yes")
birthwt$hypertension= recode_factor(birthwt$hypertension, '0' = "no", '1' = "yes")
birthwt$uterine.irr= recode_factor(birthwt$uterine.irr, `0` = "no", `1` = "yes")
birthwt$birthwt.below.2500= recode_factor(birthwt$birthwt.below.2500, `0` = "no", `1` = "yes")
Our focus today is to run hypothesis testing to assess whether some trends are statistically significant. It
is important that the tables and figures we provide with our data convey statistical uncertainty in any
cases where it is non-negligible. Failing to account for it may produce misleading conclusions. ##Testing
differences in means A common statistical tasks is to compare an outcome between groups. Here we com-
pare birth weight between smoking and non-smoking mothers. It is always good to start with a boxplot:
5000
4000
birthwt.grams
3000
2000
1000
no yes
mother.smokes
The plot suggests that smoking seems to be associated with lower birth weight. How can we assess whether
the difference is statistically significant? A two-way table can help.
## # A tibble: 2 x 3
## mother.smokes mean.birthwt sd.birthwt
## <fct> <dbl> <dbl>
## 1 no 3056. 753.
## 2 yes 2772. 660.
Currently we have the standard deviation. However, its better to have the standard error, which is the
standard deviation divided by sqrt of number of observations.
2
birthwt %>%
group_by(mother.smokes) %>%
summarize(num.obs = n(),
mean.birthwt = mean(birthwt.grams),
sd.birthwt = sd(birthwt.grams),
se.birthwt = sd(birthwt.grams) / sqrt(num.obs))
## # A tibble: 2 x 5
## mother.smokes num.obs mean.birthwt sd.birthwt se.birthwt
## <fct> <int> <dbl> <dbl> <dbl>
## 1 no 115 3056. 753. 70.2
## 2 yes 74 2772. 660. 76.7
##
## Welch Two Sample t-test
##
## data: birthwt.grams by mother.smokes
## t = 2.7299, df = 170.1, p-value = 0.007003
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
## 78.57486 488.97860
## sample estimates:
## mean in group no mean in group yes
## 3055.696 2771.919
From the output, we see that the difference is statistically significant. This function returns a lot of useful
information.
names(birth.test)
## [1] 0.007002548
Or group means:
or confidence interval:
3
Lets do an experiment to show how the test works. We will do this by a simulation of observations that
belong to 2 different groups. We do this in 2 ways:
## y groups
## 1 -0.7087907 control
## 2 1.0803069 control
## 3 1.1390947 control
## 4 0.8728649 control
## 5 -0.7742385 control
## 6 -0.2799271 control
0
y
−2
control treatment
groups
Have a look at the boxplot:
Density plot of the two groups:
ggplot(data=obs.data,aes(x=y,fill=groups,alpha=0.7)) + geom_density()
4
0.4
0.3 groups
control
density
treatment
0.2
alpha
0.7
0.1
0.0
−2 0 2
y
Now t-test:
##
## Welch Two Sample t-test
##
## data: y by groups
## t = -0.020714, df = 85.341, p-value = 0.9835
## alternative hypothesis: true difference in means between group control and group treatment is not equ
## 95 percent confidence interval:
## -0.3952230 0.3870726
## sample estimates:
## mean in group control mean in group treatment
## 0.1331638 0.1372390
## y groups
## 1 -1.346045295 control
## 2 0.485230361 control
## 3 -0.931398683 control
## 4 0.792209118 control
## 5 0.005531094 control
## 6 2.280406312 control
5
3
1
y
−1
−2
control treatment
groups
Have a look at the boxplot:
Density plot of the two groups:
ggplot(data=obs.data,aes(x=y,fill=groups,alpha=0.7)) + geom_density()
6
0.4
0.3
groups
control
density
treatment
0.2
alpha
0.7
0.1
0.0
−2 −1 0 1 2 3
y
##
## Welch Two Sample t-test
##
## data: y by groups
## t = -1.7395, df = 102.85, p-value = 0.08495
## alternative hypothesis: true difference in means between group control and group treatment is not equ
## 95 percent confidence interval:
## -0.80171915 0.05251185
## sample estimates:
## mean in group control mean in group treatment
## -0.02883144 0.34577221
iter=10000
pvals=matrix(0,nrow=iter,ncol=2)
for (i in 1:iter)
{
obs.null=simulateData(n1,n2,mean.shift=0)
obs.dif=simulateData(n1,n2,mean.shift=0.5)
pvals[i,1] = t.test(y ~ groups,data=obs.null)$p.value
pvals[i,2] = t.test(y ~ groups,data=obs.dif)$p.value
7
}
pvals = as.data.frame(pvals)
names(pvals) = c("null","dif")
300
count
200
100
8
pvalue with difference
6000
4000
count
2000
For non-Gaussian data, the sample size has to be large to t-test to work. When in doubt, you can use a
non-parametric test. Here’s a Wilcoxon rank-sum test (Mann_Whitney test)
##
## Wilcoxon rank sum test with continuity correction
##
## data: birthwt.grams by mother.smokes
## W = 5249.5, p-value = 0.006768
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
## 85.00004 512.00005
## sample estimates:
## difference in location
## 306.1846
## $names
## [1] "statistic" "parameter" "p.value" "null.value" "alternative"
## [6] "method" "data.name" "conf.int" "estimate"
##
## $class
## [1] "htest"
9
Is the data normal?
To determine if the data is normal (so that you can stick to t-test), you can use qq-plot to determine if the
data is normal:
5000
4000
3000
y
2000
1000
−3 −2 −1 0 1 2 3
x
Separate the data for groups of smoking status:
10
no yes
5000
4000
3000
y
2000
1000
−2 −1 0 1 2 −2 −1 0 1 2
x
##Test for contingency tables It looks like low birth weight is associated with mother’s smoking status. We
can do this instead of a t-test. We build 2x2 table using table()
new.df=select(birthwt,c('birthwt.below.2500','mother.smokes'))
weight.smoke.table=table(new.df)
weight.smoke.table
## mother.smokes
## birthwt.below.2500 no yes
## no 86 44
## yes 29 30
To test for significance relationship between birthweight and smoking status, we use Fisher text:
birth.fisher.test = fisher.test(weight.smoke.table)
birth.fisher.test
##
## Fisher’s Exact Test for Count Data
##
## data: weight.smoke.table
## p-value = 0.03618
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 1.028780 3.964904
11
## sample estimates:
## odds ratio
## 2.014137
attributes(birth.fisher.test)
## $names
## [1] "p.value" "conf.int" "estimate" "null.value" "alternative"
## [6] "method" "data.name"
##
## $class
## [1] "htest"
we find that there is a significant association between smoking an low birth weight. If the sample size is
large, you can also use chi-squared test via the chisq.test function.
chisq.test(weight.smoke.table)
##
## Pearson’s Chi-squared test with Yates’ continuity correction
##
## data: weight.smoke.table
## X-squared = 4.2359, df = 1, p-value = 0.03958
When one of attributes has more than 2 levels, you can also use the chi-squared test for association between
2 attributes.
data=c(47,35,33,10,13,27)
data=matrix(data,nrow=3,ncol=2)
data=as.table(data)
colnames(data)=c("Non-Managerial","Managerial")
rownames(data)=c("0-10 days","11-20 days","21 or more days")
Hypotheses are:
H_0 : There is no association between type of employee and number of sick leave days.
H_a: There is an association between type of employee and number of sick leave days. Chi-squared test for
multi-level table:
chisq.test(data)
##
## Pearson’s Chi-squared test
##
## data: data
## X-squared = 10.765, df = 2, p-value = 0.004595
12
fisher.test(data)
##
## Fisher’s Exact Test for Count Data
##
## data: data
## p-value = 0.005203
## alternative hypothesis: two.sided
Both test indicate that there is evidence for an association between the number of sick leave days and type
of employee.
13
DS 212 Business Statistics I
Fall 2019
Lecture Notes 6: Continuous Distributions1
Rule number 2 is analogous to the sum of all probabilities equally 1 for discrete distribu-
tions. The area under the graph of f (x) over a certain interval represents the probability
of the interval of interest.
1
Last update: October 22, 2019
1
Example 1
This graph plots the density function f (x) = 1.4e−1.4x . The area of the shaded region represents
the probability of the r.v. X between 1 and 2, i.e. P (1 ≤ X ≤ 2). Using a special formula (we
will learn soon), it can be found that this area is 0.1858. Hence P (1 ≤ X ≤ 2) = 0.1858.
Notes:
1. a and b could possibly be −∞ and ∞.
2. f (x) is NOT a probability (i.e. f(1) is not the probability that X = 1), it is the
probability density.
3. P (c < X < d) is the area under the curve f (x) between c and d.
4. If X is a continuous r.v., then P (X = a) = 0 for any a. This is because the area of
a straight line under a curve is 0.
5. For a continuous random variable, the probability of any single value is zero. ie.
P (X = c) = 0. This also means that P (c ≤ X ≤ d) = P (c < X < d) = P (c ≤ X <
d) = P (c < X ≤ d).
This chapter will cover three continuous distributions: the uniform distribution, the nor-
mal distribution, and the exponential distribution.
2
Some examples that has a uniform distribution:
• Flight time from Chicago to New York, where X is equally likely between 100 and
140 mins.
• Volume of a regular can of soft drink, is uniformly distributed between 11.97 and
12.03 ounces.
• Price of a box of cereal ranges from $2.80 to $3.14 uniformly.
• What is the probability that the duration of the flight is between 115 to 125 minutes?
• What’s the proportion of produced cans contain more than 12.014 ounces?
• How likely it is to find a cereal box cheaper than $2.95?
d−c
P (c ≤ X ≤ d) = ,
b−a
where a ≤ c ≤ d ≤ b.
• Likewise, the probabilities P (X ≤ c) and P (X ≥ d) are
c−a b−d
P (X ≤ c) = , P (X ≥ d) = .
b−a b−a
• The mean and variance of a uniform distribution X is given by
a+b (b − a)2
µ= , σ2 = .
2 12
Figure 1 below shows an example of an uniform density function. For this example,
1
X ∼ U (3, 7.5). Therefore, the height of the function is at 7.5−3 ≈ 0.22. Note that the
area under the curve is simply the area of the rectangle created by the boundaries (3 and
1
7.5) and the function f (x), which gives (7.5 − 3) × 7.5−3 = 1.
3
Figure 1: The density function of X ∼ U (3, 7.5).
Example 2 Suppose the time it takes a customer service representative to complete a call is
uniformly distributed with endpoints 2 and 6 minutes.
4
Optional Learning.
Instead of given some particular interval and asking to find the probability, we can also
do the reverse: given the probability, find the interval end points that will result in the
given probability.
• To find the value c such that P (X ≤ c) = p, we need to solve for the following
equation:
c−a
P (X ≤ c) = = p,
b−a
which gives c = p(b − a) + a.
• Likewise, to find the value d such that P (X ≥ d) = p, we solve for
b−d
P (X ≥ d) = = p,
b−a
resulting d = b − p(b − a).
Example 3 An insurance policy is written to cover a loss X, where X has a uniform distribution
on [100, 1000].
1. What is the loss amount that 40% of the claims is less than or equal to this amount?
Answer: Let X denotes the r.v. for the loss amount. The question translates to: find c
c−100
such that P (X ≤ c) = 1000−100 = 0.4. Using the equation above, we have c = 0.4(1000 −
100) + 100 = 460.
2. What is the amount such that 55% of the claims is over this amount?
1000−d
Answer: Similarly, this question translates to, find d such that P (X ≥ d) = 1000−100 =
0.55. Hence, d = 1000 − 0.55(900) = 505.
5
where
• It is a continuous distribution.
• It is symmetric about the mean.
• It is asymptotic to the horizontal axis (meaning the two ends approaches the hori-
zontal axis).
• It is unimodal.
• It is a family of curves.
• Area under the curve is 1.
Notice that in the formula of its p.d.f., the curve is characterized by two parameters: µ
and σ. Different combinations of µ and σ will produce a different curve, though all these
curves will still satisfy all the properties listed above. For any continuous random variable
X that is normally distributed with mean µ and standard deviation σ (or equivalently,
variance σ 2 ), we write X ∼ N (µ, σ 2 ). Below shows the graphs of a few pairs of µ and σ 2 .
Figure 2: The graph shows three variations of the normal distribution. Note that each
curve is centered at its mean. Also notice the black and blue curves, which have the same
mean but different variance. It shows that the larger the variance, the wider the spread
of the distribution.
6
Calculating probabilities with the normal distribution requires finding the area under the
curve. However, for any specific setting of µ and σ 2 , there is no analytic form for the prob-
ability (you can try calculating the integral for one of these curves for a specific interval,
and you will find there is no closed form solution). Luckily, due to some nice properties
of the normal distribution, we can transform any setting to a specific distribution, called
the standard normal distribution, in which the probabilities for this distribution has
been extensively tabulated.
If X is a normal random variable with mean µ and standard deviation σ, then one can
convert X to the random variable Z by the formula
X −µ
Z= .
σ
The value Z (called a z-score) describes the number of standard deviations between X
and µ.
Below shows a picture of this conversion. On the left shows the density of X ∼ N (494, 1002 ),
and the right shows the standard normal distribution. On the left we are interested in
the probability P (494 ≤ X ≤ 600), which, if we do some numerical calculation, results in
0.3554. Using the transformation, we obtain z-scores of 494−494
100
= 0 and 600−494
100
= 1.06,
and using numerical tables, we also find that P (0 ≤ Z ≤ 1.06) = 0.3554 as well. Hence
finding the probability with the transformed distribution is equivalent to the probability
in the original distribution.
Figure 3: Left: A normal distribution with N (494, 1002 ). Right: the standard normal
distribution N (0, 1). Image borrowed from Professor Sada Soorapanth’s lecture notes.
7
Note: We will deviate from the textbook a little bit here. Please use this notes
as reference instead of the textbook, on learning how to find the probabilities
of the Normal Distribution.
1. Convert the values of interest from x values to standard normal random variable z
values by using the formula
x−µ
z= .
σ
2. Use the standard normal table (provided on iLearn) to find the areas (probability)
corresponding to the z values. If necessary, use the symmetry property of the normal
distribution and the fact that the total area on each side of the mean equals 0.5 to
convert the areas from the table to the probabilities of the event you have selected.
How to we actually find the probability in the standard normal distribution if we cannot
calculate the area directly? As mentioned before, the probabilities related to the standard
normal distribution has been tabulated extensively, meaning that given a particular z-
score, we can just look up the probability using a numerical table. We normally call this
table the z-table.
Figure 4: A section of the z-table, that gives probability from −∞ to any negative z-
score. Note this table is different from the table A5 in the textbook. We will
be using this table instead because it’s more intuitive. They can be found on
iLearn. (source: http://users.stat.ufl.edu/∼athienit/Tables/Ztable.pdf )
8
Figure 5: A section of the z-table, that gives probability from −∞ to any positive z-score.
(source: http://users.stat.ufl.edu/∼athienit/Tables/Ztable.pdf)
Figures 4 and 5 shows a section of the z-table. On the sides (left and top) it shows the
particular z-score, and within the table it gives the probability for that z-score. Note on
top of the table it shows an image there. The image tells you which interval of probability
the table is providing. For this particular table, it is providing probabilities from negative
infinity (−∞) to any negative z-score (Figure 4), i.e. P (Z ≤ z) for any z ≤ 0, or to any
positive z-score (Figure 5), i.e. P (Z ≤ z) for any z ≥ 0. For example, if we look for z =
0.74 and its corresponding probability 0.7704, it is referring to P (Z ≤ 0.74) = 0.7704 from
Figure 5. Different z-table might provide different types of probabilities depending on the
image provided on top. For this class, we will be working with these 2 tables instead of
the one in the textbook (Table A5). The following examples will show you how to find all
kinds of probabilities using these two tables. Again these two tables are NOT the same
as the one in the textbook. I believe these two tables are more intuitive to understand.
9
problem translates to finding P (0 ≤ Z ≤ 2), which is equivalent to the shaded region for
P (Z ≤ 2) minus the region of P (Z ≤ 0) (see Figure 6). Using the positive table and the
fact that P (Z ≤ 0) = P (Z ≥ 0) = 0.5, we have P (Z ≤ 2) − P (Z ≤ 0) = 0.9772 − 0.5 =
0.4772.
4. P (42.55 ≤ X ≤ 60).
Answer: The two z-scores are 42.55−55
5 = −2.49 and 60−55
5 = 1, so we are looking for
P (−2.49 ≤ Z ≤ 1), which can be written as P (Z ≤ 1) − P (Z ≤ −2.49). Using both the
negative and positive tables, we have P (Z ≤ 1) = .8413 and P (Z ≤ −2.49) = .0064, which
gives P (−2.49 ≤ Z ≤ 1) = P (Z ≤ 1) − P (Z ≤ −2.49) = 0.8413 − 0.0064 = 0.8349.
5. P (X > 59).
Answer: This is equivalent to finding P (Z > 0.8). Since the tables only gives probabilities
towards the lower tail, we can use the complement event trick, which gives P (Z > 0.8) =
1 − P (Z ≤ 0.8). Now we can use the positive table to get P (Z > 0.8) = 1 − P (Z ≤ 0.8) =
1 − .7881 = 0.2119.
6. P (65 ≤ X ≤ 68).
Answer: This translates to P (2 ≤ Z ≤ 2.6). Similar to part 3, this is equivalent to finding
P (Z ≤ 2.6) − P (Z ≤ 2). Similar to problem 3 and 4, we have P (Z ≤ 2.6) − P (Z ≤ 2) =
0.9953 − 0.9772 = 0.0181.
7. P (40.25 ≤ X ≤ 48.65).
Answer: This is P (−2.95 ≤ Z ≤ −1.27), which is equivalent to P (Z ≤ −1.27) − P (Z ≤
−2.95). Using the negative table and similar to the other parts, this gives P (40.25 ≤ X ≤
48.65) = .1020 − .0016 = 0.1004.
8. P (X = 57.85).
Answer: The answer is 0, because X is a continuous r.v..
Notice in some problems I used ‘<’ and some I used ‘≤’, but the values I used from the table
remain consistent. This is because of the continuous distribution property, that P (X = c) = 0
for any value c. Figure 6 shows the region of interest for each problem.
10
Figure 6: The shaded region represents the probability of interest for each problem.
Example 5 Claims filed under auto insurance policies follow a normal distribution with mean
19,400 and standard deviation 5,000.
1. What is the probability that a randomly selected claim does not exceed 18,000?
11
Answer: Let X denote the amount of the claim. P (X ≤ 18000) = P (Z ≤ −0.28) = 0.3897.
2. What is the probability that a randomly selected claim exceeds 25,000?
Answer: We want P (X ≥ 25000) = P (Z ≥ 1.12) = 1−P (Z ≤ 1.12) = 1−0.8686 = 0.1314.
3. What is the probability that a randomly selected claim is between 19,000 and 21,000?
Answer: This is asking for P (19000 ≤ X ≤ 21000) = P (−0.08 ≤ Z ≤ 0.32) − P (Z ≤
0.32) − P (Z ≤ −0.08) = 0.6255 − 0.4681 = 0.1574.
4. What is the probability that if 4 claims are randomly selected, all of them exceed 20,000?
Answer: There are two steps to the solution. First, we will find the probability that a
random claim exceed 20,000. The answer to this is P (X > 20000) = P (Z > 0.12) =
1 − P (Z < 0.12) = 1 − 0.5478 = 0.4522. Next, we will treat the second part of the problem
as a binomial distribution problem, which corresponds to having 4 trials, and all 4 trials
returns a success (that a claim exceed 20,000). Using what we learned in chapter 5, the
probability is 4 C4 (0.4522)4 (1 − 0.4522)0 = 0.0418.
Solving the Inverse Problem
Similar to the uniform distribution, we can ask the inverse question: given a cumulative
probability, the mean and the standard deviation, can we find the value that cause a
particular probability? To do this, we will need to ‘reverse look-up’ on the z-table. Below
describes the problem in a more formal setting:
Suppose X ∼ N (µ, σ 2 ), and we want to find the value xa such that P (X ≤ xa ) = a.
1. From the z-table, find the z-score (denote Za ) such that
• If a > 0.5, find Za from the positive table.
• If a < 0.5, find Za from the negative table.
2. Transform Za to xa using the formula
xa = µ + σ ∗ Z a .
The reason we have the formula xa = µ + σ ∗ Za is because when we calculate the z-score,
we have z = x−µ
σ
, so if we reverse solve for x, we will have the desired formula.
Example 6 Suppose in a city A, the housing price is normally distributed with mean 830 and
standard deviation 145 (both in thousands). Only the top 2% prices are considered luxury homes.
Calculate the minimum housing price required to be considered as luxury homes.
Answer: Let X denote the housing price in city A. We are interested in finding xa such that
P (X ≥ xa ) = 0.02. Note the two step procedure above is for finding the left tail probability.
Hence we can use the complement event: P (X ≥ xa ) = 1 − P (X < xa ) = 0.02 ⇒ P (X <
xa ) = 0.98 ⇒ a = 0.98. Since a > 0.5, we want Za from the positive table that gives 0.98,
which is roughly 2.055 (i.e. P (Z < 2.055) ≈ 0.98). Transforming back to the x value gives
xa = 830 + 145 ∗ 2.055 = 1127.975. Hence a house price of at least 1.13 million is required to
qualify as luxury home in city A.
Example 7 We can also look for the mean if we are given the x value, the standard deviation,
and the cumulative probability. Suppose at a different city B, the housing price is also normally
12
distributed with a standard deviation of 150 (in thousand). Also suppose that the 30th percentile
house has a price of 700 (thousand). What is the average housing price for city B?
Answer: Again let X denote the housing price in city B. We are given P (X < 700) = 0.30 (the
30th percentile), and we want to find µ. Using the same procedure as above, we first find Za
from the negative z-table (since a = 0.3 < 0.5), to get Za ≈ −0.525, i.e. P (Z < −0.525) ≈ 0.3).
Using the conversion in step 2, we have
Doing a little bit algebra, we get µ = 778.75. So the average housing price in city B is roughly
778.75 (thousand).
x ≥ 0
λ > 0
e = 2.71828 . . .
• It is a continuous distribution.
• It is a family of distributions.
• It is skewed to the right
• The x values range from zero to infinity.
• The curve steadily decreases as x gets larger.
• Its apex is always at x = 0.
• The mean is µ = λ1 and variance is σ 2 = λ12 .
Similar to the normal distribution, the exponential curve is characterized by the rate
parameter λ, and different values of λ will result in a different curve. If X is an exponential
r.v. with parameter λ, we write X ∼ exp(λ). Below shows the graphs for a few λ values.
13
Explanation of λ and calculation of probability
Calculating the probabilities with the exponential distribution is much easier compared
to the normal distribution. Using calculus, we can show that
P (X ≤ x0 ) = 1 − e−λx0
and
P (X ≥ x0 ) = e−λx0 .
λ is usually denoted as the rate of occurrence of an event, similar to the Poisson dis-
tribution. For example, it can denote the average arrival rate of bank customers within
a certain period of time, number of calls completed within a minute, or bus arrival rate.
In fact, the Poisson distribution captures the number of arrivals within a certain period
of time, while the Exponential distribution captures the time it takes between two con-
secutive arrival.
1. P (X < 4):
For this problem, λ = 1.5, and x0 = 4. Therefore, P (X < 4) = 1 − e−(1.5)(4) = 0.9975.
2. P (X > 1):
x0 = 1, so using the second equation above, P (X > 1) = e−(1.5)(1) = 0.2231.
14
3. P (2 < X < 3.5):
We will calculate this using the same trick from the normal probability: split the probability
into two parts, and subtract the larger portion by the smaller portion: P (2 < X < 3.5) =
P (X < 3.5) − P (X < 2) = (1 − e−(1.5)(3.5) ) − (1 − e−(1.5)(2) ) = 0.9948 − 0.9502 = 0.0446.
Example 9 The lifetime of a printer is exponentially distributed with mean 2 years.
1. What is the probability that a new printer lasts less than 1 year?
Answer: First note that we are given the mean lifetime, i.e. µ = 2, so we need to calculate
our parameter λ first. Using the relationship µ = λ1 , we have λ = µ1 = 12 . Let X denote
1
the lifetime, then P (X < 1) = 1 − e− 2 (1) = 0.3935. λ = 0.5 refers to 0.5 units of printer
per year.
2. What is the probability that a new printer lasts at least 2.5 years?
1
Answer: Using the formula above, we have P (X ≥ 2.5) = e− 2 (2.5) = 0.2865.
Note that in the above example, we have converted the rate in terms of 1 year. Similar
to the Poisson distribution, it is important that the parameter is on the same scale as the
problem time interval of interest.
Example 10 Suppose a customer service agent can complete 3 calls in 10 minutes on average.
What is the probability that a particular will last between 2 to 3 minutes?
Answer: First note that we are given a rate of 3 calls, but in terms of 10 minute intervals. The
question asks for the probability between 2 to 3 minutes. So we need to first convert the rate in
terms of each minute to make the problem easier to work with. A simple calculation shows that
the rate is λ = 3/10 calls per minute. Using this we can calculate P (2 < X < 3) = P (X <
3 3
3) − P (X < 2) = (1 − e− 10 (3) ) − (1 − e− 10 (2) ) = 0.1422.
1. What is the probability that exactly 30 use self-directed work teams as a management tool?
Answer: We are looking for P (X = 30), where X ∼ B(70, 0.59). We can of course
15
calculate the exactly probability, but the computation can be difficult, hence we will use
the normal approximation. A quick check shows that this satisfies the rule of thumb:
np = 70(0.59) = 41.3 > 5 and nq = 70(0.41) = 28.7 > 5. Hence we have the normally
approximated distribution, Y ∼ N (41.3, 16.933). Note that P (Y = 30) = 0 since Y
is continuous, so this is why we need the continuity correction. To do so, we will use
P (X = 30) ≈ P (29.5 ≤ Y ≤ 30.5), which wraps around the desired value 30. The answer
is P (29.5 ≤ Y ≤ 30.5) = 0.0023 (following the steps above). The exact probability using
the binomial distribution is 0.0024, so the approximation is close enough.
2. What is the probability that fewer than 40 use it?
Answer: This asks for P (X < 40). Note that this is equivalent to asking P (X ≤ 39)
because X is a discrete r.v.. When converting to the continuous version, we need to wrap
around 39, hence the correct is done by adding 0.5 to 39, resulting in P (X ≤ 39) ≈ P (Y ≤
39.5) = 0.3309.
3. What is the probability that more than 30 but at most 45 use it? This asks for P (30 <
X ≤ 45). Again we will convert it to P (31 ≤ X ≤ 45). Once again we need to wrap
around 31 and 45 to make sure we include these two values in the continuous version,
hence the continuity approximation is P (31 ≤ X ≤ 45) ≈ P (30.5 ≤ Y ≤ 45.5) ≈ 0.8420.
Definition 5.2 A normal probability plot for a data set is a scatterplot with the
ranked data values on one axis and their corresponding expected z-scores from a standard
normal distribution on the other axis.
16
Example 12 Below is a normal probability plot for the NBA heights from the 2008-9 season.
Do these data appear to follow a normal distribution?
17
DS 212 Business Statistics I
Fall 2019
Lecture Notes 7: Sampling and Sampling Distributions1
There are two main objectives for this chapter: learn about different sampling methods,
and the sampling distributions of sample mean and sample proportion.
Definition 7.1 A sampling frame is a list, map, or directory that lists out all the
individual’s information of a population, from which the sample is drawn from.
When drawing a sample from the population, we are actually selecting the sample from
the sampling frame. For example:
• A telemarketer calls potential customers from the phone book directory (list/directory).
• A bank collects survey data from customers at a particular branch (map).
• An airline sends out survey to customers who took the flight a week ago (list).
Immediately we see that the sampling frame might not include the entire population of
interest, and it might include more on top of the population of interest. We call this
underregistered and overregistered respectively.
Types of Sampling
There are two types of sampling: random sampling (also called probability sampling)
and nonrandom sampling (nonprobability sampling):
Random sampling:
• Every unit of the population has the same probability of being selected into the
sample.
• A chance mechanism is used in the selection process.
Nonrandom sampling:
1
Last update: November 13, 2020
1
• Every unit doesn’t have the same probability of being selected into the sample.
• Subject to selection bias.
• Not appropriate method for most statistical analysis methods.
Random Sampling
There are four basic random sampling techniques. They are
• Simple Random Sampling (SRS)
• Stratified Random Sampling
• Cluster (or Area) Sampling
• Systematic Sampling
Simple Random Sampling
• The most simple and elementary sampling technique.
• Every possible subset of n units has the same chance of being selected.
Advantages:
• Simple scheme.
• All units have equal opportunity of being selected.
Disadvantages:
• Requires complete list of all members in population.
• Might require higher cost.
Suppose we want a sample of size n from a population of size N . A simple random sample
can be obtained as follows:
1. Each unit in the sampling frame is numbered from 1 to N .
2. Use a table of random numbers or a random number generator to select n items
into the sample.
Example 1 Below lists the population frame of 20 stocks, and we will use SRS to sample 6
stocks.
To select 6 random samples from these 20 stocks, we first number each stock 1 through 20 (i.e.
BAC - 1, HPE - 2, etc.). Then we can use a random number generator from a computer software
(for example, the RANDBETWEEN() in Microsoft Excel) to generate 6 non-repeat numbers between
2
1 to 20. If we don’t have a software that does this, we can also use a random number table (for
example, Table A.1 in the textbook) to pick out random numbers.
We will pick out two-digit numbers from the table (since the largest value, 20, is a two-digit
number). The first row of the table is 12651 61646 11769 75109 ... . The first two-digit is 12,
which is within 1-20, so our first sample is the stock VALE. The next two-digit number is 65.
However, it is outside the range of 1-20, so we will discard it and move on, which is 16, hence
our second sample is AZN. The next set is 16 again, but since we already have selected 16, we
will discard it as well and move on. Continue with this until we get all 6 unique samples, which
are VALE, AZN, F, GDDY, GE, HPE.
3
• Least representative of the population.
• Possibility of high sampling error.
Example 3 The same researcher now samples students from classes instead. He chooses all
the classes at 12:35 on Monday as clusters (since one student cannot be at two classes at the
same time, this is a valid clustering), randomly selects a few classes at this time slots, and take
all the students from these class as samples.
Systematic Sampling
• Randomly picks the first subject in the population (within the first k = N/n sub-
jects.
• Sample every other k th subject.
Advantage:
• Very simple.
• Ensure the population will be evenly sampled.
Disadvantage:
• The process of selection can interact with a hidden periodic trait within the popu-
lation.
Example 4 The researcher obtains a list of all students at SFSU, and simply goes down the
list and select every other 10 students.
4
(a) Simple Random Sampling
(b) Stratified Sampling
5
Nonrandom Sampling
All four sampling schemes described above uses some random selection procedure to se-
lect the sample. However, there are times that sampling is not done using a random
process. This is called nonrandom sampling. They include convenience sampling,
judgment sampling, quota sampling, and snowball sampling. These methods are not de-
sired because they are not random, hence sampling error cannot be determined and taken
into account when making inferences.
Errors in Sampling
Sampling errors occurs when the sample is not representative of the population, for ex-
ample, selection bias. If the sample is representative, then the error can be analyzed
statistically. There are also nonsampling errors, such as missing data, input errors, mea-
surement errors, etc. These errors can be eliminate with careful experiment design.
End of section
Sampling Distribution of x̄
Recall in Chapter 1 we studied the procedure of inferential statistics:
Population param-
eters: µ (mean), Select a random sample.
σ 2 (variance), etc.
In order to use the sample mean x̄ to make inference about the population mean µ, we
need to first study some properties of it.
Consider taking a sample of size n from a population with mean µ and variance σ 2 . We
can calculate the sample mean of this sample, denote as x̄1 . Now collect another sample
of n and calculate the sample mean x̄2 . It is very likely that the sample means will be
different, i.e. x̄1 6= x̄2 . What happens if we do this repeatedly, and have many different
sample means x̄1 , x̄2 , x̄3 , . . . ?
Example 5 Suppose we have a population of 7 numbers:
63 56 60 58 54 57 66
Suppose we sample with replacement all the samples of size 2 from this population, we will end
up with 72 samples. The (unique) averages are Figure 3 plots the histogram of the population
6
63.0 59.5 61.5 60.5 58.5
60.0 64.5 56.0 58.0 57.0
55.0 56.5 61.0 59.0 57.5
62.0 54.0 55.5 66.0
and the histogram of the sample means. Notice that the population is skewed to the right, but
the histogram of the means is showing a relatively bell-shaped pattern.
Figure 3: On the left plots the histogram of the population. On the right plots the
histogram of the 72 means from all samples of size 2.
From the above example, we see that even though the original population is skewed,
the sample means still approach a Normal distribution. This shows that the sampling
distribution of the sample means (which is the distribution of the means of repeated
samples of size n from a population) approaches the Normal distribution. This actually
applies to any distributions we have for the population data, as this is promised by the
central limit theorem.
Definition 7.3 The Central Limit Theorem (CLT). Consider taking a sample of
size n from a population with mean µ and standard deviation σ. The CLT states that, for
sufficiently large sample size (n ≥ 30), the sample means, x̄, are approximately normally
distributed regardless of the distribution of the population. The mean and variance of this
7
2
distribution are µx̄ = µ and σx̄ = √σn respectively, i.e. x̄ ∼ N (µ, σn ). If the population is
already normally distributed, then x̄ is also normally distributed for any sample sizes n
(i.e. even with n = 1).
In simple words, the CLT says that, if the sample size is large enough, then it doesn’t
matter what the distribution of the population is, the sample mean will be normally
distributed. The figure below shows some examples of how the sample mean approaches
the normal distribution as the sample size increases.
Figure 4: The left column shows the population distribution. The right three columns
shows the distribution of the sample means when the sample sizes are n = 2, 5, 30. As
one can see, as n increases, the shape of the distribution becomes closer to the Normal
distribution. Image taken from Professor Soorapanth’s lecture notes.
Applications
Normally we ask a lot of questions related to the sample mean. For example,
• What is the probability that the average sales exceed $50,000 this month?
• How likely it is that a customer, on average, will spend more than $75 at BestBuy?
Since the CLT guarantees that the sample mean is normally distributed, we can use our
knowledge about the normal distribution to find the probabilities for the above situations.
Recall that if X ∼ N (µ, σ 2 ), we can calculate the z-score with z = x−µ
σ
. Similarly, since
σ2
x̄ ∼ N (µ, n ), we have the z-score for the sample mean:
x̄ − µ
z= .
√σ
n
Example 6 Suppose a population has a mean of 870 and a variance of 1,600. If a random
sample of size 64 is drawn from the population, what is the probability that the sample mean is
8
between 860 and 875?
Answer: Since n = 64 ≥ 30, we can apply the CLT to this problem. First note that µ = 870 and
σ 2 = 1600. By CLT, we have x̄ ∼ N (870, 1600
64 ) = N (870, 25). Therefore, the two z-scores are
Example 7 A manufacturer of automobile batteries claims that the distribution of the lengths
of life of its best battery has a mean of 54 months and a standard deviation of 6 months. Suppose
a consumer group decides to check the claim by purchasing a sample of 50 of the batteries and
subjecting them to tests that estimate the battery’s life.
1. Assuming that the manufacturer’s claim is true, describe the sampling distribution of the
mean lifetime of a sample of 50 batteries.
Answer: Although we don’t know the population distribution, we have a sample of 50,
hence the CLT says that the sampling distribution of the mean is normally distributed,
with mean µx̄ = 54 and σx̄ = √650 ≈ 0.85. Thus x̄ ∼ N (54, 0.852 ).
2. Assuming that the manufacturer’s claim is true, what is the probability that the consumer
group’s sample has a mean life of 52 or fewer months?
Answer: We are interested in P (x̄ ≤ 52). The z-score is then z = 52−54
0.85 = −2.35, i.e. we
want P (Z ≤ −2.35). Using the z-table we have P (Z ≤ −2.35) = 0.0094.
Example 8 The average cost of a one-bedroom apartment in a town is $650 per month. If
the population standard deviation is $100, what is the probability of randomly selecting a sample
of 50 one-bedroom apartments in this town and getting a sample mean of a) less than $630, b)
more than $665?
a) Let x̄ denote the sample mean of the 50 one-bedroom apartments rent. Using the CLT, we
2
have x̄ ∼ N (650, 100
50 ) = N (650, 200). Hence, P (x̄ < 630) = P (Z <
630−650
√
200
) = P (Z <
−1.41) = 0.0793.
b) P (x̄ > 665) = P (Z > 665−650
√
200
) = P (Z > 1.06) = 1 − P (Z ≤ 1.06) = 1 − 0.8554 = 0.1446.
9
Sampling Distribution of p̂
Instead of looking at sample mean, we can also look at sample proportion. Rather than
looking at the average of a set of outcomes, one can look at the proportion of a particular
outcome occurring throughout the entire experiment. For example, the proportion of
heads in 20 coin flips, or the percentage of teenagers who smoke in a town.
One can actually treat the sample proportion as a special version of the sample mean,
where instead of averaging over certain values, the proportion is an average of 1’s and 0’s,
where the data value is 1 if a ’positive’ event occurs, and 0 otherwise. Hence we can also
apply the CLT to the sample proportion.
Definition 7.5 Suppose np > 5 and nq > 5 (where p and q are the population propor-
tion and q = 1 − p). Then the CLT guarantees that
pq
p̂ ∼ N (p, ).
n
This means we can also have a z-score formula for this, i.e.
p̂ − p
z = p pq .
n
p pq
The term n
is referred to as the standard error of the proportion.
Example 9 Given a population proportion of success is 0.38, what is the probability that a) out
of a sample of n = 100, we obtain at most 55 successes, b) out of a sample of n = 80, we obtain
at least 30 successes?
a) First let’s check the necessary sample size conditions. np = 100(0.38) = 38 > 5 and
100(0.62) = 62 > 5. Hence we can apply the CLT here. We have p̂ ∼ N (0.38, (0.38)(0.62)
100 ).
Since the question asks for at most 55 out of 100 successes, this translates to 55/100 =
0.55. Therefore, P (p̂ ≤ 0.55) = P (Z ≤ q0.55−0.38
(0.38)(0.62)
) = P (Z ≤ 3.50) = 0.9998.
100
b) Again, np = (80)(.38) = 30.4 > 5 and nq = (80)(.62) = 49.6 > 5, so p̂ ∼ N (0.38, (0.38)(0.62)
80 ).
With the problem asking for at least 30 successes, we have p̂ = 30/80 = 0.375. Hence we
want P (p̂ ≥ 0.375) = P (Z ≥ q0.375−0.38
(0.38)(0.62)
) = P (Z ≥ −0.09) = 0.5359.
80
10
Example 10 A market research firm did a report about the residential telephone numbers in
New York State, which says 29% of all residential telephone number in NY are unlisted. A
telemarketing company uses random digit dialing equipment that dials the residential numbers
at random, regardless of whether they are listed. The firm calls 2000 numbers in NY. What is
the probability that at most 590 numbers called are unlisted?
590
Answer: A quick check shows that the sample size requirements are satisfied. Then p̂ = 2000 =
(0.29)(0.71)
0.295, and by CLT, p̂ ∼ N (0.29, 2000 ) = N (0.29, 0.00008). Hence P (p̂ ≤ 0.295) = P (Z ≤
0.295−0.29
√
0.00008
) = P (Z ≤ 0.53) = 0.7019.
Example 11 If 5% of the parts produced from a manufacturing process are defective, what is
the probability, in a random sample of 150 items, a) that there are at most 1 defective , b) there
are at least 10 defectives, c) there are between 5 to 15 defectives?
a) Again, a quick check shows that the sample size requirements are met. Having at most 1
1
= 0.0067, and p̂ ∼ N 0.05, (0.05)(0.95)
defective translates to p̂ = 150 150 = N (0.05, 0.0003).
0.0067−0.05
Hence, P (p̂ ≤ 0.0067) = P (Z ≤ √
0.0003
) = P (Z ≤ −2.44) = 0.0073.
10 0.0667−0.05
b) p̂ = 150 = 0.0667, hence P (p̂ ≥ 0.0667) = P (Z ≥ ) = P (Z ≥ 0.94) = 0.1736.
√
0.0003
5 15
c) p̂1 = 150 = 0.0333 and p̂2 = 150 = 0.1, hence P (0.0333 ≤ p̂ ≤ 0.1) = P 0.0333−0.05
√
0.0003
≤Z ≤
0.1−0.05
√
0.0003
= P (−0.94 ≤ Z ≤ 2.81) = 0.8239.
11