Lecture 8: Student's T-Tests: MATH E-156: Mathematical Statistics March 25, 2019
Lecture 8: Student's T-Tests: MATH E-156: Mathematical Statistics March 25, 2019
Lecture 8: Student's T-Tests: MATH E-156: Mathematical Statistics March 25, 2019
1 Introduction
So far in this course, all of our work on hypothesis testing and interval es-
timation has suffered from a deep flaw. Whenever we’ve worked with the
normal distribution, we’ve had to assume that the population variance is
known to the investigator. Of course this is wildly implausible – how could
it be that the researcher knows the population variance with complete confi-
dence, but doesn’t know the population mean? We did this in order to make
the mathematics simple, so that we could concentrate on the arguments that
justified the fundamental procedures of hypothesis testing. But now we need
to drop this assumption, and learn how to handle the real-world case where
the population variance is not known to the experimenter.
2 Student’s t Distribution
In much of our work, we’ve used an expression where we take the difference
between the sample mean and the population mean, and then divide this by
the standard deviation of the sampling distribution of the sample mean:
X −µ
We also know that the variance of the sampling distribution of the sample
mean is:
Var[ X ] =
This result holds for all distributions (at least those for which the variance
σ 2 is defined), and does not require any assumption of normality. If we
substitute this value into the first equation, we obtain the expression:
X −µ
Z = p
σ 2 /n
= 0
Similarly, the variance of Z is:
" #
X −µ
Var[ Z ] = Var p
σ 2 /n
" #
1 µ
= Var p ·X − p
σ 2 /n σ 2 /n
" #
= Var p ·X
σ 2 /n
= p · Var X
σ 2 /n
= · σ 2 /n
σ 2 /n
= 1
estimate this quantity, and then to use S 2 in place of s2 in our expression:
X −µ X −µ
p −→ p
σ 2 /n S 2 /n
Other than this one substitution, everything else stays the same.
We’ll start with the exponential distribution. In this simulation, we draw
1,000 samples of size n = 10 from an exponential distribution with scale
parameter λ = 1, so that the density function is:
For each sample we calculate the sample mean and the sample variance, and
then draw a point such that the x coordinate is the value of the sample mean
and the y coordinate is the value of the sample variance. The resulting graph
looks like this: Here you can clearly see that the distribution of the sample
• For samples that have a smaller value for the observed sample mean
the sample variance has small variability, but for samples with a larger
value for the observed sample mean the sample variance has greater
Clearly, for the exponential distribution, the sample mean random variable
X and the sample variance random variable S 2 are not independent.
3 · θ3
fX (x) =
(x + 10)4
Here the graph is very similar to the graph for the exponential distribution:
Here you can clearly see that the distribution of the sample variance depends
on the value of the sample mean:
• For samples that have a small value for the observed sample mean, the
sample variance is also very small, but for samples that have a larger
value for the observed sample mean the sample variance tends to also
be larger.
• For samples that have a smaller value for the observed sample mean
the sample variance has small variability, but for samples with a larger
value for the observed sample mean the sample variance has greater
Just as with the exponential distribution, the sample mean random variable
X and the sample variance random variable S 2 are not independent for this
2-parameter Pareto distribution.
Figure 2: Sample mean vs. sample variance for 2-parameter Pareto distribu-
After all this, let’s finally take a look at the normal distribution. This
simulation is based on drawing 1,000 simulated samples of size n = 10 from
a normal distribution:
Figure 3: Sample mean vs. sample variance for uniform distribution
= n·X −n·X
= 0
Figure 4: Sample mean vs. sample variance for normal distribution
Now we’re going to use a sneaky algebraic trick. We’ll start with an sum
over squared terms, and inside the squared terms we’ll add and subtract the
sample mean. Since we’re both adding and subtracting, this won’t affect the
sum, and then when we expand the expression we’ll use the little result we
just proved to get rid of one of the terms:
X n
(Xi − µ) = ((Xi − X) + (X − µ))2
i=1 i=1
X n
= (Xi − X)2 + 2 · (Xi − X) · (X − µ)
i=1 i=1
+ (X − µ)2
X n
= (Xi − X) + 2 · (X − µ) · (Xi − X)
i=1 i=1
+ n · (X − µ)2
= (Xi − X)2 + 2 · (X − µ) · 0
+ n · (X − µ)2
= (Xi − X)2 + n · (X − µ)2
We can re-write the second term on the right-hand side:
2 2
X −µ X −µ
n· = √
σ σ/ n
Thus, we have:
n 2 2
(n − 1) · S 2
X Xi − µ X −µ
= + √
σ σ2 σ/ n
(n − 1) · S 2
V =
X −µ
W = √
σ/ n
Thus, our equation is:
n 2 2
(n − 1) · S 2
X Xi − µ X −µ
= 2
+ √
σ σ σ/ n
| {z } | {z }
{z } V W
Now what can we say about the distribution of these random variables
U , V , and W ? For U , note that each Xi is normally distributed with ex-
pected value µ and variance σ 2 , thus the expression inside the paretheses is
a standard normal, and since U consists of the sum of the squares of these
independent standard normal random variables, we can conclude that U is a
chi-squared random variance with n degrees of freedom:
U ∼ χ2 (n)
For W , the expression inside the parentheses is a standard normal random
variable, and since W is the square of this standard normal random variable,
it has a chi-squared distribution with 1 degree of freedom:
W ∼ χ2 (1)
We don’t know yet anything about the distribution of V , but we can say one
thing: since V is a function of the sample variance S 2 , and W is a function of
the sample mean X, and we know that for a normal population the sample
variance and the sample mean are independent, then it must be the case that
V and W are independent.
(n − 1) · S 2
∼ χ2 (n − 1)
4 The t Distribution
Suppose we have random sample S = {X1 , X2 , . . . , Xn } from a normally
distributed population with mean µ and variance σ 2 . In this section we want
to determine the probability density function for the T statistic:
X −µ
T = r
There are really two steps in this process:
• First, we want to define the distribution of the T statistic in a more
convenient form.
• Second, we want to use this new definition to calculate fT (t), the den-
sity function for the random variable T .
In practice, when we want to perform calculations with this probability distri-
bution, we invariably will use some form of statistical or numerical software,
so you can actually get away from learning all the details of this section. But
I encourage you to try to grasp at least the overall strategy, as the derivation
uses many of the tools that we’ve developed so far.
Next, in the denominator we’ll rewrite the quotient S 2 /σ 2 by multiplying
and dividing by (n − 1):
r s
S2 (n − 1) · S 2
= (n − 1)
σ2 σ2
So when we’re all done with all of these manipulations, we have:
X −µ
X −µ σ/ n
T = p = s
S 2 /n
(n − 1) · S 2
(n − 1)
At first, this looks utterly pointless, taking a relatively simple expression and
making it vastly more complicated. But in fact, we have done the contrary,
and this is an extraordinary result. First, notice that the numerator is just
a standard normal random variable, which we’ll denote as Z:
X −µ
Z = √
σ/ n
Notice that Z is really just a function of the sample mean random variable
X. Next, for the denominator, let’s define the random variable W :
(n − 1) · S 2
W =
We know from the previous section that W is a chi-squared random variable
with n − 1 degrees of freedom. Notice that W is just a function of the sample
variance random variable S 2 . Thus, we can now write our T statistic as:
T = r
The numerator is a function of the sample mean random variable X, and
the denominator is a function of the sample variance random variable S 2 .
For a normally distributed population, we’ve seen that these two random
variables are independent. Thus, the numerator of T and the denominator
of T are independent as well. So we can now formally define the probability
distribution for the T statistic:
Let Z be a standard normal random variable, let W be a chi-
squared random variable with ν degrees of freedom, and let Z
and W be independent. Then a random variable T has the t
distribution if:
T = p
• In the second method, we could calculate a p-value, and if the p-value
is less that the pre-specified significance level, we would reject the null
We also argued that all of these approaches are equivalent, in that for every
sample we will reject the null hypothesis using one of the methods if and
only if we would reject the null hypothesis using all of the methods.
H0 : µ = µ0
HA : µ 6= µ0
assume that the null hypothesis is true and then under this assumption find
a probability statement of this form:
Pr L ≤ X ≤ U = 1 − α
As before, L and U are specific values that will insure that the probability
statement is correct. Unfortunately, we don’t have any results about X that
can directly give us this the values for L and R. However, we do know
something about the test statistic T . Let’s set up a probability interval for
this statistic: !
X −µ
Pr V ≤ p ≤W =1−α
S 2 /n
We know from our previous work that the test statistic in this probability
statement follows a t distribution with n − 1 degrees of freedom. Let QT (n −
1, p) denote the pth quantile of a t distribution with n − 1 degrees of freedom,
so that:
Pr(T ≤ QT (p, n − 1)) = p
Let’s choose for V and W the values:
V = QT (α/2, n − 1)
W = QT (1 − α/2, n − 1)
Notice that V will, by definition, cut of an area of α/2 in the lower tail, while
W will cut off a value of α/2 in the upper tail. Then we have:
X −µ
Pr tq (α/2, n − 1) ≤ p ≤ QT (1 − α/2, n − 1) = 1 − α
S 2 /n
Now we do some algebra, and we obtain:
r r !
S 2 S2
Pr µ + QT (α/2, n − 1) · ≤ X ≤ µ + QT (1 − α/2, n − 1) · = 1−α
n n
At this point, we have solved the problem of determining the values of L and
L = µ + QT (α/2, n − 1) ·
U = µ + QT (1 − α/2, n − 1) ·
Let’s see an example of how this works. Let’s suppose that we draw
a sample of size 8 and observe a sample mean of x = 47.2 and a sample
variance of s2 = 27. As usual, we will perform our test at a significance level
of α = 0.05. Since the sample size is n = 8, then the degrees of freedom
will be df = 8 − 1 = 7. We want to perform a two-sided test for the null
hypothesis that µ = 53. Thus, our null and alternative hypotheses are:
H0 : µ = 53
HA : µ 6= 53
QT (0.025, 7) = −2.36462
QT (0.975, 7) = 2.36462
Now we have everything we need to calculate the rejection region for this
L = µ0 + QT (0.025, 7) ·
= 53 + (−2.36462) ·
= 48.65591
U = µ0 + QT (0.975, 7) ·
= 53 + (+2.36462) ·
= 57.34409
Thus the rejection region is (48.65591, 57.34409). Since the observed test
statistic is x = 47.2, this is outside the rejection region, and we reject the
null hypothesis and consider this to be strong evidence against the null.
5.2 Method 2: The p-Value Method
With the p-value approach, we want to calculate the probability of obtaining
a test statistic that is as extreme or more extreme than what was actually
observed, given the assumption that the null hypothesis is true. For a one-
sample test, we can use the T statistic:
X −µ
T =p
S 2 /n
We know that, if the null hypothesis is true, this test statistic will follow a
t distribution with n − 1 degrees of freedom. Thus, for the p-value method,
we calculate the observed value of this test statistic, denoted t, and then
calculate the area in the tails cut off by t and −t (for a two-sided test). If
this area is less than the pre-specified significance level, then we should reject
the null hypothesis.
47.2 − 53
= q
= −3.15712
Now we calculate the area under the tails cut off by T = −3.15712 and
T = +3.15712 for a t distribution with df = n − 1 = 7 degrees of freedom:
= 0.00800 + 0.00800
= 0.01599
Thus, since the p-value is less than the pre-specified significance level of
α = 0.05, we again consider this data to be strong evidence against the null
hypothesis, and we reject the null.
5.3 Method 3: Confidence Intervals
Finally, we can use confidence intervals to perform a null hypothesis. We
start with the probability interval for the T statistic:
X −µ
Pr QT (α/2, n − 1) ≤ p ≤ QT (1 − α/2, n − 1) = 1 − α
S 2 /n
With a little bit of algebra, we can re-arrange this to:
Pr (L∗ ≤ µ ≤ U ∗ ) = 1 − α
x = 47.2
s2 = 27
n = 8
α = 0.05
QT (α/2, n − 1) = −2.36462
QT (1 − α/2, n − 1) = +2.36462
= 42.85591
The upper limit of the confidence interval is:
U ∗ = x + QT (1 − α/2, n − 1) ·
= 47.2 + 2.36462 ·
= 51.54409
Thus, the 95% confidence interval is (42.9, 51.5), and since this does not
contain the null value µ0 = 53, we once again view this data as providing
strong evidence against the null hypothesis and we reject the null.
One of the most natural questions to ask about such a model is whether
the population means µX and µY are equal. To investigate this, consider the
test statistic D, the difference of the two sample means:
D =X −Y
I’m going to call this statistic D the “two-sample difference of sample means”,
and you should be aware that no one else does this. Then the expected value
of D is:
E[D] = E[X − Y ]
= E[X − Y ]
= E[X] − E[Y ]
= µX − µY
Var[D] = Var[X − Y ]
= Var[X] − Var[Y ]
σ2 σ2
= +
nX nY
2 1 1
= σ · +
nX nY
(X − Y ) − (µX − µY )
= s
1 1
σ · +
nX nY
Since the populations for X and Y are normally distributed, then X and
Y are normally distributed, and thus the random variable Z is a standard
normal random variable.
an unbiased estimator of the common variance σ 2 . However, if we did this,
we would be ignoring the data from Y , and that doesn’t seem like such a
great idea. Likewise, we could just use SY , the sample variance for the Y
population alone, and as before this will be an unbiased estimator for the
common variance σ 2 . But this is again an inefficient procedure, because we
are not incorporating the information from the X population. Instead, the
best thing to do is to take some sort of weighted average of SX and SY2 ,
where the weights add up to 1, because this will be an unbiased estimator
that incorporates the data from both X and Y :
a 2 b 2 a 2 b
· E SY2
E · SX + · SY = · E SX +
a+b a+b a+b a+b
It turns out that it is very nice to select a and b to be the degrees of freedom
of SX and SY2 , respectively:
a = nX − 1
b = nY − 1
a + b = (nX − 1) + (nY − 1)
= nX + nY − 2
When we use these weights for a and b, the resulting estimator is called the
pooled estimator of variance, and is denoted Sp2 :
nX − 1 nY − 1
Sp2 = 2
· SX + · S2
nX + nY − 2 nX + nY − 2 Y
What is the distribution of this pooled estimator of the variance Sp2 ? Let’s
clear fractions by multiplying by n + m − 2 and divide by σ 2 , so that we have:
(nX + nY − 2) · Sp2 2
(nX − 1) · SX (nY − 1) · SY2
= +
σ2 σ2 σ2
Let’s look at the two terms on the right-hand side of this equation. By our
basic result on the sampling distribution of the sample variance for a normal
population, the first term is a chi-squared random variable with nX − 1
degrees of freedom:
(nX − 1) · SX
∼ χ2 (nX − 1)
Similarly, the second term on the right-hand side is a chi-squared random
variable with nY − 1 degrees of freedom:
(nY − 1) · SY2
∼ χ2 (nY − 1)
Since the populations X and Y are independent, these two random variables
will be independent, and therefore their sum will be a chi-squared random
variable with (nX − 1) + (xY − 1) = nX + nY − 2 degrees of freedom:
(nX − 1) · SX (nY − 1) · SY2
+ ∼ χ2 (nX + nY − 2)
σ2 σ2
But this sum of random variables can also be expressed in terms of the pooled
variance estimator Sp2 , so we have:
(nX + nY − 2) · Sp2
∼ χ2 (nX + nY − 2)
Now it’s time for the grand finale! Let’s review what we’ve done so far.
First, we showed that the random variable Z was a standard normal random
(X − Y ) − (µX − µY )
Z = s ∼ N (0, 1)
1 1
σ2 · +
nX nY
Next, we found a random variable involving the pooled sampled variance
that followed a chi-squared distribution with nX + nY − 2 degrees of freedom:
(nX + nY − 2) · Sp2
∼ χ2 (nX + nY − 2)
Note that these two random variables are independent, because Z depends
only on the sample means X and Y , and the pooled sample variance depends
only on the sample variances SX and SY2 , and we know that for normally dis-
tributed populations the sample mean and sample variance are independent.
Now recall that the t distribution is defined as the ratio of two independent
random variables, with the numerator a standard normal random variable
Z and the denominator the square root of a chi-squared random variable X
divided by its degrees of freedom:
T = p
Substituting, we have:
(X − Y ) − (µX − µY )
1 1
σ2 · +
nX nY
T = v
u (nX + nY − 2) · Sp2
t σ2
nX + nY − 2
This looks scary, but some simplification is possible: in the denominator, we
can cancel the term nX + nY − 2, and we can cancel a σ 2 from both the
numerator and denominator. We end up with:
(X − Y ) − (µX − µY )
T = s
1 1
Sp · +
nX nY
• Finally, we calculate the T statistic:
(X − Y ) − (µX − µY )
T = s
1 1
Sp · +
nX nY
This test statistic will have a t distribution with nX + nY − 2 degrees
of freedom.
HA : µX 6= µY
Note that, when the null hypothesis is true, the two population means are
equal, so that µX − µY = 0. Thus, under the null, the sampling distribution
of the two-sample difference of sample means is:
(X − Y ) − (µX − µY )
T = s
1 1
Sp · +
nX nY
(X − Y ) − (0)
= =s
1 1
Sp · +
nX nY
X −Y
= =s
1 1
Sp2 · +
nX nY
Now we can construct a rejection region for a pre-specified significance
level α. We start with the probability interval statement:
X −Y
QT (α/2, nX + nY − 2) ≤ s
Pr ≤ Q (1 − α/2, n + n − 2)
= 1−α
1 1
Sp · +
nX nY
If we multiply through by the denominator of the central term, we end up
with a probability statement of the form:
Pr(L ≤ X − Y ≤ U ) = 1 − α
In this statement, we have:
1 1
L = Sp2 · + · QT (α/2, nX + nY − 2)
nX nY
1 1
U = Sp2 · + · QT (1 − α/2, nX + nY − 2)
nX nY
Where do we get these t quantiles from? In R, we use the function qt, so for
example to obtain the q = 2.5% quantile for a t distribution with 17 degrees
of freedom, we would use the function:
qt(0.025, 17)
In Excel, we would use the formula:
= T.INV(0.025, 17)
Using either platform, you will end up with the value t0.025,17 = −2.160369.
Let’s see an example. Suppose we conduct an experiment, and we observe
these values:
nX = 9
nY = 6
x = 45
y = 37
sX = 43
s2Y = 46
Now we can calculate the pooled variance estimate s2P :
9−1 6−1
s2P = · 43 + · 45
9+6−2 9+6−2
= 44.15385
t0.025,13 = −2.160369
t0.975,13 = +2.160369
Then L is:
1 1
L = Sp2 · + · QT (α/2, nX + nY − 2)
nX nY
1 1
= 44.15385 · + · (−2.160369)
9 6
= −7.56591
For U we have:
1 1
U = Sp2 · + · QT (1 − α/2, nX + nY − 2)
nX nY
1 1
= 44.15385 · + · (+2.160369)
9 6
= +7.56591
In fact, under the null hypothesis, the t distribution is symmetric with respect
to 0, hence L will be negative, and U will be positive, and they will both have
the same magnitude, so once you know one of L or U you can immediately
write down the other. To perform the hypothesis test, we calculate the
observed value of the test statistic:
d = x−y
= 45 − 37
= 8
Now we can make our inference: the observed value of the test statistic is
greater than U , so it lies in the rejection region, and we conclude that this
data provides strong evidence against the null hypothesis that the population
means µX and µY are equal, or equivalently that this data provides strong
evidence that the population mean µX is greater than the population mean
µY .
Under the null hypothesis, this test statistic will follow a t distribution with
nX + nY − 2 degrees of freedom. Thus, for the p-value method, we calculate
the observed value of this test statistic, denoted t, and then calculate the
area in the tails cut off by t and −t (for a two-sided test). If this area is less
than the pre-specified significance level, then we reject the null hypothesis.
Let’s go back to our previous example to see how this works. In this case,
the observed test statistic is:
t = s
1 1
s2p · +
nX nY
45 − 37
= s
1 1
44.15385 · +
9 6
= 2.28432
Now we need to calculate the upper tail cut off by the value t = 2.28432,
as well as the lower tail cut off by t = −2.28432, using a t distribution with
nX + nY − 2 = 9 + 6 − 2 = 13 degrees of freedom:
p = Pr(T ≤ −2.28432) + Pr(T ≥ 2.28432)
= 0.01990 + 0.01990
= 0.03980
Thus, the p-value is p = 0.03980, and since this is less than the pre-specified
significance level of α = 0.05, we reject the null hypothesis, and again con-
clude that this data provides strong evidence that the population mean µX
is greater than the population mean µY .
As usual, we do some algebraic manipulation to obtain a probability interval
statement of the form:
Pr(L ≤ µX − µY ≤ U ) = 1 − α
Here we have:
1 1
L = (X − Y ) − Sp2 · + · QT (α/2, nX + nY − 2)
nX nY
1 1
U = (X − Y ) + Sp2 · + · QT (1 − α/2, nX + nY − 2)
nX nY
Let’s go back to our example one last time. We calculate the lower limit
1 1
L = (X − Y ) − Sp2 · + · QT (α/2, nX + nY − 2)
nX nY
1 1
= (45 − 37) − 44.15385 · + · 2.16037
9 6
= 0.43409
= 15.56591
We can check these calculations by making sure that the midpoint of the
interval is x − y = 8:
0.43409 + 15.56591
So our 95% confidence interval for µX − µY , the difference of the true pop-
ulation means of X and Y respectively, is (0.43409, 15.56591). Since this
does not contain the value 0, this indicates that 0 is not a plausible value
for µX − µY , or equivalently it is implausible that µX = µY , and we reject
the null hypothesis. For the third time, we conclude that this data provides
strong evidence that the population mean µX is greater than the population
mean µY .