Syllabus20122013 2
Syllabus20122013 2
Syllabus20122013 2
ESTIMATION
CHAPTER
ESTIMATION
4.1 Introduction
The application of the methods of probability to the analysis and interpretation of data is
known as statistical inference. In particular, we wish to make an inference about a population
based on information contained in a sample. Since populations are characterized by numerical
descriptive measures called parameters, the objective of many statistical investigations is to
make an inference about one or more population parameters. There are two broad areas of
inference: estimation (the subject of this chapter) and hypothesis-testing (the subject of the
next chapter).
When we say that we have a random sample X1 , X2 , . . . , Xn “from a random variable X”
or “from a population with distribution function F (x; θ),” we mean that X1 , X2 , . . . , Xn are
identically and independently distributed random variables each with c.d.f. F (x; θ), that is,
depending on some parameter θ. We usually assume that the form of the distribution, e.g.,
binomial, Poisson, Normal, etc. is known but the parameter is unknown. We wish to obtain
information from the data (sample) to enable us to make some statement about the parameter.
Note that, θ may be a vector, e.g., θ = (μ, σ 2 ).
The general problem of estimation is to find out something about θ using the information
in the observed values of the sample, x1 , x2 , . . . , xn . That is, we want to choose a function
H(x1 , x2 , . . . , xn ) that will give us a good estimate of the parameter θ in F (x; θ).
Next we will consider some general methods of estimation. Since different methods may
lead to different estimators for the same parameter, we will then need to consider criteria for
deciding whether one estimate is better than another.
59
CHAPTER 4. ESTIMATION 4.2. STATISTICAL PHILOSOPHIES
60
CHAPTER 4. ESTIMATION 4.2. STATISTICAL PHILOSOPHIES
distribution of θ̂ as follows :
X P (X = x) θ̂(x)
(0, 0, 0) (1 − θ)3 0
(0, 0, 1) θ(1 − θ)2 1/3
(0, 1, 0) θ(1 − θ)2 1/3
(1, 0, 0) θ(1 − θ)2 1/3
(0, 1, 1) θ2 (1 − θ) 2/3
(1, 0, 1) θ2 (1 − θ) 2/3
(1, 1, 0) θ2 (1 − θ) 2/3
(1, 1, 1) θ3 1
Thus P (θ̂ = 0) = (1 − θ)3 , P (θ̂ = 1/3) = 3θ(1 − θ)2 , P (θ̂ = 2/3) = 3θ2 (1 − θ) and P (θ̂ = 1) = θ3 .
We now ask whether θ̂ is a good estimator of θ? Clearly if θ = 0 we have that P (θ̂ = θ) =
P (θ̂ = 0) = 1 which is good. Likewise if θ = 1 we also have that P (θ̂ = θ) = P (θ̂ = 1) = 1. If
θ = 1/3 then P (θ̂ = θ) = P (θ̂ = 1/3) = 3(1/3)(1 − 1/3)2 = 4/9. Likewise if θ = 2/3 we have
that P (θ̂ = θ) = P (θ̂ = 2/3) = 3(2/3)2 (1 − 2/3) = 4/9. However if the value of θ lies outside
the set {0, 1/3, 2/3, 1} we have that P (θ̂ = θ) = 0.
Since θ̂ is a random variable we might try to calculate its expected value E(θ̂) i.e. the
average value we would get if we carried out an infinite number of independent repetitions of
the experiment. We have that
Thus if we carried out an infinite number of independent repetitions of the experiment and
calculate the value of θ̂ for each repetition the average of the θ̂ values would be exactly θ,the
true value of the parameter! This is true no matter what the actual value of θ is. Such an
estimator is said to be unbiased.
Consider the quantity L = (θ̂ − θ)2 which might be regarded as a measure of the error or loss
involved in using θ̂ to estimate θ. The possible values for L are (0 − θ)2 , (1/3 − θ)2 , (2/3 − θ)2
and (1 − θ)2 . We can calculate the expected value of L as follows:
The quantity E(L) is called the mean squared error ( MSE ) of the estimator θ̂. Since the
quantity θ(1 − θ) attains its maximum value of 1/4 for θ = 1/2, the largest value E(L) can
attain is 1/12 which occurs if the true value of the parameter θ happens to be equal to 1/2; for
all other values of θ the quantity E(L) is less than 1/12. If somebody could invent a different
estimator θ̃ of θ whose MSE was less than that of θ̂ for all values of θ then we would prefer θ̃ to
θ̂.
61
CHAPTER 4. ESTIMATION 4.3. THE FREQUENTIST APPROACH TO ESTIMATION
This trivial example gives some idea of the kinds of calculations that we will be performing.
The basic frequentist principle is that statistical procedures should be judged in terms of their
average performance in an infinite series of independent repetitions of the experiment which
produced the data. An important point to note is that the parameter values are treated as
fixed (although unknown) throughout this infinite series of repetitions. We should be happy to
use a procedure which performs well on the average and should not be concerned with how it
performs on any one particular occasion.
Using different methods of estimation can lead to different estimators. Criteria for deciding
which are good estimators are required. Before listing the qualities of a good estimator, it is
important to understand that they are random variables. For example, suppose that we take a
sample of size 5 from a uniform distribution and calculate x. Each time we repeat the experiment
we will probably get a different sample of 5 and therefore a different x. The behaviour of an
estimator for different random samples will be described by a probability distribution. The
actual distribution of the estimator is not a concern here and only its mean and variance will
be considered. As a first condition it seems reasonable to ask that the distribution of the
estimator be centered around the parameter it is estimating. If not it will tend to overestimate
or underestimate θ. A second property an estimator should possess is precision. An estimator is
precise if the dispersion of its distribution is small. These two concepts are incorporated in the
definitions of unbiasedness and efficiency below. [5]
E[θ̂(X)] = E(θ̂) = θ.
62
CHAPTER 4. ESTIMATION 4.4. ESTIMATION BY THE METHOD OF MOMENTS
There may be large number of unbiased estimators of a parameter for any given distribution
and a further criterion for choosing between all the unbiased estimators is needed. [5]
Definition 4.4 (Bias corrected estimators). If bias(θ̂) is of the form cθ, then (obviously)
θ̃ = θ̂/(1 + c) is unbiased for θ. Likewise, if bias(θ̂) = θ + c, then θ̃ = θ̂ − c is unbiased for θ. In
such situations we say that θ̃ is a biased corrected version of θ̂.
Definition 4.5 (Unbiased functions). More generally ĝ(X) is said to be unbiased for a
function g(θ) if E[ĝ(X)] = g(θ).
Note that even if θ̂ is an unbiased estimator of θ, g(θ̂) will generally not be an unbiased
estimator of g(θ) unless g is linear or affine. This limits the importance of the notion of
unbiasedness. It might be at least as important that an estimator is accurate in the sense that
its distribution is highly concentrated around θ.
Is unbiasedness a good thing? Unbiasedness is important when combining estimates, as
averages of unbiased estimators are unbiased (see the review exercises at the end of this
chapter). For example, when combining standard deviations s1 , s2 , . . . , sk with degrees of
freedom df 1 , . . . , df k we always average their squares
df 1 s21 + · · · + df k s2k
s̄ =
df 1 + · · · + df k
as s2i are unbiased estimators of the variance σ 2 , whereas si are not unbiased estimators of σ (see
the review exercises). Be careful when averaging biased estimators! It may well be appropriate
to make a bias-correction before averaging.
63
CHAPTER 4. ESTIMATION 4.4. ESTIMATION BY THE METHOD OF MOMENTS
Note that the first sample moment is just the sample mean, X.
We will first prove a property of sample moments.
E(Mr ) = μr , r = 1, 2, 3, . . .
Proof.
1
n
1
n
1
n
E(Mr ) = E Xir = E(Xir ) = μ = μr .
n i=1
n i=1 n i=1 r
This theorem provides the motivation for estimation by the method of moments (with the
estimator being referred to as the method of moments estimator or MME). The sample moments,
M1 , M2 , . . ., are random variables whose means are μ1 , μ2 , . . . Since the population moments
depend on the parameters of the distribution, estimating them by the sample moments leads to
estimation of the parameters.
We will consider this method of estimation by means of 2 examples, then state the general
procedure.
Example 4.1.
In this example, the distribution only has one parameter. Given X1 , X2 , . . . , Xn is a random sample from a
U (0, θ) distribution, find the method of moments estimator (MME) of θ.
Solution of Example 4.1. Now, for the uniform distribution (f (x) = 1θ I[0,θ] (x)),
θ
1
μ = E(X) = x × dx
0 θ
θ
=
2
64
CHAPTER 4. ESTIMATION 4.4. ESTIMATION BY THE METHOD OF MOMENTS
θ
Using the Method of Moments we proceed to estimate μ = 2
by m1 . Thus since m1 = x we have
θ
=x
2
and
θ = 2x.
Then, θ = 2X and the MME of θ is 2X.
Computer Exercise 4.1. Generate 100 samples of size 10 from a uniform distribution, U (0, θ)
with θ = 10. Estimate the value of θ from your samples using the method of moments and plot
the results. Comment on the results.
In this exercise, we know a priori that θ = 10 and have generated random samples. The
samples are analysed as if θ is unknown and estimated by the method of moments. Then we can
compare the estimates with the known value.
for (i in 1:nsimulations){
ru <- runif(n=sampsz,min=0,max=theta)
Xbar <- mean(ru)
theta.estimates[i] <- 2*Xbar
} # end of the i loop
plot(density(theta.estimates))
Solution of Example 4.2. For the normal distribution, E(X) = μ and E(X 2 ) = σ 2 + μ2
(Thm. 2.2 of [5]).
65
CHAPTER 4. ESTIMATION 4.4. ESTIMATION BY THE METHOD OF MOMENTS
Using the Method of Moments: Equate E(X) to m1 and E(X 2 ) to m2 so that, μ = x and
σ 2 + μ2 = m2 . that is, estimate μ by x and estimate σ 2 by m2 − x2 . Then,
1 2
μ = x, and σ 2 = xi − x2 .
n
The latter can also be written as σ 2 = n1 ni=1 (xi − x)2 .
Computer Exercise 4.2. Generate 100 samples of size 10 from a normal distribution with
μ = 14 and σ = 4. Estimate μ and σ 2 from your samples using the method of moments. Plot
the estimated values of μ and σ 2 . Comment on your results.
Solution of Computer Exercise 4.2.
#_____________NormalMoments.R ___________
mu <- 14
sigma <- 4
sampsz <- 10
nsimulations <- 100
mu.estimates <- numeric(nsimulations)
var.estimates <- numeric(nsimulations)
for (i in 1:nsimulations){
rn <- rnorm(mean=mu,sd=sigma,n=sampsz)
mu.estimates[i] <- mean(rn)
var.estimates[i] <- mean( (rn -mean(rn))^2 )
} # end of i loop
plot(density(mu.estimates))
plot(density(var.estimates))
The plot you obtain for the row means should be centred around the true mean of 14. However,
you will notice that the plot of the variances is not centred about the true variance of 16 as you
would like. Rather it will appear to be centred about a value less than 16. The reason for this
will become evident when we study the properties of MME estimators in more detail.
General Procedure
Let X1 , X2 , . . . , Xn be a random sample from F (x : θ1 , . . . , θk ). That is, suppose that there are
k parameters to be estimated. Let μr , mr (r = 1, 2, . . . , k) denote the first k population and
sample moments respectively, and suppose that each of these population moments are certain
known functions of the parameters. That is,
μ1 = g1 (θ1 , . . . , θk ),
μ2 = g2 (θ1 , . . . , θk ),
..
.
μk = gk (θ1 , . . . , θk ).
66
CHAPTER 4. ESTIMATION 4.5. PROPERTIES OF AN ESTIMATOR
μr = gr (θ1 , . . . , θk ) = mr , r = 1, 2, . . . , k
Problem 4.2. Let X have a binomial distribution with parameters n and θ. Show that the
sample proportion θ̂ = X/n is an unbiased estimate of θ.
Solution. X ∼ Bin(n, θ) ⇒ E(X) = nθ. Then E(θ̂) = E(X/n) = E(X)/n = nθ/n = θ. As
E(θ̂) = θ, the estimator θ̂ is unbiased.
67
CHAPTER 4. ESTIMATION 4.5. PROPERTIES OF AN ESTIMATOR
Definition 4.6 (Mean squared error). The mean squared error of the estimator θ̂ is defined
as MSE(θ̂) = E(θ̂ − θ)2 . Given the same set of data, θ̂1 is “better than θ̂2 if M SE(θ̂1 ) ≤ MSE(θ̂2 )
(uniformly better if true ∀ θ).
Proof. The problem of finding minimum MSE estimators cannot be solved uniquely:
Note : This lemma implies that the mean squared error of an unbiased estimator is equal to
the variance of the estimator.
(ii) In the case when the estimators are unbiased, show that μ̂ has minimum variance when
the weights are inversely proportional to the variances σi2 .
(iii) Show that the variance of μ̂ for optimal weights wi is Var(μ̂) = 1/ i σi−2 .
68
CHAPTER 4. ESTIMATION 4.5. PROPERTIES OF AN ESTIMATOR
(iv) Consider the case when estimators may be biased. Find the mean square error of the
optimal linear combination obtained above, and compare its behaviour as n → ∞ in the
biased and unbiased case, when σi2 = σ 2 , i = 1, . . . , n.
Solution. E(μ̂) = E(w 1 X 1 + · · · + w n X n ) = i w i E(X i ) = i w i μ = μ i wi so
μ̂ is unbiased
2 2
if and only if i wi = 1. The variance of our estimator is Var(μ̂) = i wi σi , which
shouldbe minimized subject to the constraint i wi = 1. Differentiating the Lagrangian
L = w 2 2
σ
i i i − λ ( w
i i − 1) with respect
to w i and setting equal to zero yields 2wi σi2 =
λ ⇒ wi ∝ σi−2 so that wi = σi−2 /( j σj−2 ). Then, for optimal weights we get Var(μ̂) =
2 2 −4 2 −2 2 −2
i w i σi = ( i σi σi )/( i σi ) = 1/( i σi ). When
σi2 = σ 2 we have that Var(μ̂) = σ 2 /n
which tends to zero for n → ∞ whereas bias(μ̂) = βi /n = β̄ is equal to the average bias and
MSE(μ̂) = σ 2 /n + β̄ 2 . Therefore the bias tends to dominate the variance as n gets larger, which
is very unfortunate.
Problem 4.7. Let X1 , . . . , Xn be an independent sample of size n from the uniform distribution
on the interval (0, θ), with density for a single observation being f (x|θ) = θ−1 for 0 < x < θ and
0 otherwise, and consider θ > 0 unknown.
(i) Find the expected value and variance of the estimator θ̂ = 2X̄.
(ii) Find the expected value of the estimator θ̃ = X(n) , i.e. the largest observation.
(iii) Find an unbiased estimator of the form θ̌ = cX(n) and calculate its variance.
Solution. θ̂ has E(θ̂) = E(2X̄) = n2 [E(X1 )+· · ·+E(Xn )] = n2 [(θ/2)+· · ·+(θ/2)] = n2 [n(θ/2)] = θ
(unbiased), and Var(θ̂) = Var(2X̄) = n42 [Var(X1 ) + · · · +
Var(Xn )] = n42 [(θ2 /12) + · · · + (θ2 /12)] =
4 n 2
n2 12
θ = 3n θ . Let U = X(n) , we then have P (U ≤ u) = ni P (Xi ≤ u) = (u/θ)n for 0 < u < θ so
1 2
differentiation yields that U has density f (u|θ) = nun−1 θ−n for 0 < u < θ. Direct integration now
n
yields E(θ̃) = E(U ) = n+1 θ (a biased estimator). The estimator θ̌ = n+1 n
U is unbiased. Direct
2 n 2 n 2 1
integration gives E(U ) = n+2 θ so Var(θ̃) = Var(U ) = (n+2)(n+1)2 θ and Var(θ̌) = n(n+2) θ2 . As
θ̂ and θ̌ are both unbiased estimators MSE(θ̂) = Var(θ̂) and MSE(θ̌) = Var(θ̌). Clearly the mean
square error of θ̂ is very large compared to the mean square error of θ̌.
Getting a small MSE often involves a tradeoff between variance and bias. By not insisting on θ̂
being unbiased, the variance can sometimes be drastically reduced. For unbiased estimators,
the MSE obviously equals the variance, MSE(θ̂) = Var(θ̂), so no tradeoff can be made. One
approach is to restrict ourselves to the subclass of estimators that are unbiased and minimum
variance.
69
CHAPTER 4. ESTIMATION 4.5. PROPERTIES OF AN ESTIMATOR
We will develop a method of finding the MVUE when it exists. When such an estimator
does not exist we will be able to find a lower bound for the variance of an unbiased estimator in
the class of unbiased estimators, and compare the variance of our unbiased estimator with this
lower bound.
Definition 4.8 (Score function). For the (possibly vector valued) observation X = x to be
informative about θ, the density must vary with θ. If f (x|θ) is smooth and differentiable, this
change is quantified to first order by the score function
∂ f (x|θ)
S(θ) = ln f (x|θ) ≡ .
∂θ f (x|θ)
Under suitable regularity conditions (differentiation wrt θ and integration wrt x can be
interchanged), we have
f (x|θ)
E{S(θ)} = f (x|θ)dx = f (x|θ)dx ,
f (x|θ)
∂ ∂
= f (x|θ)dx = 1 = 0.
∂θ ∂θ
Thus the score function has expectation zero. Using the central limit theorem, the score statistic
has asymptotically a normal distribution, with mean zero. The variance will be derived below.
True frequentism evaluates the properties of estimators based on their “long-run” behaviour.
The value of x will vary from sample to sample so we have treated the score function as a
random variable and looked at its average across all possible samples.
Lemma 4.8 (Fisher information). The variance of S(θ) is the expected Fisher information
about θ 2
∂
E(I(θ)) = E{S(θ)2 } ≡ E ln f (x|θ)
∂θ
70
CHAPTER 4. ESTIMATION 4.5. PROPERTIES OF AN ESTIMATOR
Variance measures lack of knowledge. Reasonable that the reciprocal of the variance should
be defined as the amount of information carried by the (possibly vector valued) observation x
about θ.
Theorem 4.9 (Cramér Rao lower bound). Let θ̂ be an unbiased estimator of θ. Then
2 {Cov(U, S)}2
{Corr(U, S)} = ≤ 1
Var(U )Var(S)
Setting Cov(U, S) = 1 we get
Var(U )Var(S) ≥ 1
This implies
1
Var(θ̂) ≥
E(I(θ))
which is our main result. We call { E(I(θ)) }−1 the Cramér Rao lower bound (CRLB).
71
CHAPTER 4. ESTIMATION 4.5. PROPERTIES OF AN ESTIMATOR
Sufficient conditions for the proof of CRLB are that all the integrands are finite, within
the range of x. We also require that the limits of the integrals do not depend on θ. That is,
the range of x, here f (x|θ), cannot depend on θ. This second condition is violated for many
density functions, i.e. the CRLB is not valid for the uniform distribution. We can have absolute
assessment for unbiased estimators by comparing their variances to the CRLB. We can also
assess biased estimators. If its variance is lower than CRLB then it is indeed a very good
estimate, although it is biased.
Example 4.3.
Consider i.i.d. random variables Xi , i = 1, . . . , n, with
1 1
fXi (xi |μ) = exp − xi .
μ μ
so that
n
1
ln f = −n ln(μ) − xi .
μ i=1
The score function is the partial derivative of ln f wrt the unknown parameter μ,
n
∂ n 1
S(μ) = ln f = − + 2 xi
∂μ μ μ i=1
and n
n
n 1 n 1
E {S(μ)} = E − + 2 Xi = − + 2E Xi
μ μ i=1 μ μ i=1
For X ∼ Exp(1/μ), we have E(X) = μ implying E(X1 + · · · + Xn ) = E(X1 ) + · · · + E(Xn ) = nμ and E {S(μ)} = 0
as required.
n
∂ n 1
E(I(μ)) = −E − + 2 Xi
∂μ μ μ i=1
n
n 2
= −E − 3 Xi
μ2 μ i=1
n
n 2
= − 2 + 3E Xi
μ μ i=1
n 2nμ n
= − + 3 = 2
μ2 μ μ
Hence
μ2
CRLB = .
n
Let us propose μ̂ = X̄ as an estimator of μ. Then
n n
1 1
E(μ̂) = E Xi = E Xi = μ,
n i=1 n i=1
72
CHAPTER 4. ESTIMATION 4.5. PROPERTIES OF AN ESTIMATOR
verifying that μ̂ = X̄ is indeed an unbiased estimator of μ. For X ∼ Exp(1/μ), we have E(X) = μ and
Var(X) = μ2 , implying
n
1 nμ2 μ2
Var(μ̂) = 2 Var(Xi ) = 2 = .
n i=1 n n
−1
We have therefore shown that Var(μ̂) = { E(I(θ)) } , and therefore conclude that the unbiased estimator μ̂ = x̄
achieves its CRLB.
4.5.3 Efficiency
Definition 4.9 (Efficiency). Define the efficiency of the unbiased estimator θ̂ as
CRLB
eff(θ̂) = ,
Var(θ̂)
where CRLB = { E(I(θ)) }−1 . Clearly 0 < eff(θ̂) ≤ 1. An unbiased estimator θ̂ is said to be
efficient if eff(θ̂) = 1.
Let θ1 and θ2 be 2 unbiased estimators of θ with variances V ar(θ1 ), V ar(θ2 ) respectively.
We can then say that θ1 is more efficient than θ2 if
That is, θ1 is more efficient than θ2 if it has a smaller variance.
Definition 4.11 (Relative Efficiency). The relative efficiency of θ2 with respect to θ1 is
defined as V ar(θ1 )/V ar(θ2 )
It will now be useful to indicate that the estimator is based on a sample of size n by denoting
it by θn .
4.5.4 Consistency
In the previous subsections, we defined unbiasedness and mean-squared error of an estimator.
Both concepts were defined on a fixed sample size. In this subsection we will define two concepts
that are defined for increasing sample size.
73
CHAPTER 4. ESTIMATION 4.5. PROPERTIES OF AN ESTIMATOR
Mean-squared-error consistency implies that both the bias and the variance of θn approaches 0,
since the mean-squared-error in the definition can be decomposed in a variance and squared-bias
component.
Theorem 4.10. If, limn→∞ E(θn ) = θ and limn→∞ V ar(θn ) = 0, then θn is a consistent
estimator of θ.
One popular loss function is the squared error loss function, defined as (θ − θ)2 .
Definition 4.15 (Risk function). For a given loss function l(.; .) the risk function denoted
by Rl (θ) of an estimator θ is defined to be Rl (θ) = E(l(θ;
θ))
74
CHAPTER 4. ESTIMATION 4.6. SUFFICIENCY
The risk function is the average loss. For the squared error loss function, the risk function is
the familiar mean-squared-error.
4.6 Sufficiency
The final concept of sufficiency requires some explanation before a formal definition is
given. The random sample X1 , X2 , . . . , Xn drawn from the distribution with F (x; θ) contains
information about the parameter θ. To estimate θ, this sample is first condensed to a single
random variable by use of a statistic θ∗ = H(X1 , X2 , . . . , Xn ). The question of interest is whether
any information about θ has been lost by this condensing process. For example, a possible choice
of θ∗ is H(X1 , . . . , Xn ) = X1 in which case it seems that some of the information in the sample
has been lost since the observations X2 , . . . , Xn have been ignored. In many cases, the statistic
θ∗ does contain all the relevant information about the parameter θ that the sample contains.
Definition 4.16 (Sufficiency). Let X1 , X2 , . . . , Xn be a random sample from F (x; θ) and let
θ = H(X1 , X2 , . . . , Xn ) be a statistic (a function of the Xi only). Let θ = H (X1 , X2 , . . . , Xn )
∗
be any other statistic which is not a function of θ∗ . If for each of the statistics θ , the conditional
density of θ given θ∗ does not involve θ, then θ∗ is called a sufficient statistic for θ. That is,
f (θ |θ∗ ) does not contain θ, then θ∗ is sufficient for θ.
You should think of sufficiency in the sense of using all the relevant information in the
sample. For example, to say that x is sufficient for μ in a particular distribution means that
knowledge of the actual observations x1 , x2 , . . . , xn gives us no more information about μ than
does only knowing the average of the n observations.
Example 4.4.
T = t(X) = X̄ is sufficient for μ when Xi ∼ i.i.d. N(μ, σ 2 ).
To better understand the motivation behind the concept of sufficiency consider three
independent Binomial trials where θ = P (X = 1).
Event Probability Set
0 0 0 (1 − θ)3 A0
1 0 0
0 1 0 θ(1 − θ)2 A1
0 0 1
0 1 1
1 0 1 θ2 (1 − θ) A2
1 1 0
1 1 1 θ3 A3
Knowing which Ai the sample is in carries all the information about θ. Which particular
sample within Ai gives us no extra information about
θ. Extra information about other aspect of
the model maybe, but not about θ. Here T = t(X) = Xi equals the number of “successes, and
identifies Ai . Mathematically we can express the above concept by saying that the probability
75
CHAPTER 4. ESTIMATION 4.6. SUFFICIENCY
P (X = x|Ai ) does not depend on θ. i.e. P (010|A1 ; θ) = 1/3. More generally, a statistic T = t(X)
is said to be sufficient for the parameter θ if Pr (X = x|T = t) does not depend on θ. Sufficient
statistics are most easily recognized through the following fundamental result:
So
tn ntn−1
FT (t) = ⇒ fT (t) =
θn θn
Also
1
f (x, t) = ≡ f (x; θ).
θn
Hence
1
f (x|t) = ,
ntn−1
and is independent of θ.
76
CHAPTER 4. ESTIMATION 4.7. THE LIKELIHOOD APPROACH
It is crucial to stress that the argument of fX (x|θ) is x, but the argument of L(θ|x) is θ. It is
therefore convenient to view the likelihood function L(θ) as the probability of the observed data
x considered as a function of θ.
As explained in [5], the term likelihood of the sample needs to be defined, and this has
to be done separately for discrete and continuous distributions.
The likelihood function for a set of n identically and independently distributed (i.i.d.)
random variables, X1 , X2 , . . . , Xn , can thus be written as:
P (X1 = x1 ) · P (X2 = x2 ) · · · P (Xn = xn ) for X discrete,
L
(θ; x1 , . . . , xn ) = . (4.7.2)
f (x1 ; θ) · f (x2 ; θ) · · · f (xn ; θ) for X continuous
For the discrete case, L(θ; x1 , . . . , xn ) is the probability (or likelihood) of observing (X1 =
x1 , X2 = x2 , . . . , Xn = xn ). It would then seem that a sensible approach to selecting an
estimate of θ would be to find the value of θ which maximizes the probability of observing
(X1 = x1 , X2 = x2 , . . . , Xn = xn ), (the event which occured). [5]
More generally, when the random continuous variables X1 , . . . , Xn are mutually independent
we can write the joint density as
n
fX (x) = fXj (xj )
j=1
Note that we have used a bold θ to indicate that it may be a vector, although we will not be
consistent in using different notations for vectors or single parameters. When the densities
fXj (xj ) are identical, we can unambiguously write f (xj ).
77
CHAPTER 4. ESTIMATION 4.7. THE LIKELIHOOD APPROACH
Usually it is convenient to work with the natural logarithm of the likelihood called the
log-likelihood, denoted by
(θ|x) = ln L(θ|x).
When θ ∈ R1 we can define the score function as the first derivative of the log-likelihood
∂
S(θ) = ln L(θ),
∂θ
as seen before.
The maximum likelihood estimate (MLE) θ̂ of θ is the solution to the score equation
S(θ) = 0.
Put simply [5], the maximum likelihood estimate (MLE) of θ is that value of θ which
maximizes the likelihood. To state it more mathematically, the MLE of θ is that value of θ, say
θ such that
x1 , . . . , xn ) > L(θ ; x1 , . . . , xn )
L(θ;
where θ is any other value of θ.
At the maximum [8], the second partial derivative of the log-likelihood is negative, so we
define the curvature at θ̂ as I(θ̂) where
∂2
I(θ) = − ln L(θ).
∂θ2
We can check that a solution θ̂ of the equation S(θ) = 0 is actually a maximum by checking that
I(θ̂) > 0. A large curvature I(θ̂) is associated with a tight or strong peak, intuitively indicating
less uncertainty about θ. In likelihood theory I(θ) is a key quantity called the observed Fisher
information, and I(θ̂) is the observed Fisher information evaluated at the MLE θ̂. Although
I(θ) is a function, I(θ̂) is a scalar. Note the difference with the concept of ‘expected Fisher
information’ (See 4.6).
The likelihood function L(θ|x) supplies an order of preference or plausibility among possible
values of θ based on the observed x. It ranks the plausibility of possible values of θ by how
probable they make the observed x. If P (x|θ = θ1 ) > P (x|θ = θ2 ) then the observed x makes
θ = θ1 more plausible than θ = θ2 , and consequently from (4.7.1), L(θ1 |x) > L(θ2 |x). The
likelihood ratio L(θ1 |x)/L(θ2 |x) = f (θ1 |x)/f (θ2 |x) is a measure of the plausibility of θ1 relative
to θ2 based on the observed data. The relative likelihood L(θ1 |x)/L(θ2 |x) = k means that the
observed value x will occur k times more frequently in repeated samples from the population
defined by the value θ1 than from the population defined by θ2 . Since only ratios of likelihoods
are meaningful, it is convenient to standardize the likelihood with respect to its maximum.
Define the relative likelihood as R(θ|x) = L(θ|x)/L(θ̂|x). The relative likelihood varies
between 0 and 1. The MLE θ̂ is most plausible value of θ in that it makes the observed sample
most probable. The relative likelihood measures the plausibility of any particular value of θ
relative to that of θ̂.
Comments [5]
1. It is customary to use θ to denote both estimator (random variable) and estimate (its
observed value). Recall that we used θ for the MME.
78
CHAPTER 4. ESTIMATION 4.7. THE LIKELIHOOD APPROACH
2. Since L(θ; x1 , x2 , . . . , xn ) is a product, and sums are usually more convenient to deal
with than products, it is customary to maximize log L(θ; x1 , . . . , xn ) which we usually
abbreviate to l(θ). This has the same effect. Since log L is a strictly increasing function of
L, it will take on its maximum at the same point.
4. The method of differentiation to find a maximum only works if the function concerned
actually has a turning point.
Then, W1 , W2 , W3 , and W4 are all random variables and, as n → ∞, the probabilistic behaviour
of each of W1 , W2 , W3 , and W4 is well approximated by that of a N (0, 1) random variable (see
further below to see a hint of the proof). Then, since E[W1 ] ≈ 0, we have that E[θ̂] ≈ θ and so
θ̂ is approximately unbiased. Also Var[W1 ] ≈ 1 implies that Var[θ̂] ≈ (E[I(θ)])−1 and so θ̂ is
approximately efficient.
Let the data X have probability distribution g(X; θ ) where θ = (θ1 , θ2 , . . . , θm ) is a vector
of m unknown parameters. Let I(θθ ) be the m × m information matrix as defined above and
let E[I(θθ )] be the m × m matrix obtained by replacing the elements of I(θθ ) by their expected
79
CHAPTER 4. ESTIMATION 4.7. THE LIKELIHOOD APPROACH
values. Let θ̂θ be the MLE of θ . Let CRLB √ r be the rth diagonal element of [E[I(θ θ )]]−1 . For
r = 1, 2, . . . , m, define W1r = (θ̂r − θr )/ CRLBr . Then, as n → ∞, W1r behaves like a standard
normal random variable.
Suppose we define W2r by replacing CRLBr by the rth diagonal element of the matrix
[I(θθ )]−1 , W3r by replacing CRLBr by the rth diagonal element of the matrix [EI(θ̂θ )]−1 and W4r
by replacing CRLBr by the rth diagonal element of the matrix [I(θ̂θ )]−1 . Then it can be shown
that as n → ∞, W2r , W3r , and W4r all behave like standard normal random variables. These
results will become very handy when developing test statistics (Chapter refchap:testing )
A detailed proof of the assertion that Wi above follow approximately a standard normal
distribution is beyond the scope of this course. However, in what follows, we give a hint of how
the proof would go. Suppose that θ is the MLE of θ. We then apply the Taylor expansion of
at the point θ0 , yielding
S(θ)
≈ l (θ0 ) + (θ − θ0 )l (θ0 ).
0 = l (θ)
Therefore,
√ −n−1/2 l (θ0 )
n(θ − θ0 ) ≈
n−1 l (θ0 )
. Based on properties we already saw, it can then be shown that the mean of the numerator is 0
and its variance is E(I(θ0 )). By the law of large numbers, the expectation of the denomenator
converges to −E(I(θ0 )). We thus have
√ n−1/2 l (θ0 )
n(θ − θ0 ) ≈ .
E(I(θ0 ))
Consequently, √
E( n(θ − θ0 )) ≈ 0,
and
√ 1
Var( n(θ − θ0 )) ≈ .
E(I(θ0 ))
80
CHAPTER 4. ESTIMATION 4.7. THE LIKELIHOOD APPROACH
That is, our information should be invariant to the choice of parameterization. ( For the
purposes of this example we are not too concerned about how to calculate L∗ (ψ). )
Theorem 4.12 (Invariance of the MLE). If g is a one-to-one function, and θ̂ is the MLE
of θ then g(θ̂) is the MLE of g(θ).
Proof. This is trivially true as we let θ = g −1 (μ) then f {y|g −1 (μ)} is maximized in μ exactly
when μ = g(θ̂). When g is not one-to-one the discussion becomes more subtle, but we simply
choose to define ĝMLE (θ) = g(θ̂)
It seems intuitive that if θ̂ is most likely for θ and our knowledge (data) remains unchanged
then g(θ̂) is most likely for g(θ). In fact, we would find it strange if θ̂ is an estimate of θ, but θ̂2
is not an estimate of θ2 . In the binomial example with n = 10 and x = 8 we get θ̂ = 0.8, so the
MLE of g(θ) = θ/(1 − θ) is
This convenient property is not necessarily true of other estimators. For example, if θ̂ is the
MVUE of θ, then g(θ̂) is generally not MVUE for g(θ).
Frequentists generally accept the invariance principle without question. This is not the case
for Bayesians. The invariance property of the likeihood ratio is incompatible with the Bayesian
habit of assigning a probability distribution to a parameter.
Examples
All the examples are coming from [8], unless stated otherwise.
Example 4.8 ([5]).
Given X is distributed bin(1, p) where p ∈ (0, 1), and a random sample x1 , x2 , . . . , xn , find the maximum likelihood
estimate of p.
So # $
log L(p) = xi log p + n − xi log(1 − p)
Differentiating with respect to p, we have
log L(p) xi n − xi
d = −
dp p 1−p
81
CHAPTER 4. ESTIMATION 4.7. THE LIKELIHOOD APPROACH
This is equal to zero when xi (1 − p) = p (n − xi ), that is, when p = xi /n. This estimate
is denoted by p. Thus, if the random variable X is distributed bin(1, p), the MLE of p derived
from a sample of size n is
p = X. (4.7.3)
Solution of Example 4.4 ([5]). Now f (xi ; θ) = 1/θ, xi ∈ [0, θ], i = 1, 2, . . . , n. So the
likelihood is
n
L(θ; x1 , x2 , . . . , xn ) = (1/θ) = 1/θn .
i=1
When we come to find the maximum of this function we note that the slope is not zero anywhere,
so there is no use finding dL(θ)
dθ
or d logdθL(θ) .
Note however that L(θ) increases as θ → 0. So L(θ) is maximized by setting θ equal to the
smallest value it can take. If the observed values are x1 , . . . , xn then θ can be no smaller than
the largest of these. This is because xi ∈ [0, θ] for i = 1, . . . , n. That is, each xi ≤ θ or θ ≥ each
xi .
Thus, if X is distributed U (0, θ), the MLE of θ is
θ = max(Xi ). (4.7.4)
82
CHAPTER 4. ESTIMATION 4.7. THE LIKELIHOOD APPROACH
Computer Exercise 4.3. Generate 100 random samples of size 10 from a U (0, 10) distribution.
For each of the 100 samples generated calculate the MME and MLE for μ and graph the results.
1. From the graphs does it appear that the estimators are biased or unbiased? Explain.
2. Estimate the variance of the two estimators by finding the sample variance of the 100
estimates (for each estimator). Which estimator appears more efficient?
theta <- 10
sampsz <- 10
nsimulations <- 100
moment.estimates <- numeric(nsimulations)
ML.estimates <- numeric(nsimulations)
for (i in 1:nsimulations){
ru <- runif(n=sampsz,min=0,max=theta)
plot(density(moment.estimates),
xlab=" ",ylab=" ",main=" ",ylim=c(0,0.6),
las=1)
abline(v=theta,lty=3)
lines(density(ML.estimates),lty=2)
legend(11,0.5,legend=c("moment","ML"),lty=1:2,
cex=0.6)
83
CHAPTER 4. ESTIMATION 4.7. THE LIKELIHOOD APPROACH
You should see that the Method of Moments gives unbiased estimates of which many are
not in the range space as noted in Computer Example 4.2. The maximum likelihood estimates
are all less than 10 and so are biased.
Computer Exercise 4.4. Demonstrate that the MLE is consistent for estimating θ for a
U (0, θ) distribution.
Method: Generate the random variables one at a time. After each is generated calculate the
MLE of θ for the sample of size n generated to that point and so obtain a sequence of estimators,
{θn }. Plot the sequence.
Solution of Computer Exercise 4.4. Uniform random variables are generated one at a time
and θn is found as the maximum of θ%
n−1 and nth uniform rv generated. The estimates are
plotted in order.
#_________ UniformConsistency.R _____
theta <- 10
sampsz <- 10
nsimulations <- 100
ML.est <- numeric(nsimulations)
for (i in 1:nsimulations){
ru <- runif(n=sampsz,min=0,max=theta)
if(i==1) ML.est[i] <- max(ru)
else ML.est[i] <- max(ML.est[i-1],max(ru) )
}
plot(ML.est,type=’l’)
abline(h=theta,lty=2)
r n−r
= θ (1 − θ)
n
where r = i=1 xj is the number of observed successes (1’s) in the vector y. The log-likelihood function is then
84
CHAPTER 4. ESTIMATION 4.7. THE LIKELIHOOD APPROACH
guaranteeing that θ̂ is the MLE. Each Xi is a Bernoulli random variable and has expected value E(Xi ) = θ, and
variance Var(Xi ) = θ(1 − θ). The MLE θ̂(y) is itself a random variable and has expected value
#r$ n n n
i=1 Xi 1 1
E(θ̂) = E =E = E (Xi ) = θ = θ,
n n n i=1 n i=1
Here, the most plausible value for θ is θ̂ = 0, implying L(θ̂) = 1. The relative likelihood is R3 (θ) = L(θ)/L(θ̂) =
L(θ). R1 (θ) is plotted as the dashed curve in figure 4.7.2. R2 (θ) is plotted as the dotted curve in figure 4.7.2.
R3 (θ) is plotted as a solid curve in figure 4.7.2.
85
CHAPTER 4. ESTIMATION 4.7. THE LIKELIHOOD APPROACH
Suppose that the geneticists had planned to stop sampling once they observed r = 10 subjects with the specified
genotype, and the tenth subject with the genotype was the 100th subject anaylsed overall. The likelihood of θ can
be computed based on the negative binomial distribution, as
n−1 r
L(θ) = θ (1 − θ)n−r
r−1
for n = 100, r = 5. The usual calculation will confirm that θ̂ = r/n is MLE.
Count 0 1 2 3 4 5 6 7
Observed 57 203 383 525 532 408 573 139
Count 8 9 10 11 12 13 14
Observed 45 27 10 4 1 0 1
θx exp (−θ)
fX (x|θ) = .
x!
The likelihood function equals
θxi exp (−θ) θ xi
exp (−nθ)
L(θ) = = .
xi ! xi !
i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Oi 57 203 383 525 532 408 573 139 45 27 10 4 1 0 1
Ei 54 211 407 525 508 393 254 140 68 29 11 4 1 0 0
+3 −8 −24 0 +24 +15 +19 −1 −23 −2 −1 0 −1 +1 +1
The Poisson law agrees with the observed variation within about one-twentieth of its range.
86
CHAPTER 4. ESTIMATION 4.7. THE LIKELIHOOD APPROACH
In order to work out the expectation and variance of θ̂ we need to work out the probability distribution of
n
Z= Xi , where Xi ∼ Exp(θ). From the appendix on probability theory we have Z ∼ Ga(θ, n). Then
i=1
∞
1 1 θn z n−1 exp (−θz)
E = dz
Z z Γ(n)
0
∞
θ2
= (θz)n−2 exp (−θz)dz
Γ(n)
0
∞
θ
= un−2 exp (−u)du
Γ(n)
0
θ Γ(n − 1) θ
= = .
Γ(n) n−1
implies
&n' 1 n
E[θ̂] = E = nE = θ
Z Z n−1
which turns out to be biased. Propose the alternative estimator θ̃ = n−1n θ̂. Then
(n − 1) (n − 1) (n − 1) n
E[θ̃] = E θ̂ = E[θ̂] = θ=θ
n n n n−1
87
CHAPTER 4. ESTIMATION 4.7. THE LIKELIHOOD APPROACH
straightforward) calculations.
∞
1 1
E = θn z n−3 exp −θzdz
Z2 Γ(n)
0
∞
θ2
= un−3 exp −θudu
Γ(n)
0
θ2 Γ(n − 2) θ2
= =
Γ(n) (n − 1)(n − 2)
# $2
2 (n − 1)2
⇒ Var[θ̃] = E[θ̃ ] − E[θ̃] = E − θ2
Z2
(n − 1)2 θ2 θ2
= − θ2 = .
(n − 1)(n − 2) n−2
Clearly P (T ≤ t) = 1 − e−t/μ implies P (T > 100) = e−100/μ is the probability of a component surviving the 100
hour test. Then
m
1 # $n−m
−ti /μ
L(μ) = e e−100/μ ,
i=1
μ
m
1 1
(μ) = −m ln μ − ti − 100(n − m),
μ i=1 μ
m
m 1 1
S(μ) = − + 2 ti + 2 100(n − m).
μ μ i=1 μ
m
Setting S(μ̂) = 0 suggests the estimator μ̂ = [ i=1 ti + 100 (n − m)] /m. Also, I(μ̂) = m/μ̂2 > 0, and μ̂ is indeed
the MLE. Although failure times were recorded for just m components, this example usefully demonstrates that
all n components contribute to the estimation of the mean failure parameter μ. The n − m surviving components
are often referred to as right censored.
88
CHAPTER 4. ESTIMATION 4.7. THE LIKELIHOOD APPROACH
Unknown mean and known variance: As υ is known we treat this parameter as a constant when differentiating
wrt μ. Then
n n
1 1 n
S(μ) = (xi − μ), μ̂ = xi , and I(θ) = > 0 ∀ μ.
υ i=1 n i=1 υ
Then n
υ 2
E[υ̂] = E Z = υ,
n i=1 i
89
CHAPTER 4. ESTIMATION 4.8. PROPERTIES OF SAMPLE MEAN AND SAMPLE VARIANCE
and
# υ $2 n
2υ 2
Var[υ̂] = Var Zi2 = .
n i=1
n
Finally,
n
n 1
E [I(υ)] = − 2
+ 3 E (xi − μ)2
2υ υ i=1
n nυ
= − 2+ 3
2υ υ
n
= .
2υ 2
Hence the CRLB = 2υ 2 /n, and so υ̂ has efficiency 1.
Our treatment of the two parameters of the Gaussian distribution in the last example was
to (i) fix the variance and estimate the mean using maximum likelihood; and then (ii) fix the
mean and estimate the variance using maximum likelihood.
In practice we would like to consider the simultaneous estimation of these parameters. In
the next section of these notes we extend MLE to multiple parameter estimation.
Theorem 4.13. Let X be a random variable with mean μ and variance σ 2 . Let X be
the sample mean based on a random sample of size n. Then X is an unbiased and consistent
estimator of μ.
Proof. Now E(X) = μ, no matter what the sample size is, and V ar(X) = σ 2 /n. The latter
approaches 0 as n → ∞, satisfying Theorem 4.10.
It can also be shown that of all linear functions of X1 , X2 , . . . , Xn , X has minimum variance.
Note that the above theorem is true no matter what distribution is sampled. Some applications
are given below.
For a random sample X1 , X2 , . . . , Xn , X is an unbiased and consistent estimator of:
90
CHAPTER 4. ESTIMATION 4.8. PROPERTIES OF SAMPLE MEAN AND SAMPLE VARIANCE
Sample Variance
Recall that the sample variance is defined by
n
S = 2
(Xi − X)2 /(n − 1).
i=1
Proof.
n
(n − 1)E(S ) = E
2
(Xi − X)2
i=1
n
=E [Xi − μ − (X − μ)[2
i=1n
n
=E (Xi − μ)2 − 2(X − μ) (Xi − μ) + n(X − μ)2
i=1 i=1
n
=E (Xi − μ)2 − 2n(X − μ)2 + n(X − μ)2
i=1
n
=E (Xi − μ)2 − nE(X − μ)2
i=1
n
= V ar(Xi ) − nV ar(X)
i=1
σ2
= nσ 2 − n
n
= (n − 1)σ 2 .
So
E(S 2 ) = σ 2 , (4.8.1)
which proves the unbiasedness.
For any distribution that has a fourth moment,
μ4 − 3μ22 2μ22
V ar(S 2 ) = − (4.8.2)
n n−1
Clearly lim V ar(S 2 ) = 0, so from Theorem 4.10, S 2 is a consistent estimator of σ 2 .
n→∞
91
CHAPTER 4. ESTIMATION 4.9. MULTI-PARAMETER ESTIMATION
2. The number in the denominator of S 2 , that is, n − 1, is called the number of degrees
of freedom. The numerator is the sum of n deviations (from the mean) squared but the
deviations
are not independent. There is one constraint on them, namely the fact that
(Xi − X) = 0. As soon as n − 1 of the Xi − X are known, the nth one is determined.
or, equivalently,
2 x2i − nx2
s =
n−1
The equivalence of the two forms is easily seen:
(xi − x)2 = (x2i − 2xxi + x2 ) = x2i − 2x xi + nx2
( xi ) 2
where the RHS can readily be seen to be x2i − .
n
S1 (α, β) = 0
S2 (α, β) = 0
The conditions for a value (α0 , β0 ) satisfying S1 (α0 , β0 ) = 0 and S2 (α0 , β0 ) = 0 to be a MLE
are that
I11 (α0 , β0 ) > 0, I22 (α0 , β0 ) > 0,
and
det(I(α0 , β0 ) = I11 (α0 , β0 )I22 (α0 , β0 ) − I12 (α0 , β0 )2 > 0.
This is equivalent to requiring that both eigenvalues of the matrix I(α0 , β0 ) be positive.
92
CHAPTER 4. ESTIMATION 4.9. MULTI-PARAMETER ESTIMATION
Hence
n
∂ 1
S1 (μ, v) = = (xi − μ) = 0
∂μ v i=1
implies that
n
1
μ̂ = xi = x̄. (4.9.1)
n i=1
Also
n
∂ n 1
S2 (μ, v) = =− + 2 (xi − μ)2 = 0
∂v 2v 2v i=1
implies that
n n
1 1
v̂ = (xi − μ̂)2 = (xi − x̄)2 . (4.9.2)
n i=1 n i=1
Calculating second derivatives and multiplying by −1 gives that the information matrix I(μ, v) equals
⎛ ⎞
n 1
n
⎜ v v2 (xi − μ) ⎟
I(μ, v) = ⎜
⎝ 1 n
i=1
n
⎟
⎠
v2 (xi − μ) − 2vn2 + v13 (xi − μ)2
i=1 i=1
Lemma 4.15 (Joint distribution of the sample mean and sample variance). If X1 , . . . , Xn are
i.i.d. N (μ, v) then the sample mean X̄ and sample variance (n − 1)S 2 are independent. Also X̄
is distributed N (μ, v/n) and (n − 1)S 2 /v is a chi-squared random variable with n − 1 degrees
of freedom.
Proof. Define
n
n
W = (Xi − X̄) 2
= (Xi − μ)2 − n(X̄ − μ)2
i=1 i=1
W (X̄ − μ) 2
n
(Xi − μ)2
⇒ + =
v v/n i=1
v
The RHS is the sum of n independent standard normal random variables squared, and so is
distributed χ2ndf Also, X̄ ∼ N (μ, v/n), therefore (X̄ − μ)2 /(v/n) is the square of a standard
93
CHAPTER 4. ESTIMATION 4.9. MULTI-PARAMETER ESTIMATION
normal and so is distributed χ21df These Chi-Squared random variables have moment generating
functions (1 − 2t)−n/2 and (1 − 2t)−1/2 respectively. Next, W/v and (X̄ − μ)2 /(v/n) are
independent:
Cov(Xi − X̄, X̄) = Cov(Xi , X̄) − Cov(X̄, X̄)
1
= Cov Xi , Xj − Var(X̄)
n
1 v
= Cov(Xi , Xj ) −
n j n
v v
= − = 0
n n
But, Cov(Xi − X̄, X̄ − μ) = Cov(Xi − X̄, X̄) = 0 , hence
Cov(Xi − X̄, X̄ − μ) = Cov (Xi − X̄), X̄ − μ = 0
i i
As the moment generating function of the sum of independent random variables is equal to the
product of their individual moment generating functions, we see
E et(W/v) (1 − 2t)−1/2 = (1 − 2t)−n/2
⇒ E et(W/v) = (1 − 2t)−(n−1)/2
Observe that
n n n−1
E(ṽ) = E(v̂) = v = v
n−1 n−1 n
and ṽ is unbiased as suggested. We can easily show that
2v 2
Var(ṽ) =
(n − 1)
94
CHAPTER 4. ESTIMATION 4.10. NEWTON-RAPHSON OPTIMIZATION
Hence
2v 2 2v 2 1
eff(ṽ) = ÷ =1−
n (n − 1) n
Clearly ṽ is not efficient, but is asymptotically efficient.
Since I(θ) > 0 for all θ, the MLE may be found by solving the equation S(θ) = 0. It is not immediately obvious
how to solve this equation.
0 = S(θ̂)
= S(θ0 ) + (θ̂ − θ0 )S (θ0 ) + (θ̂ − θ0 )2 S (θ0 )/2 + ....
We now replace θ0 by this improved approximation θ0 + S(θ0 )/I(θ0 ) and keep repeating the process until we find
a value θ̂ for which |S(θ̂)| < where is some prechosen small number such as 0.000001. This method is called
Newton’s method for solving a non-linear equation.
If a unique solution to S(θ) = 0 exists, Newton’s method works well regardless of the choice of θ0 . When
there are multiple solutions, the solution to which the algorithm converges depends crucially on the choice of θ0 .
In many instances it is a good idea to try various starting values just to be sure that the method is not sensitive
to the choice of θ0 .
95
CHAPTER 4. ESTIMATION 4.10. NEWTON-RAPHSON OPTIMIZATION
Thus given θ 0 this is a method for finding an improved guess at θ̂. We then replace θ 0 by this improved guess
and repeat the process. We keep repeating the process until we obtain a value θ ∗ for which |Sr (θ ∗ )| is less than
for r = 1, 2, . . . , m where is some small number like 0.0001. θ ∗ will be an approximate solution to the set of
equations S(θ) = 0. We then evaluate the matrix I(θ ∗ ) and if all m of its eigenvalues are positive we set θ̂ = θ ∗ .
0.00, 0.23, −0.05, 0.01, −0.89, 0.19, 0.28, 0.51, −0.25 and 0.27.
⇒ θ1 = 0.2160061
θ2 = 0.2005475
θ3 = 0.2003788
θ4 = 0.2003788
96
CHAPTER 4. ESTIMATION 4.10. NEWTON-RAPHSON OPTIMIZATION
Figure 4.10.1: Relative likelihood for the radioactive scatter, solved by Newton Raphson.
The profile likelihood may be used to a considerable extent as a full likelihood (but not
always!). It is customary to use the curvature of the profile likelihood function as an estimate of
For more information about profile likelihoods, we refer to [1].
the variability of θ.
97
CHAPTER 4. ESTIMATION 4.10. NEWTON-RAPHSON OPTIMIZATION
where for convenience we have written tm+1 = · · · = tn = 100. This yields score functions
m n
m a
Sa (a, b) = − m ln (b) + ln (ti ) − (ti /b) ln (ti /b) ,
a i=1 i=1
and
n
ma a a
Sb (a, b) = − + (ti /b) . (4.10.2)
b b i=1
It is not obvious how to solve Sa (a, b) = Sb (a, b) = 0 for a and b.
This reduces our two parameter problem to a search over the “new” one-parameter profile log-likelihood
a (a) = (a, b(a)),
n
m
1 a
= m ln (a) − m ln t + (a − 1) ln (ti ) − m. (4.10.4)
m i=1 i i=1
Given an initial guess a0 for a, an improved estimate a1 can be obtained using a single parameter Newton-
Raphson updating step a1 = a0 + S(a0 )/I(a0 ), where S(a) and I(a) are now obtained from a (a). The estimates
â = 1.924941 and b̂ = 78.12213 were obtained by applying this method to the Weibull data using starting values
a0 = 0.001 and a0 = 5.8 in 16 and 13 iterations respectively. However, the starting value a0 = 5.9 produced the
sequence of estimates a1 = −5.544163, a2 = 8.013465, a3 = −16.02908, a4 = 230.0001 and subsequently crashed.
4.10.6 Reparameterization
Negative parameter estimates can be avoided by reparameterizing the profile log-likelihood in
(4.10.4) using α = ln(a). Since a = eα we are guaranteed to obtain a > 0. The reparameterized
profile log-likelihood becomes
1 eα
n m
α (α) = mα − m ln t + (e − 1)
α
ln (ti ) − m,
m i=1 i i=1
98
CHAPTER 4. ESTIMATION 4.11. BAYESIAN ESTIMATION
The estimates â = 1.924941 and b̂ = 78.12213 were obtained by applying this method to the
Weibull data using starting values a0 = 0.07 and a0 = 76 in 103 and 105 iterations respectively.
However, the starting values a0 = 0.06 and a0 = 77 failed due to division by computationally
tiny (1.0e-300) values.
99
CHAPTER 4. ESTIMATION 4.11. BAYESIAN ESTIMATION
The last result is Baye’s theorem and in this form shows how we can “invert” probabilities,
getting P (Hn |E) from P (E|Hn ).
When Hn consists of exclusive and exhaustive events,
P (Hn )P (E|Hn )
P (Hn |E) = (4.11.1)
m P (Hm )P (E|Hm )
θ = (θ1 , θ2 , . . . , θk )
and apriori beliefs about their values can be expressed in terms of the pdf p(θ).
Then we collect data,
X = (X1 , X2 , . . . , Xn )
which have a probability distribution that depends on θ, expressed as
p(X|θ)
From (4.11.2),
p(θ|X) ∝ p(X|θ) × p(θ) (4.11.3)
The term, p(X|θ) may be considered as a function of X for fixed θ, i.e. a density of X which is
parameterized by θ.
We can also consider the same term as a function of θ for fixed X and then it is termed the
likelihood function,
(θ|X) = p(X|θ)
The terms in expression 4.11.3 represent the 3 components in Baeyesian inference:
100
CHAPTER 4. ESTIMATION 4.11. BAYESIAN ESTIMATION
The posterior is a combination of the likelihood, where information about θ comes from the
data X and the prior p(θ) where the information is the knowledge of θ independent of X. This
knowledge may come from previous sampling, (say). The posterior represents an update on
P (θ) with the new information at hand, i.e. x.
If the likelihood is weak due to insufficient sampling or wrong choice of likelihood function,
the prior can dominate so that the posterior is just an adaptation of the prior. Alternatively, if
the sample size is large so that the likelihood function is strong, the prior will not have much
impact and the Bayesian analysis is the same as the maximum likelihood.
The output is a distribution, p(θ|X) and we may interpret it using summaries such as the
median. A tabular view on the difference between Bayesian inference and frequentist inference
is given in Table ??. In a frequentist world, θ is considered a fixed but unknown feature of the
population from which data is being (randomly) sampled. In a Bayesian world it is the estimator
θ that is fixed (a function of the data available for analysis) and θ is a random variable, subject
to (subjective) uncertainty.
From the output (i.e., posterior distribution of θ) we can derive intervals where the true
value of θ would lie with a certain probability. We differentiate between credible and highest
probability density regions.
101
CHAPTER 4. ESTIMATION 4.11. BAYESIAN ESTIMATION
Bayesian Frequentist
θ random fixed but unknown
θ fixed random
‘randomness’ subjective sampling
distribution of interest posterior sampling distribution
Figure 4.11.2 depicts a density with shaded areas of 0.9 in 2 cases. In frame 4.2(a), observe
that there are quantiles outside the interval (1.37, 7.75) for which the density is greater than
quantiles within the interval. Frame 4.2(b) depicts the HDR as (0.94, 6.96).
Figure 4.11.2: Comparison of 2 types of regions, a Confidence Interval and a Highest Density Region.
We may now have a notion about uncertainty, but how to derive a Bayesian point estimate?
Bayes estimates are single number summaries of a posterior density. As there are modes, medians,
means, etc, it is the question which summary to choose best. Different loss functions rationalize
different point estimates. The quadratic loss l(θ, θ̃ = (θ − θ̃)2 indicates the quadratic loss from
the use of the estimate θ̃ instead of θ. The posterior mean as Bayes estimate under quadratic
loss is given by
E(θ|y) = θ̃ = θp(θ|y)dθ. (4.11.4)
θ
102
CHAPTER 5. CONFIDENCE INTERVALS
CHAPTER
CONFIDENCE INTERVALS
5.1 Introduction
We have precedently been considering point estimators of a parameter, and in the Bayesian
context, we have also briefly touched upon ‘credibility’, credible intervals. Also in the frequentist
context we can construct an interval to which we can attach a probability that the true value
lies within the limits of the interval. What these intervals are and how we can construct them,
will be the subject of this chapter.
Recall that by point estimator we are referring to the fact that, after the sampling has
been done and the observed value of the estimator computed, our end-product is the single
number which is hopefully a good approximation for the unknown true value of the parameter.
If the estimator is good according to some criteria (and we have seen several of these criteria in
Chapter 4, then the estimate should be reasonably close to the unknown true value. But the
single number itself does not include any indication of how high the probability might be that
the estimator has taken on a value close to the true unknown value. The method of confidence
intervals gives both an idea of the actual numerical value of the parameter, by giving it a range
of possible values, and a measure of how confident we are that the true value of the parameter
is in that range. To pursue this idea further consider the following example.
Example 5.1.
Consider a random sample of size n for a normal distribution with mean μ (unknown) and known variance σ 2 .
Find a 95% CI for the unknown mean, μ.
103
CHAPTER 5. CONFIDENCE INTERVALS 5.1. INTRODUCTION
Solution of Example# 5.1.$We know that the best estimator of μ is X and the sampling
2
distribution of X is N μ, σn . Then from the standard normal,
X −μ
P √ < 1.96 = .95.
σ/ n
|X−μ
√
|
The event σ/ n
< 1.96 is equivalent to the event
1.96σ 1.96σ
μ− √ <X <μ+ √ ,
n n
which is equivalent to the event
σ σ
X − 1.96 √ < μ < X + 1.96 √ .
n n
Hence
σ σ
P X − 1.96 √ < μ < X + 1.96 √ = .95 (5.1.1)
n n
The two statistics X − 1.96 √σn , X + 1.96 √σn are the endpoints of a 95% CI for μ. This is reported
as
# $
The 95% CI for μ is X − 1.96 √σ , X + 1.96 √σ .
n n
Computer Exercise 5.1. Generate 100 samples of size 9 from a N (0, 1) distribution. Find
the 95% CI for μ for each of these samples and count the number that do (don’t) contain zero.
(You could repeat this say 10 times to build up the total number of CI’s generated to 1000.) You
should observe that about 5% of the intervals don’t contain the true value of μ (= 0).
Solution of Computer Exercise 5.1. Use the commands:
#___________ ConfInt.R __________
sampsz <- 9
nsimulations <- 100
non.covered <- 0
for (i in 1:nsimulations){
rn <- rnorm(mean=0,sd=1,n=sampsz)
}
cat("Rate of non covering CI’s",100*non.covered/nsimulations," % \n")
> source("ConfInt.R")
Rate of non covering CI’s 8 %
104
CHAPTER 5. CONFIDENCE INTERVALS 5.1. INTRODUCTION
This implies that 8 of the CI’s don’t contain 0. With a larger sample size we would expect that
about 5% of the CI’s would not contain zero.
We make the following definition:
Definition 5.1 (Random interval). An interval, at least one of whose endpoints is a random
variable is called a random interval.
In (5.1.1), we are saying that the probability is 0.95 that the random interval
σ σ
X − 1.96 √ , X + 1.96 √
n n
contains μ.
A CI has to be interpreted carefully. For a particular sample, where x is the observed value
of X, a 95% CI for μ is
σ σ
x − 1.96 √ , x + 1.96 √ , (5.1.2)
n n
but the statement
σ σ
x − 1.96 √ < μ < 1.96 √
n n
is either true or false. The parameter μ is a constant and either the interval contains it in
which case the statement is true, or it does not contain it, in which case the statement is false.
A probability 0.95 needs to be interpreted in terms of the relative frequency with which the
indicated event will occur “in the long run” of similar sampling experiments. Each time we
take a sample of size n, a different x, and hence a different interval (5.1.2) would be obtained.
Some of these intervals will contain μ as claimed, and some will not. In fact, if we did this many
times, we’d expect that 95 times out of 100 the interval obtained would contain μ. The measure
of our confidence is then 0.95 because before a sample is drawn there is a probability of 0.95
that the confidence interval to be constructed will cover the true mean.
A statement such as P (3.5 < μ < 4.9) = 0.95 is incorrect and should be replaced by
A 95% CI for μ is (3.5, 4.9).
That is, the area under the normal curve above zα/2 is α/2. Then
X −μ
P −zα/2 < √ < zα/2 = 1 − α. (5.1.4)
σ/ n
105
CHAPTER 5. CONFIDENCE INTERVALS 5.2. EXACT CONFIDENCE INTERVALS
Obviously, confidence intervals for a given parameter are not unique. For example, we have
considered a symmetric, two-sided interval, but
σ σ
x − z2α/3 √ , x + zα/3 √
n n
is also a 100(1 − α)% CI for μ. Likewise, we could have one-sided CI’s for μ. For example,
σ σ
−∞, x + zα √ or x − zα √ , ∞ .
n n
# $
The second of these arises from considering P σ/ √ < zα = 1 − α. Also, alternatively, we
X−μ
n
could have a CI based on say, the sample median instead of the sample mean.
As for methods to derive point estimates, methods of obtaining CIs must be judged by their
various statistical properties. For example, one desirable property is to have the length (or
expected length) of a 100(1 − α)% CI as short as possible. Note that for the CI in (5.1.5), the
length is constant for given n.
Example 5.2.
Suppose we are going to observe data x where x = (x1 , x2 , . . . , xn ), and x1 , x2 , . . . , xn are the observed values of
iid N (θ, 1) for some unknown parameter θ ∈ (−∞, ∞) =
random variables X1 , X2 , . . . , Xn which are√thought to be√
Θ. Consider the subset C(x) = [x̄ − 1.96/ n, x̄ + 1.96/ n]. If we carry out an infinite sequence of independent
repetitions of the experiment then we will get an infinite sequence of x values and thereby an infinite sequence of
subsets C(x). We might ask what proportion of this infinite sequence of subsets actually contain the fixed but
unknown value of θ?
Solution of Example 5.2. Since C(x) depends on x only through the value of x̄ we need to
know how x̄ behaves in the infinite sequence√ of repetitions. This follows from the fact that X̄
has a N (θ, n1 ) density and so Z = X̄−θ
√1
= n(X̄ − θ) has a N (0, 1) density. Thus eventhough θ
n
is unknown we can calculate the probability that the value of Z will exceed 2.78, for example,
using the standard normal tables. Remember that (from a frequentist viewpoint) the probability
is the proportion of experiments in the infinite sequence of repetitions which produce a value of
Z greater than 2.78.
106
CHAPTER 5. CONFIDENCE INTERVALS 5.2. EXACT CONFIDENCE INTERVALS
In particular we have that P [|Z| ≤ 1.96] = 0.95. Thus 95% of the time Z will lie between
−1.96 and +1.96. But
√
−1.96 ≤ Z ≤ +1.96 ⇒ −1.96 ≤ n(X̄ − θ) ≤ +1.96
√ √
⇒ −1.96/ n ≤ X̄ − θ ≤ +1.96/ n
√ √
⇒ X̄ − 1.96/ n ≤ θ ≤ X̄ + 1.96/ n
⇒ θ ∈ C(X)
Thus we have answered the question we started with. The proportion of the infinite sequence of
subsets given by the formula C(X) which will actually include the fixed but unknown value of θ
is 0.95. For this reason the set C(X) is called a 95% confidence set or confidence interval for
the parameter θ.
It is well to bear in mind that once we have actually carried out the experiment and observed
our value of x, the resulting interval C(x) either does or does not contain the unknown parameter
θ. We do not know which is the case. All we know is that the procedure we used in constructing
C(x) is one which 95% of the time produces an interval which contains the unknown parameter.
√
The crucial step in the last example was finding the quantity Z = n(X̄ − θ) whose value
depended on the parameter of interest θ but whose distribution was known to be that of a
standard normal variable. This leads to the following definition.
The quantity Z in the example above is a pivotal quantity for θ. The following lemma
provides a method of finding pivotal quantities in general.
Lemma 5.1. Let X be a random variable and define F (a) = P [X ≤ a]. Consider the
random variable U = −2 log [F (X)]. Then U has a χ22 density. Consider the random variable
V = −2 log [1 − F (X)]. Then V has a χ22 density.
Proof. Observe that, for a ≥ 0,
Hence, U has distribution 12 exp (−a/2) which corresponds to the distribution of a χ22 variable
as required. The corresponding proof for V is left as an exercise.
Suppose that we have data X1 , X2 , . . . , Xn which are iid with density f (x|θ). Define F (a|θ) =
a
−∞
f (x|θ)dx and, for i = 1, 2, . . . , n, define Ui = −2 log[F (Xi |θ)]. Then U1 , U2 , . . . , Un are iid
107
CHAPTER 5. CONFIDENCE INTERVALS 5.2. EXACT CONFIDENCE INTERVALS
each having a χ22 density. Hence Q1 (X, θ) = ni=1 Ui has a χ22n density and so is a pivotal quantity
for θ. Another pivotal quantity ( also having a χ22n density ) is given by Q2 (X, θ) = ni=1 Vi
where Vi = −2 log[1 − F (Xi |θ)].
According to [5], the pivotal method thus depends on finding a pivotal quantitity that has 2
characteristics:
X − Y − (μ1 − μ2 )
! is distributed as N (0, 1).
(σ12 /n1 ) + (σ22 /n2 )
So we have found the pivotal quantity which is a function of μ1 − μ2 but whose distribution does
not depend on μ1 − μ2 . A 95% CI for θ = μ1 − μ2 is found by considering
X −Y −θ
P −1.96 < ! 2 < 1.96 = .95,
(σ1 /n1 ) + (σ22 /n2 )
Solution of Example 5.4 ([5]). Given X is distributed as bin(n, p), an unbiased estimate of
p is p = X/n. For n large, X/n is approximately normally distributed. Then,
E(
p) = E(X)/n = p,
108
CHAPTER 5. CONFIDENCE INTERVALS 5.2. EXACT CONFIDENCE INTERVALS
and
1 1 p(1 − p)
V ar(
p) = 2
V ar(X) = 2 np(1 − p) =
n n n
so that
p − p
! is distributed approximately N (0, 1).
p(1 − p)/n
[Note that we have found the required pivotal quantity whose distribution does not depend on p.]
An approximate 100(1 − α)% CI for p is obtained by considering
p − p
P −zα/2 < ! < zα/2 = 1 − α (5.2.2)
p(1 − p)/n
where zα/2 is defined in (5.1.3). Rearranging (5.2.2), the confidence limits for p are obtained as
"
2n 2
p + zα/2 ± zα/2 4np(1 − p) + zα/2
2
2
. (5.2.3)
2(n + zα/2 )
A simpler expression can be found by dividing both numerator and denominator of (5.2.3) by 2n
and neglecting terms of order 1/n. That is, a 95% CI for p is
# ! ! $
p − 1.96 p(1 − p)/n, p + 1.96 p(1 − p)/n . (5.2.4)
Note that this is just the expression we would have used if we replaced V ar( p) = p(1 − p)/n in
p) = p(1 − p)/n. In practice, confidence limits for p when n is small will need to
(5.2.2) by V ar(
be constructed in another way. We will see more about this in future chapters.
Example 5.5 ([5]).
Construct an appropriate
90% confidence interval for λ in the Poisson distribution. Evaluate this if a sample of
size 30 yields xi = 240.
Solution of Example 5.5 ([5]). Now X is an unbiased estimator of λ for this problem, so λ
= x with E(λ)
can be estimated by λ = λ and V ar(λ) = V ar(X) = σ 2 /n = λ/n. By the Central
Limit theorem, for large n, the distribution of X is approximately normal, so
X −λ
! is distributed approximately N (0, 1).
λ/n
An approximate 90% CI for λ can be obtained from considering
X −λ
P −1.645 < ! < 1.645 = .90. (5.2.5)
λ/n
which on substitution of the observed value 240/30 = 8 for x gives (7.15, 8.85).
109
CHAPTER 5. CONFIDENCE INTERVALS 5.3. PIV. QTIES FOR USE WITH N. DATA
Example 5.6.
Suppose that we have data X1 , X2 , . . . , Xn which are iid with density
for x ≥ 0 and suppose that we want to construct a 95% confidence interval for θ. We need to find a pivotal
quantity for θ. Observe that
a
F (a|θ) = f (x|θ)dx
−∞
a
= θ exp (−θx)dx
0
= 1 − exp (−θa).
Hence
n
Q1 (X, θ) = −2 log [1 − exp (−θXi )]
i=1
0.95 = P [A ≤ Q2 (X, θ) ≤ B]
n
= P A ≤ 2θ Xi ≤ B
i=1
A B
= P n ≤θ≤ n
2 i=1 Xi 2 i=1 Xi
Example 5.7.
Suppose that we have data X1 , X2 , . . . , Xn which are iid observations from a N (θ, σ 2 ) density where σ is known.
Define √
n(X̄ − θ)
Q= .
σ
Then Q has a N (0, 1) density and so is a pivotal quantity for θ. In particular we can be 95% sure that
√
n(X̄ − θ)
−1.96 ≤ ≤ +1.96
σ
110
CHAPTER 5. CONFIDENCE INTERVALS 5.3. PIV. QTIES FOR USE WITH N. DATA
which is equivalent to
σ σ
X̄ − 1.96 √ ≤ θ ≤ X̄ + 1.96 √ .
n n
The R command qnorm(p=0.975,mean=0,sd=1) returns the value 1.959964 as the 97 12 % quantile from the
standard normal distribution.
Example 5.8.
Suppose that we have data X1 , X2 , . . . , Xn which are iid observations from a N (θ, σ 2 ) density where θ is known.
Define
n
(Xi − θ)2
i=1
Q=
σ2
2
n
We can write Q = Zi where Zi = (Xi − θ)/σ. If Zi has a N (0, 1) density then Zi2 has a χ21 density. Hence,
i=1
Q has a χ2n density and so is a pivotal quantity for σ. If n = 20 then we can be 95% sure that
n
(Xi − θ)2
i=1
9.591 ≤ ≤ 34.170
σ2
which is equivalent to 9 9
: n : n
:
; 1 : 1
(Xi − θ) ≤ σ ≤ ;
2 (Xi − θ)2 .
34.170 i=1 9.591 i=1
The R command qchisq(p=c(.025,.975),df=20) returns the values 9.590777 and 34.169607 as the 2 12 % and
97 12 % quantiles from a Chi-squared distribution on 20 degrees of freedom.
Lemma 5.2 (The Student t-distribution). Suppose the random variables X and Y are
independent, and X ∼ N (0, 1) and Y ∼ χ2n . Then the ratio
X
T =!
Y /n
is Student-t distributed with n degrees of freedom (df).
Example 5.9.
Suppose that we have data X1 , X2 , . . . , Xn which are iid observations from a N (θ, σ 2 ) density where both θ and
σ are unknown. Define √
n(X̄ − θ)
Q=
s
where
n
(Xi − X̄)2
i=1
s2 = .
n−1
We can write
Z
Q= !
W/(n − 1)
where √
n(X̄ − θ)
Z=
σ
has a N (0, 1) density and
n
(Xi − X̄)2
i=1
W =
σ2
111
CHAPTER 5. CONFIDENCE INTERVALS 5.3. PIV. QTIES FOR USE WITH N. DATA
The R command qchisq(p=c(.025,.975),df=30) returns the values 16.79077 and 46.97924 as the 2 12 % and
97 12 % quantiles from a Chi-squared distribution on 30 degrees of freedom.
It is important to point out that although a probability statement involving 95% confidence has been attached
the two intervals (5.3.1) and (5.3.2) separately, this does not imply that both intervals simultaneously hold with
95% confidence. )
Example 5.10.
Suppose that we have data X1 , X2 , . . . , Xn which are iid observations from a N (θ1 , σ 2 ) density and data
Y1 , Y2 , . . . , Ym which are iid observations from a N (θ2 , σ 2 ) density where θ1 , θ2 , and σ are unknown. Let
δ = θ1 − θ2 and define
(X̄ − Ȳ ) − δ
Q= "
s2 ( n1 + m 1
)
where n m
2 i=1 (Xi − X̄)2 + j=1 (Yj − Ȳ )2
s = .
n+m−2
2 2
We know that X̄ has a N (θ1 , σn ) density and that Ȳ has a N (θ2 , σm ) density. Then the difference X̄ − Ȳ has a
N (δ, σ 2 [ n1 + m
1
]) density. Hence
X̄ − Ȳ − δ
Z="
σ 2 [ n1 + m
1
]
n m
has a N (0, 1) density. Let W1 = i=1 (Xi − X̄)2 /σ 2 and let W2 = j=1 (Yj − Ȳ )2 /σ 2 . Then, W1 has a χ2n−1
density and W2 has a χ2m−1 density, and W = W1 + W2 has a χ2n+m−2 density. We can write
!
Q1 = Z/ W/(n + m − 2)
and so, Q1 has a tn+m−2 density and so is a pivotal quantity for δ. Define
n 2
m 2
i=1 (Xi − X̄) + j=1 (Yj − Ȳ )
Q2 = .
σ2
Then Q2 has a χ2n+m−2 density and so is a pivotal quantity for σ.
112
CHAPTER 5. CONFIDENCE INTERVALS 5.3. PIV. QTIES FOR USE WITH N. DATA
has the distribution called Fisher, or F distribution with parameters (degrees of freedom) n, m,
or the Fn,m distribution for short. The corresponding pdf fFn,m is concentrated on the positive
half axis:
Γ((n + m)/2) # n $n/2 n/2−1 # n $−(n+m)/2
fFn,m (z) = z 1+ z for z > 0.
Γ(n/2)Γ(m/2) m m
Let
n
WX = (Xi − X̄)2 /σX
2
i=1
and let
m
WY = (Yj − Ȳ )2 /σY2 .
j=1
Then, WX has a χ2n−1 density and WY has a χ2m−1 density. Hence, by lemma 5.3,
WX /(n − 1) F∗
Q= ≡ 2
WY /(m − 1) λ
has an F density with n − 1 and m − 1 degrees of freedom and so is a pivotal quantity for λ. Suppose that n = 25
and m = 13. Then we can be 95% sure that 0.39 ≤ Q ≤ 3.02 which is equivalent to
< <
F∗ F∗
≤λ≤ .
3.02 0.39
To see how this might work in practice try the following R commands one at a time:
x = rnorm(25, mean = 0, sd = 2)
y = rnorm(13, mean = 1, sd = 1)
Fstar = Var(x)/Var(y); Fstar
CutOffs = qf(p=c(.025,.975), df1=24, df2=12)
CutOffs; rev(CutOffs)
Fstar / rev(CutOffs)
var.test(x, y)
113
CHAPTER 5. CONFIDENCE INTERVALS 5.4. APPROXIMATE CONFIDENCE INTERVALS
114
CHAPTER 5. CONFIDENCE INTERVALS 5.5. BOOTSTRAP CONFIDENCE INTERVALS
The total area under the density curve (area = 1) has been subdivided into (n + 1) subregions
1
with individual areas approximated as (n+1) . The values of the cumulative area under the density
curve is then approximated as
The following data are the numbers of typhoons in the North Pacific Ocean over 88 years
and assume that they are saved in a file called TYPHOON.txt
115
CHAPTER 5. CONFIDENCE INTERVALS 5.5. BOOTSTRAP CONFIDENCE INTERVALS
13 7 14 20 13 12 12 15 20 17 11 14 16 12 17 17
16 7 14 15 16 20 17 20 15 22 26 25 27 18 23 26 18 15 20 24
19 25 23 20 24 20 16 21 20 18 20 18 24 27 27 21 21 22 28 38
39 27 26 32 19 33 23 38 30 30 27 25 33 34 16 17 22 17 26 21
30 30 31 27 43 40 28 31 24 15 22 31
A plot of the ecdf shown in Figure 5.5.2 and was generated with the following R code:
The reason why we introduced the concept of an empirical cumulative distribution function
is that empirical cumulative distributions are used when calculating bootstrap probabilities.
Here is an example. Suppose that in Example 5.5, the data were
116
CHAPTER 5. CONFIDENCE INTERVALS 5.5. BOOTSTRAP CONFIDENCE INTERVALS
8 6 5 10 8 12 9 9 8 11 7 3 6 7 5 8 10 7 8 8 10 8 5 10 8 6 10 6 8 14
Denote this sample by x1 , x2 , . . . , xn where n = 30. The summary statistics are
n
xi = 240 X = 8
i=1
We shall use this example to illustrate (a) resampling, and (b) the bootstrap distribution.
The sample, x1 , x2 , . . . , xn , are independently and identically distributed (i.i.d.) as Poisson(λ)
which means that each observation is as important as any other for providing information about
the population from which this sample is drawn. That infers we can replace any number by one
of the others and the “new” sample will still convey the same information about the population.
This is also demonstrated in Figure 5.5.3. Three “new” samples have been generated by taking
samples of size n = 30 with replacement from x. The ecdf of x is shown in bold and the ecdf’s
of the “new” samples are shown with different line types. We observe that there is little change
in the empirical distributions or estimates of quantiles. If a statistic (e.g. a quantile) were
estimated from this process a large number of times, it would be a reliable estimate of the
population parameter. The “new” samples are termed bootstrap samples.
The bootstrap procedure for the CI for λ in the current example then goes as follows:
1. Nominate the number of bootstrap samples that will be drawn, e.g. nBS=99.
117
CHAPTER 5. CONFIDENCE INTERVALS 5.5. BOOTSTRAP CONFIDENCE INTERVALS
Figure 5.5.4: Deriving the 95% CI from the ecdf of bootstrap estimates of the mean
The R code that generated the graph in Figure 5.5.4, further illustrates the bootstrap
approach to construct confidence intervals.
x <- c(8,6,5,10,8,12,9,9,8,11,7,3,6,7,5,8,10,7,8,8,10,8,5,10,8,6,10,6,8,14)
n <- length(x)
nBS <- 99 # number of bootstrap simulations
BS.mean <- numeric(nBS)
i <- 1
while (i < (nBS+1) ){
BS.mean[i] <- mean(sample(x,replace=T,size=n))
i <- i + 1
} # end of the while() loop
R has several packages that incorporate bootstrap techniques, including the boot package
with functions particularly focusing on bootstrapping. The following code uses the boot package
to construct the same confidence interval as before, but with less effort involved..., although
more effort may be required to fully understand how the ‘boot’ function works...
library(boot)
mnz <- function(z,id){mean(z[id])} # user must supply this
bs.samples <- boot(data=x,statistic=mnz,R=99)
boot.ci(bs.samples,conf=0.95,type=c("perc","bca"))
118
CHAPTER 5. CONFIDENCE INTERVALS 5.5. BOOTSTRAP CONFIDENCE INTERVALS
It seems that the user must supply a function (e.g. mnz here) to generate the bootstrap samples.
The variable id is recogised by R as a vector 1:length(z) so that it can draw the samples.
119