Syllabus20122013 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 61

CHAPTER 4.

ESTIMATION

CHAPTER

ESTIMATION

4.1 Introduction
The application of the methods of probability to the analysis and interpretation of data is
known as statistical inference. In particular, we wish to make an inference about a population
based on information contained in a sample. Since populations are characterized by numerical
descriptive measures called parameters, the objective of many statistical investigations is to
make an inference about one or more population parameters. There are two broad areas of
inference: estimation (the subject of this chapter) and hypothesis-testing (the subject of the
next chapter).
When we say that we have a random sample X1 , X2 , . . . , Xn “from a random variable X”
or “from a population with distribution function F (x; θ),” we mean that X1 , X2 , . . . , Xn are
identically and independently distributed random variables each with c.d.f. F (x; θ), that is,
depending on some parameter θ. We usually assume that the form of the distribution, e.g.,
binomial, Poisson, Normal, etc. is known but the parameter is unknown. We wish to obtain
information from the data (sample) to enable us to make some statement about the parameter.
Note that, θ may be a vector, e.g., θ = (μ, σ 2 ).
The general problem of estimation is to find out something about θ using the information
in the observed values of the sample, x1 , x2 , . . . , xn . That is, we want to choose a function
H(x1 , x2 , . . . , xn ) that will give us a good estimate of the parameter θ in F (x; θ).
Next we will consider some general methods of estimation. Since different methods may
lead to different estimators for the same parameter, we will then need to consider criteria for
deciding whether one estimate is better than another.

59
CHAPTER 4. ESTIMATION 4.2. STATISTICAL PHILOSOPHIES

4.2 Statistical philosophies


The dominant philosophy of inference is based on the frequentist theory of probability. According
to the frequentist theory probability statements can only be made regarding events associated
with a random experiment. Recall that events are subsets of the sample space assocatiated
with the experiment. A random experiment is an experiment which has a well defined set
of possible outcomes Ω. In addition, we must be able to envisage an infinite sequence of
independent repetitions of the experiment with the actual outcome of each repetition being some
unpredictable element of Ω. A random vector is a collection of numerical quantities associated
with each possible outcome in Ω. In performing the experiment we determine which element of
Ω has occurred and thereby the observed values of all random variables or random vectors of
interest. Since the outcome of the experiment is unpredictable so too is the value of any random
variable or random vector. Since we can envisage an infinite sequence of independent repetitions
of the experiment, we can envisage an infinite sequence of independent determinations of the
value of a random variable (or vector). The purpose of a statistical model is to describe the
unpredictability of such a sequence of determinations.
Consider the random experiment which consists of picking someone at random from the
2007 electoral register for Limerick. The outcome of such an experiment will be a human being
and the set Ω consists of all human beings whose names are on the register. We can clearly
envisage an infinite sequence of independent repetitions of such an experiment. Consider the
random variable X where X = 0 if the outcome of the experiment is a male and X = 1 if the
outcome of the experiment is a female. When we say that P (X = 1) = 0.54 we are taken to
mean that in an infinite sequence of independent repetitions of the experiment exactly 54% of
the outcomes will produce a value of X = 1.
Now consider the random experiment which consists of picking 3 people at random from the
1997 electoral register for Limerick. The outcome of such an experiment will be a collection of 3
human beings and the set Ω consists of all subsets of 3 human beings which may be formed
from the set of all human beings whose names are on the register. We can clearly envisage an
infinite sequence of independent repetitions of such an experiment. Consider the random vector
X = (X1 , X2 , X3 ) where for i = 1, 2, 3, Xi = 0 if the ith person chosen is a male and Xi = 1 if
the ith person chosen is a female. When we say that X1 , X2 , X3 are independent and identically
distributed or i.i.d. with P (Xi = 1) = θ we are taken to mean that in an infinite sequence
of independent repetitions of the experiment the proportion of outcomes which produce, for
instance, a value of X = (1, 1, 0) is given by θ2 (1 − θ).
Suppose that the value of θ is unknown and we propose to estimate it by the estimator θ̂
whose value is given by the proportion of females in the sample of size 3. Since θ̂ depends on the
value of X we sometimes write θ̂(X) to emphasise this fact. We can work out the probability

60
CHAPTER 4. ESTIMATION 4.2. STATISTICAL PHILOSOPHIES

distribution of θ̂ as follows :

X P (X = x) θ̂(x)
(0, 0, 0) (1 − θ)3 0
(0, 0, 1) θ(1 − θ)2 1/3
(0, 1, 0) θ(1 − θ)2 1/3
(1, 0, 0) θ(1 − θ)2 1/3
(0, 1, 1) θ2 (1 − θ) 2/3
(1, 0, 1) θ2 (1 − θ) 2/3
(1, 1, 0) θ2 (1 − θ) 2/3
(1, 1, 1) θ3 1

Thus P (θ̂ = 0) = (1 − θ)3 , P (θ̂ = 1/3) = 3θ(1 − θ)2 , P (θ̂ = 2/3) = 3θ2 (1 − θ) and P (θ̂ = 1) = θ3 .
We now ask whether θ̂ is a good estimator of θ? Clearly if θ = 0 we have that P (θ̂ = θ) =
P (θ̂ = 0) = 1 which is good. Likewise if θ = 1 we also have that P (θ̂ = θ) = P (θ̂ = 1) = 1. If
θ = 1/3 then P (θ̂ = θ) = P (θ̂ = 1/3) = 3(1/3)(1 − 1/3)2 = 4/9. Likewise if θ = 2/3 we have
that P (θ̂ = θ) = P (θ̂ = 2/3) = 3(2/3)2 (1 − 2/3) = 4/9. However if the value of θ lies outside
the set {0, 1/3, 2/3, 1} we have that P (θ̂ = θ) = 0.
Since θ̂ is a random variable we might try to calculate its expected value E(θ̂) i.e. the
average value we would get if we carried out an infinite number of independent repetitions of
the experiment. We have that

E(θ̂) = 0P (θ̂ = 0) + (1/3)P (θ̂ = 1/3) + (2/3)P (θ̂ = 2/3) + 1P (θ̂ = 1) ,


= 0(1 − θ)3 + (1/3)3θ(1 − θ)2 + (2/3)3θ2 (1 − θ) + 1θ3 ,
= θ.

Thus if we carried out an infinite number of independent repetitions of the experiment and
calculate the value of θ̂ for each repetition the average of the θ̂ values would be exactly θ,the
true value of the parameter! This is true no matter what the actual value of θ is. Such an
estimator is said to be unbiased.
Consider the quantity L = (θ̂ − θ)2 which might be regarded as a measure of the error or loss
involved in using θ̂ to estimate θ. The possible values for L are (0 − θ)2 , (1/3 − θ)2 , (2/3 − θ)2
and (1 − θ)2 . We can calculate the expected value of L as follows:

E(L) = (0 − θ)2 P (θ̂ = 0) + (1/3 − θ)2 P (θ̂ = 1/3)


+(2/3 − θ)2 P (θ̂ = 2/3) + (1 − θ)2 P (θ̂ = 1)
= θ2 (1 − θ)3 + (1/3 − θ)2 3θ(1 − θ)2 + (2/3 − θ)2 3θ2 (1 − θ) + (1 − θ)2 θ3 ,
= θ(1 − θ)/3 .

The quantity E(L) is called the mean squared error ( MSE ) of the estimator θ̂. Since the
quantity θ(1 − θ) attains its maximum value of 1/4 for θ = 1/2, the largest value E(L) can
attain is 1/12 which occurs if the true value of the parameter θ happens to be equal to 1/2; for
all other values of θ the quantity E(L) is less than 1/12. If somebody could invent a different
estimator θ̃ of θ whose MSE was less than that of θ̂ for all values of θ then we would prefer θ̃ to
θ̂.

61
CHAPTER 4. ESTIMATION 4.3. THE FREQUENTIST APPROACH TO ESTIMATION

This trivial example gives some idea of the kinds of calculations that we will be performing.
The basic frequentist principle is that statistical procedures should be judged in terms of their
average performance in an infinite series of independent repetitions of the experiment which
produced the data. An important point to note is that the parameter values are treated as
fixed (although unknown) throughout this infinite series of repetitions. We should be happy to
use a procedure which performs well on the average and should not be concerned with how it
performs on any one particular occasion.

4.3 The frequentist approach to estimation


Suppose that we are going to observe a value of a random vector X. Let X denote the set of
possible values X can take and, for x ∈ X , let f (x|θ) denote the probability that X takes the
value x where the parameter θ is some unknown element of the set Θ.
The problem we face is that of estimating θ. An estimator θ̂ is a procedure which for each
possible value x ∈ X specifies which element of Θ we should quote as an estimate of θ. When
we observe X = x we quote θ̂(x) as our estimate of θ. Thus θ̂ is a function of the random vector
X. Sometimes we write θ̂(X) to emphasise this point.
Given any estimator θ̂ we can calculate its expected value for each possible value of θ ∈ Θ.
An estimator is said to be unbiased if this expected value is identically equal to θ. If an estimator
is unbiased then we can conclude that if we repeat the experiment an infinite number of times
with θ fixed and calculate the value of the estimator each time then the average of the estimator
values will be exactly equal to θ. From the frequentist viewpoint this is a desireable property
and so, where possible, frequentists use unbiased estimators.

Definition 4.1 (The Frequentist philosophy). To evaluate the usefulness of an estimator


θ̂ = θ̂(x) of θ, examine the properties of the random variable θ̂ = θ̂(X).

Using different methods of estimation can lead to different estimators. Criteria for deciding
which are good estimators are required. Before listing the qualities of a good estimator, it is
important to understand that they are random variables. For example, suppose that we take a
sample of size 5 from a uniform distribution and calculate x. Each time we repeat the experiment
we will probably get a different sample of 5 and therefore a different x. The behaviour of an
estimator for different random samples will be described by a probability distribution. The
actual distribution of the estimator is not a concern here and only its mean and variance will
be considered. As a first condition it seems reasonable to ask that the distribution of the
estimator be centered around the parameter it is estimating. If not it will tend to overestimate
or underestimate θ. A second property an estimator should possess is precision. An estimator is
precise if the dispersion of its distribution is small. These two concepts are incorporated in the
definitions of unbiasedness and efficiency below. [5]

Definition 4.2 (Unbiased estimators). An estimator θ̂ = θ̂(X) is said to be unbiased for a


parameter θ if it equals θ in expectation:

E[θ̂(X)] = E(θ̂) = θ.

Intuitively, an unbiased estimator is ‘right on target’.

62
CHAPTER 4. ESTIMATION 4.4. ESTIMATION BY THE METHOD OF MOMENTS

Definition 4.3 (Bias of an estimator). The bias of an estimator θ̂ = θ̂(X) of θ is defined


as bias(θ̂) = E[θ̂(X) − θ].

There may be large number of unbiased estimators of a parameter for any given distribution
and a further criterion for choosing between all the unbiased estimators is needed. [5]

Definition 4.4 (Bias corrected estimators). If bias(θ̂) is of the form cθ, then (obviously)
θ̃ = θ̂/(1 + c) is unbiased for θ. Likewise, if bias(θ̂) = θ + c, then θ̃ = θ̂ − c is unbiased for θ. In
such situations we say that θ̃ is a biased corrected version of θ̂.

Definition 4.5 (Unbiased functions). More generally ĝ(X) is said to be unbiased for a
function g(θ) if E[ĝ(X)] = g(θ).

Note that even if θ̂ is an unbiased estimator of θ, g(θ̂) will generally not be an unbiased
estimator of g(θ) unless g is linear or affine. This limits the importance of the notion of
unbiasedness. It might be at least as important that an estimator is accurate in the sense that
its distribution is highly concentrated around θ.
Is unbiasedness a good thing? Unbiasedness is important when combining estimates, as
averages of unbiased estimators are unbiased (see the review exercises at the end of this
chapter). For example, when combining standard deviations s1 , s2 , . . . , sk with degrees of
freedom df 1 , . . . , df k we always average their squares

df 1 s21 + · · · + df k s2k
s̄ =
df 1 + · · · + df k

as s2i are unbiased estimators of the variance σ 2 , whereas si are not unbiased estimators of σ (see
the review exercises). Be careful when averaging biased estimators! It may well be appropriate
to make a bias-correction before averaging.

4.4 Estimation by the method of moments


Now that we have a better understanding about some factors that may play a role in determining
the quality of an estimator, we would like to know about some methods of finding estimators.
We will focus on finding point estimators first, i.e. for which the true value of a (function of
a) parameter is assumed to be a point. Several methods exist to compute point estimators,
including the ‘methods of moments’ and ‘maximul likelihood’ (which we will discuss in more
detail in this course), but also the ‘method of least squares’ (see the Chapter 9 on ‘Regression’),
the ‘Bayes’ method, the ‘minimum-chi-square’ method and the ‘minimum-distance’ method.
Point estimates should preferentially be accompanied by some interval about the point
estimate together with some measure of accurance that the true value of the parameter lies
within the interval. Hence, instead of making the inference of estimating the true value of the
parameter to be a point, we might make the inference of estimating that the true value of the
parameter is contained in some interval. We then speak of interval estimation, which will be the
subject of Chapter 5.

63
CHAPTER 4. ESTIMATION 4.4. ESTIMATION BY THE METHOD OF MOMENTS

4.4.1 Traditional methods of moments


The Method of Moments was first proposed near the turn of the century by the British statistician
Karl Pearson. The Method of Maximum Likelihood though goes back much further and will
be dealt with later. Both Gauss and Daniel Bernoulli made use of the technique, the latter as
early as 1777. Fisher though, in the early years of the twentieth century, was the first to make a
thorough study of the method’s properties and the procedure is often credited to him. [5]
Recall that, for a random variable X, the rth moment about the origin is μr = E(X r ) and
that for a random sample X1 , X2 , . . . , Xn , the rth sample moment about the origin is defined by

n
Mr = Xir /n, r = 1, 2, 3, . . .
i=1

and its observed value is denoted by



n
mr = xri /n.
i=1

Note that the first sample moment is just the sample mean, X.
We will first prove a property of sample moments.

Theorem 4.1. Let X1 , X2 , . . . , Xn be a random sample of X. Then

E(Mr ) = μr , r = 1, 2, 3, . . .

Proof.  
1 
n
1
n
1 
n
E(Mr ) = E Xir = E(Xir ) = μ = μr .
n i=1
n i=1 n i=1 r

This theorem provides the motivation for estimation by the method of moments (with the
estimator being referred to as the method of moments estimator or MME). The sample moments,
M1 , M2 , . . ., are random variables whose means are μ1 , μ2 , . . . Since the population moments
depend on the parameters of the distribution, estimating them by the sample moments leads to
estimation of the parameters.
We will consider this method of estimation by means of 2 examples, then state the general
procedure.
Example 4.1.
In this example, the distribution only has one parameter. Given X1 , X2 , . . . , Xn is a random sample from a
U (0, θ) distribution, find the method of moments estimator (MME) of θ.

Solution of Example 4.1. Now, for the uniform distribution (f (x) = 1θ I[0,θ] (x)),
 θ
1
μ = E(X) = x × dx
0 θ
θ
=
2

64
CHAPTER 4. ESTIMATION 4.4. ESTIMATION BY THE METHOD OF MOMENTS

θ
Using the Method of Moments we proceed to estimate μ = 2
by m1 . Thus since m1 = x we have

θ
=x
2
and
θ = 2x.
Then, θ = 2X and the MME of θ is 2X.

Computer Exercise 4.1. Generate 100 samples of size 10 from a uniform distribution, U (0, θ)
with θ = 10. Estimate the value of θ from your samples using the method of moments and plot
the results. Comment on the results.
In this exercise, we know a priori that θ = 10 and have generated random samples. The
samples are analysed as if θ is unknown and estimated by the method of moments. Then we can
compare the estimates with the known value.

Solution of Computer Exercise 4.1.

#_________ UniformMoment.R _____


theta <- 10
sampsz <- 10
nsimulations <- 100
theta.estimates <- numeric(nsimulations)

for (i in 1:nsimulations){
ru <- runif(n=sampsz,min=0,max=theta)
Xbar <- mean(ru)
theta.estimates[i] <- 2*Xbar
} # end of the i loop

plot(density(theta.estimates))

(You should do the exercise and obtain a plot for yourself.)


It should be clear from the plot that about 50% of the estimates are greater than 10 which is
outside the parameter space for a U (0, 10) distribution. This is undesirable.
Example 4.2.
In this example the distribution has two parameters. Given X1 , . . . , Xn is a random sample from the N (μ, σ 2 )
distribution, find the method of moments estimates of μ and σ 2 .

Solution of Example 4.2. For the normal distribution, E(X) = μ and E(X 2 ) = σ 2 + μ2
(Thm. 2.2 of [5]).

65
CHAPTER 4. ESTIMATION 4.4. ESTIMATION BY THE METHOD OF MOMENTS

Using the Method of Moments: Equate E(X) to m1 and E(X 2 ) to m2 so that, μ = x and
σ 2 + μ2 = m2 . that is, estimate μ by x and estimate σ 2 by m2 − x2 . Then,
1 2
μ = x, and σ 2 = xi − x2 .
n

The latter can also be written as σ 2 = n1 ni=1 (xi − x)2 .
Computer Exercise 4.2. Generate 100 samples of size 10 from a normal distribution with
μ = 14 and σ = 4. Estimate μ and σ 2 from your samples using the method of moments. Plot
the estimated values of μ and σ 2 . Comment on your results.
Solution of Computer Exercise 4.2.

#_____________NormalMoments.R ___________
mu <- 14
sigma <- 4
sampsz <- 10
nsimulations <- 100
mu.estimates <- numeric(nsimulations)
var.estimates <- numeric(nsimulations)
for (i in 1:nsimulations){
rn <- rnorm(mean=mu,sd=sigma,n=sampsz)
mu.estimates[i] <- mean(rn)
var.estimates[i] <- mean( (rn -mean(rn))^2 )
} # end of i loop

plot(density(mu.estimates))
plot(density(var.estimates))

The plot you obtain for the row means should be centred around the true mean of 14. However,
you will notice that the plot of the variances is not centred about the true variance of 16 as you
would like. Rather it will appear to be centred about a value less than 16. The reason for this
will become evident when we study the properties of MME estimators in more detail.

General Procedure
Let X1 , X2 , . . . , Xn be a random sample from F (x : θ1 , . . . , θk ). That is, suppose that there are
k parameters to be estimated. Let μr , mr (r = 1, 2, . . . , k) denote the first k population and
sample moments respectively, and suppose that each of these population moments are certain
known functions of the parameters. That is,
μ1 = g1 (θ1 , . . . , θk ),
μ2 = g2 (θ1 , . . . , θk ),
..
.

μk = gk (θ1 , . . . , θk ).

66
CHAPTER 4. ESTIMATION 4.5. PROPERTIES OF AN ESTIMATOR

Solving simultaneously the set of equations,

μr = gr (θ1 , . . . , θk ) = mr , r = 1, 2, . . . , k

gives the required estimates θ1 , . . . , θk .

4.4.2 Generalized methods of moments


Sometimes, methods of moments estimation in non-conclusive in deriving estimates. We will
show this in the context of estimating the slope of a line through the origin (Chapter 9). The
Generalized Methods of Moments provides an alternative. It offers an estimation strategy when
the number of restricted moments exceeds the number of parameters to be estimated.

4.5 Properties of an estimator


4.5.1 Unbiasedness
Recall the definition of an unbiased estimator: Definition 4.2. From a frequentist point of view,
this is a very desirable property for an estimator.

Problem 4.2. Let X have a binomial distribution with parameters n and θ. Show that the
sample proportion θ̂ = X/n is an unbiased estimate of θ.
Solution. X ∼ Bin(n, θ) ⇒ E(X) = nθ. Then E(θ̂) = E(X/n) = E(X)/n = nθ/n = θ. As
E(θ̂) = θ, the estimator θ̂ is unbiased.

Problem 4.3. Let X1 , . . . , Xn be independent and identically distributed with density


 −(x−θ)
e , for x > θ;
f (x|θ) =
0, otherwise.

Show that θ̂ = X̄ = (X1 + · · · + Xn )/n is a biased estimator of θ. Propose an unbiased estimator


θ̃ of the form θ̃ = θ̂ + c.
∞ ∞
Solution. E(X) = θ xe−(x−θ) dx = −xe−(x−θ) + e−(x−θ) dx θ = −(x + 1)e−(x−θ) |∞ θ = θ + 1.
Next, E(θ̂) = E(X̄) = n E(X1 + X2 + · · · + Xn ) = θ + 1 = θ ⇒ θ̂ is biased. Propose θ̃ = X̄ − 1.
1

Then E(θ̃) = E(X̄) − 1 = θ + 1 − 1 = θ and θ̃ is unbiased.

4.5.2 Trading off Bias and Variance


4.5.2.1 Mean-Squared Error
Although desirable from a frequentist standpoint, unbiasedness is not a property that helps us
choose between estimators. To do this we must examine some measure of loss like the mean
squared error.

67
CHAPTER 4. ESTIMATION 4.5. PROPERTIES OF AN ESTIMATOR

Definition 4.6 (Mean squared error). The mean squared error of the estimator θ̂ is defined
as MSE(θ̂) = E(θ̂ − θ)2 . Given the same set of data, θ̂1 is “better than θ̂2 if M SE(θ̂1 ) ≤ MSE(θ̂2 )
(uniformly better if true ∀ θ).

Lemma 4.4 (The MSE variance-bias tradeoff). The MSE decomposes as

MSE(θ̂) = Var(θ̂) + Bias(θ̂)2 .

Proof. The problem of finding minimum MSE estimators cannot be solved uniquely:

MSE(θ̂) = E(θ̂ − θ)2


= E{ [ θ̂ − E(θ̂) ] + [ E(θ̂) − θ ]}2
= E[θ̂ − E(θ̂)]2 + E[E(θ̂) − θ]2
+2 E [θ̂ − E(θ̂)][E(θ̂) − θ]
  
=0

= E[θ̂ − E(θ̂)] + E[E(θ̂) − θ]2


2

= Var(θ̂) + [E(θ̂) − θ]2


  
2
Bias(θ̂)

Note : This lemma implies that the mean squared error of an unbiased estimator is equal to
the variance of the estimator.

Problem 4.5. Consider


n X1 , . . . , Xn where Xi ∼ N(θ, σ 2 ) and σ is known. Three estimators of
1
θ are θ̂1 = X̄ = n i=1 Xi , θ̂2 = X1 , and θ̂3 = (X1 + X̄)/2. Pick one.

Solution. E(θ̂1 ) = n1 [E(X1 ) + · · · + E(Xn )] = n1 [θ + · · · + θ] = n1 [nθ] = θ, (unbiased).


 n+1 
Next E( θ̂ 2 ) = E(X 1 ) = θ, (unbiased).  Finally E( θ̂ 3 ) = 1
2
E n
X 1 + 1
n
[X 2 + · · · + X n ] =
1 n+1
2 n
E(X 1 ) + 1
n
[E(X 2 ) + · · · + E(X n )] = 2
{
1 n+1
n
θ + n−1
n
θ} = θ, (unbiased). All three
estimators are unbiased. For a class of estimators that are unbiased, the mean squared error
will be equal to the estimation variance. Calculate Var(θ̂1 ) = n12 [Var(X1 ) + · · · + Var(Xn )] =
1
n2
[σ 2 + · · · + σ 2 ] = n12 [nσ 2 ] = n1 σ 2 . Trivially Var(θ̂2 ) = Var(X1 ) = σ 2 . Finally Var(θ̂3 ) =
(σ 2 /n + σ 2 )/4 + 2Cov(X̄, X1 ). So X̄ appears “best” in the sense that Var(θ̂) is smallest among
these three unbiased estimators.

Problem 4.6. Consider X1 , . . . , Xn to be independent random variables with means E(Xi ) = μ+


βi and variances Var(Xi ) = σi2 . Such a situation could arise when Xi are estimators of μ obtained
from independent sources and βi is the bias of the estimator Xi . Consider pooling the estimators
of μ into a common estimator using the linear combination μ̂ = w1 X1 + w2 X2 + · · · + wn Xn .

(i) If the estimators are unbiased, show that μ̂ is unbiased if and only if wi = 1.

(ii) In the case when the estimators are unbiased, show that μ̂ has minimum variance when
the weights are inversely proportional to the variances σi2 .

(iii) Show that the variance of μ̂ for optimal weights wi is Var(μ̂) = 1/ i σi−2 .

68
CHAPTER 4. ESTIMATION 4.5. PROPERTIES OF AN ESTIMATOR

(iv) Consider the case when estimators may be biased. Find the mean square error of the
optimal linear combination obtained above, and compare its behaviour as n → ∞ in the
biased and unbiased case, when σi2 = σ 2 , i = 1, . . . , n.
  
Solution. E(μ̂) =  E(w 1 X 1 + · · · + w n X n ) = i w i E(X i ) = i w i μ = μ i wi so
 μ̂ is unbiased
2 2
if and only if i wi = 1. The variance of our  estimator is Var(μ̂) = i wi σi , which
shouldbe minimized  subject to the constraint i wi = 1. Differentiating the Lagrangian
L = w 2 2
σ
i i i − λ ( w
i i − 1) with respect
 to w i and setting equal to zero yields 2wi σi2 =
λ ⇒ wi ∝ σi−2 so that wi = σi−2 /( j σj−2 ). Then, for optimal weights we get Var(μ̂) =
 2 2  −4 2  −2 2  −2
i w i σi = ( i σi σi )/( i σi ) = 1/( i σi ). When
 σi2 = σ 2 we have that Var(μ̂) = σ 2 /n
which tends to zero for n → ∞ whereas bias(μ̂) = βi /n = β̄ is equal to the average bias and
MSE(μ̂) = σ 2 /n + β̄ 2 . Therefore the bias tends to dominate the variance as n gets larger, which
is very unfortunate.

Problem 4.7. Let X1 , . . . , Xn be an independent sample of size n from the uniform distribution
on the interval (0, θ), with density for a single observation being f (x|θ) = θ−1 for 0 < x < θ and
0 otherwise, and consider θ > 0 unknown.

(i) Find the expected value and variance of the estimator θ̂ = 2X̄.

(ii) Find the expected value of the estimator θ̃ = X(n) , i.e. the largest observation.

(iii) Find an unbiased estimator of the form θ̌ = cX(n) and calculate its variance.

(iv) Compare the mean square error of θ̂ and θ̌.

Solution. θ̂ has E(θ̂) = E(2X̄) = n2 [E(X1 )+· · ·+E(Xn )] = n2 [(θ/2)+· · ·+(θ/2)] = n2 [n(θ/2)] = θ
(unbiased), and Var(θ̂) = Var(2X̄) = n42 [Var(X1 ) + · · · + 
Var(Xn )] = n42 [(θ2 /12) + · · · + (θ2 /12)] =
4 n 2
n2 12
θ = 3n θ . Let U = X(n) , we then have P (U ≤ u) = ni P (Xi ≤ u) = (u/θ)n for 0 < u < θ so
1 2

differentiation yields that U has density f (u|θ) = nun−1 θ−n for 0 < u < θ. Direct integration now
n
yields E(θ̃) = E(U ) = n+1 θ (a biased estimator). The estimator θ̌ = n+1 n
U is unbiased. Direct
2 n 2 n 2 1
integration gives E(U ) = n+2 θ so Var(θ̃) = Var(U ) = (n+2)(n+1)2 θ and Var(θ̌) = n(n+2) θ2 . As
θ̂ and θ̌ are both unbiased estimators MSE(θ̂) = Var(θ̂) and MSE(θ̌) = Var(θ̌). Clearly the mean
square error of θ̂ is very large compared to the mean square error of θ̌.

4.5.2.2 Minimum-Variance Unbiased

Getting a small MSE often involves a tradeoff between variance and bias. By not insisting on θ̂
being unbiased, the variance can sometimes be drastically reduced. For unbiased estimators,
the MSE obviously equals the variance, MSE(θ̂) = Var(θ̂), so no tradeoff can be made. One
approach is to restrict ourselves to the subclass of estimators that are unbiased and minimum
variance.

Definition 4.7 (Minimum-variance unbiased estimator). If an unbiased estimator of g(θ)


has minimum variance among all unbiased estimators of g(θ) it is called a minimum variance
unbiased estimator (MVUE).

69
CHAPTER 4. ESTIMATION 4.5. PROPERTIES OF AN ESTIMATOR

We will develop a method of finding the MVUE when it exists. When such an estimator
does not exist we will be able to find a lower bound for the variance of an unbiased estimator in
the class of unbiased estimators, and compare the variance of our unbiased estimator with this
lower bound.

Definition 4.8 (Score function). For the (possibly vector valued) observation X = x to be
informative about θ, the density must vary with θ. If f (x|θ) is smooth and differentiable, this
change is quantified to first order by the score function
∂ f  (x|θ)
S(θ) = ln f (x|θ) ≡ .
∂θ f (x|θ)
Under suitable regularity conditions (differentiation wrt θ and integration wrt x can be
interchanged), we have
  
f (x|θ)
E{S(θ)} = f (x|θ)dx = f  (x|θ)dx ,
f (x|θ)
 
∂ ∂
= f (x|θ)dx = 1 = 0.
∂θ ∂θ
Thus the score function has expectation zero. Using the central limit theorem, the score statistic
has asymptotically a normal distribution, with mean zero. The variance will be derived below.

True frequentism evaluates the properties of estimators based on their “long-run” behaviour.
The value of x will vary from sample to sample so we have treated the score function as a
random variable and looked at its average across all possible samples.

Lemma 4.8 (Fisher information). The variance of S(θ) is the expected Fisher information
about θ  2 

E(I(θ)) = E{S(θ)2 } ≡ E ln f (x|θ)
∂θ

Proof. Using the chain rule


 
∂2 ∂ 1 ∂f
ln f =
∂θ2 ∂θ f ∂θ
 2
1 ∂f 1 ∂ 2f
= − 2 +
f ∂θ f ∂θ2
 2
∂ ln f 1 ∂ 2f
= − +
∂θ f ∂θ2
If integration and differentiation can be interchanged
 2   
1∂ f ∂ 2f ∂2 ∂2
E = dx = dx = 1 = 0,
f ∂θ2 X ∂θ
2 ∂θ2 X ∂θ2
thus    2 
∂2 ∂
−E ln f (x|θ) = E ln f (x|θ) = E(I(θ)). (4.5.1)
∂θ2 ∂θ

70
CHAPTER 4. ESTIMATION 4.5. PROPERTIES OF AN ESTIMATOR

Variance measures lack of knowledge. Reasonable that the reciprocal of the variance should
be defined as the amount of information carried by the (possibly vector valued) observation x
about θ.
Theorem 4.9 (Cramér Rao lower bound). Let θ̂ be an unbiased estimator of θ. Then

Var(θ̂) ≥ { E(I(θ)) }−1 .

Proof. Unbiasedness, E(θ̂) = θ, implies



θ̂(x)f (x|θ)dx = θ.

Assume we can differentiate wrt θ under the integral, then




θ̂(x)f (x|θ)dx = 1.
∂θ

The estimator θ̂(x) can’t depend on θ, so




θ̂(x) {f (x|θ)dx} = 1.
∂θ
For any pdf f ,
∂f ∂
=f (ln f ) ,
∂θ ∂θ
so that now 

θ̂(x)f (ln f ) dx = 1.
∂θ
Thus  

E θ̂(x) (ln f ) = 1.
∂θ
Define random variables
U = θ̂(x),
and

(ln f ) .
S=
∂θ
Then E (U S) = 1. We already know that the score function has expectation zero, E (S) = 0.
Consequently Cov(U, S) = E(U S) − E(U )E(S) = E(U S) = 1.

2 {Cov(U, S)}2
{Corr(U, S)} = ≤ 1
Var(U )Var(S)
Setting Cov(U, S) = 1 we get
Var(U )Var(S) ≥ 1
This implies
1
Var(θ̂) ≥
E(I(θ))
which is our main result. We call { E(I(θ)) }−1 the Cramér Rao lower bound (CRLB).

71
CHAPTER 4. ESTIMATION 4.5. PROPERTIES OF AN ESTIMATOR

Sufficient conditions for the proof of CRLB are that all the integrands are finite, within
the range of x. We also require that the limits of the integrals do not depend on θ. That is,
the range of x, here f (x|θ), cannot depend on θ. This second condition is violated for many
density functions, i.e. the CRLB is not valid for the uniform distribution. We can have absolute
assessment for unbiased estimators by comparing their variances to the CRLB. We can also
assess biased estimators. If its variance is lower than CRLB then it is indeed a very good
estimate, although it is biased.

Example 4.3.
Consider i.i.d. random variables Xi , i = 1, . . . , n, with
 
1 1
fXi (xi |μ) = exp − xi .
μ μ

Denote the joint distribution of X1 , . . . , Xn by


n  n  n

 1 1
f= fXi (xi |μ) = exp − xi ,
i=1
μ μ i=1

so that
n
1
ln f = −n ln(μ) − xi .
μ i=1
The score function is the partial derivative of ln f wrt the unknown parameter μ,
n
∂ n 1 
S(μ) = ln f = − + 2 xi
∂μ μ μ i=1

and  n
  n 
n 1  n 1 
E {S(μ)} = E − + 2 Xi = − + 2E Xi
μ μ i=1 μ μ i=1

For X ∼ Exp(1/μ), we have E(X) = μ implying E(X1 + · · · + Xn ) = E(X1 ) + · · · + E(Xn ) = nμ and E {S(μ)} = 0
as required.
  n

∂ n 1 
E(I(μ)) = −E − + 2 Xi
∂μ μ μ i=1
 n

n 2 
= −E − 3 Xi
μ2 μ i=1
 n 
n 2 
= − 2 + 3E Xi
μ μ i=1

n 2nμ n
= − + 3 = 2
μ2 μ μ
Hence
μ2
CRLB = .
n
Let us propose μ̂ = X̄ as an estimator of μ. Then
 n   n 
1 1 
E(μ̂) = E Xi = E Xi = μ,
n i=1 n i=1

72
CHAPTER 4. ESTIMATION 4.5. PROPERTIES OF AN ESTIMATOR

verifying that μ̂ = X̄ is indeed an unbiased estimator of μ. For X ∼ Exp(1/μ), we have E(X) = μ and
Var(X) = μ2 , implying
n
1  nμ2 μ2
Var(μ̂) = 2 Var(Xi ) = 2 = .
n i=1 n n
−1
We have therefore shown that Var(μ̂) = { E(I(θ)) } , and therefore conclude that the unbiased estimator μ̂ = x̄
achieves its CRLB.

4.5.3 Efficiency
Definition 4.9 (Efficiency). Define the efficiency of the unbiased estimator θ̂ as

CRLB
eff(θ̂) = ,
Var(θ̂)

where CRLB = { E(I(θ)) }−1 . Clearly 0 < eff(θ̂) ≤ 1. An unbiased estimator θ̂ is said to be
efficient if eff(θ̂) = 1.

Definition 4.10 (Asymptotic efficiency). The asymptotic efficiency of an unbiased estimator


θ̂ is the limit of the efficiency as n → ∞. An unbiased estimator θ̂ is said to be asymptotically
efficient if its asymptotic efficiency is equal to 1.

Let θ1 and θ2 be 2 unbiased estimators of θ with variances V ar(θ1 ), V ar(θ2 ) respectively.
We can then say that θ1 is more efficient than θ2 if

V ar(θ1 ) < V ar(θ2 ).

That is, θ1 is more efficient than θ2 if it has a smaller variance.

Definition 4.11 (Relative Efficiency). The relative efficiency of θ2 with respect to θ1 is
defined as V ar(θ1 )/V ar(θ2 )

It will now be useful to indicate that the estimator is based on a sample of size n by denoting
it by θn .

4.5.4 Consistency
In the previous subsections, we defined unbiasedness and mean-squared error of an estimator.
Both concepts were defined on a fixed sample size. In this subsection we will define two concepts
that are defined for increasing sample size.

Definition 4.12 (Mean-squared-error Consistency). Let θ1 , θ2 , . . . , θn , . . . be a sequence of


estimators of θ, where each estimator θn based on a sample of size n. This sequence of estimators
is defined to be a mean-squared-error consistent sequence of estimators of θ if and only if

n − θ)2 ] = 0, for all θ


lim E[(theta (4.5.2)
n→∞

73
CHAPTER 4. ESTIMATION 4.5. PROPERTIES OF AN ESTIMATOR

Mean-squared-error consistency implies that both the bias and the variance of θn approaches 0,
since the mean-squared-error in the definition can be decomposed in a variance and squared-bias
component.

Definition 4.13 (Simple (weakly) consistency). θn is a consistent estimator of θ if

lim P ( θn − θ > ) = 0for all  > 0. (4.5.3)


n→∞

We then say that θn converges in probability to θ as n → ∞. Equivalently,

lim P ( θn − θ < ) = lim P ( θn −  < θ < θn + ) = 1. (4.5.4)


n→∞ n→∞

If an estimator is a mean-squared-error consistent estimator, it is also a simple consistent


estimator, but not necessarily vice versa.
These are large-sample or asymptotic properties. Consistency has to do only with the limiting
behaviour of an estimator as the sample size increases without limit and does not imply that the
observed value of θ is necessarily close to θ for any specific size of sample n. if only a relatively
small sample is available, it would seem immaterial whether a consistent estimator is used or
not.
The following theorem (which will not be proven) gives a method of testing for consistency.

Theorem 4.10. If, limn→∞ E(θn ) = θ and limn→∞ V ar(θn ) = 0, then θn is a consistent
estimator of θ.

4.5.5 Loss and Risk Functions


The concept of a mean-squared-error can be seen as a measure of how ‘close’ an estimate is to
the ‘truth’. Using the language of ‘decision theory’, in which one wants to make a ‘decision’
about an estimate, on e might call the value of some estimator θ = H(X1 , . . . , Hn ) a decision
and call the estimator itself a decision function, since it tells us what decision to make. The
estimate itself may be in error. If so, some measure of severity of the error seems appropriate.
The word ‘loss’ is used in place of ‘error’, and ‘loss function’ is used as a measure of ‘error’. A
formal definition follows.

Definition 4.14 (Loss function). Consider estimating θ. Let H(x1 , . . . , xn ) denote an


estimate of θ. The loss function denoted by l(H(x1 , . . . , xn ); θ) is defined to be a real-valued
function satisfying
(i) l(H(x1 , . . . , xn ); θ) ≥ 0 for all possible estimates H(x1 , . . . , xn ) and all allowable θ
(ii) l(H(x1 , . . . , xn ); θ) = 0 for H(x1 , . . . , xn ) = θ
The function l(H(x1 , . . . , xn ); θ) equals the loss incurred if one estimates the true parameter to
be θ

One popular loss function is the squared error loss function, defined as (θ − θ)2 .

Definition 4.15 (Risk function). For a given loss function l(.; .) the risk function denoted
by Rl (θ) of an estimator θ is defined to be Rl (θ) = E(l(θ;
 θ))

74
CHAPTER 4. ESTIMATION 4.6. SUFFICIENCY

The risk function is the average loss. For the squared error loss function, the risk function is
the familiar mean-squared-error.

4.6 Sufficiency
The final concept of sufficiency requires some explanation before a formal definition is
given. The random sample X1 , X2 , . . . , Xn drawn from the distribution with F (x; θ) contains
information about the parameter θ. To estimate θ, this sample is first condensed to a single
random variable by use of a statistic θ∗ = H(X1 , X2 , . . . , Xn ). The question of interest is whether
any information about θ has been lost by this condensing process. For example, a possible choice
of θ∗ is H(X1 , . . . , Xn ) = X1 in which case it seems that some of the information in the sample
has been lost since the observations X2 , . . . , Xn have been ignored. In many cases, the statistic
θ∗ does contain all the relevant information about the parameter θ that the sample contains.

Definition 4.16 (Sufficiency). Let X1 , X2 , . . . , Xn be a random sample from F (x; θ) and let
θ = H(X1 , X2 , . . . , Xn ) be a statistic (a function of the Xi only). Let θ = H  (X1 , X2 , . . . , Xn )

be any other statistic which is not a function of θ∗ . If for each of the statistics θ , the conditional
density of θ given θ∗ does not involve θ, then θ∗ is called a sufficient statistic for θ. That is,
f (θ |θ∗ ) does not contain θ, then θ∗ is sufficient for θ.

You should think of sufficiency in the sense of using all the relevant information in the
sample. For example, to say that x is sufficient for μ in a particular distribution means that
knowledge of the actual observations x1 , x2 , . . . , xn gives us no more information about μ than
does only knowing the average of the n observations.
Example 4.4.
T = t(X) = X̄ is sufficient for μ when Xi ∼ i.i.d. N(μ, σ 2 ).

To better understand the motivation behind the concept of sufficiency consider three
independent Binomial trials where θ = P (X = 1).
Event Probability Set
0 0 0 (1 − θ)3 A0
1 0 0
0 1 0 θ(1 − θ)2 A1
0 0 1

0 1 1
1 0 1 θ2 (1 − θ) A2
1 1 0

1 1 1 θ3 A3
Knowing which Ai the sample is in carries all the information about θ. Which particular
sample within Ai gives us no extra information about 
θ. Extra information about other aspect of
the model maybe, but not about θ. Here T = t(X) = Xi equals the number of “successes, and
identifies Ai . Mathematically we can express the above concept by saying that the probability

75
CHAPTER 4. ESTIMATION 4.6. SUFFICIENCY

P (X = x|Ai ) does not depend on θ. i.e. P (010|A1 ; θ) = 1/3. More generally, a statistic T = t(X)
is said to be sufficient for the parameter θ if Pr (X = x|T = t) does not depend on θ. Sufficient
statistics are most easily recognized through the following fundamental result:

Theorem 4.11 (Neyman’s Factorization Criterion). A statistic T = t(X) is sufficient for θ


if and only if the family of densities can be factorized as
f (x; θ) = h(x)k {t(x); θ} , x ∈ X , θ ∈ Θ. (4.6.1)
i.e. into a function which does not depend on θ and one which only depends on x through t(x).
This is true in general. We will prove it in the case where X is discrete.
Example 4.5 (Poisson).
Let X = (X1 , . . . , Xn ) be independent and Poisson distributed with mean λ so that the joint density
n
 λx i λΣxi −nλ
f (x; λ) = e−λ =  e .
i=1
xi ! i xi !
  −1 
Take k { xi ; λ} = λΣxi e−nλ and h(x) = ( xi !) , then t(x) = i xi is sufficient.

Example 4.6 (Binomial).


Let X = (X1 , . . . , Xn ) be independent and Bernoulli distributed with parameter θ so that
n

f (x; θ) = θxi (1 − θ)1−xi = θΣxi (1 − θ)n−Σxi
i=1
 Σxi n−Σxi

Take k { xi ; θ} = θ (1 − θ) and h(x) = 1, then t(x) = i xi is sufficient.

Example 4.7 (Uniform).


Factorization criterion works in general but can mess up if the pdf depends on θ. Let X = (X1 , . . . , Xn ) be
independent and Uniform distributed with parameter θ so that X1 , X2 , . . . , Xn ∼ Unif(0, θ). Then
1
f (x; θ) = 0 ≤ xi ≤ θ ∀ i.
θn
It is not at all obvious but t(x) = max(xi ) is a sufficient statistic. We have to show that f (x|t) is independent
of θ. Well
f (x, t)
f (x|t) = .
fT (t)
Then
P (T ≤ t) = P (X1 ≤ t, . . . , Xn ≤ t)
n  n
t
= P (Xi ≤ t) =
i=1
θ

So
tn ntn−1
FT (t) = ⇒ fT (t) =
θn θn
Also
1
f (x, t) = ≡ f (x; θ).
θn
Hence
1
f (x|t) = ,
ntn−1
and is independent of θ.

76
CHAPTER 4. ESTIMATION 4.7. THE LIKELIHOOD APPROACH

4.7 The Likelihood approach


4.7.1 Maximum likelihood estimation
Unless stated otherwise, the material in this section is mostly extracted from [8].
Let x be a realization of the random variable X with probability density fX (x|θ) where
θ = (θ1 , θ2 , . . . , θm )T is a vector of m unknown parameters to be estimated. The set of allowable
values for θ, denoted by Φ, is called the parameter space. Define the likelihood function

L(θ|x) = fX (x|θ). (4.7.1)

It is crucial to stress that the argument of fX (x|θ) is x, but the argument of L(θ|x) is θ. It is
therefore convenient to view the likelihood function L(θ) as the probability of the observed data
x considered as a function of θ.
As explained in [5], the term likelihood of the sample needs to be defined, and this has
to be done separately for discrete and continuous distributions.

Definition 4.17 ([5]). Let x1 , x2 , . . . , xn be sample observations taken on the random


variables X1 , X2 , . . . , Xn . Then the likelihood of the sample, L(θ|x1 , x2 , . . . , xn ), is defined as:

1. the joint probability of x1 , x2 , . . . , xn if X1 , X2 , . . . , Xn are discrete, and

2. the joint probability density function of X1 , . . . , Xn evaluated at x1 , x2 , . . . , xn if the


random variables are continuous.

The likelihood function for a set of n identically and independently distributed (i.i.d.)
random variables, X1 , X2 , . . . , Xn , can thus be written as:

P (X1 = x1 ) · P (X2 = x2 ) · · · P (Xn = xn ) for X discrete,
L
 (θ; x1 , . . . , xn ) = . (4.7.2)
f (x1 ; θ) · f (x2 ; θ) · · · f (xn ; θ) for X continuous

For the discrete case, L(θ; x1 , . . . , xn ) is the probability (or likelihood) of observing (X1 =
x1 , X2 = x2 , . . . , Xn = xn ). It would then seem that a sensible approach to selecting an
estimate of θ would be to find the value of θ which maximizes the probability of observing
(X1 = x1 , X2 = x2 , . . . , Xn = xn ), (the event which occured). [5]
More generally, when the random continuous variables X1 , . . . , Xn are mutually independent
we can write the joint density as
n
fX (x) = fXj (xj )
j=1

where x = (x1 , . . . , xn ) is a realization of the random vector X = (X1 , . . . , Xn ) , and the




likelihood function becomes


n
LX (θ|x) = fXj (xj |θ).
j=1

Note that we have used a bold θ to indicate that it may be a vector, although we will not be
consistent in using different notations for vectors or single parameters. When the densities
fXj (xj ) are identical, we can unambiguously write f (xj ).

77
CHAPTER 4. ESTIMATION 4.7. THE LIKELIHOOD APPROACH

Usually it is convenient to work with the natural logarithm of the likelihood called the
log-likelihood, denoted by
(θ|x) = ln L(θ|x).
When θ ∈ R1 we can define the score function as the first derivative of the log-likelihood

S(θ) = ln L(θ),
∂θ
as seen before.
The maximum likelihood estimate (MLE) θ̂ of θ is the solution to the score equation
S(θ) = 0.
Put simply [5], the maximum likelihood estimate (MLE) of θ is that value of θ which
maximizes the likelihood. To state it more mathematically, the MLE of θ is that value of θ, say
θ such that
 x1 , . . . , xn ) > L(θ ; x1 , . . . , xn )
L(θ;
where θ is any other value of θ.
At the maximum [8], the second partial derivative of the log-likelihood is negative, so we
define the curvature at θ̂ as I(θ̂) where
∂2
I(θ) = − ln L(θ).
∂θ2
We can check that a solution θ̂ of the equation S(θ) = 0 is actually a maximum by checking that
I(θ̂) > 0. A large curvature I(θ̂) is associated with a tight or strong peak, intuitively indicating
less uncertainty about θ. In likelihood theory I(θ) is a key quantity called the observed Fisher
information, and I(θ̂) is the observed Fisher information evaluated at the MLE θ̂. Although
I(θ) is a function, I(θ̂) is a scalar. Note the difference with the concept of ‘expected Fisher
information’ (See 4.6).
The likelihood function L(θ|x) supplies an order of preference or plausibility among possible
values of θ based on the observed x. It ranks the plausibility of possible values of θ by how
probable they make the observed x. If P (x|θ = θ1 ) > P (x|θ = θ2 ) then the observed x makes
θ = θ1 more plausible than θ = θ2 , and consequently from (4.7.1), L(θ1 |x) > L(θ2 |x). The
likelihood ratio L(θ1 |x)/L(θ2 |x) = f (θ1 |x)/f (θ2 |x) is a measure of the plausibility of θ1 relative
to θ2 based on the observed data. The relative likelihood L(θ1 |x)/L(θ2 |x) = k means that the
observed value x will occur k times more frequently in repeated samples from the population
defined by the value θ1 than from the population defined by θ2 . Since only ratios of likelihoods
are meaningful, it is convenient to standardize the likelihood with respect to its maximum.
Define the relative likelihood as R(θ|x) = L(θ|x)/L(θ̂|x). The relative likelihood varies
between 0 and 1. The MLE θ̂ is most plausible value of θ in that it makes the observed sample
most probable. The relative likelihood measures the plausibility of any particular value of θ
relative to that of θ̂.

Comments [5]
1. It is customary to use θ to denote both estimator (random variable) and estimate (its
observed value). Recall that we used θ for the MME.

78
CHAPTER 4. ESTIMATION 4.7. THE LIKELIHOOD APPROACH

2. Since L(θ; x1 , x2 , . . . , xn ) is a product, and sums are usually more convenient to deal
with than products, it is customary to maximize log L(θ; x1 , . . . , xn ) which we usually
abbreviate to l(θ). This has the same effect. Since log L is a strictly increasing function of
L, it will take on its maximum at the same point.

3. In some problems, θ will be a vector in which case L(θ) has to be maximized by


differentiating with respect to 2 (or more) variables and solving simultaneously 2 (or
more) equations.

4. The method of differentiation to find a maximum only works if the function concerned
actually has a turning point.

4.7.2 Properties of MLE


The following four properties are the main reasons for recommending the use of Maximum
Likelihood Estimators.

1. The MLE is consistent.

2. The MLE has a distribution that tends to normality as n → ∞.

3. If a sufficient statistic for θ exists, then the MLE is sufficient.

4. The MLE is invariant under functional transformations. That is, if θ = H(X1 , X2 , . . . , Xn )


 is the MLE
is the MLE of θ and if u(θ) is a continuous monotone function of θ, then u(θ)
of u(θ). This is known as the invariance property of MLEs (see also 4.7.3).
For example, in the normal distribution where the ! mean is μ and the variance is σ 2 ,
(n − 1)S 2 /n is the MLE of σ 2 , so the MLE of σ is (n − 1)S 2 /n.

Furthermore, suppose that an experiment consists of measuring random variables x1 , x2 , . . . , xn


which are i.i.d. with probability distribution depending on a parameter θ. Let θ̂ be the MLE of
θ. Define
!
W1 = E[I(θ)](θ̂ − θ)
!
W2 = I(θ)(θ̂ − θ)
"
W3 = E[I(θ̂)](θ̂ − θ)
"
W4 = I(θ̂)(θ̂ − θ).

Then, W1 , W2 , W3 , and W4 are all random variables and, as n → ∞, the probabilistic behaviour
of each of W1 , W2 , W3 , and W4 is well approximated by that of a N (0, 1) random variable (see
further below to see a hint of the proof). Then, since E[W1 ] ≈ 0, we have that E[θ̂] ≈ θ and so
θ̂ is approximately unbiased. Also Var[W1 ] ≈ 1 implies that Var[θ̂] ≈ (E[I(θ)])−1 and so θ̂ is
approximately efficient.
Let the data X have probability distribution g(X; θ ) where θ = (θ1 , θ2 , . . . , θm ) is a vector
of m unknown parameters. Let I(θθ ) be the m × m information matrix as defined above and
let E[I(θθ )] be the m × m matrix obtained by replacing the elements of I(θθ ) by their expected

79
CHAPTER 4. ESTIMATION 4.7. THE LIKELIHOOD APPROACH

values. Let θ̂θ be the MLE of θ . Let CRLB √ r be the rth diagonal element of [E[I(θ θ )]]−1 . For
r = 1, 2, . . . , m, define W1r = (θ̂r − θr )/ CRLBr . Then, as n → ∞, W1r behaves like a standard
normal random variable.
Suppose we define W2r by replacing CRLBr by the rth diagonal element of the matrix
[I(θθ )]−1 , W3r by replacing CRLBr by the rth diagonal element of the matrix [EI(θ̂θ )]−1 and W4r
by replacing CRLBr by the rth diagonal element of the matrix [I(θ̂θ )]−1 . Then it can be shown
that as n → ∞, W2r , W3r , and W4r all behave like standard normal random variables. These
results will become very handy when developing test statistics (Chapter refchap:testing )
A detailed proof of the assertion that Wi above follow approximately a standard normal
distribution is beyond the scope of this course. However, in what follows, we give a hint of how
the proof would go. Suppose that θ is the MLE of θ. We then apply the Taylor expansion of
 at the point θ0 , yielding
S(θ)
 ≈ l (θ0 ) + (θ − θ0 )l (θ0 ).
0 = l (θ)
 

Therefore,

√ −n−1/2 l (θ0 )
n(θ − θ0 ) ≈
n−1 l (θ0 )
. Based on properties we already saw, it can then be shown that the mean of the numerator is 0
and its variance is E(I(θ0 )). By the law of large numbers, the expectation of the denomenator
converges to −E(I(θ0 )). We thus have

√ n−1/2 l (θ0 )

n(θ − θ0 ) ≈ .
E(I(θ0 ))
Consequently, √
E( n(θ − θ0 )) ≈ 0,
and
√ 1
Var( n(θ − θ0 )) ≈ .
E(I(θ0 ))

4.7.3 The Invariance principle


How do we deal with parameter transformation? We will assume a one-to-one transformation,
but the idea applied generally. Consider a binomial sample with n = 10 independent trials
resulting in data x = 8 successes. The likelihood ratio of θ1 = 0.8 versus θ2 = 0.3 is
L(θ1 = 0.8) θ8 (1 − θ1 )2
= 18 = 208.7 ,
L(θ2 = 0.3) θ2 (1 − θ2 )2
that is, given the data θ = 0.8 is about 200 times more likely than θ = 0.3.
Suppose we are interested in expressing θ on the logit scale as
ψ ≡ ln{θ/(1 − θ)} ,
then ‘intuitively’ our relative information about ψ1 = ln(0.8/0.2) = 1.29 versus ψ2 =
ln(0.3/0.7) = −0.85 should be
L∗ (ψ1 ) L(θ1 )

= = 208.7 .
L (ψ2 ) L(θ2 )

80
CHAPTER 4. ESTIMATION 4.7. THE LIKELIHOOD APPROACH

That is, our information should be invariant to the choice of parameterization. ( For the
purposes of this example we are not too concerned about how to calculate L∗ (ψ). )

Theorem 4.12 (Invariance of the MLE). If g is a one-to-one function, and θ̂ is the MLE
of θ then g(θ̂) is the MLE of g(θ).
Proof. This is trivially true as we let θ = g −1 (μ) then f {y|g −1 (μ)} is maximized in μ exactly
when μ = g(θ̂). When g is not one-to-one the discussion becomes more subtle, but we simply
choose to define ĝMLE (θ) = g(θ̂)

It seems intuitive that if θ̂ is most likely for θ and our knowledge (data) remains unchanged
then g(θ̂) is most likely for g(θ). In fact, we would find it strange if θ̂ is an estimate of θ, but θ̂2
is not an estimate of θ2 . In the binomial example with n = 10 and x = 8 we get θ̂ = 0.8, so the
MLE of g(θ) = θ/(1 − θ) is

g(θ̂) = θ̂/(1 − θ̂) = 0.8/0.2 = 4.

This convenient property is not necessarily true of other estimators. For example, if θ̂ is the
MVUE of θ, then g(θ̂) is generally not MVUE for g(θ).
Frequentists generally accept the invariance principle without question. This is not the case
for Bayesians. The invariance property of the likeihood ratio is incompatible with the Bayesian
habit of assigning a probability distribution to a parameter.

Examples
All the examples are coming from [8], unless stated otherwise.
Example 4.8 ([5]).
Given X is distributed bin(1, p) where p ∈ (0, 1), and a random sample x1 , x2 , . . . , xn , find the maximum likelihood
estimate of p.

Solution of Example 4.3 ([5]). The likelihood is,

L(p; x1 , x2 , . . . , xn ) = P (X1 = x1 )P (X2 = x2 ) · · · P (Xn = xn )


n  
1 xi
= p (1 − p)1−xi
i=1
x i

= px1 +x2 +···+xn (1 − p)n−x1 −x2 −···−xn


 
=p xi
(1 − p)n− xi

So  #  $
log L(p) = xi log p + n − xi log(1 − p)
Differentiating with respect to p, we have
 
log L(p) xi n − xi
d = −
dp p 1−p

81
CHAPTER 4. ESTIMATION 4.7. THE LIKELIHOOD APPROACH

  
This is equal to zero when xi (1 − p) = p (n − xi ), that is, when p = xi /n. This estimate
is denoted by p. Thus, if the random variable X is distributed bin(1, p), the MLE of p derived
from a sample of size n is
p = X. (4.7.3)

Example 4.9 (Binomial sampling).


The number of successes in n Bernoulli trials is a random variable R taking on values r = 0, 1, . . . , n with
probability mass function  
n r
P (R = r) = θ (1 − θ)n−r .
r
This is the exact same sampling scheme as in the previous example except that instead of observing the sequence
y we only observe the total number of successes r. Hence the likelihood function has the form
 
n r
LR (θ|r) = θ (1 − θ)n−r .
r
The relevant mathematical calculations are as follows:
 
n
R (θ|r) = ln + r ln(θ) + (n − r) ln(1 − θ)
r
r n−r r
S (θ) = + ⇒ θ̂ =
n 1−θ n
r n−r
I (θ) = + >0 ∀θ
θ2 (1 − θ)2
E(r) nθ
E(θ̂) = = =θ ⇒ θ̂ unbiased
n n
Var(r) nθ(1 − θ) θ(1 − θ)
Var(θ̂) = = =
n2 n2 n
E(r) n − E(r) nθ n − nθ
E [I (θ)] = 2
+ 2
= 2 +
θ (1 − θ) θ (1 − θ)2
n # $ −1
= = Var[θ̂]
θ(1 − θ)

and θ̂ attains the Cramer-Rao lower bound (CRLB).

Example 4.10 ([5]).


Given random variable X is distributed uniformly on [0, θ], find the MLE of θ based on a sample of size n.

Solution of Example 4.4 ([5]). Now f (xi ; θ) = 1/θ, xi ∈ [0, θ], i = 1, 2, . . . , n. So the
likelihood is

n
L(θ; x1 , x2 , . . . , xn ) = (1/θ) = 1/θn .
i=1
When we come to find the maximum of this function we note that the slope is not zero anywhere,
so there is no use finding dL(θ)

or d logdθL(θ) .
Note however that L(θ) increases as θ → 0. So L(θ) is maximized by setting θ equal to the
smallest value it can take. If the observed values are x1 , . . . , xn then θ can be no smaller than
the largest of these. This is because xi ∈ [0, θ] for i = 1, . . . , n. That is, each xi ≤ θ or θ ≥ each
xi .
Thus, if X is distributed U (0, θ), the MLE of θ is

θ = max(Xi ). (4.7.4)

82
CHAPTER 4. ESTIMATION 4.7. THE LIKELIHOOD APPROACH

Figure 4.7.1: L(θ) = 1/θn

Computer Exercise 4.3. Generate 100 random samples of size 10 from a U (0, 10) distribution.
For each of the 100 samples generated calculate the MME and MLE for μ and graph the results.

1. From the graphs does it appear that the estimators are biased or unbiased? Explain.

2. Estimate the variance of the two estimators by finding the sample variance of the 100
estimates (for each estimator). Which estimator appears more efficient?

Solution of Computer Exercise 4.3.

theta <- 10
sampsz <- 10
nsimulations <- 100
moment.estimates <- numeric(nsimulations)
ML.estimates <- numeric(nsimulations)

for (i in 1:nsimulations){
ru <- runif(n=sampsz,min=0,max=theta)

moment.estimates[i] <- 2*mean(ru)


ML.estimates[i] <- max(ru)
}

plot(density(moment.estimates),
xlab=" ",ylab=" ",main=" ",ylim=c(0,0.6),
las=1)
abline(v=theta,lty=3)
lines(density(ML.estimates),lty=2)
legend(11,0.5,legend=c("moment","ML"),lty=1:2,
cex=0.6)

83
CHAPTER 4. ESTIMATION 4.7. THE LIKELIHOOD APPROACH

You should see that the Method of Moments gives unbiased estimates of which many are
not in the range space as noted in Computer Example 4.2. The maximum likelihood estimates
are all less than 10 and so are biased.

Computer Exercise 4.4. Demonstrate that the MLE is consistent for estimating θ for a
U (0, θ) distribution.
Method: Generate the random variables one at a time. After each is generated calculate the
MLE of θ for the sample of size n generated to that point and so obtain a sequence of estimators,
{θn }. Plot the sequence.

Solution of Computer Exercise 4.4. Uniform random variables are generated one at a time
and θn is found as the maximum of θ%
n−1 and nth uniform rv generated. The estimates are
plotted in order.
#_________ UniformConsistency.R _____
theta <- 10
sampsz <- 10
nsimulations <- 100
ML.est <- numeric(nsimulations)
for (i in 1:nsimulations){
ru <- runif(n=sampsz,min=0,max=theta)
if(i==1) ML.est[i] <- max(ru)
else ML.est[i] <- max(ML.est[i-1],max(ru) )
}
plot(ML.est,type=’l’)
abline(h=theta,lty=2)

We can see that as n increases, θ → θ.


Example 4.11 (Bernoulli Trials).
Consider n independent Bernoulli trials. The jth observation is either a “success” or “failure” coded xj = 1 and
xj = 0 respectively, and
P (Xj = xj ) = θxj (1 − θ)1−xj
for j = 1, . . . , n. The vector of observations y = (x1 , x2 , . . . , xn )T is a sequence of ones and zeros, and is
a realization of the random vector Y = (X1 , X2 , . . . , Xn )T . As the Bernoulli outcomes are assumed to be
independent we can write the joint probability mass function of Y as the product of the marginal probabilities,
that is
n

L(θ) = P (Xj = xj )
j=1
n
= θxj (1 − θ)1−xj
j=1
 
xj
= θ (1 − θ)n− xj

r n−r
= θ (1 − θ)
n
where r = i=1 xj is the number of observed successes (1’s) in the vector y. The log-likelihood function is then

(θ) = r ln θ + (n − r) ln(1 − θ),

and the score function is


∂ r (n − r)
S(θ) = (θ) = − .
∂θ θ 1−θ

84
CHAPTER 4. ESTIMATION 4.7. THE LIKELIHOOD APPROACH

Figure 4.7.2: Relative likelihood functions for seed germinating probabilities.

Solving for S(θ̂) = 0 we get θ̂ = r/n. The information function is


r n−r
I(θ) = 2
+ >0 ∀ θ,
θ (1 − θ)2

guaranteeing that θ̂ is the MLE. Each Xi is a Bernoulli random variable and has expected value E(Xi ) = θ, and
variance Var(Xi ) = θ(1 − θ). The MLE θ̂(y) is itself a random variable and has expected value
#r$  n  n n
i=1 Xi 1 1
E(θ̂) = E =E = E (Xi ) = θ = θ,
n n n i=1 n i=1

implying that θ̂(y) is an unbiased estimator of θ. The variance of θ̂(y) is


 n  n n
i=1 Xi 1  1  (1 − θ)θ
Var(θ̂) = Var = 2
Var (X i ) = 2
(1 − θ)θ = .
n n i=1 n i=1 n

Finally, note that

E(r) n − E(r) n n n # $−1


I(θ) = E [I(θ)] = 2
+ 2
= + = = Var[θ̂] ,
θ (1 − θ) θ 1−θ θ(1 − θ)

and θ̂ attains the Cramer-Rao lower bound (CRLB).

Example 4.12 (Germinating seeds).


Suppose 25 seeds were planted and r = 5 seeds germinated. Then θ̂ = r/n = 0.2 and Var(θ̂) = 0.2 × 0.8/25 =
0.0064. The relative likelihood is  5  20
θ 1−θ
R1 (θ) = .
0.2 0.8
Suppose 100 seeds were planted and r = 20 seeds germinated. Then θ̂ = r/n = 0.2 but Var(θ̂) = 0.2 × 0.8/100 =
0.0016. The relative likelihood is  20  80
θ 1−θ
R2 (θ) = .
0.2 0.8
Suppose 25 seeds were planted and it is known only that r ≤ 5 seeds germinated. In this case the exact number of
germinating seeds is unknown. The information about θ is given by the likelihood function
5  
 25
L (θ) = P (R ≤ 5) = θr (1 − θ)25−r .
r=0
r

Here, the most plausible value for θ is θ̂ = 0, implying L(θ̂) = 1. The relative likelihood is R3 (θ) = L(θ)/L(θ̂) =
L(θ). R1 (θ) is plotted as the dashed curve in figure 4.7.2. R2 (θ) is plotted as the dotted curve in figure 4.7.2.
R3 (θ) is plotted as a solid curve in figure 4.7.2.

Example 4.13 (Prevalence of a Genotype).


Geneticists interested in the prevalence of a certain genotype, observe that the genotype makes its first appearance in
the 22nd subject analysed. If we assume that the subjects are independent, the likelihood function can be computed
based on the geometric distribution, as L(θ) = (1−θ)n−1 θ. The score function is then S(θ) = θ−1 −(n−1)(1−θ)−1 .
Setting S(θ̂) = 0 we get θ̂ = n−1 = 22−1 . The observed Fisher information equals I(θ) = θ−2 + (n − 1)(1 − θ)−2
and is greater than zero for all θ, implying that θ̂ is MLE.

85
CHAPTER 4. ESTIMATION 4.7. THE LIKELIHOOD APPROACH

Suppose that the geneticists had planned to stop sampling once they observed r = 10 subjects with the specified
genotype, and the tenth subject with the genotype was the 100th subject anaylsed overall. The likelihood of θ can
be computed based on the negative binomial distribution, as
 
n−1 r
L(θ) = θ (1 − θ)n−r
r−1

for n = 100, r = 5. The usual calculation will confirm that θ̂ = r/n is MLE.

Example 4.14 (Radioactive Decay).


In this classic set of data Rutherford and Geiger counted the number of scintillations in 72 second intervals
caused by radioactive decay of a quantity of the element polonium. Altogether there were 10097 scintillations
during 2608 such intervals:

Count 0 1 2 3 4 5 6 7
Observed 57 203 383 525 532 408 573 139
Count 8 9 10 11 12 13 14
Observed 45 27 10 4 1 0 1

The Poisson probability mass function with mean parameter θ is

θx exp (−θ)
fX (x|θ) = .
x!
The likelihood function equals

 θxi exp (−θ) θ xi
exp (−nθ)
L(θ) = =  .
xi ! xi !

The relevant mathematical calculations are

(θ) (Σxi ) ln (θ) − nθ − ln [Π(xi !)]


=

xi
S(θ) = −n

xi
⇒ θ̂ = = x̄
n
Σxi
I(θ) = > 0, ∀ θ
θ2
 
implying θ̂ is MLE. Also E(θ̂) = E(xi ) = n1 θ = θ, so θ̂ is an unbiased estimator. Next Var(θ̂) =
1
 1
n2 Var(xi ) = n θ and I(θ) = E[I(θ)] = n/θ = (Var[θ̂])−1 implying that θ̂ attains the theoretical CRLB. It is
always useful to compare the fitted values from a model against the observed values.

i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Oi 57 203 383 525 532 408 573 139 45 27 10 4 1 0 1
Ei 54 211 407 525 508 393 254 140 68 29 11 4 1 0 0
+3 −8 −24 0 +24 +15 +19 −1 −23 −2 −1 0 −1 +1 +1

The Poisson law agrees with the observed variation within about one-twentieth of its range.

86
CHAPTER 4. ESTIMATION 4.7. THE LIKELIHOOD APPROACH

Example 4.15 (Exponential distribution).


Suppose random variables X1 , . . . , Xn are i.i.d. as Exp(θ). Then
n

L(θ) = θ exp (−θxi )
i=1
#  $
= θn exp −θ xi

(θ) = n ln θ − θ xi
n
n 
S(θ) = − xi
θ i=1
n
⇒ θ̂ = 
xi
n
I(θ) = >0 ∀ θ.
θ2

In order to work out the expectation and variance of θ̂ we need to work out the probability distribution of

n
Z= Xi , where Xi ∼ Exp(θ). From the appendix on probability theory we have Z ∼ Ga(θ, n). Then
i=1

  ∞
1 1 θn z n−1 exp (−θz)
E = dz
Z z Γ(n)
0
∞
θ2
= (θz)n−2 exp (−θz)dz
Γ(n)
0
∞
θ
= un−2 exp (−u)du
Γ(n)
0
θ Γ(n − 1) θ
= = .
Γ(n) n−1

We now return to our estimator


n n
θ̂ = 
n =
Z
xi
i=1

implies  
&n' 1 n
E[θ̂] = E = nE = θ
Z Z n−1
which turns out to be biased. Propose the alternative estimator θ̃ = n−1n θ̂. Then
   
(n − 1) (n − 1) (n − 1) n
E[θ̃] = E θ̂ = E[θ̂] = θ=θ
n n n n−1

shows θ̃ is an unbiased estimator.


As this example demonstrates, maximum likelihood estimation does not automatically produce unbiased
estimates. If it is thought that this property is (in some sense) desirable, then some adjustments to the MLEs,
usually in the form of scaling, may be required. We conclude this example with the following tedious (but

87
CHAPTER 4. ESTIMATION 4.7. THE LIKELIHOOD APPROACH

straightforward) calculations.
  ∞
1 1
E = θn z n−3 exp −θzdz
Z2 Γ(n)
0
∞
θ2
= un−3 exp −θudu
Γ(n)
0
θ2 Γ(n − 2) θ2
= =
Γ(n) (n − 1)(n − 2)
# $2  
2 (n − 1)2
⇒ Var[θ̃] = E[θ̃ ] − E[θ̃] = E − θ2
Z2
(n − 1)2 θ2 θ2
= − θ2 = .
(n − 1)(n − 2) n−2

We have already calculated that


n n # $−1
I(θ) = ⇒ E [I(θ)] = 
= Var[ θ̃] .
θ2 θ2
However,
−1
(E [I(θ)]) θ2 θ2 n−2
eff(θ̃) = = ÷ =
Var[θ̃] n n−2 n

which although not equal to 1, converges to 1 as n → ∞, and θ̃ is asymptotically efficient.

Example 4.16 (Lifetime of a component).


The time to failure T of components has an exponential distributed with mean μ. Suppose n components are
tested for 100 hours and that m components failed at times t1 , . . . , tm , with n − m components surviving the 100
hour test. The likelihood function can be written
m n

1 −ti /μ
L(θ) = e P (Tj > 100) .
i=1
μ j=m+1
     
components failed components survived

Clearly P (T ≤ t) = 1 − e−t/μ implies P (T > 100) = e−100/μ is the probability of a component surviving the 100
hour test. Then
m 
1 # $n−m
−ti /μ
L(μ) = e e−100/μ ,
i=1
μ
m
1 1
(μ) = −m ln μ − ti − 100(n − m),
μ i=1 μ
m
m 1  1
S(μ) = − + 2 ti + 2 100(n − m).
μ μ i=1 μ
m
Setting S(μ̂) = 0 suggests the estimator μ̂ = [ i=1 ti + 100 (n − m)] /m. Also, I(μ̂) = m/μ̂2 > 0, and μ̂ is indeed
the MLE. Although failure times were recorded for just m components, this example usefully demonstrates that
all n components contribute to the estimation of the mean failure parameter μ. The n − m surviving components
are often referred to as right censored.

88
CHAPTER 4. ESTIMATION 4.7. THE LIKELIHOOD APPROACH

Example 4.17 (Gaussian Distribution).


Consider data X1 , X2 . . . , Xn distributed as N(μ, υ). Then the likelihood function is
⎧ n ⎫
⎪  2⎪
 n ⎪
⎨ (x i − μ) ⎪

1 i=1
L(μ, υ) = √ exp −
πυ ⎪
⎪ 2υ ⎪

⎩ ⎭

and the log-likelihood function is


n
n n 1 
(μ, υ) = − ln (2π) − ln (υ) − (xi − μ)2 (4.7.5)
2 2 2υ i=1

Unknown mean and known variance: As υ is known we treat this parameter as a constant when differentiating
wrt μ. Then
n n
1 1 n
S(μ) = (xi − μ), μ̂ = xi , and I(θ) = > 0 ∀ μ.
υ i=1 n i=1 υ

Also, E[μ̂] = nμ/n = μ, and so the MLE of μ is unbiased. Finally


 n 
1  υ −1
Var[μ̂] = 2 Var xi = = (E[I(θ)]) .
n i=1
n

Known mean and unknown variance: Differentiating (4.7.5) wrt υ returns


n
n 1 
S(υ) = − + 2 (xi − μ)2 ,
2υ 2υ i=1

and setting S(υ) = 0 implies


n
1
υ̂ = (xi − μ)2 .
n i=1
Differentiating again, and multiplying by −1 yields the information function
n
n 1 
I(υ) = − + (xi − μ)2 .
2υ 2 υ 3 i=1

Clearly υ̂ is the MLE since


n
I(υ̂) = > 0.
2υ 2
Define √
Zi = (Xi − μ)2 / υ,
so that Zi ∼ N(0, 1). From the appendix on probability
n

Zi2 ∼ χ2n ,
i=1
 
implying E[ Zi2 ] = n, and Var[ Zi2 ] = 2n. The MLE
n

υ̂ = (υ/n) Zi2 .
i=1

Then  n

υ 2
E[υ̂] = E Z = υ,
n i=1 i

89
CHAPTER 4. ESTIMATION 4.8. PROPERTIES OF SAMPLE MEAN AND SAMPLE VARIANCE

and  
# υ $2 n
 2υ 2
Var[υ̂] = Var Zi2 = .
n i=1
n
Finally,
n
n 1 
E [I(υ)] = − 2
+ 3 E (xi − μ)2
2υ υ i=1
n nυ
= − 2+ 3
2υ υ
n
= .
2υ 2
Hence the CRLB = 2υ 2 /n, and so υ̂ has efficiency 1.

Our treatment of the two parameters of the Gaussian distribution in the last example was
to (i) fix the variance and estimate the mean using maximum likelihood; and then (ii) fix the
mean and estimate the variance using maximum likelihood.
In practice we would like to consider the simultaneous estimation of these parameters. In
the next section of these notes we extend MLE to multiple parameter estimation.

4.8 Properties of Sample Mean and Sample Variance


In this section we will consider the sample mean X and the sample variance S 2 and examine
which of the above properties they have.

Theorem 4.13. Let X be a random variable with mean μ and variance σ 2 . Let X be
the sample mean based on a random sample of size n. Then X is an unbiased and consistent
estimator of μ.

Proof. Now E(X) = μ, no matter what the sample size is, and V ar(X) = σ 2 /n. The latter
approaches 0 as n → ∞, satisfying Theorem 4.10.
It can also be shown that of all linear functions of X1 , X2 , . . . , Xn , X has minimum variance.
Note that the above theorem is true no matter what distribution is sampled. Some applications
are given below.
For a random sample X1 , X2 , . . . , Xn , X is an unbiased and consistent estimator of:

1. μ when the Xi are distributed N (μ, σ 2 );

2. p when the Xi are distributed bin(1, p);

3. λ when the Xi are distributed Poisson(λ);

4. 1/α when the Xi have p.d.f. f (x) = αe−αx , x > 0.

90
CHAPTER 4. ESTIMATION 4.8. PROPERTIES OF SAMPLE MEAN AND SAMPLE VARIANCE

Sample Variance
Recall that the sample variance is defined by

n
S = 2
(Xi − X)2 /(n − 1).
i=1

Theorem 4.14. Given X1 , X2 , . . . , Xn is a random sample from a distribution with mean


μ and variance σ 2 , then S 2 is an unbiased and consistent estimator of σ 2 .

Proof.

n
(n − 1)E(S ) = E
2
(Xi − X)2
i=1
n
=E [Xi − μ − (X − μ)[2
i=1n 
 
n
=E (Xi − μ)2 − 2(X − μ) (Xi − μ) + n(X − μ)2
 i=1 i=1

 n
=E (Xi − μ)2 − 2n(X − μ)2 + n(X − μ)2
i=1

n
=E (Xi − μ)2 − nE(X − μ)2
i=1

n
= V ar(Xi ) − nV ar(X)
i=1
σ2
= nσ 2 − n
n
= (n − 1)σ 2 .

So
E(S 2 ) = σ 2 , (4.8.1)
which proves the unbiasedness.
For any distribution that has a fourth moment,

μ4 − 3μ22 2μ22
V ar(S 2 ) = − (4.8.2)
n n−1
Clearly lim V ar(S 2 ) = 0, so from Theorem 4.10, S 2 is a consistent estimator of σ 2 .
n→∞

We make the following comments.

1. In the special case of theorem 4.14


 where the Xi are distributed N (μ, σ 2 ) with both μ and
n
σ 2 unknown, the MLE of σ 2 is i=1 (Xi − X)2 /n which is (n − 1)S 2 /n. So in this case
the MLE is biased.

91
CHAPTER 4. ESTIMATION 4.9. MULTI-PARAMETER ESTIMATION

2. The number in the denominator of S 2 , that is, n − 1, is called the number of degrees
of freedom. The numerator is the sum of n deviations (from the mean) squared but the
deviations
 are not independent. There is one constraint on them, namely the fact that
(Xi − X) = 0. As soon as n − 1 of the Xi − X are known, the nth one is determined.

3. In calculating the observed value of S 2 , s2 , the following form is usually convenient.


  2/
 ( xi )
s2 = x2i − (n − 1) (4.8.3)
n

or, equivalently, 
2 x2i − nx2
s =
n−1
The equivalence of the two forms is easily seen:
   
(xi − x)2 = (x2i − 2xxi + x2 ) = x2i − 2x xi + nx2

 
( xi ) 2
where the RHS can readily be seen to be x2i − .
n

4.9 Multi-parameter Estimation


Suppose that a statistical model specifies that the data y has a probability distribution f (y; α, β)
depending on two unknown parameters α and β. In this case the likelihood function is a function
of the two variables α and β and having observed the value y is defined as L(α, β) = f (y; α, β)
with (α, β) = ln L(α, β). The MLE of (α, β) is a value (α̂, β̂) for which L(α, β) , or equivalently
(α, β) , attains its maximum value.
Define S1 (α, β) = ∂/∂α and S2 (α, β) = ∂/∂β. The MLEs (α̂, β̂) can be obtained by solving
the pair of simultaneous equations :

S1 (α, β) = 0
S2 (α, β) = 0

The observed information matrix I(α, β) is defined to be the matrix :


   2 
∂ ∂2
I11 (α, β) I12 (α, β) 2  
I(α, β) = =− ∂α
∂2
∂α∂β
∂2
I21 (α, β) I22 (α, β) ∂β∂α
 ∂β 2


The conditions for a value (α0 , β0 ) satisfying S1 (α0 , β0 ) = 0 and S2 (α0 , β0 ) = 0 to be a MLE
are that
I11 (α0 , β0 ) > 0, I22 (α0 , β0 ) > 0,
and
det(I(α0 , β0 ) = I11 (α0 , β0 )I22 (α0 , β0 ) − I12 (α0 , β0 )2 > 0.
This is equivalent to requiring that both eigenvalues of the matrix I(α0 , β0 ) be positive.

92
CHAPTER 4. ESTIMATION 4.9. MULTI-PARAMETER ESTIMATION

Example 4.18 (Gaussian distribution).


Let X1 , X2 . . . , Xn be i.i.d. observations from a N (μ, v) density in which both μ and v are unknown. The log
likelihood is
n  
1 1 2
(μ, v) = ln √ exp [− (xi − μ) ]
i=1
2πv 2v
n  
1 1 1
= − ln [2π] − ln [v] − (xi − μ)2
i=1
2 2 2v
n
n n 1 
= − ln [2π] − ln [v] − (xi − μ)2 .
2 2 2v i=1

Hence
n
∂ 1
S1 (μ, v) = = (xi − μ) = 0
∂μ v i=1
implies that
n
1
μ̂ = xi = x̄. (4.9.1)
n i=1
Also
n
∂ n 1 
S2 (μ, v) = =− + 2 (xi − μ)2 = 0
∂v 2v 2v i=1
implies that
n n
1 1
v̂ = (xi − μ̂)2 = (xi − x̄)2 . (4.9.2)
n i=1 n i=1
Calculating second derivatives and multiplying by −1 gives that the information matrix I(μ, v) equals
⎛ ⎞
n 1
n
⎜ v v2 (xi − μ) ⎟
I(μ, v) = ⎜
⎝ 1  n
i=1

n


v2 (xi − μ) − 2vn2 + v13 (xi − μ)2
i=1 i=1

Hence I(μ̂, v̂) is given by :  


n
v̂ 0
n
0 2v 2
Clearly both diagonal terms are positive and the determinant is positive and so (μ̂, v̂) are, indeed, the MLEs of
(μ, v).

Lemma 4.15 (Joint distribution of the sample mean and sample variance). If X1 , . . . , Xn are
i.i.d. N (μ, v) then the sample mean X̄ and sample variance (n − 1)S 2 are independent. Also X̄
is distributed N (μ, v/n) and (n − 1)S 2 /v is a chi-squared random variable with n − 1 degrees
of freedom.
Proof. Define

n 
n
W = (Xi − X̄) 2
= (Xi − μ)2 − n(X̄ − μ)2
i=1 i=1

W (X̄ − μ) 2 
n
(Xi − μ)2
⇒ + =
v v/n i=1
v

The RHS is the sum of n independent standard normal random variables squared, and so is
distributed χ2ndf Also, X̄ ∼ N (μ, v/n), therefore (X̄ − μ)2 /(v/n) is the square of a standard

93
CHAPTER 4. ESTIMATION 4.9. MULTI-PARAMETER ESTIMATION

normal and so is distributed χ21df These Chi-Squared random variables have moment generating
functions (1 − 2t)−n/2 and (1 − 2t)−1/2 respectively. Next, W/v and (X̄ − μ)2 /(v/n) are
independent:
Cov(Xi − X̄, X̄) = Cov(Xi , X̄) − Cov(X̄, X̄)
 
1
= Cov Xi , Xj − Var(X̄)
n
1 v
= Cov(Xi , Xj ) −
n j n
v v
= − = 0
n n
But, Cov(Xi − X̄, X̄ − μ) = Cov(Xi − X̄, X̄) = 0 , hence
 
 
Cov(Xi − X̄, X̄ − μ) = Cov (Xi − X̄), X̄ − μ = 0
i i

As the moment generating function of the sum of independent random variables is equal to the
product of their individual moment generating functions, we see
E et(W/v) (1 − 2t)−1/2 = (1 − 2t)−n/2
⇒ E et(W/v) = (1 − 2t)−(n−1)/2

But (1 − 2t)−(n−1)/2 is the moment generating function of a χ2 random variables with (n − 1)


degrees of freedom, and the moment generating function uniquely characterizes the random
variable W/v.

Example 4.19 ((Chi-squared distribition).


Go back to equation (4.9.1), and X̄ ∼ N (μ, v/n). Clearly E(X̄) = μ (unbiased) and Var(X̄) = v/n, so X̄ achieved
the CRLB. Go back to equation (4.9.2). Then from lemma 4.15 we have
nv̂
∼ χ2n−1
v
so that
 
nv̂
E = n−1
v
 
n−1
⇒ E(v̂) = v
n

Instead, propose the (unbiased) estimator


n
n 1 
ṽ = v̂ = (xi − x̄)2 (4.9.3)
n−1 n − 1 i=1

Observe that     
n n n−1
E(ṽ) = E(v̂) = v = v
n−1 n−1 n
and ṽ is unbiased as suggested. We can easily show that
2v 2
Var(ṽ) =
(n − 1)

94
CHAPTER 4. ESTIMATION 4.10. NEWTON-RAPHSON OPTIMIZATION

Hence
2v 2 2v 2 1
eff(ṽ) = ÷ =1−
n (n − 1) n
Clearly ṽ is not efficient, but is asymptotically efficient.

4.10 Newton-Raphson optimization


4.10.1 One-paramter scenario
Example 4.20 (Radioactive Scatter).
A radioactive source emits particles intermittently and at various angles. Let X denote the cosine of the angle
of emission. The angle of emission can range from 0 degrees to 180 degrees and so X takes values in [−1, 1].
Assume that X has density
1 + θx
f (x|θ) =
2
for −1 ≤ x ≤ 1 where θ ∈ [−1, 1] is unknown. Suppose the data consist of n independently identically distributed
measures of X yielding values x1 , x2 , ..., xn . Here
n
1 
L(θ) = (1 + θxi )
2n i=1
n

l(θ) = −n ln [2] + ln [1 + θxi ]
i=1
n
 xi
S(θ) =
i=1
1 + θxi
n
 x2i
I(θ) =
i=1
(1 + θxi )2

Since I(θ) > 0 for all θ, the MLE may be found by solving the equation S(θ) = 0. It is not immediately obvious
how to solve this equation.

By Taylor’s Theorem we have

0 = S(θ̂)
= S(θ0 ) + (θ̂ − θ0 )S  (θ0 ) + (θ̂ − θ0 )2 S  (θ0 )/2 + ....

So, if |θ̂ − θ0 | is small, we have that


0 ≈ S(θ0 ) + (θ̂ − θ0 )S  (θ0 ),
and hence

θ̂ ≈ θ0 − S(θ0 )/S  (θ0 )


= θ0 + S(θ0 )/I(θ0 )

We now replace θ0 by this improved approximation θ0 + S(θ0 )/I(θ0 ) and keep repeating the process until we find
a value θ̂ for which |S(θ̂)| < where is some prechosen small number such as 0.000001. This method is called
Newton’s method for solving a non-linear equation.
If a unique solution to S(θ) = 0 exists, Newton’s method works well regardless of the choice of θ0 . When
there are multiple solutions, the solution to which the algorithm converges depends crucially on the choice of θ0 .
In many instances it is a good idea to try various starting values just to be sure that the method is not sensitive
to the choice of θ0 .

95
CHAPTER 4. ESTIMATION 4.10. NEWTON-RAPHSON OPTIMIZATION

4.10.2 Two-paramter scenario


Suppose that m parameters are involved, organized in the vector θ. When the m equations Sr (θ) = 0, r = 1, . . . , m
in this scenario cannot be solved directly numerical optimization is required. Let S(θ) be the m × 1 vector whose
rth element is Sr (θ). Let θ̂ be the solution to the set of equations S(θ) = 0 and let θ 0 be an initial guess at θ̂.
Then a first order Taylor’s series approximation to the function S about the point θ 0 is given by
m
 ∂Sr
Sr (θ̂θ ) ≈ Sr (θ 0 ) + (θ̂j − θ0j ) (θ 0 )
j=1
∂θj

for r = 1, 2, . . . , m which may be written in matrix notation as

S(θ̂) ≈ S(θ 0 ) − I(θθ 0 )(θ̂ − θ 0 ).

Requiring S(θ̂) = 0, this last equation can be reorganized to give

θ̂ ≈ θ 0 + I(θ 0 )−1 S(θ 0 ). (4.10.1)

Thus given θ 0 this is a method for finding an improved guess at θ̂. We then replace θ 0 by this improved guess
and repeat the process. We keep repeating the process until we obtain a value θ ∗ for which |Sr (θ ∗ )| is less than
for r = 1, 2, . . . , m where is some small number like 0.0001. θ ∗ will be an approximate solution to the set of
equations S(θ) = 0. We then evaluate the matrix I(θ ∗ ) and if all m of its eigenvalues are positive we set θ̂ = θ ∗ .

4.10.3 Initial values


The Newton-Raphson method depends on the initial guess being close to the true value. If this requirement is not
satisfied the procedure might convergence to a minimum instead of a maximum, or just simply diverge and fail to
produce any estimates at all. Methods of finding good initial estimates depend very much on the problem at hand
and may require some ingenuity.
A more general method for finding initial values is the method of moments. Retaking the example on
‘Radioactive Scatter’ (Example 4.20), using the Methods of Moments to find an intial value, involves solving the
equation E(X) = x̄ for θ. For the previous example
 1
x(1 + θx) θ
E(X) = dx =
−1 2 3

and so θ0 = 3x̄ might yield a good choice for a starting value.


Suppose the following 10 values were recorded:

0.00, 0.23, −0.05, 0.01, −0.89, 0.19, 0.28, 0.51, −0.25 and 0.27.

Then x̄ = 0.03, and we substitute θ0 = .09 into the updating formula


 n
 n
−1
 xi  x2i
θ̂new = θold +
i=1
1 + θold xi i=1
(1 + θold xi )2

⇒ θ1 = 0.2160061
θ2 = 0.2005475
θ3 = 0.2003788
θ4 = 0.2003788

The relative likelihood function is plotted in Figure 4.10.1.

96
CHAPTER 4. ESTIMATION 4.10. NEWTON-RAPHSON OPTIMIZATION

Figure 4.10.1: Relative likelihood for the radioactive scatter, solved by Newton Raphson.

4.10.4 Fisher’s method of scoring


Several commonly applied modifications of the Newton-Raphson method exist. One such
modification to Newton’s method is to replace the matrix I(θ) in (4.10.1) with Ī(θ) = E [I(θ)] .
The matrix Ī(θ) is positive definite, thus overcoming many of the problems regarding matrix
inversion. Like Newton-Raphson, there is no guarantee that Fisher’s method of scoring (as it is
then called) will avoid producing negative parameter estimates or converging to local minima.
Unfortunately, calculating Ī(θ) often can be mathematically difficult.

4.10.5 The method of profiling


Especially when several parameters are involved, a simplification of the estimaton process may
be obtained by looking at profile likelihoods instead.
Recall that a likelihood function for a low-dimensional parameter can be conveniently
visualized by its graph. If the likelihood is smooth, then this is roughly, at least locally, a
reversed parabola with its top at the maximum likelihood estimator (MLE). The (negative)
curvature of the graph at the MLE is known as the observed information and provides an
estimate for the inverse of the variance of the MLE; steep likelihoods yield accurate estimates.

Definition 4.18 (Profile likelihood). Consider a vector of parameters (θ1 , θ2 ), with


likelihood function l(θ1 , θ2 ; x). Suppose that θ6 2.1 is the MLE of θ2 for a given value of θ1 .
Then the profile likelihood for θ1 is l(θ1 , θ2.1 ; x).

The profile likelihood may be used to a considerable extent as a full likelihood (but not
always!). It is customary to use the curvature of the profile likelihood function as an estimate of
 For more information about profile likelihoods, we refer to [1].
the variability of θ.

97
CHAPTER 4. ESTIMATION 4.10. NEWTON-RAPHSON OPTIMIZATION

Example 4.21 (Weibull).


A Weibull random variable with ‘shape’ parameter a > 0 and ‘scale’ parameter b > 0 has density
fT (t) = (a/b)(t/b)a−1 exp{−(t/b)a }
for t ≥ 0. The (cumulative) distribution function is
FT (t) = 1 − exp{−(t/b)a }
on t ≥ 0. Suppose that the time to failure T of components has a Weibull distribution and after testing n
components for 100 hours, m components fail at times t1 , . . . , tm , with n − m components surviving the 100 hour
test. The likelihood function can be written
m  a−1   a   n   a 
a ti ti 100
L(θ) = exp − exp − .
i=1
b b b j=m+1
b
     
components failed components survived

Then the log-likelihood function is


m
 n
 a
(a, b) = m ln (a) − ma ln (b) + (a − 1) ln (ti ) − (ti /b) ,
i=1 i=1

where for convenience we have written tm+1 = · · · = tn = 100. This yields score functions
m n
m a
Sa (a, b) = − m ln (b) + ln (ti ) − (ti /b) ln (ti /b) ,
a i=1 i=1

and
n
ma a  a
Sb (a, b) = − + (ti /b) . (4.10.2)
b b i=1
It is not obvious how to solve Sa (a, b) = Sb (a, b) = 0 for a and b.

Setting Sb (a, b) = 0 from equation (4.10.2) we can solve for b in terms of a as


 n
1/a
1  a
b= t . (4.10.3)
m i=1 i

This reduces our two parameter problem to a search over the “new” one-parameter profile log-likelihood
a (a) = (a, b(a)),
 n
 m
1  a 
= m ln (a) − m ln t + (a − 1) ln (ti ) − m. (4.10.4)
m i=1 i i=1

Given an initial guess a0 for a, an improved estimate a1 can be obtained using a single parameter Newton-
Raphson updating step a1 = a0 + S(a0 )/I(a0 ), where S(a) and I(a) are now obtained from a (a). The estimates
â = 1.924941 and b̂ = 78.12213 were obtained by applying this method to the Weibull data using starting values
a0 = 0.001 and a0 = 5.8 in 16 and 13 iterations respectively. However, the starting value a0 = 5.9 produced the
sequence of estimates a1 = −5.544163, a2 = 8.013465, a3 = −16.02908, a4 = 230.0001 and subsequently crashed.

4.10.6 Reparameterization
Negative parameter estimates can be avoided by reparameterizing the profile log-likelihood in
(4.10.4) using α = ln(a). Since a = eα we are guaranteed to obtain a > 0. The reparameterized
profile log-likelihood becomes
 
1  eα 
n m
α (α) = mα − m ln t + (e − 1)
α
ln (ti ) − m,
m i=1 i i=1

98
CHAPTER 4. ESTIMATION 4.11. BAYESIAN ESTIMATION

implying score function


n eα m
meα ti ln(ti )
S(α) = m − n eα
i=1
+e α
ln(ti )
i=1 ti i=1

and information function


 n n eα
meα  α t ln(t )]2
α[
I(α) = n eα ti ln(ti ) − e
e
ni eα i
i=1

i=1 ti i=1 i=1 ti




n
α

m
+eα tei [ln(ti )]2 − eα ln(ti ).
i=1 i=1

The estimates â = 1.924941 and b̂ = 78.12213 were obtained by applying this method to the
Weibull data using starting values a0 = 0.07 and a0 = 76 in 103 and 105 iterations respectively.
However, the starting values a0 = 0.06 and a0 = 77 failed due to division by computationally
tiny (1.0e-300) values.

4.10.7 The step-halving scheme


The Newton-Raphson method uses the (first and second) derivatives of (θ) to maximize the
function (θ), but the function itself is not used in the algorithm. The log-likelihood can be
incorporated into the Newton-Raphson method by modifying the updating step to
θ i+1 = θ i + λi I(θ i )−1 S(θ i ), (4.10.5)
where the search direction has been multiplied by some λi ∈ (0, 1] chosen so that the inequality
7 8
 θ i + λi I(θ i )−1 S(θ i ) >  (θ i ) (4.10.6)
holds. This requirement protects the algorithm from converging towards minima or saddle
points. At each iteration the algorithm sets λi = 1, and if (4.10.6) does not hold λi is replaced
with λi /2. The process is repeated until the inequality in (4.10.6) is satisfied. At this point the
parameter estimates are updated using (4.10.5) with the value of λi for which (4.10.6) holds.
If the function (θ) is concave and unimodal convergence is guaranteed. Finally, when Ī(θ) is
used in place of I(θ) convergence to a (local) maxima is guaranteed, even if (θ) is not concave.

4.11 Bayesian estimation


4.11.1 Bayes’ theorem for random variables
Some of the following results are presented without elaboration, intended as a revision to provide
the framework for Bayesian data analyses.
P (E|F H)P (F |H) = P (EF |H) = P (EF H|H)

P (E) = P (E|Hn )P (Hn )
n
P (Hn |E)P (E) = P (EHn ) = P (Hn )P (E|Hn )
P (Hn |E) ∝ P (Hn )P (E|Hn ).

99
CHAPTER 4. ESTIMATION 4.11. BAYESIAN ESTIMATION

The last result is Baye’s theorem and in this form shows how we can “invert” probabilities,
getting P (Hn |E) from P (E|Hn ).
When Hn consists of exclusive and exhaustive events,

P (Hn )P (E|Hn )
P (Hn |E) =  (4.11.1)
m P (Hm )P (E|Hm )

p(y|x) ∝ p(y)p(x|y) (4.11.2)


The constant of proportionality is
1 1
continuous : =
p(x) p(x|y)p(y)dy
1 1
discrete : =
p(x) y p(x|y)p(y)dy

4.11.2 Post ‘is’ prior × likelihood


Suppose we are interested in the values of k unknown quantities,

θ = (θ1 , θ2 , . . . , θk )

and apriori beliefs about their values can be expressed in terms of the pdf p(θ).
Then we collect data,
X = (X1 , X2 , . . . , Xn )
which have a probability distribution that depends on θ, expressed as

p(X|θ)

From (4.11.2),
p(θ|X) ∝ p(X|θ) × p(θ) (4.11.3)
The term, p(X|θ) may be considered as a function of X for fixed θ, i.e. a density of X which is
parameterized by θ.
We can also consider the same term as a function of θ for fixed X and then it is termed the
likelihood function,
(θ|X) = p(X|θ)
The terms in expression 4.11.3 represent the 3 components in Baeyesian inference:

• p(θ) is the prior

• (θ|X) is the likelihood

• p(θ|X) is the posterior

100
CHAPTER 4. ESTIMATION 4.11. BAYESIAN ESTIMATION

Figure 4.11.1: Posterior distribution

The Bayesian mantra therefore is:


posterior ∝ likelihood × prior
or a posterior density is proportional to the prior times the likelihood.
Note that the function p(·) is not the same in each instance but is a generic symbol to
represent the density appropriate for prior, density of the data given the parameters, and the
posterior. The form of p is understood by considering its arguments, i.e. p(θ), p(x|θ) or p(θ|x).
A diagram depicting the relationships amongst the different densities is shown in Fig. 4.11.1.

The posterior is a combination of the likelihood, where information about θ comes from the
data X and the prior p(θ) where the information is the knowledge of θ independent of X. This
knowledge may come from previous sampling, (say). The posterior represents an update on
P (θ) with the new information at hand, i.e. x.
If the likelihood is weak due to insufficient sampling or wrong choice of likelihood function,
the prior can dominate so that the posterior is just an adaptation of the prior. Alternatively, if
the sample size is large so that the likelihood function is strong, the prior will not have much
impact and the Bayesian analysis is the same as the maximum likelihood.
The output is a distribution, p(θ|X) and we may interpret it using summaries such as the
median. A tabular view on the difference between Bayesian inference and frequentist inference
is given in Table ??. In a frequentist world, θ is considered a fixed but unknown feature of the
population from which data is being (randomly) sampled. In a Bayesian world it is the estimator
θ that is fixed (a function of the data available for analysis) and θ is a random variable, subject
to (subjective) uncertainty.
From the output (i.e., posterior distribution of θ) we can derive intervals where the true
value of θ would lie with a certain probability. We differentiate between credible and highest
probability density regions.

101
CHAPTER 4. ESTIMATION 4.11. BAYESIAN ESTIMATION

Bayesian Frequentist
θ random fixed but unknown

θ fixed random
‘randomness’ subjective sampling
distribution of interest posterior sampling distribution

Definition 4.19 (Credible Region). A region C ⊆ Ω such that C p(θ)dθ = −α, 0 ≤ α ≤ 1


is a 100(1 − alpha)% credible region for θ. For single-parameter problems, if C is not a set of
disjoint intervals, then C is a credible interval. If p(θ) is a (prior/posterior) density, then C is a
(prior/posterior) credible region.

Definition 4.20 (Highest Probability Density Region). A region C ⊆ Ω is a 100(1 − α)%


highest probability density region for θ under p(θ) if P (θ ∈ C) = 1 − α and P (θ1 ) ≥ P (θ2 ), ∀θ1 ∈
C, θ2 ∈
/ C.

Figure 4.11.2 depicts a density with shaded areas of 0.9 in 2 cases. In frame 4.2(a), observe
that there are quantiles outside the interval (1.37, 7.75) for which the density is greater than
quantiles within the interval. Frame 4.2(b) depicts the HDR as (0.94, 6.96).

(a) Confidence interval (b) A HDR

Figure 4.11.2: Comparison of 2 types of regions, a Confidence Interval and a Highest Density Region.

We may now have a notion about uncertainty, but how to derive a Bayesian point estimate?
Bayes estimates are single number summaries of a posterior density. As there are modes, medians,
means, etc, it is the question which summary to choose best. Different loss functions rationalize
different point estimates. The quadratic loss l(θ, θ̃ = (θ − θ̃)2 indicates the quadratic loss from
the use of the estimate θ̃ instead of θ. The posterior mean as Bayes estimate under quadratic
loss is given by 
E(θ|y) = θ̃ = θp(θ|y)dθ. (4.11.4)
θ

102
CHAPTER 5. CONFIDENCE INTERVALS

CHAPTER

CONFIDENCE INTERVALS

5.1 Introduction
We have precedently been considering point estimators of a parameter, and in the Bayesian
context, we have also briefly touched upon ‘credibility’, credible intervals. Also in the frequentist
context we can construct an interval to which we can attach a probability that the true value
lies within the limits of the interval. What these intervals are and how we can construct them,
will be the subject of this chapter.

Recall that by point estimator we are referring to the fact that, after the sampling has
been done and the observed value of the estimator computed, our end-product is the single
number which is hopefully a good approximation for the unknown true value of the parameter.
If the estimator is good according to some criteria (and we have seen several of these criteria in
Chapter 4, then the estimate should be reasonably close to the unknown true value. But the
single number itself does not include any indication of how high the probability might be that
the estimator has taken on a value close to the true unknown value. The method of confidence
intervals gives both an idea of the actual numerical value of the parameter, by giving it a range
of possible values, and a measure of how confident we are that the true value of the parameter
is in that range. To pursue this idea further consider the following example.
Example 5.1.
Consider a random sample of size n for a normal distribution with mean μ (unknown) and known variance σ 2 .
Find a 95% CI for the unknown mean, μ.

103
CHAPTER 5. CONFIDENCE INTERVALS 5.1. INTRODUCTION

Solution of Example# 5.1.$We know that the best estimator of μ is X and the sampling
2
distribution of X is N μ, σn . Then from the standard normal,
 
X −μ
P √ < 1.96 = .95.
σ/ n

|X−μ

|
The event σ/ n
< 1.96 is equivalent to the event
1.96σ 1.96σ
μ− √ <X <μ+ √ ,
n n
which is equivalent to the event
σ σ
X − 1.96 √ < μ < X + 1.96 √ .
n n
Hence  
σ σ
P X − 1.96 √ < μ < X + 1.96 √ = .95 (5.1.1)
n n
The two statistics X − 1.96 √σn , X + 1.96 √σn are the endpoints of a 95% CI for μ. This is reported
as
# $
The 95% CI for μ is X  − 1.96 √σ , X + 1.96 √σ .
n n

Computer Exercise 5.1. Generate 100 samples of size 9 from a N (0, 1) distribution. Find
the 95% CI for μ for each of these samples and count the number that do (don’t) contain zero.
(You could repeat this say 10 times to build up the total number of CI’s generated to 1000.) You
should observe that about 5% of the intervals don’t contain the true value of μ (= 0).
Solution of Computer Exercise 5.1. Use the commands:
#___________ ConfInt.R __________
sampsz <- 9
nsimulations <- 100
non.covered <- 0
for (i in 1:nsimulations){
rn <- rnorm(mean=0,sd=1,n=sampsz)

Xbar <- mean(rn)


s <- sd(rn)
CI <- qnorm(mean=Xbar,sd=s/sqrt(sampsz),p=c(0.025,0.975) )

non.covered <- non.covered + (CI[1] > 0) + (CI[2] < 0)

}
cat("Rate of non covering CI’s",100*non.covered/nsimulations," % \n")

> source("ConfInt.R")
Rate of non covering CI’s 8 %

104
CHAPTER 5. CONFIDENCE INTERVALS 5.1. INTRODUCTION

This implies that 8 of the CI’s don’t contain 0. With a larger sample size we would expect that
about 5% of the CI’s would not contain zero.
We make the following definition:

Definition 5.1 (Random interval). An interval, at least one of whose endpoints is a random
variable is called a random interval.

In (5.1.1), we are saying that the probability is 0.95 that the random interval
 
σ σ
X − 1.96 √ , X + 1.96 √
n n
contains μ.

A CI has to be interpreted carefully. For a particular sample, where x is the observed value
of X, a 95% CI for μ is  
σ σ
x − 1.96 √ , x + 1.96 √ , (5.1.2)
n n
but the statement
σ σ
x − 1.96 √ < μ < 1.96 √
n n

is either true or false. The parameter μ is a constant and either the interval contains it in
which case the statement is true, or it does not contain it, in which case the statement is false.
A probability 0.95 needs to be interpreted in terms of the relative frequency with which the
indicated event will occur “in the long run” of similar sampling experiments. Each time we
take a sample of size n, a different x, and hence a different interval (5.1.2) would be obtained.
Some of these intervals will contain μ as claimed, and some will not. In fact, if we did this many
times, we’d expect that 95 times out of 100 the interval obtained would contain μ. The measure
of our confidence is then 0.95 because before a sample is drawn there is a probability of 0.95
that the confidence interval to be constructed will cover the true mean.
A statement such as P (3.5 < μ < 4.9) = 0.95 is incorrect and should be replaced by
A 95% CI for μ is (3.5, 4.9).

We can generalize the above as follows: Let z α2 be defined by

Φ(z α2 ) = 1 − (α/2). (5.1.3)

That is, the area under the normal curve above zα/2 is α/2. Then
 
X −μ
P −zα/2 < √ < zα/2 = 1 − α. (5.1.4)
σ/ n

So a 100(1 − α)% CI for μ is  


σ σ
x − z α2 √ , x + z α2 √ . (5.1.5)
n n
Commonly used values of α are 0.1, 0.05, 0.01.

105
CHAPTER 5. CONFIDENCE INTERVALS 5.2. EXACT CONFIDENCE INTERVALS

Obviously, confidence intervals for a given parameter are not unique. For example, we have
considered a symmetric, two-sided interval, but
 
σ σ
x − z2α/3 √ , x + zα/3 √
n n

is also a 100(1 − α)% CI for μ. Likewise, we could have one-sided CI’s for μ. For example,
   
σ σ
−∞, x + zα √ or x − zα √ , ∞ .
n n
# $
The second of these arises from considering P σ/ √ < zα = 1 − α. Also, alternatively, we
X−μ
n
could have a CI based on say, the sample median instead of the sample mean.

As for methods to derive point estimates, methods of obtaining CIs must be judged by their
various statistical properties. For example, one desirable property is to have the length (or
expected length) of a 100(1 − α)% CI as short as possible. Note that for the CI in (5.1.5), the
length is constant for given n.

5.2 Exact confidence intervals


Suppose that we are going to observe the value of a random vector X. Let X denote the set of
possible values that X can take and, for x ∈ X , let g(x|θ) denote the probability that X takes
the value x where the parameter θ is some unknown element of the set Θ. Consider the problem
of quoting a subset of θ values which are in some sense plausible in the light of the data x. We
need a procedure which for each possible value x ∈ X specifies a subset C(x) of Θ which we
should quote as a set of plausible values for θ.

Example 5.2.
Suppose we are going to observe data x where x = (x1 , x2 , . . . , xn ), and x1 , x2 , . . . , xn are the observed values of
iid N (θ, 1) for some unknown parameter θ ∈ (−∞, ∞) =
random variables X1 , X2 , . . . , Xn which are√thought to be√
Θ. Consider the subset C(x) = [x̄ − 1.96/ n, x̄ + 1.96/ n]. If we carry out an infinite sequence of independent
repetitions of the experiment then we will get an infinite sequence of x values and thereby an infinite sequence of
subsets C(x). We might ask what proportion of this infinite sequence of subsets actually contain the fixed but
unknown value of θ?

Solution of Example 5.2. Since C(x) depends on x only through the value of x̄ we need to
know how x̄ behaves in the infinite sequence√ of repetitions. This follows from the fact that X̄
has a N (θ, n1 ) density and so Z = X̄−θ
√1
= n(X̄ − θ) has a N (0, 1) density. Thus eventhough θ
n
is unknown we can calculate the probability that the value of Z will exceed 2.78, for example,
using the standard normal tables. Remember that (from a frequentist viewpoint) the probability
is the proportion of experiments in the infinite sequence of repetitions which produce a value of
Z greater than 2.78.

106
CHAPTER 5. CONFIDENCE INTERVALS 5.2. EXACT CONFIDENCE INTERVALS

In particular we have that P [|Z| ≤ 1.96] = 0.95. Thus 95% of the time Z will lie between
−1.96 and +1.96. But

−1.96 ≤ Z ≤ +1.96 ⇒ −1.96 ≤ n(X̄ − θ) ≤ +1.96
√ √
⇒ −1.96/ n ≤ X̄ − θ ≤ +1.96/ n
√ √
⇒ X̄ − 1.96/ n ≤ θ ≤ X̄ + 1.96/ n
⇒ θ ∈ C(X)

Thus we have answered the question we started with. The proportion of the infinite sequence of
subsets given by the formula C(X) which will actually include the fixed but unknown value of θ
is 0.95. For this reason the set C(X) is called a 95% confidence set or confidence interval for
the parameter θ.

It is well to bear in mind that once we have actually carried out the experiment and observed
our value of x, the resulting interval C(x) either does or does not contain the unknown parameter
θ. We do not know which is the case. All we know is that the procedure we used in constructing
C(x) is one which 95% of the time produces an interval which contains the unknown parameter.

The crucial step in the last example was finding the quantity Z = n(X̄ − θ) whose value
depended on the parameter of interest θ but whose distribution was known to be that of a
standard normal variable. This leads to the following definition.

Definition 5.2 (Pivotal Quantity). A pivotal quantity for a parameter θ is a random


variable Q(X|θ) whose value depends both on (the data) X and on the value of the unknown
parameter θ but whose distribution is known.

The quantity Z in the example above is a pivotal quantity for θ. The following lemma
provides a method of finding pivotal quantities in general.

Lemma 5.1. Let X be a random variable and define F (a) = P [X ≤ a]. Consider the
random variable U = −2 log [F (X)]. Then U has a χ22 density. Consider the random variable
V = −2 log [1 − F (X)]. Then V has a χ22 density.
Proof. Observe that, for a ≥ 0,

P [U ≤ a] = P [F (X) ≥ exp (−a/2)]


= 1 − P [F (X) ≤ exp (−a/2)]
= 1 − P [X ≤ F −1 (exp (−a/2))]
= 1 − F [F −1 (exp (−a/2))]
= 1 − exp (−a/2).

Hence, U has distribution 12 exp (−a/2) which corresponds to the distribution of a χ22 variable
as required. The corresponding proof for V is left as an exercise.

This lemma has an immediate, and very important, application.

Suppose that we have data X1 , X2 , . . . , Xn which are iid with density f (x|θ). Define F (a|θ) =
a
−∞
f (x|θ)dx and, for i = 1, 2, . . . , n, define Ui = −2 log[F (Xi |θ)]. Then U1 , U2 , . . . , Un are iid

107
CHAPTER 5. CONFIDENCE INTERVALS 5.2. EXACT CONFIDENCE INTERVALS


each having a χ22 density. Hence Q1 (X, θ) = ni=1 Ui has a χ22n density and so is a pivotal quantity

for θ. Another pivotal quantity ( also having a χ22n density ) is given by Q2 (X, θ) = ni=1 Vi
where Vi = −2 log[1 − F (Xi |θ)].
According to [5], the pivotal method thus depends on finding a pivotal quantitity that has 2
characteristics:

1. It is a function of the sample observations and the unknown parameter θ, say


H(X1 , X2 , . . . , Xn ; θ) where θ is the only unknown quantity,

2. It has a probability distribution that does not depend on θ.

Then any probability statement of the form

P (a < H(X1 , X2 , . . . , Xn ; θ) < b) = 1 − α

will give rise to a probability statement about θ.


Example 5.3 ([5]).
Given X1 , X2 , . . . , Xn1 from N (μ1 , σ12 ) and Y1 , Y2 , . . . , Yn2 from N (μ2 , σ22 ) where σ12 , σ22 are known, find a
symmetric 95% CI for μ1 − μ2 .

Solution of Example 5.3 ([5]). Consider μ1 − μ2 (= θ, say) as a single parameter. Then


X is distributed N (μ1 , σ12 /n1 ) and Y is distributed N (μ2 , σ22 /n2 ) and further, X and Y are
independent. It follows that X − Y is normally distributed, and writing it in standardized form,

X − Y − (μ1 − μ2 )
! is distributed as N (0, 1).
(σ12 /n1 ) + (σ22 /n2 )

So we have found the pivotal quantity which is a function of μ1 − μ2 but whose distribution does
not depend on μ1 − μ2 . A 95% CI for θ = μ1 − μ2 is found by considering
 
X −Y −θ
P −1.96 < ! 2 < 1.96 = .95,
(σ1 /n1 ) + (σ22 /n2 )

which, on rearrangement, gives the appropriate CI for μ1 − μ2 . That is,


⎛   ⎞
2 2 2 2
⎝x − y − 1.96 σ1 + σ2 , x − y + 1.96 σ1 + σ2 ⎠ . (5.2.1)
n1 n2 n1 n2

Example 5.4 ([5]).


In many problems where we need to estimate proportions, it is reasonable to assume that sampling is from a
binomial population, and hence that the problem is to estimate p in the bin(n, p) distribution, where p is unknown.
Find a 100(1 − α)% CI for p, making use of the fact that for large sample sizes, the binomial distribution can be
approximated by the normal.

Solution of Example 5.4 ([5]). Given X is distributed as bin(n, p), an unbiased estimate of
p is p = X/n. For n large, X/n is approximately normally distributed. Then,

E(
p) = E(X)/n = p,

108
CHAPTER 5. CONFIDENCE INTERVALS 5.2. EXACT CONFIDENCE INTERVALS

and
1 1 p(1 − p)
V ar(
p) = 2
V ar(X) = 2 np(1 − p) =
n n n
so that
p − p
! is distributed approximately N (0, 1).
p(1 − p)/n
[Note that we have found the required pivotal quantity whose distribution does not depend on p.]
An approximate 100(1 − α)% CI for p is obtained by considering
 
p − p
P −zα/2 < ! < zα/2 = 1 − α (5.2.2)
p(1 − p)/n

where zα/2 is defined in (5.1.3). Rearranging (5.2.2), the confidence limits for p are obtained as
"
2n 2
p + zα/2 ± zα/2 4np(1 − p) + zα/2
2

2
. (5.2.3)
2(n + zα/2 )

A simpler expression can be found by dividing both numerator and denominator of (5.2.3) by 2n
and neglecting terms of order 1/n. That is, a 95% CI for p is
# ! ! $
p − 1.96 p(1 − p)/n, p + 1.96 p(1 − p)/n . (5.2.4)

Note that this is just the expression we would have used if we replaced V ar( p) = p(1 − p)/n in
 p) = p(1 − p)/n. In practice, confidence limits for p when n is small will need to
(5.2.2) by V ar(
be constructed in another way. We will see more about this in future chapters.
Example 5.5 ([5]).
Construct an appropriate
 90% confidence interval for λ in the Poisson distribution. Evaluate this if a sample of
size 30 yields xi = 240.

Solution of Example 5.5 ([5]). Now X is an unbiased estimator of λ for this problem, so λ
 = x with E(λ)
can be estimated by λ  = λ and V ar(λ) = V ar(X) = σ 2 /n = λ/n. By the Central
Limit theorem, for large n, the distribution of X is approximately normal, so
X −λ
! is distributed approximately N (0, 1).
λ/n
An approximate 90% CI for λ can be obtained from considering
 
X −λ
P −1.645 < ! < 1.645 = .90. (5.2.5)
λ/n

Rearrangement of the inequality in (5.2.5) to give an inequality for λ, is similar to that in


Example 5.4 where it was necessary to solve a quadratic. But, noting the comment following

(5.2.4), replace the variance of X by its estimate nλ = Xn , giving for the 90% CI for λ,
# ! ! $
x − 1.645 x/nx + 1.645 x/n

which on substitution of the observed value 240/30 = 8 for x gives (7.15, 8.85).

109
CHAPTER 5. CONFIDENCE INTERVALS 5.3. PIV. QTIES FOR USE WITH N. DATA

Example 5.6.
Suppose that we have data X1 , X2 , . . . , Xn which are iid with density

f (x|θ) = θ exp (−θx)

for x ≥ 0 and suppose that we want to construct a 95% confidence interval for θ. We need to find a pivotal
quantity for θ. Observe that
 a
F (a|θ) = f (x|θ)dx
−∞
 a
= θ exp (−θx)dx
0
= 1 − exp (−θa).

Hence
n

Q1 (X, θ) = −2 log [1 − exp (−θXi )]
i=1

is a pivotal quantity for θ having a χ22n density. Also


n
 n

Q2 (X, θ) = −2 log [exp (−θXi )] = 2θ Xi
i=1 i=1

is another pivotal quantity for θ having a χ22n density.


Using the tables, find A < B such that P [χ22n < A] = P [χ22n > B] = 0.025. Then

0.95 = P [A ≤ Q2 (X, θ) ≤ B]
 n


= P A ≤ 2θ Xi ≤ B
i=1
 
A B
= P n ≤θ≤ n
2 i=1 Xi 2 i=1 Xi

and so the interval  


A B
n , n
2 i=1 Xi 2 i=1 Xi

is a 95% confidence interval for θ.


The other pivotal quantity Q1 (X, θ) is more awkward in this example since it is not straightforward to
determine the set of θ values which satisfy A ≤ Q1 (X, θ) ≤ B.

5.3 Pivotal quantities for use with normal data


Many exact pivotal quantities have been developed for use with Gaussian data.

Example 5.7.
Suppose that we have data X1 , X2 , . . . , Xn which are iid observations from a N (θ, σ 2 ) density where σ is known.
Define √
n(X̄ − θ)
Q= .
σ
Then Q has a N (0, 1) density and so is a pivotal quantity for θ. In particular we can be 95% sure that

n(X̄ − θ)
−1.96 ≤ ≤ +1.96
σ

110
CHAPTER 5. CONFIDENCE INTERVALS 5.3. PIV. QTIES FOR USE WITH N. DATA

which is equivalent to
σ σ
X̄ − 1.96 √ ≤ θ ≤ X̄ + 1.96 √ .
n n

The R command qnorm(p=0.975,mean=0,sd=1) returns the value 1.959964 as the 97 12 % quantile from the
standard normal distribution.

Example 5.8.
Suppose that we have data X1 , X2 , . . . , Xn which are iid observations from a N (θ, σ 2 ) density where θ is known.
Define
n
(Xi − θ)2
i=1
Q=
σ2
 2
n
We can write Q = Zi where Zi = (Xi − θ)/σ. If Zi has a N (0, 1) density then Zi2 has a χ21 density. Hence,
i=1
Q has a χ2n density and so is a pivotal quantity for σ. If n = 20 then we can be 95% sure that

n
(Xi − θ)2
i=1
9.591 ≤ ≤ 34.170
σ2
which is equivalent to 9 9
: n : n
:
; 1  : 1 
(Xi − θ) ≤ σ ≤ ;
2 (Xi − θ)2 .
34.170 i=1 9.591 i=1

The R command qchisq(p=c(.025,.975),df=20) returns the values 9.590777 and 34.169607 as the 2 12 % and
97 12 % quantiles from a Chi-squared distribution on 20 degrees of freedom.

Lemma 5.2 (The Student t-distribution). Suppose the random variables X and Y are
independent, and X ∼ N (0, 1) and Y ∼ χ2n . Then the ratio
X
T =!
Y /n
is Student-t distributed with n degrees of freedom (df).
Example 5.9.
Suppose that we have data X1 , X2 , . . . , Xn which are iid observations from a N (θ, σ 2 ) density where both θ and
σ are unknown. Define √
n(X̄ − θ)
Q=
s
where
n
(Xi − X̄)2
i=1
s2 = .
n−1
We can write
Z
Q= !
W/(n − 1)
where √
n(X̄ − θ)
Z=
σ
has a N (0, 1) density and

n
(Xi − X̄)2
i=1
W =
σ2

111
CHAPTER 5. CONFIDENCE INTERVALS 5.3. PIV. QTIES FOR USE WITH N. DATA

has a χ2n−1 density ( see lemma 4.15 ).


In lemma 5.2 we show that Q has a tn−1 density, and so is a pivotal quantity for θ. If n = 31 then we can
be 95% sure that √
n(X̄ − θ)
−2.042 ≤ ≤ +2.042
s
which is equivalent to
s s
X̄ − 2.042 √ ≤ θ ≤ X̄ + 2.042 √ . (5.3.1)
n n
The R command qt(p=.975,df=30) returns the value 2.042272 as the 97 12 % quantile from a Student t-distribution
on 30 degrees of freedom.
It also follows immediately that W is a pivotal quantity for σ. If n = 31 then we can be 95% sure that

n
(Xi − X̄)2
i=1
16.79077 ≤ ≤ 46.97924
σ2
which is equivalent to
9 9
: n : n
: 1  : 1 
; (Xi − X̄) ≤ σ ≤ ;
2 (Xi − X̄)2 . (5.3.2)
46.97924 i=1 16.79077 i=1

The R command qchisq(p=c(.025,.975),df=30) returns the values 16.79077 and 46.97924 as the 2 12 % and
97 12 % quantiles from a Chi-squared distribution on 30 degrees of freedom.
It is important to point out that although a probability statement involving 95% confidence has been attached
the two intervals (5.3.1) and (5.3.2) separately, this does not imply that both intervals simultaneously hold with
95% confidence. )

Example 5.10.
Suppose that we have data X1 , X2 , . . . , Xn which are iid observations from a N (θ1 , σ 2 ) density and data
Y1 , Y2 , . . . , Ym which are iid observations from a N (θ2 , σ 2 ) density where θ1 , θ2 , and σ are unknown. Let
δ = θ1 − θ2 and define
(X̄ − Ȳ ) − δ
Q= "
s2 ( n1 + m 1
)
where n m
2 i=1 (Xi − X̄)2 + j=1 (Yj − Ȳ )2
s = .
n+m−2
2 2
We know that X̄ has a N (θ1 , σn ) density and that Ȳ has a N (θ2 , σm ) density. Then the difference X̄ − Ȳ has a
N (δ, σ 2 [ n1 + m
1
]) density. Hence
X̄ − Ȳ − δ
Z="
σ 2 [ n1 + m
1
]
n m
has a N (0, 1) density. Let W1 = i=1 (Xi − X̄)2 /σ 2 and let W2 = j=1 (Yj − Ȳ )2 /σ 2 . Then, W1 has a χ2n−1
density and W2 has a χ2m−1 density, and W = W1 + W2 has a χ2n+m−2 density. We can write
!
Q1 = Z/ W/(n + m − 2)

and so, Q1 has a tn+m−2 density and so is a pivotal quantity for δ. Define
n 2
m 2
i=1 (Xi − X̄) + j=1 (Yj − Ȳ )
Q2 = .
σ2
Then Q2 has a χ2n+m−2 density and so is a pivotal quantity for σ.

112
CHAPTER 5. CONFIDENCE INTERVALS 5.3. PIV. QTIES FOR USE WITH N. DATA

Lemma 5.3 (The Fisher F-distribution). Let X1 , X2 , . . . , Xn and Y1 , Y2 , . . . , Ym be iid N (0, 1)


random variables. The ratio

n
Xi2 /n
i=1
Z=  m
Yi2 /m
i=1

has the distribution called Fisher, or F distribution with parameters (degrees of freedom) n, m,
or the Fn,m distribution for short. The corresponding pdf fFn,m is concentrated on the positive
half axis:
Γ((n + m)/2) # n $n/2 n/2−1 # n $−(n+m)/2
fFn,m (z) = z 1+ z for z > 0.
Γ(n/2)Γ(m/2) m m

Observe that if T ∼ tm , then T 2 = Z ∼ F1,m , and if Z ∼ Fn,m , then Z −1 ∼ Fm,n .


If W1 ∼ χ2n and W2 ∼ χ2m , then Z = (mW1 )/(nW2 ) ∼ Fn,m .
Example 5.11.
2
Suppose that we have data X1 , X2 , . . . , Xn which are iid observations from a N (θX , σX ) density and data
2
Y1 , Y2 , . . . , Ym which are iid observations from a N (θY , σY ) density where θX , θY , σX , and σY are all unknown.
Let
λ = σX /σY
and define n
∗ŝ2 i=1 (Xi − X̄)2 (m − 1)
F = X = m .
ŝ2Y (n − 1) j=1 (Yj − Ȳ )
2

Let
n

WX = (Xi − X̄)2 /σX
2

i=1

and let
m

WY = (Yj − Ȳ )2 /σY2 .
j=1

Then, WX has a χ2n−1 density and WY has a χ2m−1 density. Hence, by lemma 5.3,

WX /(n − 1) F∗
Q= ≡ 2
WY /(m − 1) λ

has an F density with n − 1 and m − 1 degrees of freedom and so is a pivotal quantity for λ. Suppose that n = 25
and m = 13. Then we can be 95% sure that 0.39 ≤ Q ≤ 3.02 which is equivalent to
< <
F∗ F∗
≤λ≤ .
3.02 0.39
To see how this might work in practice try the following R commands one at a time:

x = rnorm(25, mean = 0, sd = 2)
y = rnorm(13, mean = 1, sd = 1)
Fstar = Var(x)/Var(y); Fstar
CutOffs = qf(p=c(.025,.975), df1=24, df2=12)
CutOffs; rev(CutOffs)
Fstar / rev(CutOffs)
var.test(x, y)

113
CHAPTER 5. CONFIDENCE INTERVALS 5.4. APPROXIMATE CONFIDENCE INTERVALS

The search for a nice pivotal quantity for δ = θX − θX (both variances σX 2


and σY2 unknown
and not known to be equal) continues and is one of the great unsolved problems in statistics,
referred to as the Behrens-Fisher Problem. One reason for its fame is that it can be proven that
there is no exact solution satisfying the classical criteria for good tests. First-best solutions that
are uniformly most powerful and invariant either do not exist or have strange properties. One
needs to look for second-best solutions...

5.4 Approximate confidence intervals


Let X1 , X2 , . . . , Xn be iid with density f (x|θ). Let θ̂ be the MLE of
" θ. We saw before
! !
that the quantities W1 = E(I(θ))(θ̂ − θ), W2 = I(θ)(θ̂ − θ), W3 = E(I(θ̂))(θ̂ − θ), and
"
W4 = I(θ̂)(θ̂ − θ) all had densities which were approximately N (0, 1). Hence they are all
approximate pivotal quantities for θ. W3 and W4 are the simplest to use in general.
"
For W3 the approximate 95% confidence interval is given by [θ̂ − 1.96/ EI(θ̂), θ̂ +
" "
1.96/ EI(θ̂)]. For W4 the approximate 95% confidence interval is given by [θ̂ − 1.96/ I(θ̂), θ̂ +
" " "
1.96/ I(θ̂)]. The quantity 1/ E(I(θ̂)) ( or 1/ I(θ̂)) is often referred to as the approximate
standard error of the MLE θ̂.

Let X1 , X2 , . . . , Xn be iid with density f (x|θ) where θ = (θ1 , θ2 , . . . , θm ) consists of m


unknown parameters. Let θ = (θ̂1 , θ̂2 , . . . , θ̂m ) √
be the MLE of θ. We saw before that for
r = 1, 2, . . . , m the quantities W1r = (θ̂r − θr )/ CRLBr where CRLBr is the lower bound
for Var(θ̂r ) given in the generalisation of the Cramer-Rao theorem had a density which was
approximately N (0, 1). Recall that CRLBr is the rth diagonal element of the matrix [E(I(θ))]−1 .
In certain cases CRLBr may depend on the values of unknown parameters other than θr and in
those cases W1r will not be an approximate pivotal quantity for θr .
We also saw that if we define W2r by replacing CRLBr by the rth diagonal element of the
matrix [I(θ)]−1 , W3r by replacing CRLBr by the rth diagonal element of the matrix [EI(θ̂)]−1
and W4r by replacing CRLBr by the rth diagonal element of the matrix [I(θ̂)]−1 we get three
more quantities all of whom have a density which is approximately N (0, 1). W3r and W4r only
depend on the unknown parameter θr and so are approximate pivotal quantities for θr . However
in certain cases the rth diagonal element of the matrix [I(θ)]−1 may depend on the values of
unknown parameters other than θr and in those cases W2r will not be an approximate pivotal
quantity for θr . Generally W3r and W4r are most commonly used.
We now examine the use of approximate pivotal quantities based on the MLE in a series of
examples :

Example 5.12 (Poisson sampling continued).


n
Recall that θ̂ = x̄ and I(θ) = i=1 xi /θ2 = nθ̂/θ2 with E[I(θ)] = n/θ. Hence E[I(θ̂)] = I(θ̂) = n/θ̂ and the usual
approximate 95% confidence interval is given by
⎡   ⎤
⎣ θ̂ − 1.96 θ̂ , θ̂ + 1.96 θ̂ ⎦ .
n n

114
CHAPTER 5. CONFIDENCE INTERVALS 5.5. BOOTSTRAP CONFIDENCE INTERVALS

Example 5.13 (Bernoulli trials continued).


Recall that θ̂ = x̄ and n n
i=1 xi n − i=1 xi
I(θ) = +
θ2 (1 − θ)2
with
n
E[I(θ)] = .
θ(1 − θ)
Hence
n
E[I(θ̂)] = I(θ̂) =
θ̂(1 − θ̂)
and the usual approximate 95% confidence interval is given by
 
θ̂(1 − θ̂) θ̂(1 − θ̂)
[ θ̂ − 1.96 , θ̂ + 1.96 ].
n n

5.5 Bootstrap confidence intervals


The Bootstrap is a Monte-Carlo method which uses (computer) simulation in lieu of mathematical
theory. It is not necessarily simpler. Exercises with the bootstrap are mostly numerical although
the underlying theory follows much of the analytical methods.

5.5.1 The empirical cumulative distribution function


An important tool in non-parametric statistics is the empirical cumulative distribution function
1
(acronym ecdf) which uses the ordered data as quantiles and probabilities are steps of (n+1) . We
have used the word empirical for this plot because it uses only the information in the sample.
Recall that the values for cumulative area that are associated with each datum are determined
by the following argument: The sample as collected is denoted by x1 , x2 , . . . , xn . The subscript
represents the position in the list or row in the data file. Bracketed subscripts denote ordering
of the data, where x(1) is the smallest, x(2) is the second smallest, x(n) is the largest. In general
xi is not the same datum x(i) but of course this correspondence could happen. The n sample
points are considered to divide the sampling interval into (n + 1) subintervals,

(0, x(1) ), (x(1) , x(2) ), (x(2) , x(3) ), . . . , (x(n−1) , x(n) ), (x(n) , ∞)

The total area under the density curve (area = 1) has been subdivided into (n + 1) subregions
1
with individual areas approximated as (n+1) . The values of the cumulative area under the density
curve is then approximated as

Interval (0, x1 ) (x1 , x2 ) ... (x(n−1) , xn ) (xn , ∞)


0 1 n
Cumulative area (n+1) (n+1)
... (n+1)
1

A diagram of this is shown in Fig. 5.5.1.


k
If there are tied data, say k of them, the step size is (n+1) . Recall the notion of quantiles
and quantile-quantile (Q-Q) plots discussed in Chapter 3.

The following data are the numbers of typhoons in the North Pacific Ocean over 88 years
and assume that they are saved in a file called TYPHOON.txt

115
CHAPTER 5. CONFIDENCE INTERVALS 5.5. BOOTSTRAP CONFIDENCE INTERVALS

Figure 5.5.1: The ecdf is a step function with step size 1


(n+1) between data points.

13 7 14 20 13 12 12 15 20 17 11 14 16 12 17 17
16 7 14 15 16 20 17 20 15 22 26 25 27 18 23 26 18 15 20 24
19 25 23 20 24 20 16 21 20 18 20 18 24 27 27 21 21 22 28 38
39 27 26 32 19 33 23 38 30 30 27 25 33 34 16 17 22 17 26 21
30 30 31 27 43 40 28 31 24 15 22 31

A plot of the ecdf shown in Figure 5.5.2 and was generated with the following R code:

typhoons <- scan("TYPHOON.txt")


plot(ecdf(typhoons),las=1)

Figure 5.5.2: An empirical distribution function for typhoon data.

The reason why we introduced the concept of an empirical cumulative distribution function
is that empirical cumulative distributions are used when calculating bootstrap probabilities.
Here is an example. Suppose that in Example 5.5, the data were

116
CHAPTER 5. CONFIDENCE INTERVALS 5.5. BOOTSTRAP CONFIDENCE INTERVALS

8 6 5 10 8 12 9 9 8 11 7 3 6 7 5 8 10 7 8 8 10 8 5 10 8 6 10 6 8 14
Denote this sample by x1 , x2 , . . . , xn where n = 30. The summary statistics are

n
xi = 240 X = 8
i=1

We shall use this example to illustrate (a) resampling, and (b) the bootstrap distribution.
The sample, x1 , x2 , . . . , xn , are independently and identically distributed (i.i.d.) as Poisson(λ)
which means that each observation is as important as any other for providing information about
the population from which this sample is drawn. That infers we can replace any number by one
of the others and the “new” sample will still convey the same information about the population.
This is also demonstrated in Figure 5.5.3. Three “new” samples have been generated by taking
samples of size n = 30 with replacement from x. The ecdf of x is shown in bold and the ecdf’s
of the “new” samples are shown with different line types. We observe that there is little change
in the empirical distributions or estimates of quantiles. If a statistic (e.g. a quantile) were
estimated from this process a large number of times, it would be a reliable estimate of the
population parameter. The “new” samples are termed bootstrap samples.

Figure 5.5.3: Resampling with replacement from original sample.

The bootstrap procedure for the CI for λ in the current example then goes as follows:
1. Nominate the number of bootstrap samples that will be drawn, e.g. nBS=99.

2. Sample with replacement from x a bootstrap sample of size n, x1 .

3. For each bootstrap sample, calculate the statistic of interest, λ1 .

4. Repeat steps 2 and 3 nBS times.

5. Use the empirical cumulative distribution function of λ1 , λ2 , . . . , λ



nBS to get the Confidence
Interval.

117
CHAPTER 5. CONFIDENCE INTERVALS 5.5. BOOTSTRAP CONFIDENCE INTERVALS

Figure 5.5.4: Deriving the 95% CI from the ecdf of bootstrap estimates of the mean

This is shown in Fig. 5.5.4.


The bootstrap estimate of the 95% CI for λ is (7.22, 8.73). Note that although there is a
great deal of statistical theory underpinning this (the ecdf, i.i.d., order statistics, etc.), there is
no theoretical formula for the CI and it is determined numerically from the sample.

The R code that generated the graph in Figure 5.5.4, further illustrates the bootstrap
approach to construct confidence intervals.

x <- c(8,6,5,10,8,12,9,9,8,11,7,3,6,7,5,8,10,7,8,8,10,8,5,10,8,6,10,6,8,14)
n <- length(x)
nBS <- 99 # number of bootstrap simulations
BS.mean <- numeric(nBS)
i <- 1
while (i < (nBS+1) ){
BS.mean[i] <- mean(sample(x,replace=T,size=n))
i <- i + 1
} # end of the while() loop

Quantiles <- quantile(BS.mean,p = c(0.025,0.975))


cat(" 95\% CI = ",Quantiles,"\n")
plot(ecdf(BS.mean),las=1)

R has several packages that incorporate bootstrap techniques, including the boot package
with functions particularly focusing on bootstrapping. The following code uses the boot package
to construct the same confidence interval as before, but with less effort involved..., although
more effort may be required to fully understand how the ‘boot’ function works...

library(boot)
mnz <- function(z,id){mean(z[id])} # user must supply this
bs.samples <- boot(data=x,statistic=mnz,R=99)
boot.ci(bs.samples,conf=0.95,type=c("perc","bca"))

118
CHAPTER 5. CONFIDENCE INTERVALS 5.5. BOOTSTRAP CONFIDENCE INTERVALS

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS


Based on 99 bootstrap replicates
CALL :
boot.ci(boot.out = bs.samples, conf = 0.95)
Intervals :
Level Percentile BCa
95% ( 7.206, 8.882 ) ( 7.106, 8.751 )

It seems that the user must supply a function (e.g. mnz here) to generate the bootstrap samples.
The variable id is recogised by R as a vector 1:length(z) so that it can draw the samples.

119

You might also like