Cramer Rao
Cramer Rao
Cramer Rao
I will build up the Cramer-Rao lower bound in a series of steps, showing the general
formulas for each concept and then examples using the normal distribution and the
binomial distribution.
The Score
For the estimator of the mean of the normal distribution, the parameter θ is the mean, μ.
The likelihood (which is the same as the normal probability density) is:
1 ⎛ 1 n
2 ⎞
L ( μ | x1 ,..., xn , σ ) = exp ⎜ − 2 ∑( x − μ) ⎟
( ) ⎝ 2σ
n i
2πσ i =1 ⎠
2σ
i
2 i =1
The score is the derivative. I will use the letter g for the score.
∂l 1 n
n
g (μ ) = = 2
∂μ σ
∑(x
i =1
i − μ) =
σ2
(x − μ)
Notice that the score is a linear function of the sample mean, and is directly proportional
to the difference between the sample mean and the population mean. Let’s look at what
the score means for some set of data. If the data points are on average below the true
population mean, then the score is negative. The score is a derivative with respect to the
mean, so the negative score implies that the log-likelihood decreases as the mean
increases. This makes sense – we are moving the mean farther away from the data,
making the probability lower. On the other hand, if the data points are on average above
the population mean, the score is positive. The log-likelihood increases as the population
mean increases, which brings it closer to the data.
The binomial probability for sample size n and population probability π is:
n!
prob ( X = k ) = π k (1 − π ) .
n−k
k !( n − k ) !
The log-likelihood is:
∂ k (n − k )
l (π ; k ) = − .
∂π π 1− π
You should confirm that if k / n is lower than π, the score will be negative. If k / n is
higher than π, the score will be positive.
k score
0 ‐15.3846
1 ‐10.989
2 ‐6.59341
3 ‐2.1978
4 2.197802
5 6.593407
6 10.98901
7 15.38462
8 19.78022
9 24.17582
10 28.57143
You will notice that the sign is as predicted, and also that the values of the score are
farther away from 0 when the data is far away from the population parameter. You might
imagine the score as a measure of “force” that the data is exerting on the population
parameter, trying to bring it more in line with the data. A negative score is trying to
lower the population parameter, and a positive score is trying to raise it. (With a negative
score, lowering the population parameter will raise the probability.)
You can see that the score takes on both positive and negative values for different values
of the data.
⎛ ∂ ⎞
E⎜ log f ( x | θ ) ⎟ = 0
⎝ ∂θ ⎠
Proof:
log ( g ( x ) ) =
d 1 d
Recall that the derivative of a logarithm is of the form g ( x) .
dx g ( x ) dx
∂ 1 ∂
So log f ( x | θ ) = f (x | θ ) .
∂θ f ( x | θ ) ∂θ
Taking expectations:
⎛ ∂ ⎞ ⎛ 1 ∂ ⎞
E⎜ log f ( x | θ ) ⎟ = E ⎜⎜ f ( x | θ ) ⎟⎟
⎝ ∂θ ⎠ ⎝ f ( x | θ ) ∂θ ⎠
To take an expectation, we integrate or sum over all possible values for all the
observations, weighted by the probability or the probability density.
⎛ 1 ∂ ⎞ ⎛ 1 ∂ ⎞ ⎛ ∂ ⎞
E ⎜⎜ f ( x | θ ) ⎟⎟ = ∫ ⎜⎜ f ( x | θ ) ⎟⎟ f ( x | θ ) dx = ∫ ⎜ f ( x | θ ) ⎟ dx
⎝ f ( x | θ ) ∂θ ⎠ ⎝ f ( x | θ ) ∂θ ⎠ ⎝ ∂θ ⎠
or
⎛ 1 ∂ ⎞ ⎛ 1 ∂ ⎞ ⎛ ∂ ⎞
E ⎜⎜ f ( x | θ ) ⎟⎟ = ∑ ⎜⎜ f ( x | θ ) ⎟⎟ f ( x | θ ) = ∑ ⎜ f (x | θ )⎟
⎝ f ( x | θ ) ∂θ ⎠ all x's ⎝ f ( x | θ ) ∂θ ⎠ all x's ⎝ ∂θ ⎠
Notice how using the log-likelihood, and then taking the derivative, allows us to “cancel
out” the probability or probability density. This is a reason why the log-likelihood is of
theoretical as well as practical interest. Also notice that even if f is a discrete probability,
and so the x’s only take on discrete values, typically the parameter θ will be continuous,
so we can take the derivative.
⎛ ∂ ⎞ ∂ ∂
∫ ⎜⎝ ∂θ f ( x | θ ) ⎟⎠ dx = ∂θ ∫ ( f ( x | θ ) ) dx = ∂θ (1) = 0
or QED.
⎛ ∂ ⎞ ∂ ∂
∑ ⎜⎝ ∂θ f ( x | θ ) ⎟⎠ = ∂θ ∑ ( f ( x | θ ) ) = ∂θ (1) = 0
all x's all x's
This probably looks like pulling a rabbit out of a hat, so I’ll give a couple of examples.
∂l 1 n
g (μ ) = = 2 ∑ ( xi − μ ).
∂μ σ i =1
⎡ 1 n ⎤
E ( g ( μ ) ) = E ⎢ 2 ∑ ( xi − μ ) ⎥ = 0, since each xi has expectation μ .
⎣ σ i =1 ⎦
Intuitively, each xi has a symmetric distribution around μ, so the positive and negative
contributions to the score have to exactly average out to 0.
This can be seen most easily by using a numeric table. To within the precision of
computer arithmetic, the expectation (the sum of scorek * probk) is 0.
Given
n=10, p =
0.35
k likelihood score likelihood*score
0 0.013462743 ‐15.385 ‐0.2071
1 0.072491695 ‐10.989 ‐0.7966
2 0.175652953 ‐6.5934 ‐1.1582
3 0.252219625 ‐2.1978 ‐0.5543
4 0.237668493 2.1978 0.52235
5 0.153570411 6.59341 1.01255
6 0.0689098 10.989 0.75725
7 0.021203015 15.3846 0.3262
8 0.004281378 19.7802 0.08469
9 0.000512302 24.1758 0.01239
10 2.75855E‐05 28.5714 0.00079
Sum ‐8E‐16
I described the score as a measure of “force” that the data is exerting on the population
parameter, trying to bring it more in line with the data. Alternatively, it could be seen as
a measure of force that the population parameter is exerting on the data, or more
symmetrically, as the force driving them to be close to each other. To switch metaphors,
we can also think of it as a measure of “information” that the data is providing about the
parameter. The stronger the force, the closer the data is likely to be to the parameter, and
the more information about the parameter we are getting from the data.
Let’s take a look at the score function for the normal distribution again:
∂l n
g ( μ; x ) = = 2 (x − μ)
∂μ σ
As described above, the farther apart the sample mean and the population mean, the
greater the absolute value of the score. Also, the score is proportional to n / σ 2 . That
says that the score goes up as the number of data points goes up, and goes up as the
variance (of the x values) goes down.
The values of the score will get more extreme as the observed data is farther away from
the population parameter. We would like some measure of that “force” that doesn’t
depend on the specific data points, but is still in some sense a measure of how strongly
the data is pulled toward the population parameter. We can’t just take an expectation
(which would reflect the information from all possible outcomes), because we have seen
the expectation is 0. Instead, we take the expectation of the square. This expectation is
called the Fisher Information.
( )
⎛⎛ ∂ ⎞ ⎞
2
⎛ ∂ ⎞
I (θ ) = E ( g (θ ; x ) ) log f ( x | θ ) ⎟ ⎟ = Var ⎜ log f ( x | θ ) ⎟
2
= E ⎜⎜
⎜ ⎝ ∂θ ⎠ ⎠⎟ ⎝ ∂θ ⎠
⎝
Note that the expectation of the square is equal to the variance, because the expectation of
the score is 0. Also note that the Fisher information is an expectation taken over possible
data values, so it is not a function of the x values, but it may be a function of θ.
( ) = (σ nσ 2
n
I = E ( g ( μ; x ))
1 n
∑E(x − μ) =
2
=
2
) (σ ) σ2
i
2 2 2 2
i =1
Notice that for the normal distribution, the information is not a function of the mean, but
it is a function of the variance. Also notice that the information is directly proportional to
the number of data points, and indirectly proportional to the variance. Hopefully, this
makes sense for something called “information.” If we have more data points, the data
provides more information about the population mean. If the variance of the values is
higher, each data point provides less information about the population mean.
n
For the binomial, it turns out that the information is . This is derived later. For
π (1 − π )
now, you can confirm it for the specific case we’ve been working with,
n 10
= = 43.956 :
π (1 − π ) .35 *.65
Given
n=10, p =
0.35
k likelihood score^2 likelihood*score^2
0 0.013462743 236.686 3.18645
1 0.072491695 120.758 8.75398
2 0.175652953 43.473 7.63616
3 0.252219625 4.83033 1.21831
4 0.237668493 4.83033 1.14802
5 0.153570411 43.473 6.67617
6 0.0689098 120.758 8.32143
7 0.021203015 236.686 5.01847
8 0.004281378 391.257 1.67512
9 0.000512302 584.47 0.29943
10 2.75855E‐05 816.327 0.02252
Sum 43.956
Again, the information is proportional to the sample size n, and is inversely proportional
to the variance (recall that the variance of a Bernoulli variable with probability π is π(1-
π)).
The Cramer-Rao bound says that for any unbiased estimator of a population parameter,
the lowest possible variance is 1/I, where I is the Fisher information. There are some
technical regularity conditions, but these hold for the usual probability distributions and
estimators that we use in statistics.1 Before proving this, I will use the same two
examples.
From above:
n 1 σ2
I= 2, =
σ I n
Notice that 1/I is exactly the variance of the sample mean. This tells us that the sample
mean is the “best” unbiased estimator, i.e. there can be no unbiased estimator with a
lower variance.
n π (1 − π )
From above, the information is . The Cramer-Rao lower bound is .
π (1 − π ) n
This is the variance of a binomial proportion (the binomial count k divided by n). So if
we take the sample proportion k / n as the natural estimator of π, its variance matches the
Cramer-Rao lower bound.
Proof
The proof is very clever, but it’s another one of those “pulling a rabbit out of a hat”
derivations. Recall that the score, being a function of the data as well as the parameter, is
a random variable. Also, the estimator, being a function of the data, is a random variable.
I pointed out that for our two examples, the score is positive when the “data” is higher
than the parameter. To be more precise, if we use x as an estimator for μ, then the score
is positive when x > μ , and if we use k / n as an estimator for π, then the score is
positive when k / n > π. That implies that the score is positively correlated with these
1
There are also versions of the Cramer-Rao lower bound for estimators that are not unbiased. I won’t
cover those here.
estimators. That’s the best I can come up with as a motivation for the trick used in the
proof.
The trick is, we look at the covariance between the score and the estimator. Consider any
estimator of the parameter θ. It is a function of the data, and I will call it W(x). I will
refer to the score as g ( x;θ ) in this derivation, to emphasize the fact that it is a function
of the data. Recall that E ( g ( x;θ ) ) = 0 .
1) for any two random variables X and Y, if either random variable has 0 expectation,
then Cov ( X , Y ) = E ( XY )
Cov ( X , Y )
2) the correlation between two random variables ρ ( X , Y ) = and
Var ( X ) Var (Y )
ρ ( X ,Y ) ≤ 1
Var ( X ) = =
( ρ ( X , Y ) ) Var (Y )
2
Var (Y ) Var (Y )
E (W ( x ) g ( x;θ ) )
Letting X be W(x) and Y be g ( x;θ ) , Var (W ( x ) ) ≥ .
Var ( g ( x;θ ) )
To evaluate the expectation in the numerator, we use a technique similar to the one used
to show that the expectation of the score was 0. I will write this out using an integral, but
a similar derivation can be done with discrete random variables using summations.
⎛ ∂ ⎞
E (W ( x ) g ( x;θ ) ) = E ⎜ W ( x ) log f ( x | θ ) ⎟ (Using definition of g )
⎝ ∂θ ⎠
⎛ W ( x) ∂ ⎞
= E ⎜⎜ f ( x | θ ) ⎟⎟ (Derivative of a log)
⎝ f ( x | θ ) ∂θ ⎠
⎛ W ( x) ∂ ⎞
= ∫ ⎜⎜ f ( x | θ ) ⎟⎟ f ( x | θ ) dx (Definition of expectation as integral)
⎝ f ( x | θ ) ∂θ ⎠
⎛ ∂ ⎞
= ∫ ⎜W ( x ) f ( x | θ ) ⎟dx (Cancelling f ( x | θ ))
⎝ ∂θ ⎠
Now W(x) is not (directly) a function of θ, so we can move it inside the derivative.
Under “regularity conditions”, we can also take the derivative outside the integral giving
⎛ ∂ ⎞ ∂ ∂
∫ ⎜⎝ W ( x ) ∂θ f ( x | θ ) ⎟⎠dx = ∂θ ∫ (W ( x ) f ( x | θ ) )dx = ∂θ E (W ( x ) ) .
∂ ∂
Since W(x) is an unbiased estimator of θ, E (W ( x ) ) = θ = 1.
∂θ ∂θ
E (W ( x ) g ( x;θ ) )
Putting it all together, Var (W ( x ) ) ≥
1
= .
Var ( g ( x;θ ) ) Var ( g ( x;θ ) )
⎛ ∂2 ⎞
Var ( g ( x;θ ) ) = − E ⎜ 2 log ( f ( x;θ ) ) ⎟ .
⎝ ∂θ ⎠
This is a remarkable result. It says that the variance of the score is the curvature of the
log-likelihood function. I will put the proof of this at the end. For now, I will
demonstrate with our two examples.
⎛ ∂ ⎞ n
g ( μ; x ) = ⎜ log ( f ( x; μ ) ) ⎟ = 2 ( x − μ )
⎝ ∂μ ⎠ σ
⎛ ∂2 ⎞
The second derivative is ⎜ 2 log ( f ( x; μ ) ) ⎟ = − 2 .
n
⎝ ∂μ ⎠ σ
In this case, the second derivative is not a function of the data, so its expectation is
⎛ ∂2 ⎞ n
simply − 2 , and − E ⎜ 2 log ( f ( x;θ ) ) ⎟ = 2 .
n
σ ⎝ ∂θ ⎠ σ
∂2 k (n − k )
log f (π ; k ) = − 2 −
∂π π (1 − π )
2 2
⎛ ∂2 ⎞ ⎛ k ⎞ ⎛ (n − k ) ⎞
− E ⎜ 2 log f (π ; k ) ⎟ = E ⎜ 2 ⎟ + E ⎜ ⎟
⎝ ∂π ⎠ ⎝π ⎠ ⎜ (1 − π )2 ⎟
⎝ ⎠
nπ n (1 − π ) ⎛1 1 ⎞ n
= 2 + = n⎜ + ⎟=
π (1 − π ) ⎝ π 1 − π ⎠ π (1 − π )
2
This is the formula that was stated but not derived earlier.
Notice that for the normal distribution, we didn’t need to calculate an expectation when
we used the second derivative formula for the Fisher information, although if we don’t
know σ2, we will need to estimate it. Also notice that, although we calculated an
expectation for the binomial distribution, it turns out that if we plugged in k = n π in the
second derivative, we would get the same formula. This suggests that we might calculate
n n
as an estimate for the information in the normal distribution, and for the
σˆ 2
pˆ (1 − pˆ )
binomial. This calculation is called the observed information. The formula based on
taking expectations is called the expected information. Given the observed information,
we can calculate an estimated variance without taking an expectation and without
σˆ 2
knowing the true population parameter. For the normal, the estimated variance is ;
n
pˆ (1 − pˆ )
for the binomial, the estimated variance is .
n
When there is more than one population parameter (for example, the β coefficients in
multiple regression), there is a form of the Cramer-Rao bound that applies to all the
parameters. In this case, we need to look at the variance-covariance matrix of all the
estimates, and the Cramer-Rao bound is a matrix. There is a way in which the magnitude
of matrices can be compared,2 and the Cramer-Rao bound is the smallest possible
variance-covariance matrix. In this case, the Fisher information is called the information
matrix. As described in the previous paragraph, we can have both an expected
information matrix and an observed information matrix.
In addition to its theoretical importance, the Cramer-Rao bound plays a role in maximum
likelihood estimation. It turns out that the maximum likelihood estimator is consistent
even in cases where it is not unbiased. It is also asymptotically efficient, meaning that as
the sample size gets larger and larger, its variance is as low as possible, i.e. it meets the
2
A matrix A is “greater than” a matrix B if A-B is a positive definite matrix, and is “greater than or equal
to” B if A-B is non-negative definite.
Cramer-Rao lower bound. Although this is only an asymptotic result, in practice the
variance of the maximum likelihood estimator is calculated by using the Cramer-Rao
formula (the observed information matrix, in particular), so when standard errors are
given in a maximum likelihood estimator, they are based on the Cramer-Rao formula.
⎛ ∂2 ⎞
We need to prove: Var ( g ( x;θ ) ) = − E ⎜ 2 log ( f ( x;θ ) ) ⎟
⎝ ∂θ ⎠
(To reduce the amount of notation, I will use l for the log-likelihood, f for the probability
density, and g for the score, without writing in the dependence on x and θ. Looking at
the second derivative of the log-likelihood:
∂ 2l ∂ ∂l ∂ ⎛ 1 ∂f ⎞ ⎡ ∂ ⎛ 1 ⎞ ⎤ ∂f 1 ⎡ ∂ ⎛ ∂f ⎞ ⎤
= = ⎜ ⎟ = ⎢ ⎜ ⎟⎥ + ⎢ ⎜ ⎟
∂θ ∂θ ∂θ ⎝ f ∂θ ⎠ ⎣ ∂θ ⎝ f ⎠ ⎦ ∂θ f ⎣ ∂θ ⎝ ∂θ ⎠ ⎥⎦
(Product rule)
∂θ 2
⎡⎛ −1 ⎞ ∂f ⎤ ∂f 1 ⎛ ∂2 f ⎞
= ⎢⎜ 2 ⎟ ⎥ + ⎜ 2⎟ (Derivative formulas)
⎣⎝ f ⎠ ∂θ ⎦ ∂θ f ⎝ ∂θ ⎠
2
⎡⎛ 1 ⎞ ∂f ⎤ 1 ⎛ ∂2 f ⎞
= −1 ⎢⎜ ⎟ ⎥ + ⎜ 2 ⎟ (Algebra)
⎣⎝ f ⎠ ∂θ ⎦ f ⎝ ∂θ ⎠
1 ⎛ ∂2 f ⎞
= −g + ⎜ 2 ⎟
2
(Definition of score)
f ⎝ ∂θ ⎠
Taking expectations,
⎛ ∂2l ⎞ ⎡ 1 ⎛ ∂2 f ⎞⎤
−E⎜ 2 ⎟ = E(g ) + E ⎢ ⎜ 2
2
⎟⎥
⎝ ∂θ ⎠ ⎣ f ⎝ ∂θ ⎠⎦
We can prove the equivalence of alternative expression if we can prove that the last
expectation is 0. Using the same trick as before
⎡ 1 ⎛ ∂ 2 f ⎞⎤ ⎡ 1 ⎛ ∂ 2 f ⎞⎤ ⎛ ∂2 f ⎞ ∂2 ∂2
E ⎢ ⎜ 2 ⎟ ⎥ = ∫ ⎢ ⎜ 2 ⎟ ⎥ f dx = ∫ ⎜ 2 ⎟ dx = ∫ f dx = (1) = 0
⎣ f ⎝ ∂θ ⎠ ⎦ ⎣ f ⎝ ∂θ ⎠ ⎦ ⎝ ∂θ ⎠ ∂θ 2 ∂θ 2