Lecture 1.4

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

CS 57800 Statistical Machine Learning, Prof.

Anuran Makur Spring 2022

Lecture 1.4 Minimax Estimation: Cramér-Rao Bounds


Anuran Makur

Note: These notes are for your personal educational use only. Please do not distribute them.

1 Introduction
In this section, we consider the problem of non-Bayesian estimation under the mean squared error (MSE)
criterion. Suppose we are given the parametrized family of likelihoods {Pθ : θ ∈ [0, 1]} over R, where each
likelihood Pθ is a probability density function (PDF) with respect to the Lebesgue measure. We will assume
that the following regularity conditions hold:

1. For all θ ∈ [0, 1], Pθ has support R, i.e., it is strictly positive everywhere.

2. For all x ∈ R, the map [0, 1] 3 θ 7→ Pθ (x) is continuously differentiable.

In a non-Bayesian setting, the parameter θ ∈ [0, 1] is unknown and deterministic. However, for convenience,
we interpret each Pθ as the conditional probability distribution of an observation random variable X given
that the true parameter value is θ. Our goal is to infer the unknown parameter θ from the observation X.
Since we do not have a prior distribution on the parameter space, we employ a minimax formulation and
try to find an estimator θ̂ : R → [0, 1], θ̂(X) of θ based on X that minimizes the worst-case MSE:
 2 
Rm , inf sup Eθ θ̂(X) − θ , (1)
θ̂:R→[0,1] θ∈[0,1]
| {z }
MSE risk

where the infimum is over all (possibly randomized) estimators θ̂ which do not depend on θ, Eθ [·] denotes the
expectation with respect to the likelihood Pθ (as well as any randomness in the estimator), and Rm is known
as the minimax MSE risk. Although minimax risk can be defined using more general loss functions, we focus
on the MSE case because it is perhaps the simplest and most popular choice, and Cramér-Rao bounds are
most suited to analyzing MSE risk. To show that an estimator θ̂ is minimax optimal, we typically analyze
the extremal MSE risk of the estimator to obtain an upper bound on (1), and then derive a lower bound on
(1) that matches this upper bound (usually up to constants if there are multiple i.i.d. observations). This
section develops the standard tools used to analyze MSE risk (1).

2 Fisher Information
The most famous notion of “information” in classical statistics is arguably Fisher information, which turns
out to be crucial in lower bounding minimax MSE risk. To introduce this concept, we need some preliminaries.
For any x ∈ R, define the score function:


∀θ ∈ [0, 1], S(θ; x) , log(Pθ (x)) , (2)
∂θ
where log(·) denotes the natural logarithm. The score function captures how the log-likelihood of the
observation X = x fluctuates with small perturbations of the parameter. Moreover, we will assume the
additional regularity conditions below:

3. The Lipschitz constant of [0, 1] 3 θ 7→ Pθ (x) is Lebesgue integrable, i.e.,


Z +∞



max Pφ (x) dx < +∞ . (3)
−∞ φ∈[0,1] ∂θ

1
Spring 2022 Part 1 Statistical Inference, Lecture 4 Minimax Estimation: Cramér-Rao Bounds

In particular, this implies via the dominated convergence theorem that for all θ ∈ [0, 1], the score
function has zero mean:  

Eθ [S(θ; X)] = Eθ log(Pθ (X)) = 0 . (4)
∂θ
This can be proved in a manner analogous to the argument for (13) in the second proof of Theorem 2.
4. For all θ ∈ [0, 1], we have  !2 

∂θ Pφ (X)
Eθ  max  < +∞ . (5)
φ∈[0,1] Pθ (X)

The next definition formally presents Fisher information, cf. [1, Section 8].
Definition 1 (Fisher Information). Given the family of likelihoods {Pθ : θ ∈ [0, 1]}, the Fisher information
in X about θ is the variance of the score function:
"  !2 
2 # ∂
∂ P θ (X)
∀θ ∈ [0, 1], JX (θ) , Eθ S(θ; X)2 = Eθ = Eθ  ∂θ
 
log(Pθ (X)) ,
∂θ Pθ (X)

where X ∼ Pθ , and we often drop the subscript and write J (θ) = JX (θ).
Intuitively, Fisher information indicates how useful the observation X is to estimate the unknown param-
eter θ. To better understand this, the next proposition shows that Fisher information captures the curvature
of χ2 -divergence. Recall that for any two PDFs P and Q with support R, their χ2 -divergence is defined as
Z +∞
(P (x) − Q(x))2
χ2 (P ||Q) = dx . (6)
−∞ Q(x)

Proposition 1 (Local Approximation of χ2 -Divergence). For any θ, θ0 ∈ [0, 1] with θ 6= θ0 , we have


χ2 (Pθ0 ||Pθ ) = J (θ)(θ0 − θ)2 + o (θ0 − θ)2


as θ0 → θ, where o(·) denotes the asymptotic little-o notation.


Proof. We follow the proof in [2]. Observe that
Z +∞
2 (Pθ0 (x) − Pθ (x))2
χ (Pθ0 ||Pθ ) = dx
−∞ Pθ (x)
Z +∞ Z θ0 !2
∂ 1
= Pt (x) dt dx
−∞ θ ∂θ Pθ (x)
Z +∞  Z 1 2
0 ∂ 1
= (θ − θ) P(θ0 −θ)t+θ (x) dt dx
−∞ 0 ∂θ P θ (x)
Z +∞ Z 1 Z 1 ∂ ∂
∂θ P(θ −θ)τ +θ (x) ∂θ P(θ −θ)t+θ (x)
0 0
= (θ0 − θ)2 dt dτ dx ,
−∞ 0 0 Pθ (x)
where the second equality follows from regularity condition 2 in Section 1 and the fundamental theorem of
calculus, and the third equality follows from a change of variables. Since, by the continuity in regularity
condition 2, we have
 2
∂ ∂ ∂
lim P 0
(θ −θ)τ +θ (x) P 0
(θ −θ)t+θ (x) = P θ (x)
θ 0 →θ ∂θ ∂θ ∂θ
for every x ∈ R and t, τ ∈ [0, 1], the dominated convergence theorem yields
Z +∞ Z 1 Z 1 ∂ ∂
∂θ P(θ −θ)τ +θ (x) ∂θ P(θ −θ)t+θ (x)
0 0
lim dt dτ dx = J (θ) ,
0
θ →θ −∞ 0 0 Pθ (x)
where we use the regularity condition in (5) and Definition 1. This completes the proof.

2
CS 57800 Statistical Machine Learning, Prof. Anuran Makur Spring 2022

We remark that Proposition 1 can be appropriately generalized for other f -divergences. This result shows
that when the Fisher information in X about θ is large, the “distance” between any two likelihoods of X in
the neighborhood of θ is also large. So, these likelihoods are easier to distinguish statistically. In this sense,
the observation X carries more “information” about the parameter θ. The next proposition presents two
other important properties of Fisher information, cf. [3], [4, Section 7.2].

Proposition 2 (Properties of Fisher Information). The following are true:

1. (Data Processing Inequality) Consider any Markov kernel defined by a measurable function PY |X :
R × R → [0, ∞), where y 7→ PY |X (y|x) is a PDF for every x ∈ R. Construct the parametrized family
of likelihoods {Qθ : θ ∈ [0, 1]} of a random variable Y such that for every θ ∈ [0, 1]:
Z +∞
∀y ∈ R, Qθ (y) = PY |X (y|x)Pθ (x) dx .
−∞

Assume further that {Qθ : θ ∈ [0, 1]} satisfies regularity conditions 1 and 2 in Section 1 and (5). Then,
we have for all θ ∈ [0, 1],
Z +∞ ∂ 2
∂θ Qθ (y)
JY (θ) = dy ≤ JX (θ) .
−∞ Qθ (y)

2. (Tensorization) Suppose the random variables X1 , . . . , Xn are drawn i.i.d. from the likelihood Pθ with
θ ∈ [0, 1]. Then, we have for all θ ∈ [0, 1],

JX1 ,...,Xn (θ) = nJ (θ) .

Proof.
Part 1: Although more direct proofs of this result exist with milder conditions, cf. [3], we prove this in
a manner that illustrates how the data processing inequality for Fisher information is inherited from that
of χ2 -divergence. Indeed, for any θ, θ0 ∈ [0, 1] with θ 6= θ0 , the data processing inequality for χ2 -divergence
immediately yields
χ2 (Qθ0 ||Qθ ) ≤ χ2 (Pθ0 ||Pθ ) .

Using Proposition 1, this implies that

JY (θ)(θ0 − θ)2 + o (θ0 − θ)2 ≤ JX (θ)(θ0 − θ)2 + o (θ0 − θ)2 .


 

Dividing by (θ0 − θ)2 and letting θ0 → θ produces the desired inequality.


Part 2: From Definition 1, we have
 !!2 
n
∂ Y
JX1 ,...,Xn (θ) = E log Pθ (Xi ) 
∂θ i=1
 !2 
n
X ∂
= E log(Pθ (Xi )) 
i=1
∂θ
n
" 2 # X    
X ∂ ∂ ∂
= E log(Pθ (Xi )) + E log(Pθ (Xi )) E log(Pθ (Xj ))
i=1
∂θ ∂θ ∂θ
i6=j

= nJ (θ) ,

where the expectations are with respect to the product measure Pθ ⊗ · · · ⊗ Pθ of X1 , . . . , Xn , and the final
equality follows from the regularity condition in (4).

3
Spring 2022 Part 1 Statistical Inference, Lecture 4 Minimax Estimation: Cramér-Rao Bounds

3 Hammersley-Chapman-Robbins Bound
We now turn to presenting a generalization of the Cramér-Rao bound known as the Hammersley-Chapman-
Robbins bound. To prove this, we will require the following variational characterization of χ2 -divergence,
cf. [4, Section 6.1].
Proposition 3 (Variational Characterization of χ2 -Divergence). For any pair of PDFs P and Q with support
R such that χ2 (P ||Q) < +∞, we have
1
χ2 (P ||Q) = sup EP [g(X)] − EQ g(X)2 − 1 ,
 
g:R→R 4
where EP [·] denotes expectation with respect to the PDF P , and the supremum is over all measurable functions
g : R → R such that EP [g(X)] and EQ [g(X)2 ] are well-defined and finite.
Proof. Recall that χ2 -divergence is an f -divergence defined by the convex function f : R → R, f (t) = t2 − 1.
Indeed, for any PDFs P and Q with support R, we have
Z +∞
2 (P (x) − Q(x))2
χ (P ||Q) = dx
−∞ Q(x)
Z +∞
P (x)2
= − 2P (x) + Q(x) dx
−∞ Q(x)
Z +∞
P (x)2
= dx − 1
−∞ Q(x)
  
P (X)
= EQ f .
Q(X)
The Legendre-Fenchel transformation (or convex conjugate) of f is the function f ∗ : R → R given by:
∀s ∈ R, f ∗ (s) = sup ts − f (t)
t∈R

= sup ts − t2 + 1
t∈R
2
s
= + 1,
4
where the last equality follows from maximizing the quadratic function t 7→ ts − t2 + 1, which happens at
t = 2s . It is clear that f and f ∗ satisfy the Young-Fenchel inequality:
s2
∀t, s ∈ R, f (t) ≥ st − f ∗ (s) = st − − 1.
4
Hence, for any PDFs P and Q with support R such that χ2 (P ||Q) < +∞, and any measurable function
g : R → R satisfying the conditions of the proposition statement, we have
P (x) g(x)2
 
P (x)
f ≥ g(x) − −1
Q(x) Q(x) 4
P (x)
for all x ∈ R, by setting t = Q(x) and s = g(x). Taking expectations of both sides of this inequality with
respect to Q, and then taking the supremum over all valid g, we obtain
1
χ2 (P ||Q) ≥ sup EP [g(X)] − EQ g(X)2 − 1 ,
 
g:R→R 4
where the first term on the right hand side experiences a change of measure. It remains to prove that this
P (x)
inequality is an equality. To this end, let g(x) = 2 Q(x) . Then, we get
  " 2 #
P (X) P (X)
2 EP − EQ − 1 = χ2 (P ||Q)
Q(X) Q(X)
via change of measure. This completes the proof.

4
CS 57800 Statistical Machine Learning, Prof. Anuran Makur Spring 2022

It is worth mentioning that this proposition actually holds for general probability measures P and Q
on standard Borel spaces. Moreover, this strategy of using the Legendre-Fenchel transformation of the
convex function that defines an f -divergence can be used to derive variational characterizations of all f -
divergences. In fact, in the special case of TV distance, the resulting variational characterization is precisely
the Kantorovich-Rubinstein dual characterization.
We next establish the Hammersley-Chapman-Robbins bound, which lower bounds the MSE risk of any
estimator θ̂(X) of θ, as introduced in Section 1. We provide two proofs of this result; the first utilizes
Proposition 3 and sheds light on how the variational characterization of χ2 -divergence is weakened to obtain
the Hammersley-Chapman-Robbins bound, while the second is the standard (and potentially less insightful)
proof via the Cauchy-Schwarz inequality.

Theorem 1 (Hammersley-Chapman-Robbins Bound). For any (possibly randomized) estimator θ̂ : R →


[0, 1] of the parameter θ ∈ [0, 1], its MSE risk is lower bounded by
 h i h i2
 2    Eθ0 θ̂(X) − Eθ θ̂(X)
Eθ θ̂(X) − θ ≥ varθ θ̂(X) ≥ sup
θ 0 ∈[0,1]\{θ} χ2 (Pθ0 ||Pθ )

for all θ ∈ [0, 1], where varθ (·) denotes the variance with respect to Pθ (and any randomness in the estimator).

Proof 1. The first inequality is the standard fact that for any square integrable random variable Y ∈ R,

var(Y ) = min E (Y − C)2 .


 
(7)
C∈R

So, it suffices to establish the second inequality. We use the proof outline in [4, Section 6.2].
To this end, we first specialize Proposition 3 by restricting its supremum to affine functions g(x) = ax + b
with a, b ∈ R. So, for any pair of distinct probability measures P and Q over R with finite second moments,
we obtain
1
χ2 (P ||Q) ≥ sup EP [aX + b] − EQ (aX + b)2 − 1
 
a,b∈R 4
2
a   ab b2
= sup a EP [X] − EQ X 2 − EQ [X] + b − −1
a,b∈R 4 2 4
= sup φ(v) , (8)
v∈R2

where we let v = [a b]T , and we define the function φ : R2 → R as


 
1 T EQ X 2 EQ [X]
 
φ(v) = − v v + [EP [X] 1] v − 1 .
4 EQ [X] 1
| {z }
=M

We assume without loss of generality that varQ (X) > 0, where the variance is computed with respect to Q.
(Indeed, if varQ (X) = 0, then Q is a Dirac measure. So, P is not absolutely continuous with respect to Q, and
we must have χ2 (P ||Q) = +∞, in which case (9) holds trivially.) Since EQ [X 2 ] ≥ det(M ) = varQ (X) > 0,
all leading principal minors of M are strictly positive. Hence, M is positive definite, which implies that φ
is a strictly concave function. This means that to determine the unique global maximum of φ, it suffices to
find a stationary point of φ. So, we compute its gradient and set it equal to 0 to obtain
 
1 EQ X 2 EQ [X]
    
a EP [X]
= ,
2 EQ [X] 1 b 1

which we solve to get


a∗
   
2  − EQ [X]
EP [X]
v∗ = = .
b∗ varQ (X) EQ X 2 − EQ [X]EP [X]

5
Spring 2022 Part 1 Statistical Inference, Lecture 4 Minimax Estimation: Cramér-Rao Bounds

Plugging this stationary point back into (8), we get χ2 (P ||Q) ≥ maxv∈R2 φ(v) = φ(v ∗ ), or equivalently,
2
(EP [X] − EQ [X])
χ2 (P ||Q) ≥ , (9)
varQ (X)

after some tedious but simple calculations to simplify φ(v ∗ ).


Next, consider any (possibly randomized) estimator θ̂ : R → [0, 1], which defines a Markov kernel from
the observation X to the random variable θ̂ = θ̂(X). Then, for any θ, θ0 ∈ [0, 1] with θ 6= θ0 , we have

χ2 (Pθ0 ||Pθ ) ≥ χ2 (Q0θ̂ ||Qθ̂ )


 h i h i2
Eθ0 θ̂(X) − Eθ θ̂(X)
≥   .
varθ θ̂(X)

where Qθ̂ and Q0θ̂ are the output probability distributions of θ̂ after the likelihoods Pθ and Pθ0 are pushed
forward through the estimator kernel, the first inequality follows from the data processing inequality for
χ2 -divergence, and the second inequality follows from (9). (Note that means and variances of θ̂(X) are
well-defined and finite since θ̂(X) ∈ [0, 1].) Rearranging this inequality and taking the supremum over all
θ0 6= θ produces the desired result.

Proof 2. We next present the more classical proof in the setting of deterministic estimators (for simplicity).
As before, it suffices to prove the second inequality. Fix any deterministic estimator θ̂ : R → [0, 1]. Then,
for any θ, θ0 ∈ [0, 1] with θ 6= θ0 , we get
 h i h i2  h i h i2
Eθ0 θ̂(X) − Eθ θ̂(X) = inf Eθ0 θ̂(X) − C − Eθ θ̂(X) − C
C∈R
! !2
+∞ 
Pθ0 (x) − Pθ (x)
Z p
= inf θ̂(x) − C Pθ (x) p dx
C∈R −∞ Pθ (x)
!
Z +∞   Z +∞ 2
2 (Pθ0 (x) − Pθ (x))
≤ inf θ̂(x) − C Pθ (x) dx dx
C∈R −∞ −∞ Pθ (x)
 2 
= min Eθ θ̂(X) − C χ2 (Pθ0 ||Pθ )
C∈R
 
= varθ θ̂(X) χ2 (Pθ0 ||Pθ ) ,

where the first equality holds because the difference of expectations is invariant to translations by a constant
C ∈ R, the inequality follows from the Cauchy-Schwarz inequality, and the last equality follows from (7).
Rearranging this inequality and taking the supremum over all θ0 6= θ produces the desired result.

4 Cramér-Rao Bound
We are now in a position to derive the renowned Cramér-Rao bound (or information bound ), cf. [1, Section
8.3], [4, Section 6.3]. Recall that an estimator θ̂ : R → [0, 1] of the parameter θ based on the observation X
is said to be unbiased if it satisfies the condition:
h i
∀θ ∈ [0, 1], Eθ θ̂(X) = θ . (10)

The Cramér-Rao bound is a lower bound on the MSE risk of such unbiased estimators in terms of Fisher
information. We will again present two proofs of this result; the first will demonstrate how it is a corollary
of the Hammersley-Chapman-Robbins bound, and the second is the more classical proof.

6
CS 57800 Statistical Machine Learning, Prof. Anuran Makur Spring 2022

Theorem 2 (Cramér-Rao Bound). For every (possibly randomized) unbiased estimator θ̂ : R → [0, 1] and
every parameter θ ∈ [0, 1], we have
 2    1
Eθ θ̂(X) − θ = varθ θ̂(X) ≥ .
J (θ)
Proof 1. The equality is obvious since the estimators under consideration are unbiased. So, it suffices to
prove the inequality; we use the proof idea in [4, Section 6.3]. For any (possibly randomized) unbiased
estimator θ̂ : R → [0, 1] and any θ ∈ [0, 1], Theorem 1 and (10) give
  (θ0 − θ)2 (θ0 − θ)2
varθ θ̂(X) ≥ sup 2
≥ lim . (11)
θ 0 ∈[0,1]\{θ} χ (Pθ 0 ||Pθ ) θ 0 →θ χ2 (Pθ 0 ||Pθ )

Now, using Proposition 1, we get


  (θ0 − θ)2 1
varθ θ̂(X) ≥ lim 0 2 0 2
= ,
0 θ →θ J (θ)(θ − θ) + o((θ − θ) ) J (θ)
which completes the proof.
Proof 2. As before, we also show the classical proof in the setting of deterministic estimators, cf. [1, Section
8.3]. For any unbiased estimator θ̂ : R → [0, 1] and any θ ∈ [0, 1], we have
  ∂     
∂ ∂
Eθ θ̂(X) − θ log(Pθ (X)) = Eθ θ̂(X) log(Pθ (X)) − θ Eθ log(Pθ (X))
∂θ ∂θ ∂θ
 

= Eθ θ̂(X) log(Pθ (X))
∂θ
Z +∞

= θ̂(x)Pθ (x) dx , (12)
−∞ ∂θ

where the second equality follows from the regularity condition in (4). We next prove that
Z +∞ Z +∞
∂ ∂
θ̂(x)Pθ (x) dx = θ̂(x)Pθ (x) dx . (13)
−∞ ∂θ ∂θ −∞
To see this, for any θ0 ∈ [0, 1]\{θ} and any x ∈ R, notice that

θ̂(x)P 0 (x) − θ̂(x)P (x) P 0 (x) − P (x)
θ θ θ θ ≤ max ∂ Pφ (x) ,




0
θ −θ

0
θ −θ φ∈[0,1] ∂θ

where we use the fact that |θ̂(x)| ≤ 1. Since we have assumed the regularity condition in (3), we must have
Z +∞ Z +∞
∂ θ̂(x)Pθ0 (x) − θ̂(x)Pθ (x)
θ̂(x)Pθ (x) dx = lim dx
−∞ ∂θ
0
−∞ θ →θ θ0 − θ
Z +∞ Z +∞
θ̂(x)Pθ0 (x) − θ̂(x)Pθ (x) ∂
= lim dx = θ̂(x)Pθ (x) dx
θ 0 →θ −∞ θ0 − θ ∂θ −∞
by the dominated convergence theorem. Therefore, combining (13) with (12), we get
  ∂ 
∂ h i
Eθ θ̂(X) − θ log(Pθ (X)) = Eθ θ̂(X) = 1 ,
∂θ ∂θ
where the last equality uses (10). Finally, using the Cauchy-Schwarz inequality, we have
  ∂ 2  
1 = Eθ θ̂(X) − θ log(Pθ (X)) ≤ varθ θ̂(X) J (θ) .
∂θ
This completes the proof.

7
Spring 2022 Part 1 Statistical Inference, Lecture 4 Minimax Estimation: Cramér-Rao Bounds

We make a few relevant remarks at this stage. Firstly, when an unbiased estimator achieves the Cramér-
Rao bound, it is called an efficient estimator in the literature. Moreover, if an efficient estimator exists for
a problem, it is the maximum likelihood estimator (under regularity conditions), cf. [1, Section 8.3]:

∀x ∈ R, θ̂ML (x) = arg max Pθ (x) . (14)


θ∈[0,1]

Secondly, it is known under regularity conditions that an efficient estimator exists in a problem if and only if
the likelihood model {Pθ : θ ∈ [0, 1]} is an exponential family whose natural parameter is proportional to the
integral of Fisher information, cf. [1, Section 9.5]. Furthermore, the efficient estimator is an affine function
of the sufficient statistic of the exponential family. Thirdly, vector versions of the Hammersley-Chapman-
Robbins and Cramér-Rao bounds can be derived in an analogous fashion to the arguments presented here,
cf. [1, Section 8.5]. Finally, we note that the Hammersley-Chapman-Robbins bound is tighter than the
Cramér-Rao bound, as can be seen from (11), and is applicable to a broader range of problems since it makes
milder assumptions. However, it is often more difficult to analytically evaluate compared to the Cramér-Rao
bound. This trade-off between tightness, generality, and ease of computation pervades the literature on lower
bounds on MSE risk, e.g., Barankin bound, Ziv-Zakai bound, Bhattacharyya bound, Weiss-Weinstein bound,
etc. The tightest and most general bounds, such as the Barankin bound, are usually the most difficult to
compute and apply in actual problems.

5 van Trees Inequality


Although the Cramér-Rao bound is easy to analytically compute, its main drawback is that it applies only to
unbiased estimators. In most non-parametric statistics and machine learning problems, e.g., density estima-
tion or regression, and high-dimensional statistics and machine learning problems, e.g., matrix completion,
the minimax optimal estimators are biased (and in fact, they are carefully chosen to optimize the bias-
variance trade-off). To overcome this limitation, the Bayesian Cramér-Rao bound or van Trees inequality
has been developed in the literature. We now present this inequality and conclude with an illustration of
how it may be used to calculate minimax MSE risk lower bounds.
Consider any prior PDF λ with support on the parameter space [0, 1]. We will assume that this prior
PDF satisfies the following regularity conditions:
5. The function λ : [0, 1] → [0, ∞) is continuously differentiable.
6. λ(0) = λ(1) = 0. (So, strictly speaking, the support of λ is (0, 1) and the closure of the support is
[0, 1].)
For such a PDF λ over R, we can define a location family {λφ : φ ∈ R}, where each λφ is a PDF given by
translation of λ = λ0 :
∀t ∈ R, λφ (t) = λ(t − φ) . (15)
According to Definition 1, the Fisher information of this location family is defined as:
 !2  Z  2


λ (θ) φ+1 λ(t − φ) Z φ+1 0
λ (t − φ)2
Z 1 0 2
λ (t)
∂φ φ ∂φ
J (λ) , Eλφ   = dt = dt = dt , (16)
λφ (θ) φ λ(t − φ) φ λ(t − φ) 0 λ(t)

where we use the notation J (λ) since this quantity does not depend on the location parameter φ, λ0 : [0, 1] →
R denotes the derivative of λ, and θ denotes a parameter random variable with distribution λφ (it is bolded
to distinguish it from deterministic parameter values θ). We refer to J (λ) as the Fisher information of the
prior PDF λ.
Until now, we have considered estimating the parameter θ from the observation X. We next generalize
this setting slightly and impose another useful regularity condition. Given the family of likelihoods {Pθ : θ ∈
[0, 1]} and X ∼ Pθ , suppose we seek to estimate the unknown, deterministic quantity g(θ) for some function
g : [0, 1] → R satisfying the condition:
7. g : [0, 1] → R is continuously differentiable.

8
CS 57800 Statistical Machine Learning, Prof. Anuran Makur Spring 2022

Then, akin to (1) in Section 1, the minimax MSE formulation for this problem is
h i
2
Rm = inf sup Eθ (ĝ(X) − g(θ)) , (17)
ĝ:R→R θ∈[0,1]

where the infimum is over all (possibly randomized) estimators ĝ : R → R of g(θ) based on X. Using the
above regularity condition, the next theorem derives a lower bound on the minimax MSE risk in (17) (and
hence, also (1)) by first lower bounding it with the Bayes risk obtained by using the aforementioned prior
PDF λ on the parameter space [0, 1], cf. [5, Section 2], [4, Sections 6.5, 7.4, and 8.2], [6, Section 2.7.3].
Theorem 3 (van Trees Inequality). For any prior PDF λ on [0, 1] satisfying the regularity conditions 5 and
6 above, the minimax MSE risk is lower bounded by
2
h
2
i E[g 0 (θ)]
Rm ≥ inf E (ĝ(X) − g(θ)) ≥ ,
ĝ:R→R E[J (θ)] + J (λ)
| {z }
Bayes risk

where E[·] denotes expectation with respect to the joint law of (θ, X), with θ ∼ λ and X ∼ Pθ given θ = θ,
and any randomness in the estimator, the second term is the Bayes risk when the parameter random variable
θ has prior PDF λ, and g 0 : [0, 1] → R denotes the derivative of g. In the special case where g(θ) = θ for all
θ ∈ [0, 1], we have  2  1
Rm ≥ inf E θ̂(X) − θ ≥ .
θ̂:R→[0,1] E[J (θ)] + J (λ)
Proof. First, observe that the first inequality holds because for any (possibly randomized) estimator ĝ : R →
R, we have
h i Z 1 h i h i
2 2 2
sup Eθ (ĝ(X) − g(θ)) ≥ Eθ (ĝ(X) − g(θ)) λ(θ) dθ = E (ĝ(X) − g(θ)) .
θ∈[0,1] 0

Hence, taking the infimum over all estimators g on both sides yields the desired inequality.
To prove the second inequality, assume without loss of generality that E[J (θ)] < +∞, J (λ) < +∞, and
the Bayes risk is finite. Indeed, if any of these quantities is infinite, then the result holds trivially (with
the convention that 1/∞ = 0). Note that by the regularity condition 7, g 0 : [0, 1] → R is bounded by the
boundedness theorem, and E[g 0 (θ)] is well-defined and finite. Furthermore, since Bayes risk can be achieved
by deterministic estimators, we restrict our attention to deterministic estimators ĝ : R → R below. We now
follow the proof in [5, Section 2].
Observe that for all x ∈ R, we have
Z 1

Pθ (x)λ(θ) dθ = P1 (x)λ(1) − P0 (x)λ(0) = 0 ,
0 ∂θ
where we use regularity conditions 2 (from Section 1) and 5 for the first equality, and regularity condition 6
for the second equality. Furthermore, for all x ∈ R, we also have
Z 1 Z 1 Z 1
∂ 1
g(θ) Pθ (x)λ(θ) dθ = [g(θ)Pθ (x)λ(θ)]0 − g 0 (θ)Pθ (x)λ(θ) dθ = − g 0 (θ)Pθ (x)λ(θ) dθ ,
0 ∂θ 0 0

using integration by parts and regularity conditions 2, 5, 6, and 7. Hence, for any deterministic estimator
ĝ : R → R, we get
Z +∞ Z 1 Z +∞ Z 1
∂ ∂
(ĝ(x) − g(θ)) Pθ (x)λ(θ) dθ dx = ĝ(x) Pθ (x)λ(θ) dθ dx
−∞ 0 ∂θ −∞ 0 ∂θ
Z +∞ Z 1

− g(θ) Pθ (x)λ(θ) dθ dx
−∞ 0 ∂θ
Z +∞ Z 1
= g 0 (θ)Pθ (x)λ(θ) dθ dx
−∞ 0
0
= E[g (θ)] .

9
Spring 2022 Part 1 Statistical Inference, Lecture 4 Minimax Estimation: Cramér-Rao Bounds

Next, notice that


 2
0 2 ∂
E[g (θ)] = E (ĝ(X) − g(θ)) log(Pθ (X)λ(θ))
∂θ
" 2 #
h i ∂
2
≤ E (ĝ(X) − g(θ)) E log(Pθ (X)λ(θ))
∂θ
" 2 #
h i ∂ ∂
2
= E (ĝ(X) − g(θ)) E log(Pθ (X)) + log(λ(θ))
∂θ ∂θ
" 2 # " # " 2 #!
h
2
i ∂ λ0 (θ) ∂θ

Pθ (X) λ0 (θ)
= E (ĝ(X) − g(θ)) E log(Pθ (X)) + 2E +E
∂θ λ(θ)Pθ (X) λ(θ)
" #!
h
2
i λ0 (θ) ∂θ

Pθ (X)
= E (ĝ(X) − g(θ)) E[J (θ)] + J (λ) + 2 E
λ(θ)Pθ (X)
h i
2
= E (ĝ(X) − g(θ)) (E[J (θ)] + J (λ)) ,

where the inequality follows from the Cauchy-Schwarz inequality, the fifth line follows from the tower prop-
erty, Definition 1, and (16), and the last line holds because
" # Z
λ0 (θ) ∂θ
∂ 1 Z +∞
Pθ (X) ∂
E = λ0 (θ) Pθ (x) dx dθ = 0 ,
λ(θ)Pθ (X) 0 −∞ ∂θ

where we use the regularity condition in (4). Rearranging this inequality and taking the infimum over all
estimators ĝ produces the desired result. The special case where g(θ) = θ for all θ ∈ [0, 1] then follows from
direct calculation. This completes the proof.
It is worth mentioning that while the choice of prior PDF λ is arbitrary in Theorem 3, in many problems,
a convenient option turns out to be
(
2 cos2 π θ − 12 , θ ∈ [0, 1] ,

λ(θ) = (18)
0, otherwise .

This choice of prior PDF λ has the smallest Fisher information J (λ) among all PDFs on [0, 1] satisfying
regularity conditions 5 and 6 [6, Section 2.7.3]. Specifically, the Fisher information of (18) is

16π 2 sin2 π t − 21 cos2 π t − 21


Z 1  
J (λ) = dt
2 cos2 π t − 12

0
Z 1   
2 2 1
= 8π sin π t − dt
0 2
Z 1   
1
= 8π 2 1 − cos2 π t − dt
0 2
Z 1   
1
= 8π 2 − 4π 2 2 cos2 π t − dt
0 2
= 4π 2 , (19)

where the first equality follows from (16), and the third equality follows from the standard Pythagorean
identity sin2 (t) + cos2 (t) = 1 for all t ∈ R.

5.1 Application to Minimax Estimation


Finally, we present an example from [5, Section 2] that demonstrates how to use Theorem 3 to provide
minimax MSE risk guarantees. Consider the Gaussian location model in which several observation random

10
CS 57800 Statistical Machine Learning, Prof. Anuran Makur Spring 2022

variables X1 , . . . , Xn are drawn i.i.d. from Pθ = N (θ, 1) for some unknown, deterministic location (or
mean) parameter θ ≥ 0, where N (a, b) denotes a Gaussian PDF with mean a ∈ R and variance b > 0.
Let g : [0, ∞) → [0, ∞) be the map g(θ) = θα for some known α ∈ (0, 1). Suppose we seek to estimate
g(θ) from X1 , . . . , Xn under the minimax MSE risk criterion in (17). The next proposition establishes the
minimax MSE rate, i.e., the scaling of minimax MSE risk with n, for this estimation problem. (Although this
problem does not strictly fall within the formal setup of this section, readers may verify that the minimax
MSE estimation setup and associated regularity conditions can be modified so that Theorem 3 continues to
hold for this problem.)

Proposition 4 (Minimax MSE Rate). For the Gaussian location model problem above, we have the following
minimax bounds:
 α !2 
n
C(α) h
2
i 1 X 1
≤ ninf sup Eθ (ĝ(X1 , . . . , Xn ) − θα ) ≤ sup Eθ  Xi − θα  ≤ α ,

n α ĝ:R →[0,∞) θ≥0 n i=1 n
θ≥0

where C(α) > 0 is a constant that does not depend on n, the infimum is over all (possibly randomized)
estimators
1 Pn of g(θ)
 = θα based on X1 , . . . , Xn , and the estimator that achieves the minimax lower bound is
g n i=1 Xi .

Proof. We first prove the minimax lower bound using Theorem 3. For any θ ≥ 0, we have
" 2 #

J (θ) = Eθ log(Pθ (X))
∂θ
" 2 #
∂ (X − θ)2
= Eθ
∂θ 2
= Eθ (X − θ)2
 

= 1,

where the second and fourth equalities hold because Pθ = N (θ, 1). Hence, using part 2 of Proposition 2, we
have:
∀θ ≥ 0, JX1 ,...,Xn (θ) = n . (20)

Fix any prior PDF λ on [0, ∞) satisfying the requisite regularity conditions such that J (λ) < +∞ and
Eλ [g 0 (θ)] 6= 0 is finite, where Eλ [·] denotes the expectation when θ ∼ λ. Then, for any γ > 0, consider the
prior PDF λγ on [0, ∞) given by:
 
1 t
∀t ≥ 0, λγ (t) = λ .
γ γ
Clearly, we have
 2
+∞ λ0γ (t)2 +∞ 14 λ0 t +∞
λ0 (t)2 J (λ)
Z Z Z
γ γ 1
J (λγ ) = dt =   dt = 2 dt = , (21)
0 λγ (t) 0 1
λ t γ 0 λ(t) γ2
γ γ
Z +∞ Z +∞   Z +∞
0 α−1 1 t
Eλγ [g (θ)] = αtα−1
λγ (t) dt = α t λ dt = αγ α−1
tα−1 λ(t) dt = γ α−1 Eλ [g 0 (θ)] ,
0 0 γ γ 0
(22)

where the final equality in the first line uses (16). Substituting (20), (21), and (22) into Theorem 3, we get

h i γ 2α−2 E [g 0 (θ)]2 γ 2α Eλ [g 0 (θ)]


2
2 λ
inf sup Eθ (ĝ(X1 , . . . , Xn ) − θα ) ≥ = .
n
ĝ:R →[0,∞) θ≥0 n + Jγ(λ)
2
nγ 2 + J (λ)

11
Spring 2022 Part 1 Statistical Inference, Lecture 4 Minimax Estimation: Cramér-Rao Bounds

This implies that


2
h
2
i γ 2α Eλ [g 0 (θ)]
inf sup Eθ (ĝ(X1 , . . . , Xn ) − θα ) ≥ sup 2
n
ĝ:R →[0,∞) θ≥0 γ>0 nγ + J (λ)
 α
αJ (λ) 2
n(1−α) Eλ [g 0 (θ)]
= αJ (λ)
1−α + J (λ)
2
αα (1 − α)1−α Eλ [g 0 (θ)]
=
nα J (λ)1−α
C(α)
= ,

where the second equality
p follows from a straightforward exercise in calculus which shows that the supremum
is achieved at γ = αJ (λ)/(n(1 − α)), and we set the constant C(α) as follows in the final equality:
2
αα (1 − α)1−α Eλ [g 0 (θ)]
C(α) = > 0.
J (λ)1−α
Lastly, we prove the minimax upper bound by analyzing the intuitively reasonable deterministic estimator
1 X α
n ! n
1 X
g Xi = Xi .

n n
i=1 i=1

Since θ ≥ 0 and α ∈ (0, 1), observe via the cr -inequality that


1 X α 1 X α
n n n α
1 X
α α
Xi − θ = Xi − |θ| ≤ Xi − θ .

n n n


i=1 i=1 i=1

(This follows from the fact that the map 0 ≤ x 7→ xα is concave.) Therefore, we obtain
 α !2   !2α 
1 X n n
1 X
sup Eθ  Xi − θα  ≤ sup Eθ  Xi − θ

θ≥0 n θ≥0 n i=1
i=1
 
n
!2 α
1 X
≤ sup Eθ  Xi − θ 
θ≥0 n i=1
1
= ,

where the second inequality follows from Jensen’s inequality and the concavity and continuity of 0 ≤ x 7→ xα ,
and the final equality holds because X1 , . . . , Xn are i.i.d. N (θ, 1). This completes the proof.
Pn 
Proposition 4 illustrates that the estimator g n1 i=1 Xi achieves the optimal minimax rate of Θ(n−α )
for the problem of estimating g(θ) = θα in the Gaussian location model.

References
[1] G. W. Wornell, “Inference and information,” May 2017, Department of Electrical Engineering and Com-
puter Science, MIT, Cambridge, MA, USA, Lecture Notes 6.437.
[2] Y. Polyanskiy and Y. Wu, “Lecture notes on information theory,” May 2019, Department of Electrical
Engineering and Computer Science, MIT, Cambridge, MA, USA, Lecture Notes 6.441.
[3] R. Zamir, “A proof of the Fisher information inequality via a data processing argument,” IEEE Trans-
actions on Information Theory, vol. 44, no. 3, pp. 1246–1250, May 1998.

12
CS 57800 Statistical Machine Learning, Prof. Anuran Makur Spring 2022

[4] Y. Wu, “Information-theoretic methods for high-dimensional statistics,” January 2020, Department of
Statistics and Datra Science, Yale University, New Haven, CT, USA, Lecture Notes S&DS 677.
[5] R. D. Gill and B. Y. Levit, “Applications of the van Trees inequality: a Bayesian Cramér-Rao bound,”
Bernoulli, vol. 1, no. 1/2, pp. 59–79, March 1995.

[6] A. B. Tsybakov, Introduction to Nonparametric Estimation, ser. Springer Series in Statistics. New York,
NY, USA: Springer, 2009.

13

You might also like