Metrika
DOI 10.1007/s00184-009-0246-3
Estimation of Fisher information using model selection
Jan Mielniczuk · Małgorzata Wojtyś
Received: 16 March 2008
© Springer-Verlag 2009
Abstract In the paper the problem of estimation of Fisher information I f for a
univariate density supported on [0, 1] is discussed. A starting point is an observation
that when the density belongs to an exponential family of a known dimension, an
explicit formula for I f there allows for its simple estimation. In a general case, for
a given random sample, a dimension of an exponential family which approximates
it best is sought and then estimator of I f is constructed for the chosen family. As a
measure of quality of fit a modified Bayes Information Criterion is used. The estimator, which is an instance of Post Model Selection Estimation method is proved to
be consistent and asymptotically normal when the density belongs to the exponential
family. Its consistency is also proved under misspecification when the number of exponential models under consideration increases in a suitable way. Moreover we provide
evidence that in most of considered parametric cases the small sample performance
of proposed estimator is superior to that of kernel estimators.
Keywords Fisher information · Post model selection estimation · Bayes information
criterion (BIC) · Kernel estimator of a density · Sheather–Jones bandwidth
1 Introduction
Consider a univariate probability distribution having a density f and such that its
Fisher information
J. Mielniczuk (B) · M. Wojtyś
Faculty of Mathematics and Information Science, Warsaw University of Technology, Plac
Politechniki 1, 00-661 Warsaw, Poland
e-mail:
[email protected]
J. Mielniczuk
Institute of Computer Science, Polish Academy of Sciences, Warsaw, Poland
123
J. Mielniczuk, M. Wojtyś
If =
[ f ′ (x)]2
d x.
f (x)
(1)
R
is finite. Obviously, I f = E[( f ′ / f )2 (X )], where X is a random variable distributed according to density f . The Fisher information is a quantity appearing in many
contexts in statistics and its nonparametric estimation is essential in construction of
efficient estimators and tests, see, e.g. Bickel et al. (1998).
In the paper we deal with a problem of nonparametric estimation of this quantity.
An obvious method to do this would be to estimate a density and its derivative by one
of numerous nonparametric functional estimation methods and then use the definition
of I f for constructing its plug-in type estimator. This is an approach used, e.g. in
Stone (1975) and van der Vaart (2000, Section 25.8.1). However, such an estimator of
f ′ (x)/ f (x) could behave unstably for x corresponding to small values of the density. A
remark by last cited author (p. 397), who used truncation of both density and derivative
estimators, that this construction ‘is not necessarily recommended for use in practice’
is telling. Moreover, with such an approach one faces a delicate problem of choosing
smoothing parameters for estimators of both density and its derivative. Note also that
I f equals to the integral of the squared optimal score J (y) = f ′ / f (F −1 (y)). Thus
another possibility is to apply this representation of I f in terms of the optimal score
function and use a direct estimator of it which does not involve separate estimation of
f and its derivative. An example of such direct estimator is NN estimator proposed
by Csörgő and Révész (1986) and defined in Sect. 2.3.
Here we take a different route. Assume that f is supported on [0, 1] and consider
functions b1 (x), . . . ,
k-dimensional exponential family Mk pertaining to orthonormal
bk (x) defined in (2)–(3). If f ∈ Mk then I f = E{ kj=1 θ j b′j (X )}2 (cf. 14), where
X is distributed according to density f . This equality is used to construct an estimator of I f . In general, when it is not known whether f belongs to an exponential
family, we construct a parsimonious exponential model Mk well approximating f
in terms of a modified Bayesian Information Criterion (BIC) due to Schwarz (1978)
and use an estimator constructed for this very model. The procedure known as post
model selection estimation is described in detail below. An important point is that
nonparametric estimation of the density f and its derivative is entirely avoided and
the whole procedure relies on estimation of dimension k and parametric estimation
of θ by maximum likelihood method in the exponential family. We also note that the
introduced method can be applied to estimate other functionals of a density such as
its entropy or an integral of its squared derivative of a given order.
In the paper we prove consistency and asymptotic normality of the introduced estimator when the underlying distribution belongs to one of parametric models under
consideration and discuss some properties of the related selection rule. Its consistency
is also proved under misspecification when the underlying density is sufficiently well
approximated by the models on the list (cf. Theorem 5). Moreover, we compare its performance with plug-in estimators and estimator based on Csörgő and Révész estimator
of a score function.
The paper is structured as follows. Section 2 contains necessary preliminary information on exponential families as well as post model selection estimators (PMSE) and
123
Estimation of Fisher information using model selection
introduces the proposed estimator of I f together with its competitors. In Sect. 3 main
results of the paper are stated and proved. The last section is devoted to discussion of
results of simulation study, in which the case of a density with an unknown, possibly
unbounded, support is also addressed.
2 Preliminaries
2.1 Exponential family
Let Mk , for k ≥ 1, be k-dimensional exponential family
Mk = { f (·, θ ) : θ ∈ Rk },
(2)
where f (·, θ ) = f θ (·) is a density function with respect to Lebesgue measure λ on
[0, 1] defined as
f (x, θ ) = exp{θ ◦ b(x) − ψk (θ )},
(3)
where θ = (θ1 , . . . , θk )T ∈ Rk and b(x) = (b1 (x), . . . , bk (x))T is a vector of known
bounded and differentiable functions such that {1, b1 (x), . . . , bk (x)} form an orthonormal system in (L 2 ([0, 1]), λ). The symbol ◦ denotes throughout the inner product
j
in Rk ; for x ∈ Rk , y ∈ R j , when j ≤ k, we define x ◦ y = i=1 xi yi .
Since f is a density on [0, 1], a normalizing constant ψk (θ ) is equal to
ψk (θ ) = log
1
exp{θ ◦ b(x)} d x.
(4)
0
Observe that ψk (θ ) is finite for any θ ∈ Rk as b(·) is bounded and f (·, θ ) has a compact
support. Thus Mk is regular, which by definition means that the natural parameter set
is open.
kWe assume throughout that the family Mk is of full rank, i.e. no linear combination
j=1 λ j b j (X ), for (λ1 , . . . , λk ) = 0, is constant with probability 1 for any X ∼ f ∈
Mk . This is equivalent to the condition that the covariance matrix Covθ b(X ) of vector b(X ) is positive definite for every θ ∈ Rk . Observe that families Mk are nested,
i.e. Mk ⊂ Ml for k < l when the parameter space of Mk is embedded in Rl by
appending θ ∈ Rk with l − k zeros.
Below we summarize some basic facts concerning maximum likelihood estimators
(MLE) of the parameter θ in exponential family (cf. van der Vaart 2000, Sect. 4.2).
Let X 1 , . . . , X n be i.i.d. observations whose law belongs to a k-dimensional exponential family. Then for θ ∈ Rk the log-likelihood function Lk (θ ) is of the form
Lk (θ ) = log
n
i=1
f (X i , θ ) = n{θ ◦ Yn − ψk (θ )},
(5)
123
J. Mielniczuk, M. Wojtyś
where Yn = (1/n)
n
i=1 b(X i ).
For θ ∈ Rk define a vector-valued function
ek (θ ) = (ek,1 (θ ), . . . , ek,k (θ ))T = Eθ b(X ),
where X is a random variable distributed according to density f (·, θ ) ∈ Mk .
In view of ψk′ (θ ) = ek (θ ) and ψk′′ (θ ) = Covθ b(X ) (van der Vaart 2000, p. 38), we
have
L′ k (θ ) = n[Yn − ek (θ )],
L′′ k (θ ) = −nCovθ b(X ).
(6)
For k ∈ {1, . . . , K }, where K ∈ N is some fixed integer, define
Lk = sup Lk (θ ),
(7)
θ∈Rk
and let θ̂ = θ̂nk ∈ Rk (with the superscript indicating the dimension of the vector) be a
MLE of the parameter θ = θ k ∈ Rk in the k-dimensional exponential family Mk . As
θ̂nk is a point at which the maximum of the log-likelihood Lk (θ ) is attained, it is the
solution of the equation L′ k (θ ) = 0, which in the exponential family is equivalent to
the condition Yn = ek (θ ). Moreover, we have
ek′ (θ ) = Covθ b(X ),
where ek′ (θ ) =
one-to-one and
∂ek,i (θ)
∂θ j
1≤i, j≤k
k
if θ̂n exists then
(8)
. Since Mk is assumed to be of full rank then ek is
θ̂nk = ek−1 (Yn ).
(9)
For j = 1, . . . , k define a vector a j = [b j (X i )]1≤i≤n . We assume that a1 , . . . , ak
are linearly independent with probability 1 which is a sufficient condition for existence of MLE θ̂nk (see Bogdan and Bogdan 2000). This condition also implies that
the family Mk is of full rank so that θ̂nk is an argument of the global maximum of
the log-likelihood. Thus Lk (θ̂nk ) = Lk . Moreover, if θ ∈ Rk is the true value of the
P
parameter, then we have θ̂nk → θ and
√
P
D
n(θ̂nk − θ ) → N (0, [Covθ b(X )]−1 ),
(10)
D
where → and → denote convergence in probability and distribution, respectively.
2.2 Schwarz rule and post model selection estimators
In the preceding section we assumed that the density pertaining to the sample X 1 , X 2 ,
. . . , X n belongs to an exponential family of known dimension k. If this is not the case,
123
Estimation of Fisher information using model selection
i.e. either an exponential model is misspecified or k is not known, we may first want to
estimate the dimension of a model, which, in a sense specified below, describes well
the sample X 1 , X 2 , . . . , X n , and then compute an estimator θ̂ of the parameter θ in a
family with the dimension equal to the found value of the estimator of k. An estimator
of θ constructed in this manner is known as a PMSE (see, e.g. Leeb and Pötscher 2006
for a general introduction to the subject).
One of the methods for estimating the dimension of a model is Schwarz’s (1978)
BIC. According to it we should choose the family Mk , k ∈ {1, . . . , K }, where K is
some preassigned integer, for which the quantity
1
1
L k = Lk − k log n = n sup {θ ◦ Yn − ψk (θ )} − k log n
2
2
θ∈Rk
(11)
is largest. Equivalently, the chosen dimension S of the model is given by
S = min{k : 1 ≤ k ≤ K , L k ≥ L j , j = 1, . . . , K }.
The selection rule S has been applied in Ledwina (1994) in order to construct an
adaptive version of Neyman’s smooth goodness-of-fit test and intensively studied
thereafter. In the context of estimation problems we found it advantageous to modify slightly its definition by adding the value 0 to the set of possible values. For this
purpose we modify S in the following way. The symbol M0 will denote a model
consisting of the uniform density on [0, 1]
M0 = { f ∈ M1 : θ = 0}.
Let L0 = 0. Observe that as M0 consists solely of the uniform density, L0 formally
equals supremum of the log-likelihood for this model what is consistent with (7) for
k = 0. Moreover, we extend the definition of L k in (11) to k = 0 by letting L 0 = 0.
We introduce estimator S̃ of dimension k of the model Mk , k ∈ {0, 1, . . . , K } by
S̃ = min{k : 0 ≤ k ≤ K , L k ≥ L j , j = 0, . . . , K }.
(12)
Then the PMSE θ̂ of parameter θ is defined as follows
θ̂ = θ̂ S̃ ,
where θ̂ S̃ is the MLE of parameter θ in S̃-dimensional exponential family M S̃ .
Note that if S̃ = 0 then we infer that the random sample pertains to the uniform
distribution. Moreover,
S if L S > 0;
S̃ =
(13)
0 otherwise.
Let KL( f, f θ ) = log( f / f θ ) f be a Kullback–Leibler distance
between f and f θ .
We note finally that as E f Lk (θ )/n is equal −KL( f, f θ ) + log( f ) f , maximization
123
J. Mielniczuk, M. Wojtyś
of L k with respect to k can be viewed as choosing a model such that a penalized
empirical KL distance between f and the model is smallest.
2.3 Estimators of Fisher information
Fisher information pertaining to the density f is defined as in (1) and assume throughout that I f is finite. Consider the situation when f (·) = f (·, θ ) ∈ Mk , 1 ≤ k ≤ K ,
where Mk is defined in (2). Then it is easy to see that for X ∼ f (·, θ )
If =
⎧
1 ⎨
0
⎩
k
θ j b′j (x)
⎫2
⎬
⎭
j=1
f (x, θ ) d x = E
⎧
⎨
⎩
k
θ j b′j (X )
j=1
⎫2
⎬
.
(14)
⎭
Thus in the case when k is assumed known, a natural estimator of I f is
1
Iˆkf =
n
n
i=1
⎧
⎨
⎩
k
θ̂ j b′j (X i )
⎫2
⎬
⎭
j=1
=
1
n
n
(θ̂ ◦ b′ (X i ))2 ,
(15)
i=1
where X 1 , X 2 , . . . , X n is i.i.d. sample pertaining to f (·, θ ) and θ̂ = θ̂nk is a MLE of θ
in Mk based on X 1 , X 2 , . . . , X n . Moreover, we define Iˆkf = 0 for k = 0. Note that θ̂
and Iˆ f are estimated using the same sample.
In a general case, when no assumption about f belonging to one of parametric
models Mk , 0 ≤ k ≤ K , is made, we propose the following estimator of I f
Iˆ f = Iˆ S̃f ,
(16)
i.e. Iˆ f = Iˆkf , when S̃ = k. Thus we choose a parametric model Mk with k = S̃, which
yields the best parsimonious fit to the data with respect to the modified Schwarz criterion (12), and then we estimate I f in a chosen family M S̃ using (15). Thus Iˆ f is an
instance of PMSE discussed in Sect. 2.2.
Estimator (16) is the main object studied in this paper, in the next section we prove
its consistency and asymptotic normality. In Sect. 4 we will compare small sample
behaviour of Iˆ f with two other estimators which are now briefly discussed.
Kernel estimator of I f . An obvious class of estimators of I f is defined as follows
1
I˜ f =
n
n
i=1
fˆ′ (X i )
fˆ(X i )
2
,
(17)
where fˆ and fˆ′ are some estimators of the density f and its derivative f ′ , respectively.
In the following we will use as fˆ a kernel estimator with boundary adjustment (see,
123
Estimation of Fisher information using model selection
e.g. Schuster 1985)
1
fˆ(x) =
n
n
i=1
K h (x − X i ) + K h (x + X i ) + K h (x + X i − 2),
(18)
where K h (x) = h −1 K (x/ h), K is a chosen probability density and h = h n is a
sequence of bandwidths such that h n → 0 and nh n → ∞ as n → ∞. Denote the
bandwidth used for density estimation as h 1 = h 1,n . Moreover, f ′ is estimated by the
derivative of fˆ(x) with bandwidth h 2 = h 2,n . The considered choices of bandwidths
are discussed in Sect. 4.1.
NN-estimator of I f . The second estimator is based on an estimator of the optimal
score function J (y) defined as the derivative of f (F −1 (y)). It is easy to check that
1
I f = 0 J 2 (y) dy. NN-estimator of J (y) was introduced by Csörgő and Révész
(1986) and is motivated as follows. Let kn be a sequence of positive even integers
and X i:n denotes ith order statistics in the X -sequence. Observe that f (F −1 (u)) may
be estimated by (kn /n)(X i+kn /2:n − X i−kn /2:n )−1 where i = [nu]. Moreover, J (y) is
approximated by
K h (y − u) d f (F −1 (u)) =
K h′ (y − u) f (F −1 (u)) du.
Plugging in the estimator of f (F −1 (u)) an letting h n = kn /n we arrive at
1
Jˆn (y) =
kn
n−kn /2
K
i= kn /2
′
y − i/n
hn
Finally, the estimator of I f is defined as n −1
1
.
(X i+kn /2:n − X i−kn /2:n )
(19)
n
ˆ2
i=1 Jn (i/n).
3 Main results
We first prove an auxiliary result on consistency of selection rule S. The result has
been stated in Ledwina (1994) and the proof used Haughton’s (1988) result for general
k-dimensional exponential families. We give a direct proof here.
Lemma 1 If X 1 , . . . , X n are i.i.d. with density f (·, θ ) ∈ Mk , where 1 ≤ k ≤ K ,
and θk = 0, then limn→∞ P(S = k) = 1.
Proof First we show that limn→∞ P(S < k) = 0.
We have
k−1
P(S < k) =
l=1
k−1
P(S = l) ≤
l=1
P(L l ≥ L k ).
123
J. Mielniczuk, M. Wojtyś
Note that
P
sup {t ◦ Yn − ψk (t)} → sup {t ◦ (ek,1 (θ ), . . . , ek,k (θ )) − ψk (t)}
t∈Rk
(20)
t∈Rk
and for l ∈ {1, . . . , k − 1}
P
sup {t ◦ Yn − ψl (t)} → sup {t ◦ (ek,1 (θ ), . . . , ek,l (θ )) − ψi (t)},
t∈Rl
(21)
t∈Rl
which follows from the fact that function Y → supt∈Ri {t ◦ Y − ψl (t)} is convex and
hence continuous, i = l, k.
Moreover, we have for (t, 0) ∈ R k denoting t appended with k − l zero coordinates
sup{t ◦ (ek,1 (θ ), . . . , ek,l (θ ))−ψl (t)}=sup {(t, 0) ◦ (ek,1 (θ ), . . . , ek,k (θ ))−ψk ((t,0))}
t∈Rl
(t,0)∈Rk
< sup {t ◦ (ek,1 (θ ), . . . , ek,k (θ ))−ψk (t)}
t∈Rk
as the last supremum is uniquely attained at t = θ and θk = 0.
From this inequality together with (20) and (21) we have
sup {t ◦ Yn − ψk (t)} − sup {t ◦ Yn − ψl (t)} − (k − l)
t∈Rk
t∈Rl
log n P
→ a > 0,
2n
where a = supt∈Rk {t ◦ (ek,1 (θ ), . . . , ek,k (θ )) − ψk (t)} − supt∈Rl {t ◦ (ek,1 (θ ), . . . , ek,l
(θ )) − ψl (t)}.
Hence
P(L l ≥ L k ) = P
sup {t ◦ Yn − ψk (t)}
t∈Rk
log n
− sup {t Yn − ψl (t)} − (k − l)
≤0
2n
t∈Rl
◦
→0
and limn→∞ P(S < k) = 0.
Now, for l ∈ {k + 1, . . . , K } we have
1
P(S = l) ≤ P(L l > L k ) = P Ll − Lk > (l − k) log n .
2
Note that Ll − Lk ≤ Ll − Ll (θ l ), where θ l ∈ Rl equals θ appended with l − k zero
coordinates. √
Now the result follows from two-term Taylor expansion of Ll − Ll (θ l ),
the fact that n(θ̂nl − θ l ) = O P (1) and continuity of Ll′′ (·) which together imply that
Ll − Ll (θ l ) = O P (1).
⊔
⊓
We prove now that selection rule S̃ is consistent.
123
Estimation of Fisher information using model selection
Theorem 1 Let X 1 , . . . , X n be i.i.d. with density f (·, θ ) ∈ Mk , 0 ≤ k ≤ K . If k ≥ 1
and θk = 0 then limn→∞ P( S̃ = k) = 1. If θ = 0 then limn→∞ P( S̃ = 0) = 1.
Proof Assume first that θk = 0. Then supt∈Rk {t ◦ (ek,1 (θ ), . . . , ek,k (θ )) − ψk (t)} =
supt∈Rk Eθ log f (X 1 , t) = Eθ log f (X 1 , θ ) > Eθ log f (X 1 , 0) = 0 and arguing as in
the proof of Lemma 1, we obtain that
P
sup {t ◦ Yn − ψk (t)} − k log n/(2n) → sup {t ◦ ((ek,1 (θ ), . . . , ek,k (θ )) − ψk (t)}>0.
t∈Rk
t∈Rk
P
Hence L k → ∞ which in view of (13) and Lemma 1 implies that limn→∞ P( S̃ =
k) = limn→∞ P(S = k) = 1.
Now assume that θ = 0. Then Eθ b(X 1 ) = 0 and similarly to the proof of Theorem 2
P
we can show that L l → −∞ for l ∈ {1, . . . , K }. Hence limn→∞ P(L l > 0) = 0 for
l ∈ {1, . . . , K } and limn→∞ P( S̃ = 0) = 1.
⊔
⊓
Theorem 2 If X 1 , . . . , X n are i.i.d. with an arbitrary distribution P on [0, 1] and
E P b1 (X 1 ) = · · · = E P bk−1 (X 1 ) = 0, E P bk (X 1 ) = 0 for some 1 ≤ k ≤ K , then
limn→∞ P( S̃ ≥ k) = 1.
Proof Kallenberg and Ledwina (1995, p. 1600) proved that the above assumptions
P
P
imply L k → ∞. Thus to complete the proof it suffices to show that L l → −∞
for l ∈ {1, . . . , k − 1}. However, as b1 (X ), . . . , bk (X ) may be correlated, Prohorov’s
inequality, as suggested in the last reference, cannot be used, and we use Yurinskii’s
inequality instead.
Let M > 0 and 0 < ε < 2/3. We have
P(L l ≥ −M) = P
sup {t ◦ Yn − ψl (t)} ≥ (l log n − 2M)/(2n) .
t∈Rl
Using Theorem 7.4 in Inglot and Ledwina (1996) we get for sufficiently large n
P
sup {t ◦ Yn − ψl (t)} ≥ (l log n − 2M)/(2n)
t∈Rl
≤ P Yn 2 ≥ (2 − ε)(l log n − 2M)/(2n)
n
=P
b(X i ) ≥ [n(2 − ε)(l log n − 2M)/2]1/2 .
i=1
Random vector b(X 1 ) = (b1 (X 1 ), . . . , bl (X 1 ))T satisfies Cramér’s condition:
E P b(X 1 ) = 0, E P b(X 1 )s ≤
with K =
s! 2 s−2
b K , s = 2, 3, . . .
2
√
lVl and b2 = lVl2 , where Vl = max supx∈[0,1] |b j (x)|.
1≤ j≤l
123
J. Mielniczuk, M. Wojtyś
Thus we can apply Yurinskii’s (1976) inequality (see also Inglot and Ledwina 1996,
p. 2011):
P
b(X i ) ≥ [n(2 − ε)(l log n − 2M)/2]1/2
i=1
xn K −1
xn2
1 + 1.62
,
≤ 2 exp −
2
Bn
n
where xn = [(2 − ε)(l log n − 2M)/(2lVl2 )]1/2 and Bn = (nb2 )1/2 = Vl (nl)1/2 .
Since xn → ∞ and xn /Bn → 0 as n → ∞, we have limn→∞ P(L l ≥ −M) = 0.
P
Thus L l → −∞ as n → ∞.
⊔
⊓
Remark 1 Another possible method to choose the dimension of a model that is suggested in literature is the so-called simplified Schwarz selection rule which uses
Neyman’s smooth test statistic Tk defined below instead of the supremum Lk of loglikelihood function. Namely, let
k
Tk =
j=1
1
√
n
n
2
b j (X i )
i=1
.
Then the selection rule introduced in Kallenberg and Ledwina (1997) is defined by
S2 = min{k : 1 ≤ k ≤ K , Tk − k log n ≥ T j − j log n, j = 1, . . . , K }.
An obvious advantage of S2 over S is its computational simplicity as calculation of
MLE is avoided. In Ledwina (2000) it is shown that under assumptions of Theorem 2
limn→∞ P(S2 ≥ k) = 1. Thus a similar criterion which is based on the augmented
family of models including M0 can be defined as:
S̃2 = min{k : 0 ≤ k ≤ K , Tk − k log n ≥ T j − j log n, j = 0, . . . , K },
where T0 = 0. If assumptions of Theorem 2 are satisfied then limn→∞ P( S̃2 ≥ k) =
limn→∞ P(S2 ≥ k) = 1. When X 1 , . . . , X n are independent and uniformly distributed on [0, 1] then limn→∞ P(S2 = 1) = 1 (see Inglot and Ledwina 2001, p.
D
P
814). This together with the fact that T1 → χ12 yields T j − j log n → −∞ for
j = 1, 2, . . . , K . Hence limn→∞ P( S̃2 = 0) = 1. Note, however, that in the case
of Fisher information we use MLE estimators for computing Iˆ f and advantage of S̃2
over S̃ is lost.
Theorem 3 If X 1 , . . . , X n are i.i.d. with density f (·, θ ) ∈ Mk , 0 ≤ k ≤ K , and
P
E(b′ (X 1 ))2 < ∞, then Iˆ f → I f .
123
Estimation of Fisher information using model selection
Proof If f ∈ M0 then P( Iˆkf = I f ) ≥ P( S̃ = 0) → 1 as n → ∞ in view of
k , . . . , θ̂ k ). Note that P( Iˆ → I ) ≥
Theorem 1. Let θ ∈ Rk , θk = 0 and θ̂nk = (θ̂1,n
f
f
k,n
k
ˆ
P({ I → I f } ∩ { Ŝ = k}) and since P( S̃ = k) → 1 as n → ∞ it suffices to show
that
f
P
Iˆkf →
1
Iˆkf =
n
I f . We have
n
i=1
1
(θ̂ k ◦ b′ (X i ))2 =
n
= In,1 + In,2 + In,3 ,
n
i=1
⎛
⎝
k
k
j=1
(θ̂ kj,n − θ j )b′j (X i ) +
j=1
⎞2
θ j b′j (X i )⎠
where
In,1
1
=
n
In,3 =
2
n
n
i=1
n
i=1
⎛
⎝
⎞2
k
j=1
⎡⎛
⎣⎝
(θ̂ kj,n − θ j )b′j (X i )⎠ ,
⎞⎛
k
j=1
(θ̂ kj,n − θ j )b′j (X i )⎠ ⎝
1
=
n
In,2
i=1
⎛
⎝
⎞2
k
j=1
⎞⎤
k
j=1
n
θ j b′j (X i )⎠ ,
θ j b′j (X i )⎠⎦ .
P
By the weak law of large numbers we have In,2 → I f . Using the Schwarz inequality
we obtain
In,1
k
≤
n
n
k
i=1 j=1
k
(θ̂ kj,n − θ j )2 (b′j (X i ))2 = k
(θ̂ kj,n
j=1
P
1
− θ j )2
n
We have θ̂ kj,n − θ j → 0 and by the weak law of large numbers n −1
P
n
(b′j (X i ))2 .
i=1
′
2 P
i=1 (b j (X i )) →
n
Eθ (b′j (X ))2 . Hence In,1 → 0. By Schwarz inequality In,3 ≤ 2(In,1 · In,2 )1/2 so
P
P
In,3 → 0. Hence Iˆkf → I f .
⊔
⊓
Theorem 4 If X, X 1 , . . . , X n are i.i.d. with density f (·, θ ) ∈ Mk , 0 ≤ k ≤ K , and
E(b′ (X 1 ))4 < ∞, then
√
D
n( Iˆ f − I f ) → N (0, σ 2 ),
where
σ 2 = σ12 + σ22 + 4Covθ (θ T B ′ θ, b)(Covθ b)−1 ∆ θ,
(22)
with B ′ = b′ (X )(b′ (X ))T , ∆ = Eθ (B ′ ), σ12 = 4θ T ∆(Covθ b(X ))−1 ∆ θ, σ22 =
Var θ (θ T B ′ θ ).
123
J. Mielniczuk, M. Wojtyś
Proof Using the reasoning from the proof of Theorem 3, it is enough to prove the
result for Iˆkf . Define B ′ as a k × k matrix consisting of all terms bi′ (X )b′j (X ) where
i, j ∈ {1, . . . , k} :
B ′ = B ′ (X ) = [bi′ (X )b′j (X )]1≤i, j≤k = b′ (X )(b′ (X ))T
and let
B̄ ′ =
1
n
n
B ′ (X i ).
i=1
Iˆkf can be written as
n
1
Iˆkf =
n
i=1
′
(θ̂ ◦ b′ (X i ))2 = θ̂ T
1
n
n
b′ (X i )(b′ (X i ))T θ̂
i=1
= θ̂ B̄ θ̂ = (θ̂ − θ ) B̄ (θ̂ + θ ) + θ T B̄ ′ θ
T
T
T
′
with θ̂ = θ̂ k . Thus
√
√
√
n( Iˆ f − I f ) = n(θ̂ T − θ T ) B̄ ′ (θ̂ + θ ) + n( Iˇ f − I f ),
(23)
n
where Iˇ f = θ T B̄ ′θ . Note that since θ T B̄ ′θ = n −1 i=1
(θ ◦ b′ (X i ))2 , Iˇ f can be
regarded as an estimator of I f when θ is known. In view of (23) we have
√
√
√
n( Iˆ f − I f ) = g( n(θ̂ − θ ), n( Iˇ f − I f ), B̄ ′ (θ̂ + θ )),
(24)
where g(x, y, z) = x ◦ z + y for x, z ∈ Rk , y ∈ R, is a continuous mapping. We will
apply delta method (cf. van der Vaart 2000, p. 25) and (24) to prove the result.
P
By (10) and the weak law of large numbers, which implies B̄ ′ (θ̂ + θ ) → 2∆θ,
where ∆ = Eθ (B ′ ), we have
√ T
D
n(θ̂ − θ T ) B̄ ′ (θ̂ + θ ) → N (0, σ12 ),
where σ12 = 4θ T ∆(Covθ b(X ))−1 ∆θ. Since Iˇ f = n −1
Central Limit Theorem,
√
n
i=1 θ
T
B ′ (X i ) θ then, by the
D
n( Iˇ f − I f ) → N (0, σ22 ),
where σ22 = Var (θ T B ′ θ ).
Define auxiliary variables
S = S(X ) =
123
b(X )
θ T B′θ
∈ Rk+1 and S̄ =
1
n
n
S(X i ).
i=1
Estimation of Fisher information using model selection
Then by the Central Limit Theorem
√
D
n( S̄ − Eθ S) → N (0, Covθ S),
where Covθ S is the covariance matrix of vector S:
Covθ b
Covθ (b, θ T B ′ θ )
Covθ S =
.
Covθ (θ T B ′ θ, b) σ22
Recall that θ̂nk = ek−1 (Yn ) (see 9) and define function ν : Rk+1 → Rk+1 as follows:
ν(u 1 , . . . , u k , u k+1 ) = [e−1 (u 1 , . . . , u k )T , u k+1 ]T .
In view of (8) the Jacobian matrix of function ν is equal to
(Covθ b)−1 0
∇ν =
0
1
By the delta method
√
D
n(ν( S̄) − ν(Eθ S)) → N (0, Σ1 ),
where ν( S̄) = (θ̂ T , Iˇ f )T , ν(Eθ S) = (θ T , I f )T and
Σ1 = ∇ν Covθ S (∇ν)T
(Covθ b)−1 Covθ (b, θ T B ′ θ )
(Covθ b)−1
.
=
Covθ (θ T B ′ θ, b)(Covθ b)−1 σ22
√
√
Thus a random vector ( n(θ̂ − θ ), n( Iˇ f − I f ), B̄ ′ (θ̂ + θ ))T converges in law to
√
the distribution of (Z , 2∆ θ ), where Z ∼ N (0, Σ1 ). Hence in view of (24) n( Iˆ f −
D
I f ) → g((Z , 2∆ θ )), i.e.
√
D
n( Iˆ f − I f ) → N (0, σ 2 ),
where
2
T
σ = [2θ ∆, 1] Σ1
2∆ θ
1
= 4θ T ∆(Covθ b(X ))−1 ∆ θ + 4Covθ (θ T B ′ θ, b)(Covθ b)−1 ∆ θ + V ar (θ T B ′ θ )
= σ12 + σ22 + 4Covθ (θ T B ′ θ, b)(Covθ b)−1 ∆ θ.
⊔
⊓
123
0.0
0.1
0.2
0.3
0.4
J. Mielniczuk, M. Wojtyś
−4
−2
0
2
4
√
Fig. 1 Histogram of n( Iˆ f − I f )/σ for n = 3,000 with N (0, 1) density overlaid. Density f belongs to
M1 with θ1 = 0.4
Remark 2 Note that the asymptotic variance in (22) corresponds to decomposition in
T ′ θ, b)(Cov b)−1 ∆θ being the asymptotic covariance
(23)
θ
√with the term 4Covθ (θ B √
of n(θ̂ T − θ T ) B̄ ′ (θ̂ + θ ) and n( Iˇ f − I f ).
The asymptotic behaviour of estimator Iˆ f = IˆS̃f is exemplified in Fig. 1. We computed
the value of Iˆ f for 104 random samples X 1 , . . . , X n ∼ f (·, θ ) ∈ Mk with k = 1, θ =
0.4 and n = 3,000.
In Fig. 1 the standard normal density function is compared to the
√
histogram of n( Iˆ f − I f )/σ where I f and σ are theoretical values defined in (14) and
(22), respectively. Visible conformity of the histogram and theoretical curve illustrates
the result of Theorem 4.
Remark 3 Note that Theorems 1–4 remain valid when a term log n in the penalty is
replaced with arbitrary kn such that kn → ∞ and kn = o(n). In the following section
we investigate by means of simulations influence of larger penalty on performance
of Iˆ f .
We consider now a general case when underlying density f does not necessarily
belong to an exponential family. Specifically, we assume that f has the following form
⎧
⎫
⎨∞
⎬
f (x) = exp
(25)
θ j b j (x) − ψ(θ ) ,
⎩
⎭
j=1
2 1/2 <∞, and {1, b (x), b (x), . . .}
where θ = (θ1 , θ2 , . . .) ∈ R∞ , ||θ || = ( ∞
1
2
j=1 |θ j | )
2
form an orthonormal system in (L ([0, 1]), λ). Define ||a||∞ = supx∈[0,1] |a(x)| and
let
Vm = max ||b j ||∞ ,
j=1,...,m
Vm′ = max ||b′j ||∞ .
j=1,...,m
123
Estimation of Fisher information using model selection
Let f m (x) = exp{ mj=1 θ j b j (x) − ψm (θ m )} ∈ Mm , where θ m = (θ1 , . . . , θm )T , be
the approximation of f . We assume that for m → ∞
|| f − f m || L 2 = O(m −r ) for some r > 0.
(26)
Moreover, we impose the condition that
∞
j=1
|θ j | ||b j ||∞ < ∞.
(27)
In
the case when the density of a random sample is of the form (25) with the series
∞
′
j=1 θ j b j (x) converging uniformly in [0, 1] the Fisher information equals
I f = E f (ρ(X ))2 ,
where ρ(x) = f ′ (x)/ f (x) =
∞
′
j=1 θ j b j (x).
∞
j=1
(28)
In particular (28) is satisfied when
|θ j | ||b′j ||∞ < ∞.
(29)
Let m = m(n) be a deterministic sequence such that m(n) → ∞ when n → ∞. The
following result on consistency of Iˆnm holds.
Theorem 5 Assume that Vm = O(m ω1 ), Vm′ = O(m ω2 ) for some ω1 , ω2 ≥ 0 and
r > 21 + max(ω1 , ω2 ). If m = O(n 1/k ), where k > 2 + 4 max(ω1 , ω2 ) and conditions
P
(26)–(27) and (29) hold, then Iˆnm → I f .
In order to prove Theorem 5, we state two lemmas. Their proofs are given at the end
of the section. The first lemma is a version of Lemma 5 in Barron and Sheu (1991).
Lemma 2 Let θ0 ∈ Rm , α0 = b f θ0 and α ∈ Rm be given. Let c = e|| log fθ0 ||∞ . If
||α0 − α|| ≤
then the solution θ (α) to
1
√
4ce mVm
(30)
b f θ = α exists and satisfies
||θ0 − θ (α)|| ≤ 2ce||α − α0 ||.
Moreover,
√
|| log f θ0 / f θ(α) ||∞ ≤ 4ce mVm ||α − α0 || ≤ 1.
123
J. Mielniczuk, M. Wojtyś
Let θm∗ ∈ Rm be the vector that satisfies the equation b f θm∗ = b f. Then f m∗ := f θm∗
is called the information projection of f onto the exponential family Mm .
In the next lemma properties of θm∗ are used to establish consistency of MLE θ̂nm
when m → ∞.
Lemma 3 If Vm = O(m ω1 ) for some ω1 ≥ 0, r > 21 + ω1 and (26)–(27) are satisfied
then for sufficiently large m the vector θm∗ exists and
||θm∗ − θ m || = O(m −r ).
(31)
If additionally m(n) = O(n 1/k ) and k > 2 + 4ω1 then the MLE θ̂nm exists with
probability tending to 1 as n → ∞ and
||θ̂nm − θm∗ || = O P
m 1+2ω1 /n
1/2
(32)
.
Proof of Theorem 5 | Iˆnm − I f | is bounded by
1
n
n
2
i=1
+
(ρ̂n,m (X i )) −
(ρ̂n,m (x))2 f (x) d x
2
(ρ̂n,m (x)) f (x) d x −
(ρ(x))2 f (x) d x ,
m
′
m
m◦ ′
j=1 (θ̂n ) j b j (x) = θ̂n b (x). The first term of rhs can be
n
written as |(θ̂nm )T (n −1 i=1
B ′ (X i ) − EB ′ (X ))θ̂nm |, where B ′ (x) = b′ (x)(b′ (x))T ,
n
i )
and whence is bounded by ||θ̂nm ||2 ||n −1 i=1
Y i ||, where Y i = (Y j,l
1≤ j,l≤m ∈
2
i
′
′
′
′
m
R , Y j,l = b j (X i )bl (X i ) − b j bl f.
where ρ̂n,m (x) =
Yurinskii (1976) inequality implies that
P
1
n
n
Y
i
where κ = 2m(Vm′ )2 , Bn =
ε
x=
2
>ε
i=1
n
2
m (Vm′ )4
√
2
!
x
xκ
≤ 2 exp −
,
1 + 1.62
2
Bn
nκ, x = nε/Bn . Whence for some positive constant C
1/2
≥C
n
m 2+4ω2
and as |xκ/Bn | is bounded it implies that ||n −1
Moreover,
1/2
n
i=1 Y
→ ∞ as n → ∞
i || converges to 0 in probability.
√
|ρ̂n,m (x) − ρ(x))| ≤ ||θ̂nm − θ m || mVm′ +
123
∞
j=m+1
|θ j |||b′ ||∞
Estimation of Fisher information using model selection
and using Lemma 3 we obtain
√
√
||θ̂nm − θ m || mVm′ ≤ (||θ m − θm∗ || + ||θ̂nm − θm∗ ||) mVm′
1/2
,
= O(m 1/2+ω2 −r ) + O P m 2+2ω1 +2ω2 /n
which implies that (ρ̂n,m (x))2 f (x) d x − (ρ(x))2 f (x) d x converges to 0 in probability.
⊔
⊓
Proof of Lemma 2 The proof proceeds similarly to the proof of Lemma 5 in Barron
and Sheu (1991) with q(x) = 1 for x ∈ [0, 1] and τ = 1, the only change being that
instead of using the result of Lemma 4 therein we apply the inequality
D( f θ1 || f θ2 ) ≥
1 −|| log q/ fθ ||∞ −2||θ1 −θ2 ||√mVm
1
e
e
||θ1 − θ2 ||2 ,
2
which follows from the fact that
√
|| log f θ1 / f θ2 ||∞ ≤ 2||(θ1 − θ2 ) ◦ b||∞ ≤ 2||θ1 − θ2 || mVm
⊔
⊓
(cf. the proof of Lemma 4 in Barron and Sheu 1991).
Proof of Lemma 3 To establish equality (31) we use Lemma 2 with α0 =
α = b f and c = e|| log fm ||∞ . Note that
|| log f m ||∞ ≤ || log f ||∞ + 2
∞
j=1
|θ j |||b j ||∞
b fm ,
(33)
so || log f m ||∞ is bounded in m. Since {b j }∞
j=1 form an orthonormal system the Bessel
inequality yields
||
b fm −
b f || = ||
b( f m − f )|| ≤ || f m − f || L 2 .
1
Thus the fact that m 2 +ω1 || f m − f || L 2 → 0 as m → ∞ implies (30) for sufficiently
large m and hence the existence of θm∗ . Moreover,
||θm∗ − θ m || ≤ 2ce|| f m − f || L 2 ,
which proves (31). It also follows from Lemma 2 that
|| log f m∗ / f m ||∞ ≤ 1.
(34)
n
For the equality (32) we apply Lemma 2 once again with α0 = b f θm∗ , α = n −1 i=1
∗
b(X i ) and c = e|| log fm ||∞ . Using (34) and (33) we obtain that || log f m∗ ||∞ is bounded
123
J. Mielniczuk, M. Wojtyś
since || log f m∗ ||∞ ≤ || log f m∗ / f m ||∞ + || log f m ||∞ ≤ 1 + || log f m ||∞ . Yurinskii
(1976) inequality yields
1
n
n
i=1
1/2
1+2ω1
.
b(X i ) − Eb(X ) = O P m
/n
(35)
"
#1/2
n
√
Thus mVm ||n −1 i=1
and the assumpb(X i ) − Eb(X )|| = O P m 2+4ω1 /n
tion (30) of Lemma 2 is satisfied with probability tending to 1 as n → ∞. Whence
||θ̂nm − θm∗ || ≤ 2ce
1
n
n
i=1
b(X i ) − Eb(X )
⊔
⊓
and (32) follows.
Remark 4 It is shown in Barron and Sheu (1991, pp. 1351–1352), that when f ∈ W2r ,
Sobolev space of functions on [0,1] such that f (r −1) is absolutely continuous with
squared integrable derivative then f satisfies (26) when {b j }∞
j=0 is the Legendre system. Moreover, in this case Vm = (2m + 1)1/2 and Vm′ = m(m + 1)(2m + 1)1/2 , i.e.
ω1 = 1/2, ω2 = 5/2, k > 12 and f ∈ W23 in Theorem 5.
Consider a family of exponential models {Mk }, k ∈ {1, . . . , m(n)} and corresponding
S̃ defined as in (12) with K replaced by m(n). The following corollary states consistency of IˆnS̃ . Note that its last assumption is in particular satisfied under conditions of
Lemma 3 in view of (31).
Corollary 1 Assume that m(n) satisfies conditions of Theorem 5, there is an infinite
number of nonzero θi in the representation (25), for sufficiently large m θm∗ exists and
P
||θm∗ − θ m || → 0 when m → ∞. Then IˆnS̃ → I f .
Proof The proof follows by examination of the proof of the Theorem 5, which indicates
P
that its conclusion remains valid for a random sequence m(n) provided m(n) → ∞
and m(n) = O P (n 1/k ). In the case of S̃ the second condition is trivially satisfied as
S̃ ≤ m. To prove the first assertion observe that it is impossible that, starting from a
certain m 0 , all coordinates (θm∗ )l of θm∗ for m 0 ≤ l ≤ m and m ≥ m 0 would be 0.
Namely, considering k ≥ m 0 such that θk = 0 this would imply ||θm∗ −θm || ≥ |θk | = 0
for all m ≥ m 0 which contradicts the assumptions. Thus for every M > 0 there exists
m 1 ≥ M such that (θm∗ )m 2 = 0 for some m 2 ≥ M. Let m̃ 1 be the smallest integer
having this property. For 0 ≤ l < m̃ 1 we have
sup (t ◦ Eb − ψl (t)) = θl∗ ◦ Eb − ψl (θl∗ ) < θm̃∗ 1 ◦ Eb − ψm̃ 1 (θm̃∗ 1 )
t∈Rl
= sup (t ◦ Eb − ψm̃ 1 (t)).
t∈Rm̃ 1
123
Estimation of Fisher information using model selection
Reasoning as in the proof of Lemma 1 leads to
P( S̃ < M) ≤ P( S̃ < m̃ 1 ) ≤
≤
m 1 −1
l=0
m 1 −1
l=0
P(L l ≥ L m̃ 1 )
log
n
P sup {t ◦ Yn − ψl (t)}− sup {t ◦ Yn − ψm̃ 1 (t)} + (m̃ 1 −l)
≥0 →0
2n
t∈Rl
t∈Rm̃ 1
P
and thus S̃ → ∞.
⊔
⊓
4 Simulations
In this section we discuss results of a simulation study in which the performance
of Iˆ f = IˆS̃f has been compared with performance of several competing estimators
described in Sect. 2.3. In all numerical experiments first k Legendre polynomials on
[0, 1] have been considered as the functions b1 (x), . . . , bk (x). The chosen sample
sizes have been equal to n = 100 and n = 500. The number K of dimensions considered in (11) depended on a sample size, namely K (100) = 4 and K (500) = 10. The
number m of repetitions of each experiment in the study equaled 104 .
4.1 Estimators under consideration
We consider two kernel-type estimators I˜ f with a kernel function K equal to the density of the standard normal law N (0, 1) for various choices of bandwidths h 1 and h 2
used for estimating the density and its derivative, respectively. A kernel estimator of
the derivative equals derivative of the kernel estimator of f defined in (18).
Let h SJ be a bandwidth of a kernel density estimator computed using the method
of Sheather and Jones (1991) which is a commonly used method of data-dependent
bandwidth choice. We consider two estimators of I f that use h SJ : for the first one,
called I˜SJ , we have taken h 1 = h 2 = h SJ . For the second one, we calculated a bandwidth h 2,opt , which is the minimizer of the asymptotic Mean Integrated Squared Error
of fˆ′ for normal N (µ, σ 2 ) distribution and expressed it in terms of h 1,opt , which is
analogously defined asymptotically optimal bandwidth for fˆ. This yields
h 2,opt = (3n/4)2/35 (3/5)1/7 h 1,opt .
′ uses h = h and h calculated from the above formula when h
The estimator I˜SJ
1
SJ
2
1,opt
is replaced by h SJ . Additionally, we define a ‘prophet’ kernel estimator I˜best for which
error (ISE) of the corwe choose h 1 = h 2 such that it minimizes an integrated
msquared
responding estimator of f ′ / f. ISE is defined as m −1 i=1
[( fˆ′ / fˆ(xi ))−( f ′ / f )(xi )]2 ,
where (xi ) is a grid of m equidistant points in [0, 1]. Note that we use here the theoretical density of observations which is not available in practice.
123
J. Mielniczuk, M. Wojtyś
best with density of Beta(3,3) distribution
We also consider ‘prophet’ NN estimator I˜NN
on [−1, 1] as a kernel function K and with the choice of parameter h n that minimizes
ISE of the estimated function Jˆn defined in (19). To the best of our knowledge no
proposal of data-dependent bandwidth choice for NN estimator is available.
4.2 Employed distributions
The following distributions supported on [0, 1] has been considered in numerical
experiments:
–
–
–
–
the uniform distribution;
truncated normal distribution with mean 0.5 and σ = 0.25, 0.5, 0.75, 1;
Beta(a, b) distribution with a = b = 3, 4, 5, 6, 7, 8, 9;
distribution h4(a) with a density: f 4a (x) = a2a−1 (min(x, 1 − x))a−1 for a =
3, 4, 5, 6, 7;
– distribution l1(θ ) belonging to M1 with a density: f (x) = exp(θ b1 (x) − ψ1 (θ ))
√
where θ = 0.4 and b1 (x) = 3(2x − 1) is the first Legendre polynomial on [0, 1].
We considered also two distributions with unbounded support:
– normal distribution with mean 0 and σ = 0.25, 0.5, 0.75, 1;
– exponential distribution with λ = 0.5, 1, 1.5.
In the last two cases, the following modification of the original definition of Iˆ f has
been made. The range [X 1:n , X n:n ] of data is mapped onto [0, 1] by a linear transformation and Iˆ f is calculated based on transformed data. Then it is multiplied by
(X n:n − X 1:n )−2 what corresponds to the change of Fisher information when the
density is transformed back from [0, 1] to [X 1:n , X n:n ].
4.3 Comparison of performance
′ and their ‘prophet’ counWe compared estimator Iˆ f with kernel estimators I˜SJ , I˜SJ
best for the families described above. The results are summarized in
terparts I˜best , I˜NN
Tables 1, 2, 3, 4 and 5, where empirical means, standard deviations and mean squared
errors based on m = 104 repetitions are given. Also means of empirical distributions
′ are plotted in Fig. 2 as a function of pertaining parameter (for Beta
of Iˆ f , I˜SJ and I˜SJ
distribution the range of parameter a is larger than that given in the corresponding
table). The plots clearly indicate that in most displayed cases Iˆ f is the least biased
estimator among considered ones. It follows from the tables that when the distribution
belongs to an exponential family (i.e. the case of the uniform, l1 and truncated normal
′ . In particdistribution), Iˆ f performed much better in terms of MSE than I˜SJ and I˜SJ
ular, for the uniform distribution the ratios of MSEs are larger than 3 for n = 100.
best in some cases,
Surprisingly, Iˆ f fares better than ‘prophet’ estimators I˜best and I˜NN
e.g. for l1 distribution with n = 500. The same conclusion is true in most cases for
Beta and h4 distributions (the results not shown in the last case). The superiority of Iˆ f
123
Estimation of Fisher information using model selection
Table 1 Performance of estimators for the uniform (left) and l1(0.4) (right) distributions
0
If
1.92
Mean
SD
MSE
Mean
SD
MSE
n = 100, K = 4
Iˆ f
0.09
I˜′
1.53
1.18
1.4
2.33
2.79
7.96
1.69
5.18
3.78
2.67
10.57
2.49
2.63
13.1
5.27
3.9
26.46
0.008
0.006
8.7e-05
0.72
0.53
1.7
0.14
0.01
0.02
0.47
0.06
2.12
0.1
0.009
1.96
0.53
0.28
SJ
I˜SJ
I˜best
best
I˜NN
n = 500, K = 10
Iˆ f
0.005
˜I ′
0.86
0.61
1.12
3.14
1.2
2.95
1.87
1.23
5.03
4.97
2.16
13.96
0.003
0.003
1.8e-05
0.66
0.24
1.64
0.26
0.01
0.07
0.31
0.02
2.6
SJ
I˜SJ
I˜best
best
I˜NN
Table 2 Performance of estimators for Beta(a, a) distribution
3
40
a
If
Mean
4
42
SD
n = 100, K = 4
Iˆ f
26.2
8.3
I˜′
15.3
4.58
SJ
I˜SJ
I˜best
best
I˜NN
17.7
44.5
6
5.6
18
5.25
n = 500, K = 10
Iˆ f
28.1
5.74
′
I˜SJ
I˜SJ
I˜best
I˜best
NN
MSE
Mean
5
48
SD
MSE
Mean
6
55
SD
MSE
Mean
SD
MSE
259
36.3
8.23
101
45.3
8.73
83
54.1
9.88
99
631
23.9
5.57
359
28
6.61 445
34.7
7.67
470
9.59
380
529
26.6
343
48
605
22.1
6.71
16.2
6.48
281
30.8
8.25 363
300
51.4
14.51 222
55.6
38
439
28.1
7.72 458
33.8
9.17
13.2
176
534
175
36.6
4.58
50
44.7
4
27
52.7
4.09
22
18.5
2.07
468
26.1
2.38
258
33.7
2.74 212
39.7
3.27
246
20.9
2.19
368
28.1
2.53
199
35.5
2.88 164
41.7
3.65
190
45
1.93
168
47.3
10.86
146
50.2
9.7
99
54.2
8.94
81
21.2
2.36
360
28.1
3.01
203
34.9
3.67 184
41.8
4.3
194
is more apparent for n = 100 and larger values of the parameter a in cases of Beta and
h4 and parameter σ for truncated normal distributions. There is only one case, namely
′ is smaller than
that of truncated normal distribution with σ = 0.5, when MSE of I˜SJ
MSE of Iˆ f .
best which availed themselves of the knowledge of true
‘Prophet’ methods I˜best and I˜NN
density performed best in some cases (e.g. for the truncated normal and the uniform
123
J. Mielniczuk, M. Wojtyś
Table 3 Performance of estimators for truncated normal distribution on [0, 1] with mean 0.5 and standard
deviation σ
0.25
12.38
σ
If
0.5
1.164
Mean SD
MSE
n = 100, K = 4
Iˆ f
13.3
4.81 24
′
I˜SJ
7.75 3.46 33.38
I˜SJ
9.43 4.37 27.8
I˜best
I˜best
NN
I˜best
I˜best
NN
Mean SD
MSE
1
0.081
Mean SD
MSE
Mean SD
MSE
0.58
2.31
5.66
0.13
1.23
1.53
0.12
1.29
1.68
1.76
1.67
3.15
1.46
1.54
3.85
1.48
1.6
4.51
2.53
11.76
2.66
2.54
8.67
2.32
2.42 10.29
2.34
7.93
2.22 24.77
0.47
0.35
0.61
0.05
0.07
0.04
0.017 0.002
0.004
10.64
3.06 12.39
1.03
0.14
0.04
0.15
0.04
0.01
0.14
0.012
0.004
n = 500, K = 10
Iˆ f
12.5
1.83
′
I˜SJ
I˜SJ
0.75
0.248
3.36
0.86
1.07
1.24
0.05
0.3
0.13
0.014 0.164
0.031
8.1
1.46 20.47
1.19
0.57
0.32
0.83
0.52
0.61
0.81
0.56
0.85
9.65
1.74 10.46
1.97
1.01
1.67
1.68
1.04
3.13
1.73
1.13
3.98
9.64
1.58
9.98
0.61
0.27
0.38
0.08
0.07
0.03
0.017 0.0008 0.004
1.67
6.1
1.01
0.06
0.03
0.28
0.01
0.001 0.27
14.2
0.01
0.04
SD
MSE
Table 4 Performance of estimators for normal distribution N (0, σ )
0.25
16
σ
If
Mean
0.5
4
SD
n = 100, K = 4
Iˆ f
13
3.16
I˜′
11.4
2.37
best
I˜NN
I˜best
I˜best
NN
SD
MSE
Mean
SD
MSE
Mean
19.18
3.23
0.86
1.33
1.42
0.377
0.27
0.8
0.219
0.089
2.19
0.69
3.75
0.75
0.234
1.11
0.43
0.108
0.337
2.77
22.86
2.4
0.79
3.2
0.82
0.282
1
0.46
0.132
0.308
10.1
1.32
37.04
2.2
0.46
3.46
0.67
0.141
1.25
0.38
0.065
0.385
11.8
2.78
25.74
2.95
0.7
1.59
1.33
0.315
0.3
0.75
0.174
0.094
1.91
3.79
0.29
0.13
1.68
0.131
0.03
0.95
0.072
0.008
12.1
n = 500, K = 10
Iˆ f
15.2
1.12
′
I˜SJ
I˜SJ
Mean
1
1
26.47
SJ
I˜SJ
I˜best
MSE
0.75
1.778
13.1
1.02
9.39
2.45
0.31
2.5
0.82
0.102
0.92
0.47
0.045
0.283
13.5
1.12
7.46
2.62
0.33
2.02
0.87
0.115
0.84
0.49
0.051
0.266
12.1
1.04
1.65
0.27
5.57
0.68
0.071
1.22
0.44
0.043
0.315
13.9
1.42
3.47
0.35
0.41
1.54
0.155
0.08
0.87
0.088
0.026
16.5
6.43
distributions) but in general they performed on par or even worse than Iˆ f . One of
the reasons is possibly that the criterion which has been used to find a bandwidth
best involved f ′ / f and not directly I , the other is possible superiorfor I˜best and I˜NN
f
123
Estimation of Fisher information using model selection
Table 5 Performance of estimators for exponential distribution E(λ)
0.5
0.25
λ
If
Mean
1
1
1.5
2.25
SD
MSE
Mean
SD
MSE
Mean
SD
MSE
n = 100, K = 4
Iˆ f
0.256
I˜′
0.474
0.114
0.013
1.021
0.474
0.225
2.28
0.9
0.81
0.160
0.076
1.473
0.674
0.678
2.782
1.603
2.851
0.549
0.202
0.13
1.651
0.79
1.048
3.084
1.79
3.901
0.166
0.018
0.007
0.312
0.174
0.503
1.227
0.563
1.362
0.287
0.117
0.015
1.19
0.358
0.165
2.494
0.767
0.648
n = 500, K = 10
Iˆ f
0.244
I˜′
0.657
0.025
0.0007
0.971
0.098
0.01
2.187
0.222
0.054
0.129
0.182
2.425
0.583
2.372
5.121
1.516
10.543
10
SJ
I˜SJ
I˜best
best
I˜NN
SJ
I˜SJ
I˜best
best
I˜NN
0.831
0.182
0.371
2.773
0.694
3.625
5.021
1.524
0.17
0.008
0.0063
0.315
0.072
0.474
1.68
0.496
0.571
0.278
0.045
0.0027
1.076
0.166
0.033
2.396
0.36
0.151
ity of log-linear PMS estimator which avoids a problem of choosing two different
bandwidths.
In the case of two densities supported on unbounded sets estimator Iˆ f was always
less biased and had much smaller MSE than kernel estimators. For normal distribution this was true even in comparison with ‘prophet’ estimators. Also, for exponential
distribution estimator Iˆ f had in all cases smaller standard deviation than kernel estimators. Thus the proposed method seems promising also for the cases when the support
of f is unknown and possibly unbounded.
4.4 A choice of penalty and comparison of IˆS̃f and IˆSf .
We also studied the effect of adding family M0 to the list and changing a penalty in
BIC criterion. Namely, we compared MSEs for Iˆ f = Iˆ S̃f and IˆSf when the penalty
k log n/2 in (11) is multiplied by C. Obviously, C = 1 corresponds to the original
criterion. Due to space constraints only selected results for the uniform and the truncated normal distribution are shown for C = 1, 2 and n = 100 in Tables 6 and 7,
respectively. In general, for both estimators taking C > 2 does not lead to decrease
of MSE and in some cases (e.g. for truncated normal with σ = 0.25 or Beta(3, 3) for
n = 100 in case of Iˆ f ) causes its substantial increase. This also happens in the case
of l1 distribution when adoption of a larger penalty results in a larger probability of
k = 0, i.e. of wrong decision, in the case of IˆS̃f , whereas in the case of IˆSf probability of
k = 1 (correct decision) becomes larger simply because model M0 is excluded from
the list of models. It turned out that the choice of a large penalty (C ≥ 4) in the case of
the uniform and the truncated normal distribution resulted in choice of k = 0 for all
123
20
50
40
100
60
150
200
80
J. Mielniczuk, M. Wojtyś
3
4
5
6
a
7
8
9
3
4
5
a
6
7
0.8
1.0
(b) h4(a) distribution
0
0
5
5
10
10
15
(a) Beta(a,a) distribution
0.2
0.4
0.6
sigma
0.8
1.0
0.2
0.4
(c) truncated normal distribution
0.6
sigma
(d) normal distribution
Fig. 2 Empirical means of estimators for selected parametric distributions and for n = 100. Solid line Iˆ f ,
′ . Grey line I
dashed line I˜SJ , dotted-dashed line I˜SJ
f
Table 6 An influence of penalty on Iˆ Sf (left) and Iˆ S̃f (right) for uniform distribution on [0, 1] for n = 100,
K =4
If
0
Mean
C =1
C =2
0
SD
MSE
Mean
SD
MSE
0.391
2.400
5.911
0.091
1.178
1.395
0.146
0.572
0.349
0.002
0.056
0.003
104 repetitions. Iˆ S̃f for C = 2 performs better than IˆSf for the uniform distribution and
the truncated normal in the case σ ≥ 0.5, i.e. in cases when the distribution is close
to the uniform. For beta, h4 and exponential distributions both estimators performed
exactly the same. For l1 and normal distributions Iˆ S̃f performed worse. Iˆ S̃f with C = 2
performed in general better than for C = 1. Thus it should be kept in mind that changing C from 1 to 2 for comparison with other estimators discussed above will underline
superiority of IˆS̃f when compared with kernel estimators even more.
123
Estimation of Fisher information using model selection
Table 7 An influence of penalty on Iˆ S and Iˆ S̃ for truncated normal distribution on [0, 1] with mean 0.5
and standard deviation σ for n = 100, K = 4
σ
If
0.25
12.38
Mean
0.5
1.164
SD
MSE
Mean
0.75
0.248
SD
MSE
Mean
1
0.081
SD
MSE
Mean
SD
MSE
Iˆ S
C =1
C =2
13.33
4.74
23.4
1.36
2.92
8.55
0.55
2.41
5.89
0.39
1.88
3.63
12.96
4.25
18.4
0.48
1.52
2.79
0.17
0.58
0.35
0.14
0.52
0.27
13.3
4.81
24
0.58
2.31
5.66
0.13
1.23
1.53
0.12
1.29
1.68
11.8
6.16
38.3
0.03
0.51
1.56
0.006
0.18
0.09
0.004
0.15
0.03
Iˆ S̃
C =1
C =2
Acknowledgments We thank T. Ledwina for suggesting the subject of this paper, insightful comments
and supplying us with a computer program for calculating ML estimators in exponential families created
under her Ministry of National Education Grant 341 046.
References
Barron AR, Sheu CH (1991) Approximation of density functions by sequences of exponential families.
Ann Stat 19:1347–1369
Bickel P, Klassen C, Ritov Y, Wellner J (1998) Semiparametrics. In: Kotz S, Read C, Banks D (eds)
Encyclopedia of statistical sciences, vol 2. Wiley, New York, pp 602–614
Bogdan K, Bogdan M (2000) On existence of maximum likelihood estimators in exponential families.
Statistics 34:137–149
Csörgő M, Révész P (1986) A nearest neighbour-estimator for the score function. Prob Theory Relat Fields
71:293–305
Haughton DMA (1988) On the choice of a model to fit data from an exponential family. Ann Stat 16:342–
355
Inglot T, Ledwina T (1996) Asymptotic optimality of data-driven Neyman’s tests for uniformity. Ann Stat
24:1982–2019
Inglot T, Ledwina T (2001) Intermediate approach to comparison of some goodness-of-fit tests. Ann Inst
Stat Math 53:810–834
Kallenberg WCM, Ledwina T (1995) Consistency and Monte Carlo simulation of a data driven version of
smooth goodness-of-fit tests. Ann Stat 23:1594–1608
Kallenberg WCM, Ledwina T (1997) Data driven smooth test when the hypothesis is composite. J Am Stat
Assoc 92:1094–1104
Ledwina T (1994) Data-driven version of Neyman’s smooth test of fit. J Am Stat Assoc 89:1000–1005
Ledwina T (2000) Limiting behaviour of data driven Neyman’s statistic under fixed alternatives. Manuscript
Leeb H, Pötscher BM (2006) Model Selection, manuscript. http://www.stat.yale.edu/~hl284/handbook.pdf
Schuster EF (1985) Incorportating support constraints into nonparametric estimation of densities. Commun
Stat Theory Methods 14:1123–1136
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464
Sheather SJ, Jones MC (1991) A reliable data-based bandwidth selection method for kernel density estimation. J R Stat Soc Ser B 53:683–690
Stone CJ (1975) Adaptive maximum likelihood estimator of a location parameter. Ann Stat 3:267–284
van der Vaart AW (2000) Asymptotic statistics. Cambridge University Press, Cambridge
Yurinskii VV (1976) Exponential inequalities for sums of random vectors. J Mult Anal 6:473–499
123