Academia.eduAcademia.edu

Estimation of Fisher Information Using Model Selection

Metrika

Metrika DOI 10.1007/s00184-009-0246-3 Estimation of Fisher information using model selection Jan Mielniczuk · Małgorzata Wojtyś Received: 16 March 2008 © Springer-Verlag 2009 Abstract In the paper the problem of estimation of Fisher information I f for a univariate density supported on [0, 1] is discussed. A starting point is an observation that when the density belongs to an exponential family of a known dimension, an explicit formula for I f there allows for its simple estimation. In a general case, for a given random sample, a dimension of an exponential family which approximates it best is sought and then estimator of I f is constructed for the chosen family. As a measure of quality of fit a modified Bayes Information Criterion is used. The estimator, which is an instance of Post Model Selection Estimation method is proved to be consistent and asymptotically normal when the density belongs to the exponential family. Its consistency is also proved under misspecification when the number of exponential models under consideration increases in a suitable way. Moreover we provide evidence that in most of considered parametric cases the small sample performance of proposed estimator is superior to that of kernel estimators. Keywords Fisher information · Post model selection estimation · Bayes information criterion (BIC) · Kernel estimator of a density · Sheather–Jones bandwidth 1 Introduction Consider a univariate probability distribution having a density f and such that its Fisher information J. Mielniczuk (B) · M. Wojtyś Faculty of Mathematics and Information Science, Warsaw University of Technology, Plac Politechniki 1, 00-661 Warsaw, Poland e-mail: [email protected] J. Mielniczuk Institute of Computer Science, Polish Academy of Sciences, Warsaw, Poland 123 J. Mielniczuk, M. Wojtyś If =  [ f ′ (x)]2 d x. f (x) (1) R is finite. Obviously, I f = E[( f ′ / f )2 (X )], where X is a random variable distributed according to density f . The Fisher information is a quantity appearing in many contexts in statistics and its nonparametric estimation is essential in construction of efficient estimators and tests, see, e.g. Bickel et al. (1998). In the paper we deal with a problem of nonparametric estimation of this quantity. An obvious method to do this would be to estimate a density and its derivative by one of numerous nonparametric functional estimation methods and then use the definition of I f for constructing its plug-in type estimator. This is an approach used, e.g. in Stone (1975) and van der Vaart (2000, Section 25.8.1). However, such an estimator of f ′ (x)/ f (x) could behave unstably for x corresponding to small values of the density. A remark by last cited author (p. 397), who used truncation of both density and derivative estimators, that this construction ‘is not necessarily recommended for use in practice’ is telling. Moreover, with such an approach one faces a delicate problem of choosing smoothing parameters for estimators of both density and its derivative. Note also that I f equals to the integral of the squared optimal score J (y) = f ′ / f (F −1 (y)). Thus another possibility is to apply this representation of I f in terms of the optimal score function and use a direct estimator of it which does not involve separate estimation of f and its derivative. An example of such direct estimator is NN estimator proposed by Csörgő and Révész (1986) and defined in Sect. 2.3. Here we take a different route. Assume that f is supported on [0, 1] and consider functions b1 (x), . . . , k-dimensional exponential family Mk pertaining to orthonormal  bk (x) defined in (2)–(3). If f ∈ Mk then I f = E{ kj=1 θ j b′j (X )}2 (cf. 14), where X is distributed according to density f . This equality is used to construct an estimator of I f . In general, when it is not known whether f belongs to an exponential family, we construct a parsimonious exponential model Mk well approximating f in terms of a modified Bayesian Information Criterion (BIC) due to Schwarz (1978) and use an estimator constructed for this very model. The procedure known as post model selection estimation is described in detail below. An important point is that nonparametric estimation of the density f and its derivative is entirely avoided and the whole procedure relies on estimation of dimension k and parametric estimation of θ by maximum likelihood method in the exponential family. We also note that the introduced method can be applied to estimate other functionals of a density such as its entropy or an integral of its squared derivative of a given order. In the paper we prove consistency and asymptotic normality of the introduced estimator when the underlying distribution belongs to one of parametric models under consideration and discuss some properties of the related selection rule. Its consistency is also proved under misspecification when the underlying density is sufficiently well approximated by the models on the list (cf. Theorem 5). Moreover, we compare its performance with plug-in estimators and estimator based on Csörgő and Révész estimator of a score function. The paper is structured as follows. Section 2 contains necessary preliminary information on exponential families as well as post model selection estimators (PMSE) and 123 Estimation of Fisher information using model selection introduces the proposed estimator of I f together with its competitors. In Sect. 3 main results of the paper are stated and proved. The last section is devoted to discussion of results of simulation study, in which the case of a density with an unknown, possibly unbounded, support is also addressed. 2 Preliminaries 2.1 Exponential family Let Mk , for k ≥ 1, be k-dimensional exponential family Mk = { f (·, θ ) : θ ∈ Rk }, (2) where f (·, θ ) = f θ (·) is a density function with respect to Lebesgue measure λ on [0, 1] defined as f (x, θ ) = exp{θ ◦ b(x) − ψk (θ )}, (3) where θ = (θ1 , . . . , θk )T ∈ Rk and b(x) = (b1 (x), . . . , bk (x))T is a vector of known bounded and differentiable functions such that {1, b1 (x), . . . , bk (x)} form an orthonormal system in (L 2 ([0, 1]), λ). The symbol ◦ denotes throughout the inner product j in Rk ; for x ∈ Rk , y ∈ R j , when j ≤ k, we define x ◦ y = i=1 xi yi . Since f is a density on [0, 1], a normalizing constant ψk (θ ) is equal to ψk (θ ) = log 1 exp{θ ◦ b(x)} d x. (4) 0 Observe that ψk (θ ) is finite for any θ ∈ Rk as b(·) is bounded and f (·, θ ) has a compact support. Thus Mk is regular, which by definition means that the natural parameter set is open. kWe assume throughout that the family Mk is of full rank, i.e. no linear combination j=1 λ j b j (X ), for (λ1 , . . . , λk )  = 0, is constant with probability 1 for any X ∼ f ∈ Mk . This is equivalent to the condition that the covariance matrix Covθ b(X ) of vector b(X ) is positive definite for every θ ∈ Rk . Observe that families Mk are nested, i.e. Mk ⊂ Ml for k < l when the parameter space of Mk is embedded in Rl by appending θ ∈ Rk with l − k zeros. Below we summarize some basic facts concerning maximum likelihood estimators (MLE) of the parameter θ in exponential family (cf. van der Vaart 2000, Sect. 4.2). Let X 1 , . . . , X n be i.i.d. observations whose law belongs to a k-dimensional exponential family. Then for θ ∈ Rk the log-likelihood function Lk (θ ) is of the form Lk (θ ) = log n  i=1 f (X i , θ ) = n{θ ◦ Yn − ψk (θ )}, (5) 123 J. Mielniczuk, M. Wojtyś where Yn = (1/n) n i=1 b(X i ). For θ ∈ Rk define a vector-valued function ek (θ ) = (ek,1 (θ ), . . . , ek,k (θ ))T = Eθ b(X ), where X is a random variable distributed according to density f (·, θ ) ∈ Mk . In view of ψk′ (θ ) = ek (θ ) and ψk′′ (θ ) = Covθ b(X ) (van der Vaart 2000, p. 38), we have L′ k (θ ) = n[Yn − ek (θ )], L′′ k (θ ) = −nCovθ b(X ). (6) For k ∈ {1, . . . , K }, where K ∈ N is some fixed integer, define Lk = sup Lk (θ ), (7) θ∈Rk and let θ̂ = θ̂nk ∈ Rk (with the superscript indicating the dimension of the vector) be a MLE of the parameter θ = θ k ∈ Rk in the k-dimensional exponential family Mk . As θ̂nk is a point at which the maximum of the log-likelihood Lk (θ ) is attained, it is the solution of the equation L′ k (θ ) = 0, which in the exponential family is equivalent to the condition Yn = ek (θ ). Moreover, we have ek′ (θ ) = Covθ b(X ), where ek′ (θ ) = one-to-one and   ∂ek,i (θ) ∂θ j 1≤i, j≤k k if θ̂n exists then (8) . Since Mk is assumed to be of full rank then ek is θ̂nk = ek−1 (Yn ). (9) For j = 1, . . . , k define a vector a j = [b j (X i )]1≤i≤n . We assume that a1 , . . . , ak are linearly independent with probability 1 which is a sufficient condition for existence of MLE θ̂nk (see Bogdan and Bogdan 2000). This condition also implies that the family Mk is of full rank so that θ̂nk is an argument of the global maximum of the log-likelihood. Thus Lk (θ̂nk ) = Lk . Moreover, if θ ∈ Rk is the true value of the P parameter, then we have θ̂nk → θ and √ P D n(θ̂nk − θ ) → N (0, [Covθ b(X )]−1 ), (10) D where → and → denote convergence in probability and distribution, respectively. 2.2 Schwarz rule and post model selection estimators In the preceding section we assumed that the density pertaining to the sample X 1 , X 2 , . . . , X n belongs to an exponential family of known dimension k. If this is not the case, 123 Estimation of Fisher information using model selection i.e. either an exponential model is misspecified or k is not known, we may first want to estimate the dimension of a model, which, in a sense specified below, describes well the sample X 1 , X 2 , . . . , X n , and then compute an estimator θ̂ of the parameter θ in a family with the dimension equal to the found value of the estimator of k. An estimator of θ constructed in this manner is known as a PMSE (see, e.g. Leeb and Pötscher 2006 for a general introduction to the subject). One of the methods for estimating the dimension of a model is Schwarz’s (1978) BIC. According to it we should choose the family Mk , k ∈ {1, . . . , K }, where K is some preassigned integer, for which the quantity 1 1 L k = Lk − k log n = n sup {θ ◦ Yn − ψk (θ )} − k log n 2 2 θ∈Rk (11) is largest. Equivalently, the chosen dimension S of the model is given by S = min{k : 1 ≤ k ≤ K , L k ≥ L j , j = 1, . . . , K }. The selection rule S has been applied in Ledwina (1994) in order to construct an adaptive version of Neyman’s smooth goodness-of-fit test and intensively studied thereafter. In the context of estimation problems we found it advantageous to modify slightly its definition by adding the value 0 to the set of possible values. For this purpose we modify S in the following way. The symbol M0 will denote a model consisting of the uniform density on [0, 1] M0 = { f ∈ M1 : θ = 0}. Let L0 = 0. Observe that as M0 consists solely of the uniform density, L0 formally equals supremum of the log-likelihood for this model what is consistent with (7) for k = 0. Moreover, we extend the definition of L k in (11) to k = 0 by letting L 0 = 0. We introduce estimator S̃ of dimension k of the model Mk , k ∈ {0, 1, . . . , K } by S̃ = min{k : 0 ≤ k ≤ K , L k ≥ L j , j = 0, . . . , K }. (12) Then the PMSE θ̂ of parameter θ is defined as follows θ̂ = θ̂ S̃ , where θ̂ S̃ is the MLE of parameter θ in S̃-dimensional exponential family M S̃ . Note that if S̃ = 0 then we infer that the random sample pertains to the uniform distribution. Moreover,  S if L S > 0; S̃ = (13) 0 otherwise.  Let KL( f, f θ ) = log( f / f θ ) f be a Kullback–Leibler distance between f and f θ .  We note finally that as E f Lk (θ )/n is equal −KL( f, f θ ) + log( f ) f , maximization 123 J. Mielniczuk, M. Wojtyś of L k with respect to k can be viewed as choosing a model such that a penalized empirical KL distance between f and the model is smallest. 2.3 Estimators of Fisher information Fisher information pertaining to the density f is defined as in (1) and assume throughout that I f is finite. Consider the situation when f (·) = f (·, θ ) ∈ Mk , 1 ≤ k ≤ K , where Mk is defined in (2). Then it is easy to see that for X ∼ f (·, θ ) If = ⎧ 1 ⎨ 0 ⎩ k θ j b′j (x) ⎫2 ⎬ ⎭ j=1 f (x, θ ) d x = E ⎧ ⎨ ⎩ k θ j b′j (X ) j=1 ⎫2 ⎬ . (14) ⎭ Thus in the case when k is assumed known, a natural estimator of I f is 1 Iˆkf = n n i=1 ⎧ ⎨ ⎩ k θ̂ j b′j (X i ) ⎫2 ⎬ ⎭ j=1 = 1 n n (θ̂ ◦ b′ (X i ))2 , (15) i=1 where X 1 , X 2 , . . . , X n is i.i.d. sample pertaining to f (·, θ ) and θ̂ = θ̂nk is a MLE of θ in Mk based on X 1 , X 2 , . . . , X n . Moreover, we define Iˆkf = 0 for k = 0. Note that θ̂ and Iˆ f are estimated using the same sample. In a general case, when no assumption about f belonging to one of parametric models Mk , 0 ≤ k ≤ K , is made, we propose the following estimator of I f Iˆ f = Iˆ S̃f , (16) i.e. Iˆ f = Iˆkf , when S̃ = k. Thus we choose a parametric model Mk with k = S̃, which yields the best parsimonious fit to the data with respect to the modified Schwarz criterion (12), and then we estimate I f in a chosen family M S̃ using (15). Thus Iˆ f is an instance of PMSE discussed in Sect. 2.2. Estimator (16) is the main object studied in this paper, in the next section we prove its consistency and asymptotic normality. In Sect. 4 we will compare small sample behaviour of Iˆ f with two other estimators which are now briefly discussed. Kernel estimator of I f . An obvious class of estimators of I f is defined as follows 1 I˜ f = n n i=1  fˆ′ (X i ) fˆ(X i ) 2 , (17) where fˆ and fˆ′ are some estimators of the density f and its derivative f ′ , respectively. In the following we will use as fˆ a kernel estimator with boundary adjustment (see, 123 Estimation of Fisher information using model selection e.g. Schuster 1985) 1 fˆ(x) = n n i=1 K h (x − X i ) + K h (x + X i ) + K h (x + X i − 2), (18) where K h (x) = h −1 K (x/ h), K is a chosen probability density and h = h n is a sequence of bandwidths such that h n → 0 and nh n → ∞ as n → ∞. Denote the bandwidth used for density estimation as h 1 = h 1,n . Moreover, f ′ is estimated by the derivative of fˆ(x) with bandwidth h 2 = h 2,n . The considered choices of bandwidths are discussed in Sect. 4.1. NN-estimator of I f . The second estimator is based on an estimator of the optimal score function J (y) defined as the derivative of f (F −1 (y)). It is easy to check that 1 I f = 0 J 2 (y) dy. NN-estimator of J (y) was introduced by Csörgő and Révész (1986) and is motivated as follows. Let kn be a sequence of positive even integers and X i:n denotes ith order statistics in the X -sequence. Observe that f (F −1 (u)) may be estimated by (kn /n)(X i+kn /2:n − X i−kn /2:n )−1 where i = [nu]. Moreover, J (y) is approximated by  K h (y − u) d f (F −1 (u)) =  K h′ (y − u) f (F −1 (u)) du. Plugging in the estimator of f (F −1 (u)) an letting h n = kn /n we arrive at 1 Jˆn (y) = kn n−kn /2 K i= kn /2 ′  y − i/n hn  Finally, the estimator of I f is defined as n −1 1 . (X i+kn /2:n − X i−kn /2:n ) (19) n ˆ2 i=1 Jn (i/n). 3 Main results We first prove an auxiliary result on consistency of selection rule S. The result has been stated in Ledwina (1994) and the proof used Haughton’s (1988) result for general k-dimensional exponential families. We give a direct proof here. Lemma 1 If X 1 , . . . , X n are i.i.d. with density f (·, θ ) ∈ Mk , where 1 ≤ k ≤ K , and θk = 0, then limn→∞ P(S = k) = 1. Proof First we show that limn→∞ P(S < k) = 0. We have k−1 P(S < k) = l=1 k−1 P(S = l) ≤ l=1 P(L l ≥ L k ). 123 J. Mielniczuk, M. Wojtyś Note that P sup {t ◦ Yn − ψk (t)} → sup {t ◦ (ek,1 (θ ), . . . , ek,k (θ )) − ψk (t)} t∈Rk (20) t∈Rk and for l ∈ {1, . . . , k − 1} P sup {t ◦ Yn − ψl (t)} → sup {t ◦ (ek,1 (θ ), . . . , ek,l (θ )) − ψi (t)}, t∈Rl (21) t∈Rl which follows from the fact that function Y → supt∈Ri {t ◦ Y − ψl (t)} is convex and hence continuous, i = l, k. Moreover, we have for (t, 0) ∈ R k denoting t appended with k − l zero coordinates sup{t ◦ (ek,1 (θ ), . . . , ek,l (θ ))−ψl (t)}=sup {(t, 0) ◦ (ek,1 (θ ), . . . , ek,k (θ ))−ψk ((t,0))} t∈Rl (t,0)∈Rk < sup {t ◦ (ek,1 (θ ), . . . , ek,k (θ ))−ψk (t)} t∈Rk as the last supremum is uniquely attained at t = θ and θk = 0. From this inequality together with (20) and (21) we have sup {t ◦ Yn − ψk (t)} − sup {t ◦ Yn − ψl (t)} − (k − l) t∈Rk t∈Rl log n P → a > 0, 2n where a = supt∈Rk {t ◦ (ek,1 (θ ), . . . , ek,k (θ )) − ψk (t)} − supt∈Rl {t ◦ (ek,1 (θ ), . . . , ek,l (θ )) − ψl (t)}. Hence  P(L l ≥ L k ) = P sup {t ◦ Yn − ψk (t)} t∈Rk log n − sup {t Yn − ψl (t)} − (k − l) ≤0 2n t∈Rl ◦  →0 and limn→∞ P(S < k) = 0. Now, for l ∈ {k + 1, . . . , K } we have   1 P(S = l) ≤ P(L l > L k ) = P Ll − Lk > (l − k) log n . 2 Note that Ll − Lk ≤ Ll − Ll (θ l ), where θ l ∈ Rl equals θ appended with l − k zero coordinates. √ Now the result follows from two-term Taylor expansion of Ll − Ll (θ l ), the fact that n(θ̂nl − θ l ) = O P (1) and continuity of Ll′′ (·) which together imply that Ll − Ll (θ l ) = O P (1). ⊔ ⊓ We prove now that selection rule S̃ is consistent. 123 Estimation of Fisher information using model selection Theorem 1 Let X 1 , . . . , X n be i.i.d. with density f (·, θ ) ∈ Mk , 0 ≤ k ≤ K . If k ≥ 1 and θk = 0 then limn→∞ P( S̃ = k) = 1. If θ = 0 then limn→∞ P( S̃ = 0) = 1. Proof Assume first that θk = 0. Then supt∈Rk {t ◦ (ek,1 (θ ), . . . , ek,k (θ )) − ψk (t)} = supt∈Rk Eθ log f (X 1 , t) = Eθ log f (X 1 , θ ) > Eθ log f (X 1 , 0) = 0 and arguing as in the proof of Lemma 1, we obtain that P sup {t ◦ Yn − ψk (t)} − k log n/(2n) → sup {t ◦ ((ek,1 (θ ), . . . , ek,k (θ )) − ψk (t)}>0. t∈Rk t∈Rk P Hence L k → ∞ which in view of (13) and Lemma 1 implies that limn→∞ P( S̃ = k) = limn→∞ P(S = k) = 1. Now assume that θ = 0. Then Eθ b(X 1 ) = 0 and similarly to the proof of Theorem 2 P we can show that L l → −∞ for l ∈ {1, . . . , K }. Hence limn→∞ P(L l > 0) = 0 for l ∈ {1, . . . , K } and limn→∞ P( S̃ = 0) = 1. ⊔ ⊓ Theorem 2 If X 1 , . . . , X n are i.i.d. with an arbitrary distribution P on [0, 1] and E P b1 (X 1 ) = · · · = E P bk−1 (X 1 ) = 0, E P bk (X 1 ) = 0 for some 1 ≤ k ≤ K , then limn→∞ P( S̃ ≥ k) = 1. Proof Kallenberg and Ledwina (1995, p. 1600) proved that the above assumptions P P imply L k → ∞. Thus to complete the proof it suffices to show that L l → −∞ for l ∈ {1, . . . , k − 1}. However, as b1 (X ), . . . , bk (X ) may be correlated, Prohorov’s inequality, as suggested in the last reference, cannot be used, and we use Yurinskii’s inequality instead. Let M > 0 and 0 < ε < 2/3. We have   P(L l ≥ −M) = P sup {t ◦ Yn − ψl (t)} ≥ (l log n − 2M)/(2n) . t∈Rl Using Theorem 7.4 in Inglot and Ledwina (1996) we get for sufficiently large n P   sup {t ◦ Yn − ψl (t)} ≥ (l log n − 2M)/(2n) t∈Rl   ≤ P Yn 2 ≥ (2 − ε)(l log n − 2M)/(2n)   n      =P  b(X i ) ≥ [n(2 − ε)(l log n − 2M)/2]1/2 .   i=1 Random vector b(X 1 ) = (b1 (X 1 ), . . . , bl (X 1 ))T satisfies Cramér’s condition: E P b(X 1 ) = 0, E P b(X 1 )s ≤ with K = s! 2 s−2 b K , s = 2, 3, . . . 2 √ lVl and b2 = lVl2 , where Vl = max supx∈[0,1] |b j (x)|. 1≤ j≤l 123 J. Mielniczuk, M. Wojtyś Thus we can apply Yurinskii’s (1976) inequality (see also Inglot and Ledwina 1996, p. 2011):    P       b(X i ) ≥ [n(2 − ε)(l log n − 2M)/2]1/2  i=1     xn K −1 xn2 1 + 1.62 , ≤ 2 exp − 2 Bn n where xn = [(2 − ε)(l log n − 2M)/(2lVl2 )]1/2 and Bn = (nb2 )1/2 = Vl (nl)1/2 . Since xn → ∞ and xn /Bn → 0 as n → ∞, we have limn→∞ P(L l ≥ −M) = 0. P Thus L l → −∞ as n → ∞. ⊔ ⊓ Remark 1 Another possible method to choose the dimension of a model that is suggested in literature is the so-called simplified Schwarz selection rule which uses Neyman’s smooth test statistic Tk defined below instead of the supremum Lk of loglikelihood function. Namely, let k Tk = j=1  1 √ n n 2 b j (X i ) i=1 . Then the selection rule introduced in Kallenberg and Ledwina (1997) is defined by S2 = min{k : 1 ≤ k ≤ K , Tk − k log n ≥ T j − j log n, j = 1, . . . , K }. An obvious advantage of S2 over S is its computational simplicity as calculation of MLE is avoided. In Ledwina (2000) it is shown that under assumptions of Theorem 2 limn→∞ P(S2 ≥ k) = 1. Thus a similar criterion which is based on the augmented family of models including M0 can be defined as: S̃2 = min{k : 0 ≤ k ≤ K , Tk − k log n ≥ T j − j log n, j = 0, . . . , K }, where T0 = 0. If assumptions of Theorem 2 are satisfied then limn→∞ P( S̃2 ≥ k) = limn→∞ P(S2 ≥ k) = 1. When X 1 , . . . , X n are independent and uniformly distributed on [0, 1] then limn→∞ P(S2 = 1) = 1 (see Inglot and Ledwina 2001, p. D P 814). This together with the fact that T1 → χ12 yields T j − j log n → −∞ for j = 1, 2, . . . , K . Hence limn→∞ P( S̃2 = 0) = 1. Note, however, that in the case of Fisher information we use MLE estimators for computing Iˆ f and advantage of S̃2 over S̃ is lost. Theorem 3 If X 1 , . . . , X n are i.i.d. with density f (·, θ ) ∈ Mk , 0 ≤ k ≤ K , and P E(b′ (X 1 ))2 < ∞, then Iˆ f → I f . 123 Estimation of Fisher information using model selection Proof If f ∈ M0 then P( Iˆkf = I f ) ≥ P( S̃ = 0) → 1 as n → ∞ in view of k , . . . , θ̂ k ). Note that P( Iˆ → I ) ≥ Theorem 1. Let θ ∈ Rk , θk = 0 and θ̂nk = (θ̂1,n f f k,n k ˆ P({ I → I f } ∩ { Ŝ = k}) and since P( S̃ = k) → 1 as n → ∞ it suffices to show that f P Iˆkf → 1 Iˆkf = n I f . We have n i=1 1 (θ̂ k ◦ b′ (X i ))2 = n = In,1 + In,2 + In,3 , n i=1 ⎛ ⎝ k k j=1 (θ̂ kj,n − θ j )b′j (X i ) + j=1 ⎞2 θ j b′j (X i )⎠ where In,1 1 = n In,3 = 2 n n i=1 n i=1 ⎛ ⎝ ⎞2 k j=1 ⎡⎛ ⎣⎝ (θ̂ kj,n − θ j )b′j (X i )⎠ , ⎞⎛ k j=1 (θ̂ kj,n − θ j )b′j (X i )⎠ ⎝ 1 = n In,2 i=1 ⎛ ⎝ ⎞2 k j=1 ⎞⎤ k j=1 n θ j b′j (X i )⎠ , θ j b′j (X i )⎠⎦ . P By the weak law of large numbers we have In,2 → I f . Using the Schwarz inequality we obtain In,1 k ≤ n n k i=1 j=1 k (θ̂ kj,n − θ j )2 (b′j (X i ))2 = k  (θ̂ kj,n j=1 P 1 − θ j )2 n We have θ̂ kj,n − θ j → 0 and by the weak law of large numbers n −1 P n  (b′j (X i ))2 . i=1 ′ 2 P i=1 (b j (X i )) → n Eθ (b′j (X ))2 . Hence In,1 → 0. By Schwarz inequality In,3 ≤ 2(In,1 · In,2 )1/2 so P P In,3 → 0. Hence Iˆkf → I f . ⊔ ⊓ Theorem 4 If X, X 1 , . . . , X n are i.i.d. with density f (·, θ ) ∈ Mk , 0 ≤ k ≤ K , and E(b′ (X 1 ))4 < ∞, then √ D n( Iˆ f − I f ) → N (0, σ 2 ), where σ 2 = σ12 + σ22 + 4Covθ (θ T B ′ θ, b)(Covθ b)−1 ∆ θ, (22) with B ′ = b′ (X )(b′ (X ))T , ∆ = Eθ (B ′ ), σ12 = 4θ T ∆(Covθ b(X ))−1 ∆ θ, σ22 = Var θ (θ T B ′ θ ). 123 J. Mielniczuk, M. Wojtyś Proof Using the reasoning from the proof of Theorem 3, it is enough to prove the result for Iˆkf . Define B ′ as a k × k matrix consisting of all terms bi′ (X )b′j (X ) where i, j ∈ {1, . . . , k} : B ′ = B ′ (X ) = [bi′ (X )b′j (X )]1≤i, j≤k = b′ (X )(b′ (X ))T and let B̄ ′ = 1 n n B ′ (X i ). i=1 Iˆkf can be written as n 1 Iˆkf = n i=1 ′ (θ̂ ◦ b′ (X i ))2 = θ̂ T 1 n n b′ (X i )(b′ (X i ))T θ̂ i=1 = θ̂ B̄ θ̂ = (θ̂ − θ ) B̄ (θ̂ + θ ) + θ T B̄ ′ θ T T T ′ with θ̂ = θ̂ k . Thus √ √ √ n( Iˆ f − I f ) = n(θ̂ T − θ T ) B̄ ′ (θ̂ + θ ) + n( Iˇ f − I f ), (23) n where Iˇ f = θ T B̄ ′θ . Note that since θ T B̄ ′θ = n −1 i=1 (θ ◦ b′ (X i ))2 , Iˇ f can be regarded as an estimator of I f when θ is known. In view of (23) we have √ √ √ n( Iˆ f − I f ) = g( n(θ̂ − θ ), n( Iˇ f − I f ), B̄ ′ (θ̂ + θ )), (24) where g(x, y, z) = x ◦ z + y for x, z ∈ Rk , y ∈ R, is a continuous mapping. We will apply delta method (cf. van der Vaart 2000, p. 25) and (24) to prove the result. P By (10) and the weak law of large numbers, which implies B̄ ′ (θ̂ + θ ) → 2∆θ, where ∆ = Eθ (B ′ ), we have √ T D n(θ̂ − θ T ) B̄ ′ (θ̂ + θ ) → N (0, σ12 ), where σ12 = 4θ T ∆(Covθ b(X ))−1 ∆θ. Since Iˇ f = n −1 Central Limit Theorem, √ n i=1 θ T B ′ (X i ) θ then, by the D n( Iˇ f − I f ) → N (0, σ22 ), where σ22 = Var (θ T B ′ θ ). Define auxiliary variables S = S(X ) = 123  b(X ) θ T B′θ  ∈ Rk+1 and S̄ = 1 n n S(X i ). i=1 Estimation of Fisher information using model selection Then by the Central Limit Theorem √ D n( S̄ − Eθ S) → N (0, Covθ S), where Covθ S is the covariance matrix of vector S:  Covθ b Covθ (b, θ T B ′ θ ) Covθ S = . Covθ (θ T B ′ θ, b) σ22  Recall that θ̂nk = ek−1 (Yn ) (see 9) and define function ν : Rk+1 → Rk+1 as follows: ν(u 1 , . . . , u k , u k+1 ) = [e−1 (u 1 , . . . , u k )T , u k+1 ]T . In view of (8) the Jacobian matrix of function ν is equal to (Covθ b)−1 0 ∇ν = 0 1   By the delta method √ D n(ν( S̄) − ν(Eθ S)) → N (0, Σ1 ), where ν( S̄) = (θ̂ T , Iˇ f )T , ν(Eθ S) = (θ T , I f )T and Σ1 = ∇ν Covθ S (∇ν)T   (Covθ b)−1 Covθ (b, θ T B ′ θ ) (Covθ b)−1 . = Covθ (θ T B ′ θ, b)(Covθ b)−1 σ22 √ √ Thus a random vector ( n(θ̂ − θ ), n( Iˇ f − I f ), B̄ ′ (θ̂ + θ ))T converges in law to √ the distribution of (Z , 2∆ θ ), where Z ∼ N (0, Σ1 ). Hence in view of (24) n( Iˆ f − D I f ) → g((Z , 2∆ θ )), i.e. √ D n( Iˆ f − I f ) → N (0, σ 2 ), where 2 T σ = [2θ ∆, 1] Σ1  2∆ θ 1  = 4θ T ∆(Covθ b(X ))−1 ∆ θ + 4Covθ (θ T B ′ θ, b)(Covθ b)−1 ∆ θ + V ar (θ T B ′ θ ) = σ12 + σ22 + 4Covθ (θ T B ′ θ, b)(Covθ b)−1 ∆ θ. ⊔ ⊓ 123 0.0 0.1 0.2 0.3 0.4 J. Mielniczuk, M. Wojtyś −4 −2 0 2 4 √ Fig. 1 Histogram of n( Iˆ f − I f )/σ for n = 3,000 with N (0, 1) density overlaid. Density f belongs to M1 with θ1 = 0.4 Remark 2 Note that the asymptotic variance in (22) corresponds to decomposition in T ′ θ, b)(Cov b)−1 ∆θ being the asymptotic covariance (23) θ √with the term 4Covθ (θ B √ of n(θ̂ T − θ T ) B̄ ′ (θ̂ + θ ) and n( Iˇ f − I f ). The asymptotic behaviour of estimator Iˆ f = IˆS̃f is exemplified in Fig. 1. We computed the value of Iˆ f for 104 random samples X 1 , . . . , X n ∼ f (·, θ ) ∈ Mk with k = 1, θ = 0.4 and n = 3,000. In Fig. 1 the standard normal density function is compared to the √ histogram of n( Iˆ f − I f )/σ where I f and σ are theoretical values defined in (14) and (22), respectively. Visible conformity of the histogram and theoretical curve illustrates the result of Theorem 4. Remark 3 Note that Theorems 1–4 remain valid when a term log n in the penalty is replaced with arbitrary kn such that kn → ∞ and kn = o(n). In the following section we investigate by means of simulations influence of larger penalty on performance of Iˆ f . We consider now a general case when underlying density f does not necessarily belong to an exponential family. Specifically, we assume that f has the following form ⎧ ⎫ ⎨∞ ⎬ f (x) = exp (25) θ j b j (x) − ψ(θ ) , ⎩ ⎭ j=1  2 1/2 <∞, and {1, b (x), b (x), . . .} where θ = (θ1 , θ2 , . . .) ∈ R∞ , ||θ || = ( ∞ 1 2 j=1 |θ j | ) 2 form an orthonormal system in (L ([0, 1]), λ). Define ||a||∞ = supx∈[0,1] |a(x)| and let Vm = max ||b j ||∞ , j=1,...,m Vm′ = max ||b′j ||∞ . j=1,...,m 123 Estimation of Fisher information using model selection  Let f m (x) = exp{ mj=1 θ j b j (x) − ψm (θ m )} ∈ Mm , where θ m = (θ1 , . . . , θm )T , be the approximation of f . We assume that for m → ∞ || f − f m || L 2 = O(m −r ) for some r > 0. (26) Moreover, we impose the condition that ∞ j=1 |θ j | ||b j ||∞ < ∞. (27) In the case when the density of a random sample is of the form (25) with the series ∞ ′ j=1 θ j b j (x) converging uniformly in [0, 1] the Fisher information equals I f = E f (ρ(X ))2 , where ρ(x) = f ′ (x)/ f (x) = ∞ ′ j=1 θ j b j (x). ∞ j=1 (28) In particular (28) is satisfied when |θ j | ||b′j ||∞ < ∞. (29) Let m = m(n) be a deterministic sequence such that m(n) → ∞ when n → ∞. The following result on consistency of Iˆnm holds. Theorem 5 Assume that Vm = O(m ω1 ), Vm′ = O(m ω2 ) for some ω1 , ω2 ≥ 0 and r > 21 + max(ω1 , ω2 ). If m = O(n 1/k ), where k > 2 + 4 max(ω1 , ω2 ) and conditions P (26)–(27) and (29) hold, then Iˆnm → I f . In order to prove Theorem 5, we state two lemmas. Their proofs are given at the end of the section. The first lemma is a version of Lemma 5 in Barron and Sheu (1991).  Lemma 2 Let θ0 ∈ Rm , α0 = b f θ0 and α ∈ Rm be given. Let c = e|| log fθ0 ||∞ . If ||α0 − α|| ≤ then the solution θ (α) to  1 √ 4ce mVm (30) b f θ = α exists and satisfies ||θ0 − θ (α)|| ≤ 2ce||α − α0 ||. Moreover, √ || log f θ0 / f θ(α) ||∞ ≤ 4ce mVm ||α − α0 || ≤ 1. 123 J. Mielniczuk, M. Wojtyś   Let θm∗ ∈ Rm be the vector that satisfies the equation b f θm∗ = b f. Then f m∗ := f θm∗ is called the information projection of f onto the exponential family Mm . In the next lemma properties of θm∗ are used to establish consistency of MLE θ̂nm when m → ∞. Lemma 3 If Vm = O(m ω1 ) for some ω1 ≥ 0, r > 21 + ω1 and (26)–(27) are satisfied then for sufficiently large m the vector θm∗ exists and ||θm∗ − θ m || = O(m −r ). (31) If additionally m(n) = O(n 1/k ) and k > 2 + 4ω1 then the MLE θ̂nm exists with probability tending to 1 as n → ∞ and ||θ̂nm − θm∗ || = O P  m 1+2ω1 /n 1/2  (32) . Proof of Theorem 5 | Iˆnm − I f | is bounded by 1 n n 2 i=1 +  (ρ̂n,m (X i )) −  (ρ̂n,m (x))2 f (x) d x 2 (ρ̂n,m (x)) f (x) d x −  (ρ(x))2 f (x) d x , m ′ m m◦ ′ j=1 (θ̂n ) j b j (x) = θ̂n b (x). The first term of rhs can be  n written as |(θ̂nm )T (n −1 i=1 B ′ (X i ) − EB ′ (X ))θ̂nm |, where B ′ (x) = b′ (x)(b′ (x))T , n i ) and whence is bounded by ||θ̂nm ||2 ||n −1 i=1 Y i ||, where Y i = (Y j,l 1≤ j,l≤m ∈  2 i ′ ′ ′ ′ m R , Y j,l = b j (X i )bl (X i ) − b j bl f. where ρ̂n,m (x) = Yurinskii (1976) inequality implies that P  1 n n Y i where κ = 2m(Vm′ )2 , Bn = ε x= 2  >ε i=1 n 2 m (Vm′ )4 √   2 ! x xκ ≤ 2 exp − , 1 + 1.62 2 Bn nκ, x = nε/Bn . Whence for some positive constant C 1/2 ≥C  n m 2+4ω2 and as |xκ/Bn | is bounded it implies that ||n −1 Moreover, 1/2 n i=1 Y → ∞ as n → ∞ i || converges to 0 in probability. √ |ρ̂n,m (x) − ρ(x))| ≤ ||θ̂nm − θ m || mVm′ + 123 ∞ j=m+1 |θ j |||b′ ||∞ Estimation of Fisher information using model selection and using Lemma 3 we obtain √ √ ||θ̂nm − θ m || mVm′ ≤ (||θ m − θm∗ || + ||θ̂nm − θm∗ ||) mVm′  1/2  , = O(m 1/2+ω2 −r ) + O P m 2+2ω1 +2ω2 /n   which implies that (ρ̂n,m (x))2 f (x) d x − (ρ(x))2 f (x) d x converges to 0 in probability. ⊔ ⊓ Proof of Lemma 2 The proof proceeds similarly to the proof of Lemma 5 in Barron and Sheu (1991) with q(x) = 1 for x ∈ [0, 1] and τ = 1, the only change being that instead of using the result of Lemma 4 therein we apply the inequality D( f θ1 || f θ2 ) ≥ 1 −|| log q/ fθ ||∞ −2||θ1 −θ2 ||√mVm 1 e e ||θ1 − θ2 ||2 , 2 which follows from the fact that √ || log f θ1 / f θ2 ||∞ ≤ 2||(θ1 − θ2 ) ◦ b||∞ ≤ 2||θ1 − θ2 || mVm ⊔ ⊓ (cf. the proof of Lemma 4 in Barron and Sheu 1991). Proof of Lemma 3 To establish equality (31) we use Lemma 2 with α0 = α = b f and c = e|| log fm ||∞ . Note that || log f m ||∞ ≤ || log f ||∞ + 2 ∞ j=1 |θ j |||b j ||∞  b fm , (33) so || log f m ||∞ is bounded in m. Since {b j }∞ j=1 form an orthonormal system the Bessel inequality yields ||  b fm −  b f || = ||  b( f m − f )|| ≤ || f m − f || L 2 . 1 Thus the fact that m 2 +ω1 || f m − f || L 2 → 0 as m → ∞ implies (30) for sufficiently large m and hence the existence of θm∗ . Moreover, ||θm∗ − θ m || ≤ 2ce|| f m − f || L 2 , which proves (31). It also follows from Lemma 2 that || log f m∗ / f m ||∞ ≤ 1. (34)  n For the equality (32) we apply Lemma 2 once again with α0 = b f θm∗ , α = n −1 i=1 ∗ b(X i ) and c = e|| log fm ||∞ . Using (34) and (33) we obtain that || log f m∗ ||∞ is bounded 123 J. Mielniczuk, M. Wojtyś since || log f m∗ ||∞ ≤ || log f m∗ / f m ||∞ + || log f m ||∞ ≤ 1 + || log f m ||∞ . Yurinskii (1976) inequality yields 1 n n i=1  1/2  1+2ω1 . b(X i ) − Eb(X ) = O P m /n (35) " #1/2  n √ Thus mVm ||n −1 i=1 and the assumpb(X i ) − Eb(X )|| = O P m 2+4ω1 /n tion (30) of Lemma 2 is satisfied with probability tending to 1 as n → ∞. Whence ||θ̂nm − θm∗ || ≤ 2ce 1 n n i=1 b(X i ) − Eb(X ) ⊔ ⊓ and (32) follows. Remark 4 It is shown in Barron and Sheu (1991, pp. 1351–1352), that when f ∈ W2r , Sobolev space of functions on [0,1] such that f (r −1) is absolutely continuous with squared integrable derivative then f satisfies (26) when {b j }∞ j=0 is the Legendre system. Moreover, in this case Vm = (2m + 1)1/2 and Vm′ = m(m + 1)(2m + 1)1/2 , i.e. ω1 = 1/2, ω2 = 5/2, k > 12 and f ∈ W23 in Theorem 5. Consider a family of exponential models {Mk }, k ∈ {1, . . . , m(n)} and corresponding S̃ defined as in (12) with K replaced by m(n). The following corollary states consistency of IˆnS̃ . Note that its last assumption is in particular satisfied under conditions of Lemma 3 in view of (31). Corollary 1 Assume that m(n) satisfies conditions of Theorem 5, there is an infinite number of nonzero θi in the representation (25), for sufficiently large m θm∗ exists and P ||θm∗ − θ m || → 0 when m → ∞. Then IˆnS̃ → I f . Proof The proof follows by examination of the proof of the Theorem 5, which indicates P that its conclusion remains valid for a random sequence m(n) provided m(n) → ∞ and m(n) = O P (n 1/k ). In the case of S̃ the second condition is trivially satisfied as S̃ ≤ m. To prove the first assertion observe that it is impossible that, starting from a certain m 0 , all coordinates (θm∗ )l of θm∗ for m 0 ≤ l ≤ m and m ≥ m 0 would be 0. Namely, considering k ≥ m 0 such that θk = 0 this would imply ||θm∗ −θm || ≥ |θk | = 0 for all m ≥ m 0 which contradicts the assumptions. Thus for every M > 0 there exists m 1 ≥ M such that (θm∗ )m 2 = 0 for some m 2 ≥ M. Let m̃ 1 be the smallest integer having this property. For 0 ≤ l < m̃ 1 we have sup (t ◦ Eb − ψl (t)) = θl∗ ◦ Eb − ψl (θl∗ ) < θm̃∗ 1 ◦ Eb − ψm̃ 1 (θm̃∗ 1 ) t∈Rl = sup (t ◦ Eb − ψm̃ 1 (t)). t∈Rm̃ 1 123 Estimation of Fisher information using model selection Reasoning as in the proof of Lemma 1 leads to P( S̃ < M) ≤ P( S̃ < m̃ 1 ) ≤ ≤ m 1 −1 l=0 m 1 −1 l=0 P(L l ≥ L m̃ 1 )   log n P sup {t ◦ Yn − ψl (t)}− sup {t ◦ Yn − ψm̃ 1 (t)} + (m̃ 1 −l) ≥0 →0 2n t∈Rl t∈Rm̃ 1 P and thus S̃ → ∞. ⊔ ⊓ 4 Simulations In this section we discuss results of a simulation study in which the performance of Iˆ f = IˆS̃f has been compared with performance of several competing estimators described in Sect. 2.3. In all numerical experiments first k Legendre polynomials on [0, 1] have been considered as the functions b1 (x), . . . , bk (x). The chosen sample sizes have been equal to n = 100 and n = 500. The number K of dimensions considered in (11) depended on a sample size, namely K (100) = 4 and K (500) = 10. The number m of repetitions of each experiment in the study equaled 104 . 4.1 Estimators under consideration We consider two kernel-type estimators I˜ f with a kernel function K equal to the density of the standard normal law N (0, 1) for various choices of bandwidths h 1 and h 2 used for estimating the density and its derivative, respectively. A kernel estimator of the derivative equals derivative of the kernel estimator of f defined in (18). Let h SJ be a bandwidth of a kernel density estimator computed using the method of Sheather and Jones (1991) which is a commonly used method of data-dependent bandwidth choice. We consider two estimators of I f that use h SJ : for the first one, called I˜SJ , we have taken h 1 = h 2 = h SJ . For the second one, we calculated a bandwidth h 2,opt , which is the minimizer of the asymptotic Mean Integrated Squared Error of fˆ′ for normal N (µ, σ 2 ) distribution and expressed it in terms of h 1,opt , which is analogously defined asymptotically optimal bandwidth for fˆ. This yields h 2,opt = (3n/4)2/35 (3/5)1/7 h 1,opt . ′ uses h = h and h calculated from the above formula when h The estimator I˜SJ 1 SJ 2 1,opt is replaced by h SJ . Additionally, we define a ‘prophet’ kernel estimator I˜best for which error (ISE) of the corwe choose h 1 = h 2 such that it minimizes an integrated msquared responding estimator of f ′ / f. ISE is defined as m −1 i=1 [( fˆ′ / fˆ(xi ))−( f ′ / f )(xi )]2 , where (xi ) is a grid of m equidistant points in [0, 1]. Note that we use here the theoretical density of observations which is not available in practice. 123 J. Mielniczuk, M. Wojtyś best with density of Beta(3,3) distribution We also consider ‘prophet’ NN estimator I˜NN on [−1, 1] as a kernel function K and with the choice of parameter h n that minimizes ISE of the estimated function Jˆn defined in (19). To the best of our knowledge no proposal of data-dependent bandwidth choice for NN estimator is available. 4.2 Employed distributions The following distributions supported on [0, 1] has been considered in numerical experiments: – – – – the uniform distribution; truncated normal distribution with mean 0.5 and σ = 0.25, 0.5, 0.75, 1; Beta(a, b) distribution with a = b = 3, 4, 5, 6, 7, 8, 9; distribution h4(a) with a density: f 4a (x) = a2a−1 (min(x, 1 − x))a−1 for a = 3, 4, 5, 6, 7; – distribution l1(θ ) belonging to M1 with a density: f (x) = exp(θ b1 (x) − ψ1 (θ )) √ where θ = 0.4 and b1 (x) = 3(2x − 1) is the first Legendre polynomial on [0, 1]. We considered also two distributions with unbounded support: – normal distribution with mean 0 and σ = 0.25, 0.5, 0.75, 1; – exponential distribution with λ = 0.5, 1, 1.5. In the last two cases, the following modification of the original definition of Iˆ f has been made. The range [X 1:n , X n:n ] of data is mapped onto [0, 1] by a linear transformation and Iˆ f is calculated based on transformed data. Then it is multiplied by (X n:n − X 1:n )−2 what corresponds to the change of Fisher information when the density is transformed back from [0, 1] to [X 1:n , X n:n ]. 4.3 Comparison of performance ′ and their ‘prophet’ counWe compared estimator Iˆ f with kernel estimators I˜SJ , I˜SJ best for the families described above. The results are summarized in terparts I˜best , I˜NN Tables 1, 2, 3, 4 and 5, where empirical means, standard deviations and mean squared errors based on m = 104 repetitions are given. Also means of empirical distributions ′ are plotted in Fig. 2 as a function of pertaining parameter (for Beta of Iˆ f , I˜SJ and I˜SJ distribution the range of parameter a is larger than that given in the corresponding table). The plots clearly indicate that in most displayed cases Iˆ f is the least biased estimator among considered ones. It follows from the tables that when the distribution belongs to an exponential family (i.e. the case of the uniform, l1 and truncated normal ′ . In particdistribution), Iˆ f performed much better in terms of MSE than I˜SJ and I˜SJ ular, for the uniform distribution the ratios of MSEs are larger than 3 for n = 100. best in some cases, Surprisingly, Iˆ f fares better than ‘prophet’ estimators I˜best and I˜NN e.g. for l1 distribution with n = 500. The same conclusion is true in most cases for Beta and h4 distributions (the results not shown in the last case). The superiority of Iˆ f 123 Estimation of Fisher information using model selection Table 1 Performance of estimators for the uniform (left) and l1(0.4) (right) distributions 0 If 1.92 Mean SD MSE Mean SD MSE n = 100, K = 4 Iˆ f 0.09 I˜′ 1.53 1.18 1.4 2.33 2.79 7.96 1.69 5.18 3.78 2.67 10.57 2.49 2.63 13.1 5.27 3.9 26.46 0.008 0.006 8.7e-05 0.72 0.53 1.7 0.14 0.01 0.02 0.47 0.06 2.12 0.1 0.009 1.96 0.53 0.28 SJ I˜SJ I˜best best I˜NN n = 500, K = 10 Iˆ f 0.005 ˜I ′ 0.86 0.61 1.12 3.14 1.2 2.95 1.87 1.23 5.03 4.97 2.16 13.96 0.003 0.003 1.8e-05 0.66 0.24 1.64 0.26 0.01 0.07 0.31 0.02 2.6 SJ I˜SJ I˜best best I˜NN Table 2 Performance of estimators for Beta(a, a) distribution 3 40 a If Mean 4 42 SD n = 100, K = 4 Iˆ f 26.2 8.3 I˜′ 15.3 4.58 SJ I˜SJ I˜best best I˜NN 17.7 44.5 6 5.6 18 5.25 n = 500, K = 10 Iˆ f 28.1 5.74 ′ I˜SJ I˜SJ I˜best I˜best NN MSE Mean 5 48 SD MSE Mean 6 55 SD MSE Mean SD MSE 259 36.3 8.23 101 45.3 8.73 83 54.1 9.88 99 631 23.9 5.57 359 28 6.61 445 34.7 7.67 470 9.59 380 529 26.6 343 48 605 22.1 6.71 16.2 6.48 281 30.8 8.25 363 300 51.4 14.51 222 55.6 38 439 28.1 7.72 458 33.8 9.17 13.2 176 534 175 36.6 4.58 50 44.7 4 27 52.7 4.09 22 18.5 2.07 468 26.1 2.38 258 33.7 2.74 212 39.7 3.27 246 20.9 2.19 368 28.1 2.53 199 35.5 2.88 164 41.7 3.65 190 45 1.93 168 47.3 10.86 146 50.2 9.7 99 54.2 8.94 81 21.2 2.36 360 28.1 3.01 203 34.9 3.67 184 41.8 4.3 194 is more apparent for n = 100 and larger values of the parameter a in cases of Beta and h4 and parameter σ for truncated normal distributions. There is only one case, namely ′ is smaller than that of truncated normal distribution with σ = 0.5, when MSE of I˜SJ MSE of Iˆ f . best which availed themselves of the knowledge of true ‘Prophet’ methods I˜best and I˜NN density performed best in some cases (e.g. for the truncated normal and the uniform 123 J. Mielniczuk, M. Wojtyś Table 3 Performance of estimators for truncated normal distribution on [0, 1] with mean 0.5 and standard deviation σ 0.25 12.38 σ If 0.5 1.164 Mean SD MSE n = 100, K = 4 Iˆ f 13.3 4.81 24 ′ I˜SJ 7.75 3.46 33.38 I˜SJ 9.43 4.37 27.8 I˜best I˜best NN I˜best I˜best NN Mean SD MSE 1 0.081 Mean SD MSE Mean SD MSE 0.58 2.31 5.66 0.13 1.23 1.53 0.12 1.29 1.68 1.76 1.67 3.15 1.46 1.54 3.85 1.48 1.6 4.51 2.53 11.76 2.66 2.54 8.67 2.32 2.42 10.29 2.34 7.93 2.22 24.77 0.47 0.35 0.61 0.05 0.07 0.04 0.017 0.002 0.004 10.64 3.06 12.39 1.03 0.14 0.04 0.15 0.04 0.01 0.14 0.012 0.004 n = 500, K = 10 Iˆ f 12.5 1.83 ′ I˜SJ I˜SJ 0.75 0.248 3.36 0.86 1.07 1.24 0.05 0.3 0.13 0.014 0.164 0.031 8.1 1.46 20.47 1.19 0.57 0.32 0.83 0.52 0.61 0.81 0.56 0.85 9.65 1.74 10.46 1.97 1.01 1.67 1.68 1.04 3.13 1.73 1.13 3.98 9.64 1.58 9.98 0.61 0.27 0.38 0.08 0.07 0.03 0.017 0.0008 0.004 1.67 6.1 1.01 0.06 0.03 0.28 0.01 0.001 0.27 14.2 0.01 0.04 SD MSE Table 4 Performance of estimators for normal distribution N (0, σ ) 0.25 16 σ If Mean 0.5 4 SD n = 100, K = 4 Iˆ f 13 3.16 I˜′ 11.4 2.37 best I˜NN I˜best I˜best NN SD MSE Mean SD MSE Mean 19.18 3.23 0.86 1.33 1.42 0.377 0.27 0.8 0.219 0.089 2.19 0.69 3.75 0.75 0.234 1.11 0.43 0.108 0.337 2.77 22.86 2.4 0.79 3.2 0.82 0.282 1 0.46 0.132 0.308 10.1 1.32 37.04 2.2 0.46 3.46 0.67 0.141 1.25 0.38 0.065 0.385 11.8 2.78 25.74 2.95 0.7 1.59 1.33 0.315 0.3 0.75 0.174 0.094 1.91 3.79 0.29 0.13 1.68 0.131 0.03 0.95 0.072 0.008 12.1 n = 500, K = 10 Iˆ f 15.2 1.12 ′ I˜SJ I˜SJ Mean 1 1 26.47 SJ I˜SJ I˜best MSE 0.75 1.778 13.1 1.02 9.39 2.45 0.31 2.5 0.82 0.102 0.92 0.47 0.045 0.283 13.5 1.12 7.46 2.62 0.33 2.02 0.87 0.115 0.84 0.49 0.051 0.266 12.1 1.04 1.65 0.27 5.57 0.68 0.071 1.22 0.44 0.043 0.315 13.9 1.42 3.47 0.35 0.41 1.54 0.155 0.08 0.87 0.088 0.026 16.5 6.43 distributions) but in general they performed on par or even worse than Iˆ f . One of the reasons is possibly that the criterion which has been used to find a bandwidth best involved f ′ / f and not directly I , the other is possible superiorfor I˜best and I˜NN f 123 Estimation of Fisher information using model selection Table 5 Performance of estimators for exponential distribution E(λ) 0.5 0.25 λ If Mean 1 1 1.5 2.25 SD MSE Mean SD MSE Mean SD MSE n = 100, K = 4 Iˆ f 0.256 I˜′ 0.474 0.114 0.013 1.021 0.474 0.225 2.28 0.9 0.81 0.160 0.076 1.473 0.674 0.678 2.782 1.603 2.851 0.549 0.202 0.13 1.651 0.79 1.048 3.084 1.79 3.901 0.166 0.018 0.007 0.312 0.174 0.503 1.227 0.563 1.362 0.287 0.117 0.015 1.19 0.358 0.165 2.494 0.767 0.648 n = 500, K = 10 Iˆ f 0.244 I˜′ 0.657 0.025 0.0007 0.971 0.098 0.01 2.187 0.222 0.054 0.129 0.182 2.425 0.583 2.372 5.121 1.516 10.543 10 SJ I˜SJ I˜best best I˜NN SJ I˜SJ I˜best best I˜NN 0.831 0.182 0.371 2.773 0.694 3.625 5.021 1.524 0.17 0.008 0.0063 0.315 0.072 0.474 1.68 0.496 0.571 0.278 0.045 0.0027 1.076 0.166 0.033 2.396 0.36 0.151 ity of log-linear PMS estimator which avoids a problem of choosing two different bandwidths. In the case of two densities supported on unbounded sets estimator Iˆ f was always less biased and had much smaller MSE than kernel estimators. For normal distribution this was true even in comparison with ‘prophet’ estimators. Also, for exponential distribution estimator Iˆ f had in all cases smaller standard deviation than kernel estimators. Thus the proposed method seems promising also for the cases when the support of f is unknown and possibly unbounded. 4.4 A choice of penalty and comparison of IˆS̃f and IˆSf . We also studied the effect of adding family M0 to the list and changing a penalty in BIC criterion. Namely, we compared MSEs for Iˆ f = Iˆ S̃f and IˆSf when the penalty k log n/2 in (11) is multiplied by C. Obviously, C = 1 corresponds to the original criterion. Due to space constraints only selected results for the uniform and the truncated normal distribution are shown for C = 1, 2 and n = 100 in Tables 6 and 7, respectively. In general, for both estimators taking C > 2 does not lead to decrease of MSE and in some cases (e.g. for truncated normal with σ = 0.25 or Beta(3, 3) for n = 100 in case of Iˆ f ) causes its substantial increase. This also happens in the case of l1 distribution when adoption of a larger penalty results in a larger probability of k = 0, i.e. of wrong decision, in the case of IˆS̃f , whereas in the case of IˆSf probability of k = 1 (correct decision) becomes larger simply because model M0 is excluded from the list of models. It turned out that the choice of a large penalty (C ≥ 4) in the case of the uniform and the truncated normal distribution resulted in choice of k = 0 for all 123 20 50 40 100 60 150 200 80 J. Mielniczuk, M. Wojtyś 3 4 5 6 a 7 8 9 3 4 5 a 6 7 0.8 1.0 (b) h4(a) distribution 0 0 5 5 10 10 15 (a) Beta(a,a) distribution 0.2 0.4 0.6 sigma 0.8 1.0 0.2 0.4 (c) truncated normal distribution 0.6 sigma (d) normal distribution Fig. 2 Empirical means of estimators for selected parametric distributions and for n = 100. Solid line Iˆ f , ′ . Grey line I dashed line I˜SJ , dotted-dashed line I˜SJ f Table 6 An influence of penalty on Iˆ Sf (left) and Iˆ S̃f (right) for uniform distribution on [0, 1] for n = 100, K =4 If 0 Mean C =1 C =2 0 SD MSE Mean SD MSE 0.391 2.400 5.911 0.091 1.178 1.395 0.146 0.572 0.349 0.002 0.056 0.003 104 repetitions. Iˆ S̃f for C = 2 performs better than IˆSf for the uniform distribution and the truncated normal in the case σ ≥ 0.5, i.e. in cases when the distribution is close to the uniform. For beta, h4 and exponential distributions both estimators performed exactly the same. For l1 and normal distributions Iˆ S̃f performed worse. Iˆ S̃f with C = 2 performed in general better than for C = 1. Thus it should be kept in mind that changing C from 1 to 2 for comparison with other estimators discussed above will underline superiority of IˆS̃f when compared with kernel estimators even more. 123 Estimation of Fisher information using model selection Table 7 An influence of penalty on Iˆ S and Iˆ S̃ for truncated normal distribution on [0, 1] with mean 0.5 and standard deviation σ for n = 100, K = 4 σ If 0.25 12.38 Mean 0.5 1.164 SD MSE Mean 0.75 0.248 SD MSE Mean 1 0.081 SD MSE Mean SD MSE Iˆ S C =1 C =2 13.33 4.74 23.4 1.36 2.92 8.55 0.55 2.41 5.89 0.39 1.88 3.63 12.96 4.25 18.4 0.48 1.52 2.79 0.17 0.58 0.35 0.14 0.52 0.27 13.3 4.81 24 0.58 2.31 5.66 0.13 1.23 1.53 0.12 1.29 1.68 11.8 6.16 38.3 0.03 0.51 1.56 0.006 0.18 0.09 0.004 0.15 0.03 Iˆ S̃ C =1 C =2 Acknowledgments We thank T. Ledwina for suggesting the subject of this paper, insightful comments and supplying us with a computer program for calculating ML estimators in exponential families created under her Ministry of National Education Grant 341 046. References Barron AR, Sheu CH (1991) Approximation of density functions by sequences of exponential families. Ann Stat 19:1347–1369 Bickel P, Klassen C, Ritov Y, Wellner J (1998) Semiparametrics. In: Kotz S, Read C, Banks D (eds) Encyclopedia of statistical sciences, vol 2. Wiley, New York, pp 602–614 Bogdan K, Bogdan M (2000) On existence of maximum likelihood estimators in exponential families. Statistics 34:137–149 Csörgő M, Révész P (1986) A nearest neighbour-estimator for the score function. Prob Theory Relat Fields 71:293–305 Haughton DMA (1988) On the choice of a model to fit data from an exponential family. Ann Stat 16:342– 355 Inglot T, Ledwina T (1996) Asymptotic optimality of data-driven Neyman’s tests for uniformity. Ann Stat 24:1982–2019 Inglot T, Ledwina T (2001) Intermediate approach to comparison of some goodness-of-fit tests. Ann Inst Stat Math 53:810–834 Kallenberg WCM, Ledwina T (1995) Consistency and Monte Carlo simulation of a data driven version of smooth goodness-of-fit tests. Ann Stat 23:1594–1608 Kallenberg WCM, Ledwina T (1997) Data driven smooth test when the hypothesis is composite. J Am Stat Assoc 92:1094–1104 Ledwina T (1994) Data-driven version of Neyman’s smooth test of fit. J Am Stat Assoc 89:1000–1005 Ledwina T (2000) Limiting behaviour of data driven Neyman’s statistic under fixed alternatives. Manuscript Leeb H, Pötscher BM (2006) Model Selection, manuscript. http://www.stat.yale.edu/~hl284/handbook.pdf Schuster EF (1985) Incorportating support constraints into nonparametric estimation of densities. Commun Stat Theory Methods 14:1123–1136 Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464 Sheather SJ, Jones MC (1991) A reliable data-based bandwidth selection method for kernel density estimation. J R Stat Soc Ser B 53:683–690 Stone CJ (1975) Adaptive maximum likelihood estimator of a location parameter. Ann Stat 3:267–284 van der Vaart AW (2000) Asymptotic statistics. Cambridge University Press, Cambridge Yurinskii VV (1976) Exponential inequalities for sums of random vectors. J Mult Anal 6:473–499 123