Technical Report 73 IMSV, University of Bern Adaptive Confidence Sets For The Optimal Approximating Model
Technical Report 73 IMSV, University of Bern Adaptive Confidence Sets For The Optimal Approximating Model
Technical Report 73 IMSV, University of Bern Adaptive Confidence Sets For The Optimal Approximating Model
Angelika Rohde
Universitat Hamburg
Department Mathematik
Bundesstrae 55
D-20146 Hamburg
Germany
e-mail: [email protected]
and
Lutz Dumbgen
Universitat Bern
Institut fur Mathematische Statistik
und Versicherungslehre
Sidlerstrasse 5
CH-3012 Bern
Switzerland
e-mail: [email protected]
Universitat Hamburg and Universitat Bern
This work was supported by the Swiss National Science Foundation.
1
A. Rohde and L. Dumbgen/Confidence Sets for the Best Approximating Model 2
1. Introduction
When dealing with a high dimensional observation vector, the natural question arises
whether the data generating process can be approximated by a model of substantially lower
dimension. Rather than on the true model, the focus is here on smaller ones which still
contain the essential information and allow for interpretation. Typically, the models under
consideration are characterized by the non-zero components of some parameter vector.
Estimating the true model requires the rather idealistic situation that each component is
either equals zero or has sufficiently modulus: A tiny perturbation of the parameter vector
may result in the biggest model, so the question about the true model does not seem to be
adequate in general. Alternatively, the model which is optimal in terms of risk appears as
a target of many model selection strategies. Within a specified class of competing models,
this paper is concerned with confidence regions for those approximating models which are
optimal in terms of risk.
Suppose that we observe a random vector Xn = (Xin )ni=1 with distribution Nn (n , 2 In )
together with an estimator n for the standard deviation > 0. Often the signal n
represents coefficients of an unknown smooth function with respect to a given orthonormal
basis of functions.
There is a vast amount of literature on point estimation of n . For a given estimator
n = n (Xn , n ) for n , let
be its quadratic loss and the corresponding risk, respectively. Here kk denotes the standard
Euclidean norm of vectors. Various adaptivity results are known for this setting, often in
(c)
terms of oracle inequalities. A typical result reads as follows: Let (n )cCn be a family
(c) (c)
of candidate estimators n = n (Xn ) for n , where > 0 is temporarily assumed to be
known. Then there exist estimators n and constants An , Bn = O(log(n) ) with 0
such that for arbitrary n in a certain set n Rn ,
Results of this type are provided, for instance, by Polyak and Tsybakov (1991) and Donoho
and Johnstone (1994, 1995, 1998), in the framework of Gaussian model selection by Birge
and Massart (2001). The latter article copes in particular with the fact that a model is
A. Rohde and L. Dumbgen/Confidence Sets for the Best Approximating Model 3
not necessarily true. Further results of this type, partly in different settings, have been
provided by Stone (1984), Lepski et al. (1997), Efromovich (1998), Cai (1999, 2002), to
mention just a few.
By way of contrast, when aiming at adaptive confidence sets one faces severe limita-
tions. Here is a result of Li (1989), slightly rephrased: Suppose that n contains a closed
Euclidean ball B(no , cn1/4 ) around some vector no Rn with radius cn1/4 > 0. Still as-
suming to be known, let Dn = Dn (Xn ) n be a (1 )-confidence set for n n .
Such a confidence set may be used as a test of the (Bayesian) null hypothesis that n is
uniformly distributed on the sphere B(no , cn1/4 ) versus the alternative that n = no : We
reject this null hypothesis at level if k no k < cn1/4 for all Dn . Since this test
cannot have larger power than the corresponding Neyman-Pearson test,
Pno sup k no kn < cn 1/4
P Sn2 2n; (c2 n1/2 / 2 ) (Sn2 2n )
Dn
= 1 () + 21/2 c2 / 2 + o(1),
where 2n;(2 ) stands for the -quantile of the noncentral chi-squared distribution with
n degrees of freedom and noncentrality parameter 2 . Throughout this paper, asymptotic
statements refer to n . The previous inequality entails that no reasonable confidence
set has a diameter of order op (n1/4 ) uniformly over the parameter space n , as long as the
latter is sufficiently large. Despite these limitations, there is some literature on confidence
sets in the present or similar settings; see for instance Beran (1996, 2000), Beran and
Dumbgen (1998) and Genovese and Wassermann (2005).
Improving the rate of Op (n1/4 ) is only possible via additional constraints on n , i.e. con-
sidering substantially smaller sets n . For instance, Baraud (2004) developed nonasymp-
totic confidence regions which perform well on finitely many linear subspaces. Robins and
van der Vaart (2006) construct confidence balls via sample splitting which adapt to some
extent to the unknown smoothness of n . In their context, n corresponds to a Sobolev
smoothness class with given parameter (, L). However, adaptation in this context is possi-
ble only within a range [, 2]. Independently, Cai and Low (2006) treat the same problem
in the special case of the Gaussian white noise model, obtaining the same kind of adap-
tivity in the broader scale of Besov bodies. Other possible constraints on n are so-called
shape constraints; see for instance Cai and Low (2007), Dumbgen (2003) or Hengartner
and Stark (1995).
A. Rohde and L. Dumbgen/Confidence Sets for the Best Approximating Model 4
The question is whether one can bridge this gap between confidence sets and point esti-
mators. More precisely, we would like to understand the possibility of adaptation for point
estimators in terms of some confidence region for the set of all optimal candidate estima-
(c)
tors n . That means, we want to construct a confidence region Kn, = Kn, (Xn , n ) Cn
for the set
Kn (n ) := argmin R(n(c) )
cCn
n
o
= c Cn : R(n(c) , n ) R(n(c ) , n ) for all c Cn
and
max R(n(c), n )
cKn,
= Op (An ) min R(n(c) , n ) + Op (Bn ) 2 . (2)
max L(n(c) , n )
cCn
cKn,
Solving this problem means that statistical inference about differences in the performance
of estimators is possible, although inference about their risk and loss is severely limited.
In some settings, selecting estimators out of a class of competing estimators entails esti-
mating implicitly an unknown regularity or smoothness class for the underlying signal n .
Computing a confidence region for good estimators is particularly suitable in situations
in which several good candidate estimators fit the data equally well although they look
different. This aspect of exploring various candidate estimators is not covered by the usual
theory of point estimation.
Note that our confidence region Kn, is required to contain the whole set Kn (n ), not
just one element of it, with probability at least 1 . The same requirement is used by
Futschik (1999) for inference about the argmax of a regression function.
The remainder of this paper is organized as follows. For the readers convenience our
approach is first described in a simple toy model in Section 2. In Section 3 we develop and
analyze an explicit confidence region Kn, related to Cn := {0, 1, . . . , n} with candidate
estimators
n
n(k) := 1{i k}Xin i=1
.
2. A toy problem
regardless of F . To construct such a confidence set we regard Y (s) Y (t) for arbitrary
different s, t [0, 1] as a test statistic for the null hypothesis that s S(F ), i.e. large
values of Y (s) Y (t) give evidence for s 6 S(F ).
A first naive proposal is the set
n o
Snaive := s [0, 1] : Y (s) min Y + naive
[0,1]
with naive
denoting the (1 )-quantile of max[0,1] W min[0,1] W .
Here is a refined version based on results of Dumbgen and Spokoiny (2001): Let be
the (1 )-quantile of
|W (s) W (t)| q
sup p 2 log(e/|s t|) . (4)
s,t[0,1] |s t|
Then constraint (3) is satisfied by the confidence region S which consists of all s [0, 1]
such that
q q
Y (s) Y (t) + |s t| 2 log(e/|s t|) + for all t [0, 1].
To illustrate the power of this method, consider for instance a sequence of functions
F = Fn = cn Fo with positive constants cn and a fixed continuous function Fo with
unique minimizer so . Suppose that
Fo (t) Fo (so )
lim = 1
tso |t so |
A. Rohde and L. Dumbgen/Confidence Sets for the Best Approximating Model 6
for some > 1/2. Then the naive confidence region satisfies only
max |t so | = Op cn1/ ,
(5)
tS
naive
whereas
max |t so | = Op log(cn )1/(21) c2/(21)
n . (6)
tS
mn2
2m ,
2
2n
n2 := = O(1)
mn
Important for our analysis is the behavior of the centered and rescaled difference process
Dn = Dn (j, k) 0j<kn
with
k
1 X
2
Dn (j, k) := p 2( in /)( in /) + ( in /) 1 , (7)
4kn /k2 + 2n i=j+1
1/2
Vn (j, k) := 4kn /k2 + 2n 2(k j)(1 2 / 2 ) (8)
This representation shows that the distribution of Dn depends on the degrees of free-
dom, m, and the unknown signal-to-noise vector n /. The process Dn consists of par-
tial sums of the independent, but in general non-identically distributed random variables
2(in /)(in /) + (in /)2 1. The standard deviation of Dn (j, k) is given by
k
1 X
2 2 1/2
n (j, k) := p 4 in / + 2 .
4kn /k2 + 2n i=j+1
Note that n (0, n) = 1 by construction. To imitate the more powerful confidence region of
Section 2 based on the multiscale approach, one needs a refined analysis of the increment
process Dn . Since this process does not have subgaussian tails, the standardization is more
involved than the correction in (4).
1/2
Theorem 1. Define n (j, k) := 2 log e/n (j, k)2 for 0 j < k n. Then
The limiting distribution indicates that the additive correction term in the definition of
dn cannot be chosen essentially smaller. It will play a crucial role for the efficiency of the
confidence region.
To construct a confidence set for Kn (n ) by means of dn , we are facing the problem that
the auxiliary function n (, ) depends on the unknown signal-to-noise vector n /. In fact,
knowing n would imply knowledge of Kn (n ) already. A natural approach is to replace
the quantities which are dependent on the unknown parameter by suitable estimates. A
common estimator of the variance n (j, k)2 , j < k, is given by
n
X 1 X
k
2 2
n (j, k) := 4(Xin /n2 1) + 2 2
4(Xin /n2 1) + 2 .
i=1 i=j+1
as n goes to infinity. This can be verified by noting that the (rescaled) numerator of
n (j, k)2
0j<kn
is, up to centering, essentially of the same structure as the rescaled
difference process Dn itself.
The problem of estimating the set arg mink Rn (k) can be cast into our toy model where
Y (t), F (t) and W (t) correspond to Rn (k), Rn (k) and the difference Rn (k) Rn (k), re-
spectively. One may expect that the more distinctive the global minima are, the easier it is
to identify their location. Hence the case of constant risks appears to be least favourable,
corresponding to a signal
n
n := i=1
,
A. Rohde and L. Dumbgen/Confidence Sets for the Best Approximating Model 9
(k)
In this situation, each candidate estimator n has the same risk of n 2 .
A related consideration leading to an explicit procedure is as follows: For fixed indices
0 j < k n,
k
2
(k j) 2 ,
X
Rn (j) Rn (k) = in
i=j+1
with k j and m degrees of freedom. Thus large or small values of Tjkn give evidence for
Rn (j) being larger or smaller, respectively, than Rn (k). Precisely,
(
st. Ln (Tjkn ) whenever j Kn (n ),
Ln (Tjkn )
st. Ln (Tjkn ) whenever k Kn (n ).
Note that this stochastic ordering remains valid if n2 is just independent from Xn , i.e. also
under the more general requirement of the remark at the end of this section. Via suitable
coupling of Poisson mixtures of central 2 -distributed random variables, this observation
is extended to a coupling for the whole process Tjkn 0j<kn
:
Proposition 2 (Coupling). For any n Rn there exists a probability space with random
variables Tjkn 0j<kn
and Tjkn 0j<kn
such that
L Tjkn 0j<kn
= L n Tjkn 0j<kn
,
L Tjkn 0j<kn
= Ln Tjkn 0j<kn
,
simplicity c := 3 in the definition of dn . Note also that n (j, k)2 = (k j)/n in case of
A. Rohde and L. Dumbgen/Confidence Sets for the Best Approximating Model 10
with s
6
k j 3
k j 2
cjkn = cjkn, := + n, + .
|k j| n |k j| n
In case of n 0 (i.e. n/m 0), the critical values n, converge to the critical value
introduced in Section 2. In general, n, = O(1), and the confidence regions Kn, satisfy
the oracle inequalities
r
max Rn (k) min Rn (j) + 4 3 + op (1) 2 log(n) min Rn (j) (10)
kKn, jCn jCn
+ Op 2 log n
and
r
max Ln (k) min Ln (j) + Op 2 log(n) min Ln (j) (11)
kKn, jCn jCn
+ Op 2 log n .
uniformly in (n).
A. Rohde and L. Dumbgen/Confidence Sets for the Best Approximating Model 11
Remark (Variance estimation) Instead of Condition (A), one may require more gen-
erally that n2 and Xn are independent with
n2
n 2 1 D N (0, 2 )
for a given 0. This covers, for instance, estimators used in connection with wavelets.
There is estimated by the median of some high frequency wavelet coefficients divided by
the normal quantile 1 (3/4). Theorem 1 continues to hold, and the coupling extends to
this situation, too, with S 2 in the proof being distributed as nn2 . Under this assumption
on the external variance estimator, the confidence region Kn, , defined with m := 2n/ 2 ,
is at least asymptotically valid and satisfies the above oracle inequalities as well.
The previous result relies strongly on the assumption of nested models. It is possible to
obtain confidence sets for the optimal approximating models in a more general setting,
albeit the resulting oracle property is not as strong as in the nested case. In particular, we
can no longer rely on a coupling result but need a different construction. For the readers
convenience, we focus on the case of known , i.e. m = ; see also the remark at the end
of this section.
Let Cn be a family of index sets C {1, 2, . . . , n} with candidate estimators
n
(C) := 1{i C}Xin i=1
Rn (C) := R( (C) , n ) = 2
+ |C| 2 ,
X
in
i6C
where |S| denotes the cardinality of a set S. For two index sets C and D,
n2 (J) := 2
/ 2 ,
X
in J {1, 2, . . . , n}.
iJ
that
1 X 2
Tn (J) := X
2 iJ in
has a 2|J| (n2 (J))-distribution. We denote the distribution function of 2k (2 ) by Fk ( | 2 ).
Now let Mn := |Mn | 1 |Cn |(|Cn | 1), the number of nonvoid index sets J Mn . Then
with probability at least 1 ,
Since F|J| (Tn (J) | 2 ) is strictly decreasing in 2 with limit 0 as 2 , (12) entails the
2 2 (J) for all parameters n2 (J) as
simultaneous (1 )-confidence intervals n,,l (J), n,,u
2
follows: We set n,,l 2
() := n,,u () := 0, while for nonvoid J,
n o
2
(J) := min 2 0 : F|J| Tn (J) 2 1 /(2Mn ) ,
n,,l (13)
n o
2
(J) := max 2 0 : F|J| Tn (J) 2 /(2Mn ) .
n,,u (14)
By means of these bounds, we may claim with confidence 1 that for arbitrary C, D Cn
the normalized difference (n/ 2 ) Rn (D) Rn (C) is at most n,,u
2 2
(C \ D) n,,l (D \
C) + |D| |C|. Thus a (1 )-confidence set for Kn (n ) = argminCCn Rn (C) is given by
n o
2 2
Kn, := C Cn : n,,u (C \ D) n,,l (D \ C) + |D| |C| 0 for all D Cn .
Theorem 4. Let (n )nN be arbitrary, and suppose that log |Cn | = o(n). Then
r
max Rn (C) min Rn (D) + Op 2 log(|Cn |) min Rn (D)
CKn, DCn DCn
+ Op 2 log |Cn |
and
r
max Ln (C) min Ln (D) + Op 2 log(|C n |) min Ln (D)
CKn, DCn DCn
+ Op 2 log |Cn | .
with n denoting minimal risk or minimal loss. Thus Theorem 4 entails that the maximal
risk (loss) over Kn, exceeds the minimal risk (loss) only by a factor close to one, provided
that the minimal risk (loss) is substantially larger than 2 log |Cn |.
A. Rohde and L. Dumbgen/Confidence Sets for the Best Approximating Model 13
+ Op 2 log n .
The intrinsic reason is that the general procedure does not assume any structure of the
family of candidate estimators. Hence advanced multiscale theory is not applicable.
The latter inequalities entail that (/n )2 lies between n,,l := m/m;1 /2 and n,,u :=
m/2m; /2 . Then we obtain simultaneous (1 )-confidence bounds n,,l
2 2
(J) and n,,u (J)
as in (13) and (14) by replacing with and Tn (J) with
n,,l X 2 n,,u X 2
X and X ,
n2 iJ in n2 iJ in
5. Proofs
Note first that min[0,1] Y lies between Fn (so ) + min[0,1] W and Fn (so ) + W (so ). Hence for
any (0, 1),
and
naive naive
s [0, 1] : Fn (s) Fn (so )
with Op (1) not depending on s. Now the expansion of Fo near so entails claim (6).
An essential ingredient for our main results is an exponential inequality for quadratic
functions of a Gaussian random vector. It extends inequalities of Dahlhaus and Polonik
(2006) for quadratic forms and is of independent interest.
n
X 2
i (Zi + i )2 (1 + i2 )
P exp
i=1
2 + 4max /
e1/4 exp / 8 .
2 2 1
= .
2 + 4max / 2 + 8 8 8 + 4 8 4
Throughout this section we assume without loss of generality that = 1. Further let
Sn := {0, 1, . . . , n} and Tn := (j, k) : 0 j < k n .
Proof of Theorem 1.
Step I. We first analyze Dn in place of Dn . To collect the necessary ingredients, let the
metric n on Tn pointwise be defined by
q
n (j, k), (j , k ) n (j, j )2 + n (k, k )2 .
:=
We need bounds for the capacity numbers D(u, T , n ) (cf. Section 6) for certain u > 0
and T T . The proof of Theorem 2.1 of Dumbgen and Spokoiny (2001) entails that
12u4 2
D u, t Tn : n (t) , n for all u, (0, 1]. (21)
with
1/2
i = in (j, k) := 4kn k2 + 2n I(j,k](i),
1/2
so |i | 4kn k2 + 2n . Hence it follows from Proposition 5 that
!
2
P |Dn (t)| n (t) 2 exp 1/2
2 + 4 4kn k2 + 2n /n (t)
and
2e1/4 exp()
P Dn (s) Dn (t) 8n (s, t) (24)
for a suitable constant Q > 0. Since Dn (j, k) = Dn (0, k) Dn (0, j) and n (j, k) =
n (0, j), (0, k) , this entails the stochastic equicontinuity of Dn with respect to n .
For 0 < 1 define
!+
|Dn (t)| c n (t)2
Tn (, ) := sup n (t) 1/2
tTn :<n (t) n (t) n (t) 4kn k2 + 2n
1/2
with a constant c > 0 to be specified later. Recall that n (t) := 2 log e/n (t)2 .
Starting from (21), (22) and (25), Theorem 8 of Dumbgen and Walther (2007) and its
subsequent remark imply that
provided that c > 2. On the other hand, (21), (23) and (25) entail that
Now we are ready to prove the first assertion about Dn . Recall that Dn = n2 (Dn + Vn )
and
Vn (j, k) 2n (k j)
= 1/2 Zn
n (j, k) n (j, k) 4kn k2 + 2n n
2(k j)/ 4kn k2 +
p
with Zn being asymptotically standard normal. Since n (j, k)
1/2
2n ,
p
2(k j)
|Vn (j, k)| n (j, k)
n |Zn | n |Zn |, (28)
n (j, k)
n n
so the maximum of |Vn |/n over Tn is bounded by 2n |Zn | = Op (1). Furthermore, since
|Tn | n2 /2, one can easily deduce from (23) that the maximum of |Dn |/n over Tn exceeds
A. Rohde and L. Dumbgen/Confidence Sets for the Best Approximating Model 18
32 log n + with probability at most e1/4 exp / 8 . Since n = 1 + Op (n1/2 ), these
considerations show that
(Dn + Vn )(t)|
max 32 log n + Op (1)
tTn n (t)
and
Dn (t) (Dn + Vn )(t)
max = Op (n1/2 log n).
tTn n (t)
This proves our first assertion about Dn /n .
Step II. Because n2 p 1, it is sufficient for the proof of the weak approximation
dw Dn (t) tTn
, n (t) tTn
0 as n (29)
To this end we write Dn (j, k) = Dn,1 (j, k) + Dn,2 (j, k) + Dn,3 (j, k) with
k
1/2 X
Dn,1 (j, k) := 4kn k2 + 2n 1{|in | n }(2in in + 2in 1),
i=j+1
k
1/2 X
Dn,2 (j, k) := 4kn k2 + 2n 1{|in | > n }2in in ,
i=j+1
k
1/2 X
Dn,3 (j, k) := 4kn k2 + 2n 1{|in | > n }(2in 1)
i=j+1
1/2
and arbitrary numbers n > 0 such that n but n / 4kn k2 + 2n 0. These
three random variables Dn,s (j, k) are uncorrelated and have mean zero. The number an :=
{i : |in | > n } satisfies the inequality kn k2 an 2 , whence
n
2an 1
E Dn,3 (j, k)2
2
0.
2n + 4kn k 2n2
Moreover, Dn,1 (j, k) and Dn,2 (j, k) are stochastically independent, where Dn,1 (j, k) is
asymptotically Gaussian by virtue of Lindebergs CLT, while Dn,2 (j, k) is exactly Gaus-
sian. These findings entail (29).
n (t)2
max 0 while max |n (t)| = O(1)
1/2
t: (t) 4kn k2 + 2n n (t) t:n (t)
n
for any fixed (0, 1). Consequently it follows from step II that
dw Sn (, 1), n (, 1) 0 (32)
provided that kn k2 = O(n). For n (0, ) this claim follows, for instance, with the same
arguments as (26). Moreover, Sn (0, ) is not greater than
1/2
|Vn (t)| 4kn k2 + 2n
Tn (0, ) + sup Tn (0, ) + ,
tTn :n (t) n (t) n
according to (28). Thus our claim follows from (26) and kn k2 = O(n).
j=0
j!
as can be proved via Laplace transforms. Now we define time points
k
2
and tkn := tj(n)n + k j(n)
X
tkn := in
i=1
with j(n) any fixed index in Kn (n ). This construction entails that tkn tkn with equality
if, and only if, k Kn (n ).
Figure 1 illustrates this construction. It shows the time points tkn (crosses) and tkn
(dots and line) versus k for a hypothetical signal n R40 . Note that in this example,
Kn (n ) is given by {10, 11, 20, 21}.
Let , G1 , G2 , . . . , Gn , Z1 , Z2 , Z3 , . . . and S 2 be stochastically independent random
variables, where = ((t))t0 is a standard Poisson process, Gi and Zj are standard
Gaussian random variables, and S 2 2m . Then one can easily verify that
k 2(tkn /2)
m
X
G2i Zs2
X
Tjkn := + ,
(k j)S 2 i=j+1 s=2(tjn /2)+1
k 2(tkn /2)
m
X
G2i Zs2
X
Tjkn := +
(k j)S 2 i=j+1 s=2(tjn /2)+1
)
define random variables (Tjkn )0j<kn and (Tjkn 0j<kn with the desired properties.
p
Proof of Lemma 7. The inequality y x + b(x + y) + c entails that either y < x + c
or
(y x c)2 b(x + y) = 2bx + b(y x).
Since y < x + c is stronger than the assertions of part (i), we only consider the displayed
quadratic inequality. The latter is equivalent to
2
y x (b/2 + c) 2bx + (b/2 + c)2 c2 = 2bx + b2 /4 + bc.
pP P
Hence the standard inequality i zi i zi for nonnegative numbers zi leads to
q
y x 2bx + b2 /4 + bc + b/2 + c = 2bx + b + bc + c.
2
Finally, 0 b c entails that bc (b + c)/2.
As to part (ii), the definition of h(x) entails that
q
h(h(x)) = x + a + bx +
a + bx + b a + bx + bc + 2c
q
x + a + bx + a + bx + b a + bx + bc + 2c
q
= x + a + bx + a + bx 1 + b/ a + bx + bc + 2c
x + 2 a + bx + b/2 + bc + 2c,
A. Rohde and L. Dumbgen/Confidence Sets for the Best Approximating Model 22
because 1 + d 1 + d/2 for arbitrary d 0.
Proof of Theorem 3. The definition of Kn, and Proposition 2 together entail that Kn,
contains Kn (n ) with probability at least 1 . The assertions about n, are immediate
consequences of Theorem 1 applied to n = n .
1/2
Now we verify the oracle inequalities (10) and (11). Let n := 4kn k2 + 2n n .
With n we denote the function n on Tn corresponding to n . Throughout this proof
we use the shorthand notation Mn (, k) := Mn () Mn (k) for Mn = Rn , Rn , Ln , Ln and
() () ()
arbitrary , k Cn . Furthermore, n (, k) := n (k, ) if > k, and n (k, k) := 0.
In the subsequent arguments, kn := min(Kn (n )), while j stands for a generic index in
Kn, . The definition of the set Kn, entails that
j k
n
h i
Rn (j, kn ) n2 n (j, kn ) + n, + O(log n) . (33)
n
Here and subsequently, O(rn ) and Op (rn ) denote a generic number and random variable,
respectively, depending on n but neither on any other indices in Cn nor on (0, 1).
Precisely, in view of our remark on dependence of , we consider all (n) with (n) > 0
such that n,(n) = O(n1/6 ). Note that n2 = 1 + Op (n1/2 ). Moreover, n (j, kn )2 (j
2
kn )/n equals 12nx log(e/x) 12n with x := |j kn |/n [0, 1]. Thus we may rewrite
(33) as
j k
n
Rn (j, kn ) n (j, kn ) + n, + Op (log n). (34)
n
Combining this with the equation Rn (j, kn ) = Rn (j, kn ) Dn (j, kn ) yields
j k
n
Rn (j, kn ) n (j, kn ) + n, + Op (log n) + |Dn (j, kn )|. (35)
n
Since n (j, kn )2 6n and maxtTn |Dn (t)|/n (t) = Op (log n), (35) yields
Rn (j, kn ) 12n + 6n n, + Op (log n)n (j, kn ).
This preliminary result allows us to restrict our attention to indices j in a certain subset
Pn 2
of Cn : Since 0 Rn (n, kn ) = n kn i=kn +1 in ,
n
2
X
in n kn .
i=kn +1
P kn 2
On the other hand, in case of j < kn , Rn (j, kn ) = i=j+1 in (kn j), so
n
X
2
in n + Op n(log n + n, ) .
i=j+1
Pn 2
Thus if jn denotes the smallest index j Cn such that i=j+1 in 2n, then kn jn , and
Kn, {jn , . . . , n} with asymptotic probability one, uniformly in (n). This allows
us to restrict our attention to indices j in {jn , . . . , n} Kn, . For any jn , Dn (, kn )
involves only the restricted signal vector (in )ni=jn +1 , and the proof of Theorem 1 entails
that +
|Dn (, kn )| p 2c log n
max 2 log n = Op (1).
jn n n (, kn ) n (, kn )
Thus we may deduce from (35) the simpler statement that with asymptotic probability
one,
n (j, kn ) + n (j, kn )
p
Rn (j, kn ) 2 log n + n, + Op (1) (38)
+ Op (log n).
Now we need reasonable bounds for n (j, kn )2 in terms of Rn (j) and the minimal risk
n = Rn (kn ), where we start from the equation in (36): If j < kn , then n (j, kn )2 =
n (j, kn )2 + 4Rn (j, kn ) and n (j, kn )2 = 6(kn j) 6n . If j > kn , then n (j, kn )2 =
n (j, kn )2 + 4Rn (j, kn ) and
j
2 2
X
n (j, kn ) = (4in + 2) 4n + 2Rn (j) = 6n + 2Rn (j, kn ).
i=kn +1
Thus
q
n (j, kn ) + n (j, kn ) 2 6 n + 2 + 6 Rn (j, kn ),
A. Rohde and L. Dumbgen/Confidence Sets for the Best Approximating Model 24
for all j Kn, . Again we may employ Lemma 7 with x = 0 and y = Rn (j, kn ) to conclude
that
p
max Rn (j, kn ) 4 3 log n + 2 6 n, + Op (1) n
jKn,
3/2
+ Op (log(n)3/4 + n,(n) )1/4 2
n + log n + n,(n)
uniformly in 0.
If log(n)3 + 6n,(n) = O(n ), then the previous bound for Rn (j, kn ) = Rn (j) n reads
p
max Rn (j) n + 4 3 log n + 2 6 n, + Op (1) n
jKn,
uniformly in (n). On the other hand, if we consider just a fixed > 0, then
n, = O(1), and the previous considerations yield
q
max Rn (j) n + 4 3 + op (1) log(n) n
jKn,
+ Op log(n)3/4 1/4
n + log n
q
n + 4 3 + op (1) log(n) n + Op (log n).
To verify the latter step, note that for any fixed > 0,
(
1 log n if n 4 log n,
log(n)3/4 1/4
n
if n 4 log n.
p
log(n) n
It remains to prove claim (11) about the losses. From now on, j denotes a generic index
in Cn . Note first that
kn
(1 2in ) = Rn (kn , j) Ln (kn , j)
X
Ln (j, kn ) Rn (j, kn ) = if j < k.
i=j+1
where
q q
n+ (j, kn ) :=
p
2|kn j| 2n + 2|Rn (j, k)|.
A. Rohde and L. Dumbgen/Confidence Sets for the Best Approximating Model 25
Now we restrict our attention to indices j Kn, again. Here it follows from our result
about the maximal risk over Kn, that Ln (j) n equals
q p q
Rn (j, kn ) + Op log(n)n + Op log n Rn (j, kn ) + Op (log n)
q q
2Rn (j, kn ) + Op log(n)n + log n Op log(n)n + log n .
for all J Mn . Here and throughout this proof, K denotes a generic constant not depend-
ing on n. Its value may be different in different expressions. It follows from the definition
A. Rohde and L. Dumbgen/Confidence Sets for the Best Approximating Model 26
= (n2 n,,l
2 2
)(D \ C) + (n,,u n2 )(C \ D)
2 2
n,,u (C \ D) n,,l (D \ C) + |D| |C|
(n2 n,,l
2 2
)(D \ C) + (n,,u n2 )(C \ D).
+ (K + op (1)) log Mn
q
8 Rn (C) + Rn (D) log Mn (1 + op (1)) + (K + op (1)) log Mn .
+ (K + op (1)) log Mn ,
2in |J|
X
(Ln Rn )(D) =
iD
for arbitrary D Cn . Hence we may utilize (17-18), replacing the tripel (n, 2 , u) with
(|D|, 0, /(2n )), to complement (39) with the following observation:
q q
A |D| log Mn Ln (D) Rn (D) A |D| log Mn + A log Mn (40)
by Lemma 7 (i). Assuming that both (39) and (40) hold for some large but fixed A, we
may conclude that for arbitrary C Kn, and D Cn ,
Ln (C) Ln (D)
6. Auxiliary results
This section collects some results from the vicinity of empirical process theory which are
used in the present paper.
For any pseudo-metric space (X , d) and u > 0, we define the capacity number
D(u, X , d) := max |Xo | : Xo X , d(x, y) > u for different x, y Xo .
sup E f (X) E f (Y ),
dw (X, Y ) :=
f H(T )
where P and E denote outer probabilities and expectations, respectively, while H(T ) is
the family of all funtionals f : (T ) R such that
dw (Xn , Yn ) 0
Proof. For any fixed number > 0 let Tn,o be a maximal subset of Tn such that n (s, t) >
for differnt s, t Tn,o . Then |Tn,o | = O(1) by Assumption (iii). Moreover, for any t Tn
there exists a to Tn,o such that n (t, to ) . Hence there exists a partition of Tn into
sets Bn (to ), to Tn,o , satisfying to Bn (to ) t Tn : n (t, to ) . For any function x
in (Tn ) or (Tn,o ) let n x (Tn ) be given by
X
n x(t) := 1{t Bn (to )}x(to ).
to Tn,o
Then n x is linear in xTn,o with kn xk =
xTn,o
. Moreover, any x (Tn ) satisfies
the inequality kx n xk w(x, | n ). Hence for Zn = Xn , Yn ,
E min w(Zn , | n ), 1 ,
and this is arbitrarily small for sufficiently small > 0 and sufficiently large n, according
to Assumption (ii).
A. Rohde and L. Dumbgen/Confidence Sets for the Best Approximating Model 29
and the latter distance converges to zero, because of |Tn,o | = O(1) and Assumption (i).
Since
For this lemma it is important that we consider random variables rather than just
stochastic processes with bounded sample paths. Note that a stochastic process on T
is automatically a random variable with values in ( (T ), k k ) if (a) the index set
T is finite, or (b) the process has uniformly continuous sample paths with respect to a
pseudo-metric d on T such that N (u, T , d) < for all u > 0.
Proof of Lemma 9. Without loss of generality let the four random variables X1 , X2 ,
Y1 and Y2 be defined on a common probability space and stochastically independent. Let
f be an arbitrary functional in H(T ). Then it follows from Fubinis theorem that
Ef (X1 + X2 ) Ef (Y1 + Y2 )
Ef (X1 + X2 ) Ef (Y1 + X2 ) + Ef (Y1 + X2 ) Ef (Y1 + Y2 )
E E(f (X1 + X2 ) | X2 ) E(f (Y1 + X2 ) | X2 )
+ E E(f (Y1 + X2 ) | Y1 ) E(f (Y1 + Y2 ) | Y1 )
dw (X1 , Y1 ) + dw (X2 , Y2 ).
The latter inequality follows from the fact that the functionals x 7 f (x + X2 ) and x 7
f (Y1 + x) belong to H(T ), too. Thus dw (X, Y ) dw (X1 , Y1 ) + dw (X2 , Y2 ).
References
[1] Baraud, Y. (2004). Confidence balls in Gaussian regression. Ann. Statist. 32,
528-551.
[2] Beran, R. (1996). Confidence sets centered at Cp estimators. Ann. Inst. Statist.
Math. 48, 1-15.
[3] Beran, R. (2000). REACT scatterplot smoothers: superefficiency through basis
economy. J. Amer. Statist. Assoc. 95, 155-169.
[4] Beran, R. and Dumbgen, L. (1998). Modulation of estimators and confidence sets.
Ann. Statist. 26, 1826-1856.
[5] Birge, L. and Massart, P. (2001). Gaussian model selection. J. Eur. Math. Soc.
3, 203-268.
[6] Cai, T.T. (1999). Adaptive wavelet estimation: a block thresholding and oracle
inequality approach. Ann. Statist. 26, 1783-1799.
[7] Cai, T.T. (2002). On block thresholding in wavelet regression: adaptivity, block size,
and threshold level. Statistica Sinica 12, 1241-1273.
[8] Cai, T.T. and Low, M.G. (2006). Adaptive confidence balls. Ann. Statist. 34,
202-228.
[9] Cai, T.T. and Low, M.G. (2007). Adaptive estimation and confidence intervals for
convex functions and monotone functions. Manuscript in preparation.
[10] Dahlhaus, R. and Polonik, W. (2006). Nonparametric quasi-maximum likelihood
estimation for Gaussian locally stationary processes. Ann. Statist. 34, 2790-2824.
[11] Donoho, D.L. and Johnstone, I.M. (1994). Ideal spatial adaptation by wavelet
shrinkage. Biometrika 81, 425-455.
[12] Donoho, D.L. and Johnstone, I.M. (1995). Adapting to unknown smoothness via
wavelet shrinkage. JASA 90, 1200-1224.
[13] Donoho, D.L. and Johnstone, I.M. (1998). Minimax estimation via wavelet
shrinkage. Ann. Statist. 26, 879-921.
[14] Dumbgen, L. (2002). Application of local rank tests to nonparametric regression. J.
Nonpar. Statist. 14, 511-537.
[15] Dumbgen, L. (2003). Optimal confidence bands for shape-restricted curves. Bernoulli
9, 423-449.
[16] Dumbgen, L. and Spokoiny, V.G. (2001). Multiscale testing of qualitative hy-
potheses. Ann. Statist. 29, 124-152.
[17] Dumbgen, L. and Walther, G. (2007). Multiscale inference about a density. Tech-
nical report 56, IMSV, University of Bern.
[18] Efromovich, S. (1998). Simultaneous sharp estimation of functions and their deriva-
tives. Ann. Statist. 26, 273-278.
[19] Futschik, A. (1999). Confidence regions for the set of global maximizers of non-
parametrically estimated curves. J. Statist. Plann. Inf. 82, 237-250.
[20] Genovese, C.R. and Wassermann, L. (2005). Confidence sets for nonparametric
wavelet regression. Ann. Statist. 33, 698-729.
[21] Hengartner, N.W. and Stark, P.B. (1995). Finite-sample confidence envelopes
for shape-restricted densities. Ann. Statist. 23, 525-550.
[22] Hoffmann, M. and Lepski, O. (2002). Random rates in anisotropic regression
(with discussion). Ann. Statist. 30, 325-396.
A. Rohde and L. Dumbgen/Confidence Sets for the Best Approximating Model 31
[23] Lepski, O.V., Mammen, E. and Spokoiny, V.G. (1997). Optimal spatial adap-
tation to inhomogeneous smoothness: an approach based on kernel estimates with
variable bandwidth selectors. Ann. Statist. 25, 929-947.
[24] Li, K.-C. (1989). Honest confidence regions for nonparametric regression. Ann.
Statist. 17, 1001-1008.
[25] Polyak, B.T. and Tsybakov, A.B. (1991). Asymptotic optimality of the Cp -test
for the orthogonal series estimation of regression. Theory Probab. Appl. 35, 293-306.
[26] Robins, J. and van der Vaart, A. (2006). Adaptive nonparametric confidence
sets. Ann. Statist. 34, 229-253.
[27] Stone, C.J. (1984). An asymptotically optimal window selection rule for kernel
density estimates. Ann. Statist. 12, 1285-1297.