Academia.eduAcademia.edu

Dead in the Water

2014

arXiv:1804.07566v2 [math.ST] 22 Nov 2018 Electronic Journal of Statistics ISSN: 1935-7524 On the Post Selection Inference constant under Restricted Isometry Properties François Bachoc Institut de Mathématiques de Toulouse; UMR 5219; Université de Toulouse; CNRS UPS, F-31062 Toulouse Cedex 9, France e-mail: [email protected] Gilles Blanchard Universität Potsdam, Institut für Mathematik Karl-Liebknecht-Straße 24-25 14476 Potsdam, Germany e-mail: [email protected] Pierre Neuvial Institut de Mathématiques de Toulouse; UMR 5219; Université de Toulouse; CNRS UPS, F-31062 Toulouse Cedex 9, France e-mail: [email protected] Abstract: Uniformly valid confidence intervals post model selection in regression can be constructed based on Post-Selection Inference (PoSI) constants. PoSI constants are minimal for orthogonal design matrices, and can be upper bounded in function of the sparsity of the set of models under consideration, for generic design matrices. In order to improve on these generic sparse upper bounds, we consider design matrices satisfying a Restricted Isometry Property (RIP) condition. We provide a new upper bound on the PoSI constant in this setting. This upper bound is an explicit function of the RIP constant of the design matrix, thereby giving an interpolation between the orthogonal setting and the generic sparse setting. We show that this upper bound is asymptotically optimal in many settings by constructing a matching lower bound. MSC 2010 subject classifications: 62J05, 62J15, 62F25. Keywords and phrases: Inference post model-selection, Confidence intervals, PoSI constants, Linear Regression, High-dimensional Inference, Sparsity, Restricted Isometry Property. 1. Introduction Fitting a statistical model to data is often preceded by a model selection step. The construction of valid statistical procedures in such post model selection situations is quite challenging (cf. [21, 22, 23], [17] and [25], and the references given in that literature), and has recently attracted a considerable amount of attention. Among various recent references in this context, we can mention those addressing sparse high dimensional settings with a focus on lasso-type model selection procedures [4, 5, 29, 31], those aiming for conditional coverage 1 imsart-ejs ver. 2014/10/16 file: EJS1490.tex date: April 23, 2019 F. Bachoc, G. Blanchard, P. Neuvial/Post selection inference under RIP 2 properties for polyhedral-type model selection procedures [14, 19, 20, 27, 28] and those achieving valid post selection inference universally over the model selection procedure [1, 2, 6]. In this paper, we shall focus on the latter type of approach and adopt the setting introduced in [6]. In that work, a linear Gaussian regression model is considered, based on an n×p design matrix X. A model M ⊂ {1, ..., p} is defined as a subset of indices of the p covariates. For a family M ⊂ {M |M ⊂ {1, . . . , p}} of admissible models, it is shown in [6] that a universal coverage property is achievable (see Section 2) by using a family of confidence intervals whose sizes are proportional to a constant K(X, M) > 0. This constant K(X, M) is called a PoSI (Post-Selection Inference) constant in [6]. This setting was later extended to prediction problems in [1] and to misspecified non-linear settings in [2]. The focus of this paper is on the order of magnitude of the PoSI constant K(X, M) for large p. We shall consider n ≥ p for simplicity of exposition in the rest of thispsection (and asymptotics n, p → ∞). It is shown in [6] that K(X, M) = Ω( log(p)); this rate is reached in particular when X has orthog√ onal columns. On the other hand, in full generality K(X, M) = O( p) for all X. It can also be shown, as discussed in an intermediary version of [32], that when M is p composed of s-sparse submodels, the sharper upper bound K(X, M) = O( s log(p/s)) holds. Hence, intuitively, design matrices that are close to orthogonal and consideration of sparse models yield smaller PoSI constants. In this paper, we obtain additional quantitative insights for this intuition, by considering design matrices X satisfying restricted isometry property (RIP) conditions. RIP conditions have become central in high dimensional statistics and compressed sensing [8, 10, 15]. In the s-sparse setting and for design matrices X that satisfy ap RIP property p of order s with RIP constant δ → 0, we show that K(X, M) = O( log(p) + δ s log(p/s)). This corresponds to the intuition that for such matrices, any subset of s columns of X is “approximately orthogonal”. Thus, under the RIP condition we improve the upper bound of [32] for the s-sparse case, by up to a factor δ → 0. We show that our upper bound is complementary to the bounds recently proposed in [18]. In addition, we obtain lower bounds on K(X, M) for a class of design matrices that extends the equicorrelated design matrix in [6]. From these lower bounds, we show that the new upper bound we provide is optimal, in a large range of situations. While the main interest of our results is theoretical, our suggested upper bound can be practically useful in cases where it is computable whereas the PoSI constant K(X, M) is not. The only challenge for computing our upper bound is to find a value δ for which the design matrix X satisfies a RIP property. While this is currently challenging in general for large p, we discuss, in this paper, specific cases where it is feasible. The rest of the paper is organized as follows. In Section 2 we introduce in more details the setting and the PoSI constant K(X, M). In Section 3 we introduce the RIP condition, provide the upper bound on K(X, M) and discuss its theoretical comparison with [18] and its applicability. In Section 4 we provide the lower bound and the optimality result for the upper bound. All the proofs imsart-ejs ver. 2014/10/16 file: EJS1490.tex date: April 23, 2019 F. Bachoc, G. Blanchard, P. Neuvial/Post selection inference under RIP 3 are given in the appendix. 2. Settings and notation 2.1. PoSI confidence intervals We consider and review briefly the framework introduced by [6] for which the so-called PoSI constant plays a central role. The goal is to construct post-model selection confidence intervals that are agnostic with respect to the model selection mehod used. The authors of [6] assume a Gaussian vector of observations Y = µ + ǫ, (1) 2 where the n×1 mean vector µ is fixed and unknown, and ǫ follows the N (0, σ In ) distribution where σ 2 > 0 is unknown. Consider an n × p fixed design matrix X, whose columns correspond to explanatory variables for µ. It is not necessarily assumed that µ belongs to the image of X or that n ≥ p. A model M corresponds to a subset of selected variables in {1, . . . , p}. A set of models of interest M ⊂ Mall = {M |M ⊂ {1, . . . , p}} is supposed to be given. Following [6], for any M ∈ M, the projection based vector of regression coefficients βM is a target of inference, with 2 t t βM := Arg Minkµ − XM βk = (XM XM )−1 XM µ, (2) β∈R|M | where XM is the submatrix of X formed of the columns of X with indices in M , and where we assume that for each M ∈ M, XM has full rank and M is nonempty. We refer to [6] for an interpretation of the vector βM and a justification for considering it as a target of inference. In [6], a family of confidence intervals (CIi,M ; i ∈ M ∈ M) for βM is introduced, containing the targets (βM )M∈M simultaneously with probability at least 1 − α. The confidence intervals take the form CIi,M := (β̂M )i.M ± σ̂kvM,i kK(X, M, α, r); (3) the different quantities involved, which we now define, are standard ingredients for univariate confidence intervals for regression coefficients in the Gaussian model, except for the last factor (the “PoSI constant”) which will account for multiplicity of covariates and models, and their simultaneous coverage. The t t confidence interval is centered at β̂M := (XM XM )−1 XM Y , the ordinary least squares estimator of βM ; also, if M = {j1 , . . . , j|M| } with j1 < . . . < j|M| , for i ∈ M we denote by i.M the number k ∈ N for which jk = i, that is, the rank of the i-th element in the subset M . The quantity σ̂ 2 is an unbiased estimator of σ 2 , more specifically it is assumed that it is an observable random variable, such that σ̂ 2 is independent of PX Y and is distributed as σ 2 /r times a chi-square distributed random variable with r degrees of freedom (PX denoting the orthogonal projection onto the column space of X). We allow for r = ∞ corresponding to σ̂ = σ, i.e., the case of known variance (also called Gaussian imsart-ejs ver. 2014/10/16 file: EJS1490.tex date: April 23, 2019 F. Bachoc, G. Blanchard, P. Neuvial/Post selection inference under RIP 4 limiting case). In [6], it is assumed that σ̂ exists and it is shown that this indeed holds in some specific situations. A further analysis of the existence of σ̂ is provided in [1, 2]. The next quantity to define is |M| t n t vM,i := (ei.M )t G−1 M XM ∈ R , (4) t XM is the where eba is the a-th base column vector of Rb ; and GM := XM |M | × |M | Gram matrix formed from the columns of XM . Observe that vM,i is nothing more than the row corresponding to covariate i in the estimation matrix t t G−1 M XM , in other words (β̂M )i.M = vM,i Y . Finally, K(X, M, α, r) is called a PoSI constant and we turn to its definition. We shall occasionally write for simplicity K(X, M, α, r) = K(X, M). Furthermore, if the value of r is not specified in K(X, M), it is implicit that r = ∞. Definition 2.1. Let M ⊂ Mall for which each M ∈ M is non-empty, and so that XM has full rank. Let also ( vM,i /kvM,i k, if kvM,i k 6= 0; wM,i = 0 ∈ Rn else. Let ξ be a Gaussian vector with zero mean vector and identity covariance matrix on Rn . Let N be a random variable, independent of ξ, and so that rN 2 follows a chi-square distribution with r degrees of freedom. If r = ∞, then we let N = 1. For α ∈ (0, 1), K(X, M, α, r) is defined as the 1 − α quantile of γM,r := 1 N max M∈M,i∈M t wM,i ξ. (5) We remark that K(X, M, α, r) is the same as in [6]. For j = 1, . . . , p, let Xj 2 be the column j of X. We also remark, from [6], that the vector vM,i /kvM,i k in (4) is the residual of the regression of Xi with respect to the variables {j|j ∈ M \ {i}}; in other words, it is the component of the vector Xi orthogonal to Span{Xj |j ∈ M \ {i}}. It is shown in [6] that we have, with probability larger than 1 − α, ∀M ∈ M, ∀i ∈ M, (βM )i.M ∈ CIi,M . (6) Hence, the PoSI confidence intervals guarantee a simultaneous coverage of all the projection-based regression coefficients, over all models M in the set M. For a square symmetric non-negative matrix A, we let corr(A) = (diag(A)† )1/2 A(diag(A)† )1/2 , where diag(A) is obtained by setting all the non-diagonal elements of A to zero and where B † is the Moore-Penrose pseudo-inverse of B. Then we show in the following lemma that K(X, M) depends on X only through corr(X t X). Lemma 2.2. Let X and Z be two n × p and m × p matrices satisfying the relation corr(X t X) = corr(Z t Z). Then K(X, M, α, r) = K(Z, M, α, r). imsart-ejs ver. 2014/10/16 file: EJS1490.tex date: April 23, 2019 F. Bachoc, G. Blanchard, P. Neuvial/Post selection inference under RIP 5 2.2. Order of magnitude of the PoSI constant The confidence intervals in (3) are similar in form to the standard confidence intervals that one would use for a single fixed model M and a fixed i ∈ M . For a standard interval, K(X, M) would be replaced by a standard Gaussian or Student quantile. Of course, the standard intervals do not account for multiplicity and do not have uniform coverage over i ∈ M ∈ M (see [1, 2]). Hence K(X, M) is the inflation factor or correction over standard intervals to get uniform coverage; it must go to infinity as p → ∞ [6]. Studying the asymptotic order of magnitude of K(X, M) is thus an important problem, as this order of magnitude corresponds to the price one has to pay in order to obtain universally valid post model selection inference. We now present the existing results on the asymptotic order of magnitude of K(X, M). Let us define γM,∞ := max M∈M,i∈M t wM,i ξ , (7) so that γM,r = γM,∞ /N , where we recall that rN 2 follows a chi-square distribution with r degrees of freedom. We can relate the quantiles of γM,r (which coincide with the PoSI constants K(X, M)) to the expectation E[γM,∞ ] by the following argument based on Gaussian concentration (see Appendix A): Proposition 2.3. Let T (µ, r, α) denote the α-quantile of a noncentral T distribution with r degrees of freedom and noncentrality parameter µ. Then K(X, M, α, r) ≤ T (E[γM,∞ ], r, 1 − α/2). To be more concrete, we observe that we can get a rough estimate of the latter quantile via p E[γM,∞ ] + 2 log(4/α) p T (E[γM,∞ ], r, 1 − α/2) ≤ ; (1 − 2 2 log(4/α)/r)+ furthermore, as r → +∞, this quantile reduces to the (1 − α/2) quantile of a Gaussian distribution with mean E[γM,∞ ] and unit variance. The point of the above estimate is that the dependence in the set of models M is only present through E[γM,∞ ]. Therefore, we will focus in this paper on the problem of bounding E[γM,∞ ], which is nothing more than the Gaussian width [15, chapter 9] of the set ΓM = {±wM,i |M ∈ M, i ∈ M }. p When n ≥ p, it is shown in [6] that E[γM,∞ ] is no smaller than 2 log(2p) √ and asymptotically no larger than p. These two lower and upper bound are reached by respectively orthogonal design matrices and equi-correlated design matrices (see [6]). We now concentrate on s-sparse models. For s ≤ p, let us define Ms = {M |M ⊂ {1, . . . , p}, |M | ≤ s}. In this case, using a direct argument based on cardinality, one gets the following generic upper bound (proved in Appendix B). imsart-ejs ver. 2014/10/16 file: EJS1490.tex date: April 23, 2019 F. Bachoc, G. Blanchard, P. Neuvial/Post selection inference under RIP Lemma 2.4. For any s, n, p ∈ N, with s ≤ n, we have p E[γMs ,∞ ] ≤ 2s log(6p/s) . 6 (8) We remark that an asymptotic version of the bound in Lemma 2.4 (as p and s go to infinity) appears in an intermediary version of [32]. 3. Upper bound under RIP conditions 3.1. Main result We recall the definition and a property of the RIP constant κ(X, s) associated to a design matrix X and a sparsity condition s given in [15, Chap.6]: t κ(X, s) = sup XM XM − I|M| |M|≤s op . (9) Letting κ = κ(X, s), we have for any subset M ⊂ {1, . . . , p} such that |M | ≤ s: ∀β ∈ R|M| , (1 − κ)+ kβk2 ≤ kXM βk2 ≤ (1 + κ)kβk2 . (10) Remark 3.1. The RIP condition may also be stated between norms instead of squared norms in (10). Following [15, Chap.6] we will consider the formulation in terms of squared norms, which is more convenient here. Since the PoSI constant K(X, M) only depends on corr(X t X) (see Lemma 2.2), we shall rather consider the RIP constant associated to corr(X t X). We let t δ(X, s) = sup corr(XM XM ) − I|M| |M|≤s op . (11) Any upper bound for κ(X, s) yields an upper bound for δ(X, s) as shown in the following lemma. Lemma 3.2. Let κ = κ(X, s). If κ ∈ [0, 1), then δ(X, s) ≤ 2κ . 1−κ The next theorem is the main result of the paper. It provides a new upper bound on the PoSI constant, under RIP conditions and with sparse submodels. We remark that in this theorem, we do not necessarily assume that n ≥ p. Theorem 3.3. Let X be a n × p matrix with n, p ∈ N. Let δ = δ(X, s). We have √  p 1+δ p 2s log(6p/s). E[γMs ,∞ ] ≤ 2 log(2p) + 2δ 1−δ This upper bound is of the form URIP (p, s, δ) = Uorth (p) + 2δc(δ)Usparse (p, s), where: imsart-ejs ver. 2014/10/16 file: EJS1490.tex date: April 23, 2019 F. Bachoc, G. Blanchard, P. Neuvial/Post selection inference under RIP 7 p • Uorth (p) = 2 log(2p) is the upper bound in the orthogonal case; • Usparse (p, s) is the right-hand side of (8) corresponding to the cardinalitybased upper √ bound in the sparse case; • c(δ) = 1 + δ/(1 − δ) satisfies: c(δ) ≥ 0, c(δ) → 1 as δ → 0, and c is increasing. We p observe that if δ → 0, our bound URIP is o(Usparse ). Moreover, when √ δ s 1 − log s/ log p + 1/ log p → 0, then URIP√is even asymptotically equivalent to Uorth . In particular, this is the case if δ s → 0. We now consider the specific case where X is a subgaussian random matrix, that is, X has independent subgaussian entries [15, Definition 9.1]. We discuss in which situations δ = δ(X, s) → 0. The estimate of κ in [15, Theorem 9.2] combined with Lemma 3.2 yields  p s log(ep/s)/n , (12) δ = OP so that δ → 0 as soon as n/(s log(ep/s)) → +∞. 3.2. Comparison with upper bounds based on Euclidean norms We now compare our upper bound in Theorem 3.3 to upper bounds recently and independently obtained in [18]. Recall the notation Y , µ, βM and β̂M from Section 2 and let r = ∞ for simplicity of exposition. The authors in [18] address the case where X is random (random design) and consider de−1 t t viations of βbM to β̄M = E[XM XM ] E[XM Y ], the population version of the regression coefficients βM , assuming that the rows of X are independent random vectors in dimension p. They derive uniform bounds over M ∈ Ms for β̄M − β̂M 2 . They also consider briefly (Remark 4.3 in [18]) the fixed design t t case with βM = (XM XM )−1 XM µ as in the present paper. This target βM can be interpreted as the random design model conditional to X. They assume that the individual coordinates of X and Y have exponential moments bounded by a constant independently from n, p (thus their setting is more general than the Gaussian regression setting, but for the purpose of this discussion we assume Gaussian noise). √ Let us additionally assume that the RIP property κ(X/ n, s) ≤ κ is satisfied (on an event of probability tending to 1) and for κ restricted to a compact of √ [0, 1) independently of n, p; note that we used the rescaling of X by n, which is natural in the random design case. Then some simple estimates obtained as a consequence of Theorems1 3.1 and 4.1 in [18] lead to ! r s log(ep/s) , (13) sup βM − β̂M 2 = OP σ n M∈Ms 1 The technical conditions assumed by [18] imply a slightly weaker version of the RIP √ property κ(X/ n, s) ≤ κ < 1. imsart-ejs ver. 2014/10/16 file: EJS1490.tex date: April 23, 2019 F. Bachoc, G. Blanchard, P. Neuvial/Post selection inference under RIP 8 as p, n → ∞ and assuming s log2 p = o(n). On our side, under the same assumptions we have that −1 !  t XM XM sup n M∈Ms ,i∈M i.Mi.M is bounded on an event of probability tending to 1. This leads to kvi.M k = √ OP (1/ n) uniformly for all M ∈ Ms , i ∈ M . Hence, from Theorem 3.3, (3), (6), we obtain !! r r log(p) s log(ep/s) sup βM − β̂M ∞ = OP σ . (14) +δ n n M∈Ms Thus, if δ = Ω(1), since the Euclidean norm upper bounds the supremum norm, the results of [18] imply ours (at least in the sense of these asymptotic considerations). On the other hand, in the case where δ → 0, which is the case we are specifically interested in, we obtain a sharper bound (in the weaker supremum norm). In particular, if X is a subgaussian random matrix (as discussed in the previous section), due to (12) we obtain !! r log(p) s log(ep/s) . (15) + sup βM − β̂M ∞ = OP σ n n M∈Ms This improves over the estimate deduced from (13) as soon as s log(ep/s) = o(n), which corresponds to the case where (13) tends to 0. Conversely, in this situation our bound (15) yields for the Euclidean norm (using kwk2 ≤ kwk0 kwk∞ ): !! r s log(p) s3/2 log(ep/s) . (16) + sup βM − β̂M 2 = OP σ n n M∈Ms Assuming s = O(pλ ) for some λ < 1 for ease of interpretation, we see that (16) is of the same order as (13) when s2 log(p) = O(n), and is of a strictly larger order otherwise. In this sense, it seems that (14) and (13) are complementary to each other since we are using a weaker norm, but obtain a sharper bound in the case δ → 0. 3.3. Applicability While the main interest of our results is theoretical, we now discuss the applicability of our bound. For any δ ≥ δ(X, s), Theorem 3.3 combined with Proposition 2.3 provides a bound of the form U RIP (p, s, δ) ≥ K(X, Ms ), with  √   p 1+δ p U RIP (p, s, δ) = T 2 log(2p) + 2δ 2s log(6p/s), r, 1 − α/2 . 1−δ imsart-ejs ver. 2014/10/16 file: EJS1490.tex date: April 23, 2019 F. Bachoc, G. Blanchard, P. Neuvial/Post selection inference under RIP 9 This bound can be used in practice in situations where δ(X, s) (or an upper bound of it) can be computed, whereas K(X, Ms ) cannot because the number of inner products in (5) is too large. Indeed, for a given δ, it is immediate to compute U RIP (p, s, δ). Upper bounding the RIP constant. When n ≥ p, we have δ(X, s) ≤ δ(X, p) and δ(X, p) can be computed in practice for a given X. Specifically, t δ(X, p) is the largest eigenvalue of corr(Xp X) − Ip in absolute value. When X is a subgaussian random matrix, δ(X, p) ∼ p/n [3, 24]. Thus, if n is large enough compared to p, the computable upper bound U RIP (p, s, δ(X, p)) will improve on the sparsity-based upper bound U sparse (p, s) = T ((2s log(6p/s))1/2 , r, 1−α/2) ≥ K(X, Ms ), see Proposition 2.3 and Lemma 2.4. On the other hand, when n < p, it is typically too costly to compute δ(X, s) (or an upper bound of it) for a large p. Nevertheless, if one knows that X is a subgaussian random matrix, they can compute an upper bound δ̃ satisfying δ̃ ≥ δ(X, s) with high probability, as in [15, Chapter 9]. We remark that using the values of δ̃ currently available in the literature, one would need n to be very large for U RIP (p, s, δ̃) to improve on U sparse (p, s). Alternative upper bound on the PoSI constant. For any δ ≥ δ(X, s), we now show how to compute an alternative bound of the form ŨRIP (p, s, δ) ≥ K(X, Ms ). Our numerical experiments suggest that this alternative bound is generally sharper than U RIP (p, s, δ). For q, r, ρ ∈ N and ℓ ∈ (0, 1), let Bℓ (q, r, ρ) be defined as the smallest t > 0 so that   ≤ ℓ, Hq,ρ (t) := EG min 1, ρ 1 − FBeta,1/2,(q−1)/2 (t2 /G2 ) where G2 /q follows a Fisher distribution with q and r degrees of freedom, and FBeta,a,b denotes the cumulative distribution function of the Beta(a, b) distribution. In the case r = +∞, Bℓ is also defined and further described in [2, Section 2.5.2]. It can be seen from the proof of Theorem 3.3 (see specifically (22) which also holds without the expectation operators), and from the arguments in [1], that we have K(X, Ms , α) ≤ Btα (n ∧ p, r, p) + 2δc(δ)B(1−t)α (n ∧ p, r, |Ms |) for any t ∈ (0, 1). This upper bound can be minimized with respect to t, yielding ŨRIP (p, s, δ). The quantity Bℓ (q, r, ρ) can be easily approximated numerically, as it is simply the quantile of the tail distribution Hq,ρ , which only involves standard distributions. Algorithm E.3 in the supplementary materials of [1] can be used to compute Bℓ (q, r, ρ). An implementation of this algorithm in R [26] is available in Appendix C. Hence, the upper bound ŨRIP (p, s, δ) can be computed for large values of p for a given δ. imsart-ejs ver. 2014/10/16 file: EJS1490.tex date: April 23, 2019 F. Bachoc, G. Blanchard, P. Neuvial/Post selection inference under RIP 10 4. Lower bound 4.1. Equi-correlated design matrices The goal of this section is to find a matching lower bound for Theorem 3.3. For this we extend ideas of [6, Example 6.2] and, following that reference, we restrict our study to design matrices X for which n ≥ p. The lower bound is based on the p × p matrix Z (c,k) = (ep1 , ep2 , . . . , epp−1 , xk (c)), where p xk (c) = (c, c, . . . c, 0, 0, . . . 0, 1 − kc2 )t , | {z } | {z } | {z } k p−1−k 1 where we assume k < p, and the constant c satisfies c2 < 1/k, so that Z (c,k) has full rank. By definition, the correlation between any of the first k columns of Z (c,k) and the last one is c, and Z (c,k) restricted to its first p − 1 columns is the identity matrix Ip−1 . The case where k = p − 1 is studied in [6, Example 6.2]: Theorem 6.2 in [6] implies that the PoSI constant K(X, M), where X is a n × p √ matrix such that X t X = (Z (c,k) )t Z (c,k) , is of the order of p when k = p − 1 and M = Mall . The Gram matrix of Z (c,k) is the 3 × 3 block matrix with sizes (k, p − k − 1, 1) × (k, p − k − 1, 1) defined by   Ik [0] [c] (Z (c,k) )t Z (c,k) = [0] Ip−k−1 [0], (17) [c] [0] 1 where [a] means that all the entries of the corresponding block are identical to a. We begin by studying the RIP coefficient δ(X, s) for design matrices X yielding the Gram matrix (17). Since this Gram matrix has full rank p, there exists a design matrix satisfying this condition if and only if n ≥ p. Lemma 4.1. Let X be a n × p matrix for which X t X is given √ by (17) with kc2 < 1. Then for s ≤ k ≤ p − 1, we have κ(X, s) = δ(X, s) ≤ c s − 1. 4.2. A matching lower bound In the following proposition, we provide a lower bound of K(X, Ms ) for matrices X yielding the Gram matrix (17). Proposition 4.2. For any s ≤ k < p, c2 < 1/k and α ≤ 12 , let X be a n × p matrix for which X t X is given by (17) with kc2 < 1. We have K(X, Ms , α, ∞) ≥ A p c(s − 1) 1 − (s − 1)c2 p p log⌊k/s⌋ − 2 log 2, where A > 0 is a universal constant. imsart-ejs ver. 2014/10/16 file: EJS1490.tex date: April 23, 2019 F. Bachoc, G. Blanchard, P. Neuvial/Post selection inference under RIP 11 From the previous lemma, we now show that the upper bound of Theorem 3.3 is optimal (up to a multiplicative constant) for a large range of behavior of√s p and δ relatively to p. As discussed after Theorem 3.3, in the case where δ s 1 − log s/ log p + 1/ log √p = O(1), the upper bound we obtain is optimal, we show that the upsince it can be written as O( log p). In the next Corollary, √ p per bound of Theorem 3.3 is also optimal when δ s 1 − log s/ log p + 1/ log p tends to +∞, and when δ = O(p−λ ) for some λ > 0. Corollary 4.3 (Optimality of the RIP-PoSI bound). Let (sp , δp )p≥0 be sequences of values such that sp < p, δp > 0, δp → 0 and satisfying: q √ lim δp sp 1 − log sp / log p + 1/ log p = +∞. p→∞ Then Theorem 3.3 implies sup n∈N s≤sp ,X∈Rn×p s.t. δ(X,s)≤δp √ K(X, Msp ) ≤ Bδp sp q log(6p/sp ), (18) where B is a constant. Moreover, there exists a sequence of design matrices Xp such that δ(Xp , sp ) ≤ δp and q  √ (19) K(Xp , Msp ) ≥ Aδp sp log min(1/δp2 , ⌊(p − 1)/sp ⌋) , where A is a constant. In particular, if δp = O(p−λ ) for some λ > 0 and if ⌊(p − 1)/sp ⌋ ≥ 2, then the above upper and lower bounds have the same rate. Therefore, the upper bound in Theorem 3.3 is optimal in most configurations of sp and δp , except if δp goes to 0 slower than any inverse power of p. 5. Concluding remarks In this paper, we have proposed an upper bound on PoSI constants in s-sparse situations where the n × p design matrix X satisfies a RIP condition. As the value of the RIP constant δ increases from 0, this upper bound provides an interpolation between the case of an orthogonal X and an existing upper bound only based on sparsity and cardinality. We have shown that our upper bound is asymptotically optimal for many configurations of (s, δ, p) by giving a matching lower bound. In the case of random design matrices with independent entries, since δ decreases with n, our upper bound compares increasingly more favorably to the cardinality-based upper bound as n gets larger. It is also complementary to the bounds recently proposed in [18]. The interest and various applications of the RIP property are well-known in the high-dimensional statistics literature, in particular for statistical risk analysis or support recovery. Our analysis puts into light an additional interest of the RIP property for agnostic post-selection inference (uncertainty quantification). imsart-ejs ver. 2014/10/16 file: EJS1490.tex date: April 23, 2019 F. Bachoc, G. Blanchard, P. Neuvial/Post selection inference under RIP 12 The PoSI constant corresponds to confidence intervals on βM in (2). In section 3.2 we also mention another target of interest in the case of random X, −1 t t β̄M = E[XM XM ] E[XM Y ]. This quantity depends on the distribution of X rather than on its realization, which is a desirable property as discussed in [1, 18] where the same target has also been considered. In [1], it is shown that valid confidence intervals for βM are also asymptotically valid for β̄M , provided that p is fixed. These results require that µ belongs to the column space of X and hold for models M such that µ is close to the column space of XM . It would be interesting to study whether assuming RIP conditions on X enables to alleviate these assumptions. The purpose of post-selection inference based on the PoSI constant K(X, M) is to achieve the coverage guarantee (6). The guarantee (6) implies that, for any model selection procedure M̂ : Rn → M, with probability larger than 1 − α, for all i ∈ M̂ , (M̂ )i.M̂ ∈ CIi,M̂ . Hence, there is in general no need to make assumptions about the model selection procedure when using PoSI constants. On the other hand, the RIP condition that we study here is naturally associated to specific model selection procedures, namely the lasso or the Dantzig selector [9, 10, 30, 33]. Hence, it is natural to ask whether the results in this paper could help post-selection inference specifically for such procedures. We believe that the answer could be positive in some situations. Indeed, if the lasso model selector is used in conjunction with a design matrix X satisfying a RIP property, then asymptotic guarantees exist on the sparsity of the selected model [8]. Thus, one could investigate the combination of bounds on the size of selected models (of the form |M̂ | ≤ S and holding with high probability) with our upper bound, by replacing s by S. In the case of the lasso model selector, we have referred, in the introduction section, to the post-selection intervals achieving conditional coverage [19], specifically for the lasso model selector. These intervals are simple to compute (when the conditioning is on the signs, see [19]). Generally speaking, in comparison with confidence intervals based on PoSI constants, the confidence intervals of [19] have the benefit of guaranteeing a coverage level conditionally on the selected model. On the other hand the confidence intervals in [19] can be large, and can provide small coverage rates when the regularization parameter of the lasso is data-dependent [1]. It would be interesting to study whether these general conclusions would be modified in the special case of design matrices satisfying RIP properties. Finally, the focus of this paper is on PoSI constants in the context of linear regression. Recently, [2] extended the PoSI approach to more general settings (for instance generalized linear models), provided a joint asymptotic normality property holds between model dependent targets and estimators. This extension was suggested in the case of asymptotics for fixed dimension and fixed number of models. In the high-dimensional case, an interesting direction would be to apply the results of [12], that provide Gaussian approximations for maxima of sums of high-dimensional random vectors. This opens the perspective of applying our results to various high-dimensional post model selection settings, beyond linear regression. imsart-ejs ver. 2014/10/16 file: EJS1490.tex date: April 23, 2019 F. Bachoc, G. Blanchard, P. Neuvial/Post selection inference under RIP 13 Acknowledgements This work has been supported by ANR-16-CE40-0019 (SansSouci). The second author acknowledges the support from the german DFG, under the Research Unit FOR-1735 “Structural Inference in Statistics - Adaptation and Efficiency”, and under the Collaborative Research Center SFB-1294 “Data Assimilation”. Appendix Appendix A: Gaussian concentration To relate the expectation of a supremum of Gaussian variables to its quantiles, we use the following classical Gaussian concentration inequality [13] (see e.g. [16], Section B.2.2. for a short exposition): Theorem A.1 (Cirel’son, Ibragimov, Sudakov). Assume that F : Rd → R is a 1-Lipschitz function (w.r.t. the Euclidean norm of its input) and Z follows the N (0, σ 2 Id ) distribution. Then, there exists two one-dimensional standard Gaussian variables ζ, ζ ′ such that E[F (Z)] − σ|ζ ′ | ≤ F (Z) ≤ E[F (Z)] + σ|ζ|. (20) It is known that in certain situations one can expect an even tighter concentration, through the phenomenon known as superconcentration [11]. While such situations are likely to be relevant for the setting considered in this paper, we leave such improvements as an open issue for future work. We use the previous property in our setting as follows: Proposition A.2. Let C be finite a family of unit vectors of Rn , ξ a standard Gaussian vector in Rn and N an independent nonnegative random variable so that rN 2 follows a chi-squared distribution with r degrees of freedom. Define the random variable 1 max v t ξ . γC,r := N v∈C Then the (1 − α) quantile of γC,r is upper bounded by the (1 − α/2) quantile of a noncentral T distribution with r degrees of freedom and noncentrality parameter E[maxv∈C |v t ξ|]. Proof. Observe that ξ 7→ maxv∈C |v t ξ| is 1-Lipschitz since the vectors of C are unit vectors. Therefore we conclude by Theorem A.1 that there exists a standard normal variable ζ (which is independent of N since N is independent of ξ) so that the following holds: i  1 h E max v t ξ + |ζ| . γC ≤ v∈C N We can represent the above right-hand side as max(T+ , T− ) where i  1 h T± = E max v t ξ ± ζ , v∈C N imsart-ejs ver. 2014/10/16 file: EJS1490.tex date: April 23, 2019 F. Bachoc, G. Blanchard, P. Neuvial/Post selection inference under RIP 14 i.e. T+ , T− are two (dependent) noncentral t distributions with r degrees of freedom and noncentrality parameter E[maxv∈C |v t ξ|]. Finally since P[max(T+ , T− ) > t] ≤ P[T+ > t] + P[T− > t] = 2P[T+ > t], we obtain the claim. Since a noncentral distribution is (stochastically) increasing in its noncentrality parameter, any bound obtained for E[maxv∈C |v t ξ|] will result in a corresponding bound on the quantiles of the corresponding noncentral T distribution and therefore of those of γC . In the limit r → ∞, the quantiles of the noncentral T distribution reduce to those of a shifted Gaussian distribution with unit variance. Here is a naive bound on (some) quantiles of a noncentral T : Lemma A.3. The 1 − α quantile of a noncentral T distribution with r degrees of freedom and noncentrality parameter µ ≥ 0 is upper bounded by: p p (µ + 2 log(2/α)/(1 − 2 2 log(2/α)/r)+ . Proof. Let µ+ζ T = p , V /r where ζ ∼ N (0, 1) and V ∼ χ2 (r). We have (as a consequence of e.g. [7], Lemma 8.1), for any η ∈ (0, 1]: i h√ p √ P V ≤ r − 2 2 log η −1 ≤ η, as well as the classical bound h i p P ζ ≥ 2 log η −1 ≤ η. It follows that i h p p P T ≥ (µ + 2 log η −1 )/(1 − 2 2 log(η −1 )/r)+ ≤ 2η. The claimed estimate follows. Appendix B: Proofs Proof of Lemma 2.2. With the notation of Definition 2.1, K(X, M, α, r) is the 1 − α quantile of (1/N )kzk∞ where z = (zM,i , M ∈ M, i ∈ M) is a Gaussian vector, independent of N , with mean vector zero and covariance matrix corr(Σ), where Σ is defined by, for i ∈ M ∈ M and i′ ∈ M ′ ∈ M, t Σ(M,i),(M ′ ,i′ ) = vM,i vM ′ ,i′ |M| |M ′ | t −1 t t ei′ .M ′ . XM )−1 XM XM ′ (XM = (ei.M )t (XM ′ XM ′ ) imsart-ejs ver. 2014/10/16 file: EJS1490.tex date: April 23, 2019 F. Bachoc, G. Blanchard, P. Neuvial/Post selection inference under RIP 15 Hence, Σ depends on X only through X t X. Also, if X is replaced by XD, where D is a diagonal matrix with positive components, Σ becomes the matrix Λ with for i ∈ M ∈ M and i′ ∈ M ′ ∈ M, |M ′ | |M| −1 −1 −1 t t t DM ′ ,M ′ ei′ .M ′ Λ(M,i),(M ′ ,i′ ) = (ei.M )t DM,M (XM XM )−1 XM XM ′ (XM ′ XM ′ ) −1 −1 = Di,i Di′ ,i′ Σ(M,i),(M ′ ,i′ ) . Hence, corr(Σ) = corr(Λ). This shows that Σ depends on X only through t corr(X t X) (we remark that because ∪M M = {1, . . . , p} and each XM XM is invertible we have that kXi k > 0 for i = 1, . . . , p). Hence K(X, M, α, r) depends on X only through corr(X t X). Proof of Lemma 2.4. Using a direct cardinality-based bound we have the wellp known inequality E[γMs ,∞ ] ≤ 2 log(2|Ms |), hence v ! u s   X u p t E[γMs ,∞ ] ≤ 2 log 2 i , i i=1 moreover s   s    pe s X X p p , ≤s i ≤s s i i i=0 i=1 the last inequality being classical and due to p   s X s  i   s   X s p s p s ≤ ≤ 1+ ≤ es . i p p p i i=0 i=0 Since log s ≤ s/e, and using e1+2/e ≤ 6, we obtain !   s    pe  p  X p 6p 1+2/e i ≤ log 2s + s log , log 2 ≤ s log e ≤ s log i s s s i=1 implying (8). Proof of Lemma 3.2. Put κ = κ(X, s) < 1. Then, kXi k ≥ (1 − κ)1/2 for i = t t 1, ..., p so that for i ∈ M ∈ Ms , corr(XM XM ) = D M XM XM DM where D √M is a |M |×|M | matrix defined by [DM ]i.M,i.M = 1/kXi k. Hence kDM kop ≤ 1/ 1 − κ. We have, by applications of the triangle inequality and since k.kop is a matrix norm, t corr(XM XM ) − I|M| op t t t = (DM − I|M| )XM XM D M + XM XM (DM − IM ) + XM XM − I|M| ≤ DM − + t XM op kDM kop I|M| op XM t XM XM = DM − − I|M| op t I|M| op XM XM op  + DM − op t XM op I|M| op XM  t kDM kop + 1 + XM XM − I|M| op . (21) imsart-ejs ver. 2014/10/16 file: EJS1490.tex date: April 23, 2019 F. Bachoc, G. Blanchard, P. Neuvial/Post selection inference under RIP 16 t From (9)-(10), we have for all M ∈ Ms : kXM XM kop ≤ 1 + κ, as well as DM − I|M| op 1 −1 ≤ max i=1,...,p kXi k   1 1 ≤ max 1 − √ ,√ −1 1+κ 1−κ 1 − 1. =√ 1−κ Plugging this into (21), we obtain    1 1 − 1 (1 + κ) √ +1 +κ δ(X, s) ≤ √ 1−κ 1−κ 2κ = . 1−κ  Proof of Theorem 3.3. From Lemma 2.2, it is sufficient to treat the case where, t for any M , GM = XM XM has ones on the diagonal; in that case δ(X, s) = κ(X, s). We have |M| t t vM,i = (ei.M )t G−1 M XM  t |M| |M| t = (ei.M )t I|M| XM + (ei.M )t G−1 M − I|M| XM t = Xit + rM,i , say. We have  |M|  |M| −1 t rM,i rM,i = (ei.M )t G−1 M − I|M| GM GM − I|M| ei.M |M| 2 ≤ ei.M G−1 M − I|M| 2 kGM kop . op From (10), the eigenvalues of GM are all between (1 − δ) and (1 + δ), hence we have  2 δ t rM,i rM,i ≤ (1 + δ), 1−δ √ so that letting c(δ) = 1 + δ/(1 − δ) krM,i k ≤ δc(δ), and kwM,i − Xi k = vM,i vM,i − Xi = (1 − kvM,i k) + vM,i − Xi kvM,i k kvM,i k ≤ 2krM,i k, imsart-ejs ver. 2014/10/16 file: EJS1490.tex date: April 23, 2019 F. Bachoc, G. Blanchard, P. Neuvial/Post selection inference under RIP 17 from two applications of the triangle inequality, and using that kXi k = 1 since we assumed that GM has ones on its diagonal for all M . Hence, we have   t E[γMs ,∞ ] = E sup |wM,i ξ|  ≤E M∈Ms ;i∈M sup M∈Ms ;i∈M   |Xit ξ| + E   t ≤ E sup |Xi ξ| i=1,...,p  + 2δc(δ)E ≤ sup M∈Ms ;i∈M  sup M∈Ms ;i∈M wM,i − Xi kwM,i − Xi k p p 2 log(2p) + 2δc(δ) 2s log(6p/s), t (wM,i − Xi ) ξ t  ξ  (22) where in the last step we have used Lemma 2.4. Proof of Lemma 4.1. Since kXi k = 1 for i = 1, ..., p we have corr(X t X) = X t X and so κ(X, s) = δ(X, s). The Gram matrix in (17) can be written as Ip + cUp,k , where Up,k is the 3 × 3 block matrix with sizes (k, p − k − 1, 1) × (k, p − k − 1, 1) defined by   [0] [0] [1] Up,k = [0] [0] [0]. [1] [0] 0 Consider a model M with |M | = s ≤ k ≤ p − 1, and denote by GM its Gram matrix. If p ∈ / M , then GM = Is and kGM − Is kop = 0. If p ∈ M , then GM = Is + cUs,m , where m = m(M ) = |(M \ {p}) ∩ {1, . . . k}| ≤ s − 1. The operator norm of GM −Is is the square root of the largest eigenvalue of (cUs,m )2 , 2 where Us,m is a 3 × 3 block matrix with sizes (m, s − m − 1, 1) × (m, s − m − 1, 1) defined by   [1] [0] [0] 2 = [0] [0] [0]. Us,m [0] [0] m The first block is a m × m matrix with all entries equal to 1, hence its only non-null eigenvalue is m. This is also the (only) eigenvalue of the last block (an 2 1 × 1 matrix). Thus, the largest eigenvalue of Us,m is m. Therefore, as m ≤ s− 1, √ we have kGM − Is kop = c s − 1 for all M such that |M | = s ≤ k ≤ p − 1, which concludes the proof. Proof of Proposition 4.2. Without loss of generality (by Lemma 2.2) we can assume that X = Z (c,k) , where Z (c,k) is the p×p matrix defined as the beginning of Section 4.1. The proof is an extension of the proof of [6, Theorem 6.2]. For m ≥ 0, consider a model M such that M ∋ p, M ∩ {k + 1, . . . , p − 1} = ∅, and |M | = m + 1; in other words, M = {i1 , . . . , im , p} such that i1 , . . . , im are elements of {1, . . . , k}. Denote as M+p m:k the set of all such models. Let imsart-ejs ver. 2014/10/16 file: EJS1490.tex date: April 23, 2019 F. Bachoc, G. Blanchard, P. Neuvial/Post selection inference under RIP 18 uM,p = Zp − PM\{p} (Zp ), where Zp is the last column of Z (c,k) , and where PM\{p} (Zp ) is the orthogonal projection of Zp onto the span of the columns with indices M \ {p}. Observe that the column ij of Z (c,k) is the ij -th base column vector of Rp that we write eij , therefore PM\{p} (Zp ) = m X (etij Zp )eij = c(ei1 + . . . + eim ). j=1 Hence, we have, for M ∈ M+p m:k ,  0       0 uM,p j =  c   √ 1 − kc2 for for for for j j j j = k + 1, . . . , p − 1, = 1, . . . , k; j ∈ M, = 1, . . . , k; j 6∈ M, = p. Recall that we have wM,p = uM,p /kuM,p k. Hence, for M ∈ M+p m:k ,  0     0 √ [wM,p ]j =  c/ 1 − mc2   √ √ 1 − kc2 / 1 − mc2 for for for for j j j j = k + 1, . . . , p − 1, = 1, . . . , k; j ∈ M, = 1, . . . , k; j 6∈ M, = p. Hence, we have   t E[γMs ,∞ ] = E max |wM,i ξ| |M|≤s,i∈M   t ≥E max w ξ M,p +p M∈M(s−1):k # √ k−s+1 X c 1 − kc2 ξk−j:k , =E p ξp + p 1 − (s − 1)c2 1 − (s − 1)c2 j=1 " where ξ1:k ≤ . . . ≤ ξk:k are the order statistics of ξ1 , . . . , ξk . Hence, since s − 1 < k, we obtain # " k s−1 X X c E[γMs ,∞ ] ≥ 0 + p ξj:k ξj − E 1 − (s − 1)c2 j=1 j=1 # " s−1 X c ξk−j:k =p E 1 − (s − 1)c2 j=1 " s−1 # X c ≥p E max ξ(j−1)⌊k/s⌋+l . l=1,...,⌊k/s⌋ 1 − (s − 1)c2 j=1 imsart-ejs ver. 2014/10/16 file: EJS1490.tex date: April 23, 2019 F. Bachoc, G. Blanchard, P. Neuvial/Post selection inference under RIP 19 p In the above display, each maximum has mean value larger than A log⌊k/s⌋, with A > 0 a universal constant (see e.g. Lemma A.3 in [11]). Hence, we have p c(s − 1) log⌊k/s⌋. E[γMs ,∞ ] ≥ A p 1 − (s − 1)c2 Finally, a consequence of Gaussian √ concentration (Theorem A.1) is that mean and median of γMs ,∞ are within 2 log 2 of each other. Since we assumed α ≤ 21 , √ K(Z (c,k) , Ms , α, ∞) ≥ E[γMs ,∞ ] − 2 log 2, which concludes the proof. √ p Proof of Corollary 4.3. When δp sp 1 − log sp / log p + 1/ log p → ∞, one can see that in Theorem 3.3, the first term is negligible compared to the second one. Since δp → 0, the first result (18) follows from Theorem 3.3. p We now apply Proposition 4.2 with cp = δp / sp − 1 and kp = min(p − p 1, ⌊1/c2p − 1⌋). From Lemma 4.1, δ(Z (cp ,kp ) , sp ) ≤ cp sp − 1 = δp . We then have, with two positive constants A′ and A, s   min(p − 1, ⌊1/c2p − 1⌋) √ (cp ,kp ) ′ K(Z , Ms , α, ∞) ≥A δp sp log sp q  √ ≥Aδp sp log min(⌊(p − 1)/sp ⌋, 1/δp2 ) . This concludes the proof of (19). Appendix C: Code for computing Bℓ (q, r, ρ) Bl <- function(q, r, rho, l, I = 1000) { ## ## Compute an upper bound for the quantile 1-l of ## max_{i=1,...,rho} (1/N) | w_i’ V | ## where: ## - the w_1,...w_{rho} are unit vectors ## - V follows N(0,I_q) ## - N^2/r follows X^2(r) ## ## Adapted from K4 in Bachoc, Leeb, Poetscher 2018 ## ## Parameters: ## q.......: dimension of the Gaussian vector ## r.......: degrees of freedom for the variance estimator ## rho.....: number of unit vectors ## l.......: type I error rate (1 - confidence level) ## I.......: numerical precision ## ## Value: ## A numerical approximation of the upper bound imsart-ejs ver. 2014/10/16 file: EJS1490.tex date: April 23, 2019 F. Bachoc, G. Blanchard, P. Neuvial/Post selection inference under RIP 20 ## ## vector of quantiles of Beta distribution: vC <- qbeta(p = seq(from = 0, to = 1/rho, length = I), shape1 = 1/2, shape2 = (q-1)/2, lower.tail = FALSE) ## Monte-Carlo evaluation of confidence level ## for a constant K fconfidence <- function(K){ prob <- pf(q = K^2/vC/q, df1 = q, df2 = r, lower.tail = FALSE) mean(prob) - l } quant <- qf(p = l, df1 = q, df2 = r, lower.tail = FALSE) Kmax <- sqrt(quant) * sqrt(q) uniroot(fconfidence, interval = c(1, 2*Kmax))$root } References [1] F. Bachoc, H. Leeb, and B. M. Pötscher. Valid confidence intervals for post-model-selection predictors. The Annals of Statistics (forthcoming), 2018. [2] F. Bachoc, D. Preinerstorfer, and L. Steinberger. Uniformly valid confidence intervals post-model-selection. arXiv:1611.01043, 2016. [3] Z. Bai and J. W. Silverstein. Spectral analysis of large dimensional random matrices, volume 20. Springer, 2010. [4] A. Belloni, V. Chernozhukov, and C. Hansen. Inference for highdimensional sparse econometric models. Advances in Economics and Econometrics. 10th World Congress of the Econometric Society, Volume III,, pages 245–295, 2011. [5] A. Belloni, V. Chernozhukov, and C. Hansen. Inference on treatment effects after selection among high-dimensional controls. The Review of Economic Studies, 81:608–650, 2014. [6] R. Berk, L. Brown, A. Buja, K. Zhang, and L. Zhao. Valid post-selection inference. The Annals of Statistics, 41(2):802–837, 2013. [7] L. Birgé. An alternative point of view on Lepski’s method. In State of the art in probability and statistics (Leiden, 1999), volume 36 of IMS Lecture Notes Monogr. Ser., pages 113–133. Inst. Math. Statist., 2001. [8] P. Bühlmann and S. Van De Geer. Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media, 2011. [9] E. Candes, T. Tao, et al. The dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics, 35(6):2313–2351, 2007. imsart-ejs ver. 2014/10/16 file: EJS1490.tex date: April 23, 2019 F. Bachoc, G. Blanchard, P. Neuvial/Post selection inference under RIP 21 [10] E. J. Candes and T. Tao. Decoding by linear programming. IEEE transactions on information theory, 51(12):4203–4215, 2005. [11] S. Chatterjee. Superconcentration and related topics. Springer, 2014. [12] V. Chernozhukov, D. Chetverikov, and K. Kato. Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. The Annals of Statistics, 41(6):2786–2819, 2013. [13] B. S. Cirel’son, I. A. Ibragimov, and V. N. Sudakov. Norm of Gaussian sample functions. In Proceedings of the 3rd Japan-U.S.S.R. Symposium on Probability Theory (Tashkent, 1975), volume 550 of Lecture Notes in Mathematics, pages 20–41. Springer, 1976. [14] W. Fithian, D. Sun, and J. Taylor. Optimal inference after model selection. arXiv:1410.2597, 2015. [15] S. Foucart and H. Rauhut. A mathematical introduction to compressive sensing. Basel: Birkhäuser, 2013. [16] C. Giraud. Introduction to high-dimensional statistics., volume 139 of Monographs on Statistics and Applied Probability. CRC Press, 2015. [17] P. Kabaila and H. Leeb. On the large-sample minimal coverage probability of confidence intervals after model selection. Journal of the American Statistical Association, 101:619–629, 2006. [18] A. K. Kuchibhotla, L. D. Brown, A. Buja, E. I. George, and L. Zhao. A model free perspective for linear regression: Uniform-in-model bounds for post selection inference. arXiv preprint arXiv:1802.05801, 2018. [19] J. D. Lee, D. L. Sun, Y. Sun, and J. E. Taylor. Exact post-selection inference, with application to the lasso. The Annals of Statistics, 44(3):907–927, 2016. [20] J. D. Lee and J. E. Taylor. Exact post model selection inference for marginal screening. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 136–144. Curran Associates, Inc., 2014. [21] H. Leeb and B. M. Pötscher. Model selection and inference: Facts and fiction. Econometric Theory, 21:21–59, 2005. [22] H. Leeb and B. M. Pötscher. Performance limits for estimators of the risk or distribution of shrinkage-type estimators, and some general lower risk-bound results. Econometric Theory, 22:69–97, 2 2006. [23] H. Leeb and B. M. Pötscher. Model selection. In T. G. Andersen, R. A. Davis, J.-P. Kreiß, and T. Mikosch, editors, Handbook of Financial Time Series, pages 785–821, New York, NY, 2008. Springer. [24] V. A. Marčenko and L. A. Pastur. Distribution of eigenvalues for some sets of random matrices. Mathematics of the USSR-Sbornik, 1(4):457, 1967. [25] B. M. Pötscher. Confidence sets based on sparse estimators are necessarily large. Sankhya, 71:1–18, 2009. [26] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2018. [27] R. J. Tibshirani, A. Rinaldo, R. Tibshirani, and L. Wasserman. Uniform asymptotic inference and the bootstrap after model selection. The Annals of Statistics, forthcoming, 2015. imsart-ejs ver. 2014/10/16 file: EJS1490.tex date: April 23, 2019 F. Bachoc, G. Blanchard, P. Neuvial/Post selection inference under RIP 22 [28] R. J. Tibshirani, J. Taylor, R. Lockhart, and R. Tibshirani. Exact postselection inference for sequential regression procedures. Journal of the American Statistical Association, 111(514):600–620, 2016. [29] S. van de Geer, P. Bühlmann, Y. Ritov, and R. Dezeure. On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics, 42:1166–1202, 2014. [30] S. A. Van De Geer, P. Bühlmann, et al. On the conditions used to prove oracle results for the lasso. Electronic Journal of Statistics, 3:1360–1392, 2009. [31] C.-H. Zhang and S. Zhang. Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society B, 76:217–242, 2014. [32] K. Zhang. Spherical cap packing asymptotics and rank-extreme detection. IEEE Transactions on Information Theory, 63(7), 2017. [33] P. Zhao and B. Yu. On model selection consistency of lasso. Journal of Machine Learning Research, 7(Nov):2541–2563, 2006. imsart-ejs ver. 2014/10/16 file: EJS1490.tex date: April 23, 2019