Academia.eduAcademia.edu

High-Dimensional Regression with Unknown Variance

2012, Statistical Science

We review recent results for high-dimensional sparse linear regression in the practical case of unknown variance. Different sparsity settings are covered, including coordinate-sparsity, group-sparsity and variationsparsity. The emphasis is put on nonasymptotic analyses and feasible procedures. In addition, a small numerical study compares the practical performance of three schemes for tuning the lasso estimator and some references are collected for some more general models, including multivariate regression and nonparametric regression.

Statistical Science 2012, Vol. 27, No. 4, 500–518 DOI: 10.1214/12-STS398 © Institute of Mathematical Statistics, 2012 High-Dimensional Regression with Unknown Variance Christophe Giraud, Sylvie Huet and Nicolas Verzelen Abstract. We review recent results for high-dimensional sparse linear regression in the practical case of unknown variance. Different sparsity settings are covered, including coordinate-sparsity, group-sparsity and variationsparsity. The emphasis is put on nonasymptotic analyses and feasible procedures. In addition, a small numerical study compares the practical performance of three schemes for tuning the lasso estimator and some references are collected for some more general models, including multivariate regression and nonparametric regression. Key words and phrases: ance. Linear regression, high-dimension, unknown vari- additional results for multivariate regression and the nonparametric regression model 1. INTRODUCTION In the present paper, we mainly focus on the linear regression model (1) (2)   Yi = f X(i) + εi , i = 1, . . . , n, will also be mentioned. Y = Xβ0 + ε, 1.1 Sparsity Assumptions where Y is a n-dimensional response vector, X is a fixed n × p design matrix, and the vector ε is made of n i.i.d. Gaussian random variables with N (0, σ 2 ) distribution. In the sequel, X(i) stands for the ith row of X. Our interest is on the high-dimensional setting, where the dimension p of the unknown parameter β0 is large, possibly larger than n. The analysis of the high-dimensional linear regression model has attracted a lot of attention in the last decade. Nevertheless, there is a longstanding gap between the theory where the variance σ 2 is generally assumed to be known and the practice where it is often unknown. The present paper is mainly devoted to reviewing recent results on linear regression in highdimensional settings with unknown variance σ 2 . A few In a high-dimensional linear regression model, accurate estimation is unfeasible unless it relies on some special properties of the parameter β0 . The most common assumption on β0 is that it is sparse in some sense. We will consider in this paper the three following classical sparsity assumptions. Coordinate-sparsity. Most of the coordinates of β0 are assumed to be zero (or approximately zero). This is the most common acceptation for sparsity in linear regression. Structured-sparsity. The pattern of zero(s) of the coordinates of β0 is assumed to have an a priori known structure. For instance, in group-sparsity [80], the covariates are clustered into M groups, and when the coefficient β0,i corresponding to the covariate Xi (the ith column of X) is nonzero, then it is likely that all the coefficients β0,j with variables Xj in the same cluster as Xi are nonzero. Variation-sparsity. The p −1-dimensional vector β0V V =β of variation of β0 is defined by β0,j 0,j +1 − β0,j . Sparsity in variation means that most of the components of β0V are equal to zero (or approximately zero). When p = n and X = In , variation-sparse linear regression corresponds to signal segmentation. Christophe Giraud is Professor, CMAP, UMR CNRS 7641, Ecole Polytechnique, Route de Saclay, 91128 Palaiseau Cedex, France (e-mail: [email protected]). Sylvie Huet is Research Scientist, UR341 MIA, INRA, F-78350 Jouy-en-Josas, France (e-mail: [email protected]). Nicolas Verzelen is Research Scientist, UMR729 MISTEA, INRA, F-34060 Montpellier, France (e-mail: [email protected]). 500 501 UNKNOWN VARIANCE 1.2 Statistical Objectives In the linear regression model, there are roughly two kinds of estimation objectives. In the prediction problem, the goal is to estimate Xβ0 , whereas in the inverse problem it is to estimate β0 . When the vector β0 is sparse, a related objective is to estimate the support of β0 (model identification problem) which is the set of the indices j corresponding to the nonzero coefficients β0,j . Inverse problems and prediction problems are not equivalent in general. When the Gram matrix XX∗ is poorly conditioned, the former problems can be much more difficult than the latter. Since there are only a few results on inverse problems with unknown variance, we will focus on the prediction problem, the support estimation problem being shortly discussed in the course of the paper. In the sequel, Eβ0 [·] stands for the expectation with respect to Y ∼ N (Xβ0 , σ 2 In ), and  · 2 is the Euclidean norm. The prediction objective amounts to build estimators β so that the risk (3)  β0 ] := Eβ [X(β  − β0 )2 ] R[β; 2 0 is as small as possible. 1.3 Approaches Most procedures that handle high-dimensional linear models [22, 26, 62, 72, 73, 81, 83, 85] rely on tuning parameters whose optimal value depends on σ . For example, the results of Bickel et al. [17] suggest to choose√the tuning parameter λ of the lasso of the order of 2σ 2 log(p). As a consequence, all these procedures cannot be directly applied when σ 2 is unknown. A straightforward approach is to replace σ 2 by an estimate of the variance in the optimal value of the tuning parameter(s). Nevertheless, the variance σ 2 is difficult to estimate in high-dimensional settings (see Proposition 2.3 below), so a plug-in of the variance does not necessarily yield good results. There are basically two approaches to build on this amount of work on highdimensional estimation with known variance. (1) Ad-hoc estimation. There has been some recent work [16, 68, 71] to modify procedures like the lasso in such a way that the tuning parameter does not depend anymore on σ 2 ; see Section 4.2. The challenge is to find a smart modification of the procedure, so that the resulting estimator β is computationally feasible and  β0 ] as small as possible. has a risk R[β; (2) Estimator selection. Given a collection (βλ )λ∈ of estimators, the objective of estimator selection is to λ such that the risk of βλ is as small as pick an index  possible, ideally as small as the risk R[βλ∗ ; β0 ] of the so-called oracle estimator (4) βλ∗ := argmin R[βλ ; β0 ]. {βλ ,λ∈} Efficient estimator selection procedures can then be applied to tune the aforementioned estimation methods [22, 26, 62, 72, 73, 81, 83, 85]. Among the most famous methods for estimator selection, we mention V fold cross-validation (Geisser [32]), AIC (Akaike [1]) and BIC (Schwarz [64]) criteria. The objective of this survey is to describe state-ofthe-art procedures for high-dimensional linear regression with unknown variance. We will review both automatic tuning methods and ad-hoc methods. There are some procedures that we will let aside. For example, Baraud [11] provides a versatile estimator selection scheme, but the procedure is computationally intractable in large dimensions. Linear or convex aggregation of estimators are also valuable alternatives to estimator selection when the goal is to perform estimation, but only a few theoretical works have addressed the aggregation problem when the variance is unknown [33, 34]. For these reasons, we will not review these approaches in the sequel. 1.4 Why Care about Nonasymptotic Analyses? AIC [1], BIC [64] and V -fold cross-validation [32] are probably the most popular criteria for estimator selection. The use of these criteria relies on some classical asymptotic optimality results. These results focus on the setting where the collection of estimators (βλ )λ∈ and the dimension p are fixed and consider the limit behavior of the criteria when the sample size n goes to infinity. For example, under some suitable conditions, Shibata [67], Li [53] and Shao [66] prove that the risk of the estimator selected by AIC or V -fold CV (with V = Vn → ∞) is asymptotically equivalent to the oracle risk R[βλ∗ ; β0 ]. Similarly, Nishii [59] shows that the BIC criterion is consistent for model selection. All these asymptotic results can lead to misleading conclusions in modern statistical settings where the sample size remains small and the parameter’s dimension becomes large. For instance it is proved in [12], Section 3.3.2, and illustrated in [12], Section 6.2, that BIC (and thus AIC) can strongly overfit and should not be used for p larger than n. Additional examples are provided in Appendix A. A nonasymptotic analysis takes into account all the characteristics of the selection problem (sample size n, parameter dimension p, number of models per dimension, design X, etc.). It treats n 502 C. GIRAUD, S. HUET AND N. VERZELEN and p as they are, and it highlights important features hidden by the asymptotic theory. For these reasons, we will restrict this review to nonasymptotic results. 1.5 Organization of the Paper In Section 2, we investigate how the ignorance of the variance affects the minimax risk bounds. In Section 3, some “generic” schemes for selecting estimators are presented. The coordinate-sparse setting is addressed in Section 4 where some theoretical results are collected, and a small numerical experiment compares different lasso-based procedures. The group-sparse and variation-sparse settings are reviewed in Sections 5 and 6, and Section 7 is devoted to some more general models such as multivariate regression or nonparametric regression. In the sequel, C, C1 , . . . refer to numerical constants whose value may vary from line to line, while β0 stands for the number of nonzero components of β and |J | for the cardinality of a set J . 2. THEORETICAL LIMITS The goal of this section is to address the intrinsic difficulty of a coordinate-sparse linear regression problem. We will answer the following questions: Which range of p can we reasonably consider? When the variance is unknown, can we hope to do as well as when the variance is known? In practice, the number of nonzero coordinates of β0 is unknown. The fact that an estimator β is minimax over B[k, p] for some specific k > 0 does not imply that β performs well when β0 has a number of nonzero components different from k. Indeed, β can be strongly biased when β0 has more than k nonzero components or the variance of β can be too large compared to R[k0 , X] when β0 has k0 nonzero components with k0 less than k. A good estimation procedure β should not require the knowledge of the sparsity k of β0 and should perform as well as if this sparsity were known. An estimator β that achieves [up to a possible multiplicative constant C(X)] the minimax risk over B[k, p] for a range of k is said to be adaptive to the sparsity. Similarly, an estimator β is adaptive to the variance σ 2 if it does not require the knowledge of σ 2 and nearly achieves the minimax risk for all σ 2 > 0. When possible, the main challenge is to build adaptive procedures. In the following subsections, we review sharp bounds on the minimax prediction risks for both known and unknown sparsity, known and unknown variance. The big picture is summed up in Figure 1. Roughly, it says that adaptation is possible as long as 2k log(p/k) < n. In contrast, the situation becomes more complex for the ultra-high-dimensional1 setting where 2k log(p/k) ≥ n. The rest of this section is devoted to explain this big picture. 2.1 Minimax Adaptation A classical way to assess the performance of an estimator β is to measure its maximal risk over a class B ⊂ Rp . This is the minimax point of view. As we are interested in coordinate-sparsity for β0 , we will consider the sets B[k, p] of vectors that contain at most k nonzero coordinates for some k > 0.  the maximal prediction risk Given an estimator β, of β over B[k, p] for a fixed design X and a variance  β0 ] where the risk σ 2 is defined by supβ0 ∈B[k,p] R[β; function R[·, β0 ] is defined by (3). Taking the infimum  we of the maximal risk over all possible estimators β, obtain the minimax risk (5) R[k, X] = inf sup β β0 ∈B[k,p]  β0 ]. R[β; Minimax bounds are convenient results to assess the range of problems that are statistically feasible and the optimality of particular procedures. Below, we say that an estimator β is “minimax” over B[k, p] if its maximal prediction risk equals [up to a possible multiplicative constant C(X)] the minimax risk. F IG . 1. Minimal prediction risk over B[k, p] as a function of k. 1 In some papers, the expression ultra-high-dimensional has been used to characterize problems such that log(p) = O(nθ ) with θ < 1. We argue here that as soon as k log(p)/n goes to 0, the case log(p) = O(nθ ) is not intrinsically more difficult than conditions such as p = O(nδ ) with δ > 0. 503 UNKNOWN VARIANCE 2.2 Minimax Risks under Known Sparsity and Known Variance The minimax risk R[k, X] depends on the form of the design X. In order to grasp this dependency, we define for any k > 0, the largest and the smallest sparse eigenvalues of order k of X∗ X by Φk,+ (X) := Xβ2n 2 β∈B[k,p]\{0p } βp sup and Xβ2n . inf Φk,− (X) := β∈B[k,p]\{0p } β2 p P ROPOSITION 2.1. Assume that k and σ are known. There exist positive numerical constants C1 , C1′ , C2 and C2′ such that the following holds. For any (k, n, p) such that k ≤ n/2 and any design X, we have C1 (6)   p 2 Φ2k,− (X) σ k log Φ2k,+ (X) k ≤ R[k, X] ≤ C1′     p ∧ n σ 2. k log k For any (k, n, p) such that k ≤ n/2, we have  C2 k log (7)    p ∧ n σ2 k     p ∧ n σ 2. ≤ sup R[k, X] ≤ C2′ k log k X The minimax lower bound (6) has been first proved in [61, 62, 79] while (7) is stated in [77]. Let us first comment on bound (7). If the vector β0 has k nonzero components, and if these components are a priori known, then one may build estimators that achieve a risk bound of the order k. In a (nonultra) high-dimensional setting [2k log(p/k) ≤ n], the minimax risk is of the order k log(p/k)σ 2 . The logarithmic term is the price to pay to cope with the fact that we do not know the position of the nonzero components in β0 . The situation is quite different in an ultrahigh-dimensional setting [2k log(p/k) > n]. Indeed, the minimax risk remains of the order of nσ 2 , which corresponds to the minimax risk of estimation of the vector Xβ0 without any sparsity assumption; see the blue curve in Figure 1. In other terms, the sparsity index k does not play a role anymore. Dependency of R[k, X] on the design X. It follows from (6) that supX R[k, X] is nearly achieved by designs X satisfying Φ2k,− (X)/Φ2k,+ (X) ≈ 1, when the setting is not ultra-high-dimensional. For some designs such that Φ2k,− (X)/Φ2k,+ (X) is small, the minimax prediction risk R[k, X] is possibly faster; see [77] for a discussion. In an ultra-high-dimensional setting, the form of the minimax risk (nσ 2 ) is related to the fact that no designs can satisfy Φ2k,− (X)/Φ2k,+ (X) ≈ 1; see, for example, [10]. More precisely, the lower bound (6) enforces the following geometrical constrain: Φ2k,− (X) n ≤C Φ2k,+ (X) k log(p/k) for any design X. The lower bound R[k, X] ≥ C[k · log(p/k) ∧ n]σ 2 in (7) is, for instance, achieved by realizations of a standard Gaussian design, that is, designs X whose components follow independent standard normal distributions. See [77] for more details. 2.3 Adaptation to the Sparsity and to the Variance Adaptation to the sparsity when the variance is known. When σ 2 is known, there exist both model selection and aggregation procedures that achieve this [k log(p/k)∧n]σ 2 risk simultaneously for all k and for all designs X. Such procedures derive from the work of Birgé and Massart [18] and Leung and Barron [52]. However, these methods are intractable for large p except for specific forms of the design. We refer to the supplementary material [38] for more details. Simultaneous adaptation to the sparsity and the variance. We first restrict to the nonultra high-dimensional setting, where the number of nonzero components k is unknown but satisfies 2k log(p/k) < n. In this setting, some procedures based on penalized log-likelihood [12] are simultaneous adaptive to the unknown sparsity and to the unknown variance and this for all designs X. Again such procedures are intractable for large p. See the supplementary material [38] for more details. If we want to cover all k (including ultra-high-dimensional settings), the situation is different as shown in the next proposition (from [77]). P ROPOSITION 2.2 (Simultaneous adaptation is impossible). There exist positive constants C, C ′ , C1 , C2 , C3 , C1′ , C2′ and C3′ , such that the following holds. Consider any p ≥ n ≥ C and k ≤ p 1/3 ∧ n/2 such that k log(p/k) ≥ C ′ n. There exist designs X of size n × p  we have either such that for any estimator β,  0p ] R[β; > C1 n σ2 σ 2 >0 sup  β0 ] R[β; σ2 β0 ∈B[k,p],σ 2 >0 or sup > C2 k log      k p p exp C3 log k n k . 504 C. GIRAUD, S. HUET AND N. VERZELEN Conversely, there exist two estimators β(n) and βBGH (defined in the supplementary material [38]) that respectively satisfy sup X sup X R[β(n) ; β0 ] ≤ C1′ n, 2 σ p 2 β0 ∈R ,σ >0 sup R[βBGH ; β0 ] σ2 β0 ∈B[k,p],σ 2 >0 ≤ C2′ k log    p k p exp C3′ log k n k for all 1 ≤ k ≤ [(n − 1) ∧ p]/4. As a consequence, simultaneous adaptation to the sparsity and to the variance is impossible in an ultrahigh-dimensional setting. Indeed, any estimator β that does not rely on σ 2 has to pay at least one of these two prices: (1) The estimator β does not use the sparsity of the true parameter β0 , and its risk for estimating X0p is of the same order as the minimax risk over Rn . (2) For any 1 ≤ k ≤ p 1/3 , the risk of β fulfills  β0 ] R[β; sup sup σ2 σ >0 β0 ∈B[k,p]  k ≥ C1 k log(p) exp C2 log(p) . n It follows that the maximal risk of β increases exponentially fast in an ultra-high-dimensional setting (red curve in Figure 1), while the minimax risk is stuck to n (blue curve in Figure 1). The designs that satisfy the minimax lower bounds of Proposition 2.2 include realizations of a standard Gaussian design. In an ultra-high-dimensional setting, the prediction problem becomes extremely difficult under unknown variance because the variance estimation itself is inconsistent as shown in the next proposition (from [77]). P ROPOSITION 2.3. There exist positive constants C, C1 and C2 such that the following holds. Assume that p ≥ n ≥ C. For any 1 ≤ k ≤ p 1/3 , there exist designs X such that sup  σ σ >0,β ∈B[k,p] 0 E β0    σ2 σ2  −  σ2 σ2   β0 ] ≤ C1 β0 0 log(p)σ 2 , R[β; for any σ > 0 and any vector β0 ∈ Rp such that 1 ≤ β0 0 log(p) ≤ C2 n. Such risk bounds prove that the estimator is approximately [up to a possible log(β0 0 ) additional term] minimax adaptive to the unknown variance and the unknown sparsity. The condition β0 0 log(p) ≤ C2 n ensures that the setting is not ultra-high-dimensional. As stated above, some procedures achieve (8) for all designs X, but they are intractable for large p; see [38]. One purpose of this review is to present fast procedures that achieve these kind of bounds under possible restrictive assumptions on the design matrix X. For some procedures, (8) can be improved into a bound of the form (9)  β0 ] R[β; ≤ C1 inf {X(β − β0 )22 + β0 log(p)σ 2 }, β=0  inf Let us consider an estimator β that does not depend on σ 2 . Relying on the previous minimax bounds, we will say that β achieves an optimal risk bound (with respect to the sparsity) if (8) sup   2.4 What Should We Expect from a Good Estimation Procedure?    k k p p ≥ C1 log exp C2 log n k n k . with C1 close to one. Again, the dimension β0 0 is restricted to be smaller than Cn/ log(p) to ensure that the setting is not ultra-high-dimensional. This kind of bound makes a clear trade-off between a bias and a variance term. For instance, when β0 contains many components that are nearly equal to zero, the bound (9) can be much smaller than (8). 2.5 Other Statistical Problems in an Ultra-High-Dimensional Setting We have seen that adaptation becomes impossible for the prediction problem in an ultra-high-dimensional setting. For other statistical problems, including the prediction problem with random design, the inverse problem (estimation of β0 ), the variable selection problem (estimation of the support of β0 ), the dimension reduction problem [47, 77, 78], the minimax risks increase exponentially fast in an ultra-high-dimensional setting. This kind of phase transition has been observed in a wide range of random geometry problems [28], suggesting some universality in this limitation. In practice, the sparsity index k is not known, but given (n, p), we can compute k ∗ := max{k : 2k log(p/k) ≥ n}. One may interpret that the problem is still reasonably difficult as long as k ≤ k ∗ . This gives a simple rule of 505 UNKNOWN VARIANCE thumb to know what we can hope from a given regression problem. For example, setting p = 5000 and n = 50 leads to k ∗ = 3, implying that the prediction problem becomes extremely difficult when there are more than 4 relevant covariates; see the simulations in [77]. 3. SOME GENERIC SELECTION SCHEMES Among the selection schemes not requiring the knowledge of the variance σ 2 , some are very specific to a particular algorithm, while some others are more generic. We describe in this section three versatile selection principles and refer to the examples for the more specific schemes. 3.1 Cross-Validation Procedures The cross-validation schemes are nearly universal in the sense that they can be implemented in most statistical frameworks and for most estimation procedures. The principle of the cross-validation schemes is to split the data into a training set and a validation set: the estimators are built on the training set, and the validation set is used for estimating their prediction risk. This training/validation splitting is eventually repeated several times. The most popular cross-validation schemes are: • Hold-out [27, 57] which is based on a single split of the data for training and validation. • V -fold CV [32]. The data is split into V subsamples. Each subsample is successively removed for validation, the remaining data being used for training. • Leave-one-out [69] which corresponds to n-fold CV. • Leave-q-out (also called delete-q-CV) [65] where every possible subset of cardinality q of the data is removed for validation, the remaining data being used for training. We refer to Arlot and Célisse [5] for a review of the cross-validation schemes and their theoretical properties. 3.2 Penalized Empirical Loss Penalized empirical loss criteria form another class of versatile selection schemes, yet less universal than CV procedures. The principle is to select among a family (βλ )λ∈ of estimators by minimizing a criterion of the generic form (10) Crit(λ) = LX (Y, βλ ) + pen(λ), where LX (Y, βλ ) is a measure of the distance between Y and Xβλ , and pen is a function from  to R+ . The penalty function sometimes depends on data. Penalized log-likelihood. The most famous criteria of the form (10) are AIC and BIC. They have been designed to select among estimators βλ obtained by maximizing the likelihood of (β, σ ) with the constraint that β lies on a linear space Sλ (called model). In the Gaussian case, these estimators are given by Xβλ = Sλ Y , where Sλ denotes the orthogonal projector onto the model Sλ . For AIC and BIC, the function LX corresponds to twice the negative log-likelihood LX (Y, βλ ) = n log(Y − Xβλ 22 ) and the penalties are pen(λ) = 2 dim(Sλ ) and pen(λ) = dim(Sλ ) log(n), respectively. We recall that these two criteria can perform very poorly in a high-dimensional setting. In the same setting, Baraud et al. [12] propose alternative penalties built from a nonasymptotic perspective. The resulting criterion can handle the highdimensional setting where p is possibly larger than n, and the risk of the selection procedure is controlled by a bound of the form (9); see Theorem 2 in [12]. Plug-in criteria. Many other penalized-empiricalloss criteria have been developed in the last decades. Several selection criteria [14, 18] have been designed from a nonasymptotic point of view to handle the case where the variance is known. These criteria usually involve the residual least-square LX (Y, βλ ) = Y − Xβλ 22 and a penalty pen(λ) depending on the variance σ 2 . A common practice is then to plug in the penalty σ 2 of the variance in place of the variance. an estimate  For linear regression, when the design matrix X has a rank less than n, a classical choice for  σ 2 is  σ2 = Y − X Y 22 , n − rank(X) with X the orthogonal projector onto the range of X. In the Gaussian setting, this estimator  σ 2 has the nice feature to be independent of X Y on which usually rely the estimators βλ . Nevertheless, the variance of  σ2 is of order σ 4 /(n − rank(X)) which is small only when the sample size n is quite large in front of the rank of X. This situation is unfortunately not likely to happen in a high-dimensional setting where p can be larger than n. 3.3 Approximation Versus Complexity Penalization: LinSelect The criterion proposed by Baraud et al. [12] can handle high-dimensional settings, but it suffers from two rigidities. First, it can only handle fixed collections of models (Sλ )λ∈ . In some situations, the size of  is huge (e.g., for complete variable selection) and the estimation procedure can then be computationally intractable. In this case, we may want to work with a 506 C. GIRAUD, S. HUET AND N. VERZELEN  ⊂  may subcollection of models (Sλ )λ∈  , where  depend on data. For example, for complete variable se could be generated by efficient lection, the subset  algorithms like Lars [30]. The second rigidity of the procedure of Baraud et al. [12] is that it can only handle constrained-maximum-likelihood estimators. This procedure then does not help for selecting among arbitrary estimators such as the lasso or elastic-net. These two rigidities have been addressed recently by Baraud et al. [13]. They propose a selection procedure, LinSelect, which can handle both data-dependent collections of models and arbitrary estimators βλ . The procedure is based on a collection S of linear spaces which gives a collection of possible “approximative” supports for the estimators (Xβλ )λ∈ . A measure of complexity on S is provided by a weight function : S → R+ . We refer to Sections 4.1 and 5 for examples of collection S and weight in the context of coordinate-sparse and group-sparse regression. We present below a simplified version of the LinSelect procedure. For a suitable, possibly data-dependent, subset  S ⊂ S (depending on the statistical problem), the estimator βλ is selected by minimizing the criterion   Crit(βλ ) = inf Y − S∈ S (11) where S  2 S Xβλ 2 1 + Xβλ − 2   2 σS2 , S Xβλ 2 + pen(S) is the orthogonal projector onto S, Y − S Y 22  σS2 = n − dim(S) and pen(S) is a penalty depending on . In the cases we will consider here, the penalty pen(S) is roughly of the order of (S), and therefore it penalizes S according to its complexity. We refer to Appendix B for a precise definition of this penalty and more details on its characteristics. We emphasize that criterion (11) and the family of estimators {βλ , λ ∈ } are based on the same data Y and X. In other words, there is no datasplitting occurring in the LinSelect procedure. The first term in (11) quantifies the fit of the projected estimator to the data, the second term evaluates the approximation quality of the space S and the last term penalizes S according to its complexity. We refer to Proposition B.1 in Appendix B and Theorem 1 in [12] for risk bounds on the selected estimator. Instantiations of the procedure and more specific risks bounds are given in Sections 4 and 5 in the context of coordinate-sparsity and group-sparsity. From a computational point of view, the algorithmic complexity of LinSelect is at most proportional to S|, and in many cases there is no need to scan || × | S for each λ ∈  to minimize (11). In the the whole set  examples of Sections 4 and 5, the whole procedure is computationally less intensive than V -fold CV; see Table 3. Finally, we mention that for the constrained leastsquare estimators Xβλ = Sλ Y , the LinSelect proceS = {Sλ : λ ∈ } simply coincides with the dure with  procedure of Baraud et al. [12]. 4. COORDINATE-SPARSITY In this section, we focus on the high-dimensional linear regression model Y = Xβ0 + ε where the vector β0 itself is assumed to be sparse. This setting has attracted a lot of attention in the last decade, and many estimation procedures have been developed. Most of them require the choice of tuning parameters which depend on the unknown variance σ 2 . This is, for instance, the case for the lasso [24, 72], Dantzig Selector [22], Elastic Net [85], MC+ [81], aggregation techniques [21, 26], etc. We first discuss how the generic schemes introduced in the previous section can be instantiated for tuning these procedures and for selecting among them. Then, we pay a special attention to the calibration of the lasso. Finally, we discuss the problem of support estimation and present a small numerical study. 4.1 Automatic Tuning Methods Cross-validation. Arguably, V -fold cross-validation is the most popular technique for tuning the abovementioned procedures. To our knowledge, there are no theoretical results for V -fold CV in large dimensional settings. In practice, V -fold CV seems to give rather good results. The problem of choosing the best V has not yet been solved [5], Section 10, but it is often reported that a good choice for V is between 5 and 10. Indeed, the statistical performance does not increase for larger values of V , and averaging over 10 splits remains computationally feasible [42], Section 7.10. LinSelect. The procedure LinSelect can be used for selecting among a collection (βλ )λ∈ of sparse regressors as follows. For J ⊂ {1, . . . , p}, we define XJ as the matrix [Xij ]i=1,...,n,j ∈J obtained by only keeping the columns of X with index in J . We recall that the collection S gives some possible “approximative” supports for the estimators (Xβλ )λ∈ . For sparse linear re- 507 UNKNOWN VARIANCE gression, a possible collection S and measure of complexity are S = S = range(XJ ), J ⊂ {1, . . . , p}, 1 ≤ |J | ≤ n/(3 log p) and   p + log(dim(S)). (S) = log dim(S) λ = range(X Let us introduce the spaces S supp(βλ ) ) and the subcollection of S  λ , λ ∈ }  S = {S λ ∈ S}.  = {λ ∈  : S where  The following proposition gives a risk bound when seλ with LinSelect with the above choice of  S lecting  and . P ROPOSITION 4.1. There exists a numerical conλ of the critestant C > 1 such that for any minimizer  rion (11), we have R[βλ ; β0 ] ≤ CE inf Xβλ − Xβ0 22 λ∈ + inf {Xβλ − S∈ S (12) 2  + dim(S) log(p)σ } ≤ CE inf {Xβλ − Xβ0 22  λ∈  2 S Xβλ 2  + βλ 0 log(p)σ 2 } . Proposition 4.1 is a simple corollary of Proposition B.1 in Appendix B. The first bound involves three terms: the loss of the estimator βλ , an approximation loss, and a variance term. Hence, LinSelect chooses an estimator βλ that achieves a trade-off between the loss of βλ and the closeness of Xβλ to some small dimensional subspace S. Bound (12) cannot be formulated in  Nevthe form (9) due to the random nature of the set . ertheless, a bound similar to (8) can be deduced from (12) when the estimators βλ are least-squares estimators; see Corollary 4 in [13]. Furthermore, we note that increasing the size of  leads to a better risk bound for βλ . It is then advisable to consider a family of candidate estimators {βλ , λ ∈ } as large as possible. Proposition 4.1 is valid for any family of estimators {βλ , λ ∈ }. For the specific family of lasso estimators {βλL , λ > 0} we provide a refined bound in Proposition 4.3, Section 4.3. 4.2 Lasso-Type Estimation under Unknown Variance The lasso is certainly one of the most popular methods for variable selection in a high-dimensional setting. Given λ > 0, the lasso estimator βλL is defined by βλL := argminβ∈Rp Y − Xβ22 + λβ1 . A sensible choice of λ must be homogeneous with the square-root of the variance σ 2 . As explained above, when the variance σ 2 is unknown, one may apply V -fold CV or LinSelect to select λ. Some alternative approaches have also been developed for tuning the lasso. Their common idea is to modify the ℓ1 criterion so that the tuning parameter becomes pivotal with respect to σ 2 . This means that the method remains valid for any σ > 0 and that the choice of the tuning parameter does not depend on σ . For the sake of simplicity, we assume throughout this subsection and the next one that the columns of X are normalized to one. ℓ1 -Penalized log-likelihood. In low-dimensional regression, it is classical to consider a penalized loglikelihood criterion instead of a penalized least-square criterion to handle the unknown variance. Following this principle, Städler et al. [68] propose to minimize the ℓ1 -penalized log-likelihood criterion (13) βλLL ,  σλLL := argmin  β∈Rp ,σ ′ >0 n log(σ ′ )  Y − Xβ22 β1 +λ ′ . ′2 2σ σ By reparametrizing (β, σ ), Städler et al. [68] obtain a convex criterion that can be efficiently minimized. Interestingly, the penalty level λ is pivotal with respect to σ . Under suitable conditions on the design matrix√X, Sun and Zhang [70] show that the choice λ = c 2 log p, with c > 1 yields optimal risk bounds in the sense of (8). + Square-root lasso and scaled lasso. Sun and Zhang [71], following an idea of Antoniadis [3], propose to minimize a penalized Huber loss [45], page 179, (14) βλSR ,  σλSR := argmin  β∈Rp ,σ ′ >0 nσ ′ Y − Xβ22 + 2 2σ ′  + λβ1 . This convex criterion can be minimized with roughly the same computational complexity as a Lars-lasso 508 C. GIRAUD, S. HUET AND N. VERZELEN path [30]. Interestingly, their procedure (called the scaled lasso in [71]) is equivalent to the square-root lasso estimator previously introduced by Belloni et al. [16]. The square-root lasso of Belloni et al. is obtained by replacing the residual sum of squares in the lasso criterion by its square-root (15) βλSR = argmin β∈Rp  − Xβ22 Y  λ + √ β1 . n The equivalence between the two definitions follows from the minimization of the criterion in (14) with respect to σ ′ . In (14) and (15), the penalty level λ is again pivotal with respect to σ . Sun and Zhang [71] state sharp √oracle inequalities for the estimator βλSR with λ = c 2 log(p), with c > 1; see Proposition 4.2 below. Their empirical results suggest that criterion (15) provides slightly better results than the ℓ1 -penalized loglikelihood. In the sequel, we shall refer to βλSR as the square-root lasso estimator. Bayesian lasso. The Bayesian paradigm allows us to put prior distributions on the variance σ 2 and the tuning parameter λ, as in the Bayesian lasso [60]. Bayesian procedures straightforwardly handle the case of unknown variance, but no frequentist analysis of these procedures is so far available. 4.3 Risk Bounds for Square-Root Lasso and Lasso-LinSelect Let us state a bound on the prediction error for the square-root lasso (also called scaled lasso). For the sake of conciseness, we only present a simplified version of Theorem 1 in [71]. Consider some number ξ > 0 and some subset T ⊂ {1, . . . , p}. The compatibility constant κ[ξ, T ] is defined by κ[ξ, T ] = min u∈C (ξ,T )  |T |1/2 Xu2 uT 1  where C (ξ, T ) = {u : uT c 1 < ξ uT 1 }. P ROPOSITION 4.2. There exist positive numerical constants C1 , C2 and C3 such that the following holds. Let us consider √ the square-root lasso with the tuning parameter λ = 2 2 log(p). If we assume that: (1) p ≥ C1 ; n (2) β0 0 ≤ C2 κ 2 [4, supp(β0 )] log(p) then, with high probability, X(βSR − β0 )22 ≤ inf β=0  X(β0 − β)22  β0 log(p) 2 + C3 2 σ . κ [4, supp(β)] This bound is comparable to the general objective (9) stated in Section 2.4. Interestingly, the constant before the bias term X(β0 − β)22 equals one. If β0 0 = k, the square-root lasso achieves the minimax loss k log(p)σ 2 as long as k log(p)/n is small, and κ[4, supp(β0 )] is away from zero. This last condition ensures that the design X is not too far from orthogonality on the cone C (4, supp(β0 )). State of the art results for the classical lasso with known variance [17, 49, 74] all involve this condition. In what follows, we call lasso-LinSelect the lasso estimator of β0 obtained by choosing the parameter λ in  = R+ with LinSelect. We next state a risk bound for this procedure. For J ⊂ {1, . . . , p}, we define φJ as the largest eigenvalue of XJT XJ . The following proposition involves the restricted eigenvalue φ∗ = max{φJ : Card(J ) ≤ n/(3 log p)}. P ROPOSITION 4.3. There exist positive numerical constants C, C1 , C2 and C3 such that the following holds. Take  = R+ , and assume that β0 0 ≤ C n κ 2 [5, supp(β0 )] × . φ∗ log(p) Then, with probability at least 1 − C1 p−C2 , the lasso estimator βλL selected according to the LinSelect procedure described in Section 4.1 fulfills (16) X(β0 − βλL )22  ≤ C3 inf X(β0 − β)22 β=0  φ∗ β0 log(p) 2 + 2 σ . κ [5, supp(β)] The bound (16) is similar to the bound stated above for the square-root lasso, the most notable differences being the constant larger than 1 in front of the bias term and the quantity φ∗ in front of the variance term. We refer to the supplementary material [38] for a proof of Proposition 4.3. 4.4 Support Estimation and Inverse Problem Until now, we only discussed estimation methods that perform well in prediction. Little is known when the objective is to infer β0 or its support under unknown variance. Inverse problem. The square-root lasso [16, 71] is proved to achieve near optimal risk bound for the inverse problems under suitable assumptions on the design X. 509 UNKNOWN VARIANCE Support estimation. Up to our knowledge, there are no nonasymptotic results on support estimation for the aforementioned procedures in the unknown variance setting. Nevertheless, some related results and heuristics have been developed for the cross-validation scheme. If the tuning parameter λ is chosen to minimize the prediction error [i.e., take λ = λ∗ as defined in (4)], the lasso is not consistent for support estimation; see [51, 56] for results in a random design setting. One idea to overcome this problem, is to choose the parameter λ that minimizes the risk of the so-called Gauss-lasso estimator βλGL which is the least square estimator over the support of the lasso estimator βλL (17) βλGL := argmin β∈Rp :supp(β)⊂supp(βλL ) Y − Xβ22 . When the objective is support estimation, some numerical simulations [62] suggest that it may be more advisable not to apply the selection schemes based on prediction risk (such as V -fold CV or LinSelect) to the lasso estimators, but rather to the Gauss-lasso estimators. Similar remarks also apply for the Dantzig Selector [22]. 4.5 Numerical Experiments We present two numerical experiments to illustrate the behavior of some of the above mentioned procedures for high-dimensional sparse linear regression. The first one concerns the problem of tuning the parameter λ of the lasso algorithm for estimating Xβ0 . The procedures will be compared on the basis of the prediction risk. The second one concerns the problem of support estimation with lasso-type estimators. We will focus on the false discovery rates (FDR) and the proportion of true discoveries (Power). Simulation design. The simulation design is the same as the one described in Sections 6.1 and 8.2 of [13], except that we restrict to the case n = p = 100. Therefore, 165 examples are simulated. They are inspired by examples found in [43, 72, 84, 85] and cover a large variety of situations. The simulation were carried out with R www.r-project.org, using the library elasticnet. Experiment 1: Tuning the lasso for prediction. In the first experiment, we compare 10-fold CV [32], LinSelect [13] and the square-root lasso [16, 71] (also called scaled lasso) for tuning the lasso. The lasso-LinSelect requires one to minimize criterion (11) over  = R + . In fact, the minimum of (11) is achieved at some λ corresponding to a Lars-lasso [30] step. Consequently, only the steps of the regularization path of the lasso have to be considered so that the lasso-LinSelect estimator can be computed with roughly the same computational complexity as a Lars-lasso path [30]. The square-root lasso is computed using the algorithm de√ scribed in Sun and Zhang [71]. We set λ = 2 log(p) which corresponds to the value recommended by the authors.2 For each tuning procedure ℓ ∈ {10-fold CV, LinSelect, square-root lasso}, we focus on the prediction risk R[βλL ; β0 ] of the selected lasso estimator βλL . ℓ ℓ For each simulated example e = 1, . . . , 165, we estimate on the basis of 400 runs: • the risk of the oracle (4): Re = R[βλ∗ ; β0 ]; • the risk when selecting λ with procedure ℓ: Rℓ,e = R[βλℓ ; β0 ]. The comparison between the procedures is based on the comparison of the means, standard deviations and quantiles of the risk ratios Rℓ,e /Re computed over all the simulated examples e = 1, . . . , 165. The results are displayed in Table 1. For 10-fold CV and LinSelect, the risk ratios are close to one. For 90% of the examples, the risk of the lassoLinSelect is smaller than the risk of the lasso-CV, but there are a few examples where the risk of the lassoLinSelect is significantly larger than the risk of the lasso-CV. For the square-root lasso procedure, the risk ratios are clearly larger than for the two others. An inspection of the results reveals that the square-root lasso selects estimators with supports of small size. This feature can be interpreted as follows. Due to the bias of the lasso-estimator, the residual variance tends to overestimate the variance, leading the square-root lasso to TABLE 1 For each procedure ℓ, mean, standard-error and quantiles of the ratios {Rℓ,e /Re , e = 1, . . . , 165} Quantiles Procedure Mean Std-err 0% 50% 75% 90% 95% Lasso 10-fold CV 1.13 1.19 Lasso LinSelect Square-root lasso 5.15 0.08 0.48 6.74 1.03 1.11 1.15 1.19 1.24 0.97 1.03 1.06 1.19 2.52 1.32 2.61 3.37 11.2 17 2 More precisely, Sun and Zhang advocate λ = √2 log(p)/n in √0 their paper. This choice is equivalent to λ = 2 log(p) in (14) because the normalizations of X and of (14) are different from [71]. 510 C. GIRAUD, S. HUET AND N. VERZELEN TABLE 2 For each procedure ℓ, mean, standard-error and quantiles of FDR and Power values Quantiles Procedure Mean Std-err Gauss-lasso 10-fold CV Gauss-lasso LinSelect Square-root lasso 0.28 0.12 0.13 0.26 0.25 0.26 Gauss-lasso 10-fold CV Gauss-lasso LinSelect Square-root lasso 0.67 0.56 0.59 0.18 0.33 0.28 0% False discovery rate 0 0 0 select a lasso estimator βλL with large λ. Consequently the risk is high. Experiment 2: Variable selection with Gauss-lasso and square-root lasso. We consider now the problem of support estimation, sometimes referred as the problem of variable selection. We implement three procedures. The Gauss-lasso procedure tuned by either 10fold CV or LinSelect and the square-root lasso. The support of β0 is estimated by the support of the selected estimator. For each simulated example, the FDR and the Power are estimated on the basis of 400 runs. The results are given on Table 2. It appears that the Gauss-lasso CV procedure gives greater values of the FDR than the two others. The Gauss-lasso LinSelect and the square-root lasso behave similarly for the FDR, but the values of the power are more variable for the LinSelect procedure. Computation time. Let us conclude this numerical section with the comparison of the computation times between the methods. For all methods the computation time depends on the maximum number of steps in the lasso algorithm, and for the LinSelect method, it depends on the cardinality of S or equivalently on the  The maximum number of nonzero components of β. results are shown at Table 3. The square-root lasso is the less time consuming method, closely followed by the lasso LinSelect method. The V -fold CV carried out with the function cv.enet of the R package elasticnet, pays the price of several calls to the lasso algorithm. 5. GROUP-SPARSITY In the previous section, we have made no prior assumptions on the form of β0 . In some applications, Power 0.4 0.002 0.013 25% 50% 75% 90% 0.08 0.002 0.009 0.22 0.02 0.023 0.35 0.13 0.07 0.74 0.33 0.32 0.52 0.23 0.41 0.65 0.56 0.57 0.71 0.93 0.80 1 1 1 there are some known structures between the covariates. As an example, we treat the now classical case of group sparsity. The covariates are assumed to be clustered into M groups, and when the coefficient β0,i corresponding to the covariate Xi is nonzero, then it is likely that all the coefficients β0,j with variables Xj in the same group as Xi are nonzero. We refer to the introduction of [8] for practical examples of this so-called group-sparsity assumption. Let G1 , . . . , GM form a given partition of {1, . . . , p}. For λ = (λ1 , . . . , λM ), the group-lasso estimator βλ is defined as the minimizer of the convex optimization criterion (18) Y − Xβ22 + M  k=1 λk β Gk 2 , where β Gk = (βj )j ∈Gk . Criterion (18) promotes solutions where all the coordinates of β Gk are either zero or nonzero, leading to group selection [80]. Under some assumptions on X, Huang and Zhang [44] or Lounici et al. [54] provide a suitable choice of λ = (λ1 , . . . , λM ) that leads to near optimal prediction bounds. As expected, this choice of λ = (λ1 , . . . , λM ) is proportional to σ . As for the lasso, V -fold CV is widely used in practice to tune the penalty parameter λ = (λ1 , . . . , λM ). To our knowledge, there is not yet any extension of the procedures described in Section 4.2 to the group lasso. An alternative to cross-validation is to use LinSelect. Tuning the Group-Lasso with LinSelect For any K ⊂ {1, . . . , M}, we define the submatrix X(K) of X by only keeping the columns of X with in dex in k∈K Gk . We also write XGk for the submatrix of X, built from the columns with index in Gk . The 511 UNKNOWN VARIANCE TABLE 3 For each procedure computation time for different values of n and p. The maximum number of steps in the lasso algorithm is taken as  denoted kmax is taken as max.steps = min{n, p}. For the LinSelect procedure, the maximum number of nonzero components of β, kmax = min{p, n/ log(p)} n p max.steps kmax 100 100 500 100 500 500 100 100 500 21 16 80 collection S and the function     k∈K  |Gk | ≤ n/2 − 1      = λ ∈ , range X  ∈ S . with  (K λ ) Proposition B.1 in Appendix B ensures that we have for some constant C > 1, R[βλ ; β0 ] ≤ CE inf Xβλ − Xβ0 22  λ∈    λ | log(M) σ 2 . + βλ 0 ∨ |K In the following, we provide a more explicit bound. For simplicity, we restrict to the specific case where each group Gk has the same cardinality T . For K ⊂ {1, . . . , M}, we define φ(K) as the largest eigenvalue of XT(K) X(K) , and we set (19)  φ∗ = max φ(K) : 1 ≤ |K| ≤  n−2 . 2T ∨ 3 log(M) We assume that all the columns of X are normalized to 1, and following Lounici et al. [54], we introduce for 1 ≤ s ≤ M, (20) Square-root lasso 4s 4.8 s 300 s 0.21 s 0.43 s 11 s 0.18 s 0.4 s 6.3 s 1 ≤ |K0 | ≤ C K| ]. For a given  ⊂ and (range(X(K) )) = log[|K| |M M λ = {k : R+ , similarly to Section 4.1, we define K βλGk 2 = 0} and     , S = range X(K λ ) , λ ∈  Lasso LinSelect P ROPOSITION 5.1. There exist positive numerical constants C, C1 , C2 and C3 such  that the following holds. Assume that  contains λ∈R+ {(λ, . . . , λ)}, that T ≤ (n − 2)/4 and that are given by S = range X(K) : 1 ≤ |K| ≤ n/(3 log(M)) and Lasso 10-fold CV Xu2 , min κG [ξ, s] = min 1≤|K|≤s u∈Ŵ(ξ,K) u(K) 2 where Ŵ(ξ, K) is the cone of vectors u ∈ RM \ {0}   such that k∈Kc λk uGk 2 ≤ ξ k∈K λk uGk 2 . In the sequel, K0 stands for the set of groups containing nonzero components of β0 . 2 [3, |K |] κG n−2 0 . × φ∗ log(M) ∨ T Then, with probability larger than 1 − C1 M −C2 , we have   φ∗ Xβλ − Xβ0 22 ≤ C3 2 |K0 | T ∨ log(M) . κG [3, |K0 |] This proposition provides a bound comparable to the bounds of Lounici et al. [54], without requiring the knowledge of the variance. Its proof can be found in the supplementary material [38]. 6. VARIATION-SPARSITY We focus in this section on the variation-sparse regression. We recall that the vector β V ∈ Rp−1 of the variations of β has for coordinates βjV = βj +1 − βj and that the variation-sparse setting corresponds to the setting where the vector of variations β0V is coordinatesparse. In the following, we restrict to the case where n = p, and X is the identity matrix. In this case, the problem of variation-sparse regression coincides with the problem of segmentation of the mean of the vector Y = β0 + ε. For any subset I ⊂ {1, . . . , n − 1}, we define SI = {β ∈ Rn : supp(β V ) ⊂ I } and βI = SI Y . For any integer q ∈ {0, . . . , n − 1}, we define also the “best” subset of size q by q = argmin Y − β I 2 . I 2 |I |=q Though the number of subsets I ⊂ {1, . . . , n − 1} of cardinality q is of order nq+1 , this minimization can be performed using dynamic programming with a com=I  q in plexity of order n2 [40]. To select I q with  {0, . . . , n − 1}, any of the generic selection schemes of Section 3 can be applied. Below, we instantiate these schemes and present some alternatives. 512 C. GIRAUD, S. HUET AND N. VERZELEN 6.1 Penalized Empirical Loss When the variance σ 2 is known, penalized log likelihood model selection amounts to select a subset I 2  which minimizes a criterion of the form Y − βI 2 + = I  pen(Card(I )). This is equivalent to select I q with  q minimizing (21) Crit(q) = Y − βIq 22 + pen(q). Following the work of Birgé and Massart [18], Lebarbier [50] considers the penalty     pen(q) = (q + 1) c1 log n/(q + 1) + c2 σ 2 and determines the constants c1 = 2, c2 = 5 by extensive numerical experiments (see also Comte and Rozenholc [25] for a similar approach in a more general setting). With this choice of the penalty, the procedure satisfies a bound of the form (22) R[βI, β0 ] ≤C inf I ⊂{1,...,n−1} βI − β0 22   + (1 + |I |) log n/(1 + |I |) σ 2 . When σ 2 is unknown, several approaches have been proposed. Plug-in estimator. The idea is to replace σ 2 in σ2 = pen(q) by an estimator of the variance such as  n/2 2 i=1 (Y2i − Y2i−1 ) /n, or one of the estimators proposed by Hall et al. [41]. No theoretical results are proved in a nonasymptotic framework. Estimating the variance by the residual leastsquares. Baraud et al. [12], in Section 5.4.2, propose to select q by minimizing a penalized log-likelihood criterion. This criterion can be written in the form Crit(q) = Y − βIq 22 (1 + K pen(q)), with K > 1 and the penalty pen(q) solving  E U − pen(q)V   + = 1 n−1 , (q + 1) q where (·)+ = max(·, 0), and U , V are two independent χ 2 variables with respectively q + 2 and n − q − 2 degrees of freedom. The resulting estimator βI, with = I  I q , satisfies a nonasymptotic risk bound similar to (22) for all K > 1. The choice K = 1.1 is suggested for the practice. Slope heuristic. Lebarbier [50] implements the slope heuristic introduced by Birgé and Massart [19] for handling the unknown variance σ 2 . The method consists in calibrating the penalty directly, without estimating  σ 2 . It is based on the following principle. First, there exists a so-called minimal penalty penmin (q) such that choosing pen(q) = K penmin (q) in (21) with K < 1 can lead to a strong overfit, whereas for K > 1, the bound (22) is met. Second, it can be shown that there exists a dimension jump around the minimal penalty, allowing one to estimate penmin (q) from the data. The slope heuristic then proposes to select q by minimizing the criterion Crit(q) = Y − βIq 22 + 2 p enmin (q). Arlot and Massart [7] provide a nonasymptotic risk bound for this procedure. Their results are proved in a general regression model with heteroscedatic and nonGaussian errors, but with a constraint on the number of models per dimension which is not met for the family of models (SI )I ⊂{1,...,n−1} . Nevertheless, the authors indicate how to generalize their results for the problem of signal segmentation. Finally, for practical issues, different procedures for estimating the minimal penalty are compared and implemented in Baudry et al. [15]. 6.2 CV Procedure In a recent paper, Arlot and Célisse [6] consider the problem of signal segmentation using cross-validation. Their results apply in the heteroscedastic case. They consider several CV-methods, the leave-one-out, leavep-out and V -fold CV for estimating the quadratic loss. They propose two cross-validation schemes. The first one, denoted Procedure 5, aims to estimate directly E[β0 − βIq 22 ], while the second one, denoted Procedure 6, relies on two steps, where the cross-validation is used first for choosing the best partition of dimension q, then the best dimension q. They show that the leavep-out CV method can be implemented with a complexity of order n2 , and they give a control of the expected CV risk. The use of CV leads to some restrictions on the subsets I that compete for estimating β0 . This problem is discussed in [6], Section 3 of the supplementary material. 6.3 Alternative for Very High-Dimensional Settings When n is very large, the dynamic programming optimization can become computationally too intensive. An attractive alternative is based on the fused lasso proposed by Tibshirani et al. [73]. The estimator βλT V is 513 UNKNOWN VARIANCE defined by minimizing the convex criterion Y − β22 + λ n−1  j =1 |βj +1 − βj |,  where the total-variation norm j |βj +1 − βj | promotes solutions which are variation-sparse. The family (βλT V )λ≥0 can be computed very efficiently with the Lars-algorithm; see Vert and Bleakley [75]. A sensible choice of the parameter λ must be proportional to σ . When the variance σ 2 is unknown, the parameter λ can be selected either by V -fold CV or by LinSelect (see Section 5.1 in [13] for details). 7. EXTENSIONS 7.1 Gaussian Design and Graphical Models Assume that the design X is now random and that the n rows X(i) are independent observations of a Gaussian vector with mean 0p and unknown covariance matrix . This setting is mainly motivated by applications in compressed sensing [29] and in Gaussian graphical modeling. Indeed, Meinshausen and Bühlmann [56] have proved that it is possible to estimate the graph of a Gaussian graphical model by studying linear regression with Gaussian design and unknown variance. If we work conditionally on the observed X design, then all the results and methodologies described in this survey still apply. Nevertheless, these prediction results do not really take into account the fact that the design is random. In this setting, it is more natural to consider the integrated prediction risk E[ 1/2 (β− β0 )22 ] rather than the risk (3). Some procedures [35, 76] have been proved to achieve optimal risk bounds with respect to this risk, but they are computationally intractable in a high-dimensional setting. In the context of Gaussian graphical modeling, the procedure GGMselect [39] is designed to select among any collection of graph estimators, and it is proved to achieve near optimal risk bounds in terms of the integrated prediction risk. 7.3 Multivariate Regression Multivariate regression deals with T simultaneous linear regression models yk = Xβk + εk , k = 1, . . . , T . Stacking the yk ’s in a n × T matrix Y , we obtain the model Y = XB0 + E, where B0 is a p × T matrix with columns given by βk and E is a n × T matrix with i.i.d. entries. The classical structural assumptions on B0 are either that most rows of B0 are identically zero, or the rank of B0 is small. The first case is a simple case of group sparsity and can be handled by the group-lasso as in Section 5. The second case, first considered by Anderson [2] and Izenman [46], is much more nonlinear. Writing  · F for the Frobenius (or Hilbert–Schmidt) norm, the problem of selecting among the estimators r = argmin Y − XB2 , B F B:rank(B)≤r r ∈ {1, . . . , min(T , rank(X))} has been investigated recently from a nonasymptotic point of view by Bunea et al. [20] and Giraud [36]. r is of order of The prediction risk of B r − XB0 2 ] ≍ E[XB F A few results do not require that the noise ε follows a Gaussian distribution. The lasso-type procedures, such as the square-root lasso [16, 71], do not require the normality of the noise and extend to other distributions. In practice, it seems that cross-validation procedures still work well for other distributions of the noise. sk2 (XB0 ) k≥r   + r n + rank(X) σ 2 , where sk (M) denotes the kth largest singular value of the matrix M. Therefore, a sensible choice of r depends on σ 2 . The first selection criterion introduced by Bunea et al. [20] requires the knowledge of the variance σ 2 . To handle the case of unknown variance, Bunea et al. [20] propose to plug an estimate of the variance in their selection criterion [which works when rank(X) < n], whereas Giraud [36] introduces a penalized log-likelihood criterion independent of the variance. Both papers provide oracle risk bounds for the resulting estimators showing rate-minimax adaptation. Several recent papers [9, 20, 49, 58, 63] have investigated another strategy for the low-rank setting. For a positive λ, the matrix B0 is estimated by  λ ∈ argmin Y − XB2 + λ B F B∈Rp×T 7.2 Non-Gaussian Noise   k  sk (B) . Translating the work on trace regression of Koltchinskii et al. [49] into the setting of multivariate regression provides (under some conditions on X) an √ oraλ∗ with λ∗ = 3s1 (X)( T + cle bound on the risk of B √ rank(X))σ . We also refer to Giraud [37] for a slight variation of this result requiring no condition on the design X. Again, the value of λ∗ is proportional to σ . 514 C. GIRAUD, S. HUET AND N. VERZELEN To handle the case of unknown variance, Klopp [48] adapts the concept of the square-root lasso [16] to this setting and provides an oracle risk bound for the resulting procedure. 7.4 Nonparametric Regression In the nonparametric regression model (2), classical estimation procedures include local-polynomial estimators, kernel estimators, basis-projection estimators, k-nearest neighbors, etc. All these procedures depend on one (or several) tuning parameter(s), whose optimal value(s) scales with the variance σ 2 . V -fold CV is widely used in practice for choosing these parameters, but little is known on its theoretical performance. The class of linear estimators (including spline smoothing, Nadaraya estimators, k-nearest neighbors, low-pass filters, kernel ridge regression, etc.) has attracted some attention in the last years. Some papers have investigated the tuning of some specific family of estimators. For example, Cao and Golubev [23] provides a tuning procedure for spline smoothing while Zhang [82] provides a sharp analysis of kernel ridge regression. Recently, two papers have focused on the tuning of arbitrary linear estimators when the variance σ 2 is unknown. Arlot and Bach [4] generalize the slope heuristic to symmetric linear estimators with spectrum in [0, 1] and prove an oracle bound for the resulting estimator. Baraud et al. [13], in Section 4, shows that LinSelect can be used for selecting among a (almost) completely arbitrary collection of linear estimators (possibly nonsymmetric and/or singular). An oracle bound for the selected estimator is given by Corollary 2 in [13]. APPENDIX A: A NOTE ON BIC TYPE CRITERIA The BIC criterion has been initially introduced [64] to select an estimator among a collection of constrained maximum likelihood estimators. Nevertheless, modified versions of this criterion are often used for tuning more general estimation procedures. The purpose of this appendix is to illustrate why we advise against this approach in a high-dimensional setting. D EFINITION A.1 (A modified BIC criterion). Suppose we are given a collection (βλ )λ∈ of estimators depending on a tuning parameter λ ∈ . For any λ ∈ , σλ2 = Y − Xβλ 22 /n, and define the modwe consider  ified BIC (A.1)  λ ∈ argmin{−2Ln (βλ ,  σλ ) + log(n)βλ 0 },  λ∈  = {λ ∈  : where Ln is the log-likelihood and   βλ 0 ≤ n/2}. Sometimes, the log(n) term is replaced by log(p).  allows us to avoid trivial estimators. Replacing  by  First, we would like to emphasize that there is no theoretical warranty that the selected estimator does not overfit in a high-dimensional setting. In practice, using this criterion often leads to overfitting. Let us illustrate this with a simple experiment. Setting We consider the model (A.2) Yi = β0,i + εi , i = 1, . . . , n, with ε ∼ N (0, σ 2 In ) so that p = n and X = In . Here, we fix n = 10,000, σ = 1 and β0 = 0n . Methods We apply the modified BIC criterion to tune the lasso [72], SCAD [31] and the hard thresholding estimator. The hard thresholding estimator βλH T is defined for any λ > 0 by [βλH T ]i = Yi 1|Yi |≥λ . Given λ > 0 and a > 2, SCAD is defined as the minimizer the SCAD estimator βλ,a  of the penalized criterion Y − Xβ22 + ni=1 pλ (|βi |), where for x > 0, pλ′ (x) = λ1x≤λ + (aλ − x)+ 1x>λ /(a − 1). For the sake of simplicity we fix a = 3. We note βL;BIC , βaSCAD;BIC , and βH T ;BIC for the lasso, hard thresholding, and SCAD estimators selected by the modified BIC criterion. Results We have realized N = 200 experiments. For each of these experiments, the estimator βL;BIC , βaSCAD,BIC and βH T ;BIC are computed. The mean number of nonzero components and the estimated risk R[β∗;BIC ; 0n ] are reported in Table 4. Obviously, the SCAD and hard thresholding methods select too many irrelevant variables when they are tuned with BIC. Moreover, their risks are quite high. Intuitively, this is due to the fact that the log(n) [or log(p)] term in the BIC penalty is too small in this high-dimensional setting (n = p). For the lasso estimator, a very specific phenomenon occurs due to the soft thresholding effect. In the discussion of [30], Loubes and Massart advocate that TABLE 4 Estimated risk and Estimated number of nonzero components for βL;BIC , βSCAD;BIC and βH T ;BIC β ∗;BIC ; 0p ] R[ Mean of β∗;BIC 0 Lasso SCAD Hard thres. 4.6×10−2 0.025 1.6×101 86.9 3.0×102 28.2 515 UNKNOWN VARIANCE soft thresholding estimators penalized by Mallows’ Cp [55] penalties should yield good results, while hard thresholding estimators penalized by Mallows’s Cp are known to highly overfit. This strange behavior is due to the bias of the soft thresholding estimator. Nevertheless, Loubes and Massart’s arguments have been developed for an orthogonal design. In fact, there is no nonasymptotic justification that the lasso tuned by BIC or AIC performs well for general designs X. Conclusion The use of the modified BIC criterion to tune estimation procedures in a high-dimensional setting is not supported by theoretical results. It is proved to overfit in the case of thresholding estimators [12], Section 3.2.2. Empirically, BIC seems to overfit except for the lasso. We advise the practitioner to avoid BIC (and AIC) when p is at least of the same order as n. For instance, LinSelect is supported by nonasymptotic arguments and by empirical results [13] in contrast to BIC. APPENDIX B: COMPLEMENTS ON LINSELECT B.1 More Details on the Selection Procedure The penalty pen(S) involved in the LinSelect criterion (11) is defined by pen(S) = 1.1 pen (S) where pen (S) is the unique solution of E  pen (S) V U− n − dim(S)   + (S) = e− , where U and V are two independent chi-square random variables with dim(S) + 1 and n − dim(S) − 1 degrees of freedom respectively. It is also the solution in x of   N −1 e− (S) = (D + 1)P FD+3,N−1 ≥ x N(D + 3)   N −1 N +1 −x , P FD+1,N+1 ≥ x N N(D + 1) where D = dim(S), N = n − dim(S) and Fd,r is a Fisher random variable with d and r degrees of freedom. Proposition 4 in [12] ensures the following upperbound on pen (S). For any 0 < κ < 1, there exists a constant Cκ > 1 such that for any S ∈ S fulfilling 1 ≤ dim(S) ∨ (S) ≤ κn we have   pen (S) ≤ Cκ dim(S) ∨ (S) . Conversely, Lemma 2.3 in the supplement [38] ensures that pen (S) ≥ 2 (S) + dim(S) − C for some constant C ≥ 0. B.2 A General Risk Bound for LinSelect We set  = σ2 (B.1)  e− (S) . S∈S The following proposition gives a risk bound when selecting  λ by minimizing (11). P ROPOSITION B.1. Assume that 1 ≤ dim(S) ≤ n/2 − 1 and (S) ≤ 2n/3 for all S ∈ S. Then, there λ exists a constant C > 1 such that for any minimizer  of criterion (11), we have C −1 R[βλ ; β0 ] ≤ E inf Xβλ − Xβ0 22 λ∈ + inf {Xβλ − (B.2) S∈ S  2 S Xβλ 2  + [ (S) ∨ dim(S)]σ 2 } + . Furthermore, with probability larger than 1 − e−C0 n −  C1 S∈S e−C2 [ (S)∧n] e− (S) , we have for some C > 1 C −1 Xβ0 − Xβλ 22 ≤ inf Xβλ − Xβ0 22 λ∈ + inf {Xβλ − S∈ S  2 S Xβλ 2  + [ (S) ∨ dim(S)]σ 2 } . The first part of Proposition B.1 is a slight variation of Theorem 1 in [13]. We refer to the supplementary material [38] for a proof of these two results. SUPPLEMENTARY MATERIAL Supplement to “High-dimensional regression with unknown variance” (DOI: 10.1214/12STS398SUPP; .pdf). This supplement contains a description of estimation procedures that are minimax adaptive to the sparsity for all designs X. It also contains the proofs of the risk bounds for LinSelect and lasso-LinSelect. REFERENCES [1] A KAIKE , H. (1973). Information theory and an extension of the maximum likelihood principle. In Second International Symposium on Information Theory (Tsahkadsor, 1971) 267– 281. Akadémiai Kiadó, Budapest. MR0483125 516 C. GIRAUD, S. HUET AND N. VERZELEN [2] A NDERSON , T. W. (1951). Estimating linear restrictions on regression coefficients for multivariate normal distributions. Ann. Math. Statistics 22 327–351. MR0042664 [3] A NTONIADIS , A. (2010). Comments on: ℓ1 -penalization for mixture regression models. TEST 19 257–258. MR2677723 [4] A RLOT, S. and BACH , F. (2009). Data-driven calibration of linear estimators with minimal penalties. In Advances in Neural Information Processing Systems 22 (Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams and A. Culotta, eds.) 46–54. Curran Associates, New York. [5] A RLOT, S. and C ELISSE , A. (2010). A survey of crossvalidation procedures for model selection. Stat. Surv. 4 40–79. MR2602303 [6] A RLOT, S. and C ELISSE , A. (2011). Segmentation of the mean of heteroscedastic data via cross-validation. Stat. Comput. 21 613–632. MR2826696 [7] A RLOT, S. and M ASSART, P. (2010). Data-driven calibration of penalties for least-squares regression. J. Mach. Learn. Res. 10 245–279. [8] BACH , F. R. (2008). Consistency of the group lasso and multiple kernel learning. J. Mach. Learn. Res. 9 1179–1225. MR2417268 [9] BACH , F. R. (2008). Consistency of trace norm minimization. J. Mach. Learn. Res. 9 1019–1048. MR2417263 [10] BARANIUK , R., DAVENPORT, M., D E VORE , R. and WAKIN , M. (2008). A simple proof of the restricted isometry property for random matrices. Constr. Approx. 28 253–263. MR2453366 [11] BARAUD , Y. (2011). Estimator selection with respect to Hellinger-type risks. Probab. Theory Related Fields 151 353– 401. MR2834722 [12] BARAUD , Y., G IRAUD , C. and H UET, S. (2009). Gaussian model selection with an unknown variance. Ann. Statist. 37 630–672. MR2502646 [13] BARAUD , Y., G IRAUD , C. and H UET, S. (2010). Estimator selection in the Gaussian setting. Available at arXiv:1007.2096v2. [14] BARRON , A., B IRGÉ , L. and M ASSART, P. (1999). Risk bounds for model selection via penalization. Probab. Theory Related Fields 113 301–413. MR1679028 [15] BAUDRY, J.-P., M AUGIS , C. and M ICHEL , B. (2012). Slope heuristics: Overview and implementation. Statist. Comput. 22 455–470. [16] B ELLONI , A., C HERNOZHUKOV, V. and WANG , L. (2011). Square-root lasso: Pivotal recovery of sparse signals via conic programming. Biometrika 98 791–806. MR2860324 [17] B ICKEL , P. J., R ITOV, Y. and T SYBAKOV, A. B. (2009). Simultaneous analysis of lasso and Dantzig selector. Ann. Statist. 37 1705–1732. MR2533469 [18] B IRGÉ , L. and M ASSART, P. (2001). Gaussian model selection. J. Eur. Math. Soc. (JEMS) 3 203–268. MR1848946 [19] B IRGÉ , L. and M ASSART, P. (2007). Minimal penalties for Gaussian model selection. Probab. Theory Related Fields 138 33–73. MR2288064 [20] B UNEA , F., S HE , Y. and W EGKAMP, M. H. (2011). Optimal selection of reduced rank estimators of high-dimensional matrices. Ann. Statist. 39 1282–1309. MR2816355 [21] B UNEA , F., T SYBAKOV, A. B. and W EGKAMP, M. H. (2007). Aggregation for Gaussian regression. Ann. Statist. 35 1674–1697. MR2351101 [22] C ANDES , E. and TAO , T. (2007). The Dantzig selector: Statistical estimation when p is much larger than n. Ann. Statist. 35 2313–2351. MR2382644 [23] C AO , Y. and G OLUBEV, Y. (2006). On oracle inequalities related to smoothing splines. Math. Methods Statist. 15 398– 414. MR2301659 [24] C HEN , S. S., D ONOHO , D. L. and S AUNDERS , M. A. (1998). Atomic decomposition by basis pursuit. SIAM J. Sci. Comput. 20 33–61. MR1639094 [25] C OMTE , F. and ROZENHOLC , Y. (2004). A new algorithm for fixed design regression and denoising. Ann. Inst. Statist. Math. 56 449–473. MR2095013 [26] DALALYAN , A. and T SYBAKOV, A. (2008). Aggregation by exponential weighting, sharp oracle inequalities and sparsity. Machine Learning 72 39–61. [27] D EVROYE , L. P. and WAGNER , T. J. (1979). The L1 convergence of kernel density estimates. Ann. Statist. 7 1136–1139. MR0536515 [28] D ONOHO , D. and TANNER , J. (2009). Observed universality of phase transitions in high-dimensional geometry, with implications for modern data analysis and signal processing. Philos. Trans. R. Soc. Lond. Ser. A Math. Phys. Eng. Sci. 367 4273–4293. MR2546388 [29] D ONOHO , D. L. (2006). Compressed sensing. IEEE Trans. Inform. Theory 52 1289–1306. MR2241189 [30] E FRON , B., H ASTIE , T., J OHNSTONE , I. and T IBSHI RANI , R. (2004). Least angle regression. Ann. Statist. 32 407– 499. MR2060166 [31] FAN , J. and L I , R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348–1360. MR1946581 [32] G EISSER , S. (1975). The predictive sample reuse method with applications. J. Amer. Statist. Assoc. 70 320–328. [33] G ERCHINOVITZ , S. (2011). Sparsity regret bounds for individual sequences in online linear regression. In Proceedings of COLT 2011. Microtome Publishing, Brookline, MA. [34] G IRAUD , C. (2008). Mixing least-squares estimators when the variance is unknown. Bernoulli 14 1089–1107. MR2543587 [35] G IRAUD , C. (2008). Estimation of Gaussian graphs by model selection. Electron. J. Stat. 2 542–563. MR2417393 [36] G IRAUD , C. (2011). Low rank multivariate regression. Electron. J. Stat. 5 775–799. MR2824816 [37] G IRAUD , C. (2011). A pseudo-RIP for multivariate regression. Available at arXiv:1106.5599v1. [38] G IRAUD , C., H UET, S. and V ERZELEN , N. (2012). Supplement to “High-dimensional regression with unknown variance.” DOI:10.1214/12-STS398SUPP. [39] G IRAUD , C., H UET, S. and V ERZELEN , N. (2012). Graph selection with GGMselect. Stat. Appl. Genet. Mol. Biol. 11 1–50. [40] G UTHERY, S. B. (1974). A transformation theorem for onedimensional F -expansions. J. Number Theory 6 201–210. MR0342484 [41] H ALL , P., K AY, J. W. and T ITTERINGTON , D. M. (1990). Asymptotically optimal difference-based estimation of variance in nonparametric regression. Biometrika 77 521–528. MR1087842 [42] H ASTIE , T., T IBSHIRANI , R. and F RIEDMAN , J. (2009). The Elements of Statistical Learning. Data Mining, Inference, and Prediction, 2nd ed. Springer, New York. MR2722294 UNKNOWN VARIANCE [43] H UANG , J., M A , S. and Z HANG , C.-H. (2008). Adaptive Lasso for sparse high-dimensional regression models. Statist. Sinica 18 1603–1618. MR2469326 [44] H UANG , J. and Z HANG , T. (2010). The benefit of group sparsity. Ann. Statist. 38 1978–2004. MR2676881 [45] H UBER , P. J. (1981). Robust Statistics. Wiley, New York. MR0606374 [46] I ZENMAN , A. J. (1975). Reduced-rank regression for the multivariate linear model. J. Multivariate Anal. 5 248–264. MR0373179 [47] J I , P. and J IN , J. (2010). UPS delivers optimal phase diagram in high dimensional variable selection. Available at http://arxiv.org/abs/1010.5028. [48] K LOPP, O. (2011). High dimensional matrix estimation with unknown variance of the noise. Available at arXiv:1112.3055v1. [49] KOLTCHINSKII , V., L OUNICI , K. and T SYBAKOV, A. B. (2011). Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. Ann. Statist. 39 2302–2329. MR2906869 [50] L EBARBIER , E. (2005). Detecting multiple change-points in the mean of Gaussian process by model selection. Signal Processing 85 717–736. [51] L ENG , C., L IN , Y. and WAHBA , G. (2006). A note on the lasso and related procedures in model selection. Statist. Sinica 16 1273–1284. MR2327490 [52] L EUNG , G. and BARRON , A. R. (2006). Information theory and mixing least-squares regressions. IEEE Trans. Inform. Theory 52 3396–3410. MR2242356 [53] L I , K.-C. (1987). Asymptotic optimality for Cp , CL , crossvalidation and generalized cross-validation: Discrete index set. Ann. Statist. 15 958–975. MR0902239 [54] L OUNICI , K., P ONTIL , M., VAN DE G EER , S. and T SYBAKOV, A. B. (2011). Oracle inequalities and optimal inference under group sparsity. Ann. Statist. 39 2164–2204. MR2893865 [55] M ALLOWS , C. L. (1973). Some comments on Cp . Technometrics 15 661–675. [56] M EINSHAUSEN , N. and B ÜHLMANN , P. (2006). Highdimensional graphs and variable selection with the lasso. Ann. Statist. 34 1436–1462. MR2278363 [57] M OSTELLER , F. and T UKEY, J. W. (1968). Data analysis, including statistics. In Handbook of Social Psychology, Vol. 2 (G. Lindsey and E. Aronson, eds.). Addison-Wesley, Reading, MA. [58] N EGAHBAN , S. and WAINWRIGHT, M. J. (2011). Estimation of (near) low-rank matrices with noise and high-dimensional scaling. Ann. Statist. 39 1069–1097. MR2816348 [59] N ISHII , R. (1984). Asymptotic properties of criteria for selection of variables in multiple regression. Ann. Statist. 12 758–765. MR0740928 [60] PARK , T. and C ASELLA , G. (2008). The Bayesian lasso. J. Amer. Statist. Assoc. 103 681–686. MR2524001 [61] R ASKUTTI , G., WAINWRIGHT, M. J. and Y U , B. (2011). Minimax rates of estimation for high-dimensional linear regression over ℓq -balls. IEEE Trans. Inform. Theory 57 6976– 6994. MR2882274 [62] R IGOLLET, P. and T SYBAKOV, A. (2011). Exponential screening and optimal rates of sparse estimation. Ann. Statist. 39 731–771. MR2816337 517 [63] ROHDE , A. and T SYBAKOV, A. B. (2011). Estimation of high-dimensional low-rank matrices. Ann. Statist. 39 887– 930. MR2816342 [64] S CHWARZ , G. (1978). Estimating the dimension of a model. Ann. Statist. 6 461–464. MR0468014 [65] S HAO , J. (1993). Linear model selection by cross-validation. J. Amer. Statist. Assoc. 88 486–494. MR1224373 [66] S HAO , J. (1997). An asymptotic theory for linear model selection. Statist. Sinica 7 221–264. With comments and a rejoinder by the author. MR1466682 [67] S HIBATA , R. (1981). An optimal selection of regression variables. Biometrika 68 45–54. MR0614940 [68] S TÄDLER , N., B ÜHLMANN , P. and VAN DE G EER , S. (2010). ℓ1 -penalization for mixture regression models. TEST 19 209–256. MR2677722 [69] S TONE , M. (1974). Cross-validatory choice and assessment of statistical predictions. J. Roy. Statist. Soc. Ser. B 36 111– 147. MR0356377 [70] S UN , T. and Z HANG , C.-H. (2010). Comments on: ℓ1 penalization for mixture regression models. TEST 19 270– 275. MR2677726 [71] S UN , T. and Z HANG , C. H. (2011). Scaled sparse linear regression. Available at arXiv:1104.4595. [72] T IBSHIRANI , R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288. MR1379242 [73] T IBSHIRANI , R., S AUNDERS , M., ROSSET, S., Z HU , J. and K NIGHT, K. (2005). Sparsity and smoothness via the fused lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 67 91–108. MR2136641 [74] VAN DE G EER , S. A. and B ÜHLMANN , P. (2009). On the conditions used to prove oracle results for the Lasso. Electron. J. Stat. 3 1360–1392. MR2576316 [75] V ERT, J. P. and B LEAKLEY, K. (2010). Fast detection of multiple change-points shared by many signals using group LARS. In Advances in Neural Information Processing Systems 23 (J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel and A. Culotta, eds.) 2343–2351. Curran Associates, New York. [76] V ERZELEN , N. (2010). High-dimensional Gaussian model selection on a Gaussian design. Ann. Inst. H. Poincaré Probab. Stat. 46 480–524. MR2667707 [77] V ERZELEN , N. (2012). Minimax risks for sparse regressions: Ultra-high-dimensional phenomenons. Electron. J. Stat. 6 38–90. [78] WAINWRIGHT, M. J. (2009). Information-theoretic limits on sparsity recovery in the high-dimensional and noisy setting. IEEE Trans. Inform. Theory 55 5728–5741. MR2597190 [79] Y E , F. and Z HANG , C.-H. (2010). Rate minimaxity of the Lasso and Dantzig selector for the ℓq loss in ℓr balls. J. Mach. Learn. Res. 11 3519–3540. MR2756192 [80] Y UAN , M. and L IN , Y. (2006). Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B Stat. Methodol. 68 49–67. MR2212574 [81] Z HANG , C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. Ann. Statist. 38 894–942. MR2604701 [82] Z HANG , T. (2005). Learning bounds for kernel regression using effective data dimensionality. Neural Comput. 17 2077– 2098. MR2175849 518 C. GIRAUD, S. HUET AND N. VERZELEN [83] Z HANG , T. (2011). Adaptive forward–backward greedy algorithm for learning sparse representations. IEEE Trans. Inform. Theory 57 4689–4708. MR2840485 [84] Z OU , H. (2006). The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc. 101 1418–1429. MR2279469 [85] Z OU , H. and H ASTIE , T. (2005). Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat. Methodol. 67 301–320. MR2137327