Statistical Science
2012, Vol. 27, No. 4, 500–518
DOI: 10.1214/12-STS398
© Institute of Mathematical Statistics, 2012
High-Dimensional Regression with
Unknown Variance
Christophe Giraud, Sylvie Huet and Nicolas Verzelen
Abstract. We review recent results for high-dimensional sparse linear regression in the practical case of unknown variance. Different sparsity settings
are covered, including coordinate-sparsity, group-sparsity and variationsparsity. The emphasis is put on nonasymptotic analyses and feasible procedures. In addition, a small numerical study compares the practical performance of three schemes for tuning the lasso estimator and some references
are collected for some more general models, including multivariate regression and nonparametric regression.
Key words and phrases:
ance.
Linear regression, high-dimension, unknown vari-
additional results for multivariate regression and the
nonparametric regression model
1. INTRODUCTION
In the present paper, we mainly focus on the linear
regression model
(1)
(2)
Yi = f X(i) + εi ,
i = 1, . . . , n,
will also be mentioned.
Y = Xβ0 + ε,
1.1 Sparsity Assumptions
where Y is a n-dimensional response vector, X is a
fixed n × p design matrix, and the vector ε is made
of n i.i.d. Gaussian random variables with N (0, σ 2 )
distribution. In the sequel, X(i) stands for the ith row
of X. Our interest is on the high-dimensional setting,
where the dimension p of the unknown parameter β0
is large, possibly larger than n.
The analysis of the high-dimensional linear regression model has attracted a lot of attention in the last
decade. Nevertheless, there is a longstanding gap between the theory where the variance σ 2 is generally
assumed to be known and the practice where it is often unknown. The present paper is mainly devoted to
reviewing recent results on linear regression in highdimensional settings with unknown variance σ 2 . A few
In a high-dimensional linear regression model, accurate estimation is unfeasible unless it relies on some
special properties of the parameter β0 . The most common assumption on β0 is that it is sparse in some sense.
We will consider in this paper the three following classical sparsity assumptions.
Coordinate-sparsity. Most of the coordinates of β0
are assumed to be zero (or approximately zero). This
is the most common acceptation for sparsity in linear
regression.
Structured-sparsity. The pattern of zero(s) of the coordinates of β0 is assumed to have an a priori known
structure. For instance, in group-sparsity [80], the covariates are clustered into M groups, and when the coefficient β0,i corresponding to the covariate Xi (the ith
column of X) is nonzero, then it is likely that all the
coefficients β0,j with variables Xj in the same cluster
as Xi are nonzero.
Variation-sparsity. The p −1-dimensional vector β0V
V =β
of variation of β0 is defined by β0,j
0,j +1 − β0,j .
Sparsity in variation means that most of the components of β0V are equal to zero (or approximately zero).
When p = n and X = In , variation-sparse linear regression corresponds to signal segmentation.
Christophe Giraud is Professor, CMAP, UMR CNRS 7641,
Ecole Polytechnique, Route de Saclay, 91128 Palaiseau
Cedex, France (e-mail:
[email protected]). Sylvie Huet is
Research Scientist, UR341 MIA, INRA, F-78350
Jouy-en-Josas, France (e-mail:
[email protected]).
Nicolas Verzelen is Research Scientist, UMR729 MISTEA,
INRA, F-34060 Montpellier, France (e-mail:
[email protected]).
500
501
UNKNOWN VARIANCE
1.2 Statistical Objectives
In the linear regression model, there are roughly two
kinds of estimation objectives. In the prediction problem, the goal is to estimate Xβ0 , whereas in the inverse
problem it is to estimate β0 . When the vector β0 is
sparse, a related objective is to estimate the support of
β0 (model identification problem) which is the set of
the indices j corresponding to the nonzero coefficients
β0,j . Inverse problems and prediction problems are not
equivalent in general. When the Gram matrix XX∗ is
poorly conditioned, the former problems can be much
more difficult than the latter. Since there are only a few
results on inverse problems with unknown variance, we
will focus on the prediction problem, the support estimation problem being shortly discussed in the course
of the paper.
In the sequel, Eβ0 [·] stands for the expectation with
respect to Y ∼ N (Xβ0 , σ 2 In ), and · 2 is the Euclidean norm. The prediction objective amounts to
build estimators β so that the risk
(3)
β0 ] := Eβ [X(β
− β0 )2 ]
R[β;
2
0
is as small as possible.
1.3 Approaches
Most procedures that handle high-dimensional linear models [22, 26, 62, 72, 73, 81, 83, 85] rely on
tuning parameters whose optimal value depends on σ .
For example, the results of Bickel et al. [17] suggest
to choose√the tuning parameter λ of the lasso of the order of 2σ 2 log(p). As a consequence, all these procedures cannot be directly applied when σ 2 is unknown.
A straightforward approach is to replace σ 2 by an estimate of the variance in the optimal value of the tuning
parameter(s). Nevertheless, the variance σ 2 is difficult
to estimate in high-dimensional settings (see Proposition 2.3 below), so a plug-in of the variance does not
necessarily yield good results. There are basically two
approaches to build on this amount of work on highdimensional estimation with known variance.
(1) Ad-hoc estimation. There has been some recent
work [16, 68, 71] to modify procedures like the lasso in
such a way that the tuning parameter does not depend
anymore on σ 2 ; see Section 4.2. The challenge is to
find a smart modification of the procedure, so that the
resulting estimator β is computationally feasible and
β0 ] as small as possible.
has a risk R[β;
(2) Estimator selection. Given a collection (βλ )λ∈
of estimators, the objective of estimator selection is to
λ such that the risk of βλ is as small as
pick an index
possible, ideally as small as the risk R[βλ∗ ; β0 ] of the
so-called oracle estimator
(4)
βλ∗ := argmin R[βλ ; β0 ].
{βλ ,λ∈}
Efficient estimator selection procedures can then be applied to tune the aforementioned estimation methods
[22, 26, 62, 72, 73, 81, 83, 85]. Among the most famous methods for estimator selection, we mention V fold cross-validation (Geisser [32]), AIC (Akaike [1])
and BIC (Schwarz [64]) criteria.
The objective of this survey is to describe state-ofthe-art procedures for high-dimensional linear regression with unknown variance. We will review both automatic tuning methods and ad-hoc methods. There
are some procedures that we will let aside. For example, Baraud [11] provides a versatile estimator selection scheme, but the procedure is computationally
intractable in large dimensions. Linear or convex aggregation of estimators are also valuable alternatives to
estimator selection when the goal is to perform estimation, but only a few theoretical works have addressed
the aggregation problem when the variance is unknown
[33, 34]. For these reasons, we will not review these
approaches in the sequel.
1.4 Why Care about Nonasymptotic Analyses?
AIC [1], BIC [64] and V -fold cross-validation [32]
are probably the most popular criteria for estimator selection. The use of these criteria relies on some classical asymptotic optimality results. These results focus on the setting where the collection of estimators
(βλ )λ∈ and the dimension p are fixed and consider
the limit behavior of the criteria when the sample size n
goes to infinity. For example, under some suitable conditions, Shibata [67], Li [53] and Shao [66] prove that
the risk of the estimator selected by AIC or V -fold CV
(with V = Vn → ∞) is asymptotically equivalent to
the oracle risk R[βλ∗ ; β0 ]. Similarly, Nishii [59] shows
that the BIC criterion is consistent for model selection.
All these asymptotic results can lead to misleading
conclusions in modern statistical settings where the
sample size remains small and the parameter’s dimension becomes large. For instance it is proved in [12],
Section 3.3.2, and illustrated in [12], Section 6.2, that
BIC (and thus AIC) can strongly overfit and should not
be used for p larger than n. Additional examples are
provided in Appendix A. A nonasymptotic analysis
takes into account all the characteristics of the selection
problem (sample size n, parameter dimension p, number of models per dimension, design X, etc.). It treats n
502
C. GIRAUD, S. HUET AND N. VERZELEN
and p as they are, and it highlights important features
hidden by the asymptotic theory. For these reasons, we
will restrict this review to nonasymptotic results.
1.5 Organization of the Paper
In Section 2, we investigate how the ignorance of
the variance affects the minimax risk bounds. In Section 3, some “generic” schemes for selecting estimators are presented. The coordinate-sparse setting is addressed in Section 4 where some theoretical results are
collected, and a small numerical experiment compares
different lasso-based procedures. The group-sparse and
variation-sparse settings are reviewed in Sections 5
and 6, and Section 7 is devoted to some more general
models such as multivariate regression or nonparametric regression.
In the sequel, C, C1 , . . . refer to numerical constants
whose value may vary from line to line, while β0
stands for the number of nonzero components of β and
|J | for the cardinality of a set J .
2. THEORETICAL LIMITS
The goal of this section is to address the intrinsic
difficulty of a coordinate-sparse linear regression problem. We will answer the following questions: Which
range of p can we reasonably consider? When the variance is unknown, can we hope to do as well as when
the variance is known?
In practice, the number of nonzero coordinates of
β0 is unknown. The fact that an estimator β is minimax over B[k, p] for some specific k > 0 does not
imply that β performs well when β0 has a number of
nonzero components different from k. Indeed, β can be
strongly biased when β0 has more than k nonzero components or the variance of β can be too large compared
to R[k0 , X] when β0 has k0 nonzero components with
k0 less than k. A good estimation procedure β should
not require the knowledge of the sparsity k of β0 and
should perform as well as if this sparsity were known.
An estimator β that achieves [up to a possible multiplicative constant C(X)] the minimax risk over B[k, p]
for a range of k is said to be adaptive to the sparsity.
Similarly, an estimator β is adaptive to the variance σ 2
if it does not require the knowledge of σ 2 and nearly
achieves the minimax risk for all σ 2 > 0. When possible, the main challenge is to build adaptive procedures.
In the following subsections, we review sharp bounds
on the minimax prediction risks for both known and
unknown sparsity, known and unknown variance. The
big picture is summed up in Figure 1. Roughly, it says
that adaptation is possible as long as 2k log(p/k) < n.
In contrast, the situation becomes more complex for the
ultra-high-dimensional1 setting where 2k log(p/k) ≥
n. The rest of this section is devoted to explain this big
picture.
2.1 Minimax Adaptation
A classical way to assess the performance of an estimator β is to measure its maximal risk over a class
B ⊂ Rp . This is the minimax point of view. As we are
interested in coordinate-sparsity for β0 , we will consider the sets B[k, p] of vectors that contain at most k
nonzero coordinates for some k > 0.
the maximal prediction risk
Given an estimator β,
of β over B[k, p] for a fixed design X and a variance
β0 ] where the risk
σ 2 is defined by supβ0 ∈B[k,p] R[β;
function R[·, β0 ] is defined by (3). Taking the infimum
we
of the maximal risk over all possible estimators β,
obtain the minimax risk
(5)
R[k, X] = inf
sup
β β0 ∈B[k,p]
β0 ].
R[β;
Minimax bounds are convenient results to assess the
range of problems that are statistically feasible and the
optimality of particular procedures. Below, we say that
an estimator β is “minimax” over B[k, p] if its maximal prediction risk equals [up to a possible multiplicative constant C(X)] the minimax risk.
F IG . 1.
Minimal prediction risk over B[k, p] as a function of k.
1 In some papers, the expression ultra-high-dimensional has been
used to characterize problems such that log(p) = O(nθ ) with
θ < 1. We argue here that as soon as k log(p)/n goes to 0, the case
log(p) = O(nθ ) is not intrinsically more difficult than conditions
such as p = O(nδ ) with δ > 0.
503
UNKNOWN VARIANCE
2.2 Minimax Risks under Known Sparsity and
Known Variance
The minimax risk R[k, X] depends on the form of
the design X. In order to grasp this dependency, we
define for any k > 0, the largest and the smallest sparse
eigenvalues of order k of X∗ X by
Φk,+ (X) :=
Xβ2n
2
β∈B[k,p]\{0p } βp
sup
and
Xβ2n
.
inf
Φk,− (X) :=
β∈B[k,p]\{0p } β2
p
P ROPOSITION 2.1. Assume that k and σ are
known. There exist positive numerical constants C1 ,
C1′ , C2 and C2′ such that the following holds. For any
(k, n, p) such that k ≤ n/2 and any design X, we have
C1
(6)
p 2
Φ2k,− (X)
σ
k log
Φ2k,+ (X)
k
≤ R[k, X] ≤ C1′
p
∧ n σ 2.
k log
k
For any (k, n, p) such that k ≤ n/2, we have
C2 k log
(7)
p
∧ n σ2
k
p
∧ n σ 2.
≤ sup R[k, X] ≤ C2′ k log
k
X
The minimax lower bound (6) has been first proved
in [61, 62, 79] while (7) is stated in [77]. Let us
first comment on bound (7). If the vector β0 has
k nonzero components, and if these components are
a priori known, then one may build estimators that
achieve a risk bound of the order k. In a (nonultra)
high-dimensional setting [2k log(p/k) ≤ n], the minimax risk is of the order k log(p/k)σ 2 . The logarithmic term is the price to pay to cope with the fact that
we do not know the position of the nonzero components in β0 . The situation is quite different in an ultrahigh-dimensional setting [2k log(p/k) > n]. Indeed,
the minimax risk remains of the order of nσ 2 , which
corresponds to the minimax risk of estimation of the
vector Xβ0 without any sparsity assumption; see the
blue curve in Figure 1. In other terms, the sparsity index k does not play a role anymore.
Dependency of R[k, X] on the design X. It follows
from (6) that supX R[k, X] is nearly achieved by designs X satisfying Φ2k,− (X)/Φ2k,+ (X) ≈ 1, when the
setting is not ultra-high-dimensional. For some designs
such that Φ2k,− (X)/Φ2k,+ (X) is small, the minimax
prediction risk R[k, X] is possibly faster; see [77] for
a discussion. In an ultra-high-dimensional setting, the
form of the minimax risk (nσ 2 ) is related to the fact
that no designs can satisfy Φ2k,− (X)/Φ2k,+ (X) ≈ 1;
see, for example, [10]. More precisely, the lower bound
(6) enforces the following geometrical constrain:
Φ2k,− (X)
n
≤C
Φ2k,+ (X)
k log(p/k)
for any design X. The lower bound R[k, X] ≥ C[k ·
log(p/k) ∧ n]σ 2 in (7) is, for instance, achieved by
realizations of a standard Gaussian design, that is, designs X whose components follow independent standard normal distributions. See [77] for more details.
2.3 Adaptation to the Sparsity and to the Variance
Adaptation to the sparsity when the variance is
known. When σ 2 is known, there exist both model selection and aggregation procedures that achieve this
[k log(p/k)∧n]σ 2 risk simultaneously for all k and for
all designs X. Such procedures derive from the work
of Birgé and Massart [18] and Leung and Barron [52].
However, these methods are intractable for large p except for specific forms of the design. We refer to the
supplementary material [38] for more details.
Simultaneous adaptation to the sparsity and the variance. We first restrict to the nonultra high-dimensional
setting, where the number of nonzero components k is
unknown but satisfies 2k log(p/k) < n. In this setting,
some procedures based on penalized log-likelihood
[12] are simultaneous adaptive to the unknown sparsity
and to the unknown variance and this for all designs X.
Again such procedures are intractable for large p. See
the supplementary material [38] for more details. If we
want to cover all k (including ultra-high-dimensional
settings), the situation is different as shown in the next
proposition (from [77]).
P ROPOSITION 2.2 (Simultaneous adaptation is impossible). There exist positive constants C, C ′ , C1 ,
C2 , C3 , C1′ , C2′ and C3′ , such that the following holds.
Consider any p ≥ n ≥ C and k ≤ p 1/3 ∧ n/2 such that
k log(p/k) ≥ C ′ n. There exist designs X of size n × p
we have either
such that for any estimator β,
0p ]
R[β;
> C1 n
σ2
σ 2 >0
sup
β0 ]
R[β;
σ2
β0 ∈B[k,p],σ 2 >0
or
sup
> C2 k log
k
p
p
exp C3 log
k
n
k
.
504
C. GIRAUD, S. HUET AND N. VERZELEN
Conversely, there exist two estimators β(n) and βBGH
(defined in the supplementary material [38]) that respectively satisfy
sup
X
sup
X
R[β(n) ; β0 ]
≤ C1′ n,
2
σ
p
2
β0 ∈R ,σ >0
sup
R[βBGH ; β0 ]
σ2
β0 ∈B[k,p],σ 2 >0
≤ C2′ k log
p
k
p
exp C3′ log
k
n
k
for all 1 ≤ k ≤ [(n − 1) ∧ p]/4.
As a consequence, simultaneous adaptation to the
sparsity and to the variance is impossible in an ultrahigh-dimensional setting. Indeed, any estimator β that
does not rely on σ 2 has to pay at least one of these two
prices:
(1) The estimator β does not use the sparsity of the
true parameter β0 , and its risk for estimating X0p is of
the same order as the minimax risk over Rn .
(2) For any 1 ≤ k ≤ p 1/3 , the risk of β fulfills
β0 ]
R[β;
sup sup
σ2
σ >0 β0 ∈B[k,p]
k
≥ C1 k log(p) exp C2 log(p) .
n
It follows that the maximal risk of β increases exponentially fast in an ultra-high-dimensional setting (red
curve in Figure 1), while the minimax risk is stuck to
n (blue curve in Figure 1). The designs that satisfy the
minimax lower bounds of Proposition 2.2 include realizations of a standard Gaussian design.
In an ultra-high-dimensional setting, the prediction
problem becomes extremely difficult under unknown
variance because the variance estimation itself is inconsistent as shown in the next proposition (from [77]).
P ROPOSITION 2.3. There exist positive constants
C, C1 and C2 such that the following holds. Assume
that p ≥ n ≥ C. For any 1 ≤ k ≤ p 1/3 , there exist designs X such that
sup
σ σ >0,β ∈B[k,p]
0
E β0
σ2
σ2
−
σ2 σ2
β0 ] ≤ C1 β0 0 log(p)σ 2 ,
R[β;
for any σ > 0 and any vector β0 ∈ Rp such that
1 ≤ β0 0 log(p) ≤ C2 n. Such risk bounds prove
that the estimator is approximately [up to a possible
log(β0 0 ) additional term] minimax adaptive to the
unknown variance and the unknown sparsity. The condition β0 0 log(p) ≤ C2 n ensures that the setting is
not ultra-high-dimensional. As stated above, some procedures achieve (8) for all designs X, but they are intractable for large p; see [38]. One purpose of this review is to present fast procedures that achieve these
kind of bounds under possible restrictive assumptions
on the design matrix X.
For some procedures, (8) can be improved into a
bound of the form
(9)
β0 ]
R[β;
≤ C1 inf {X(β − β0 )22 + β0 log(p)σ 2 },
β=0
inf
Let us consider an estimator β that does not depend
on σ 2 . Relying on the previous minimax bounds, we
will say that β achieves an optimal risk bound (with
respect to the sparsity) if
(8)
sup
2.4 What Should We Expect from a Good
Estimation Procedure?
k
k
p
p
≥ C1 log
exp C2 log
n
k
n
k
.
with C1 close to one. Again, the dimension β0 0 is
restricted to be smaller than Cn/ log(p) to ensure that
the setting is not ultra-high-dimensional. This kind of
bound makes a clear trade-off between a bias and a
variance term. For instance, when β0 contains many
components that are nearly equal to zero, the bound
(9) can be much smaller than (8).
2.5 Other Statistical Problems in an
Ultra-High-Dimensional Setting
We have seen that adaptation becomes impossible
for the prediction problem in an ultra-high-dimensional
setting. For other statistical problems, including the
prediction problem with random design, the inverse
problem (estimation of β0 ), the variable selection problem (estimation of the support of β0 ), the dimension
reduction problem [47, 77, 78], the minimax risks increase exponentially fast in an ultra-high-dimensional
setting. This kind of phase transition has been observed
in a wide range of random geometry problems [28],
suggesting some universality in this limitation. In practice, the sparsity index k is not known, but given (n, p),
we can compute k ∗ := max{k : 2k log(p/k) ≥ n}. One
may interpret that the problem is still reasonably difficult as long as k ≤ k ∗ . This gives a simple rule of
505
UNKNOWN VARIANCE
thumb to know what we can hope from a given regression problem. For example, setting p = 5000 and
n = 50 leads to k ∗ = 3, implying that the prediction
problem becomes extremely difficult when there are
more than 4 relevant covariates; see the simulations in
[77].
3. SOME GENERIC SELECTION SCHEMES
Among the selection schemes not requiring the
knowledge of the variance σ 2 , some are very specific
to a particular algorithm, while some others are more
generic. We describe in this section three versatile selection principles and refer to the examples for the
more specific schemes.
3.1 Cross-Validation Procedures
The cross-validation schemes are nearly universal in
the sense that they can be implemented in most statistical frameworks and for most estimation procedures.
The principle of the cross-validation schemes is to split
the data into a training set and a validation set: the estimators are built on the training set, and the validation set is used for estimating their prediction risk. This
training/validation splitting is eventually repeated several times. The most popular cross-validation schemes
are:
• Hold-out [27, 57] which is based on a single split of
the data for training and validation.
• V -fold CV [32]. The data is split into V subsamples.
Each subsample is successively removed for validation, the remaining data being used for training.
• Leave-one-out [69] which corresponds to n-fold CV.
• Leave-q-out (also called delete-q-CV) [65] where
every possible subset of cardinality q of the data
is removed for validation, the remaining data being
used for training.
We refer to Arlot and Célisse [5] for a review of the
cross-validation schemes and their theoretical properties.
3.2 Penalized Empirical Loss
Penalized empirical loss criteria form another class
of versatile selection schemes, yet less universal than
CV procedures. The principle is to select among a family (βλ )λ∈ of estimators by minimizing a criterion of
the generic form
(10)
Crit(λ) = LX (Y, βλ ) + pen(λ),
where LX (Y, βλ ) is a measure of the distance between
Y and Xβλ , and pen is a function from to R+ . The
penalty function sometimes depends on data.
Penalized log-likelihood. The most famous criteria
of the form (10) are AIC and BIC. They have been
designed to select among estimators βλ obtained by
maximizing the likelihood of (β, σ ) with the constraint that β lies on a linear space Sλ (called model).
In the Gaussian case, these estimators are given by
Xβλ = Sλ Y , where Sλ denotes the orthogonal projector onto the model Sλ . For AIC and BIC, the function
LX corresponds to twice the negative log-likelihood
LX (Y, βλ ) = n log(Y − Xβλ 22 ) and the penalties are
pen(λ) = 2 dim(Sλ ) and pen(λ) = dim(Sλ ) log(n), respectively. We recall that these two criteria can perform
very poorly in a high-dimensional setting.
In the same setting, Baraud et al. [12] propose alternative penalties built from a nonasymptotic perspective. The resulting criterion can handle the highdimensional setting where p is possibly larger than n,
and the risk of the selection procedure is controlled by
a bound of the form (9); see Theorem 2 in [12].
Plug-in criteria. Many other penalized-empiricalloss criteria have been developed in the last decades.
Several selection criteria [14, 18] have been designed
from a nonasymptotic point of view to handle the case
where the variance is known. These criteria usually
involve the residual least-square LX (Y, βλ ) = Y −
Xβλ 22 and a penalty pen(λ) depending on the variance
σ 2 . A common practice is then to plug in the penalty
σ 2 of the variance in place of the variance.
an estimate
For linear regression, when the design matrix X has a
rank less than n, a classical choice for
σ 2 is
σ2 =
Y − X Y 22
,
n − rank(X)
with X the orthogonal projector onto the range of X.
In the Gaussian setting, this estimator
σ 2 has the nice
feature to be independent of X Y on which usually
rely the estimators βλ . Nevertheless, the variance of
σ2
is of order σ 4 /(n − rank(X)) which is small only when
the sample size n is quite large in front of the rank of X.
This situation is unfortunately not likely to happen in a
high-dimensional setting where p can be larger than n.
3.3 Approximation Versus Complexity
Penalization: LinSelect
The criterion proposed by Baraud et al. [12] can
handle high-dimensional settings, but it suffers from
two rigidities. First, it can only handle fixed collections
of models (Sλ )λ∈ . In some situations, the size of
is huge (e.g., for complete variable selection) and the
estimation procedure can then be computationally intractable. In this case, we may want to work with a
506
C. GIRAUD, S. HUET AND N. VERZELEN
⊂ may
subcollection of models (Sλ )λ∈
, where
depend on data. For example, for complete variable se could be generated by efficient
lection, the subset
algorithms like Lars [30]. The second rigidity of the
procedure of Baraud et al. [12] is that it can only handle constrained-maximum-likelihood estimators. This
procedure then does not help for selecting among arbitrary estimators such as the lasso or elastic-net.
These two rigidities have been addressed recently
by Baraud et al. [13]. They propose a selection procedure, LinSelect, which can handle both data-dependent
collections of models and arbitrary estimators βλ . The
procedure is based on a collection S of linear spaces
which gives a collection of possible “approximative”
supports for the estimators (Xβλ )λ∈ . A measure of
complexity on S is provided by a weight function
: S → R+ . We refer to Sections 4.1 and 5 for examples of collection S and weight
in the context
of coordinate-sparse and group-sparse regression. We
present below a simplified version of the LinSelect procedure. For a suitable, possibly data-dependent, subset
S ⊂ S (depending on the statistical problem), the estimator βλ is selected by minimizing the criterion
Crit(βλ ) = inf Y −
S∈
S
(11)
where
S
2
S Xβλ 2
1
+ Xβλ −
2
2
σS2 ,
S Xβλ 2 + pen(S)
is the orthogonal projector onto S,
Y − S Y 22
σS2 =
n − dim(S)
and pen(S) is a penalty depending on . In the cases
we will consider here, the penalty pen(S) is roughly
of the order of (S), and therefore it penalizes S according to its complexity. We refer to Appendix B for
a precise definition of this penalty and more details on
its characteristics. We emphasize that criterion (11) and
the family of estimators {βλ , λ ∈ } are based on the
same data Y and X. In other words, there is no datasplitting occurring in the LinSelect procedure. The first
term in (11) quantifies the fit of the projected estimator to the data, the second term evaluates the approximation quality of the space S and the last term penalizes S according to its complexity. We refer to Proposition B.1 in Appendix B and Theorem 1 in [12] for risk
bounds on the selected estimator. Instantiations of the
procedure and more specific risks bounds are given in
Sections 4 and 5 in the context of coordinate-sparsity
and group-sparsity.
From a computational point of view, the algorithmic complexity of LinSelect is at most proportional to
S|, and in many cases there is no need to scan
|| × |
S for each λ ∈ to minimize (11). In the
the whole set
examples of Sections 4 and 5, the whole procedure is
computationally less intensive than V -fold CV; see Table 3. Finally, we mention that for the constrained leastsquare estimators Xβλ = Sλ Y , the LinSelect proceS = {Sλ : λ ∈ } simply coincides with the
dure with
procedure of Baraud et al. [12].
4. COORDINATE-SPARSITY
In this section, we focus on the high-dimensional linear regression model Y = Xβ0 + ε where the vector β0
itself is assumed to be sparse. This setting has attracted
a lot of attention in the last decade, and many estimation procedures have been developed. Most of them require the choice of tuning parameters which depend on
the unknown variance σ 2 . This is, for instance, the case
for the lasso [24, 72], Dantzig Selector [22], Elastic Net
[85], MC+ [81], aggregation techniques [21, 26], etc.
We first discuss how the generic schemes introduced
in the previous section can be instantiated for tuning
these procedures and for selecting among them. Then,
we pay a special attention to the calibration of the lasso.
Finally, we discuss the problem of support estimation
and present a small numerical study.
4.1 Automatic Tuning Methods
Cross-validation. Arguably, V -fold cross-validation
is the most popular technique for tuning the abovementioned procedures. To our knowledge, there are no
theoretical results for V -fold CV in large dimensional
settings.
In practice, V -fold CV seems to give rather good results. The problem of choosing the best V has not yet
been solved [5], Section 10, but it is often reported that
a good choice for V is between 5 and 10. Indeed, the
statistical performance does not increase for larger values of V , and averaging over 10 splits remains computationally feasible [42], Section 7.10.
LinSelect. The procedure LinSelect can be used for
selecting among a collection (βλ )λ∈ of sparse regressors as follows. For J ⊂ {1, . . . , p}, we define XJ as
the matrix [Xij ]i=1,...,n,j ∈J obtained by only keeping
the columns of X with index in J . We recall that the
collection S gives some possible “approximative” supports for the estimators (Xβλ )λ∈ . For sparse linear re-
507
UNKNOWN VARIANCE
gression, a possible collection S and measure of complexity are
S = S = range(XJ ), J ⊂ {1, . . . , p},
1 ≤ |J | ≤ n/(3 log p)
and
p
+ log(dim(S)).
(S) = log
dim(S)
λ = range(X
Let us introduce the spaces S
supp(βλ ) ) and
the subcollection of S
λ , λ ∈ }
S = {S
λ ∈ S}.
= {λ ∈ : S
where
The following proposition gives a risk bound when seλ with LinSelect with the above choice of
S
lecting
and .
P ROPOSITION 4.1. There exists a numerical conλ of the critestant C > 1 such that for any minimizer
rion (11), we have
R[βλ ; β0 ]
≤ CE inf Xβλ − Xβ0 22
λ∈
+ inf {Xβλ −
S∈
S
(12)
2
+ dim(S) log(p)σ }
≤ CE inf {Xβλ − Xβ0 22
λ∈
2
S Xβλ 2
+ βλ 0 log(p)σ 2 } .
Proposition 4.1 is a simple corollary of Proposition B.1 in Appendix B. The first bound involves three
terms: the loss of the estimator βλ , an approximation
loss, and a variance term. Hence, LinSelect chooses an
estimator βλ that achieves a trade-off between the loss
of βλ and the closeness of Xβλ to some small dimensional subspace S. Bound (12) cannot be formulated in
Nevthe form (9) due to the random nature of the set .
ertheless, a bound similar to (8) can be deduced from
(12) when the estimators βλ are least-squares estimators; see Corollary 4 in [13]. Furthermore, we note that
increasing the size of leads to a better risk bound
for βλ . It is then advisable to consider a family of
candidate estimators {βλ , λ ∈ } as large as possible.
Proposition 4.1 is valid for any family of estimators
{βλ , λ ∈ }. For the specific family of lasso estimators {βλL , λ > 0} we provide a refined bound in Proposition 4.3, Section 4.3.
4.2 Lasso-Type Estimation under Unknown
Variance
The lasso is certainly one of the most popular methods for variable selection in a high-dimensional setting. Given λ > 0, the lasso estimator βλL is defined
by βλL := argminβ∈Rp Y − Xβ22 + λβ1 . A sensible
choice of λ must be homogeneous with the square-root
of the variance σ 2 . As explained above, when the variance σ 2 is unknown, one may apply V -fold CV or LinSelect to select λ. Some alternative approaches have
also been developed for tuning the lasso. Their common idea is to modify the ℓ1 criterion so that the tuning parameter becomes pivotal with respect to σ 2 . This
means that the method remains valid for any σ > 0 and
that the choice of the tuning parameter does not depend
on σ . For the sake of simplicity, we assume throughout
this subsection and the next one that the columns of X
are normalized to one.
ℓ1 -Penalized log-likelihood. In low-dimensional regression, it is classical to consider a penalized loglikelihood criterion instead of a penalized least-square
criterion to handle the unknown variance. Following
this principle, Städler et al. [68] propose to minimize
the ℓ1 -penalized log-likelihood criterion
(13)
βλLL ,
σλLL
:= argmin
β∈Rp ,σ ′ >0
n log(σ ′ )
Y − Xβ22
β1
+λ ′ .
′2
2σ
σ
By reparametrizing (β, σ ), Städler et al. [68] obtain
a convex criterion that can be efficiently minimized.
Interestingly, the penalty level λ is pivotal with respect to σ . Under suitable conditions on the design
matrix√X, Sun and Zhang [70] show that the choice
λ = c 2 log p, with c > 1 yields optimal risk bounds
in the sense of (8).
+
Square-root lasso and scaled lasso. Sun and Zhang
[71], following an idea of Antoniadis [3], propose to
minimize a penalized Huber loss [45], page 179,
(14)
βλSR ,
σλSR
:= argmin
β∈Rp ,σ ′ >0
nσ ′ Y − Xβ22
+
2
2σ ′
+ λβ1 .
This convex criterion can be minimized with roughly
the same computational complexity as a Lars-lasso
508
C. GIRAUD, S. HUET AND N. VERZELEN
path [30]. Interestingly, their procedure (called the
scaled lasso in [71]) is equivalent to the square-root
lasso estimator previously introduced by Belloni et al.
[16]. The square-root lasso of Belloni et al. is obtained
by replacing the residual sum of squares in the lasso
criterion by its square-root
(15)
βλSR = argmin
β∈Rp
− Xβ22
Y
λ
+ √ β1 .
n
The equivalence between the two definitions follows
from the minimization of the criterion in (14) with respect to σ ′ . In (14) and (15), the penalty level λ is again
pivotal with respect to σ . Sun and Zhang [71] state
sharp √oracle inequalities for the estimator βλSR with
λ = c 2 log(p), with c > 1; see Proposition 4.2 below.
Their empirical results suggest that criterion (15) provides slightly better results than the ℓ1 -penalized loglikelihood. In the sequel, we shall refer to βλSR as the
square-root lasso estimator.
Bayesian lasso. The Bayesian paradigm allows us
to put prior distributions on the variance σ 2 and the
tuning parameter λ, as in the Bayesian lasso [60].
Bayesian procedures straightforwardly handle the case
of unknown variance, but no frequentist analysis of
these procedures is so far available.
4.3 Risk Bounds for Square-Root Lasso and
Lasso-LinSelect
Let us state a bound on the prediction error for the
square-root lasso (also called scaled lasso). For the
sake of conciseness, we only present a simplified version of Theorem 1 in [71]. Consider some number
ξ > 0 and some subset T ⊂ {1, . . . , p}. The compatibility constant κ[ξ, T ] is defined by
κ[ξ, T ] = min
u∈C (ξ,T )
|T |1/2 Xu2
uT 1
where C (ξ, T ) = {u : uT c 1 < ξ uT 1 }.
P ROPOSITION 4.2. There exist positive numerical
constants C1 , C2 and C3 such that the following holds.
Let us consider √
the square-root lasso with the tuning
parameter λ = 2 2 log(p). If we assume that:
(1) p ≥ C1 ;
n
(2) β0 0 ≤ C2 κ 2 [4, supp(β0 )] log(p)
then, with high probability,
X(βSR − β0 )22
≤ inf
β=0
X(β0 − β)22
β0 log(p) 2
+ C3 2
σ .
κ [4, supp(β)]
This bound is comparable to the general objective
(9) stated in Section 2.4. Interestingly, the constant
before the bias term X(β0 − β)22 equals one. If
β0 0 = k, the square-root lasso achieves the minimax
loss k log(p)σ 2 as long as k log(p)/n is small, and
κ[4, supp(β0 )] is away from zero. This last condition
ensures that the design X is not too far from orthogonality on the cone C (4, supp(β0 )). State of the art results for the classical lasso with known variance [17,
49, 74] all involve this condition.
In what follows, we call lasso-LinSelect the lasso
estimator of β0 obtained by choosing the parameter
λ in = R+ with LinSelect. We next state a risk
bound for this procedure. For J ⊂ {1, . . . , p}, we define φJ as the largest eigenvalue of XJT XJ . The following proposition involves the restricted eigenvalue
φ∗ = max{φJ : Card(J ) ≤ n/(3 log p)}.
P ROPOSITION 4.3. There exist positive numerical
constants C, C1 , C2 and C3 such that the following
holds. Take = R+ , and assume that
β0 0 ≤ C
n
κ 2 [5, supp(β0 )]
×
.
φ∗
log(p)
Then, with probability at least 1 − C1 p−C2 , the lasso
estimator βλL selected according to the LinSelect procedure described in Section 4.1 fulfills
(16)
X(β0 − βλL )22
≤ C3 inf X(β0 − β)22
β=0
φ∗ β0 log(p) 2
+ 2
σ .
κ [5, supp(β)]
The bound (16) is similar to the bound stated above
for the square-root lasso, the most notable differences
being the constant larger than 1 in front of the bias term
and the quantity φ∗ in front of the variance term. We
refer to the supplementary material [38] for a proof of
Proposition 4.3.
4.4 Support Estimation and Inverse Problem
Until now, we only discussed estimation methods
that perform well in prediction. Little is known when
the objective is to infer β0 or its support under unknown variance.
Inverse problem. The square-root lasso [16, 71] is
proved to achieve near optimal risk bound for the inverse problems under suitable assumptions on the design X.
509
UNKNOWN VARIANCE
Support estimation. Up to our knowledge, there are
no nonasymptotic results on support estimation for
the aforementioned procedures in the unknown variance setting. Nevertheless, some related results and
heuristics have been developed for the cross-validation
scheme. If the tuning parameter λ is chosen to minimize the prediction error [i.e., take λ = λ∗ as defined
in (4)], the lasso is not consistent for support estimation; see [51, 56] for results in a random design setting.
One idea to overcome this problem, is to choose the
parameter λ that minimizes the risk of the so-called
Gauss-lasso estimator βλGL which is the least square
estimator over the support of the lasso estimator βλL
(17) βλGL :=
argmin
β∈Rp :supp(β)⊂supp(βλL )
Y − Xβ22 .
When the objective is support estimation, some numerical simulations [62] suggest that it may be more advisable not to apply the selection schemes based on
prediction risk (such as V -fold CV or LinSelect) to the
lasso estimators, but rather to the Gauss-lasso estimators. Similar remarks also apply for the Dantzig Selector [22].
4.5 Numerical Experiments
We present two numerical experiments to illustrate
the behavior of some of the above mentioned procedures for high-dimensional sparse linear regression.
The first one concerns the problem of tuning the parameter λ of the lasso algorithm for estimating Xβ0 .
The procedures will be compared on the basis of the
prediction risk. The second one concerns the problem
of support estimation with lasso-type estimators. We
will focus on the false discovery rates (FDR) and the
proportion of true discoveries (Power).
Simulation design. The simulation design is the
same as the one described in Sections 6.1 and 8.2 of
[13], except that we restrict to the case n = p = 100.
Therefore, 165 examples are simulated. They are inspired by examples found in [43, 72, 84, 85] and cover
a large variety of situations. The simulation were carried out with R www.r-project.org, using the library
elasticnet.
Experiment 1: Tuning the lasso for prediction. In the
first experiment, we compare 10-fold CV [32], LinSelect [13] and the square-root lasso [16, 71] (also called
scaled lasso) for tuning the lasso. The lasso-LinSelect
requires one to minimize criterion (11) over = R + .
In fact, the minimum of (11) is achieved at some λ
corresponding to a Lars-lasso [30] step. Consequently,
only the steps of the regularization path of the lasso
have to be considered so that the lasso-LinSelect estimator can be computed with roughly the same computational complexity as a Lars-lasso path [30]. The
square-root lasso is computed using the algorithm
de√
scribed in Sun and Zhang [71]. We set λ = 2 log(p)
which corresponds to the value recommended by the
authors.2
For each tuning procedure ℓ ∈ {10-fold CV,
LinSelect, square-root lasso}, we focus on the prediction risk R[βλL ; β0 ] of the selected lasso estimator βλL .
ℓ
ℓ
For each simulated example e = 1, . . . , 165, we estimate on the basis of 400 runs:
• the risk of the oracle (4): Re = R[βλ∗ ; β0 ];
• the risk when selecting λ with procedure ℓ: Rℓ,e =
R[βλℓ ; β0 ].
The comparison between the procedures is based on
the comparison of the means, standard deviations and
quantiles of the risk ratios Rℓ,e /Re computed over all
the simulated examples e = 1, . . . , 165. The results are
displayed in Table 1.
For 10-fold CV and LinSelect, the risk ratios are close
to one. For 90% of the examples, the risk of the lassoLinSelect is smaller than the risk of the lasso-CV, but
there are a few examples where the risk of the lassoLinSelect is significantly larger than the risk of the
lasso-CV. For the square-root lasso procedure, the risk
ratios are clearly larger than for the two others. An inspection of the results reveals that the square-root lasso
selects estimators with supports of small size. This feature can be interpreted as follows. Due to the bias of
the lasso-estimator, the residual variance tends to overestimate the variance, leading the square-root lasso to
TABLE 1
For each procedure ℓ, mean, standard-error and quantiles of the
ratios {Rℓ,e /Re , e = 1, . . . , 165}
Quantiles
Procedure
Mean Std-err 0% 50% 75% 90% 95%
Lasso 10-fold CV 1.13
1.19
Lasso LinSelect
Square-root lasso 5.15
0.08
0.48
6.74
1.03 1.11 1.15 1.19 1.24
0.97 1.03 1.06 1.19 2.52
1.32 2.61 3.37 11.2 17
2 More precisely, Sun and Zhang advocate λ = √2 log(p)/n in
√0
their paper. This choice is equivalent to λ = 2 log(p) in (14) because the normalizations of X and of (14) are different from [71].
510
C. GIRAUD, S. HUET AND N. VERZELEN
TABLE 2
For each procedure ℓ, mean, standard-error and quantiles of FDR and Power values
Quantiles
Procedure
Mean
Std-err
Gauss-lasso 10-fold CV
Gauss-lasso LinSelect
Square-root lasso
0.28
0.12
0.13
0.26
0.25
0.26
Gauss-lasso 10-fold CV
Gauss-lasso LinSelect
Square-root lasso
0.67
0.56
0.59
0.18
0.33
0.28
0%
False discovery rate
0
0
0
select a lasso estimator βλL with large λ. Consequently
the risk is high.
Experiment 2: Variable selection with Gauss-lasso
and square-root lasso. We consider now the problem
of support estimation, sometimes referred as the problem of variable selection. We implement three procedures. The Gauss-lasso procedure tuned by either 10fold CV or LinSelect and the square-root lasso. The
support of β0 is estimated by the support of the selected
estimator.
For each simulated example, the FDR and the Power
are estimated on the basis of 400 runs. The results are
given on Table 2.
It appears that the Gauss-lasso CV procedure gives
greater values of the FDR than the two others. The
Gauss-lasso LinSelect and the square-root lasso behave
similarly for the FDR, but the values of the power are
more variable for the LinSelect procedure.
Computation time. Let us conclude this numerical
section with the comparison of the computation times
between the methods. For all methods the computation
time depends on the maximum number of steps in the
lasso algorithm, and for the LinSelect method, it depends on the cardinality of S or equivalently on the
The
maximum number of nonzero components of β.
results are shown at Table 3. The square-root lasso is
the less time consuming method, closely followed by
the lasso LinSelect method. The V -fold CV carried out
with the function cv.enet of the R package elasticnet, pays the price of several calls to the lasso
algorithm.
5. GROUP-SPARSITY
In the previous section, we have made no prior assumptions on the form of β0 . In some applications,
Power
0.4
0.002
0.013
25%
50%
75%
90%
0.08
0.002
0.009
0.22
0.02
0.023
0.35
0.13
0.07
0.74
0.33
0.32
0.52
0.23
0.41
0.65
0.56
0.57
0.71
0.93
0.80
1
1
1
there are some known structures between the covariates. As an example, we treat the now classical case
of group sparsity. The covariates are assumed to be
clustered into M groups, and when the coefficient β0,i
corresponding to the covariate Xi is nonzero, then it is
likely that all the coefficients β0,j with variables Xj in
the same group as Xi are nonzero. We refer to the introduction of [8] for practical examples of this so-called
group-sparsity assumption. Let G1 , . . . , GM form a
given partition of {1, . . . , p}. For λ = (λ1 , . . . , λM ), the
group-lasso estimator βλ is defined as the minimizer of
the convex optimization criterion
(18)
Y − Xβ22 +
M
k=1
λk β Gk 2 ,
where β Gk = (βj )j ∈Gk . Criterion (18) promotes solutions where all the coordinates of β Gk are either
zero or nonzero, leading to group selection [80]. Under some assumptions on X, Huang and Zhang [44]
or Lounici et al. [54] provide a suitable choice of
λ = (λ1 , . . . , λM ) that leads to near optimal prediction
bounds. As expected, this choice of λ = (λ1 , . . . , λM )
is proportional to σ .
As for the lasso, V -fold CV is widely used in practice
to tune the penalty parameter λ = (λ1 , . . . , λM ). To our
knowledge, there is not yet any extension of the procedures described in Section 4.2 to the group lasso. An
alternative to cross-validation is to use LinSelect.
Tuning the Group-Lasso with LinSelect
For any K ⊂ {1, . . . , M}, we define the submatrix
X(K) of X by only keeping the columns of X with in
dex in k∈K Gk . We also write XGk for the submatrix
of X, built from the columns with index in Gk . The
511
UNKNOWN VARIANCE
TABLE 3
For each procedure computation time for different values of n and p. The maximum number of steps in the lasso algorithm is taken as
denoted kmax is taken as
max.steps = min{n, p}. For the LinSelect procedure, the maximum number of nonzero components of β,
kmax = min{p, n/ log(p)}
n
p
max.steps
kmax
100
100
500
100
500
500
100
100
500
21
16
80
collection S and the function
k∈K
|Gk | ≤ n/2 − 1
= λ ∈ , range X ∈ S .
with
(K λ )
Proposition B.1 in Appendix B ensures that we have
for some constant C > 1,
R[βλ ; β0 ] ≤ CE inf Xβλ − Xβ0 22
λ∈
λ | log(M) σ 2 .
+ βλ 0 ∨ |K
In the following, we provide a more explicit bound.
For simplicity, we restrict to the specific case where
each group Gk has the same cardinality T . For K ⊂
{1, . . . , M}, we define φ(K) as the largest eigenvalue of
XT(K) X(K) , and we set
(19)
φ∗ = max φ(K) : 1 ≤ |K| ≤
n−2
.
2T ∨ 3 log(M)
We assume that all the columns of X are normalized to
1, and following Lounici et al. [54], we introduce for
1 ≤ s ≤ M,
(20)
Square-root lasso
4s
4.8 s
300 s
0.21 s
0.43 s
11 s
0.18 s
0.4 s
6.3 s
1 ≤ |K0 | ≤ C
K|
]. For a given ⊂
and (range(X(K) )) = log[|K| |M
M
λ = {k :
R+ , similarly to Section 4.1, we define K
βλGk 2 = 0} and
,
S = range X(K
λ ) , λ ∈
Lasso LinSelect
P ROPOSITION 5.1. There exist positive numerical
constants C, C1 , C2 and C3 such
that the following holds. Assume that contains λ∈R+ {(λ, . . . , λ)},
that T ≤ (n − 2)/4 and that
are given by
S = range X(K) : 1 ≤ |K| ≤ n/(3 log(M))
and
Lasso 10-fold CV
Xu2
,
min
κG [ξ, s] = min
1≤|K|≤s u∈Ŵ(ξ,K) u(K) 2
where Ŵ(ξ, K) is the cone of vectors u ∈ RM \ {0}
such that k∈Kc λk uGk 2 ≤ ξ k∈K λk uGk 2 . In the
sequel, K0 stands for the set of groups containing
nonzero components of β0 .
2 [3, |K |]
κG
n−2
0
.
×
φ∗
log(M) ∨ T
Then, with probability larger than 1 − C1 M −C2 , we
have
φ∗
Xβλ − Xβ0 22 ≤ C3 2
|K0 | T ∨ log(M) .
κG [3, |K0 |]
This proposition provides a bound comparable to the
bounds of Lounici et al. [54], without requiring the
knowledge of the variance. Its proof can be found in
the supplementary material [38].
6. VARIATION-SPARSITY
We focus in this section on the variation-sparse regression. We recall that the vector β V ∈ Rp−1 of the
variations of β has for coordinates βjV = βj +1 − βj
and that the variation-sparse setting corresponds to the
setting where the vector of variations β0V is coordinatesparse. In the following, we restrict to the case where
n = p, and X is the identity matrix. In this case, the
problem of variation-sparse regression coincides with
the problem of segmentation of the mean of the vector
Y = β0 + ε.
For any subset I ⊂ {1, . . . , n − 1}, we define SI =
{β ∈ Rn : supp(β V ) ⊂ I } and βI = SI Y . For any integer q ∈ {0, . . . , n − 1}, we define also the “best” subset of size q by
q = argmin Y − β
I 2 .
I
2
|I |=q
Though the number of subsets I ⊂ {1, . . . , n − 1} of
cardinality q is of order nq+1 , this minimization can be
performed using dynamic programming with a com=I
q in
plexity of order n2 [40]. To select I
q with
{0, . . . , n − 1}, any of the generic selection schemes of
Section 3 can be applied. Below, we instantiate these
schemes and present some alternatives.
512
C. GIRAUD, S. HUET AND N. VERZELEN
6.1 Penalized Empirical Loss
When the variance σ 2 is known, penalized log
likelihood model selection amounts to select a subset I
2
which minimizes a criterion of the form Y − βI 2 +
= I
pen(Card(I )). This is equivalent to select I
q with
q minimizing
(21)
Crit(q) = Y − βIq 22 + pen(q).
Following the work of Birgé and Massart [18],
Lebarbier [50] considers the penalty
pen(q) = (q + 1) c1 log n/(q + 1) + c2 σ 2
and determines the constants c1 = 2, c2 = 5 by extensive numerical experiments (see also Comte and
Rozenholc [25] for a similar approach in a more general setting). With this choice of the penalty, the procedure satisfies a bound of the form
(22)
R[βI, β0 ]
≤C
inf
I ⊂{1,...,n−1}
βI − β0 22
+ (1 + |I |) log n/(1 + |I |) σ 2 .
When σ 2 is unknown, several approaches have been
proposed.
Plug-in estimator. The idea is to replace σ 2 in
σ2 =
pen(q) by an estimator of the variance such as
n/2
2
i=1 (Y2i − Y2i−1 ) /n, or one of the estimators proposed by Hall et al. [41]. No theoretical results are
proved in a nonasymptotic framework.
Estimating the variance by the residual leastsquares. Baraud et al. [12], in Section 5.4.2, propose
to select q by minimizing a penalized log-likelihood
criterion. This criterion can be written in the form
Crit(q) = Y − βIq 22 (1 + K pen(q)), with K > 1 and
the penalty pen(q) solving
E U − pen(q)V
+
=
1
n−1 ,
(q + 1)
q
where (·)+ = max(·, 0), and U , V are two independent
χ 2 variables with respectively q + 2 and n − q − 2
degrees of freedom. The resulting estimator βI, with
= I
I
q , satisfies a nonasymptotic risk bound similar to
(22) for all K > 1. The choice K = 1.1 is suggested
for the practice.
Slope heuristic. Lebarbier [50] implements the slope
heuristic introduced by Birgé and Massart [19] for handling the unknown variance σ 2 . The method consists
in calibrating the penalty directly, without estimating
σ 2 . It is based on the following principle. First, there
exists a so-called minimal penalty penmin (q) such that
choosing pen(q) = K penmin (q) in (21) with K < 1
can lead to a strong overfit, whereas for K > 1, the
bound (22) is met. Second, it can be shown that there
exists a dimension jump around the minimal penalty,
allowing one to estimate penmin (q) from the data. The
slope heuristic then proposes to select q by minimizing the criterion Crit(q) = Y − βIq 22 + 2 p
enmin (q).
Arlot and Massart [7] provide a nonasymptotic risk
bound for this procedure. Their results are proved in a
general regression model with heteroscedatic and nonGaussian errors, but with a constraint on the number of
models per dimension which is not met for the family
of models (SI )I ⊂{1,...,n−1} . Nevertheless, the authors
indicate how to generalize their results for the problem
of signal segmentation.
Finally, for practical issues, different procedures for
estimating the minimal penalty are compared and implemented in Baudry et al. [15].
6.2 CV Procedure
In a recent paper, Arlot and Célisse [6] consider the
problem of signal segmentation using cross-validation.
Their results apply in the heteroscedastic case. They
consider several CV-methods, the leave-one-out, leavep-out and V -fold CV for estimating the quadratic loss.
They propose two cross-validation schemes. The first
one, denoted Procedure 5, aims to estimate directly
E[β0 − βIq 22 ], while the second one, denoted Procedure 6, relies on two steps, where the cross-validation
is used first for choosing the best partition of dimension
q, then the best dimension q. They show that the leavep-out CV method can be implemented with a complexity of order n2 , and they give a control of the expected
CV risk. The use of CV leads to some restrictions on the
subsets I that compete for estimating β0 . This problem
is discussed in [6], Section 3 of the supplementary material.
6.3 Alternative for Very High-Dimensional Settings
When n is very large, the dynamic programming optimization can become computationally too intensive.
An attractive alternative is based on the fused lasso proposed by Tibshirani et al. [73]. The estimator βλT V is
513
UNKNOWN VARIANCE
defined by minimizing the convex criterion
Y − β22 + λ
n−1
j =1
|βj +1 − βj |,
where the total-variation norm j |βj +1 − βj | promotes solutions which are variation-sparse. The family
(βλT V )λ≥0 can be computed very efficiently with the
Lars-algorithm; see Vert and Bleakley [75]. A sensible
choice of the parameter λ must be proportional to σ .
When the variance σ 2 is unknown, the parameter λ can
be selected either by V -fold CV or by LinSelect (see
Section 5.1 in [13] for details).
7. EXTENSIONS
7.1 Gaussian Design and Graphical Models
Assume that the design X is now random and that the
n rows X(i) are independent observations of a Gaussian vector with mean 0p and unknown covariance matrix . This setting is mainly motivated by applications
in compressed sensing [29] and in Gaussian graphical
modeling. Indeed, Meinshausen and Bühlmann [56]
have proved that it is possible to estimate the graph of
a Gaussian graphical model by studying linear regression with Gaussian design and unknown variance. If
we work conditionally on the observed X design, then
all the results and methodologies described in this survey still apply. Nevertheless, these prediction results
do not really take into account the fact that the design
is random. In this setting, it is more natural to consider the integrated prediction risk E[ 1/2 (β− β0 )22 ]
rather than the risk (3). Some procedures [35, 76] have
been proved to achieve optimal risk bounds with respect to this risk, but they are computationally intractable in a high-dimensional setting. In the context of Gaussian graphical modeling, the procedure
GGMselect [39] is designed to select among any collection of graph estimators, and it is proved to achieve
near optimal risk bounds in terms of the integrated prediction risk.
7.3 Multivariate Regression
Multivariate regression deals with T simultaneous
linear regression models yk = Xβk + εk , k = 1, . . . , T .
Stacking the yk ’s in a n × T matrix Y , we obtain the
model Y = XB0 + E, where B0 is a p × T matrix with
columns given by βk and E is a n × T matrix with i.i.d.
entries. The classical structural assumptions on B0 are
either that most rows of B0 are identically zero, or the
rank of B0 is small. The first case is a simple case of
group sparsity and can be handled by the group-lasso as
in Section 5. The second case, first considered by Anderson [2] and Izenman [46], is much more nonlinear.
Writing · F for the Frobenius (or Hilbert–Schmidt)
norm, the problem of selecting among the estimators
r = argmin Y − XB2 ,
B
F
B:rank(B)≤r
r ∈ {1, . . . , min(T , rank(X))}
has been investigated recently from a nonasymptotic
point of view by Bunea et al. [20] and Giraud [36].
r is of order of
The prediction risk of B
r − XB0 2 ] ≍
E[XB
F
A few results do not require that the noise ε follows a
Gaussian distribution. The lasso-type procedures, such
as the square-root lasso [16, 71], do not require the normality of the noise and extend to other distributions. In
practice, it seems that cross-validation procedures still
work well for other distributions of the noise.
sk2 (XB0 )
k≥r
+ r n + rank(X) σ 2 ,
where sk (M) denotes the kth largest singular value of
the matrix M. Therefore, a sensible choice of r depends on σ 2 . The first selection criterion introduced
by Bunea et al. [20] requires the knowledge of the
variance σ 2 . To handle the case of unknown variance,
Bunea et al. [20] propose to plug an estimate of the
variance in their selection criterion [which works when
rank(X) < n], whereas Giraud [36] introduces a penalized log-likelihood criterion independent of the variance. Both papers provide oracle risk bounds for the
resulting estimators showing rate-minimax adaptation.
Several recent papers [9, 20, 49, 58, 63] have investigated another strategy for the low-rank setting. For a
positive λ, the matrix B0 is estimated by
λ ∈ argmin Y − XB2 + λ
B
F
B∈Rp×T
7.2 Non-Gaussian Noise
k
sk (B) .
Translating the work on trace regression of Koltchinskii et al. [49] into the setting of multivariate regression provides (under some conditions on X) an
√ oraλ∗ with λ∗ = 3s1 (X)( T +
cle
bound
on
the
risk
of
B
√
rank(X))σ . We also refer to Giraud [37] for a slight
variation of this result requiring no condition on the
design X. Again, the value of λ∗ is proportional to σ .
514
C. GIRAUD, S. HUET AND N. VERZELEN
To handle the case of unknown variance, Klopp [48]
adapts the concept of the square-root lasso [16] to this
setting and provides an oracle risk bound for the resulting procedure.
7.4 Nonparametric Regression
In the nonparametric regression model (2), classical
estimation procedures include local-polynomial estimators, kernel estimators, basis-projection estimators,
k-nearest neighbors, etc. All these procedures depend
on one (or several) tuning parameter(s), whose optimal value(s) scales with the variance σ 2 . V -fold CV is
widely used in practice for choosing these parameters,
but little is known on its theoretical performance.
The class of linear estimators (including spline
smoothing, Nadaraya estimators, k-nearest neighbors,
low-pass filters, kernel ridge regression, etc.) has attracted some attention in the last years. Some papers
have investigated the tuning of some specific family of
estimators. For example, Cao and Golubev [23] provides a tuning procedure for spline smoothing while
Zhang [82] provides a sharp analysis of kernel ridge
regression. Recently, two papers have focused on the
tuning of arbitrary linear estimators when the variance
σ 2 is unknown. Arlot and Bach [4] generalize the slope
heuristic to symmetric linear estimators with spectrum
in [0, 1] and prove an oracle bound for the resulting estimator. Baraud et al. [13], in Section 4, shows that LinSelect can be used for selecting among a (almost) completely arbitrary collection of linear estimators (possibly nonsymmetric and/or singular). An oracle bound
for the selected estimator is given by Corollary 2 in
[13].
APPENDIX A: A NOTE ON BIC TYPE CRITERIA
The BIC criterion has been initially introduced [64]
to select an estimator among a collection of constrained
maximum likelihood estimators. Nevertheless, modified versions of this criterion are often used for tuning more general estimation procedures. The purpose
of this appendix is to illustrate why we advise against
this approach in a high-dimensional setting.
D EFINITION A.1 (A modified BIC criterion). Suppose we are given a collection (βλ )λ∈ of estimators
depending on a tuning parameter λ ∈ . For any λ ∈ ,
σλ2 = Y − Xβλ 22 /n, and define the modwe consider
ified BIC
(A.1)
λ ∈ argmin{−2Ln (βλ ,
σλ ) + log(n)βλ 0 },
λ∈
= {λ ∈ :
where Ln is the log-likelihood and
βλ 0 ≤ n/2}.
Sometimes, the log(n) term is replaced by log(p).
allows us to avoid trivial estimators.
Replacing by
First, we would like to emphasize that there is no theoretical warranty that the selected estimator does not
overfit in a high-dimensional setting. In practice, using
this criterion often leads to overfitting. Let us illustrate
this with a simple experiment.
Setting
We consider the model
(A.2)
Yi = β0,i + εi ,
i = 1, . . . , n,
with ε ∼ N (0, σ 2 In ) so that p = n and X = In . Here,
we fix n = 10,000, σ = 1 and β0 = 0n .
Methods
We apply the modified BIC criterion to tune the lasso
[72], SCAD [31] and the hard thresholding estimator.
The hard thresholding estimator βλH T is defined for any
λ > 0 by [βλH T ]i = Yi 1|Yi |≥λ . Given λ > 0 and a > 2,
SCAD is defined as the minimizer
the SCAD estimator βλ,a
of the penalized criterion Y − Xβ22 + ni=1 pλ (|βi |),
where for x > 0,
pλ′ (x) = λ1x≤λ + (aλ − x)+ 1x>λ /(a − 1).
For the sake of simplicity we fix a = 3. We note βL;BIC ,
βaSCAD;BIC , and βH T ;BIC for the lasso, hard thresholding, and SCAD estimators selected by the modified BIC
criterion.
Results
We have realized N = 200 experiments. For each of
these experiments, the estimator βL;BIC , βaSCAD,BIC and
βH T ;BIC are computed. The mean number of nonzero
components and the estimated risk R[β∗;BIC ; 0n ] are
reported in Table 4.
Obviously, the SCAD and hard thresholding methods select too many irrelevant variables when they are
tuned with BIC. Moreover, their risks are quite high.
Intuitively, this is due to the fact that the log(n) [or
log(p)] term in the BIC penalty is too small in this
high-dimensional setting (n = p).
For the lasso estimator, a very specific phenomenon
occurs due to the soft thresholding effect. In the discussion of [30], Loubes and Massart advocate that
TABLE 4
Estimated risk and Estimated number of nonzero components for
βL;BIC , βSCAD;BIC and βH T ;BIC
β
∗;BIC ; 0p ]
R[
Mean of β∗;BIC 0
Lasso
SCAD
Hard thres.
4.6×10−2
0.025
1.6×101
86.9
3.0×102
28.2
515
UNKNOWN VARIANCE
soft thresholding estimators penalized by Mallows’ Cp
[55] penalties should yield good results, while hard
thresholding estimators penalized by Mallows’s Cp are
known to highly overfit. This strange behavior is due
to the bias of the soft thresholding estimator. Nevertheless, Loubes and Massart’s arguments have been developed for an orthogonal design. In fact, there is no
nonasymptotic justification that the lasso tuned by BIC
or AIC performs well for general designs X.
Conclusion
The use of the modified BIC criterion to tune estimation procedures in a high-dimensional setting is not
supported by theoretical results. It is proved to overfit in the case of thresholding estimators [12], Section 3.2.2. Empirically, BIC seems to overfit except for
the lasso. We advise the practitioner to avoid BIC (and
AIC) when p is at least of the same order as n. For instance, LinSelect is supported by nonasymptotic arguments and by empirical results [13] in contrast to BIC.
APPENDIX B: COMPLEMENTS ON LINSELECT
B.1 More Details on the Selection Procedure
The penalty pen(S) involved in the LinSelect criterion (11) is defined by pen(S) = 1.1 pen (S) where
pen (S) is the unique solution of
E
pen (S)
V
U−
n − dim(S)
+
(S)
= e−
,
where U and V are two independent chi-square random variables with dim(S) + 1 and n − dim(S) − 1
degrees of freedom respectively. It is also the solution
in x of
N −1
e− (S) = (D + 1)P FD+3,N−1 ≥ x
N(D + 3)
N −1
N +1
−x
,
P FD+1,N+1 ≥ x
N
N(D + 1)
where D = dim(S), N = n − dim(S) and Fd,r is a
Fisher random variable with d and r degrees of freedom.
Proposition 4 in [12] ensures the following upperbound on pen (S). For any 0 < κ < 1, there exists a
constant Cκ > 1 such that for any S ∈ S fulfilling 1 ≤
dim(S) ∨ (S) ≤ κn we have
pen (S) ≤ Cκ dim(S) ∨ (S) .
Conversely, Lemma 2.3 in the supplement [38] ensures
that pen (S) ≥ 2 (S) + dim(S) − C for some constant
C ≥ 0.
B.2 A General Risk Bound for LinSelect
We set
= σ2
(B.1)
e−
(S)
.
S∈S
The following proposition gives a risk bound when selecting
λ by minimizing (11).
P ROPOSITION B.1. Assume that 1 ≤ dim(S) ≤
n/2 − 1 and (S) ≤ 2n/3 for all S ∈ S. Then, there
λ
exists a constant C > 1 such that for any minimizer
of criterion (11), we have
C −1 R[βλ ; β0 ]
≤ E inf Xβλ − Xβ0 22
λ∈
+ inf {Xβλ −
(B.2)
S∈
S
2
S Xβλ 2
+ [ (S) ∨ dim(S)]σ 2 }
+ .
Furthermore, with probability larger than 1 − e−C0 n −
C1 S∈S e−C2 [ (S)∧n] e− (S) , we have for some C > 1
C −1 Xβ0 − Xβλ 22
≤ inf Xβλ − Xβ0 22
λ∈
+ inf {Xβλ −
S∈
S
2
S Xβλ 2
+ [ (S) ∨ dim(S)]σ 2 } .
The first part of Proposition B.1 is a slight variation
of Theorem 1 in [13]. We refer to the supplementary
material [38] for a proof of these two results.
SUPPLEMENTARY MATERIAL
Supplement to “High-dimensional regression
with unknown variance” (DOI: 10.1214/12STS398SUPP; .pdf). This supplement contains a description of estimation procedures that are minimax
adaptive to the sparsity for all designs X. It also contains the proofs of the risk bounds for LinSelect and
lasso-LinSelect.
REFERENCES
[1] A KAIKE , H. (1973). Information theory and an extension of
the maximum likelihood principle. In Second International
Symposium on Information Theory (Tsahkadsor, 1971) 267–
281. Akadémiai Kiadó, Budapest. MR0483125
516
C. GIRAUD, S. HUET AND N. VERZELEN
[2] A NDERSON , T. W. (1951). Estimating linear restrictions on
regression coefficients for multivariate normal distributions.
Ann. Math. Statistics 22 327–351. MR0042664
[3] A NTONIADIS , A. (2010). Comments on: ℓ1 -penalization for
mixture regression models. TEST 19 257–258. MR2677723
[4] A RLOT, S. and BACH , F. (2009). Data-driven calibration of
linear estimators with minimal penalties. In Advances in Neural Information Processing Systems 22 (Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams and A. Culotta, eds.)
46–54. Curran Associates, New York.
[5] A RLOT, S. and C ELISSE , A. (2010). A survey of crossvalidation procedures for model selection. Stat. Surv. 4 40–79.
MR2602303
[6] A RLOT, S. and C ELISSE , A. (2011). Segmentation of the
mean of heteroscedastic data via cross-validation. Stat. Comput. 21 613–632. MR2826696
[7] A RLOT, S. and M ASSART, P. (2010). Data-driven calibration
of penalties for least-squares regression. J. Mach. Learn. Res.
10 245–279.
[8] BACH , F. R. (2008). Consistency of the group lasso and
multiple kernel learning. J. Mach. Learn. Res. 9 1179–1225.
MR2417268
[9] BACH , F. R. (2008). Consistency of trace norm minimization.
J. Mach. Learn. Res. 9 1019–1048. MR2417263
[10] BARANIUK , R., DAVENPORT, M., D E VORE , R. and
WAKIN , M. (2008). A simple proof of the restricted isometry
property for random matrices. Constr. Approx. 28 253–263.
MR2453366
[11] BARAUD , Y. (2011). Estimator selection with respect to
Hellinger-type risks. Probab. Theory Related Fields 151 353–
401. MR2834722
[12] BARAUD , Y., G IRAUD , C. and H UET, S. (2009). Gaussian
model selection with an unknown variance. Ann. Statist. 37
630–672. MR2502646
[13] BARAUD , Y., G IRAUD , C. and H UET, S. (2010). Estimator selection in the Gaussian setting. Available at
arXiv:1007.2096v2.
[14] BARRON , A., B IRGÉ , L. and M ASSART, P. (1999). Risk
bounds for model selection via penalization. Probab. Theory
Related Fields 113 301–413. MR1679028
[15] BAUDRY, J.-P., M AUGIS , C. and M ICHEL , B. (2012). Slope
heuristics: Overview and implementation. Statist. Comput. 22
455–470.
[16] B ELLONI , A., C HERNOZHUKOV, V. and WANG , L. (2011).
Square-root lasso: Pivotal recovery of sparse signals via conic
programming. Biometrika 98 791–806. MR2860324
[17] B ICKEL , P. J., R ITOV, Y. and T SYBAKOV, A. B. (2009).
Simultaneous analysis of lasso and Dantzig selector. Ann.
Statist. 37 1705–1732. MR2533469
[18] B IRGÉ , L. and M ASSART, P. (2001). Gaussian model selection. J. Eur. Math. Soc. (JEMS) 3 203–268. MR1848946
[19] B IRGÉ , L. and M ASSART, P. (2007). Minimal penalties for
Gaussian model selection. Probab. Theory Related Fields 138
33–73. MR2288064
[20] B UNEA , F., S HE , Y. and W EGKAMP, M. H. (2011). Optimal selection of reduced rank estimators of high-dimensional
matrices. Ann. Statist. 39 1282–1309. MR2816355
[21] B UNEA , F., T SYBAKOV, A. B. and W EGKAMP, M. H.
(2007). Aggregation for Gaussian regression. Ann. Statist. 35
1674–1697. MR2351101
[22] C ANDES , E. and TAO , T. (2007). The Dantzig selector: Statistical estimation when p is much larger than n. Ann. Statist.
35 2313–2351. MR2382644
[23] C AO , Y. and G OLUBEV, Y. (2006). On oracle inequalities
related to smoothing splines. Math. Methods Statist. 15 398–
414. MR2301659
[24] C HEN , S. S., D ONOHO , D. L. and S AUNDERS , M. A.
(1998). Atomic decomposition by basis pursuit. SIAM J. Sci.
Comput. 20 33–61. MR1639094
[25] C OMTE , F. and ROZENHOLC , Y. (2004). A new algorithm
for fixed design regression and denoising. Ann. Inst. Statist.
Math. 56 449–473. MR2095013
[26] DALALYAN , A. and T SYBAKOV, A. (2008). Aggregation by
exponential weighting, sharp oracle inequalities and sparsity.
Machine Learning 72 39–61.
[27] D EVROYE , L. P. and WAGNER , T. J. (1979). The L1 convergence of kernel density estimates. Ann. Statist. 7 1136–1139.
MR0536515
[28] D ONOHO , D. and TANNER , J. (2009). Observed universality of phase transitions in high-dimensional geometry, with
implications for modern data analysis and signal processing.
Philos. Trans. R. Soc. Lond. Ser. A Math. Phys. Eng. Sci. 367
4273–4293. MR2546388
[29] D ONOHO , D. L. (2006). Compressed sensing. IEEE Trans.
Inform. Theory 52 1289–1306. MR2241189
[30] E FRON , B., H ASTIE , T., J OHNSTONE , I. and T IBSHI RANI , R. (2004). Least angle regression. Ann. Statist. 32 407–
499. MR2060166
[31] FAN , J. and L I , R. (2001). Variable selection via nonconcave
penalized likelihood and its oracle properties. J. Amer. Statist.
Assoc. 96 1348–1360. MR1946581
[32] G EISSER , S. (1975). The predictive sample reuse method
with applications. J. Amer. Statist. Assoc. 70 320–328.
[33] G ERCHINOVITZ , S. (2011). Sparsity regret bounds for individual sequences in online linear regression. In Proceedings
of COLT 2011. Microtome Publishing, Brookline, MA.
[34] G IRAUD , C. (2008). Mixing least-squares estimators
when the variance is unknown. Bernoulli 14 1089–1107.
MR2543587
[35] G IRAUD , C. (2008). Estimation of Gaussian graphs by model
selection. Electron. J. Stat. 2 542–563. MR2417393
[36] G IRAUD , C. (2011). Low rank multivariate regression. Electron. J. Stat. 5 775–799. MR2824816
[37] G IRAUD , C. (2011). A pseudo-RIP for multivariate regression. Available at arXiv:1106.5599v1.
[38] G IRAUD , C., H UET, S. and V ERZELEN , N. (2012). Supplement to “High-dimensional regression with unknown variance.” DOI:10.1214/12-STS398SUPP.
[39] G IRAUD , C., H UET, S. and V ERZELEN , N. (2012). Graph
selection with GGMselect. Stat. Appl. Genet. Mol. Biol. 11
1–50.
[40] G UTHERY, S. B. (1974). A transformation theorem for onedimensional F -expansions. J. Number Theory 6 201–210.
MR0342484
[41] H ALL , P., K AY, J. W. and T ITTERINGTON , D. M. (1990).
Asymptotically optimal difference-based estimation of variance in nonparametric regression. Biometrika 77 521–528.
MR1087842
[42] H ASTIE , T., T IBSHIRANI , R. and F RIEDMAN , J. (2009). The
Elements of Statistical Learning. Data Mining, Inference, and
Prediction, 2nd ed. Springer, New York. MR2722294
UNKNOWN VARIANCE
[43] H UANG , J., M A , S. and Z HANG , C.-H. (2008). Adaptive
Lasso for sparse high-dimensional regression models. Statist.
Sinica 18 1603–1618. MR2469326
[44] H UANG , J. and Z HANG , T. (2010). The benefit of group sparsity. Ann. Statist. 38 1978–2004. MR2676881
[45] H UBER , P. J. (1981). Robust Statistics. Wiley, New York.
MR0606374
[46] I ZENMAN , A. J. (1975). Reduced-rank regression for the
multivariate linear model. J. Multivariate Anal. 5 248–264.
MR0373179
[47] J I , P. and J IN , J. (2010). UPS delivers optimal phase diagram in high dimensional variable selection. Available at
http://arxiv.org/abs/1010.5028.
[48] K LOPP, O. (2011). High dimensional matrix estimation with unknown variance of the noise. Available at
arXiv:1112.3055v1.
[49] KOLTCHINSKII , V., L OUNICI , K. and T SYBAKOV, A. B.
(2011). Nuclear-norm penalization and optimal rates for noisy
low-rank matrix completion. Ann. Statist. 39 2302–2329.
MR2906869
[50] L EBARBIER , E. (2005). Detecting multiple change-points in
the mean of Gaussian process by model selection. Signal Processing 85 717–736.
[51] L ENG , C., L IN , Y. and WAHBA , G. (2006). A note on the
lasso and related procedures in model selection. Statist. Sinica
16 1273–1284. MR2327490
[52] L EUNG , G. and BARRON , A. R. (2006). Information theory
and mixing least-squares regressions. IEEE Trans. Inform.
Theory 52 3396–3410. MR2242356
[53] L I , K.-C. (1987). Asymptotic optimality for Cp , CL , crossvalidation and generalized cross-validation: Discrete index
set. Ann. Statist. 15 958–975. MR0902239
[54] L OUNICI , K., P ONTIL , M., VAN DE G EER , S. and T SYBAKOV, A. B. (2011). Oracle inequalities and optimal inference under group sparsity. Ann. Statist. 39 2164–2204.
MR2893865
[55] M ALLOWS , C. L. (1973). Some comments on Cp . Technometrics 15 661–675.
[56] M EINSHAUSEN , N. and B ÜHLMANN , P. (2006). Highdimensional graphs and variable selection with the lasso. Ann.
Statist. 34 1436–1462. MR2278363
[57] M OSTELLER , F. and T UKEY, J. W. (1968). Data analysis,
including statistics. In Handbook of Social Psychology, Vol. 2
(G. Lindsey and E. Aronson, eds.). Addison-Wesley, Reading,
MA.
[58] N EGAHBAN , S. and WAINWRIGHT, M. J. (2011). Estimation
of (near) low-rank matrices with noise and high-dimensional
scaling. Ann. Statist. 39 1069–1097. MR2816348
[59] N ISHII , R. (1984). Asymptotic properties of criteria for selection of variables in multiple regression. Ann. Statist. 12
758–765. MR0740928
[60] PARK , T. and C ASELLA , G. (2008). The Bayesian lasso.
J. Amer. Statist. Assoc. 103 681–686. MR2524001
[61] R ASKUTTI , G., WAINWRIGHT, M. J. and Y U , B. (2011).
Minimax rates of estimation for high-dimensional linear regression over ℓq -balls. IEEE Trans. Inform. Theory 57 6976–
6994. MR2882274
[62] R IGOLLET, P. and T SYBAKOV, A. (2011). Exponential
screening and optimal rates of sparse estimation. Ann. Statist.
39 731–771. MR2816337
517
[63] ROHDE , A. and T SYBAKOV, A. B. (2011). Estimation of
high-dimensional low-rank matrices. Ann. Statist. 39 887–
930. MR2816342
[64] S CHWARZ , G. (1978). Estimating the dimension of a model.
Ann. Statist. 6 461–464. MR0468014
[65] S HAO , J. (1993). Linear model selection by cross-validation.
J. Amer. Statist. Assoc. 88 486–494. MR1224373
[66] S HAO , J. (1997). An asymptotic theory for linear model selection. Statist. Sinica 7 221–264. With comments and a rejoinder by the author. MR1466682
[67] S HIBATA , R. (1981). An optimal selection of regression variables. Biometrika 68 45–54. MR0614940
[68] S TÄDLER , N., B ÜHLMANN , P. and VAN DE G EER , S.
(2010). ℓ1 -penalization for mixture regression models. TEST
19 209–256. MR2677722
[69] S TONE , M. (1974). Cross-validatory choice and assessment
of statistical predictions. J. Roy. Statist. Soc. Ser. B 36 111–
147. MR0356377
[70] S UN , T. and Z HANG , C.-H. (2010). Comments on: ℓ1 penalization for mixture regression models. TEST 19 270–
275. MR2677726
[71] S UN , T. and Z HANG , C. H. (2011). Scaled sparse linear regression. Available at arXiv:1104.4595.
[72] T IBSHIRANI , R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
MR1379242
[73] T IBSHIRANI , R., S AUNDERS , M., ROSSET, S., Z HU , J. and
K NIGHT, K. (2005). Sparsity and smoothness via the fused
lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 67 91–108.
MR2136641
[74] VAN DE G EER , S. A. and B ÜHLMANN , P. (2009). On the
conditions used to prove oracle results for the Lasso. Electron.
J. Stat. 3 1360–1392. MR2576316
[75] V ERT, J. P. and B LEAKLEY, K. (2010). Fast detection of
multiple change-points shared by many signals using group
LARS. In Advances in Neural Information Processing Systems 23 (J. Lafferty, C. K. I. Williams, J. Shawe-Taylor,
R. S. Zemel and A. Culotta, eds.) 2343–2351. Curran Associates, New York.
[76] V ERZELEN , N. (2010). High-dimensional Gaussian model
selection on a Gaussian design. Ann. Inst. H. Poincaré
Probab. Stat. 46 480–524. MR2667707
[77] V ERZELEN , N. (2012). Minimax risks for sparse regressions:
Ultra-high-dimensional phenomenons. Electron. J. Stat. 6
38–90.
[78] WAINWRIGHT, M. J. (2009). Information-theoretic limits on
sparsity recovery in the high-dimensional and noisy setting.
IEEE Trans. Inform. Theory 55 5728–5741. MR2597190
[79] Y E , F. and Z HANG , C.-H. (2010). Rate minimaxity of the
Lasso and Dantzig selector for the ℓq loss in ℓr balls. J. Mach.
Learn. Res. 11 3519–3540. MR2756192
[80] Y UAN , M. and L IN , Y. (2006). Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser.
B Stat. Methodol. 68 49–67. MR2212574
[81] Z HANG , C.-H. (2010). Nearly unbiased variable selection
under minimax concave penalty. Ann. Statist. 38 894–942.
MR2604701
[82] Z HANG , T. (2005). Learning bounds for kernel regression using effective data dimensionality. Neural Comput. 17 2077–
2098. MR2175849
518
C. GIRAUD, S. HUET AND N. VERZELEN
[83] Z HANG , T. (2011). Adaptive forward–backward greedy algorithm for learning sparse representations. IEEE Trans. Inform.
Theory 57 4689–4708. MR2840485
[84] Z OU , H. (2006). The adaptive lasso and its oracle properties.
J. Amer. Statist. Assoc. 101 1418–1429. MR2279469
[85] Z OU , H. and H ASTIE , T. (2005). Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat.
Methodol. 67 301–320. MR2137327