A Closer Look at Sparse Regression Ryan Tibshirani: 2.1 Three Norms: ', ', '

Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

A Closer Look at Sparse Regression

Ryan Tibshirani
(ammended by Larry Wasserman)

1 Introduction
In these notes we take a closer look at sparse linear regression. Throughout, we
make the very strong assumption that Yi = β T Xi + i where E[i |Xi ] = 0 and
Var(i |Xi ) = σ 2 . These assumptions are highly unrealistic but they permit a more de-
tailed analysis. There are several books on high-dimensional estimation: Hastie, Tib-
shirani & Wainwright (2015), Buhlmann & van de Geer (2011), Wainwright (2017).

2 Best subset selection, ridge regression, and the lasso


2.1 Three norms: `0 , `1 , `2
In terms of regularization, we typically choose the constraint set C to be a sublevel set
of a norm (or seminorm), and equivalently, the penalty function P (·) to be a multiple
of a norm (or seminorm)
Let’s consider three canonical choices: the `0 , `1 , and `2 norms:
p p p
X 1/2
X X
kβk0 = 1{βj 6= 0}, kβk1 = |βj |, kβk2 = βj2 .
j=1 j=1 j=1

(Truthfully, calling it “the `0 norm” is a misnomer, since it is not a norm: it does not
satisfy positive homogeneity, i.e., kaβk0 6= akβk0 whenever a 6= 0, 1.)
In constrained form, this gives rise to the problems:

min ky − Xβk22 subject to kβk0 ≤ k (Best subset selection) (1)


β∈Rp

min ky − Xβk22 subject to kβk1 ≤ t (Lasso regression) (2)


β∈Rp

min ky − Xβk22 subject to kβk22 ≤ t (Ridge regession) (3)


β∈Rp

where k, t ≥ 0 are tuning parameters. Note that it makes sense to restrict k to be


an integer; in best subset selection, we are quite literally finding the best subset of
variables of size k, in terms of the achieved training error
Though it is likely the case that these ideas were around earlier in other contexts, in
statistics we typically subset selection to Beale et al. (1967), Hocking & Leslie (1967),
ridge regression to Hoerl & Kennard (1970), and the lasso to Tibshirani (1996), Chen
et al. (1998)

1
In penalized form, the use of `0 , `1 , `2 norms gives rise to the problems:
1
min ky − Xβk22 + λkβk0 (Best subset selection) (4)
β∈Rp 2
1
minp ky − Xβk22 + λkβk1 (Lasso regression) (5)
β∈R 2
1
min ky − Xβk22 + λkβk22 (Ridge regression) (6)
β∈Rp 2

with λ ≥ 0 the tuning parameter. In fact, problems (2), (5) are equivalent. By this,
we mean that for any t ≥ 0 and solution βb in (2), there is a value of λ ≥ 0 such
that βb also solves (5), and vice versa. The same equivalence holds for (3), (6). (The
factors of 1/2 multiplying the squared loss above are inconsequential, and just for
convenience)
It means, roughly speaking, that computing solutions of (2) over a sequence of
t values and performing cross-validation (to select an estimate) should be basically
the same as computing solutions of (5) over some sequence of λ values and perform-
ing cross-validation (to select an estimate). Strictly speaking, this isn’t quite true,
because the precise correspondence between equivalent t, λ depends on the data X, y
Notably, problems (1), (4) are not equivalent. For every value of λ ≥ 0 and
solution βb in (4), there is a value of t ≥ 0 such that βb also solves (1), but the converse
is not true

2.2 A Toy Example


It is helpful to first consider a toy example. Suppose that Y ∼ N (µ, 1). Let’s consider
the three different estimators we get using the following three different loss functions:
1 1 1
(Y − µ)2 + λ||µ||0 , (Y − µ)2 + λ|µ|, (Y − µ)2 + λµ2 .
2 2 2
You should verify that the solutions are
√ Y
µ
b = H(Y ; 2λ), µ b = S(Y ; λ), µb=
1 + 2λ
where H(y; a) = yI(|y| > a) is the hard-thresholding operator, and

y − a if y > a

S(y; a) = 0 if − a ≤ y ≤ a

y + a if y < a.

Hard thresholding creates a “zone of sparsity” but it is discontinuous. Soft thresh-


olding also creates a “zone of sparsity” but it is scontinuous. The L2 loss creates a
nice smooth estimator but it is never sparse. (You can verify the solution to the L1
problem using sub-differentials if you know convex analysis, or by doing three cases
separately: µ > 0, µ = 0, µ < 0.)

2
TABLE 3.4. Estimators of βj in the case of orthonormal columns of X. M and λ
are constants chosen by the corresponding techniques; sign denotes the sign of its
argument (±1), and x+ denotes “positive part” of x. Below the table, estimators
are shown by broken red lines. The 45◦ line in gray shows the unrestricted estimate
for reference.
2.3 Sparsity Estimator Formula
The best subset selection and the lasso estimators have a special, useful property:
Best subset
their solutions are sparse, i.e., at (size M ) βb β̂
a solution wej ·will j | ≥β
I(|β̂have b|jβ̂=
(M0) |)
for many components
j ∈ {1, . . . , p}. In problem
Ridge (1), this is obviously true,
β̂j /(1 λ) k ≥ 0 controls the sparsity
+where
level. In problem (2), it is less obviously true, but we get a higher degree of sparsity
the smaller the value Lasso
of t ≥ 0. In the penalized sign(forms, j | −(5),
β̂j )(|β̂(4), λ)+we get more sparsity
the larger the value of λ ≥ 0
This isBestnotSubset
true of ridge regression,Ridge i.e., the solution of (3) orLasso (6) generically has
all nonzero components, no matter the value of t or λ. Note that sparsity is desirable, λ
for two reasons: (i) it corresponds to performing variable selection in the constructed
linear model, and (ii) it) |provides a level of interpretability (beyond sheer accuracy)
|β̂(M
That the `0 (0,0)norm induces sparsity is obvious.(0,0) But, why does the(0,0)`1 norm induce
sparsity and not the `2 norm? There are different ways to look at it; let’s stick
with intuition from the constrained problem forms (2), (5). Figure 1 shows the
“classic” picture, contrasting the way the contours of the squared error loss hit the
two constraint sets, the `1 and `2 balls. As the `1 ball has sharp corners (aligned with
the coordinate axes), we get sparse solutions

β2 ^
β
. β2 ^
β
.

β1 β1

FIGURE 3.11. Estimation picture for the lasso (left) and ridge regression
Figure 1: The “classic” illustration comparing lasso and ridge constraints. From
(right). Shown are contours of the error and constraint functions. The solid blue
Chapter 3 of Hastie et al. (2009)
areas are the constraint regions |β | + |β | ≤ t and β 2 + β 2 ≤ t2 , respectively,
1 2 1 2
while the red ellipses are the contours of the least squares error function.
Intuition can also be drawn from the orthogonal case. When X is orthogonal, it
is not hard to show that the solutions of the penalized problems (4), (5), (6) are
XT y
βbsubset = H√2λ (X T y), βblasso = Sλ (X T y), βbridge =
1 + 2λ

3
respectively, where Ht (·), St (·) are the componentwise hard- and soft-thresholding
functions at the level t. We see several revealing properties: subset selection and
lasso solutions exhibit sparsity when the componentwise least squares coefficients
(inner products X T y) are small enough; the lasso solution exihibits shrinkage, in
that large enough least squares coefficients are shrunken towards zero by λ; the ridge
regression solution is never sparse and compared to the lasso, preferentially shrinkage
the larger least squares coefficients even more

2.4 Convexity
The lasso and ridge regression problems (2), (3) have another very important prop-
erty: they are convex optimization problems. Best subset selection (1) is not, in fact
it is very far from being convex. Consider using the norm ||β||p as a penalty. Sparsity
requires p ≤ 1 and convexity requires p ≥ 1. The only norm that gives sparsity and
convexity is p = 1. The appendix has a brief review of convexity.

2.5 Theory For Subset Selection


Despite its computational intractability, best subset selection has some attractive risk
properties. A classic result is due to Foster & George (1994), on the in-sample risk of
best subset selection in penalized form (4), which we will paraphrase here. First, we
raise a very simple point: if A denotes the support (also called the active set) of the
subset selection solution βb in (4)—meaning that βbj = 0 for all j ∈ / A, and denoted
A = supp(β)—then
b we have
βbA = (XAT XA )−1 XAT y,
(7)
βb−A = 0.
Here and throughout we write XA for the columns of matrix X in a set A, and xA for
the components of a vector x in A. We will also use X−A and x−A for the columns
or components not in A. The observation in (7) follows from the fact that, given the
support set A, the `0 penalty term in the subset selection criterion doesn’t depend
on the actual magnitudes of the coefficients (it contributes a constant factor), so the
problem reduces to least squares.
Now, consider a standard linear model as with X fixed, and  ∼ N (0, σ 2 I). Sup-
pose that the underlying coefficients have support S = supp(β0 ), and s0 = |S|. Then,
the estimator given by least squares on S, i.e.,
βboracle = (X T XS )−1 X T y,
S S S
oracle
βb−S = 0.
is is called oracle estimator, and as we know from our previous calculations, has
in-sample risk
1 s0
kX βboracle − Xβ0 k22 = σ 2 .
n n
4
Foster & George (1994) consider this setup, and compare the risk of the best
subset selection estimator βb in (4) to the oracle risk of σ 2 s0 /n. They show that, if
we choose λ  σ 2 log p, then the best subset selection estimator satisfies
EkX βb − Xβ0 k22 /n
≤ 4 log p + 2 + o(1), (8)
σ 2 s0 /n
as n, p → ∞. This holds without any conditions on the predictor matrix X. Moreover,
they prove the lower bound
EkX βb − Xβ0 k22 /n
inf sup ≥ 2 log p − o(log p),
βb X,β0 σ 2 s0 /n
where the infimum is over all estimators β, b and the supremum is over all predictor
matrices X and underlying coefficients with kβ0 k0 = s0 . Hence, in terms of rate, best
subset selection achieves the optimal risk inflation over the oracle risk.
Returning to what was said above, the kicker is that we can’t really compute
the best subset selection estimator for even moderately-sized problems. As we will
in the following, the lasso provides a similar risk inflation guarantee, though under
considerably stronger assumptions.
Lastly, it is worth remarking that even if we could compute the subset selection
estimator at scale, it’s not at all clear that we would want to use this in place of the
lasso. (Many people assume that we would.) We must remind ourselves that theory
provides us an understanding of the performance of various estimators under typically
idealized conditions, and it doesn’t tell the complete story. It could be the case that
the lack of shrinkage in the subset selection coefficients ends up being harmful in
practical situations, in a signal-to-noise regime, and yet the lasso could still perform
favorably in such settings.
Update. Some nice recent work in optimization (Bertsimas et al. 2016) shows
that we can cast best subset selection as a mixed integer quadratic program, and
proposes to solve it (in general this means approximately, though with a certified
bound on the duality gap) with an industry-standard mixed integer optimization
package like Gurobi. However, in a recent paper, Hastie, Tibshirani and Tibshirani
(arXiv:1707.08692) show that best subset selection does not do well statistically unless
there is an extremely high signal to noise ratio.

3 Basic properties and geometry of the lasso


3.1 Ridge regression and the elastic net
A quick refresher: the ridge regression problem (6) is always strictly convex (assuming
λ > 0), due to the presense of the squared `2 penalty kβk22 . To be clear, this is true
regardless of X, and so the ridge regression solution is always well-defined, and is in
fact given in closed-form by βb = (X T X + 2λI)−1 X T y.

5
3.2 Lasso
Now we turn to subgradient optimality (sometimes called the KKT conditions) for
the lasso problem in (5). They tell us that any lasso solution βb must satisfy

X T (y − X β)
b = λs, (9)

where s ∈ ∂kβk
b 1 , a subgradient of the `1 norm evaluated at β.
b Precisely, this means
that 
{+1}
 βbj > 0
sj ∈ {−1} βbj < 0 j = 1, . . . , p. (10)

[−1, 1] βbj = 0,

From (9) we can read off a straightforward but important fact: even though the
solution βb may not be uniquely determined, the optimal subgradient s is a function
of the unique fitted value X βb (assuming λ > 0), and hence is itself unique.
Now from (10), note that the uniqueness of s implies that any two lasso solutions
must have the same signs on the overlap of their supports. That is, it cannot happen
that we find two different lasso solutions βb and βe with βbj > 0 but βej < 0 for some
j, and hence we have no problem interpretating the signs of components of lasso
solutions.
Let’s assume henceforth that the columns of X are in general position (and we
are looking at a nontrivial end of the path, with λ > 0), so the lasso solution βb is
unique. Let A = supp(β) b be the lasso active set, and let sA = sign(βbA ) be the signs
of active coefficients. From the subgradient conditions (9), (10), we know that

XAT (y − XA βbA ) = λsA ,

and solving for βbA gives

βbA = (XAT XA )−1 (XAT y − λsA ),


(11)
βb−A = 0

(where recall we know that XAT XA is invertible because X has columns in general
position). We see that the active coefficients βbA are given by taking the least squares
coefficients on XA , (XAT XA )−1 XAT y, and shrinking them by an amount λ(XAT XA )−1 sA .
Contrast this to, e.g., the subset selection solution in (7), where there is no such
shrinkage.
Now, how about this so-called shrinkage term (XAT XA )−1 XAT y? Does it always
act by moving each one of the least squares coefficients (XAT XA )−1 XAT y towards zero?
Indeed, this is not always the case, and one can find empirical examples where a
lasso coefficient is actually larger (in magnitude) than the corresponding least squares
coefficient on the active set. Of course, we also know that this is due to the correlations

6
between active variables, because when X is orthogonal, as we’ve already seen, this
never happens.
On the other hand, it is always the case that the lasso solution has a strictly
smaller `1 norm than the least squares solution on the active set, and in this sense,
we are (perhaps) justified in always referring to (XAT XA )−1 XAT y as a shrinkage term.
To see this, note that, for any vector b, ||b||1 = sT b where s is the vector of signs of
b 1 = sT βb = sT βbA and so
b. So ||β|| A

b 1 = sT (X T XA )−1 X T y − λsT (X T XA )−1 sA < k(X T XA )−1 X T yk1 .


kβk (12)
A A A A A A A

The first term is less than or equal to k(XAT XA )−1 XAT yk1 , and the term we are sub-
tracting is strictly negative (because (XAT XA )−1 is positive definite).

4 Theoretical analysis of the lasso


4.1 Slow rates
There has been an enormous amount theoretical work analyzing the performance of
the lasso. Some references (warning: a highly incomplete list) are Greenshtein &
Ritov (2004), Fuchs (2005), Donoho (2006), Candes & Tao (2006), Meinshausen &
Buhlmann (2006), Zhao & Yu (2006), Candes & Plan (2009), Wainwright (2009); a
helpful text for these kind of results is Buhlmann & van de Geer (2011).
We begin by stating what are called slow rates for the lasso estimator. Most of
the proofs are simple enough that they are given below. These results don’t place
any real assumptions on the predictor matrix X, but deliver slow(er) rates for the
risk of the lasso estimator than what we would get under more assumptions, hence
their name.
We will assume the standard linear model with X fixed, and  ∼ N (0, σ 2 ). We
will also assume that kXj k22 ≤ n, for j = 1, . . . , p. That the errors are Gaussian can
be easily relaxed to sub-Gaussianity.
The lasso estimator in bound form (2) is particularly easy to analyze. Suppose that
we choose t = kβ0 k1 as the tuning parameter. Then, simply by virtue of optimality
of the solution βb in (2), we find that
b 2 ≤ ky − Xβ0 k2 ,
ky − X βk 2 2

or, expanding and rearranging,


kX βb − Xβ0 k22 ≤ 2h, X βb − Xβ0 i.
Here we denote ha, bi = aT b. The above is sometimes called the basic inequality
(for the lasso in bound form). Now, rearranging the inner product, using Holder’s
inequality, and recalling the choice of bound parameter:
kX βb − Xβ0 k22 ≤ 2hX T , βb − β0 i ≤ 4kβ0 k1 kX T k∞ .

7
Notice that kX T k∞ = maxj=1,...,p |XjT | is a maximum of p Gaussians, each with
mean zero and variance upper bounded by σ 2 n. By a standard maximal inequality
for Gaussians, for any δ > 0,
p
max |XjT | ≤ σ 2n log(ep/δ),
j=1,...,p

with probability at least 1−δ. Plugging this to the second-to-last display and dividing
by n, we get the finite-sample result for the lasso estimator
r
1 2 2 log(ep/δ)
kX βb − Xβ0 k2 ≤ 4σkβ0 k1 , (13)
n n
with probability at least 1 − δ.
The high-probability result (13) implies an in-sample risk bound of
r
1 log p
EkX βb − Xβ0 k2 . kβ0 k1
2
.
n n
Compare to this with the risk bound (8) for best subset selection, which is on the
(optimal) order of s0 log p/n when β0 has s0 nonzero components. If each of the
nonzero components here has constant p magnitude, then above risk bound for the
lasso estimator is on the order of s0 log p/n, which is much slower.
Predictive risk. Instead of in-sample risk, we might also be interested in out-
of-sample risk, as after all that reflects actual (out-of-sample) predictions. In least
squares, recall, we saw that out-of-sample risk was generally higher than in-sample
risk. The same is true for the lasso Chatterjee (2013) gives a nice, simple analysis of
out-of-sample risk for the lasso. He assumes that x0 , xi , i = 1, . . . , n are i.i.d. from
an arbitrary distribution supported on a compact set in Rp , and shows that the lasso
estimator in bound form (2) with t = kβ0 k1 has out-of-sample risk satisfying
r
log p
E(x0 β − x0 β) . kβ0 k1
Tb T 2 2
.
n
The proof is not much more complicated than the above, for the in-sample risk, and
reduces to a clever application of Hoeffding’s inequality, though we omit it for brevity.
Note here the dependence on kβ0 k21 , rather than kβ0 k1 as in the in-sample risk. This
agrees with the analysis we did in the previous set of notes where we did not assume
the linear model. (Only the interpretation changes.)

Oracle inequality. If we don’t want to assume linearity of the mean then we


can still derive an oracle inequality that characterizes the risk of the lasso estimator
in excess of the risk of the best linear predictor. For this part only, assume the more
general model
y = µ(X) + ,

8
with an arbitrary mean function µ(X), and normal errors  ∼ N (0, σ 2 ). We will
analyze the bound form lasso estimator (2) for simplicity. By optimality of β,
b for any
1
other β feasible for the lasso problem in (2), it holds that
e

hX T (y − X β),
b βe − βi
b ≤ 0. (14)

Rearranging gives

hµ(X) − X β, b ≤ hX T , βb − βi.
b X βe − X βi e (15)

Now using the polarization identity kak22 + kbk22 − ka − bk22 = 2ha, bi,

kX βb − µ(X)k22 + kX βb − X βk
e 2 ≤ kX βe − µ(X)k2 + 2hX T , βb − βi,
2 2
e

and from the exact same arguments as before, it holds that


r
1 1 e 2 ≤ kX βe − µ(X)k2 + 4σt 2 log(ep/δ) ,
1
kX βb − µ(X)k22 + kX βb − X βk 2 2
n n n n
with probability at least 1 − δ. This holds simultaneously over all βe with kβk
e 1 ≤ t.
Thus, we may write, with probability 1 − δ,
  r
1 2 1 2 2 log(ep/δ)
kX βb − µ(X)k2 ≤ inf kX βe − µ(X)k2 + 4σt .
n e 1 ≤t n
kβk n

Also if we write X βebest as the best linear that predictor of `1 at most t, achieving
the infimum on the right-hand side (which we know exists, as we are minimizing a
continuous function over a compact set), then
r
1 2 log(ep/δ)
kX βb − X βebest k22 ≤ 4σt ,
n n
with probability at least 1 − δ

4.2 Fast rates


Under very strong assumptions we can get faster rates. For example, if we assume
that X satisfies the restricted eigenvalue condition with constant φ0 > 0, i.e.,

1
kXvk22 ≥ φ20 kvk22 for all subsets J ⊆ {1, . . . , p} such that |J| = s0
n
and all v ∈ Rp such that kvJ c k1 ≤ 3kvJ k1 (16)
1
To see this, consider minimizing a convex function f (x) over a convex set C. Let xb be a
minimizer. Let z ∈ C be any other point in C. If we move away from the solution xb we can only
x). In other words, h∇f (b
increase f (b x), z − zbi ≥ 0.

9
then
s0 log p
kβb − β0 k22 . (17)
nφ20
with probability tending to 1. (This condition can be slightly weakened, but not
much.) The condition is unlikely to hold in any real problem. Nor is it checkable.
The proof is in the appendix.

4.3 Support recovery


Here we discuss results on support recovery of the lasso estimator. There are a few
versions of support recovery results and again Buhlmann & van de Geer (2011) is
a good place to look for a thorough coverage. Here we describe a result due to
Wainwright (2009), who introduced a proof technique called the primal-dual witness
method. The assumptions are even stronger (and less believable) than in the previous
section. In addition to the previous assumptions we need:
Mutual incoherence: for some γ > 0, we have

k(XST XS )−1 XST Xj k1 ≤ 1 − γ, for j ∈


/ S,

Minimum eigenvalue: for some C > 0, we have


 
1 T
Λmin X XS ≥ C,
n S

where Λmin (A) denotes the minimum eigenvalue of a matrix A


Minimum signal:
4γλ
β0,min = min |β0,j | ≥ λk(XST XS )−1 k∞ + √ ,
j∈S C
Pq
where kAk∞ = maxi=1,...,m j=1 |Aij | denotes the `∞ norm of an m × q matrix A
Under these assumptions, once can show that, if λ is chosen just right, then
b = support(β)) → 1.
P (support(β) (18)

The proof is in the appendix.

References
Beale, E. M. L., Kendall, M. G. & Mann, D. W. (1967), ‘The discarding of variables
in multivariate analysis’, Biometrika 54(3/4), 357–366.

Bertsimas, D., King, A. & Mazumder, R. (2016), ‘Best subset selection via a modern
optimization lens’, The Annals of Statistics 44(2), 813–852.

10
Buhlmann, P. & van de Geer, S. (2011), Statistics for High-Dimensional Data,
Springer.

Candes, E. J. & Plan, Y. (2009), ‘Near ideal model selection by `1 minimization’,


Annals of Statistics 37(5), 2145–2177.

Candes, E. J. & Tao, T. (2006), ‘Near optimal signal recovery from random projec-
tions: Universal encoding strategies?’, IEEE Transactions on Information Theory
52(12), 5406–5425.

Chatterjee, S. (2013), Assumptionless consistency of the lasso. arXiv: 1303.5817.

Chen, S., Donoho, D. L. & Saunders, M. (1998), ‘Atomic decomposition for basis
pursuit’, SIAM Journal on Scientific Computing 20(1), 33–61.

Donoho, D. L. (2006), ‘Compressed sensing’, IEEE Transactions on Information The-


ory 52(12), 1289–1306.

Efron, B., Hastie, T., Johnstone, I. & Tibshirani, R. (2004), ‘Least angle regression’,
Annals of Statistics 32(2), 407–499.

Foster, D. & George, E. (1994), ‘The risk inflation criterion for multiple regression’,
The Annals of Statistics 22(4), 1947–1975.

Fuchs, J. J. (2005), ‘Recovery of exact sparse representations in the presense of


bounded noise’, IEEE Transactions on Information Theory 51(10), 3601–3608.

Greenshtein, E. & Ritov, Y. (2004), ‘Persistence in high-dimensional linear predictor


selection and the virtue of overparametrization’, Bernoulli 10(6), 971–988.

Hastie, T., Tibshirani, R. & Friedman, J. (2009), The Elements of Statistical Learning;
Data Mining, Inference and Prediction, Springer. Second edition.

Hastie, T., Tibshirani, R. & Wainwright, M. (2015), Statistical Learning with Sparsity:
the Lasso and Generalizations, Chapman & Hall.

Hocking, R. R. & Leslie, R. N. (1967), ‘Selection of the best subset in regression


analysis’, Technometrics 9(4), 531–540.

Hoerl, A. & Kennard, R. (1970), ‘Ridge regression: biased estimation for nonorthog-
onal problems’, Technometrics 12(1), 55–67.

Meinshausen, N. & Buhlmann, P. (2006), ‘High-dimensional graphs and variable se-


lection with the lasso’, The Annals of Statistics 34(3), 1436–1462.

Osborne, M., Presnell, B. & Turlach, B. (2000a), ‘A new approach to variable selection
in least squares problems’, IMA Journal of Numerical Analysis 20(3), 389–404.

11
Osborne, M., Presnell, B. & Turlach, B. (2000b), ‘On the lasso and its dual’, Journal
of Computational and Graphical Statistics 9(2), 319–337.

Raskutti, G., Wainwright, M. J. & Yu, B. (2011), ‘Minimax rates of estimation for
high-dimensional linear regression over `q -balls’, IEEE Transactions on Information
Theory 57(10), 6976–6994.

Tibshirani, R. (1996), ‘Regression shrinkage and selection via the lasso’, Journal of
the Royal Statistical Society: Series B 58(1), 267–288.

van de Geer, S. & Buhlmann, P. (2009), ‘On the conditions used to prove oracle
results for the lasso’, Electronic Journal of Statistics 3, 1360–1392.

Wainwright, M. (2017), High-Dimensional Statistics: A Non-Asymptotic View, Cam-


bridge University Press. To appear.

Wainwright, M. J. (2009), ‘Sharp thresholds for high-dimensional and noisy sparsity


recovery using `1 -constrained quadratic programming (lasso)’, IEEE Transactions
on Information Theory 55(5), 2183–2202.

Zhao, P. & Yu, B. (2006), ‘On model selection consistency of lasso’, Journal of Ma-
chine Learning Research 7, 2541–2564.

5 Appendix: Convexity
It is convexity that allows to equate (2), (5), and (3), (6) (and yes, the penalized forms
are convex problems too). It is also convexity that allows us to both efficiently solve,
and in some sense, precisely understand the nature of the lasso and ridge regression
solutions
Here is a (far too quick) refresher/introduction to basic convex analysis and convex
optimization. Recall that a set C ⊆ Rn is called convex if for any x, y ∈ C and
t ∈ [0, 1], we have
tx + (1 − t)y ∈ C,
i.e., the line segment joining x, y lies entirely in C. A function f : Rn → R is called
convex if its domain dom(f ) is convex, and for any x, y ∈ dom(f ) and t ∈ [0, 1],

f tx + (1 − t)y ≤ tf (x) + (1 − t)f (y),

i.e., the function lies below the line segment joining its evaluations at x and y. A
function is called strictly convex if this same inequality holds strictly for x 6= y and
t ∈ (0, 1)
E.g., lines, rays, line segments, linear spaces, affine spaces, hyperplans, halfspaces,
polyhedra, norm balls are all convex sets

12
E.g., affine functions aT x + b are convex and concave, quadratic functions xT Qx +
bT x + c are convex if Q  0 and strictly convex if Q  0, norms are convex
Formally, an optimization problem is of the form

min f (x)
x∈D
subject to hi (x) ≤ 0, i = 1, . . . m
`j (x) = 0, j = 1, . . . r

Here D = dom(f ) ∩ m
T Tr
i=1 dom(hi ) ∩ j=1 dom(`j ) is the common domain of all func-
tions. A convex optimization problem is an optimization problem in which all functions
f, h1 , . . . hm are convex, and all functions `1 , . . . `r are affine. (Think: why affine?)
Hence, we can express it as

min f (x)
x∈D
subject to hi (x) ≤ 0, i = 1, . . . m
Ax = b

Why is a convex optimization problem so special? The short answer: because any
local minimizer is a global minimizer. To see this, suppose that x is feasible for the
convex problem formulation above and there exists some R > 0 such that

f (x) ≤ f (y) for all feasible y with kx − yk2 ≤ R.

Such a point x is called a local minimizer. For the sake of contradiction, suppose that
x was not a global minimizer, i.e., there exists some feasible z such that f (z) < f (x).
By convexity of the constraints (and the domain D), the point tz + (1 − t)x is feasible
for any 0 ≤ t ≤ 1. Furthermore, by convexity of f ,

f tz + (1 − t)x ≤ tf (z) + (1 − t)f (x) < f (x)

for any 0 < t < 1. Lastly, we can choose t > 0 small enough so that kx − (tz + (1 −
t)x)k2 = tkx − zk2 ≤ R, and we obtain a contradiction
Algorithmically, this is a very useful property, because it means if we keep “going
downhill”, i.e., reducing the achieved criterion value, and we stop when we can’t do
so anymore, then we’ve hit the global solution
Convex optimization problems are also special because they come with a beautiful
theory of beautiful convex duality and optimality, which gives us a way of understand-
ing the solutions. We won’t have time to cover any of this, but we’ll mention what
subgradient optimality looks like for the lasso
Just based on the definitions, it is not hard to see that (2), (3), (5), (6) are convex
problems, but (1), (4) are not. In fact, the latter two problems are known to be
NP-hard, so they are in a sense even the worst kind of nonconvex problem

13
6 Appendix: Geometry of the solutions
One undesirable feature of the best subset selection solution (7) is the fact that
it behaves discontinuously with y. As we change y, the active set A must change
at some point, and the coefficients will jump discontinuously, because we are just
doing least squares onto the active set. So, does the same thing happen with the
lasso solution (11)? The answer it not immediately clear. Again, as we change y,
the active set A must change at some point; but if the shrinkage term were defined
“just right”, then perhaps the coefficients of variables to leave the active set would
gracefully and continously drop to zero, and coefficients of variables to enter the
active set would continuously move form zero. This would make whole the lasso
solution continuous. Fortuitously, this is indeed the case, and the lasso solution
βb is continuous as a function of y. It might seem a daunting task to prove this,
but a certain perspective using convex geometry provides a very simple proof. The
geometric perspective in fact proves that the lasso fit X βb is nonexpansive in y, i.e.,
1-Lipschitz continuous, which is a very strong form of continuity. Define the convex
polyhedron C = {u : kX T uk∞ ≤ λ} ⊆ Rn . Some simple manipulations of the KKT
conditions show that the lasso fit is given by

X βb = (I − PC )(y),

the residual from projecting y onto C. A picture to show this (just look at the left
panel for now) is given in Figure 2.
The projection onto any convex set is nonexpansive, i.e., kPC (y) − PC (y 0 )k2 ≤
ky − y 0 k2 for any y, y 0 . This should be visually clear from the picture. Actually, the
same is true with the residual map: I − PC is also nonexpansive, and hence the lasso
fit is 1-Lipschitz continuous. Viewing the lasso fit as the residual from projection
onto a convex polyhedron is actually an even more fruitful perspective. Write this
polyhedron as
C = (X T )−1 {v : kvk∞ ≤ λ},
where (X T )−1 denotes the preimage operator under the linear map X T . The set
{v : kvk∞ ≤ λ} is a hypercube in Rp . Every face of this cube corresponds to a subset
A ⊆ {1, . . . p} of dimensions (that achieve the maximum value |λ|) and signs sA ∈
{−1, 1}|A| (that tell which side of the cube the face will lie on, for each dimension).
Now, the faces of C are just faces of {v : kvk∞ ≤ λ} run through the (linear) preimage
transformation, so each face of C can also indexed by a set A ⊆ {1, . . . p} and signs
sA ∈ {−1, 1}|A| . The picture in Figure 2 attempts to convey this relationship with
the colored black face in each of the panels.
Now imagine projecting y onto C; it will land on some face. We have just argued
that this face corresponds to a set A and signs sA . One can show that this set A is
exactly the active set of the lasso solution at y, and sA are exactly the active signs.
The size of the active set |A| is the co-dimension of the face. Looking at the picture:
we can that see that as we wiggle y around, it will project to the same face. From the

14
y

X β̂

û A, sA

(X T )−1
0
0

{v : kvk∞ ≤ λ}
T
C = {u : kX uk∞ ≤ λ}

Rn Rp

Figure 2: A geometric picture of the lasso solution. The left panel shows the polyhe-
dron underlying all lasso fits, where each face corresponds to a particular combination
of active set A and signs s; the right panel displays the “inverse” polyhedron, where
the dual solutions live

correspondence between faces and active set and signs of lasso solutions, this means
that A, sA do not change as we perturb y, i.e., they are locally constant. But this isn’t
true for all points y, e.g., if y lies on one of the rays emanating from the lower right
corner of the polyhedron in the picture, then we can see that small perturbations of
y do actually change the face that it projects to, which invariably changes the active
set and signs of the lasso solution. However, this is somewhat of an exceptional case,
in that such points can be form a of Lebesgue measure zero, and therefore we can
assure ourselves that the active set and signs A, sA are locally constant for almost
every y.
From the lasso KKT conditions (9), (10), it is possible to compute the lasso
solution in (5) as a function of λ, which we will write as β(λ),
b for all values of the
tuning parameter λ ∈ [0, ∞]. This is called the regularization path or solution path of
the problem (5). Path algorithms like the one we will describe below are not always
possible; the reason that this ends up being feasible for the lasso problem (5) is that
the solution path β(λ),
b λ ∈ [0, ∞] turns out to be a piecewise linear, continuous
function of λ. Hence, we only need to compute and store the knots in this path,
which we will denote by λ1 ≥ λ2 ≥ . . . ≥ λr ≥ 0, and the lasso solution at these
knots. From this information, we can then compute the lasso solution at any value

15
1
of λ by linear interpolation.
The knots λ1 ≥ . . . ≥ λr in the solution path correspond to λ values at which
the active set A(λ) = supp(β(λ))
b changes. As we decrease λ from ∞ to 0, the knots
typically correspond to the point at which a variable enters the active set; this con-
nects the lasso to an incremental variable selection procedure like forward stepwise
regression. Interestingly though, as we decrease λ, a knot in the lasso path can also
correspond to the point at which a variables leaves the active set. See Figure 3.
0.3
0.2
Coefficients

0.1
0.0
−0.1
−0.2

0.5 1.0 1.5 2.0

lambda

Figure 3: An example of the lasso path. Each colored line denotes a component of the
lasso solution βbj (λ), j = 1, . . . , p as a function of λ. The gray dotted vertical lines
mark the knots λ1 ≥ λ2 ≥ . . .

The lasso solution path was described by Osborne et al. (2000a,b), Efron et al.
(2004). Like the construction of all other solution paths that followed these seminal
works, the lasso path is essentially given by an iterative or inductive verification of the
KKT conditions; if we can maintain that the KKT conditions holds as we decrease
λ, then we know we have a solution. The trick is to start at a value of λ at which the
solution is trivial; for the lasso, this is λ = ∞, at which case we know the solution
must be β(∞)
b = 0.
Why would the path be piecewise linear? The construction of the path from the

16
KKT conditions is actually rather technical (not difficult conceptually, but somewhat
tedious), and doesn’t shed insight onto this matter. But we can actually see it clearly
from the projection picture in Figure 2.
As λ decreases from ∞ to 0, we are shrinking (by a multiplicative factor λ) the
polyhedron onto which y is projected; let’s write Cλ = {u : kX T uk∞ ≤ λ} = λC1 to
make this clear. Now suppose that y projects onto the relative interior of a certain
face F of Cλ , corresponding to an active set A and signs sA . As λ decreases, the
point on the boundary of Cλ onto which y projects, call it u b(λ) = PCλ (y), will move
along the face F , and change linearly in λ (because we are equivalently just tracking
the projection of y onto an affine space that is being scaled by λ). Thus, the lasso
fit X β(λ)
b =y−u b(λ) will also behave linearly in λ. Eventually, as we continue to
decrease λ, the projected point u b(λ) will move to the relative boundary of the face F ;
then, decreasing λ further, it will lie on a different, neighboring face F 0 . This face will
correspond to an active set A0 and signs sA0 that (each) differ by only one element to
A and sA , respectively. It will then move linearly across F 0 , and so on.
Now we will walk through the technical derivation of the lasso path, starting
at λ = ∞ and β(∞)
b = 0, as indicated above. Consider decreasing λ from ∞, and
continuing to set β(λ)
b = 0 as the lasso solution. The KKT conditions (9) read

X T y = λs,

where s is a subgradient of the `1 norm evaluated at 0, i.e., sj ∈ [−1, 1] for every j =


1, . . . , p. For large enough values of λ, this is satisfied, as we can choose s = X T y/λ.
But this ceases to be a valid subgradient if we decrease λ past the point at which
λ = |XjT y| for some variable j = 1, . . . , p. In short, β(λ)
b = 0 is the lasso solution for
all λ ≥ λ1 , where
λ1 = max |XjT y|. (19)
j=1,...,p

What happens next? As we decrease λ from λ1 , we know that we’re going to have to
change β(λ)
b from 0 so that the KKT conditions remain satisfied. Let j1 denote the
variable that achieves the maximum in (19). Since the subgradient was |sj1 | = 1 at
λ = λ1 , we see that we are “allowed” to make βbj1 (λ) nonzero. Consider setting

βbj1 (λ) = (XjT1 Xj1 )−1 (XjT1 y − λsj1 )


(20)
βbj (λ) = 0, for all j 6= j1 ,

as λ decreases from λ1 , where sj1 = sign(XjT1 y). Note that this makes β(λ)
b a piecewise
linear and continuous function of λ, so far. The KKT conditions are then
 
XjT1 y − Xj1 (XjT1 Xj1 )−1 (XjT1 y − λsj1 ) = λsj1 ,

which can be checked with simple algebra, and


 
T T −1 T
Xj y − Xj1 (Xj1 Xj1 ) (Xj1 y − λsj1 ) ≤ λ,

17
for all j 6= j1 . Recall that the above held with strict inequality at λ = λ1 for all j 6= j1 ,
and by continuity of the constructed solution β(λ), b it should continue to hold as we
decrease λ for at least a little while. In fact, it will hold until one of the piecewise
linear paths
XjT (y − Xj1 (XjT1 Xj1 )−1 (XjT1 y − λsj1 )), j 6= j1
becomes equal to ±λ, at which point we have to modify the solution because otherwise
the implicit subgradient

XjT (y − Xj1 (XjT1 Xj1 )−1 (XjT1 y − λsj1 ))


sj =
λ
will cease to be in [−1, 1]. It helps to draw yourself a picture of this.
Thanks to linearity, we can compute the critical “hitting time” explicitly; a short
calculation shows that, the lasso solution continues to be given by (20) for all λ1 ≥
λ ≥ λ2 , where

XjT (I − Xj1 (XjT1 Xj1 )−1 Xj1 )y


λ2 = max+ , (21)
j6=j1 , sj ∈{−1,1} sj − XjT Xj1 (XjT1 Xj1 )−1 sj1

and max+ denotes the maximum over all of its arguments that are < λ1 .
To keep going: let j2 , s2 achieve the maximum in (21). Let A = {j1 , j2 }, sA =
(sj1 , sj2 ), and consider setting

βbA (λ) = (XAT XA )−1 (XAT y − λsA )


(22)
βb−A (λ) = 0,

as λ decreases from λ2 . Again, we can verify the KKT conditions for a stretch of
decreasing λ, but will have to stop when one of

XjT (y − XA (XAT XA )−1 (XAT y − λsA ), j∈


/A

becomes equal to ±λ. By linearity, we can compute this next “hitting time” explic-
itly, just as before. Furthermore, though, we will have to check whether the active
components of the computed solution in (22) are going to cross through zero, because
past such a point, sA will no longer be a proper subgradient over the active compo-
nents. We can again compute this next “crossing time” explicitly, due to linearity.
Therefore, we maintain that (22) is the lasso solution for all λ2 ≥ λ ≥ λ3 , where λ3 is
the maximum of the next hitting time and the next crossing time. For convenience,
the lasso path algorithm is summarized below.
As we decrease λ from a knot λk , we can rewrite the lasso coefficient update in
Step 1 as
βbA (λ) = βbA (λk ) + (λk − λ)(XAT XA )−1 sA ,
(23)
βb−A (λ) = 0.

18
We can see that we are moving the active coefficients in the direction (λk −λ)(XAT XA )−1 sA
for decreasing λ. In other words, the lasso fitted values proceed as

X β(λ)
b b k ) + (λk − λ)XA (X T XA )−1 sA ,
= X β(λ A

for decreasing λ. Efron et al. (2004) call XA (XAT XA )−1 sA the equiangular direction,
because this direction, in Rn , takes an equal angle with all Xj ∈ Rn , j ∈ A.
For this reason, the lasso path algorithm in Algorithm ?? is also often referred
to as the least angle regression path algorithm in “lasso mode”, though we have not
mentioned this yet to avoid confusion. Least angle regression is considered as another
algorithm by itself, where we skip Step 3 altogether. In words, Step 3 disallows any
component path to cross through zero. The left side of the plot in Figure 3 visualizes
the distinction between least angle regression and lasso estimates: the dotted black
line displays the least angle regression component path, crossing through zero, while
the lasso component path remains at zero.
Lastly, an alternative expression for the coefficient update in (23) (the update in
Step 1) is
λk − λ T
βbA (λ) = βbA (λk ) + (XA XA )−1 XAT r(λk ),
λk (24)
β−A (λ) = 0,
b

where r(λk ) = y − XA βbA (λk ) is the residual (from the fitted lasso model) at λk . This
follows because, recall, λk sA are simply the inner products of the active variables
with the residual at λk , i.e., λk sA = XAT (y − XA βbA (λk )). In words, we can see that
the update for the active lasso coefficients in (24) is in the direction of the least
squares coefficients of the residual r(λk ) on the active variables XA .

7 Appendix: Fast Rates


Here is a proof of (17). There are many flavors of fast rates, and the conditions
required are all very closely related. van de Geer & Buhlmann (2009) provides a nice
review and discussion. Here we just discuss two such results, for simplicity.
Compatibility result. Assume that X satisfies the compatibility condition with
respect to the true support set S, i.e., for some compatibility constant φ0 > 0,

1 φ2
kXvk22 ≥ 0 kvS k21 for all v ∈ Rp such that kv−S k1 ≤ 3kvS k1 . (25)
n s0
While this may look like an odd condition, we will see it being useful in the proof
below, and we will also have some help interpreting it when we discuss the restricted
eigenvalue condition shortly. Roughly, it means the (truly active) predictors can’t be
too correlated

19
Recall from our previous analysis for the lasso estimator in penalized form (5), we
showed on an event Eδ of probability at least 1 − δ,
p
kX βb − Xβ0 k22 ≤ 2σ 2n log(ep/δ)kβb − β0 k1 + 2λ(kβ0 k1 − kβk
b 1 ).

Choosing λ large enough and applying the triangle inequality then gave us the slow
rate wepderived before. Now we choose λ just slightly larger (by a factor of 2):
λ ≥ 2σ 2n log(ep/δ). The remainder of the analysis will be performed on the event
Eδ and we will no longer make this explicit until the very end. Then

kX βb − Xβ0 k22 ≤ λkβb − β0 k1 + 2λ(kβ0 k1 − kβk


b 1)
≤ λkβbS − β0,S k1 + λkβb−S k1 + 2λ(kβ0 k1 − kβk
b 1)
≤ λkβbS − β0,S k1 + λkβb−S k1 + 2λ(kβ0,S − βbS k1 − kβb−S k1 )
= 3λkβbS − β0,S k1 − λkβb−S k1 ,

where the two inequalities both followed from the triangle inequality, one application
for each of the two terms, and we have used that βb0,−S = 0. As kX βb − Xβ0 k22 ≥ 0,
we have shown
kβb−S − βb0,−S k1 ≤ 3kβbS − β0,S k1 ,
and thus we may apply the compatibility condition (25) to the vector v = βb − β0 .
This gives us two bounds: one on the fitted values, and the other on the coefficients.
Both start with the key inequality (from the second-to-last display)

kX βb − Xβ0 k22 ≤ 3λkβbS − β0,S k1 . (26)

For the fitted values, we upper bound the right-hand side of the key inequality (26),
r
2 s0
kX βb − Xβ0 k2 ≤ 3λ kX βb − Xβ0 k2 ,
nφ20

or dividing through both sides by kX βb − Xβ0 k2 , then squaring both sides, and di-
viding by n,
1 9s0 λ2
kX βb − Xβ0 k22 ≤ 2 2 .
n n φ0
p
Plugging in λ = 2σ 2n log(ep/δ), we have shown that

1 72σ 2 s0 log(ep/δ)
kX βb − Xβ0 k22 ≤ , (27)
n nφ20

with probability at least 1 − δ. Notice the similarity between (27) and (8): both
provide us in-sample risk bounds on the order of s0 log p/n, but the bound for the
lasso requires a strong compability assumption on the predictor matrix X, which
roughly means the predictors can’t be too correlated

20
For the coefficients, we lower bound the left-hand side of the key inequality (26),

nφ20 b
kβS − β0,S k21 ≤ 3λkβbS − β0,S k1 ,
s0
so dividing through both sides by kβbS − β0,S k1 , and recalling kβb−S k1 ≤ 3kβbS − β0,S k1 ,
which implies by the triangle inequality that kβb − β0 k1 ≤ 4kβbS − β0,S k1 ,
12s0 λ
kβb − β0 k1 ≤ .
nφ20
p
Plugging in λ = 2σ 2n log(ep/δ), we have shown that
r
24σs 0 2 log(ep/δ)
kβb − β0 k1 ≤ 2
, (28)
φ0 n
p
with probability at least 1 − δ. This is a error bound on the order of s0 log p/n for
the lasso coefficients (in `1 norm)
Restricted eigenvalue result. Instead of compatibility, we may assume that
X satisfies the restricted eigenvalue condition with constant φ0 > 0, i.e.,

1
kXvk22 ≥ φ20 kvk22 for all subsets J ⊆ {1, . . . , p} such that |J| = s0
n
and all v ∈ Rp such that kvJ c k1 ≤ 3kvJ k1 . (29)

This produces essentially the same results as in (27), (28), but additionally, in the `2
norm,
s0 log p
kβb − β0 k22 .
nφ20
with probability tending to 1
Note the similarity between (29) and the compatibility condition (25). The former
is actually stronger, i.e., it implies the latter, because kβk22 ≥ kβJ k22 ≥ kβJ k21 /s0 . We
may interpret the restricted eigenvalue condition roughly as follows: the requirement
(1/n)kXvk22 ≥ φ20 kvk22 for all v ∈ Rn would be a lower bound of φ20 on the smallest
eigenvalue of (1/n)X T X; we don’t require this (as this would of course mean that X
was full column rank, and couldn’t happen when p > n), but instead that require
that the same inequality hold for v that are “mostly” supported on small subsets J
of variables, with |J| = s0

8 Appendix: Support Recovery


Again we assume a standard linear model (??), with X fixed, subject to the scaling
kXj k22 ≤ n, for j = 1, . . . , p, and  ∼ N (0, σ 2 ). Denote by S = supp(β0 ) the true
support set, and s0 = |S|. Assume that XS has full column rank

21
We aim to show that, at some value of λ, the lasso solution βb in (5) has an active
set that exactly equals the true support set,

A = supp(β)
b = S,

with high probability. We actually aim to show that the signs also match,

sign(βbS ) = sign(β0,S ),

with high probability. The primal-dual witness method basically plugs in the true
support S into the KKT conditions for the lasso (9), (10), and checks when they can
be verified
We start by breaking up (9) into two blocks, over S and S c . Suppose that
supp(β)
b = S at a solution β.
b Then the KKT conditions become

XST (y − XS βbS ) = λsS (30)


X T (y − XS βbS ) = λs−S .
−S (31)

Hence, if we can satisfy the two conditions (30), (31) with a proper subgradient
s, such that
sS = sign(β0,S ) and ks−S k∞ = max |sj | < 1,
j ∈S
/

then we have met our goal: we have recovered a (unique) lasso solution whose active
set is S, and whose active signs are sign(β0,S )
So, let’s solve for βbS in the first block (30). Just as we did in the work on basic
properties of the lasso estimator, this yields

βbS = (XST XS )−1 XST y − λsign(β0,S ) ,



(32)

where we have substituted sS = sign(β0,S ). From (31), this implies that s−S must
satisfy
1 T
X−S I − XS (XST XS )−1 XST y + X−S
T
XS (XST XS )−1 sign(β0,S ).

s−S = (33)
λ
To lay it out, for concreteness, the primal-dual witness method proceeds as follows:

1. Solve for the lasso solution over the S components, βbS , as in (32), and set
βb−S = 0

2. Solve for the subgradient over the S c components, s−S , as in (33)

3. Check that sign(βbS ) = sign(β0,S ), and that ks−S k∞ < 1. If these two checks
pass, then we have certified there is a (unique) lasso solution that exactly re-
covers the true support and signs

22
The success of the primal-dual witness method hinges on Step 3. We can plug in y =
Xβ0 + , and rewrite the required conditions, sign(βbS ) = sign(β0,S ) and ks−S k∞ < 1,
as

sign(β0,j + ∆j ) = sign(β0,j ), where


∆j = eTj (XST XS )−1 XST  − λsign(β0,S ) , for all j ∈ S, (34)


and
1
T
I − XS (XST XS )−1 XST  + X−S
T
XS (XST XS )−1 sign(β0,S ) < 1.

X−S (35)

λ ∞

As  ∼ N (0, σ 2 I), we see that the two required conditions have been reduced to
statements about Gaussian random variables. The arguments we need to check these
conditions actually are quite simply, but we will need to make assumptions on X and
β0 . These are:
With these assumptions in place on X and β0 , let’s first consider verifying (34),
and examine ∆S , whose components ∆j , j ∈ S are as defined in (34). We have

k∆S k∞ ≤ k(XST XS )−1 XST k∞ + λk(XST XS )−1 k∞ .

Note that w = (XST XS )−1 XST  is Gaussian with mean zero and covariance σ 2 (XST XS )−1 ,
so the variances of components of w are bounded by
  σ2n
σ 2 Λmax (XST XS )−1 ≤ ,
C
where we have used the minimum eigenvalue assumption. By a standard result on
the maximum of Gaussians, for any δ > 0, it holds with probability at least 1 − δ
that
σ p
k∆S k∞ ≤ √ 2n log (es0 /δ) + λk(XST XS )−1 k∞
C  
γ σp
≤ β0,min + √ 2n log (es0 /δ) − 4λ .
C γ
| {z }
a

where in the second line we used the minimum signal condition. As long as a < 0,
we can see that the sign condition (34) is verified
Now, let’s consider verifying (35). Using the mutual incoherence condition, we
have
1
T T −1 T T T −1

X I − X (X X ) X  + X X (X X ) sign(β ) ≤ kzk∞ + (1 − γ),

S S −S S S 0,S
λ −S S S S

where z = (1/λ)X−ST
(I − XS (XST XS )−1 XST ) = (1/λ)X−S
T
PXS , with PXS the projec-
tion matrix onto the column space of XS . Notice that z is Gaussian with mean zero

23
and covariance (σ 2 /λ2 )X−S
T
PXS X−S , so the components of z have variances bounded
by
σ2n σ2n
Λ max (P XS
) ≤ .
λ2 λ2
Therefore, again by the maximal Gaussian inequality, for any δ > 0, it holds with
probability at least 1 − δ that
1
T
I − XS (XST XS )−1 XST  + X−ST
XS (XST XS )−1 sign(β0,S )

X−S

λ ∞
σp
≤ 2n log (e(p − s0 )/δ) + (1 − γ)
λ  
σp
=1+ 2n log (e(p − s0 )/δ) − γ ,
λ
| {z }
b

Thus as long as b < 0, we can see that the subgradient condition


p (35) is verified
So it remains to choose λ so that a, b < 0. For λ ≥ (2σ/γ) 2n log(ep/δ), we can
see that
a ≤ 2λ − 4λ < 0, b ≤ γ/2 − γ < 0,
so (34), (35) are verified—and hence lasso estimator recovers the correct support and
signs—with probability at least 1 − 2δ

8.1 A note on the conditions


As we moved from the slow rates, to fast rates, to support recovery, the assumptions
we used just got stronger and stronger. For the slow rates, we essentially assumed
nothing about the predictor matrix X except for column normalization. For the
fast rates, we had to additionally assume a compatibility or restricted eigenvalue
condition, which roughly speaking, limited the correlations of the predictor variables
(particularly concentrated over the underlying support S). For support recovery, we
still needed whole lot more. The minimum eigenvalue condition on (1/n)(XST XS )−1 is
somewhat like the restricted eigenvalue condition on X. But the mutual incoherence
condition is even stronger; it requires the regression coefficients

ηj (S) = (XST XS )−1 XST Xj ,

given by regressing each Xj on the truly active variables XS , to be small (in `1 norm)
for all j ∈
/ S. In other words, no truly inactive variables can be highly correlated
(or well-explained, in a linear projection sense) by any of the truly active variables.
Finally, this minimum signal condition ensures that the nonzero entries of the true
coefficient vector β0 are big enough to detect. This is quite restrictive and is not
needed for risk bounds, but it is crucial to support recovery.

24
8.2 Minimax bounds
Under the data model (??) with X fixed, subject to the scaling kXj k22 ≤ n, for
j = 1, . . . , p, and  ∼ N (0, σ 2 ), Raskutti et al. (2011) derive upper and lower bounds
on the minimax prediction error
1
M (s0 , n, p) = inf sup kX βb − Xβ0 k22 .
βb kβ0 k0 ≤s0 n

(Their analysis is acutally considerably more broad than this and covers the coefficient
error kβb − β0 k2 , as well `q constraints on β0 , for q ∈ [0, 1].) They prove that, under
no additional assumptions on X,

s0 log(p/s0 )
M (s0 , n, p) . ,
n
with probability tending to 1
They also prove that, under a type of restricted eigenvalue condition in which

(1/n)kXvk22
c0 ≤ ≤ c1 for all v ∈ Rp such that kvk0 ≤ 2s0 ,
kvk22

for some constants c0 > 0 and c1 < ∞, it holds that

s0 log(p/s0 )
M (s0 , n, p) & ,
n
with probability at least 1/2
The implication is that, for some X, minimax optimal prediction may be able
to be performed at a faster rate than s0 log(p/s0 )/n; but for low correlations, this
is the rate we should expect. (This is consistent with the worst-case-X analysis of
Foster & George (1994), who actually show the worst-case behavior is attained in the
orthogonal X case)

25

You might also like