A Closer Look at Sparse Regression Ryan Tibshirani: 2.1 Three Norms: ', ', '
A Closer Look at Sparse Regression Ryan Tibshirani: 2.1 Three Norms: ', ', '
A Closer Look at Sparse Regression Ryan Tibshirani: 2.1 Three Norms: ', ', '
Ryan Tibshirani
(ammended by Larry Wasserman)
1 Introduction
In these notes we take a closer look at sparse linear regression. Throughout, we
make the very strong assumption that Yi = β T Xi + i where E[i |Xi ] = 0 and
Var(i |Xi ) = σ 2 . These assumptions are highly unrealistic but they permit a more de-
tailed analysis. There are several books on high-dimensional estimation: Hastie, Tib-
shirani & Wainwright (2015), Buhlmann & van de Geer (2011), Wainwright (2017).
(Truthfully, calling it “the `0 norm” is a misnomer, since it is not a norm: it does not
satisfy positive homogeneity, i.e., kaβk0 6= akβk0 whenever a 6= 0, 1.)
In constrained form, this gives rise to the problems:
1
In penalized form, the use of `0 , `1 , `2 norms gives rise to the problems:
1
min ky − Xβk22 + λkβk0 (Best subset selection) (4)
β∈Rp 2
1
minp ky − Xβk22 + λkβk1 (Lasso regression) (5)
β∈R 2
1
min ky − Xβk22 + λkβk22 (Ridge regression) (6)
β∈Rp 2
with λ ≥ 0 the tuning parameter. In fact, problems (2), (5) are equivalent. By this,
we mean that for any t ≥ 0 and solution βb in (2), there is a value of λ ≥ 0 such
that βb also solves (5), and vice versa. The same equivalence holds for (3), (6). (The
factors of 1/2 multiplying the squared loss above are inconsequential, and just for
convenience)
It means, roughly speaking, that computing solutions of (2) over a sequence of
t values and performing cross-validation (to select an estimate) should be basically
the same as computing solutions of (5) over some sequence of λ values and perform-
ing cross-validation (to select an estimate). Strictly speaking, this isn’t quite true,
because the precise correspondence between equivalent t, λ depends on the data X, y
Notably, problems (1), (4) are not equivalent. For every value of λ ≥ 0 and
solution βb in (4), there is a value of t ≥ 0 such that βb also solves (1), but the converse
is not true
2
TABLE 3.4. Estimators of βj in the case of orthonormal columns of X. M and λ
are constants chosen by the corresponding techniques; sign denotes the sign of its
argument (±1), and x+ denotes “positive part” of x. Below the table, estimators
are shown by broken red lines. The 45◦ line in gray shows the unrestricted estimate
for reference.
2.3 Sparsity Estimator Formula
The best subset selection and the lasso estimators have a special, useful property:
Best subset
their solutions are sparse, i.e., at (size M ) βb β̂
a solution wej ·will j | ≥β
I(|β̂have b|jβ̂=
(M0) |)
for many components
j ∈ {1, . . . , p}. In problem
Ridge (1), this is obviously true,
β̂j /(1 λ) k ≥ 0 controls the sparsity
+where
level. In problem (2), it is less obviously true, but we get a higher degree of sparsity
the smaller the value Lasso
of t ≥ 0. In the penalized sign(forms, j | −(5),
β̂j )(|β̂(4), λ)+we get more sparsity
the larger the value of λ ≥ 0
This isBestnotSubset
true of ridge regression,Ridge i.e., the solution of (3) orLasso (6) generically has
all nonzero components, no matter the value of t or λ. Note that sparsity is desirable, λ
for two reasons: (i) it corresponds to performing variable selection in the constructed
linear model, and (ii) it) |provides a level of interpretability (beyond sheer accuracy)
|β̂(M
That the `0 (0,0)norm induces sparsity is obvious.(0,0) But, why does the(0,0)`1 norm induce
sparsity and not the `2 norm? There are different ways to look at it; let’s stick
with intuition from the constrained problem forms (2), (5). Figure 1 shows the
“classic” picture, contrasting the way the contours of the squared error loss hit the
two constraint sets, the `1 and `2 balls. As the `1 ball has sharp corners (aligned with
the coordinate axes), we get sparse solutions
β2 ^
β
. β2 ^
β
.
β1 β1
FIGURE 3.11. Estimation picture for the lasso (left) and ridge regression
Figure 1: The “classic” illustration comparing lasso and ridge constraints. From
(right). Shown are contours of the error and constraint functions. The solid blue
Chapter 3 of Hastie et al. (2009)
areas are the constraint regions |β | + |β | ≤ t and β 2 + β 2 ≤ t2 , respectively,
1 2 1 2
while the red ellipses are the contours of the least squares error function.
Intuition can also be drawn from the orthogonal case. When X is orthogonal, it
is not hard to show that the solutions of the penalized problems (4), (5), (6) are
XT y
βbsubset = H√2λ (X T y), βblasso = Sλ (X T y), βbridge =
1 + 2λ
3
respectively, where Ht (·), St (·) are the componentwise hard- and soft-thresholding
functions at the level t. We see several revealing properties: subset selection and
lasso solutions exhibit sparsity when the componentwise least squares coefficients
(inner products X T y) are small enough; the lasso solution exihibits shrinkage, in
that large enough least squares coefficients are shrunken towards zero by λ; the ridge
regression solution is never sparse and compared to the lasso, preferentially shrinkage
the larger least squares coefficients even more
2.4 Convexity
The lasso and ridge regression problems (2), (3) have another very important prop-
erty: they are convex optimization problems. Best subset selection (1) is not, in fact
it is very far from being convex. Consider using the norm ||β||p as a penalty. Sparsity
requires p ≤ 1 and convexity requires p ≥ 1. The only norm that gives sparsity and
convexity is p = 1. The appendix has a brief review of convexity.
5
3.2 Lasso
Now we turn to subgradient optimality (sometimes called the KKT conditions) for
the lasso problem in (5). They tell us that any lasso solution βb must satisfy
X T (y − X β)
b = λs, (9)
where s ∈ ∂kβk
b 1 , a subgradient of the `1 norm evaluated at β.
b Precisely, this means
that
{+1}
βbj > 0
sj ∈ {−1} βbj < 0 j = 1, . . . , p. (10)
[−1, 1] βbj = 0,
From (9) we can read off a straightforward but important fact: even though the
solution βb may not be uniquely determined, the optimal subgradient s is a function
of the unique fitted value X βb (assuming λ > 0), and hence is itself unique.
Now from (10), note that the uniqueness of s implies that any two lasso solutions
must have the same signs on the overlap of their supports. That is, it cannot happen
that we find two different lasso solutions βb and βe with βbj > 0 but βej < 0 for some
j, and hence we have no problem interpretating the signs of components of lasso
solutions.
Let’s assume henceforth that the columns of X are in general position (and we
are looking at a nontrivial end of the path, with λ > 0), so the lasso solution βb is
unique. Let A = supp(β) b be the lasso active set, and let sA = sign(βbA ) be the signs
of active coefficients. From the subgradient conditions (9), (10), we know that
(where recall we know that XAT XA is invertible because X has columns in general
position). We see that the active coefficients βbA are given by taking the least squares
coefficients on XA , (XAT XA )−1 XAT y, and shrinking them by an amount λ(XAT XA )−1 sA .
Contrast this to, e.g., the subset selection solution in (7), where there is no such
shrinkage.
Now, how about this so-called shrinkage term (XAT XA )−1 XAT y? Does it always
act by moving each one of the least squares coefficients (XAT XA )−1 XAT y towards zero?
Indeed, this is not always the case, and one can find empirical examples where a
lasso coefficient is actually larger (in magnitude) than the corresponding least squares
coefficient on the active set. Of course, we also know that this is due to the correlations
6
between active variables, because when X is orthogonal, as we’ve already seen, this
never happens.
On the other hand, it is always the case that the lasso solution has a strictly
smaller `1 norm than the least squares solution on the active set, and in this sense,
we are (perhaps) justified in always referring to (XAT XA )−1 XAT y as a shrinkage term.
To see this, note that, for any vector b, ||b||1 = sT b where s is the vector of signs of
b 1 = sT βb = sT βbA and so
b. So ||β|| A
The first term is less than or equal to k(XAT XA )−1 XAT yk1 , and the term we are sub-
tracting is strictly negative (because (XAT XA )−1 is positive definite).
7
Notice that kX T k∞ = maxj=1,...,p |XjT | is a maximum of p Gaussians, each with
mean zero and variance upper bounded by σ 2 n. By a standard maximal inequality
for Gaussians, for any δ > 0,
p
max |XjT | ≤ σ 2n log(ep/δ),
j=1,...,p
with probability at least 1−δ. Plugging this to the second-to-last display and dividing
by n, we get the finite-sample result for the lasso estimator
r
1 2 2 log(ep/δ)
kX βb − Xβ0 k2 ≤ 4σkβ0 k1 , (13)
n n
with probability at least 1 − δ.
The high-probability result (13) implies an in-sample risk bound of
r
1 log p
EkX βb − Xβ0 k2 . kβ0 k1
2
.
n n
Compare to this with the risk bound (8) for best subset selection, which is on the
(optimal) order of s0 log p/n when β0 has s0 nonzero components. If each of the
nonzero components here has constant p magnitude, then above risk bound for the
lasso estimator is on the order of s0 log p/n, which is much slower.
Predictive risk. Instead of in-sample risk, we might also be interested in out-
of-sample risk, as after all that reflects actual (out-of-sample) predictions. In least
squares, recall, we saw that out-of-sample risk was generally higher than in-sample
risk. The same is true for the lasso Chatterjee (2013) gives a nice, simple analysis of
out-of-sample risk for the lasso. He assumes that x0 , xi , i = 1, . . . , n are i.i.d. from
an arbitrary distribution supported on a compact set in Rp , and shows that the lasso
estimator in bound form (2) with t = kβ0 k1 has out-of-sample risk satisfying
r
log p
E(x0 β − x0 β) . kβ0 k1
Tb T 2 2
.
n
The proof is not much more complicated than the above, for the in-sample risk, and
reduces to a clever application of Hoeffding’s inequality, though we omit it for brevity.
Note here the dependence on kβ0 k21 , rather than kβ0 k1 as in the in-sample risk. This
agrees with the analysis we did in the previous set of notes where we did not assume
the linear model. (Only the interpretation changes.)
8
with an arbitrary mean function µ(X), and normal errors ∼ N (0, σ 2 ). We will
analyze the bound form lasso estimator (2) for simplicity. By optimality of β,
b for any
1
other β feasible for the lasso problem in (2), it holds that
e
hX T (y − X β),
b βe − βi
b ≤ 0. (14)
Rearranging gives
hµ(X) − X β, b ≤ hX T , βb − βi.
b X βe − X βi e (15)
Now using the polarization identity kak22 + kbk22 − ka − bk22 = 2ha, bi,
kX βb − µ(X)k22 + kX βb − X βk
e 2 ≤ kX βe − µ(X)k2 + 2hX T , βb − βi,
2 2
e
Also if we write X βebest as the best linear that predictor of `1 at most t, achieving
the infimum on the right-hand side (which we know exists, as we are minimizing a
continuous function over a compact set), then
r
1 2 log(ep/δ)
kX βb − X βebest k22 ≤ 4σt ,
n n
with probability at least 1 − δ
1
kXvk22 ≥ φ20 kvk22 for all subsets J ⊆ {1, . . . , p} such that |J| = s0
n
and all v ∈ Rp such that kvJ c k1 ≤ 3kvJ k1 (16)
1
To see this, consider minimizing a convex function f (x) over a convex set C. Let xb be a
minimizer. Let z ∈ C be any other point in C. If we move away from the solution xb we can only
x). In other words, h∇f (b
increase f (b x), z − zbi ≥ 0.
9
then
s0 log p
kβb − β0 k22 . (17)
nφ20
with probability tending to 1. (This condition can be slightly weakened, but not
much.) The condition is unlikely to hold in any real problem. Nor is it checkable.
The proof is in the appendix.
References
Beale, E. M. L., Kendall, M. G. & Mann, D. W. (1967), ‘The discarding of variables
in multivariate analysis’, Biometrika 54(3/4), 357–366.
Bertsimas, D., King, A. & Mazumder, R. (2016), ‘Best subset selection via a modern
optimization lens’, The Annals of Statistics 44(2), 813–852.
10
Buhlmann, P. & van de Geer, S. (2011), Statistics for High-Dimensional Data,
Springer.
Candes, E. J. & Tao, T. (2006), ‘Near optimal signal recovery from random projec-
tions: Universal encoding strategies?’, IEEE Transactions on Information Theory
52(12), 5406–5425.
Chen, S., Donoho, D. L. & Saunders, M. (1998), ‘Atomic decomposition for basis
pursuit’, SIAM Journal on Scientific Computing 20(1), 33–61.
Efron, B., Hastie, T., Johnstone, I. & Tibshirani, R. (2004), ‘Least angle regression’,
Annals of Statistics 32(2), 407–499.
Foster, D. & George, E. (1994), ‘The risk inflation criterion for multiple regression’,
The Annals of Statistics 22(4), 1947–1975.
Hastie, T., Tibshirani, R. & Friedman, J. (2009), The Elements of Statistical Learning;
Data Mining, Inference and Prediction, Springer. Second edition.
Hastie, T., Tibshirani, R. & Wainwright, M. (2015), Statistical Learning with Sparsity:
the Lasso and Generalizations, Chapman & Hall.
Hoerl, A. & Kennard, R. (1970), ‘Ridge regression: biased estimation for nonorthog-
onal problems’, Technometrics 12(1), 55–67.
Osborne, M., Presnell, B. & Turlach, B. (2000a), ‘A new approach to variable selection
in least squares problems’, IMA Journal of Numerical Analysis 20(3), 389–404.
11
Osborne, M., Presnell, B. & Turlach, B. (2000b), ‘On the lasso and its dual’, Journal
of Computational and Graphical Statistics 9(2), 319–337.
Raskutti, G., Wainwright, M. J. & Yu, B. (2011), ‘Minimax rates of estimation for
high-dimensional linear regression over `q -balls’, IEEE Transactions on Information
Theory 57(10), 6976–6994.
Tibshirani, R. (1996), ‘Regression shrinkage and selection via the lasso’, Journal of
the Royal Statistical Society: Series B 58(1), 267–288.
van de Geer, S. & Buhlmann, P. (2009), ‘On the conditions used to prove oracle
results for the lasso’, Electronic Journal of Statistics 3, 1360–1392.
Zhao, P. & Yu, B. (2006), ‘On model selection consistency of lasso’, Journal of Ma-
chine Learning Research 7, 2541–2564.
5 Appendix: Convexity
It is convexity that allows to equate (2), (5), and (3), (6) (and yes, the penalized forms
are convex problems too). It is also convexity that allows us to both efficiently solve,
and in some sense, precisely understand the nature of the lasso and ridge regression
solutions
Here is a (far too quick) refresher/introduction to basic convex analysis and convex
optimization. Recall that a set C ⊆ Rn is called convex if for any x, y ∈ C and
t ∈ [0, 1], we have
tx + (1 − t)y ∈ C,
i.e., the line segment joining x, y lies entirely in C. A function f : Rn → R is called
convex if its domain dom(f ) is convex, and for any x, y ∈ dom(f ) and t ∈ [0, 1],
f tx + (1 − t)y ≤ tf (x) + (1 − t)f (y),
i.e., the function lies below the line segment joining its evaluations at x and y. A
function is called strictly convex if this same inequality holds strictly for x 6= y and
t ∈ (0, 1)
E.g., lines, rays, line segments, linear spaces, affine spaces, hyperplans, halfspaces,
polyhedra, norm balls are all convex sets
12
E.g., affine functions aT x + b are convex and concave, quadratic functions xT Qx +
bT x + c are convex if Q 0 and strictly convex if Q 0, norms are convex
Formally, an optimization problem is of the form
min f (x)
x∈D
subject to hi (x) ≤ 0, i = 1, . . . m
`j (x) = 0, j = 1, . . . r
Here D = dom(f ) ∩ m
T Tr
i=1 dom(hi ) ∩ j=1 dom(`j ) is the common domain of all func-
tions. A convex optimization problem is an optimization problem in which all functions
f, h1 , . . . hm are convex, and all functions `1 , . . . `r are affine. (Think: why affine?)
Hence, we can express it as
min f (x)
x∈D
subject to hi (x) ≤ 0, i = 1, . . . m
Ax = b
Why is a convex optimization problem so special? The short answer: because any
local minimizer is a global minimizer. To see this, suppose that x is feasible for the
convex problem formulation above and there exists some R > 0 such that
Such a point x is called a local minimizer. For the sake of contradiction, suppose that
x was not a global minimizer, i.e., there exists some feasible z such that f (z) < f (x).
By convexity of the constraints (and the domain D), the point tz + (1 − t)x is feasible
for any 0 ≤ t ≤ 1. Furthermore, by convexity of f ,
f tz + (1 − t)x ≤ tf (z) + (1 − t)f (x) < f (x)
for any 0 < t < 1. Lastly, we can choose t > 0 small enough so that kx − (tz + (1 −
t)x)k2 = tkx − zk2 ≤ R, and we obtain a contradiction
Algorithmically, this is a very useful property, because it means if we keep “going
downhill”, i.e., reducing the achieved criterion value, and we stop when we can’t do
so anymore, then we’ve hit the global solution
Convex optimization problems are also special because they come with a beautiful
theory of beautiful convex duality and optimality, which gives us a way of understand-
ing the solutions. We won’t have time to cover any of this, but we’ll mention what
subgradient optimality looks like for the lasso
Just based on the definitions, it is not hard to see that (2), (3), (5), (6) are convex
problems, but (1), (4) are not. In fact, the latter two problems are known to be
NP-hard, so they are in a sense even the worst kind of nonconvex problem
13
6 Appendix: Geometry of the solutions
One undesirable feature of the best subset selection solution (7) is the fact that
it behaves discontinuously with y. As we change y, the active set A must change
at some point, and the coefficients will jump discontinuously, because we are just
doing least squares onto the active set. So, does the same thing happen with the
lasso solution (11)? The answer it not immediately clear. Again, as we change y,
the active set A must change at some point; but if the shrinkage term were defined
“just right”, then perhaps the coefficients of variables to leave the active set would
gracefully and continously drop to zero, and coefficients of variables to enter the
active set would continuously move form zero. This would make whole the lasso
solution continuous. Fortuitously, this is indeed the case, and the lasso solution
βb is continuous as a function of y. It might seem a daunting task to prove this,
but a certain perspective using convex geometry provides a very simple proof. The
geometric perspective in fact proves that the lasso fit X βb is nonexpansive in y, i.e.,
1-Lipschitz continuous, which is a very strong form of continuity. Define the convex
polyhedron C = {u : kX T uk∞ ≤ λ} ⊆ Rn . Some simple manipulations of the KKT
conditions show that the lasso fit is given by
X βb = (I − PC )(y),
the residual from projecting y onto C. A picture to show this (just look at the left
panel for now) is given in Figure 2.
The projection onto any convex set is nonexpansive, i.e., kPC (y) − PC (y 0 )k2 ≤
ky − y 0 k2 for any y, y 0 . This should be visually clear from the picture. Actually, the
same is true with the residual map: I − PC is also nonexpansive, and hence the lasso
fit is 1-Lipschitz continuous. Viewing the lasso fit as the residual from projection
onto a convex polyhedron is actually an even more fruitful perspective. Write this
polyhedron as
C = (X T )−1 {v : kvk∞ ≤ λ},
where (X T )−1 denotes the preimage operator under the linear map X T . The set
{v : kvk∞ ≤ λ} is a hypercube in Rp . Every face of this cube corresponds to a subset
A ⊆ {1, . . . p} of dimensions (that achieve the maximum value |λ|) and signs sA ∈
{−1, 1}|A| (that tell which side of the cube the face will lie on, for each dimension).
Now, the faces of C are just faces of {v : kvk∞ ≤ λ} run through the (linear) preimage
transformation, so each face of C can also indexed by a set A ⊆ {1, . . . p} and signs
sA ∈ {−1, 1}|A| . The picture in Figure 2 attempts to convey this relationship with
the colored black face in each of the panels.
Now imagine projecting y onto C; it will land on some face. We have just argued
that this face corresponds to a set A and signs sA . One can show that this set A is
exactly the active set of the lasso solution at y, and sA are exactly the active signs.
The size of the active set |A| is the co-dimension of the face. Looking at the picture:
we can that see that as we wiggle y around, it will project to the same face. From the
14
y
X β̂
û A, sA
(X T )−1
0
0
{v : kvk∞ ≤ λ}
T
C = {u : kX uk∞ ≤ λ}
Rn Rp
Figure 2: A geometric picture of the lasso solution. The left panel shows the polyhe-
dron underlying all lasso fits, where each face corresponds to a particular combination
of active set A and signs s; the right panel displays the “inverse” polyhedron, where
the dual solutions live
correspondence between faces and active set and signs of lasso solutions, this means
that A, sA do not change as we perturb y, i.e., they are locally constant. But this isn’t
true for all points y, e.g., if y lies on one of the rays emanating from the lower right
corner of the polyhedron in the picture, then we can see that small perturbations of
y do actually change the face that it projects to, which invariably changes the active
set and signs of the lasso solution. However, this is somewhat of an exceptional case,
in that such points can be form a of Lebesgue measure zero, and therefore we can
assure ourselves that the active set and signs A, sA are locally constant for almost
every y.
From the lasso KKT conditions (9), (10), it is possible to compute the lasso
solution in (5) as a function of λ, which we will write as β(λ),
b for all values of the
tuning parameter λ ∈ [0, ∞]. This is called the regularization path or solution path of
the problem (5). Path algorithms like the one we will describe below are not always
possible; the reason that this ends up being feasible for the lasso problem (5) is that
the solution path β(λ),
b λ ∈ [0, ∞] turns out to be a piecewise linear, continuous
function of λ. Hence, we only need to compute and store the knots in this path,
which we will denote by λ1 ≥ λ2 ≥ . . . ≥ λr ≥ 0, and the lasso solution at these
knots. From this information, we can then compute the lasso solution at any value
15
1
of λ by linear interpolation.
The knots λ1 ≥ . . . ≥ λr in the solution path correspond to λ values at which
the active set A(λ) = supp(β(λ))
b changes. As we decrease λ from ∞ to 0, the knots
typically correspond to the point at which a variable enters the active set; this con-
nects the lasso to an incremental variable selection procedure like forward stepwise
regression. Interestingly though, as we decrease λ, a knot in the lasso path can also
correspond to the point at which a variables leaves the active set. See Figure 3.
0.3
0.2
Coefficients
0.1
0.0
−0.1
−0.2
lambda
Figure 3: An example of the lasso path. Each colored line denotes a component of the
lasso solution βbj (λ), j = 1, . . . , p as a function of λ. The gray dotted vertical lines
mark the knots λ1 ≥ λ2 ≥ . . .
The lasso solution path was described by Osborne et al. (2000a,b), Efron et al.
(2004). Like the construction of all other solution paths that followed these seminal
works, the lasso path is essentially given by an iterative or inductive verification of the
KKT conditions; if we can maintain that the KKT conditions holds as we decrease
λ, then we know we have a solution. The trick is to start at a value of λ at which the
solution is trivial; for the lasso, this is λ = ∞, at which case we know the solution
must be β(∞)
b = 0.
Why would the path be piecewise linear? The construction of the path from the
16
KKT conditions is actually rather technical (not difficult conceptually, but somewhat
tedious), and doesn’t shed insight onto this matter. But we can actually see it clearly
from the projection picture in Figure 2.
As λ decreases from ∞ to 0, we are shrinking (by a multiplicative factor λ) the
polyhedron onto which y is projected; let’s write Cλ = {u : kX T uk∞ ≤ λ} = λC1 to
make this clear. Now suppose that y projects onto the relative interior of a certain
face F of Cλ , corresponding to an active set A and signs sA . As λ decreases, the
point on the boundary of Cλ onto which y projects, call it u b(λ) = PCλ (y), will move
along the face F , and change linearly in λ (because we are equivalently just tracking
the projection of y onto an affine space that is being scaled by λ). Thus, the lasso
fit X β(λ)
b =y−u b(λ) will also behave linearly in λ. Eventually, as we continue to
decrease λ, the projected point u b(λ) will move to the relative boundary of the face F ;
then, decreasing λ further, it will lie on a different, neighboring face F 0 . This face will
correspond to an active set A0 and signs sA0 that (each) differ by only one element to
A and sA , respectively. It will then move linearly across F 0 , and so on.
Now we will walk through the technical derivation of the lasso path, starting
at λ = ∞ and β(∞)
b = 0, as indicated above. Consider decreasing λ from ∞, and
continuing to set β(λ)
b = 0 as the lasso solution. The KKT conditions (9) read
X T y = λs,
What happens next? As we decrease λ from λ1 , we know that we’re going to have to
change β(λ)
b from 0 so that the KKT conditions remain satisfied. Let j1 denote the
variable that achieves the maximum in (19). Since the subgradient was |sj1 | = 1 at
λ = λ1 , we see that we are “allowed” to make βbj1 (λ) nonzero. Consider setting
as λ decreases from λ1 , where sj1 = sign(XjT1 y). Note that this makes β(λ)
b a piecewise
linear and continuous function of λ, so far. The KKT conditions are then
XjT1 y − Xj1 (XjT1 Xj1 )−1 (XjT1 y − λsj1 ) = λsj1 ,
17
for all j 6= j1 . Recall that the above held with strict inequality at λ = λ1 for all j 6= j1 ,
and by continuity of the constructed solution β(λ), b it should continue to hold as we
decrease λ for at least a little while. In fact, it will hold until one of the piecewise
linear paths
XjT (y − Xj1 (XjT1 Xj1 )−1 (XjT1 y − λsj1 )), j 6= j1
becomes equal to ±λ, at which point we have to modify the solution because otherwise
the implicit subgradient
and max+ denotes the maximum over all of its arguments that are < λ1 .
To keep going: let j2 , s2 achieve the maximum in (21). Let A = {j1 , j2 }, sA =
(sj1 , sj2 ), and consider setting
as λ decreases from λ2 . Again, we can verify the KKT conditions for a stretch of
decreasing λ, but will have to stop when one of
becomes equal to ±λ. By linearity, we can compute this next “hitting time” explic-
itly, just as before. Furthermore, though, we will have to check whether the active
components of the computed solution in (22) are going to cross through zero, because
past such a point, sA will no longer be a proper subgradient over the active compo-
nents. We can again compute this next “crossing time” explicitly, due to linearity.
Therefore, we maintain that (22) is the lasso solution for all λ2 ≥ λ ≥ λ3 , where λ3 is
the maximum of the next hitting time and the next crossing time. For convenience,
the lasso path algorithm is summarized below.
As we decrease λ from a knot λk , we can rewrite the lasso coefficient update in
Step 1 as
βbA (λ) = βbA (λk ) + (λk − λ)(XAT XA )−1 sA ,
(23)
βb−A (λ) = 0.
18
We can see that we are moving the active coefficients in the direction (λk −λ)(XAT XA )−1 sA
for decreasing λ. In other words, the lasso fitted values proceed as
X β(λ)
b b k ) + (λk − λ)XA (X T XA )−1 sA ,
= X β(λ A
for decreasing λ. Efron et al. (2004) call XA (XAT XA )−1 sA the equiangular direction,
because this direction, in Rn , takes an equal angle with all Xj ∈ Rn , j ∈ A.
For this reason, the lasso path algorithm in Algorithm ?? is also often referred
to as the least angle regression path algorithm in “lasso mode”, though we have not
mentioned this yet to avoid confusion. Least angle regression is considered as another
algorithm by itself, where we skip Step 3 altogether. In words, Step 3 disallows any
component path to cross through zero. The left side of the plot in Figure 3 visualizes
the distinction between least angle regression and lasso estimates: the dotted black
line displays the least angle regression component path, crossing through zero, while
the lasso component path remains at zero.
Lastly, an alternative expression for the coefficient update in (23) (the update in
Step 1) is
λk − λ T
βbA (λ) = βbA (λk ) + (XA XA )−1 XAT r(λk ),
λk (24)
β−A (λ) = 0,
b
where r(λk ) = y − XA βbA (λk ) is the residual (from the fitted lasso model) at λk . This
follows because, recall, λk sA are simply the inner products of the active variables
with the residual at λk , i.e., λk sA = XAT (y − XA βbA (λk )). In words, we can see that
the update for the active lasso coefficients in (24) is in the direction of the least
squares coefficients of the residual r(λk ) on the active variables XA .
1 φ2
kXvk22 ≥ 0 kvS k21 for all v ∈ Rp such that kv−S k1 ≤ 3kvS k1 . (25)
n s0
While this may look like an odd condition, we will see it being useful in the proof
below, and we will also have some help interpreting it when we discuss the restricted
eigenvalue condition shortly. Roughly, it means the (truly active) predictors can’t be
too correlated
19
Recall from our previous analysis for the lasso estimator in penalized form (5), we
showed on an event Eδ of probability at least 1 − δ,
p
kX βb − Xβ0 k22 ≤ 2σ 2n log(ep/δ)kβb − β0 k1 + 2λ(kβ0 k1 − kβk
b 1 ).
Choosing λ large enough and applying the triangle inequality then gave us the slow
rate wepderived before. Now we choose λ just slightly larger (by a factor of 2):
λ ≥ 2σ 2n log(ep/δ). The remainder of the analysis will be performed on the event
Eδ and we will no longer make this explicit until the very end. Then
where the two inequalities both followed from the triangle inequality, one application
for each of the two terms, and we have used that βb0,−S = 0. As kX βb − Xβ0 k22 ≥ 0,
we have shown
kβb−S − βb0,−S k1 ≤ 3kβbS − β0,S k1 ,
and thus we may apply the compatibility condition (25) to the vector v = βb − β0 .
This gives us two bounds: one on the fitted values, and the other on the coefficients.
Both start with the key inequality (from the second-to-last display)
For the fitted values, we upper bound the right-hand side of the key inequality (26),
r
2 s0
kX βb − Xβ0 k2 ≤ 3λ kX βb − Xβ0 k2 ,
nφ20
or dividing through both sides by kX βb − Xβ0 k2 , then squaring both sides, and di-
viding by n,
1 9s0 λ2
kX βb − Xβ0 k22 ≤ 2 2 .
n n φ0
p
Plugging in λ = 2σ 2n log(ep/δ), we have shown that
1 72σ 2 s0 log(ep/δ)
kX βb − Xβ0 k22 ≤ , (27)
n nφ20
with probability at least 1 − δ. Notice the similarity between (27) and (8): both
provide us in-sample risk bounds on the order of s0 log p/n, but the bound for the
lasso requires a strong compability assumption on the predictor matrix X, which
roughly means the predictors can’t be too correlated
20
For the coefficients, we lower bound the left-hand side of the key inequality (26),
nφ20 b
kβS − β0,S k21 ≤ 3λkβbS − β0,S k1 ,
s0
so dividing through both sides by kβbS − β0,S k1 , and recalling kβb−S k1 ≤ 3kβbS − β0,S k1 ,
which implies by the triangle inequality that kβb − β0 k1 ≤ 4kβbS − β0,S k1 ,
12s0 λ
kβb − β0 k1 ≤ .
nφ20
p
Plugging in λ = 2σ 2n log(ep/δ), we have shown that
r
24σs 0 2 log(ep/δ)
kβb − β0 k1 ≤ 2
, (28)
φ0 n
p
with probability at least 1 − δ. This is a error bound on the order of s0 log p/n for
the lasso coefficients (in `1 norm)
Restricted eigenvalue result. Instead of compatibility, we may assume that
X satisfies the restricted eigenvalue condition with constant φ0 > 0, i.e.,
1
kXvk22 ≥ φ20 kvk22 for all subsets J ⊆ {1, . . . , p} such that |J| = s0
n
and all v ∈ Rp such that kvJ c k1 ≤ 3kvJ k1 . (29)
This produces essentially the same results as in (27), (28), but additionally, in the `2
norm,
s0 log p
kβb − β0 k22 .
nφ20
with probability tending to 1
Note the similarity between (29) and the compatibility condition (25). The former
is actually stronger, i.e., it implies the latter, because kβk22 ≥ kβJ k22 ≥ kβJ k21 /s0 . We
may interpret the restricted eigenvalue condition roughly as follows: the requirement
(1/n)kXvk22 ≥ φ20 kvk22 for all v ∈ Rn would be a lower bound of φ20 on the smallest
eigenvalue of (1/n)X T X; we don’t require this (as this would of course mean that X
was full column rank, and couldn’t happen when p > n), but instead that require
that the same inequality hold for v that are “mostly” supported on small subsets J
of variables, with |J| = s0
21
We aim to show that, at some value of λ, the lasso solution βb in (5) has an active
set that exactly equals the true support set,
A = supp(β)
b = S,
with high probability. We actually aim to show that the signs also match,
sign(βbS ) = sign(β0,S ),
with high probability. The primal-dual witness method basically plugs in the true
support S into the KKT conditions for the lasso (9), (10), and checks when they can
be verified
We start by breaking up (9) into two blocks, over S and S c . Suppose that
supp(β)
b = S at a solution β.
b Then the KKT conditions become
Hence, if we can satisfy the two conditions (30), (31) with a proper subgradient
s, such that
sS = sign(β0,S ) and ks−S k∞ = max |sj | < 1,
j ∈S
/
then we have met our goal: we have recovered a (unique) lasso solution whose active
set is S, and whose active signs are sign(β0,S )
So, let’s solve for βbS in the first block (30). Just as we did in the work on basic
properties of the lasso estimator, this yields
where we have substituted sS = sign(β0,S ). From (31), this implies that s−S must
satisfy
1 T
X−S I − XS (XST XS )−1 XST y + X−S
T
XS (XST XS )−1 sign(β0,S ).
s−S = (33)
λ
To lay it out, for concreteness, the primal-dual witness method proceeds as follows:
1. Solve for the lasso solution over the S components, βbS , as in (32), and set
βb−S = 0
3. Check that sign(βbS ) = sign(β0,S ), and that ks−S k∞ < 1. If these two checks
pass, then we have certified there is a (unique) lasso solution that exactly re-
covers the true support and signs
22
The success of the primal-dual witness method hinges on Step 3. We can plug in y =
Xβ0 + , and rewrite the required conditions, sign(βbS ) = sign(β0,S ) and ks−S k∞ < 1,
as
and
1
T
I − XS (XST XS )−1 XST + X−S
T
XS (XST XS )−1 sign(β0,S )
< 1.
X−S (35)
λ ∞
As ∼ N (0, σ 2 I), we see that the two required conditions have been reduced to
statements about Gaussian random variables. The arguments we need to check these
conditions actually are quite simply, but we will need to make assumptions on X and
β0 . These are:
With these assumptions in place on X and β0 , let’s first consider verifying (34),
and examine ∆S , whose components ∆j , j ∈ S are as defined in (34). We have
Note that w = (XST XS )−1 XST is Gaussian with mean zero and covariance σ 2 (XST XS )−1 ,
so the variances of components of w are bounded by
σ2n
σ 2 Λmax (XST XS )−1 ≤ ,
C
where we have used the minimum eigenvalue assumption. By a standard result on
the maximum of Gaussians, for any δ > 0, it holds with probability at least 1 − δ
that
σ p
k∆S k∞ ≤ √ 2n log (es0 /δ) + λk(XST XS )−1 k∞
C
γ σp
≤ β0,min + √ 2n log (es0 /δ) − 4λ .
C γ
| {z }
a
where in the second line we used the minimum signal condition. As long as a < 0,
we can see that the sign condition (34) is verified
Now, let’s consider verifying (35). Using the mutual incoherence condition, we
have
1
T T −1 T T T −1
X I − X (X X ) X + X X (X X ) sign(β ) ≤ kzk∞ + (1 − γ),
S S −S S S 0,S
λ −S S S S
∞
where z = (1/λ)X−ST
(I − XS (XST XS )−1 XST ) = (1/λ)X−S
T
PXS , with PXS the projec-
tion matrix onto the column space of XS . Notice that z is Gaussian with mean zero
23
and covariance (σ 2 /λ2 )X−S
T
PXS X−S , so the components of z have variances bounded
by
σ2n σ2n
Λ max (P XS
) ≤ .
λ2 λ2
Therefore, again by the maximal Gaussian inequality, for any δ > 0, it holds with
probability at least 1 − δ that
1
T
I − XS (XST XS )−1 XST + X−ST
XS (XST XS )−1 sign(β0,S )
X−S
λ ∞
σp
≤ 2n log (e(p − s0 )/δ) + (1 − γ)
λ
σp
=1+ 2n log (e(p − s0 )/δ) − γ ,
λ
| {z }
b
given by regressing each Xj on the truly active variables XS , to be small (in `1 norm)
for all j ∈
/ S. In other words, no truly inactive variables can be highly correlated
(or well-explained, in a linear projection sense) by any of the truly active variables.
Finally, this minimum signal condition ensures that the nonzero entries of the true
coefficient vector β0 are big enough to detect. This is quite restrictive and is not
needed for risk bounds, but it is crucial to support recovery.
24
8.2 Minimax bounds
Under the data model (??) with X fixed, subject to the scaling kXj k22 ≤ n, for
j = 1, . . . , p, and ∼ N (0, σ 2 ), Raskutti et al. (2011) derive upper and lower bounds
on the minimax prediction error
1
M (s0 , n, p) = inf sup kX βb − Xβ0 k22 .
βb kβ0 k0 ≤s0 n
(Their analysis is acutally considerably more broad than this and covers the coefficient
error kβb − β0 k2 , as well `q constraints on β0 , for q ∈ [0, 1].) They prove that, under
no additional assumptions on X,
s0 log(p/s0 )
M (s0 , n, p) . ,
n
with probability tending to 1
They also prove that, under a type of restricted eigenvalue condition in which
(1/n)kXvk22
c0 ≤ ≤ c1 for all v ∈ Rp such that kvk0 ≤ 2s0 ,
kvk22
s0 log(p/s0 )
M (s0 , n, p) & ,
n
with probability at least 1/2
The implication is that, for some X, minimax optimal prediction may be able
to be performed at a faster rate than s0 log(p/s0 )/n; but for low correlations, this
is the rate we should expect. (This is consistent with the worst-case-X analysis of
Foster & George (1994), who actually show the worst-case behavior is attained in the
orthogonal X case)
25