The Bayesian Elastic Net
The Bayesian Elastic Net
The Bayesian Elastic Net
151–170
Abstract. Elastic net (Zou and Hastie 2005) is a flexible regularization and vari-
able selection method that uses a mixture of L1 and L2 penalties. It is particularly
useful when there are much more predictors than the sample size. This paper pro-
poses a Bayesian method to solve the elastic net model using a Gibbs sampler.
While the marginal posterior mode of the regression coefficients is equivalent to
estimates given by the non-Bayesian elastic net, the Bayesian elastic net has two
major advantages. Firstly, as a Bayesian method, the distributional results on the
estimates are straightforward, making the statistical inference easier. Secondly, it
chooses the two penalty parameters simultaneously, avoiding the “double shrinkage
problem” in the elastic net method. Real data examples and simulation studies
show that the Bayesian elastic net behaves comparably in prediction accuracy but
performs better in variable selection.
Keywords: Bayesian analysis, elastic net, Gibbs sampler, regularization, variable
selection.
1 Introduction
Regression regularization methods are developed to carry out parameter estimation
and variable selection simultaneously. Tibshirani (1996) proposed the lasso estimator
which estimates the linear regression coefficients through an L1 -penalized least squares
criterion. Due to the nature of the L1 -penalty, the lasso often yields solutions with some
components being exactly 0. The lasso estimates can be solved efficiently by the lars
algorithm proposed by Efron et al. (2004), in which the order of computation load is the
same as that of a single ordinary least squares (ols) fit. While demonstrating promising
performance for many problems, the lasso estimator does have some shortcomings. Zou
and Hastie (2005) emphasized three inherent drawbacks of the lasso estimator. Firstly,
due to the nature of the convex optimization problem, the lasso method cannot select
more predictors than the sample size. But in practice there are often studies that involve
much more predictors than the sample size, e.g. microarray data analysis (Guyon et al.
2002). Secondly, when there is some group structure among the predictors, the lasso
estimator usually selects only one predictor from a group while ignoring others. Thirdly,
when the predictors are highly correlated, the lasso estimator performs unsatisfactorily.
Zou and Hastie (2005) proposed the elastic net (en) estimator to achieve improved
performance in these cases. The en estimator can also be viewed as a penalized least
squares method where the penalty term is a convex combination of the lasso penalty
and the ridge penalty. Simulations and biological data examples showed that the en
∗ Department of Mathematics, Washington University in St. Louis, St. Louis, MO, mailto:qli@
math.wustl.edu
† Department of Mathematics, Washington University in St. Louis, St. Louis, MO, mailto:nlin@
math.wustl.edu
estimator does well in the three scenarios where the lasso estimator is not proper. A
brief review of en is as follows.
Suppose that we have n observations and p predictors. Denote the response vari-
able by y = (y1 , . . . , yn )T and the design matrix by X = (x1 , . . . , xp ), where xj =
(x1j , . . . , xnj )T , j = 1, . . . , p. Apply a linear transformation if necessary, we assume
that
Xn Xn Xn
yi = 0, xij = 0, x2ij = 1, for j = 1, . . . , p.
i=1 i=1 i=1
For any fixed nonnegative penalty parameters λ1 and λ2 , the en loss function is
defined as
L(λ1 , λ2 , β) = (y − Xβ)T (y − Xβ) + λ1 kβk1 + λ2 kβk22 ,
where
21
p
X p
X
kβk1 = |βj |, and kβk2 = βj2 .
j=1 j=1
Then the naı̈ve en estimator β̂en is defined as β̂en = argminL(λ1 , λ2 , β). The name
β
“naı̈ve” comes from the fact that β̂en suffers from the “double shrinkage problem” (Zou
and Hastie 2005). To correct the “double shrinkage problem”, (1+λ2 )β̂en instead of β̂en
is used as the en estimator. β̂en can be solved efficiently through the lars-en algorithm
(Zou and Hastie 2005). However, there are two limitations with the en method:
1. The lars-en algorithm only computes estimates of the regression coefficients, but
doing statistical inference for the coefficients is difficult.
The first limitation is common for most regression regularization methods. A number
of people have considered the standard error estimation for the lasso estimator. Tibshi-
rani (1996) suggested using ridge regression approximation or the bootstrap method.
The problem for the ridge regression approximation is that the standard error estimate
would be 0 when the parameter estimate is 0. The sandwich estimator proposed by Fan
and Li (2001) suffers the same problem. Knight and Fu (2000) studied the bootstrap
method for standard error estimation and pointed out that the bootstrap estimates are
asymptotically biased when some true parameters are 0 or close to 0. Osborne et al.
(2000) gave an alternative approximation that leads to better standard error estimates,
Q. Li and N. Lin 153
but it is based on the implicit assumption that the estimators are approximately linear,
which holds only for very small λ1 . All these studies aforementioned are from a fre-
quentist’s point of view. In contrast, statistical inference for the Bayesian methods are
more straightforward. The Bayesian lasso (bl) (Park and Casella 2008; Hans 2009a,b)
not only gives the interval estimation but also provided the posterior distribution for
the lasso estimator. bl assigns a double exponential prior on β, which is essentially a
scale mixture of normals where the mixing is through an exponential distribution. Some
related works or extensions of bl include the horseshoe estimator by Carvalho et al.
(2008) where the mixing is through a half-Cauchy distribution and the normal-gamma
prior by Griffin and Brown (2009) where the mixing is through a gamma distribution.
The latter is further applied and explored by Scheipl and Kneib (2009).
In this paper, we propose a Bayesian analysis of the en problem and solve the
problem using a Gibbs sampler. As an automatic byproduct of the Markov chain Monte
Carlo procedure, the inference on the estimates is straightforward. In addition, our
method chooses the penalty parameters λ1 and λ2 simultaneously by a Monte Carlo
em algorithm proposed by Casella (2001), thus avoids the “double shrinkage problem”.
Section 2 describes the model and the Gibbs sampler, as well as criteria for variable
selection. Section 3 presents some simulation studies to compare the Bayesian elastic
net (ben) method with some other regression regularization methods. Sections 4 and
5 check the performance of ben on some real data examples. Some discussions are
provided in Section 6. An appendix contains technical proofs and derivations .
y | X, β ∼ N (Xβ, σ 2 In ).
We assume that all analysis hereafter is conditional on X. Zou and Hastie (2005) pointed
out that, under these assumptions, solving the en problem is equivalent to finding the
marginal posterior mode of β | y when the prior distribution of β is given by
© ª
π(β) ∝ exp −λ1 kβk1 − λ2 kβk22 , (1)
However, under (1), neither the conditional posterior mode of β | σ 2 , y nor the marginal
posterior mode of β | y would be equivalent to the en estimator β̂en unless the analysis
is conditional on σ 2 or σ 2 is given a point-mass prior. Instead we propose a prior for β,
which is conditional on σ 2 , as
½ ¾
2 1 ¡ 2
¢
π(β | σ ) ∝ exp − 2 λ1 kβk1 + λ2 kβk2 . (2)
2σ
A noninformative prior is then assigned for σ 2 , i.e. π(σ 2 ) ∝ 1/σ 2 . In this setup, the
marginal posterior distribution for β | y has the pdf
Z ∞ ½ ¾
C(λ1 , λ2 , σ 2 ) ky − Xβk22 + λ1 kβk1 + λ2 kβk22
f (β | y) = exp − π(σ 2 )dσ 2 , (3)
0 (2πσ 2 )n/2 2σ 2
where C(λ1 , λ2 , σ 2 ) is the normalizing constant later mentioned in Lemma 1. From (3),
we can see that the en estimator β̂en maximizes the integrand for each σ 2 , and thus
equals the marginal posterior mode of β | y. This conditional prior specification is also
used by Park and Casella (2008) for the sake of accelerating the convergence of the
Gibbs sampler in that paper.
Based on the discussion above, we have the following hierarchial model.
y | β, σ 2 ∼ N (Xβ, σ 2 In ),
½ ¾
1 ¡ ¢
β | σ2 ∼ exp − 2 λ1 kβk1 + λ2 kβk22 , (4)
2σ
1
σ2 ∼ .
σ2
To better understand this prior specification, it is informative to see how the hy-
perparameters (λ1 , λ2 ) affect the shape of the prior. Figure 1 shows the prior density
function (2) with different values of λ1 and λ2 , assuming σ 2 = 1 and p = 1. Three
different choices of (λ1 , λ2 ) are considered, (5,0), (1,5) and (0,5). These three choices
correspond to the lasso prior (Park and Casella 2008), the en prior and the ridge prior,
respectively. It is worth noting that the lasso prior and en prior are not differentiable at
0, but the ridge prior is. This is the essential difference between sparse model priors and
non-sparse model priors, as pointed out by Fan and Li (2001). To yield sparse solutions
that are continuous in the data, the prior must be non-differentiable at 0. Furthermore,
from the plot we can see that, the en prior is less “sharp” than the lasso prior. In fact,
it is a compromise between the lasso prior and the ridge prior. Qualitatively speaking
this suggests that the en method is not so “aggressive” as the lasso method in terms of
predictor exclusions.
1.2
λ1=5 λ2=0
λ1=1 λ2=5
λ1=0 λ2=5
1.0
0.8
density function
0.6
0.4
0.2
0.0
−2 −1 0 1 2
The proof for this lemma is in the Appendix. This lemma implies that we can treat
βj | σ 2 as a mixture of normal distributions, N (0, σ 2 (t − 1)/(λ2 t)), where the mixing
distribution is over the variance σ 2 (t − 1)/(λ2 t) and is given by a truncated gamma
distribution
¡ with shape parameter
¢ 1/2, scale parameter 8λ2 σ 2 /λ21 and support (1, ∞),
2 2
tg 1/2, 8λ2 σ /λ1 , (1, ∞) .
By Lemma 1, we have the following hierarchical model.
y | β, σ 2 ∼ N (Xβ, σ 2 In ),
p
à µ ¶−1 !
Y λ2 τ j
β | τ , σ2 ∼ N 0, ,
j=1
σ 2 τj − 1
Yp µ ¶
1 8λ2 σ 2
τ | σ2 ∼ tg , , (1, ∞) ,
j=1
2 λ21
1
σ2 ∼ .
σ2
156 Bayesian Elastic Net
After introducing τ , the model becomes computationally easier to solve since the full
conditional distributions now are
¡ ¢
β | y, σ 2 , τ ∼ N A−1 XT y, σ 2 A−1 ,
µ ¶
T τ1 τp
where A = X X + λ2 diag ,··· , ,
τ1 − 1 τp − 1
p
à !
Y 1 λ1 λ2 βj2
(τ − 1p ) | y, σ 2 , β ∼ gig λ = , ψ = 2
,χ = , (5)
j=1
2 4λ2 σ σ2
µ ¶ n2 +p+1 ½ µ ¶¾−p
1 1 λ21
σ 2 | y, β, τ ∼ Γ U , ×
σ2 2 8σ 2 λ2
1 Xp
τ λ 2 X p
j
exp − 2 ky − Xβk22 + λ2 βj2 + 1 τj , (6)
2σ j=1
τj − 1 4λ2
j=1
R∞
where 1p is a p-dimensional vector of 1’s. ΓU (α, x) = x tα−1 e−t dt is the upper incom-
plete gamma function (Armido and Alfred 1986). gig(λ, ψ, χ) denotes the generalized
inverse Gaussian distribution (Jφrgensen 1982), whose pdf is
½ ¾
(ψ/χ)λ/2 λ−1 1 −1
f (x | λ, χ, ψ) = √ x exp − (χx + ψx) , (7)
2Kλ ( ψχ) 2
for x > 0, where Kλ (·) is the modified Bessel function of the third kind with order λ.
3 Simulation studies
We carry out Monte Carlo simulations to compare the performance of ben, bl, en and
lasso in prediction accuracy and variable selection. For en, instead of using the penalty
parameters (λ1 , λ2 ), it is equivalent to use (s, λ2 ), where s = kβk1 /kβols k1 with βols
being the ols estimate (Zou and Hastie 2005). Similarly, s instead of λ1 is used in lasso.
The data are simulated from the following linear model,
y = Xβ + ε, ε ∼ N (0, σ 2 I).
Each simulated sample was partitioned into a training set, a validation set and a testing
set. The validation set is used to select s for lasso and (s, λ2 ) for en. After the penalty
parameter is selected, we combine the training set and the validation set together to
estimate β. For Bayesian methods, we directly combine the training and validation sets
together for estimation.
For the first two simulation studies, we simulate 50 data sets, each of which has
20 observations for the training set, 20 for the validation set, and 200 for the testing
set. In Simulation 1, we set β = (3, 1.5, 0, 0, 2, 0, 0, 0)T and σ 2 = 9. The design matrix
X is generated from the multivariate normal distribution with mean 0, variance 1 and
pairwise correlations between xi and xj equal to 0.5|i−j| for all i and j. In Simulation 2,
we set β = (0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85)T , and leave other setups the same
as in Simulation 1. In Simulation 3, we first generate Z1 , Z2 and Z3 independently from
N (0, 1). Then let xi = Z1 + ²i , i = 1, . . . , 5, xi = Z2 + ²i , i = 6, . . . , 10, xi = Z3 + ²i ,
i = 11, . . . , 15 and xi ∼ N (0, 1), i = 16, . . . , 30, where ²i ∼ N (0, 0.01), i = 1, . . . , 15. We
perform 50 simulations, in each of which we have a training set of size 100, a validation
set of size 100 and a testing set of size 400. The parameters are set as σ 2 = 225 and
β = (3, . . . , 3, 3, . . . , 3, 3, . . . , 3, 0, . . . , 0)T .
| {z } | {z } | {z } | {z }
5 5 5 15
In Simulation 4, we set the sizes of the training set and the validation set both to 200,
while leaving other setups the same as in Simulation 3. In Simulation 5, we set the sizes
of the training set and the validation set both to 20, and the true parameter value to
β = (3, . . . , 3, 0, . . . , 0, 3, . . . , 3),
| {z } | {z } | {z }
10 10 10
1.0
0.8
0.8
the correct inclusion rate
0.6
BEN+CI
BL+CI
0.4
0.4
BEN+CI BEN+SN
BL+CI BL+SN
BEN+SN
BL+SN
0.2
0.2
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
1.0
0.8
0.8
the correct inclusion rate
0.6
0.4
0.4
BEN+CI
BEN+CI BL+CI
BL+CI BEN+SN
BEN+SN BL+SN
0.2
0.2
BL+SN
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Figure 2: The ROC curves for the four simulation studies. ben + ci (solid line), bl +
ci (dotted line), ben + sn (dot-dashed line), bl + sn (dashed line), en (+) and lasso
(×).
Q. Li and N. Lin 161
1.0
0.8
0.8
the correct inclusion rate
0.6
0.4
0.4
BEN+CI
BL+CI
BEN+CI BEN+SN
BL+CI BL+SN
0.2
0.2
BEN+SN
BL+SN
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
1.0
0.8
0.8
the correct inclusion rate
0.6
BEN+CI
BL+CI
BEN+SN
0.4
0.4
BL+SN
BEN+CI
BL+CI
BEN+SN
0.2
0.2
BL+SN
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
0.6
0.4
BEN+CI
0.2
BL+CI
BEN+SN
BL+SN
0.0
Figure 3: The power curves for the five simulation studies. ben + ci (solid line), bl
+ ci (dotted line), ben + sn (dot-dashed line), bl + sn (dashed line), en (horizontal
solid line) and lasso (horizontal dashed line).
162 Bayesian Elastic Net
Table 1: Comparison of the four methods (ben, bl, en, and lasso) on variable selection
accuracy with α = 0.5 and p = 0.5 for the first two simulation studies.
simulation studies. However, for small samples under a less sparse model (Simulation
5), ben behaves the best. Figure 2 also shows that the ROC curve for variable selection
of ben dominates that of bl in that the correct inclusion rate is always higher at any
given false inclusion rates, although bl has a higher power (Figure 3). Secondly, when
the true model structure is relatively simple (Simulations 1 and 2), the four methods
perform comparably in prediction accuracy. But when the true model has a complex
structure (Simulations 3, 4 and 5), ben and bl outperform en and lasso significantly.
By comparing ben and bl in Simulation 3 and en and lasso in Simulation 4 (the bolded
numbers in Table 2), we can see that even with only half as many data, the Bayesian
methods perform better than the non-Bayesian methods in prediction accuracy. One
possible reason is that complicated models would result in highly variational estima-
tors, and the Bayesian methods use prior information to integrate across uncertainty
to reduce the variance, which leads to a smaller mean squared error. In this sense, the
Bayesian methods furnish better results with less data when the true model is compli-
cated. Furthermore, with the sample size doubled from Simulation 3 to Simulation 4,
the mmse of the Bayesian methods decreases about 15 while that of the non-Bayesian
methods decreases about 40.
Table 2: Comparison of the four methods (ben, bl, en, and lasso) on prediction accu-
racy.
six blood serum measurements (tc, ldl, hdl, tch, lth, glu). The response of interest is a
quantitative measure of disease progression one year after baseline. There are n = 442
diabetes patients in the data set. Four different methods are applied to this data set:
ben, bl, en and lasso. The estimation results are summarized in Figure 4. In this
example, the posterior distributions of β | y are very close to normal distributions, so
posterior medians instead of the posterior modes are reported for ben and bl.
1
2
3
4
Variable number
5
6
7
8
9
10
−40 −20 0 20 40
Coefficients
Figure 4: The estimates of the predictor effects for the diabetes data using different
methods: posterior median ben estimates (+), posterior median bl estimates (◦), en
estimates (4), lasso estimates (∇) and the ols estimates (×).
The penalty parameters are selected as (s, λ2 ) = (0.78, 0.01) for en and s = 0.57 for
lasso. Figure 4 shows that these two methods give almost identical estimates of β for this
data set. For the ben method, the penalty parameters estimated by the Monte Carlo
em algorithm are (λ1 , λ2 ) = (0.996, 14.89) while for the bl method, λ1 is estimated as
4.536. These two Bayesian methods also behave similarly for this data set. From Figure
164 Bayesian Elastic Net
4 we can see that the Bayesian methods (ben and bl) tend to achieve milder shrinkage
compared to the non-Bayesian methods (en and lasso), i.e. their estimates usually lie
between the ols estimates and the estimates given by en and lasso.
For the variable selection, if we use the credible interval criterion with level α = 0.5,
two variables are excluded (for both ben and bl): age and ldl. These are consistent with
the variables excluded by the en and lasso. If we use the scaled neighborhood criterion
with probability threshold p = 0.5, then three variables are excluded (for both ben
and bl): age, ldl and tch. Table 3 summarizes the posterior probabilities for the scaled
neighborhood criterion. Since a larger posterior probability implies stronger evidence for
exclusion, Table 3 shows that ben and bl give the same ordering for variable exclusion.
very different. One possible reason is that the predictors in this data set are more corre-
lated than those in the diabetes data. Table 5 also reports the prediction mean-squared
error from the testing data using four different methods. The two Bayesian methods
have slightly larger prediction error than the non-Bayesian methods. One possible rea-
son is that the true model structure is not complicated, so that not much variance is
reduced by introducing the prior, but the increased bias leads to a larger prediction
error.
1
2
3
Variable number
4
5
6
7
8
Coefficients
Figure 5: The estimates of the predictor effects for the prostate cancer data using
different methods: posterior median ben estimates (+), posterior median bl estimates
(◦), en estimates (4), lasso estimates (∇) and the ols estimates (×).
Table 5: Variable selection results and prediction mean-squared error for the prostate
cancer example using different methods: ben, bl, en and lasso.
Then the resulting conjugacy leads to an easy extension of the Gibbs sampler. This
approach does not require updating the penalty parameters after each Markov chain,
but would run a single longer chain that updates the penalty parameters in every step.
So this approach is much faster computationally. But our experience is that the results
depend heavily on the parameters (a, b) and (c, d), which need to be specified by users
in advance. However, these parameters are less intuitive to specify than the credible
interval level or the probability threshold. In Park and Casella (2008) the specification
of parameters in the prior for λ1 leads to similar results as those given by the Monte
Carlo em method.
Another important desired property for the regression regularization method is the
oracle property (Fan and Li 2001). Zou and Zhang (2009) proposed the adaptive elastic
net to achieve this goal. Making adjustments to the ben method to adopt the oracle
property deserves some further consideration.
Q. Li and N. Lin 167
Appendix
1. Proof of Lemma 1
Firstly, note that
½ ¾ Y p ½ ¾
1 ¡ ¢ 1 ¡ ¢
exp − 2 λ1 kβk1 + λ2 kβk22 = exp − 2 λ1 |βj | + λ2 βj2 .
2σ j=1
2σ
where ∝2 denotes that its two sides differ by a multiplicand involving σ 2 and possibly
σ
some constants. Then, we have
½ ¾
1 ¡ 2
¢
exp − 2 λ1 |βj | + λ2 βj
2σ
Z ∞ ( µ ¶) ( µ ¶2 )
1 βj2 1 λ2 1 λ1
∝ √ exp − + exp − s ds
σ2 0 s 2 s σ2 2 2σ 2
Z ∞ sµ ¶ ( µ ¶) ( µ ¶2 )
1 λ2 βj2 1 λ2 1 1 λ1
= + exp − + q exp − s ds
0 s σ2 2 s σ2 1 + σλ22 s 2 2σ 2
Z ∞r ( µ ¶) µ ¶
t βj2 λ2 t − 21 1 λ21
∝2 exp − t exp − t dt.
σ 1 t−1 2 σ2 t − 1 2σ 2 4λ2
So the log-likelihood is
³n ´ µ ¶ p
2 1 λ21 1X
p log(λ1 ) − + p + 1 log(σ ) − p log ΓU , − log (τj − 1)
2 2 8σ 2 λ2 2 j=1
p p
1 T λ2 X τ j 2 1 λ21 X
− (y − Xβ) (y − Xβ) − β − τj
2σ 2 2σ 2 j=1 τj − 1 j 2σ 2 4λ2 j=1
µ ¶ p p
1 λ21 λ2 X τ j 1 λ2 X
= p log(λ1 ) − p log ΓU , 2 − 2 βj2 − 2 1 τj
2 8σ λ2 2σ j=1 τj − 1 2σ 4λ2 j=1
+ terms not involving λ1 and λ2 .
Q(λ | λ(k−1) )
· µ ¶ ¸ p
" #
1 λ21 (k−1) λ2 X τj βj2 (k−1)
= p log(λ1 ) − pE log ΓU , |λ ,Y − E |λ ,Y
2 8σ 2 λ2 2 j=1 τj − 1 σ 2
λ21 X h τj i
p
− E 2 | λ(k−1) , Y + terms not involving λ1 and λ2
8λ2 j=1 σ
1
where ϕ(t) = t− 2 e−t .
Q. Li and N. Lin 169
References
Andrews, D. F. and Mallows, C. L. (1974). “Scale mixtures of normal distributions.”
J. R. Statist. Soc. B, 36: 99–102. 167
Berger, J. O. (1993). Statistical Decision Theory and Bayesian Analysis. Springer, 2nd
edition. 157
Carvalho, C. M., Polson, N. G., and Scott, J. G. (2008). “The horseshoe estimator for
sparse signals.” Discussion Paper 2008-31, Duke University Department of Statistical
Science. 153
Chhikara, R. and Folks, L. (1988). The Inverse Gaussian Distribution Theory: Method-
ology, and Applications. Marcel Dekker, Inc, 1st edition. 156
Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). “Least angle regression.”
Ann. Stat., 32: 407–499. 151, 162
Fan, J. and Li, R. (2001). “Variable Selection via Nonconcave Penalized Likelihood and
Its Oracle Properties.” J. Amer. Statist. Assoc., 96: 1348–1360. 152, 154, 166
Guyon, I., Weston, J., Barnhill, S., and Vapnik, V. (2002). “Gene Selection for Cancer
Classification Using Support Vector Machines.” Mach. Learn., 46: 389–422. 151
— (2009b). “Model uncertainty and variable selection in Bayesian Lasso.” Stat. Com-
put.. To appear. 153
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learn-
ing: Data Mining, Inference and Prediction. Springer, 2nd edition. 152
Knight, K. and Fu, W. (2000). “Asymptotics for lasso-type estimators.” Ann. Stat.,
28: 1356–1378. 152
170 Bayesian Elastic Net
Michael, J. R., Schucany, W. R., and Haas, R. W. (1976). “Generating random variates
using transformations with multiple roots.” Am. Stat., 30-2: 88–90. 156
Osborne, M. R., Presnell, B., and Turlach, B. A. (2000). “On the LASSO and its dual.”
J. Comput. Graph. Stat., 9: 319–337. 152
Park, T. and Casella, G. (2008). “The Bayesian Lasso.” J. Amer. Statist. Assoc., 103:
681–686. 153, 154, 157, 165, 166
Stamey, T., Kabalin, J., McNeal, J., Johnstone, I., Freiha, F., Redwine, E., and Yang, N.
(1989). “Prostate specific antigen in the diagnosis and treatment of adenocarcinoma
of the prostate II: radical prostatectomy treated patients.” J. Urol., 16: 1076–1083.
164
Tibshirani, R. (1996). “Regression shrinkage and selection via the Lasso.” J. R. Statist.
Soc. B, 58: 267–288. 151, 152
Zou, H. and Hastie, T. (2005). “Regularization and variable selection via the elastic
net.” J. R. Statist. Soc. B, 67: 301–320. 151, 152, 153, 158, 164
Zou, H. and Zhang, H. H. (2009). “On the adaptive elastic-net with a diverging number
of parameters.” Ann. Stat., 37: 1733–1751. 166