The Bayesian Elastic Net

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Bayesian Analysis (2010) 5, Number 1, pp.

151–170

The Bayesian Elastic Net


Qing Li∗ and Nan Lin†

Abstract. Elastic net (Zou and Hastie 2005) is a flexible regularization and vari-
able selection method that uses a mixture of L1 and L2 penalties. It is particularly
useful when there are much more predictors than the sample size. This paper pro-
poses a Bayesian method to solve the elastic net model using a Gibbs sampler.
While the marginal posterior mode of the regression coefficients is equivalent to
estimates given by the non-Bayesian elastic net, the Bayesian elastic net has two
major advantages. Firstly, as a Bayesian method, the distributional results on the
estimates are straightforward, making the statistical inference easier. Secondly, it
chooses the two penalty parameters simultaneously, avoiding the “double shrinkage
problem” in the elastic net method. Real data examples and simulation studies
show that the Bayesian elastic net behaves comparably in prediction accuracy but
performs better in variable selection.
Keywords: Bayesian analysis, elastic net, Gibbs sampler, regularization, variable
selection.

1 Introduction
Regression regularization methods are developed to carry out parameter estimation
and variable selection simultaneously. Tibshirani (1996) proposed the lasso estimator
which estimates the linear regression coefficients through an L1 -penalized least squares
criterion. Due to the nature of the L1 -penalty, the lasso often yields solutions with some
components being exactly 0. The lasso estimates can be solved efficiently by the lars
algorithm proposed by Efron et al. (2004), in which the order of computation load is the
same as that of a single ordinary least squares (ols) fit. While demonstrating promising
performance for many problems, the lasso estimator does have some shortcomings. Zou
and Hastie (2005) emphasized three inherent drawbacks of the lasso estimator. Firstly,
due to the nature of the convex optimization problem, the lasso method cannot select
more predictors than the sample size. But in practice there are often studies that involve
much more predictors than the sample size, e.g. microarray data analysis (Guyon et al.
2002). Secondly, when there is some group structure among the predictors, the lasso
estimator usually selects only one predictor from a group while ignoring others. Thirdly,
when the predictors are highly correlated, the lasso estimator performs unsatisfactorily.
Zou and Hastie (2005) proposed the elastic net (en) estimator to achieve improved
performance in these cases. The en estimator can also be viewed as a penalized least
squares method where the penalty term is a convex combination of the lasso penalty
and the ridge penalty. Simulations and biological data examples showed that the en
∗ Department of Mathematics, Washington University in St. Louis, St. Louis, MO, mailto:qli@

math.wustl.edu
† Department of Mathematics, Washington University in St. Louis, St. Louis, MO, mailto:nlin@

math.wustl.edu

c 2010 International Society for Bayesian Analysis


° DOI:10.1214/10-BA506
152 Bayesian Elastic Net

estimator does well in the three scenarios where the lasso estimator is not proper. A
brief review of en is as follows.
Suppose that we have n observations and p predictors. Denote the response vari-
able by y = (y1 , . . . , yn )T and the design matrix by X = (x1 , . . . , xp ), where xj =
(x1j , . . . , xnj )T , j = 1, . . . , p. Apply a linear transformation if necessary, we assume
that
Xn Xn Xn
yi = 0, xij = 0, x2ij = 1, for j = 1, . . . , p.
i=1 i=1 i=1

For any fixed nonnegative penalty parameters λ1 and λ2 , the en loss function is
defined as
L(λ1 , λ2 , β) = (y − Xβ)T (y − Xβ) + λ1 kβk1 + λ2 kβk22 ,
where
  21
p
X p
X
kβk1 = |βj |, and kβk2 =  βj2  .
j=1 j=1

Then the naı̈ve en estimator β̂en is defined as β̂en = argminL(λ1 , λ2 , β). The name
β
“naı̈ve” comes from the fact that β̂en suffers from the “double shrinkage problem” (Zou
and Hastie 2005). To correct the “double shrinkage problem”, (1+λ2 )β̂en instead of β̂en
is used as the en estimator. β̂en can be solved efficiently through the lars-en algorithm
(Zou and Hastie 2005). However, there are two limitations with the en method:

1. The lars-en algorithm only computes estimates of the regression coefficients, but
doing statistical inference for the coefficients is difficult.

2. As the lars-en assumes given L1 -penalty coefficient λ1 and L2 -penalty coefficient


λ2 , the tuning parameters λ1 and λ2 are selected by cross-validation (Hastie et al.
2009). To avoid intensive computation, a grid of values for λ2 is first specified. For
example, Zou and Hastie (2005) used (0, 0.01, 0.1, 1, 10, 100, 1000). For each λ2 , a
10-fold cross-validation is then used to choose λ1 . This cross-validation procedure
selects λ1 and λ2 sequentially instead of simultaneously and causes the “double
shrinkage problem”.

The first limitation is common for most regression regularization methods. A number
of people have considered the standard error estimation for the lasso estimator. Tibshi-
rani (1996) suggested using ridge regression approximation or the bootstrap method.
The problem for the ridge regression approximation is that the standard error estimate
would be 0 when the parameter estimate is 0. The sandwich estimator proposed by Fan
and Li (2001) suffers the same problem. Knight and Fu (2000) studied the bootstrap
method for standard error estimation and pointed out that the bootstrap estimates are
asymptotically biased when some true parameters are 0 or close to 0. Osborne et al.
(2000) gave an alternative approximation that leads to better standard error estimates,
Q. Li and N. Lin 153

but it is based on the implicit assumption that the estimators are approximately linear,
which holds only for very small λ1 . All these studies aforementioned are from a fre-
quentist’s point of view. In contrast, statistical inference for the Bayesian methods are
more straightforward. The Bayesian lasso (bl) (Park and Casella 2008; Hans 2009a,b)
not only gives the interval estimation but also provided the posterior distribution for
the lasso estimator. bl assigns a double exponential prior on β, which is essentially a
scale mixture of normals where the mixing is through an exponential distribution. Some
related works or extensions of bl include the horseshoe estimator by Carvalho et al.
(2008) where the mixing is through a half-Cauchy distribution and the normal-gamma
prior by Griffin and Brown (2009) where the mixing is through a gamma distribution.
The latter is further applied and explored by Scheipl and Kneib (2009).
In this paper, we propose a Bayesian analysis of the en problem and solve the
problem using a Gibbs sampler. As an automatic byproduct of the Markov chain Monte
Carlo procedure, the inference on the estimates is straightforward. In addition, our
method chooses the penalty parameters λ1 and λ2 simultaneously by a Monte Carlo
em algorithm proposed by Casella (2001), thus avoids the “double shrinkage problem”.
Section 2 describes the model and the Gibbs sampler, as well as criteria for variable
selection. Section 3 presents some simulation studies to compare the Bayesian elastic
net (ben) method with some other regression regularization methods. Sections 4 and
5 check the performance of ben on some real data examples. Some discussions are
provided in Section 6. An appendix contains technical proofs and derivations .

2 The Bayesian elastic net


2.1 Model hierarchy and prior distributions
Consider the linear model E(y | X, β) = Xβ, where we assume that the response
variables follow the normal distribution conditionally, i.e.

y | X, β ∼ N (Xβ, σ 2 In ).

We assume that all analysis hereafter is conditional on X. Zou and Hastie (2005) pointed
out that, under these assumptions, solving the en problem is equivalent to finding the
marginal posterior mode of β | y when the prior distribution of β is given by
© ª
π(β) ∝ exp −λ1 kβk1 − λ2 kβk22 , (1)

a compromise between Gaussian and Laplacian priors. Specifically, the conditional


posterior distribution has the probability density function (pdf)
½ ¾
2 1 T 2
f (β | σ , y) ∝ exp − 2 (y − Xβ) (y − Xβ) − λ1 kβk1 − λ2 kβk2 ,

½ ¾
1 £ T 2 2 2
¤
= exp − 2 (y − Xβ) (y − Xβ) + (2σ λ1 )kβk1 + (2σ λ2 )kβk2 .

154 Bayesian Elastic Net

However, under (1), neither the conditional posterior mode of β | σ 2 , y nor the marginal
posterior mode of β | y would be equivalent to the en estimator β̂en unless the analysis
is conditional on σ 2 or σ 2 is given a point-mass prior. Instead we propose a prior for β,
which is conditional on σ 2 , as
½ ¾
2 1 ¡ 2
¢
π(β | σ ) ∝ exp − 2 λ1 kβk1 + λ2 kβk2 . (2)

A noninformative prior is then assigned for σ 2 , i.e. π(σ 2 ) ∝ 1/σ 2 . In this setup, the
marginal posterior distribution for β | y has the pdf
Z ∞ ½ ¾
C(λ1 , λ2 , σ 2 ) ky − Xβk22 + λ1 kβk1 + λ2 kβk22
f (β | y) = exp − π(σ 2 )dσ 2 , (3)
0 (2πσ 2 )n/2 2σ 2
where C(λ1 , λ2 , σ 2 ) is the normalizing constant later mentioned in Lemma 1. From (3),
we can see that the en estimator β̂en maximizes the integrand for each σ 2 , and thus
equals the marginal posterior mode of β | y. This conditional prior specification is also
used by Park and Casella (2008) for the sake of accelerating the convergence of the
Gibbs sampler in that paper.
Based on the discussion above, we have the following hierarchial model.
y | β, σ 2 ∼ N (Xβ, σ 2 In ),
½ ¾
1 ¡ ¢
β | σ2 ∼ exp − 2 λ1 kβk1 + λ2 kβk22 , (4)

1
σ2 ∼ .
σ2
To better understand this prior specification, it is informative to see how the hy-
perparameters (λ1 , λ2 ) affect the shape of the prior. Figure 1 shows the prior density
function (2) with different values of λ1 and λ2 , assuming σ 2 = 1 and p = 1. Three
different choices of (λ1 , λ2 ) are considered, (5,0), (1,5) and (0,5). These three choices
correspond to the lasso prior (Park and Casella 2008), the en prior and the ridge prior,
respectively. It is worth noting that the lasso prior and en prior are not differentiable at
0, but the ridge prior is. This is the essential difference between sparse model priors and
non-sparse model priors, as pointed out by Fan and Li (2001). To yield sparse solutions
that are continuous in the data, the prior must be non-differentiable at 0. Furthermore,
from the plot we can see that, the en prior is less “sharp” than the lasso prior. In fact,
it is a compromise between the lasso prior and the ridge prior. Qualitatively speaking
this suggests that the en method is not so “aggressive” as the lasso method in terms of
predictor exclusions.

2.2 The full conditional distributions and the Gibbs sampler


Computationally, it is difficult to solve the model (4) directly through a Gibbs sampler
because the absolute values |βj |’s in the prior would yield unfamiliar full conditional dis-
tributions. However, the following lemma suggests solving the problem by introducing
another hierarchy to the model.
Q. Li and N. Lin 155

Priors with different λ1and λ2

1.2
λ1=5 λ2=0
λ1=1 λ2=5
λ1=0 λ2=5

1.0
0.8
density function

0.6
0.4
0.2
0.0

−2 −1 0 1 2

Figure 1: A comparison of prior (2) with different λ1 ’s and λ2 ’s.

Lemma 1. The density π(β | σ 2 ) in (2) can be written as


p Z ∞r
( µ ¶) µ ¶
Y t βj2 λ2 t 1 λ21
2 − 21
C(λ1 , λ2 , σ ) exp − t exp − t dt,
j=1 1
t−1 2 σ2 t − 1 2σ 2 4λ2

where C(λ1 , λ2 , σ 2 ) is the normalizing constant.

The proof for this lemma is in the Appendix. This lemma implies that we can treat
βj | σ 2 as a mixture of normal distributions, N (0, σ 2 (t − 1)/(λ2 t)), where the mixing
distribution is over the variance σ 2 (t − 1)/(λ2 t) and is given by a truncated gamma
distribution
¡ with shape parameter
¢ 1/2, scale parameter 8λ2 σ 2 /λ21 and support (1, ∞),
2 2
tg 1/2, 8λ2 σ /λ1 , (1, ∞) .
By Lemma 1, we have the following hierarchical model.

y | β, σ 2 ∼ N (Xβ, σ 2 In ),
p
à µ ¶−1 !
Y λ2 τ j
β | τ , σ2 ∼ N 0, ,
j=1
σ 2 τj − 1
Yp µ ¶
1 8λ2 σ 2
τ | σ2 ∼ tg , , (1, ∞) ,
j=1
2 λ21
1
σ2 ∼ .
σ2
156 Bayesian Elastic Net

After introducing τ , the model becomes computationally easier to solve since the full
conditional distributions now are
¡ ¢
β | y, σ 2 , τ ∼ N A−1 XT y, σ 2 A−1 ,
µ ¶
T τ1 τp
where A = X X + λ2 diag ,··· , ,
τ1 − 1 τp − 1
p
à !
Y 1 λ1 λ2 βj2
(τ − 1p ) | y, σ 2 , β ∼ gig λ = , ψ = 2
,χ = , (5)
j=1
2 4λ2 σ σ2
µ ¶ n2 +p+1 ½ µ ¶¾−p
1 1 λ21
σ 2 | y, β, τ ∼ Γ U , ×
σ2 2 8σ 2 λ2
  
1  Xp
τ λ 2 X p 
j
exp − 2 ky − Xβk22 + λ2 βj2 + 1 τj  , (6)
2σ  j=1
τj − 1 4λ2
j=1

R∞
where 1p is a p-dimensional vector of 1’s. ΓU (α, x) = x tα−1 e−t dt is the upper incom-
plete gamma function (Armido and Alfred 1986). gig(λ, ψ, χ) denotes the generalized
inverse Gaussian distribution (Jφrgensen 1982), whose pdf is
½ ¾
(ψ/χ)λ/2 λ−1 1 −1
f (x | λ, χ, ψ) = √ x exp − (χx + ψx) , (7)
2Kλ ( ψχ) 2
for x > 0, where Kλ (·) is the modified Bessel function of the third kind with order λ.

2.3 Sampling from the full conditional distributions


It is straightforward to sample from β | y, σ 2 , τ which follows the multivariate normal
distribution. Details on sampling from (5) and (4) are as follows.
To sample generalized inverse Gaussian random numbers for (τ − 1p ) | y, σ 2 , β, we
may use the rgig() function in the “HyperbolicDist” package (Scott 2008) in R (R De-
velopment Core Team 2005) which implements the algorithm in Dagpunar (1989). This
algorithm is based on the ratio method together with rejection sampling on the minimal
enclosing rectangle. However, when the product of parameters χψ in (7) is small, the
rejection rate could be extremely high, and the algorithm becomes very inefficient. To
expedite our algorithm, we consider the full conditional of some transformation of τ −1p .
It is easy to show that 1/(τj − 1) | y, σ 2 , β, j = 1, . . . , p are independent
√ and follow
the inverse Gaussian distribution (Chhikara and Folks 1988) with µ = λ1 /(2λ2 |βj |),
λ = λ1 /(4λ2 σ 2 ). The pdf of an inverse Gaussian distribution is
r ½ ¾
λ −λ(x − µ)2
f (x | µ, λ) = exp .
2πx3 2µ2 x
Sampling from an inverse Gaussian distribution uses a more efficient algorithm based
on multiple roots selection (Michael et al. 1976) and is much faster than the rgig()
function.
Q. Li and N. Lin 157

Sampling from σ 2 | Y, β, τ could be done by the acceptance-rejection algorithm.


Denote the function of σ 2 on the right-hand side of (4) as f (σ 2 ). Then by the definition
of incomplete gamma functions,
µ ¶−p µ ¶a+1 ½ ¾ ¡ ¢−p
2 1 1 1 Γ(a)Γ 12
f (σ ) ≤ Γ exp − 2 b = h(σ 2 ),
2 σ2 σ ba
where h(·) is the pdf for inverse-gamma(a, b) and
 
Xp 2 X p
n 1 τj λ
a = + p, b = (Y − Xβ)T (Y − Xβ) + λ2 βj2 + 1 τj  .
2 2 j=1
τ j − 1 4λ 2 j=1

To get σ 2 from f (σ 2 ), we first generate


¡ 1 ¢p aa candidate Z from h and a u from uni-
form(0,1), and then accept
³ Z if u´≤ Γ 2 b f (Z)/Γ(a)h(Z) or, equivalently, if log(u) ≤
¡ ¡ 1 ¢¢ 1 λ21
p log Γ 2 − p log ΓU 2 , 8Zλ2 .

2.4 Choosing the penalty parameters by empirical Bayes


As suggested by Park and Casella (2008), we use empirical Bayes estimates for the
penalty parameters λ1 and λ2 , which are given by maximizing the data marginal likeli-
hood. Specifically, by treating the parameters β, τ and σ 2 as missing data, we can use
the Monte Carlo em algorithm in Casella (2001) to estimate λ1 and λ2 . In each step of
the Monte Carlo em algorithm, any conditional expectations that are hard to compute
are substituted by Monte Carlo estimates. The details of the algorithm are described
in the Appendix.

2.5 Variable selection


In lasso and en, some regression coefficient estimates are forced to be zero, so vari-
able selection is straightforward. However, the Bayesian approaches need some ad hoc
treatment for variable selection.
In general, variable selection can be viewed as a hypothesis testing problem. Since
the posterior probability of H0 : βj = 0, j = 1, . . . , p would always be 0, a common
practice is to consider the posterior probability in some neighborhood [−rj , rj ] of 0, for
some constants rj > 0, by viewing H0 : βj = 0 as an approximation to H0 : βj ∈ [−rj , rj ]
(Berger 1993). The choice of rj ’s usually demands expert knowledge about the data.
In this paper, instead of choosing an rj for each βj , j = 1, . . . , p, we circumvent by
suggesting two automatic criteria for variable selection.
The first criterion is called the credible interval criterion. A predictor xj is excluded
if the credible interval of βj covers 0 and is retained otherwise. Simulation results show
that 95% credible intervals are usually too wide and most predictors would consequently
be excluded. Some empirical guidance on choosing the appropriate level is discussed in
detail in Section 3.
158 Bayesian Elastic Net

The second criterion his called the scaled neighborhood


i criterion. We consider the
p p
posterior probability in − var(βj | y), var(βj | y) . A predictor is excluded if the
posterior probability exceeds a certain probability threshold and is retained otherwise.
We would explore the choice of the probability threshold in the next section.

3 Simulation studies
We carry out Monte Carlo simulations to compare the performance of ben, bl, en and
lasso in prediction accuracy and variable selection. For en, instead of using the penalty
parameters (λ1 , λ2 ), it is equivalent to use (s, λ2 ), where s = kβk1 /kβols k1 with βols
being the ols estimate (Zou and Hastie 2005). Similarly, s instead of λ1 is used in lasso.
The data are simulated from the following linear model,

y = Xβ + ε, ε ∼ N (0, σ 2 I).

Each simulated sample was partitioned into a training set, a validation set and a testing
set. The validation set is used to select s for lasso and (s, λ2 ) for en. After the penalty
parameter is selected, we combine the training set and the validation set together to
estimate β. For Bayesian methods, we directly combine the training and validation sets
together for estimation.
For the first two simulation studies, we simulate 50 data sets, each of which has
20 observations for the training set, 20 for the validation set, and 200 for the testing
set. In Simulation 1, we set β = (3, 1.5, 0, 0, 2, 0, 0, 0)T and σ 2 = 9. The design matrix
X is generated from the multivariate normal distribution with mean 0, variance 1 and
pairwise correlations between xi and xj equal to 0.5|i−j| for all i and j. In Simulation 2,
we set β = (0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85)T , and leave other setups the same
as in Simulation 1. In Simulation 3, we first generate Z1 , Z2 and Z3 independently from
N (0, 1). Then let xi = Z1 + ²i , i = 1, . . . , 5, xi = Z2 + ²i , i = 6, . . . , 10, xi = Z3 + ²i ,
i = 11, . . . , 15 and xi ∼ N (0, 1), i = 16, . . . , 30, where ²i ∼ N (0, 0.01), i = 1, . . . , 15. We
perform 50 simulations, in each of which we have a training set of size 100, a validation
set of size 100 and a testing set of size 400. The parameters are set as σ 2 = 225 and

β = (3, . . . , 3, 3, . . . , 3, 3, . . . , 3, 0, . . . , 0)T .
| {z } | {z } | {z } | {z }
5 5 5 15

In Simulation 4, we set the sizes of the training set and the validation set both to 200,
while leaving other setups the same as in Simulation 3. In Simulation 5, we set the sizes
of the training set and the validation set both to 20, and the true parameter value to

β = (3, . . . , 3, 0, . . . , 0, 3, . . . , 3),
| {z } | {z } | {z }
10 10 10

while leaving other setups the same as in Simulation 3.


Q. Li and N. Lin 159

3.1 The variable selection accuracy


Denote by α the level in the credible interval criterion and by p the probability threshold
in the scaled neighborhood criterion. The variable selection accuracy of the Bayesian
methods depends on α and p. In order to understand how the choices affect the variable
selection result, we can draw the receiver operating characteristic (ROC) curve and the
power curve for different α’s and p’s. The ROC curve is derived by plotting the correct
inclusion rate, i.e. sensitivity, against the false inclusion rate, i.e. 1 - specificity, along
different α’s or p’s. The power curve is similar to the ROC curve, but with α or p as the
horizontal axis. The ROC curves and power curves for the five simulation studies are
in Figures 2 and 3. For Simulation 2, as there cannot be any false inclusions, only the
power curve is plotted. In these plots, ben + ci and ben + sn stand for the ben method
with the credible interval criterion and the ben method with the scaled neighborhood
criterion, respectively. The same is for bl + ci and bl + sn.
Figures 2 and 3 show that the two variable selection criteria give similar ROC curves
for both ben and bl but the credible interval criterion tends to have higher sensitivity
for the same α or p. This implies that the scaled neighborhood criterion has higher
specificity for the same α or p. So depending on the importance of sensitivity and
specificity of a particular study, the researcher may choose the appropriate variable
selection criterion. In the usual hypothesis testing setup where specificity is more valued,
the scaled neighborhood criterion would be preferred. Secondly, the power curves show
that ben and en generally have higher sensitivity than bl and lasso, i.e. they are not
as “aggressive” in excluding predictors. This validates the observation in the discussion
of Figure 1 in Section 2.1. Although en and ben also give sparse solutions, they tend to
give less sparse ones than those given by lasso and bl. Thirdly, from the power curves
for the five simulation studies, we can see that α = 0.5 and p = 0.5 would generally
guarantee that ben and bl have similar or higher power than the non-Bayesian methods.
Based on this observation, together with the discussion in Section 2.5, we would suggest
using α = 0.5 and p = 0.5 in practice. In principle, using a higher level α or a threshold
value p would result in a higher sensitivity but a lower specificity. We show in Table
1 some detailed results on the variable selection for the first two simulation studies.
Let N (βj ), j = 1, . . . , 8, denote the frequency of exclusions for the predictor xj . For
Simulation 1, the smaller N (βj ) for j = 1, 2, 5 and the larger N (βj ) for j = 3, 4, 6, 7, 8,
the better the method performs. For Simulation 2, we want all N (βj ) to be as small as
possible. As there are too many predictors in Simulations 3, 4 and 5 to put in a table,
we do not show those results.

3.2 The prediction accuracy


We compare the prediction accuracy of the four methods using the median of the pre-
diction mean-squared errors (mmse). Also, 500 bootstrap resampling on the 50 mean-
squared errors are drawn to give the standard error for the mmse. The results are
summarized in Table 2.
Table 2 shows that bl has a better prediction accuracy than other methods in most
160 Bayesian Elastic Net

Simulation 1: ROC curve Simulation 3: ROC curve


1.0

1.0
0.8

0.8
the correct inclusion rate

the correct inclusion rate


0.6

0.6
BEN+CI
BL+CI
0.4

0.4
BEN+CI BEN+SN
BL+CI BL+SN
BEN+SN
BL+SN
0.2

0.2
0.0

0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

the false inclusion rate the false inclusion rate

Simulation 4: ROC curve Simulation 5: ROC curve


1.0

1.0
0.8

0.8
the correct inclusion rate

the correct inclusion rate


0.6

0.6
0.4

0.4

BEN+CI
BEN+CI BL+CI
BL+CI BEN+SN
BEN+SN BL+SN
0.2

0.2

BL+SN
0.0

0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

the false inclusion rate the false inclusion rate

Figure 2: The ROC curves for the four simulation studies. ben + ci (solid line), bl +
ci (dotted line), ben + sn (dot-dashed line), bl + sn (dashed line), en (+) and lasso
(×).
Q. Li and N. Lin 161

Simulation 1: power curve Simulation 2: power curve


1.0

1.0
0.8

0.8
the correct inclusion rate

the correct inclusion rate


0.6

0.6
0.4

0.4
BEN+CI
BL+CI
BEN+CI BEN+SN
BL+CI BL+SN
0.2

0.2
BEN+SN
BL+SN
0.0

0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

the threshold value the threshold value

Simulation 3: power curve Simulation 4: power curve


1.0

1.0
0.8

0.8
the correct inclusion rate

the correct inclusion rate


0.6

0.6

BEN+CI
BL+CI
BEN+SN
0.4

0.4

BL+SN
BEN+CI
BL+CI
BEN+SN
0.2

0.2

BL+SN
0.0

0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

the threshold value the threshold value

Simulation 5: power curve


1.0
0.8
the correct inclusion rate

0.6
0.4

BEN+CI
0.2

BL+CI
BEN+SN
BL+SN
0.0

0.0 0.2 0.4 0.6 0.8 1.0

the threshold value

Figure 3: The power curves for the five simulation studies. ben + ci (solid line), bl
+ ci (dotted line), ben + sn (dot-dashed line), bl + sn (dashed line), en (horizontal
solid line) and lasso (horizontal dashed line).
162 Bayesian Elastic Net

Method N(β1 ) N(β2 ) N(β3 ) N(β4 ) N(β5 ) N(β6 ) N(β7 ) N(β8 )


ben + ci 0 0 9 13 0 22 30 26
ben + sn 0 0 16 19 0 30 37 41
en 0 1 12 10 0 22 27 31
bl + ci 0 3 26 31 0 27 27 25
bl + sn 0 7 40 41 0 38 36 36
lasso 0 2 22 24 0 23 24 25
(a) Simulation 1

Method N(β1 ) N(β2 ) N(β3 ) N(β4 ) N(β5 ) N(β6 ) N(β7 ) N(β8 )


ben + ci 2 0 1 0 0 1 2 3
ben + sn 3 1 2 0 0 2 4 7
en 9 1 2 1 0 1 5 8
bl + ci 6 6 8 8 10 6 7 10
bl + sn 13 15 16 19 22 14 14 18
lasso 6 3 5 5 5 5 5 9
(b) Simulation 2

Table 1: Comparison of the four methods (ben, bl, en, and lasso) on variable selection
accuracy with α = 0.5 and p = 0.5 for the first two simulation studies.

simulation studies. However, for small samples under a less sparse model (Simulation
5), ben behaves the best. Figure 2 also shows that the ROC curve for variable selection
of ben dominates that of bl in that the correct inclusion rate is always higher at any
given false inclusion rates, although bl has a higher power (Figure 3). Secondly, when
the true model structure is relatively simple (Simulations 1 and 2), the four methods
perform comparably in prediction accuracy. But when the true model has a complex
structure (Simulations 3, 4 and 5), ben and bl outperform en and lasso significantly.
By comparing ben and bl in Simulation 3 and en and lasso in Simulation 4 (the bolded
numbers in Table 2), we can see that even with only half as many data, the Bayesian
methods perform better than the non-Bayesian methods in prediction accuracy. One
possible reason is that complicated models would result in highly variational estima-
tors, and the Bayesian methods use prior information to integrate across uncertainty
to reduce the variance, which leads to a smaller mean squared error. In this sense, the
Bayesian methods furnish better results with less data when the true model is compli-
cated. Furthermore, with the sample size doubled from Simulation 3 to Simulation 4,
the mmse of the Bayesian methods decreases about 15 while that of the non-Bayesian
methods decreases about 40.

4 The diabetes example


The data in this example are from Efron et al. (2004), where the predictors are ten
baseline variables: age, sex, body mass index (bmi), average blood pressure (map), and
Q. Li and N. Lin 163

Method Simulation 1 Simulation 2 Simulation 3 Simulation 4 Simulation 5


mmse(se) mmse(se) mmse(se) mmse(se) mmse(se)
ben 11.99(0.28) 11.39(0.29) 232.3(3.83) 215.1(1.95) 335.3(4.17)
en 11.29(0.53) 11.10(0.29) 281.4(6.80) 243.0(4.24) 376.9(9.81)
bl 10.45(0.24) 10.55(0.21) 227.1(4.20) 211.4(2.36) 352.6(13.1)
lasso 10.99(0.20) 11.41(0.33) 280.1(8.00) 243.0(4.34) 384.9(9.61)

Table 2: Comparison of the four methods (ben, bl, en, and lasso) on prediction accu-
racy.

six blood serum measurements (tc, ldl, hdl, tch, lth, glu). The response of interest is a
quantitative measure of disease progression one year after baseline. There are n = 442
diabetes patients in the data set. Four different methods are applied to this data set:
ben, bl, en and lasso. The estimation results are summarized in Figure 4. In this
example, the posterior distributions of β | y are very close to normal distributions, so
posterior medians instead of the posterior modes are reported for ben and bl.
1
2
3
4
Variable number

5
6
7
8
9
10

−40 −20 0 20 40

Coefficients

Figure 4: The estimates of the predictor effects for the diabetes data using different
methods: posterior median ben estimates (+), posterior median bl estimates (◦), en
estimates (4), lasso estimates (∇) and the ols estimates (×).

The penalty parameters are selected as (s, λ2 ) = (0.78, 0.01) for en and s = 0.57 for
lasso. Figure 4 shows that these two methods give almost identical estimates of β for this
data set. For the ben method, the penalty parameters estimated by the Monte Carlo
em algorithm are (λ1 , λ2 ) = (0.996, 14.89) while for the bl method, λ1 is estimated as
4.536. These two Bayesian methods also behave similarly for this data set. From Figure
164 Bayesian Elastic Net

4 we can see that the Bayesian methods (ben and bl) tend to achieve milder shrinkage
compared to the non-Bayesian methods (en and lasso), i.e. their estimates usually lie
between the ols estimates and the estimates given by en and lasso.
For the variable selection, if we use the credible interval criterion with level α = 0.5,
two variables are excluded (for both ben and bl): age and ldl. These are consistent with
the variables excluded by the en and lasso. If we use the scaled neighborhood criterion
with probability threshold p = 0.5, then three variables are excluded (for both ben
and bl): age, ldl and tch. Table 3 summarizes the posterior probabilities for the scaled
neighborhood criterion. Since a larger posterior probability implies stronger evidence for
exclusion, Table 3 shows that ben and bl give the same ordering for variable exclusion.

predictor posterior probability


ben bl
bmi 0.00 0.00
ltg 0.00 0.00
map 0.00 0.00
sex 0.00 0.00
hdl 0.37 0.46
glu 0.42 0.46
tc 0.49 0.48
tch 0.51 0.57
age 0.68 0.69
ldl 0.69 0.71

Table 3: Diabetes example: posterior probabilities in the scaled neighborhood.

5 The prostate example


The data in this example are from a prostate cancer study (Stamey et al. 1989) and
were analyzed by Zou and Hastie (2005). The predictors are eight clinical measures:
the logarithm of cancer volume, the logarithm of prostate weight, age, the logarithm
of the amount of benign prostatic hyperplasia, seminal vesicle invasion, the logarithm
of capsular penetration, the Gleason score and the percentage Gleason score 4 or 5.
The response of interest is the logarithm of prostate-specific antigen. In total we have
97 observations for the data. The training data has 67 observations and the testing
data has 30 observations. The estimates given by different methods are summarized in
Figure 5. It shows that estimates given by the four methods are much more different
than those in the diabetes data. In particular, some bl estimates are large compared
to the corresponding estimates from the other three methods.
The scaled neighborhood criterion with threshold p = 0.5 is used for variable selec-
tion in ben and bl. Table 4 gives the posterior probabilities in the scaled neighborhood,
and Table 5 gives the variable selection results using four different methods. In contrast
to the diabetes example, we see that the variable selection results for ben and bl are
Q. Li and N. Lin 165

very different. One possible reason is that the predictors in this data set are more corre-
lated than those in the diabetes data. Table 5 also reports the prediction mean-squared
error from the testing data using four different methods. The two Bayesian methods
have slightly larger prediction error than the non-Bayesian methods. One possible rea-
son is that the true model structure is not complicated, so that not much variance is
reduced by introducing the prior, but the increased bias leads to a larger prediction
error.
1
2
3
Variable number

4
5
6
7
8

−0.3 −0.1 0.1 0.3 0.5 0.7

Coefficients

Figure 5: The estimates of the predictor effects for the prostate cancer data using
different methods: posterior median ben estimates (+), posterior median bl estimates
(◦), en estimates (4), lasso estimates (∇) and the ols estimates (×).

6 Conclusion and discussion


We propose a Bayesian formulation of the en problem and the posterior inference can be
obtained efficiently using Gibbs sampling. Real data examples and simulation studies
show that ben behaves comparably to en in prediction accuracy. en does slightly better
than ben for simple models, but ben performs significantly better than en for more
complicated models. Simulation studies suggest that ben outperforms en in variable
selection when coupled with the scaled neighborhood criterion with a proper probability
threshold, and ben gives better prediction accuracy than bl for small samples from less
sparse models.
An alternative way of choosing the penalty parameters λ1 and λ2 is to introduce
hyper-priors on them. This is also mentioned by Park and Casella (2008). Specifically,
we may set a gamma prior gamma(a, b) for λ21 and a gig prior gig(1, c, d) for λ2 .
166 Bayesian Elastic Net

predictor posterior probability


ben bl
lcavol 0.00 0.00
lweight 0.05 0.17
age 0.68 0.37
lbph 0.06 0.09
svi 0.69 0.62
lcp 0.54 0.63
gleason 0.20 0.41
pgg45 0.36 0.60

Table 4: Prostate cancer example: posterior probabilities in the scaled neighborhood.

method selected variables prediction mean-squared error


ben locavol, lweight, lbph, gleason, pgg45 0.47
bl locavol, lweight, age, lbph, gleason 0.44
en locavol, lweight, lbph, lcp, gleason, pgg45 0.39
lasso locavol, lweight, lbph, gleason 0.38

Table 5: Variable selection results and prediction mean-squared error for the prostate
cancer example using different methods: ben, bl, en and lasso.

Then the resulting conjugacy leads to an easy extension of the Gibbs sampler. This
approach does not require updating the penalty parameters after each Markov chain,
but would run a single longer chain that updates the penalty parameters in every step.
So this approach is much faster computationally. But our experience is that the results
depend heavily on the parameters (a, b) and (c, d), which need to be specified by users
in advance. However, these parameters are less intuitive to specify than the credible
interval level or the probability threshold. In Park and Casella (2008) the specification
of parameters in the prior for λ1 leads to similar results as those given by the Monte
Carlo em method.
Another important desired property for the regression regularization method is the
oracle property (Fan and Li 2001). Zou and Zhang (2009) proposed the adaptive elastic
net to achieve this goal. Making adjustments to the ben method to adopt the oracle
property deserves some further consideration.
Q. Li and N. Lin 167

Appendix
1. Proof of Lemma 1
Firstly, note that
½ ¾ Y p ½ ¾
1 ¡ ¢ 1 ¡ ¢
exp − 2 λ1 kβk1 + λ2 kβk22 = exp − 2 λ1 |βj | + λ2 βj2 .
2σ j=1

Secondly, as suggested by Andrews and Mallows (1974), for a > 0, we have


Z ∞ µ ¶ µ ¶
a 1 1 z 2 a2 1
exp(−a|z|) = √ exp − exp − a2 s ds.
2 0 2πs 2 s 2 2
λ1
Let a = 2σ 2and z = βj , and we have
µ ¶ Z ∞ Ã ! ( µ ¶2 )
λ1 1 1 βj2 1 λ1
exp − 2 |βj | ∝2 √ exp − exp − s ds,
2σ σ 0 s 2 s 2 2σ 2

where ∝2 denotes that its two sides differ by a multiplicand involving σ 2 and possibly
σ
some constants. Then, we have
½ ¾
1 ¡ 2
¢
exp − 2 λ1 |βj | + λ2 βj

Z ∞ ( µ ¶) ( µ ¶2 )
1 βj2 1 λ2 1 λ1
∝ √ exp − + exp − s ds
σ2 0 s 2 s σ2 2 2σ 2
Z ∞ sµ ¶ ( µ ¶) ( µ ¶2 )
1 λ2 βj2 1 λ2 1 1 λ1
= + exp − + q exp − s ds
0 s σ2 2 s σ2 1 + σλ22 s 2 2σ 2
Z ∞r ( µ ¶) µ ¶
t βj2 λ2 t − 21 1 λ21
∝2 exp − t exp − t dt.
σ 1 t−1 2 σ2 t − 1 2σ 2 4λ2

The result of Lemma 1 then follows.

2. Choosing the tuning parameters (λ1 , λ2 )


The penalty parameters (λ1 , λ2 ) are chosen by the Monte Carlo em algorithm. By
treating β, τ , σ 2 as missing data and (λ1 , λ2 ) as fixed parameters, the “complete data”
likelihood is (after ignoring the constants not involving λ1 and λ2 )
µ ¶ n2 +p+1 ½ µ ¶¾−p Y p µ ¶1/2
p 1 1 λ21 1
λ1 ΓU , ×
σ2 2 8σ 2 λ2 j=1
τj − 1
  
1  Xp
τj λ
p
2 X 
exp − 2 (y − Xβ)T (y − Xβ) + λ2 βj2 + 1 τj  .
2σ  τ −1
j=1 j
4λ2 j=1 
168 Bayesian Elastic Net

So the log-likelihood is

³n ´ µ ¶ p
2 1 λ21 1X
p log(λ1 ) − + p + 1 log(σ ) − p log ΓU , − log (τj − 1)
2 2 8σ 2 λ2 2 j=1
p p
1 T λ2 X τ j 2 1 λ21 X
− (y − Xβ) (y − Xβ) − β − τj
2σ 2 2σ 2 j=1 τj − 1 j 2σ 2 4λ2 j=1
µ ¶ p p
1 λ21 λ2 X τ j 1 λ2 X
= p log(λ1 ) − p log ΓU , 2 − 2 βj2 − 2 1 τj
2 8σ λ2 2σ j=1 τj − 1 2σ 4λ2 j=1
+ terms not involving λ1 and λ2 .

So at the³kth step of the´ Monte Carlo em algorithm, the conditional log-likelihood


(k−1) (k−1)
on λ(k−1) = λ1 , λ2 and Y is

Q(λ | λ(k−1) )
· µ ¶ ¸ p
" #
1 λ21 (k−1) λ2 X τj βj2 (k−1)
= p log(λ1 ) − pE log ΓU , |λ ,Y − E |λ ,Y
2 8σ 2 λ2 2 j=1 τj − 1 σ 2

λ21 X h τj i
p
− E 2 | λ(k−1) , Y + terms not involving λ1 and λ2
8λ2 j=1 σ

= R(λ | λ(k−1) ) + terms not involving λ1 and λ2 .

Then the M-step is to find λ1 and λ2 that maximize R(λ | λ(k−1) ).


To facilitate the maximization problem, we may need the gradient of R(λ | λ(k−1) ),
which is given by
"½ µ ¶¾−1 µ ¶ #
∂R p pλ1 1 λ21 λ21 1
= + E ΓU , ϕ | λ(k−1) , Y
∂λ1 λ1 4λ2 2 8σ 2 λ2 8σ 2 λ2 σ2
λ1 X h τ j i
p
− E 2 | λ(k−1) , Y ,
4λ2 j=1 σ
"½ µ ¶¾−1 µ 2 ¶ #
∂R pλ21 1 λ21 λ1 1
= − 2E ΓU , ϕ | λ(k−1) , Y
∂λ2 8λ2 2 8σ 2 λ2 8σ 2 λ2 σ 2
" #
λ21 X h τj i
p p
1X τj βj2 (k−1) (k−1)
− E | λ , Y + E | λ , Y ,
2 j=1 τj − 1 σ 2 8λ22 j=1 σ2

1
where ϕ(t) = t− 2 e−t .
Q. Li and N. Lin 169

References
Andrews, D. F. and Mallows, C. L. (1974). “Scale mixtures of normal distributions.”
J. R. Statist. Soc. B, 36: 99–102. 167

Armido, D. and Alfred, M. (1986). “Computation of the incomplete gamma function


ratios and their inverse.” ACM Trans. Math. Soft., 12: 377–393. 156

Berger, J. O. (1993). Statistical Decision Theory and Bayesian Analysis. Springer, 2nd
edition. 157

Carvalho, C. M., Polson, N. G., and Scott, J. G. (2008). “The horseshoe estimator for
sparse signals.” Discussion Paper 2008-31, Duke University Department of Statistical
Science. 153

Casella, G. (2001). “Empirical Bayes Gibbs sampling.” Biostatistics, 2: 485–500. 153,


157

Chhikara, R. and Folks, L. (1988). The Inverse Gaussian Distribution Theory: Method-
ology, and Applications. Marcel Dekker, Inc, 1st edition. 156

Dagpunar, J. S. (1989). “An easily implemented generalised inverse Gaussian genera-


tor.” Commun. Statist. -Simula., 18: 703–710. 156

Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). “Least angle regression.”
Ann. Stat., 32: 407–499. 151, 162

Fan, J. and Li, R. (2001). “Variable Selection via Nonconcave Penalized Likelihood and
Its Oracle Properties.” J. Amer. Statist. Assoc., 96: 1348–1360. 152, 154, 166

Griffin, J. E. and Brown, P. J. (2009). “Inference with Normal-Gamma prior distribu-


tions in regression problems.” Technical report, Institute of Mathematics, Statistics
and Actuarial Science, University of Kent. 153

Guyon, I., Weston, J., Barnhill, S., and Vapnik, V. (2002). “Gene Selection for Cancer
Classification Using Support Vector Machines.” Mach. Learn., 46: 389–422. 151

Hans, C. (2009a). “Bayesian Lasso Regression.” Biometrika, 96: 835–845. 153

— (2009b). “Model uncertainty and variable selection in Bayesian Lasso.” Stat. Com-
put.. To appear. 153

Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learn-
ing: Data Mining, Inference and Prediction. Springer, 2nd edition. 152

Jφrgensen (1982). “Statistical properties of the generalized inverse Gaussian distribu-


tion.” Lecture Notes in Statistics, 9. 156

Knight, K. and Fu, W. (2000). “Asymptotics for lasso-type estimators.” Ann. Stat.,
28: 1356–1378. 152
170 Bayesian Elastic Net

Michael, J. R., Schucany, W. R., and Haas, R. W. (1976). “Generating random variates
using transformations with multiple roots.” Am. Stat., 30-2: 88–90. 156
Osborne, M. R., Presnell, B., and Turlach, B. A. (2000). “On the LASSO and its dual.”
J. Comput. Graph. Stat., 9: 319–337. 152
Park, T. and Casella, G. (2008). “The Bayesian Lasso.” J. Amer. Statist. Assoc., 103:
681–686. 153, 154, 157, 165, 166

R Development Core Team (2005). R: A language and environment for statistical


computing. R Foundation for Statistical Computing, Vienna, Austria.
URL http://www.R-project.org 156
Scheipl, F. and Kneib, T. (2009). “Locally adaptive Bayesian P-splines with a Normal-
Exponential-Gamma prior.” Computational Statistics & Data Analysis, 53(10):
3533–3552. 153

Scott, D. (2008). HyperbolicDist: The hyperbolic distribution. R package version 0.5-2.


URL http://www.r-project.org 156

Stamey, T., Kabalin, J., McNeal, J., Johnstone, I., Freiha, F., Redwine, E., and Yang, N.
(1989). “Prostate specific antigen in the diagnosis and treatment of adenocarcinoma
of the prostate II: radical prostatectomy treated patients.” J. Urol., 16: 1076–1083.
164

Tibshirani, R. (1996). “Regression shrinkage and selection via the Lasso.” J. R. Statist.
Soc. B, 58: 267–288. 151, 152
Zou, H. and Hastie, T. (2005). “Regularization and variable selection via the elastic
net.” J. R. Statist. Soc. B, 67: 301–320. 151, 152, 153, 158, 164
Zou, H. and Zhang, H. H. (2009). “On the adaptive elastic-net with a diverging number
of parameters.” Ann. Stat., 37: 1733–1751. 166

You might also like