Overview
Variable selection in linear models
Yuqi Chen,1 Pang Du2 and Yuedong Wang1∗
Variable selection in linear models is essential for improved inference and
interpretation, an activity which has become even more critical for high
dimensional data. In this article, we provide a selective review of some classical
methods including Akaike information criterion, Bayesian information criterion,
Mallow’s Cp and risk inflation criterion, as well as regularization methods including
Lasso, bridge regression, smoothly clipped absolute deviation, minimax concave
penalty, adaptive Lasso, elastic-net, and group Lasso. We discuss how to select the
penalty parameters. We also provide a review for some screening procedures for
ultra high dimensions. © 2013 Wiley Periodicals, Inc.
How to cite this article:
WIREs Comput Stat 2014, 6:1–9. doi: 10.1002/wics.1284
Keywords: elastic-net; generalized information criterion; Lasso; regularization
method; smoothly clipped absolute deviation
INTRODUCTION
C
onsider the linear model
y = Xβ + ǫ,
(1)
where y is a vector of n observations from a
response variable Y, X is an n × p design matrix
from p predictors X1 , X2 , . . . , Xp , β = (β 1 , . . . , β p )T
is a vector
of p unknown coefficients, and ǫ ∼
N 0, σ 2 In is a vector of n independent and identically
distributed random errors. Without loss of generality,
we assume that the response variable is centered and
the predictors are standardized. That is, yT 1 = 0,
XT 1 = 0 where 0 and 1 are vectors of dimension n
with all elements equal zero and one respectively,
and the diagonal elements of XT X equal to 1. As a
consequence, the linear model Eq. (1) does not contain
the intercept.
One inevitable issue when building a linear
model is to include which predictors into the model.
Modern applications of the linear model often involve
a large number of predictors and it is likely that not
∗ Correspondence
to:
[email protected]
1 Department
of Statistics and Applied Probability, University of
California – Santa Barbara, Santa Barbara, CA, USA
2
Department of Statistics, Virginia Tech, Blacksburg, VA, USA
Conflict of interest: The authors have declared no conflicts of
interest for this article.
Volume 6, January/February 2014
all predictors are important. Simpler models enhance
the efficiency of statistical inference and model
interpretability. Sometimes it is desirable to select
the most important predictors without losing too
much prediction accuracy. For high-dimension data
where p > n, there does not exist unique least squares
estimates for parameters and variable selection is
necessary in this situation.
Variable selection for linear models has attracted
a great deal of research. Many classical methods such
as Akaike information criterion (AIC),1 Bayesian
information criterion (BIC),2 Mallow’s Cp 3 and
risk inflation criterion (RIC)4 have been developed
through the years. Various regularization methods
have been developed in recent decades.
There has been a considerable amount of
research on variable selection for high-dimensional
data.5 It is usually assumed for high-dimensional
data that the p-dimensional parameter vector β is
sparse with many components being zero.6 Classical
methods such as BIC and RIC have been extended for
high dimensional data.7–9 Regularization methods are
especially powerful and flexible for high dimensional
data.10
As there is a vast amount of literature on the
topic, this review focuses on two general approaches:
the generalized information criterion (GIC) and
the regularization method. We also review some
screening procedures for ultra high dimensional data.
© 2013 Wiley Periodicals, Inc.
1
wires.wiley.com/compstats
Overview
GENERALIZED INFORMATION
CRITERION
To illustrate the basic concepts behind variable
selection, let us first consider two nested linear models
M1 : y = X1 β 1 + ǫ, and M2 : y = X1 β 1 + X2 β 2 + ǫ. We
will have the situation of over-fitting when we fit
model M2 while the true model is M1 and the situation
of under-fitting when we fit model M1 while the true
model is M2 . The consequence of over-fitting is to
lose precision of parameter estimation with larger
variances while the consequence of under-fitting is
biased estimates of parameters.11 Intuitively speaking,
including more variables in a linear model reduces
potential bias, but at the same time, makes estimation
more difficult.12 Therefore, variable selection is
essentially a compromise between bias and variance.
Occam’s razor suggests that the model fitting
observations sufficiently well in the least complex
way should be preferred. For linear models, a natural
choice for goodness of fit is the residual sum of
||2 , and a natural choice for model
squares, || y − Xβ
complexity is the degrees of freedom p. Therefore a
direct compromise between goodness of fit and model
complexity is the following GIC13
||2 + ξ σ 2 p,
|| y − Xβ
(2)
where ξ is a positive number that controls the trade-off
between two conflicting aspects of a model: goodness
of fit and model complexity. The GIC contains
several well-known criteria as special cases: AIC and
Mallow’s Cp with ξ = 2, BIC with ξ = log n, and RIC
with ξ = 2 log p.
For the purpose of variable selection, it is
often desirable to have selection methods that can
identify predictors with nonzero coefficients in Eq. (1)
correctly.
Denote
β * as the true coefficients and π ∗ =
j : |βj∗ | = 0 as the set of all nonzero coefficients. A
variable selection method is (selection) consistent if
the subset it selects,
π , satisfies the condition
Pr
π = π ∗ → 1 as n → ∞.
In the remainder of this article, for simplicity,
consistency means selection consistency. For the
purpose of estimation, it is often desirable to have
nonzero coefficients to be estimated efficiently. A
selection and estimation procedure is said to have
an oracle property if, in addition to being consistent,
the nonzero coefficients are estimated as well as when
the correct submodel is known, that is, the asymptotic
covariance of the estimates of true nonzero coefficients
is the same as the one when the true model is known.6
2
Shao13 and Kim et al.14 studied the asymptotic
properties of the GIC. Under regularity conditions,
Kim et al.14 have found sufficient conditions for
consistency of GIC. In particular, they have shown
that the BIC is consistent when p is fixed or p = nγ
where 0 < γ < 1/2. The AIC is not consistent.
To select the best model using the GIC, one
may compare all 2p possible submodels. This is
a combinational problem with NP-complexity.15
Therefore, the best subset selection approach is computationally intensive or even prohibitive when p is
large. Sequential methods such as forward/backward
stepwise selection are often used as alternatives.
However, due to the myopic property of stepwise
algorithm, the result is likely to be trapped in local
optima. To reduce computational burden and maintain consistency of the GIC method, one approach is to
select the best model among a sequence of submodels
that includes the true model with probability converging to one.14 Under the irrepresentable condition,
the Lasso solution path provides such a sequence.16
For high dimensional data where the irrepresentable
condition is hardly satisfied, the smoothly clipped
absolute deviation or minimax concave penalty
solution path provides such a sequence.14
It has been shown that, for high dimensional
data, classical model selection criteria such as AIC and
BIC tend to select more variables than necessary.17,18
Therefore, classical methods need to be modified for
high dimension data. The modified BIC proposed by
Wang et al.8 is consistent when p diverges slower than
n. The extended BIC proposed by Chen and Chen7
and corrected RIC proposed by Zhang and Shen9 are
consistent even when p > n.
The penalty to model complexity in GIC Eq. (2),
ξ σ 2 p, is linear in p. Nonlinear penalties has been
considered by Abramovich and Grinshtein19 (see
references therein).
REGULARIZATION METHODS
One potential problem with variable selection is its
discrete nature where each variable is either included
or excluded from a model. This may lead to completely
different estimates with small change in the data. Consequently, subset selection is often unstable and highly
variable.20 One possible remedy is to penalize on the
coefficients rather than on the number of parameters.
Specifically, consider the penalized least squares (PLS)
|| y − Xβ ||2 + Jλ (β) ,
(3)
where Jλ (β) is a penalty function on the coefficients
with penalty parameter(s) λ. Many regularization
© 2013 Wiley Periodicals, Inc.
Volume 6, January/February 2014
WIREs Computational Statistics
Variable selection in linear models
methods assume the following form of the penalty
function
Jλ (β) =
p
pλ β j ,
(4)
j=1
where pλ is a penalty function on individual
coefficient.
Bridge Regression and Lasso
A popular choice of the penalty function pλ in
Eq. (4) is the Lq norm: pλ (t) = λ|t|q for q > 0.
When 0 < q ≤ 2, the resulting penalized regression
is referred to as the bridge regression.21 The Lq
penalty shrinks coefficients toward zero. In particular,
q = 2 corresponds to the ridge regression, a traditional
method to deal with ill-conditioned design matrices.22
For variable selection, it is desirable to have a penalty
function that shrinks some small coefficients all the
way to zero. In this case the PLS method provides
simultaneous variable selection and estimation. When
q ≤ 1, the limiting distributions of bridge estimators
can have positive probability mass at 0 when the
true value of the parameter is zero.23 Therefore, the
Lq penalty with q ≤ 1 has the desirable property
that produce coefficients exactly equal to zero. In
particular, the popular Lasso (least absolute shrinkage
and selection operator) corresponds to q = 1 with the
PLS as a convex optimization problem.24,25 Note that
as q → 0, pλ (t) = λI(t = 0). Therefore, the GIC is a
limiting case of Eq. (3) with L0 norm.
Asymptotic properties of Lasso have been
well-studied.16,26–32 Zhao and Yu16 introduced an
irrepresentable condition and showed that this
condition is almost necessary and sufficient for Lasso
to be consistent. The irrepresentable condition is
quite restrictive, especially for high dimensional data.
Therefore Lasso is in general not consistent. On the
other hand, the bridge estimators with q < 1 satisfy
the oracle property.27
The early success of Lasso was limited by its
computational intricacy due to the nondifferentiable
L1 norm. Efron et al.33 provided an ingenious
geometric view of the Lasso penalty that yields the
LARS algorithm for computing the whole solution
path of the Lasso. Utilizing the fact that the Lasso
solution is piecewise linear, the LARS algorithm only
requires the same computational cost as the full least
squares fit on the data. For large Lasso problems,
the cyclical coordinate descent algorithm is very
efficient.34,35 See also Fu,36 Osborne et al.37 and Wu
and Lange38 for computational methods for Lasso.
Volume 6, January/February 2014
Improved Penalty Functions
Fan and Li6 considered three properties for the
penalty function: sparsity: the resulting estimator
automatically sets small estimated coefficients to
zero; unbiasedness, the resulting estimator is nearly
unbiased; and continuity, the resulting estimator is
continuous in the data to reduce instability in model
prediction. None of the Lq penalties satisfies all three
properties simultaneously. In particular, the Lasso
produces biased estimates for large coefficients.6 This
has motivated Fan and Li6 to propose the smoothly
clipped absolute deviation (SCAD) penalty
(aλ − t)+
′
I (t > λ) ,
pλ (t) = λ I (t ≤ λ) +
(a − 1) λ
(5)
where pλ (0) = 0 and a > 2 is a constant. The SCAD
penalty satisfies all three properties. The resulting
estimator possesses the oracle property.6,39–41
However, since the SCAD penalty is nonconcave,
the involved computation is more difficult. Zou and
Li42 proposed a local linear approximation algorithm
that borrows the strength of LARS. The most recent,
possibly the most efficient as well, algorithm for
solving the SCAD problem is the iterative coordinate
ascent algorithm proposed by Fan and Lv.39
Zhang43 proposed the following minimax
concave penalty (MCP)
′
pλ (t) =
(aλ − t)+
,
a
(6)
and showed that the resulting procedure possesses
the oracle property. A penalized linear unbiased
selection (PLUS) algorithm was proposed for the
MCP procedure.43
To overcome the lack of oracle property
of the Lasso, Zou44,45 proposed a simple but
important modification, called the adaptive Lasso,
which replaces the L1 penalty by a weighted version:
Jλ (β) = λ
p
βj
,
|βinit,j |γ
j=1
(7)
init are root-n-consistent initial estimates of
where β
β and γ > 0 is a preselected constant. The adaptive
Lasso has the oracle property under some regularity
conditions.45,44 However, the adaptive Lasso has an
undesirable property that the penalty is infinite at
zero.5 The adaptive Lasso estimates can be calculated
using the same algorithms for Lasso.
To overcome the problem of discontinuity in the
L0 penalty, Dicker et al.46 proposed the following
© 2013 Wiley Periodicals, Inc.
3
wires.wiley.com/compstats
Overview
6
pλ(t)
4
6
pλ(t)
2
4
−5
Combinations of the L1 Penalty
With Another Penalty
pλ (t) = λ1 |t| + λ2 t2 ,
(9)
where the L1 penalty encourages sparsity in
the coefficients and the L2 penalty encourages
similar coefficient estimates among highly correlated
predictors. The penalty Eq. (9) corresponds to the
(naïve).
naïve ENet which produces estimates β
(ENet) = (1 + λ2 ) β
(naïve).
The ENet estimates β
Consistency of the ENet has been studied by Yuan and
Lin48 and Jia and Yu.49 Zou and Hastie47 proposed an
efficient algorithm called LARS-EN to solve the ENet
efficiently. To overcome the lack of oracle property
of the ENet, Zou and Zhang50 later proposed the
adaptive ENet that combines the strengths of the
quadratic regularization and the adaptively weighted
Lasso shrinkage. Under weak regularity conditions,
they established the oracle property of the adaptive
ENet when 0 ≤ lim n → ∞ (log p/log n) < 1.
Liu and Wu51 combined the L0 and L1 penalties:
pλ (t) = (1 − λ1 ) min {|t|/λ2 , 1} + λ1 |t|,
−5
5
(11)
0
t
5
FIGURE 1 | Penalty functions of Lasso, SCAD, MCP, SELO, ENet,
and L0L1. The tuning parameters are selected as follows: λ = 1.5 for
Lasso; a = 3.7 and λ = 1.5 for SCAD; a = 2 and λ = 1.5 for MCP;
λ1 = 1.5 and λ2 = 2 for SELO; λ1 = 1 and λ2 = 0.1 for ENet; and
λ1 = 1.5 and λ2 = 2 for L0L1.
where ||β||∞ = max1 ≤ j ≤ p |β j |. While the L1 penalty
leads to sparsity, the L∞ penalty encourages grouping
among highly correlated predictors. The resulting
procedure is adaptive to both sparse and nonsparse
situations. Wu et al.52 developed a homotopy
algorithm for efficient computation.
For illustration, Figure 1 shows the penalty
functions of Lasso, SCAD, MCP, SELO, ENet,
and L0L1.
Group Lasso
When predictors are grouped such as the dummy
variables for a multilevel categorical variable, one
may wish to select groups of predictors rather than
individual predictors. Suppose there are J groups.
T
Without loss of generality, denote β = (β(1)
, · · · , β TJ )T
()
as partitions of coefficients according to J groups. The
group Lasso proposed by Yuan and Lin53 assumed
the following penalty in Eq. (3):
(10)
where min{|t|/λ2 , 1} is a continuous approximation
of the L0 norm. The penalty Eq. (10) will be referred
to as the L0L1 penalty. The L0L1 penalty overcomes
disadvantages of the L0 and L1 penalties. Liu and
Wu51 developed a global optimization algorithm
using mixed integer programming to implement the
L0L1 penalty.
Wu et al.52 proposed a procedure that combines
the L1 and L∞ penalties:
4
0
t
Several authors have considered combinations of
different penalties. The Lasso is unstable when
the predictors are highly correlated.47 For high
dimensional data, sample correlation can be large even
when predictors are independent.5 To overcome this
problem, Zou and Hastie47 proposed the elastic-net
(ENet) that combines the L1 and L2 penalties:
j=1
0
0
While SCAD and MCP mimic the L1 penalty,
SELO mimics the L0 penalty. The SELO procedure
has the oracle property when p = o(n). 46 The SELO
estimators can be obtained by the same iterated coordinated descent algorithm proposed by Fan and Lv.39
p
Jλ (β) = λ1 |βj | + λ∞ || β ||∞ ,
SELO
ENet
L0L1
8
(8)
8
10
Lasso
SCAD
MCP
2
|t|
λ1
pλ (t) =
+1 .
log
log (2)
|t| + λ2
10
12
SELO penalty (seamless-L0 )
Jλ (β) = λ
J
|| β(j) ||Kj ,
(12)
j=1
where || β(j) ||Kj = β Tj Kj β(j) for j = 1, . . . , J, and
()
K1 , . . . , KJ are positive definite matrices. Asymptotic
properties of the group Lasso have been studied by
Bach54 and Nardi and Rinaldo.55
Other group level procedures were developed by
Kim et al.,56 Wang et al.,57 and Zhao et al.58 Penalty
at both the group level and individual predictor level
were considered by Huang et al.,59 Breheny and
Huang,60 Friedman et al.,61 Zhou and Zhu,62 and
Geng.63 Huang64 provided a selective review of group
variable selection procedures.
© 2013 Wiley Periodicals, Inc.
Volume 6, January/February 2014
WIREs Computational Statistics
Variable selection in linear models
SELECTION OF PENALTY
PARAMETERS
The regularization methods often involve parameters
controlling the amount of penalization. Proper tuning
of these parameters is critical to the performance of
these methods. As an all-round option, the K-fold
cross-validation has always been a popular choice,
especially in the early years.
Classical methods such as the GIC may also be
used. Since the error variance σ 2 is usually unknown
in practice, we consider the GIC proposed by Nishii65
(λ) ||2 + ξ d (λ) ,
n log || y − Xβ
(13)
(λ) is an estimate of β based on the
where β
model chosen with fixed penalty parameter λ, d(λ)
is an appropriate ‘degrees of freedom’ that measures
complexity of the model with fixed penalty parameter
λ, and ξ controls the trade-off between goodness of
fit and model complexity. AIC and BIC correspond to
the special cases when ξ = 2 and ξ = log n.
An alternative criterion is the generalized crossvalidation (GCV):
GDF (λ) =
(λ) ||
||y − Xβ
2 .
n − d (λ)
2
(14)
GIC and GCV choices of λ are the minimizers
of the GIC and the GCV criterion, respectively. To
be able to use these criteria, we need an appropriate
measure of model complexity d(λ). When there is no
variable selection, the number of parameters in the
model is a logical choice for the degrees of freedom
d(λ) which was used in the standard GIC in Eq. (2).
When variable selection is involved, the choice of d(λ)
is not always clear since the cost of the selection should
be taken into account.66–68
For some regularization methods it is possible to
derive good estimates for the degrees of freedom d(λ).
For the Lasso procedure, Zou et al.69 showed that
the number of nonzero coefficients is an unbiased and
consistent estimator of d(λ). Wang et al.8 proposed
a modified BIC with ξ = Cn log n for the situation
when p → ∞, and showed that the modified BIC is
consistent when Cn → ∞.
For the SCAD procedure, Fan and Li6 proposed
to use the generalized degrees of freedom defined as
d (λ) = tr X XT X + n (λ)
−1
XT ,
′
′
(λ)
= diag{pλ (|β1 (λ) |/|β1 (λ) |, . . . , pλ
where
p (λ) |/|β
p (λ) }. The d(λ) is calculated based
β
Volume 6, January/February 2014
on submatrices of X and (λ) corresponding to
the selected covariates. Wang et al.70 showed that
the model selected by GCV contains all important
variables, but with nonzero probability to include
some unimportant variables, and the BIC can identify
the true model consistently.
For the SELO procedure, Dicker et al.46 proposed to estimate d(λ) by the number of nonzero coefficients. They showed that the modified BIC proposed
by Wang et al.8 is consistent for the SELO procedure.
Generalized degrees of freedom (GDF) is
a generic measure of model complexity for any
modeling procedure which is considered as a map
from observations to fitted values.66,68 It accounts for
the cost due to both model selection and parameter
estimation. Therefore, it may be used to estimate
d(λ) for a regularization procedure when a simple
estimate of the degrees of freedom is not available.
Denote y = (y1 , . . . , yn )T and μ = (μ1 , . . . , μn )T
where μi = E(yi ) for i = 1, . . . , n. For a regularization
method with fixed penalty parameter λ, denote the
resulting fitted values as
μi (λ) for i = 1, . . . , n. The
GDF is defined as66,68
n
n
μi )
∂Eμ (
1
cov
μi , yi . (15)
= 2
∂μi
σ
i=1
i=1
Extending the degrees of freedom to general
modeling procedure, the GDF can be viewed as the
sum of the sensitivities of the fitted values to a small
change in the response. GDF(λ) cannot be used directly
since it depends on the unknown true mean values
μ. One may estimate GDF(λ) using Monte Carlo
methods such as the pertubation technique described
in Ye68 and the bootstrap technique described in
Tibshirani and Knight71 and Efron.66 The estimate of
GDF(λ) may be used as an estimate of d(λ).
More research is necessary on the choice of
penalty parameters. Theoretical properties of the
estimates of β with data driven selection of penalty
parameters have received scant attention.
SCREENING PROCEDURES FOR ULTRA
HIGH DIMENSIONS
The regularization methods in the last section can
comfortably deal with high dimensional cases when p
is almost as large as n but may have difficulty when
applied to data with p ≫ n. For example, a genetic
study can have thousands of genes with only a few
hundred observations and filtering out only tens of
important genes can be a daunting task for these
regularization methods. This difficulty has motivated
research on the ultra high dimensional cases when
© 2013 Wiley Periodicals, Inc.
5
wires.wiley.com/compstats
Overview
p can increase in an exponential order exp{O(nα )},
α > 0 of the sample size n.
The sure independence screening (SIS)
procedure72 is among the first approaches to
tackling such ultra high dimensional problems. Let
the p-vector ω = XT y be from the componentwise
regression of Y against each Xj . Then the p componentwise magnitudes of ω are sorted in a decreasing
order and define a submodel
Md = 1 ≤ j ≤ p : |ωi | is amongthe first d largest of all ,
where a conservative practical choice of d suggested
in the article is [n/log n] with [x] denoting the integer
part of x. The SIS is a hard-thresholding approach.
Through the use of the marginal information of the
correlation between each predictor and the response,
it can reduce the dimensionality of p from exp{O(nα )}
to o(n) in a fast and efficient way. The procedure
is shown to achieve the sure screening property,
that is, all the important variables survive after
variable screening with probability tending to 1. An
iteration of the procedure is needed when the features
are marginally unrelated but jointly related to the
response variable. After reducing from an ultra high
dimensional problem to a much lower dimension of
o(n), if needed, a variable selection procedure such
as SCAD and adaptive Lasso can be applied to the
selected variables from the screening procedure.
Huang73 studied the screening property of the
forward regression (FR) method with log(p) = O(nξ )
for some 0 < ξ < 1. The size p0 of the true model
T can diverge in the order of O nξ0 for some
0 < ξ 0 < 1. The FR algorithm starts with the null
model S(0) = ∅, and iterates to update S(k−1) to S(k)
by adding the predictor that gives the smallest residual
sum of squares among all the outside predictors
when it is augmented to the model S(k−1) . The
algorithm
stops at k =
n and yields the solution path
S = S(k) : 1 ≤ k ≤ n . Then the extended BIC7 is
used to select the final candidate model
S. Huang
showed that
S has the sure screening property.
A three-stage screening and variable selection
procedure
in Wasserman and Roeder74
was
proposed
c2
with log p = O (n ), 0 ≤ c2 < 1. At stage 1, a suite
of candidate models, each depending on a tuning
parameter λ, are fitted. Particularly, the candidate
models considered in the article can be from the
Lasso method with the regularization parameter
λ, the forward stepwise regression after λ steps,
or the marginal regression with a threshold λ on
the magnitudes of the regression coefficients. At
stage 2, one model is selected from the candidates
through cross-validation. At stage 3, the model is
further cleaned through eliminating some variables by
hypothesis testing. Theoretical properties such as the
selection consistency are established there.
For the ultra high dimensional case with
log(p) = O(nν ), 0 < ν < 1, Huang et al.75 showed the
oracle property for the adaptive Lasso although their
result requires a consistent initial estimator which is
often unavailable in ultra high dimensional problems.
Kim et al.41 revisited the SCAD procedure and
established its oracle property under the ultra high
dimension case with log(p) = O(n).
CONCLUSION
In this article we have focused on two approaches
for variables selection in linear models: the classical
approach unified under the GIC and regularization
approach unified under PLS. For simplicity, we have
assumed that random errors in model Eq. (1) follow
a normal distribution. We note that many methods
only require a weaker assumption that random errors
are identically and independent distributed with zero
mean and a finite variance.
Many important approaches are not reviewed
due to space limitations. For example, nonnegative
garrote76 and Dantzig selector77 are both popular
procedures for high dimensional variable selection.
But they don’t quite fit under the PLS framework and
thus are not described in detail here. More references
about them and other approaches can be found in
further reading.
REFERENCES
1. Akaike H. Information theory and an extension of the
maximum likelihood principle. In: Second International
Symposium on Information Theory, vol. 1. Budapest:
Akademinai Kiado; 1973, 267–281.
2. Schwarz G. Estimating the dimension of a model. Ann
Stat 1978, 6:461–464.
6
3. Mallows CL. Some comments on Cp. Technometrics
1973, 15:661–675.
4. Foster DP, George EI. The risk inflation criterion for multiple regression. Ann Stat 1994, 22:
1947–1975.
© 2013 Wiley Periodicals, Inc.
Volume 6, January/February 2014
WIREs Computational Statistics
Variable selection in linear models
5. Fan J, Lv J. A selective overview of variable selection
in high dimensional feature space. Stat Sinica 2010,
20:101–148.
26. Bickel PJ, Ritov Y, Tsybakov AB. Simultaneous
analysis of Lasso and Dantzig selector. Ann Stat 2009,
37:1705–1732.
6. Fan J, Li R. Variable selection via nonconcave penalized
likelihood and its oracle properties. J Am Stat Assoc
2001, 96:1348–1360.
27. Huang J, Horowitz JL, Ma S. Asymptotic properties of
bridge estimators in sparse high-dimensional regression
models. Ann Stat 2008, 36:587–613.
7. Chen J, Chen Z. Extended Bayesian information criteria
for model selection with large model spaces. Biometrika
2008, 95:759–771.
28. Leng C, Lin Y, Wahba G. A note on the Lasso and
related procedures in model selection. Stat Sinica 2006,
16:1273–1284.
8. Wang H, Li B, Leng C. Shrinkage tuning parameter
selection with a diverging number of parameters. J R
Stat Soc Ser B 2009, 71:671–683.
29. Meinshausen N, Bühlmann P. High-dimensional graphs
and variable selection with the Lasso. Ann Stat 2006,
34:1436–1462.
9. Zhang Y, Shen X. Model selection procedure for highdimensional data. Stat Anal Data Min 2010, 3:350–358.
30. Van de Geer SA. High-dimensional generalized linear
models and the Lasso. Ann Stat 2008, 36:614–645.
10. Bühlmann P, Van de Geer SA. Statistics for HighDimensional Data. New York: Springer; 2011.
31. Wainwright M. Sharp thresholds for noisy and highdimensional recovery of sparsity using L1 -constrained
quadratic programming (Lasso). IEEE Trans Inform
Theory 2009, 55:2183–2202.
11. Seber GAF, Lee AJ. Linear Regression Analysis. New
York: Wiley; 2003.
12. Yang Y. Model selection for nonparametric regression.
Stat Sinica 1999, 9:475–499.
13. Shao J. An asymptotic theory for linear model selection
(with discussion). Stat Sinica 1997, 7:221–264.
14. Kim Y, Kwon S, Choi H. Consistent model selection
criteria on high dimensions. J Mach Learn Res 2012,
13:1037–1057.
15. Huo X, Ni XS. When do stepwise algorithms meet
subset selection criteria? Ann Stat 2007, 35:870–887.
16. Zhao P, Yu B. On model selection consistency of Lasso.
J Mach Learn Res 2006, 7:2541–2563.
17. Broman KW, Speed TP. A model selection approach
for the identification of quantitative trait loci in
experimental crosses. J R Stat Soc Ser B 2002,
64:641–656.
18. Casella G, Girón FJ, Martnez ML, Moreno E.
Consistency of Bayesian procedures for variable
selection. Ann Stat 2009, 37:1207–1228.
19. Abramovich F, Grinshtein V. Model selection in
Gaussian regression for high-dimensional data. In:
Inverse Problems and High-Dimensional Estimation,
vol. 203. Berlin: Springer; 2011, 159–170.
20. Breiman L. Heuristics of instability and stabilization in
model selection. Ann Stat 1996, 24:2350–2383.
21. Frank LE, Friedman JH. A statistical view of some
chemometrics regression tools. Technometrics 1993,
35:109–135.
22. Hoerl AE, Kennard RW. Ridge regression: biased
estimation for nonorthogonal problems. Technometrics
1970, 12:55–67.
32. Zhang CH, Huang J. The sparsity and bias of the Lasso
selection in high-dimensional linear regression. Ann Stat
2008, 36:1567–1594.
33. Efron B, Hastie T, Johnstone I, Tibshirani R. Least
angle regression (with discussion). Ann Stat 2004,
32:407–499.
34. Friedman J, Hastie T, Höfling H, Tibshirani R.
Pathwise coordinate optimization. Ann Appl Stat 2007,
1:302–332.
35. Friedman J, Hastie T, Tibshirani R. Regularization
paths for generalized linear models via coordinate
descent. J Stat Softw 2010, 33:1–22.
36. Fu W. Penalized regressions: the bridge vs. the Lasso. J
Comput Graph Stat 1998, 7:397–416.
37. Osborne M, Presnell B, Turlach B. A new approach
to variable selection in least squares problems. IMA J
Numer Anal 2000, 20:389–404.
38. Wu T, Lange K. Coordinate descent procedures for
Lasso penalized regression. Ann Appl Stat 2008,
2:224–244.
39. Fan J, Lv J. Nonconcave penalized likelihood with
NP-dimensionality. IEEE Trans Inform Theory 2011,
57:5467–5484.
40. Fan J, Peng H. Nonconcave penalized likelihood with
a diverging number of parameters. Ann Stat 2004,
32:928–961.
41. Kim Y, Choi H, Oh HS. Smoothly clipped absolute
deviation on high dimensions. J Am Stat Assoc 2008,
103:1665–1673.
23. Knight K, Fu W. Asymptotics for Lasso-type estimators.
Ann Stat 2000, 28:1356–1378.
42. Zou H, Li R. One-step sparse estimates in
nonconcave penalized likelihood models. Ann Stat
2008, 36:1509–1533.
24. Hastie T, Tibshirani R, Friedman J. The Elements of
Statistical Learning. New York: Springer; 2002.
43. Zhang CH. Nearly unbiased variable selection under
minimax concave penalty. Ann Stat 2010, 38:894–942.
25. Tibshirani R. Regression shrinkage and selection via the
Lasso. J R Stat Soc Ser B 1996, 58:267–288.
44. Zou H. The adaptive Lasso and its oracle properties. J
Am Stat Assoc 2006, 101:1418–1429.
Volume 6, January/February 2014
© 2013 Wiley Periodicals, Inc.
7
wires.wiley.com/compstats
Overview
45. Huang J, Ma S, Zhang CH. Adaptive Lasso for sparse
high-dimensional regression models. Stat Sinica 2008,
18:1603–1618.
62. Zhou N, Zhu J. Group variable selection via a
hierarchical Lasso and its oracle property. Stat Interface
2010, 3:557–574.
46. Dicker L, Huang B, Lin X. Variable selection and
estimation with the seamless-l0 penalty. Stat Sinica
2012, 23:929–962.
63. Geng Z. Group variable selection via convex Log-ExpSum penalty with application to a breast cancer survivor
study. PhD thesis, University of Wisconsin; 2013.
47. Zou H, Hastie T. Regularization and variable selection
via the elastic net. J R Stat Soc Ser B 2005, 67:301–320.
64. Huang J, Breheny P, Ma S. A selective review of group
selection in high-dimensional models. Stat Sci 2012,
27:481–499.
65. Nishii R. Asymptotic properties of criteria for selection
of variables in multiple regression. Ann Stat 1984,
12:758–765.
48. Yuan M, Lin Y. On the non-negative garrotte estimator.
J R Stat Soc Ser B 2007, 69:143–161.
49. Jia J, Yu B. On model selection consistency of the elastic
net when p ≫ n. Stat Sinica 2010, 20:595–611.
50. Zou H, Zhang HH. On the adaptive elastic-net with
a diverging number of parameters. Ann Stat 2009,
37:1733–1751.
51. Liu Y, Wu Y. Variable selection via a combination of
the L0 and L1 penalties. J Comput Graph Stat 2007,
16:782–798.
52. Wu S, Shen X, Geyer CJ. Adaptive regularization
using the entire solution surface. Biometrika 2009,
96:513–527.
53. Yuan M, Lin Y. Model selection and estimation in
regression with grouped variables. J R Stat Soc Ser B
2006, 68:49–67.
54. Bach F. Consistency of the group Lasso and multiple
kernel learning. J Mach Learn Res 2008, 9:1179–1225.
55. Nardi Y, Rinaldo A. On the asymptotic properties of
the group Lasso estimator for linear models. Electron J
Stat 2008, 2:605–633.
66. Efron B. The estimation of prediction error: covariance
penalties and cross-validation (with discussion). J Am
Stat Assoc 2004, 99:619–632.
67. Sklar JC, Wu J, Meiring W, Wang Y. Non-parametric
regression with basis selection from multiple libraries.
Technometrics 2013, 55:189–201.
68. Ye JM. On measuring and correcting the effects of data
mining and model selection. J Am Stat Assoc 1998,
93:120–131.
69. Zou H, Hastie T, Tibshirani R. On the ‘‘degrees of
freedom’’ of the Lasso. Ann Stat 2007, 35:2173–2192.
70. Wang H, Li R, Tsai CL. Tuning parameter selectors
for the smoothly clipped absolute deviation method.
Biometrika 2007, 94:553–558.
71. Tibshirani R, Knight K. The covariance inflation
criterion for adaptive model selection. J R Stat Soc
Ser B 1999, 61:529–546.
56. Kim Y, Kim J, Kim Y. Blockwise sparse regression. Stat
Sinica 2006, 16:375–390.
72. Fan J, Lv J. Sure independence screening for ultrahigh
dimensional feature space. J R Stat Soc Ser B 2008,
70:849–911.
57. Wang L, Chen G, Li H. Group SCAD regression
analysis for microarray time course gene expression.
Bioinformatics 2007, 23:1486–1494.
73. Huang H. Forward regression for ultra-high dimensional variable screening. J Am Stat Assoc 2009,
104:1512–1524.
58. Zhao P, Rocha G, Yu B. Grouped and hierarchical
model selection through composite absolute penalties.
Ann Stat 2009, 37:3468–3497.
74. Wasserman L, Roeder K. High-dimensional variable
selection. Ann Stat 2009, 37:2178–2201.
59. Huang J, Ma S, Xie H, Zhang CH. A group bridge
approach for variable selection. Biometrika 2009,
96:339–355.
75. Huang J, Ma S, Zhang CH. The iterated Lasso for highdimensional logistic regression. Technical Report No.
392, The University of Iowa, Department of Statistics
and Actuarial Science, 2008.
60. Breheny P, Huang J. Penalized methods for bi-level
variable selection. Stat Interface 2009, 2:369–380.
76. Breiman L. Better subset regression using the
nonnegative garrote. Technometrics 1995, 37:373–384.
61. Friedman J, Hastie T, Tibshirani R. A note on the
group Lasso and a sparse group Lasso. Technical report,
Department of Statistics, Stanford University; 2010.
77. Candes E, Tao T. The Dantzig selector: statistical
estimation when p is much larger than n. Ann Stat
2007, 35:2313–2351.
FURTHER READING
Baraud Y, Giraud C, Huet S. Gaussian model selection with an unknown variance. Ann Stat 2009, 37:630–672.
Birgé L, Massart P. Minimal penalties for Gaussian model selection. Probab Theory Rel Fields 2007, 138:33–73.
George EI, McCulloch RE. Approaches for Bayesian variable selection. Stat Sinica 1997, 7:339–373.
8
© 2013 Wiley Periodicals, Inc.
Volume 6, January/February 2014
WIREs Computational Statistics
Variable selection in linear models
James GM, Radchenko P, Lv J. DASSO: connections between the Dantzig selector and Lasso. J R Stat Soc Ser B 2009,
71:127–142.
McQuarrie ADR, Tsai CL. Regression and Times Series Model Selection. River Edge: World Scientific Publishing; 1998.
Miller A. Subset Selection in Regression. Boca Baton, MA: Chapman & Hall/CRC; 2002.
O’Hara RB. Silanpää. A review of Bayesian variable selection methods: what, how and which. Bayesian Anal 2009,
4:85–118.
Volume 6, January/February 2014
© 2013 Wiley Periodicals, Inc.
9