Academia.eduAcademia.edu

Variable selection in linear models

Variable selection in linear models is essential for improved inference and interpretation, an activity which has become even more critical for high dimensional data. In this article, we provide a selective review of some classical methods including Akaike information criterion, Bayesian information criterion, Mallow's C p and risk inflation criterion, as well as regularization methods including Lasso, bridge regression, smoothly clipped absolute deviation, minimax concave penalty, adaptive Lasso, elastic-net, and group Lasso. We discuss how to select the penalty parameters. We also provide a review for some screening procedures for ultra high dimensions.

Overview Variable selection in linear models Yuqi Chen,1 Pang Du2 and Yuedong Wang1∗ Variable selection in linear models is essential for improved inference and interpretation, an activity which has become even more critical for high dimensional data. In this article, we provide a selective review of some classical methods including Akaike information criterion, Bayesian information criterion, Mallow’s Cp and risk inflation criterion, as well as regularization methods including Lasso, bridge regression, smoothly clipped absolute deviation, minimax concave penalty, adaptive Lasso, elastic-net, and group Lasso. We discuss how to select the penalty parameters. We also provide a review for some screening procedures for ultra high dimensions. © 2013 Wiley Periodicals, Inc. How to cite this article: WIREs Comput Stat 2014, 6:1–9. doi: 10.1002/wics.1284 Keywords: elastic-net; generalized information criterion; Lasso; regularization method; smoothly clipped absolute deviation INTRODUCTION C onsider the linear model y = Xβ + ǫ, (1) where y is a vector of n observations from a response variable Y, X is an n × p design matrix from p predictors X1 , X2 , . . . , Xp , β = (β 1 , . . . , β p )T is  a vector  of p unknown coefficients, and ǫ ∼ N 0, σ 2 In is a vector of n independent and identically distributed random errors. Without loss of generality, we assume that the response variable is centered and the predictors are standardized. That is, yT 1 = 0, XT 1 = 0 where 0 and 1 are vectors of dimension n with all elements equal zero and one respectively, and the diagonal elements of XT X equal to 1. As a consequence, the linear model Eq. (1) does not contain the intercept. One inevitable issue when building a linear model is to include which predictors into the model. Modern applications of the linear model often involve a large number of predictors and it is likely that not ∗ Correspondence to: [email protected] 1 Department of Statistics and Applied Probability, University of California – Santa Barbara, Santa Barbara, CA, USA 2 Department of Statistics, Virginia Tech, Blacksburg, VA, USA Conflict of interest: The authors have declared no conflicts of interest for this article. Volume 6, January/February 2014 all predictors are important. Simpler models enhance the efficiency of statistical inference and model interpretability. Sometimes it is desirable to select the most important predictors without losing too much prediction accuracy. For high-dimension data where p > n, there does not exist unique least squares estimates for parameters and variable selection is necessary in this situation. Variable selection for linear models has attracted a great deal of research. Many classical methods such as Akaike information criterion (AIC),1 Bayesian information criterion (BIC),2 Mallow’s Cp 3 and risk inflation criterion (RIC)4 have been developed through the years. Various regularization methods have been developed in recent decades. There has been a considerable amount of research on variable selection for high-dimensional data.5 It is usually assumed for high-dimensional data that the p-dimensional parameter vector β is sparse with many components being zero.6 Classical methods such as BIC and RIC have been extended for high dimensional data.7–9 Regularization methods are especially powerful and flexible for high dimensional data.10 As there is a vast amount of literature on the topic, this review focuses on two general approaches: the generalized information criterion (GIC) and the regularization method. We also review some screening procedures for ultra high dimensional data. © 2013 Wiley Periodicals, Inc. 1 wires.wiley.com/compstats Overview GENERALIZED INFORMATION CRITERION To illustrate the basic concepts behind variable selection, let us first consider two nested linear models M1 : y = X1 β 1 + ǫ, and M2 : y = X1 β 1 + X2 β 2 + ǫ. We will have the situation of over-fitting when we fit model M2 while the true model is M1 and the situation of under-fitting when we fit model M1 while the true model is M2 . The consequence of over-fitting is to lose precision of parameter estimation with larger variances while the consequence of under-fitting is biased estimates of parameters.11 Intuitively speaking, including more variables in a linear model reduces potential bias, but at the same time, makes estimation more difficult.12 Therefore, variable selection is essentially a compromise between bias and variance. Occam’s razor suggests that the model fitting observations sufficiently well in the least complex way should be preferred. For linear models, a natural choice for goodness of fit is the residual sum of ||2 , and a natural choice for model squares, || y − Xβ complexity is the degrees of freedom p. Therefore a direct compromise between goodness of fit and model complexity is the following GIC13 ||2 + ξ σ 2 p, || y − Xβ (2) where ξ is a positive number that controls the trade-off between two conflicting aspects of a model: goodness of fit and model complexity. The GIC contains several well-known criteria as special cases: AIC and Mallow’s Cp with ξ = 2, BIC with ξ = log n, and RIC with ξ = 2 log p. For the purpose of variable selection, it is often desirable to have selection methods that can identify predictors with nonzero coefficients in Eq. (1) correctly. Denote β * as the true coefficients and π ∗ =   j : |βj∗ | = 0 as the set of all nonzero coefficients. A variable selection method is (selection) consistent if the subset it selects,  π , satisfies the condition   Pr  π = π ∗ → 1 as n → ∞. In the remainder of this article, for simplicity, consistency means selection consistency. For the purpose of estimation, it is often desirable to have nonzero coefficients to be estimated efficiently. A selection and estimation procedure is said to have an oracle property if, in addition to being consistent, the nonzero coefficients are estimated as well as when the correct submodel is known, that is, the asymptotic covariance of the estimates of true nonzero coefficients is the same as the one when the true model is known.6 2 Shao13 and Kim et al.14 studied the asymptotic properties of the GIC. Under regularity conditions, Kim et al.14 have found sufficient conditions for consistency of GIC. In particular, they have shown that the BIC is consistent when p is fixed or p = nγ where 0 < γ < 1/2. The AIC is not consistent. To select the best model using the GIC, one may compare all 2p possible submodels. This is a combinational problem with NP-complexity.15 Therefore, the best subset selection approach is computationally intensive or even prohibitive when p is large. Sequential methods such as forward/backward stepwise selection are often used as alternatives. However, due to the myopic property of stepwise algorithm, the result is likely to be trapped in local optima. To reduce computational burden and maintain consistency of the GIC method, one approach is to select the best model among a sequence of submodels that includes the true model with probability converging to one.14 Under the irrepresentable condition, the Lasso solution path provides such a sequence.16 For high dimensional data where the irrepresentable condition is hardly satisfied, the smoothly clipped absolute deviation or minimax concave penalty solution path provides such a sequence.14 It has been shown that, for high dimensional data, classical model selection criteria such as AIC and BIC tend to select more variables than necessary.17,18 Therefore, classical methods need to be modified for high dimension data. The modified BIC proposed by Wang et al.8 is consistent when p diverges slower than n. The extended BIC proposed by Chen and Chen7 and corrected RIC proposed by Zhang and Shen9 are consistent even when p > n. The penalty to model complexity in GIC Eq. (2), ξ σ 2 p, is linear in p. Nonlinear penalties has been considered by Abramovich and Grinshtein19 (see references therein). REGULARIZATION METHODS One potential problem with variable selection is its discrete nature where each variable is either included or excluded from a model. This may lead to completely different estimates with small change in the data. Consequently, subset selection is often unstable and highly variable.20 One possible remedy is to penalize on the coefficients rather than on the number of parameters. Specifically, consider the penalized least squares (PLS) || y − Xβ ||2 + Jλ (β) , (3) where Jλ (β) is a penalty function on the coefficients with penalty parameter(s) λ. Many regularization © 2013 Wiley Periodicals, Inc. Volume 6, January/February 2014 WIREs Computational Statistics Variable selection in linear models methods assume the following form of the penalty function Jλ (β) = p    pλ β j , (4) j=1 where pλ is a penalty function on individual coefficient. Bridge Regression and Lasso A popular choice of the penalty function pλ in Eq. (4) is the Lq norm: pλ (t) = λ|t|q for q > 0. When 0 < q ≤ 2, the resulting penalized regression is referred to as the bridge regression.21 The Lq penalty shrinks coefficients toward zero. In particular, q = 2 corresponds to the ridge regression, a traditional method to deal with ill-conditioned design matrices.22 For variable selection, it is desirable to have a penalty function that shrinks some small coefficients all the way to zero. In this case the PLS method provides simultaneous variable selection and estimation. When q ≤ 1, the limiting distributions of bridge estimators can have positive probability mass at 0 when the true value of the parameter is zero.23 Therefore, the Lq penalty with q ≤ 1 has the desirable property that produce coefficients exactly equal to zero. In particular, the popular Lasso (least absolute shrinkage and selection operator) corresponds to q = 1 with the PLS as a convex optimization problem.24,25 Note that as q → 0, pλ (t) = λI(t = 0). Therefore, the GIC is a limiting case of Eq. (3) with L0 norm. Asymptotic properties of Lasso have been well-studied.16,26–32 Zhao and Yu16 introduced an irrepresentable condition and showed that this condition is almost necessary and sufficient for Lasso to be consistent. The irrepresentable condition is quite restrictive, especially for high dimensional data. Therefore Lasso is in general not consistent. On the other hand, the bridge estimators with q < 1 satisfy the oracle property.27 The early success of Lasso was limited by its computational intricacy due to the nondifferentiable L1 norm. Efron et al.33 provided an ingenious geometric view of the Lasso penalty that yields the LARS algorithm for computing the whole solution path of the Lasso. Utilizing the fact that the Lasso solution is piecewise linear, the LARS algorithm only requires the same computational cost as the full least squares fit on the data. For large Lasso problems, the cyclical coordinate descent algorithm is very efficient.34,35 See also Fu,36 Osborne et al.37 and Wu and Lange38 for computational methods for Lasso. Volume 6, January/February 2014 Improved Penalty Functions Fan and Li6 considered three properties for the penalty function: sparsity: the resulting estimator automatically sets small estimated coefficients to zero; unbiasedness, the resulting estimator is nearly unbiased; and continuity, the resulting estimator is continuous in the data to reduce instability in model prediction. None of the Lq penalties satisfies all three properties simultaneously. In particular, the Lasso produces biased estimates for large coefficients.6 This has motivated Fan and Li6 to propose the smoothly clipped absolute deviation (SCAD) penalty  (aλ − t)+ ′ I (t > λ) , pλ (t) = λ I (t ≤ λ) + (a − 1) λ (5) where pλ (0) = 0 and a > 2 is a constant. The SCAD penalty satisfies all three properties. The resulting estimator possesses the oracle property.6,39–41 However, since the SCAD penalty is nonconcave, the involved computation is more difficult. Zou and Li42 proposed a local linear approximation algorithm that borrows the strength of LARS. The most recent, possibly the most efficient as well, algorithm for solving the SCAD problem is the iterative coordinate ascent algorithm proposed by Fan and Lv.39 Zhang43 proposed the following minimax concave penalty (MCP) ′ pλ (t) = (aλ − t)+ , a (6) and showed that the resulting procedure possesses the oracle property. A penalized linear unbiased selection (PLUS) algorithm was proposed for the MCP procedure.43 To overcome the lack of oracle property of the Lasso, Zou44,45 proposed a simple but important modification, called the adaptive Lasso, which replaces the L1 penalty by a weighted version: Jλ (β) = λ p  βj ,  |βinit,j |γ j=1 (7) init are root-n-consistent initial estimates of where β β and γ > 0 is a preselected constant. The adaptive Lasso has the oracle property under some regularity conditions.45,44 However, the adaptive Lasso has an undesirable property that the penalty is infinite at zero.5 The adaptive Lasso estimates can be calculated using the same algorithms for Lasso. To overcome the problem of discontinuity in the L0 penalty, Dicker et al.46 proposed the following © 2013 Wiley Periodicals, Inc. 3 wires.wiley.com/compstats Overview 6 pλ(t) 4 6 pλ(t) 2 4 −5 Combinations of the L1 Penalty With Another Penalty pλ (t) = λ1 |t| + λ2 t2 , (9) where the L1 penalty encourages sparsity in the coefficients and the L2 penalty encourages similar coefficient estimates among highly correlated predictors. The penalty Eq. (9) corresponds to the (naïve). naïve ENet which produces estimates β  (ENet) = (1 + λ2 ) β (naïve). The ENet estimates β Consistency of the ENet has been studied by Yuan and Lin48 and Jia and Yu.49 Zou and Hastie47 proposed an efficient algorithm called LARS-EN to solve the ENet efficiently. To overcome the lack of oracle property of the ENet, Zou and Zhang50 later proposed the adaptive ENet that combines the strengths of the quadratic regularization and the adaptively weighted Lasso shrinkage. Under weak regularity conditions, they established the oracle property of the adaptive ENet when 0 ≤ lim n → ∞ (log p/log n) < 1. Liu and Wu51 combined the L0 and L1 penalties: pλ (t) = (1 − λ1 ) min {|t|/λ2 , 1} + λ1 |t|, −5 5 (11) 0 t 5 FIGURE 1 | Penalty functions of Lasso, SCAD, MCP, SELO, ENet, and L0L1. The tuning parameters are selected as follows: λ = 1.5 for Lasso; a = 3.7 and λ = 1.5 for SCAD; a = 2 and λ = 1.5 for MCP; λ1 = 1.5 and λ2 = 2 for SELO; λ1 = 1 and λ2 = 0.1 for ENet; and λ1 = 1.5 and λ2 = 2 for L0L1. where ||β||∞ = max1 ≤ j ≤ p |β j |. While the L1 penalty leads to sparsity, the L∞ penalty encourages grouping among highly correlated predictors. The resulting procedure is adaptive to both sparse and nonsparse situations. Wu et al.52 developed a homotopy algorithm for efficient computation. For illustration, Figure 1 shows the penalty functions of Lasso, SCAD, MCP, SELO, ENet, and L0L1. Group Lasso When predictors are grouped such as the dummy variables for a multilevel categorical variable, one may wish to select groups of predictors rather than individual predictors. Suppose there are J groups. T Without loss of generality, denote β = (β(1) , · · · , β TJ )T () as partitions of coefficients according to J groups. The group Lasso proposed by Yuan and Lin53 assumed the following penalty in Eq. (3): (10) where min{|t|/λ2 , 1} is a continuous approximation of the L0 norm. The penalty Eq. (10) will be referred to as the L0L1 penalty. The L0L1 penalty overcomes disadvantages of the L0 and L1 penalties. Liu and Wu51 developed a global optimization algorithm using mixed integer programming to implement the L0L1 penalty. Wu et al.52 proposed a procedure that combines the L1 and L∞ penalties: 4 0 t Several authors have considered combinations of different penalties. The Lasso is unstable when the predictors are highly correlated.47 For high dimensional data, sample correlation can be large even when predictors are independent.5 To overcome this problem, Zou and Hastie47 proposed the elastic-net (ENet) that combines the L1 and L2 penalties: j=1 0 0 While SCAD and MCP mimic the L1 penalty, SELO mimics the L0 penalty. The SELO procedure has the oracle property when p = o(n). 46 The SELO estimators can be obtained by the same iterated coordinated descent algorithm proposed by Fan and Lv.39 p  Jλ (β) = λ1 |βj | + λ∞ || β ||∞ , SELO ENet L0L1 8 (8) 8 10 Lasso SCAD MCP 2 |t| λ1 pλ (t) = +1 . log log (2) |t| + λ2 10 12 SELO penalty (seamless-L0 ) Jλ (β) = λ J  || β(j) ||Kj , (12) j=1 where || β(j) ||Kj = β Tj Kj β(j) for j = 1, . . . , J, and () K1 , . . . , KJ are positive definite matrices. Asymptotic properties of the group Lasso have been studied by Bach54 and Nardi and Rinaldo.55 Other group level procedures were developed by Kim et al.,56 Wang et al.,57 and Zhao et al.58 Penalty at both the group level and individual predictor level were considered by Huang et al.,59 Breheny and Huang,60 Friedman et al.,61 Zhou and Zhu,62 and Geng.63 Huang64 provided a selective review of group variable selection procedures. © 2013 Wiley Periodicals, Inc. Volume 6, January/February 2014 WIREs Computational Statistics Variable selection in linear models SELECTION OF PENALTY PARAMETERS The regularization methods often involve parameters controlling the amount of penalization. Proper tuning of these parameters is critical to the performance of these methods. As an all-round option, the K-fold cross-validation has always been a popular choice, especially in the early years. Classical methods such as the GIC may also be used. Since the error variance σ 2 is usually unknown in practice, we consider the GIC proposed by Nishii65 (λ) ||2 + ξ d (λ) , n log || y − Xβ (13) (λ) is an estimate of β based on the where β model chosen with fixed penalty parameter λ, d(λ) is an appropriate ‘degrees of freedom’ that measures complexity of the model with fixed penalty parameter λ, and ξ controls the trade-off between goodness of fit and model complexity. AIC and BIC correspond to the special cases when ξ = 2 and ξ = log n. An alternative criterion is the generalized crossvalidation (GCV): GDF (λ) =  (λ) || ||y − Xβ  2 . n − d (λ) 2 (14) GIC and GCV choices of λ are the minimizers of the GIC and the GCV criterion, respectively. To be able to use these criteria, we need an appropriate measure of model complexity d(λ). When there is no variable selection, the number of parameters in the model is a logical choice for the degrees of freedom d(λ) which was used in the standard GIC in Eq. (2). When variable selection is involved, the choice of d(λ) is not always clear since the cost of the selection should be taken into account.66–68 For some regularization methods it is possible to derive good estimates for the degrees of freedom d(λ). For the Lasso procedure, Zou et al.69 showed that the number of nonzero coefficients is an unbiased and consistent estimator of d(λ). Wang et al.8 proposed a modified BIC with ξ = Cn log n for the situation when p → ∞, and showed that the modified BIC is consistent when Cn → ∞. For the SCAD procedure, Fan and Li6 proposed to use the generalized degrees of freedom defined as  d (λ) = tr X XT X + n (λ) −1 XT , ′ ′   (λ)  = diag{pλ (|β1 (λ) |/|β1 (λ) |, . . . , pλ where p (λ) |/|β p (λ) }. The d(λ) is calculated based β Volume 6, January/February 2014 on submatrices of X and (λ) corresponding to the selected covariates. Wang et al.70 showed that the model selected by GCV contains all important variables, but with nonzero probability to include some unimportant variables, and the BIC can identify the true model consistently. For the SELO procedure, Dicker et al.46 proposed to estimate d(λ) by the number of nonzero coefficients. They showed that the modified BIC proposed by Wang et al.8 is consistent for the SELO procedure. Generalized degrees of freedom (GDF) is a generic measure of model complexity for any modeling procedure which is considered as a map from observations to fitted values.66,68 It accounts for the cost due to both model selection and parameter estimation. Therefore, it may be used to estimate d(λ) for a regularization procedure when a simple estimate of the degrees of freedom is not available. Denote y = (y1 , . . . , yn )T and μ = (μ1 , . . . , μn )T where μi = E(yi ) for i = 1, . . . , n. For a regularization method with fixed penalty parameter λ, denote the resulting fitted values as  μi (λ) for i = 1, . . . , n. The GDF is defined as66,68 n n    μi ) ∂Eμ ( 1 cov  μi , yi . (15) = 2 ∂μi σ i=1 i=1 Extending the degrees of freedom to general modeling procedure, the GDF can be viewed as the sum of the sensitivities of the fitted values to a small change in the response. GDF(λ) cannot be used directly since it depends on the unknown true mean values μ. One may estimate GDF(λ) using Monte Carlo methods such as the pertubation technique described in Ye68 and the bootstrap technique described in Tibshirani and Knight71 and Efron.66 The estimate of GDF(λ) may be used as an estimate of d(λ). More research is necessary on the choice of penalty parameters. Theoretical properties of the estimates of β with data driven selection of penalty parameters have received scant attention. SCREENING PROCEDURES FOR ULTRA HIGH DIMENSIONS The regularization methods in the last section can comfortably deal with high dimensional cases when p is almost as large as n but may have difficulty when applied to data with p ≫ n. For example, a genetic study can have thousands of genes with only a few hundred observations and filtering out only tens of important genes can be a daunting task for these regularization methods. This difficulty has motivated research on the ultra high dimensional cases when © 2013 Wiley Periodicals, Inc. 5 wires.wiley.com/compstats Overview p can increase in an exponential order exp{O(nα )}, α > 0 of the sample size n. The sure independence screening (SIS) procedure72 is among the first approaches to tackling such ultra high dimensional problems. Let the p-vector ω = XT y be from the componentwise regression of Y against each Xj . Then the p componentwise magnitudes of ω are sorted in a decreasing order and define a submodel   Md = 1 ≤ j ≤ p : |ωi | is amongthe first d largest of all , where a conservative practical choice of d suggested in the article is [n/log n] with [x] denoting the integer part of x. The SIS is a hard-thresholding approach. Through the use of the marginal information of the correlation between each predictor and the response, it can reduce the dimensionality of p from exp{O(nα )} to o(n) in a fast and efficient way. The procedure is shown to achieve the sure screening property, that is, all the important variables survive after variable screening with probability tending to 1. An iteration of the procedure is needed when the features are marginally unrelated but jointly related to the response variable. After reducing from an ultra high dimensional problem to a much lower dimension of o(n), if needed, a variable selection procedure such as SCAD and adaptive Lasso can be applied to the selected variables from the screening procedure. Huang73 studied the screening property of the forward regression (FR) method with log(p) = O(nξ ) for some 0 < ξ < 1. The size p0 of the true model T can diverge in the order of O nξ0 for some 0 < ξ 0 < 1. The FR algorithm starts with the null model S(0) = ∅, and iterates to update S(k−1) to S(k) by adding the predictor that gives the smallest residual sum of squares among all the outside predictors when it is augmented to the model S(k−1) . The algorithm stops at k =  n and yields the solution path  S = S(k) : 1 ≤ k ≤ n . Then the extended BIC7 is used to select the final candidate model  S. Huang showed that  S has the sure screening property. A three-stage screening and variable selection procedure in Wasserman and Roeder74  was  proposed c2 with log p = O (n ), 0 ≤ c2 < 1. At stage 1, a suite of candidate models, each depending on a tuning parameter λ, are fitted. Particularly, the candidate models considered in the article can be from the Lasso method with the regularization parameter λ, the forward stepwise regression after λ steps, or the marginal regression with a threshold λ on the magnitudes of the regression coefficients. At stage 2, one model is selected from the candidates through cross-validation. At stage 3, the model is further cleaned through eliminating some variables by hypothesis testing. Theoretical properties such as the selection consistency are established there. For the ultra high dimensional case with log(p) = O(nν ), 0 < ν < 1, Huang et al.75 showed the oracle property for the adaptive Lasso although their result requires a consistent initial estimator which is often unavailable in ultra high dimensional problems. Kim et al.41 revisited the SCAD procedure and established its oracle property under the ultra high dimension case with log(p) = O(n). CONCLUSION In this article we have focused on two approaches for variables selection in linear models: the classical approach unified under the GIC and regularization approach unified under PLS. For simplicity, we have assumed that random errors in model Eq. (1) follow a normal distribution. We note that many methods only require a weaker assumption that random errors are identically and independent distributed with zero mean and a finite variance. Many important approaches are not reviewed due to space limitations. For example, nonnegative garrote76 and Dantzig selector77 are both popular procedures for high dimensional variable selection. But they don’t quite fit under the PLS framework and thus are not described in detail here. More references about them and other approaches can be found in further reading. REFERENCES 1. Akaike H. Information theory and an extension of the maximum likelihood principle. In: Second International Symposium on Information Theory, vol. 1. Budapest: Akademinai Kiado; 1973, 267–281. 2. Schwarz G. Estimating the dimension of a model. Ann Stat 1978, 6:461–464. 6 3. Mallows CL. Some comments on Cp. Technometrics 1973, 15:661–675. 4. Foster DP, George EI. The risk inflation criterion for multiple regression. Ann Stat 1994, 22: 1947–1975. © 2013 Wiley Periodicals, Inc. Volume 6, January/February 2014 WIREs Computational Statistics Variable selection in linear models 5. Fan J, Lv J. A selective overview of variable selection in high dimensional feature space. Stat Sinica 2010, 20:101–148. 26. Bickel PJ, Ritov Y, Tsybakov AB. Simultaneous analysis of Lasso and Dantzig selector. Ann Stat 2009, 37:1705–1732. 6. Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 2001, 96:1348–1360. 27. Huang J, Horowitz JL, Ma S. Asymptotic properties of bridge estimators in sparse high-dimensional regression models. Ann Stat 2008, 36:587–613. 7. Chen J, Chen Z. Extended Bayesian information criteria for model selection with large model spaces. Biometrika 2008, 95:759–771. 28. Leng C, Lin Y, Wahba G. A note on the Lasso and related procedures in model selection. Stat Sinica 2006, 16:1273–1284. 8. Wang H, Li B, Leng C. Shrinkage tuning parameter selection with a diverging number of parameters. J R Stat Soc Ser B 2009, 71:671–683. 29. Meinshausen N, Bühlmann P. High-dimensional graphs and variable selection with the Lasso. Ann Stat 2006, 34:1436–1462. 9. Zhang Y, Shen X. Model selection procedure for highdimensional data. Stat Anal Data Min 2010, 3:350–358. 30. Van de Geer SA. High-dimensional generalized linear models and the Lasso. Ann Stat 2008, 36:614–645. 10. Bühlmann P, Van de Geer SA. Statistics for HighDimensional Data. New York: Springer; 2011. 31. Wainwright M. Sharp thresholds for noisy and highdimensional recovery of sparsity using L1 -constrained quadratic programming (Lasso). IEEE Trans Inform Theory 2009, 55:2183–2202. 11. Seber GAF, Lee AJ. Linear Regression Analysis. New York: Wiley; 2003. 12. Yang Y. Model selection for nonparametric regression. Stat Sinica 1999, 9:475–499. 13. Shao J. An asymptotic theory for linear model selection (with discussion). Stat Sinica 1997, 7:221–264. 14. Kim Y, Kwon S, Choi H. Consistent model selection criteria on high dimensions. J Mach Learn Res 2012, 13:1037–1057. 15. Huo X, Ni XS. When do stepwise algorithms meet subset selection criteria? Ann Stat 2007, 35:870–887. 16. Zhao P, Yu B. On model selection consistency of Lasso. J Mach Learn Res 2006, 7:2541–2563. 17. Broman KW, Speed TP. A model selection approach for the identification of quantitative trait loci in experimental crosses. J R Stat Soc Ser B 2002, 64:641–656. 18. Casella G, Girón FJ, Martnez ML, Moreno E. Consistency of Bayesian procedures for variable selection. Ann Stat 2009, 37:1207–1228. 19. Abramovich F, Grinshtein V. Model selection in Gaussian regression for high-dimensional data. In: Inverse Problems and High-Dimensional Estimation, vol. 203. Berlin: Springer; 2011, 159–170. 20. Breiman L. Heuristics of instability and stabilization in model selection. Ann Stat 1996, 24:2350–2383. 21. Frank LE, Friedman JH. A statistical view of some chemometrics regression tools. Technometrics 1993, 35:109–135. 22. Hoerl AE, Kennard RW. Ridge regression: biased estimation for nonorthogonal problems. Technometrics 1970, 12:55–67. 32. Zhang CH, Huang J. The sparsity and bias of the Lasso selection in high-dimensional linear regression. Ann Stat 2008, 36:1567–1594. 33. Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression (with discussion). Ann Stat 2004, 32:407–499. 34. Friedman J, Hastie T, Höfling H, Tibshirani R. Pathwise coordinate optimization. Ann Appl Stat 2007, 1:302–332. 35. Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw 2010, 33:1–22. 36. Fu W. Penalized regressions: the bridge vs. the Lasso. J Comput Graph Stat 1998, 7:397–416. 37. Osborne M, Presnell B, Turlach B. A new approach to variable selection in least squares problems. IMA J Numer Anal 2000, 20:389–404. 38. Wu T, Lange K. Coordinate descent procedures for Lasso penalized regression. Ann Appl Stat 2008, 2:224–244. 39. Fan J, Lv J. Nonconcave penalized likelihood with NP-dimensionality. IEEE Trans Inform Theory 2011, 57:5467–5484. 40. Fan J, Peng H. Nonconcave penalized likelihood with a diverging number of parameters. Ann Stat 2004, 32:928–961. 41. Kim Y, Choi H, Oh HS. Smoothly clipped absolute deviation on high dimensions. J Am Stat Assoc 2008, 103:1665–1673. 23. Knight K, Fu W. Asymptotics for Lasso-type estimators. Ann Stat 2000, 28:1356–1378. 42. Zou H, Li R. One-step sparse estimates in nonconcave penalized likelihood models. Ann Stat 2008, 36:1509–1533. 24. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. New York: Springer; 2002. 43. Zhang CH. Nearly unbiased variable selection under minimax concave penalty. Ann Stat 2010, 38:894–942. 25. Tibshirani R. Regression shrinkage and selection via the Lasso. J R Stat Soc Ser B 1996, 58:267–288. 44. Zou H. The adaptive Lasso and its oracle properties. J Am Stat Assoc 2006, 101:1418–1429. Volume 6, January/February 2014 © 2013 Wiley Periodicals, Inc. 7 wires.wiley.com/compstats Overview 45. Huang J, Ma S, Zhang CH. Adaptive Lasso for sparse high-dimensional regression models. Stat Sinica 2008, 18:1603–1618. 62. Zhou N, Zhu J. Group variable selection via a hierarchical Lasso and its oracle property. Stat Interface 2010, 3:557–574. 46. Dicker L, Huang B, Lin X. Variable selection and estimation with the seamless-l0 penalty. Stat Sinica 2012, 23:929–962. 63. Geng Z. Group variable selection via convex Log-ExpSum penalty with application to a breast cancer survivor study. PhD thesis, University of Wisconsin; 2013. 47. Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc Ser B 2005, 67:301–320. 64. Huang J, Breheny P, Ma S. A selective review of group selection in high-dimensional models. Stat Sci 2012, 27:481–499. 65. Nishii R. Asymptotic properties of criteria for selection of variables in multiple regression. Ann Stat 1984, 12:758–765. 48. Yuan M, Lin Y. On the non-negative garrotte estimator. J R Stat Soc Ser B 2007, 69:143–161. 49. Jia J, Yu B. On model selection consistency of the elastic net when p ≫ n. Stat Sinica 2010, 20:595–611. 50. Zou H, Zhang HH. On the adaptive elastic-net with a diverging number of parameters. Ann Stat 2009, 37:1733–1751. 51. Liu Y, Wu Y. Variable selection via a combination of the L0 and L1 penalties. J Comput Graph Stat 2007, 16:782–798. 52. Wu S, Shen X, Geyer CJ. Adaptive regularization using the entire solution surface. Biometrika 2009, 96:513–527. 53. Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B 2006, 68:49–67. 54. Bach F. Consistency of the group Lasso and multiple kernel learning. J Mach Learn Res 2008, 9:1179–1225. 55. Nardi Y, Rinaldo A. On the asymptotic properties of the group Lasso estimator for linear models. Electron J Stat 2008, 2:605–633. 66. Efron B. The estimation of prediction error: covariance penalties and cross-validation (with discussion). J Am Stat Assoc 2004, 99:619–632. 67. Sklar JC, Wu J, Meiring W, Wang Y. Non-parametric regression with basis selection from multiple libraries. Technometrics 2013, 55:189–201. 68. Ye JM. On measuring and correcting the effects of data mining and model selection. J Am Stat Assoc 1998, 93:120–131. 69. Zou H, Hastie T, Tibshirani R. On the ‘‘degrees of freedom’’ of the Lasso. Ann Stat 2007, 35:2173–2192. 70. Wang H, Li R, Tsai CL. Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika 2007, 94:553–558. 71. Tibshirani R, Knight K. The covariance inflation criterion for adaptive model selection. J R Stat Soc Ser B 1999, 61:529–546. 56. Kim Y, Kim J, Kim Y. Blockwise sparse regression. Stat Sinica 2006, 16:375–390. 72. Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc Ser B 2008, 70:849–911. 57. Wang L, Chen G, Li H. Group SCAD regression analysis for microarray time course gene expression. Bioinformatics 2007, 23:1486–1494. 73. Huang H. Forward regression for ultra-high dimensional variable screening. J Am Stat Assoc 2009, 104:1512–1524. 58. Zhao P, Rocha G, Yu B. Grouped and hierarchical model selection through composite absolute penalties. Ann Stat 2009, 37:3468–3497. 74. Wasserman L, Roeder K. High-dimensional variable selection. Ann Stat 2009, 37:2178–2201. 59. Huang J, Ma S, Xie H, Zhang CH. A group bridge approach for variable selection. Biometrika 2009, 96:339–355. 75. Huang J, Ma S, Zhang CH. The iterated Lasso for highdimensional logistic regression. Technical Report No. 392, The University of Iowa, Department of Statistics and Actuarial Science, 2008. 60. Breheny P, Huang J. Penalized methods for bi-level variable selection. Stat Interface 2009, 2:369–380. 76. Breiman L. Better subset regression using the nonnegative garrote. Technometrics 1995, 37:373–384. 61. Friedman J, Hastie T, Tibshirani R. A note on the group Lasso and a sparse group Lasso. Technical report, Department of Statistics, Stanford University; 2010. 77. Candes E, Tao T. The Dantzig selector: statistical estimation when p is much larger than n. Ann Stat 2007, 35:2313–2351. FURTHER READING Baraud Y, Giraud C, Huet S. Gaussian model selection with an unknown variance. Ann Stat 2009, 37:630–672. Birgé L, Massart P. Minimal penalties for Gaussian model selection. Probab Theory Rel Fields 2007, 138:33–73. George EI, McCulloch RE. Approaches for Bayesian variable selection. Stat Sinica 1997, 7:339–373. 8 © 2013 Wiley Periodicals, Inc. Volume 6, January/February 2014 WIREs Computational Statistics Variable selection in linear models James GM, Radchenko P, Lv J. DASSO: connections between the Dantzig selector and Lasso. J R Stat Soc Ser B 2009, 71:127–142. McQuarrie ADR, Tsai CL. Regression and Times Series Model Selection. River Edge: World Scientific Publishing; 1998. Miller A. Subset Selection in Regression. Boca Baton, MA: Chapman & Hall/CRC; 2002. O’Hara RB. Silanpää. A review of Bayesian variable selection methods: what, how and which. Bayesian Anal 2009, 4:85–118. Volume 6, January/February 2014 © 2013 Wiley Periodicals, Inc. 9