A Regularization Approach to Nonlinear Variable Selection
L. Rosasco
M. Santoro, S. Mosci, A. Verri
S. Villa
CBCL, McGovern Institute, MIT
43 Vassar Street,
Cambridge, MA 02139, United States
DISI, Università di Genova
via Dodecaneso 35,
16146, Genova, Italy
DIMA, Università di Genova
via Dodecaneso 35,
16146, Genova, Italy
Abstract
regularizer (penalty) R, so that
argmin{E(f ) + τ R(f )}
In this paper we consider a regularization approach to variable selection when the regression function depends nonlinearly on a few input variables. The proposed method is based
on a regularized least square estimator penalizing large values of the partial derivatives.
An efficient iterative procedure is proposed
to solve the underlying variational problem,
and its convergence is proved. The empirical properties of the obtained estimator are
tested both for prediction and variable selection. The algorithm compares favorably to
more standard ridge regression and ℓ1 regularization schemes.
1
Introduction
In this work, we are interested into the variable selection problem within the supervised learning framework, when the input-output dependence is described
by a nonlinear model.
Given a set of input-output pairs z = (xi , yi )ni=1 sampled i.i.d. with respect to an unknown probability
measure ρ, the task of supervised learning is to estimate the regression function fρ (x) = E[y|x]. The
output space Y can be either a subset of R or simply {+1, −1} in binary classification. In general,the
input space is X ⊂ Rp . The problem we have in mind
is motivated by the assumption that fρ depends on
d ≪ p variables only. The key question is how to turn
this assumption into a learning algorithm. We study
this problem in the context of regularization where –
given the empirical risk E – one aims at designing a
Appearing in Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS)
2010, Chia Laguna Resort, Sardinia, Italy. Volume 9 of
JMLR: W&CP 9. Copyright 2010 by the authors.
f ∈H
provides us with an estimator that is a (nonlinear)
function of few variables. Towards this end we will propose a new penalty which enforces the partial derivatives of the learned estimator to be small in an appropriate sense. The size of the partial derivatives of
the obtained estimator indicates which variables can
be discarded.
The variational problem corresponding to this new
functional cannot be solved using simple gradient descent methods. Towards this end we develop an
iterative projection procedure based on a forwardbackward splitting approach (Daubechies et al., 2007;
Lions and Mercier, 1979; Combettes and Wajs, 2005;
Hale et al., 2008; Rosasco et al., 2009). Convergence
of the method can be proved and the entire regularization path for the algorithm can be efficiently computed
using a simple continuation method (Hale et al., 2008).
The performance of the proposed algorithm is assessed
on toy data as well as on a benchmark data set.
The problem of variable selection has recently received
considerable attention, motivated by high dimensional
learning problems such as microarray data analysis
(Golub et al., 1999). The last few years witnessed
substantial progresses in the case when the regression
function is described by a linear model – see Lal et al.
(2006) for a survey – and regularization with a sparsity
penalty (ℓ1 and its variations) proved to be a suitable
strategy (Tibshirani, 1996; Efron et al., 2004). In the
case of nonlinear models the situation is much less understood. A number of authors consider a dictionary
of nonlinear features and use a sparsity-based algorithm to select the relevant elements of the dictionary.
The simplest of such approaches is that of considering
a generalized linear model where the regression function is assumed to be a linear combination of nonlinear functions, each one depending on a single variable
only, see for example Ravikumar et al. (2008). A more
advanced approach would be to consider dictionaries
653
A Regularization Approach to Nonlinear Variable Selection
encoding more complex interactions among the variables (Lin and Zhang, 2006). For example, one could
consider the features defined by a second-degree polynomial kernel. The shortcoming of this approach is
that the size of the dictionary grows more than exponentially as one considers high order interactions.
Recently, Bach (2008) showed that it is still possible
to learn with such dictionaries if the subsets of variables that can be selected (the sparsity patterns) are
suitably restricted (e.g. the atoms of the dictionary
can be embedded into a directed acyclic graph). Our
approach differs from these latter works since we do
not try to design dictionaries encoding variables interactions but we use partial derivatives to derive a new
regularizer that induces a different form of sparsity.
We recall that a number of heuristics for nonlinear
feature selection are surveyed by Guyon et al. (2006).
We also mention a series of related works considering
the problem of nonlinear dimensionality reduction, either in the supervised setting (see Wu et al. (2008)
and references therein) or in the unsupervised setting
– see KPCA (Schölkopf and Smola, 2002) and manifold learning (Belkin and Niyogi, 2008). More recently,
a method based on estimating the gradient of the regression function is proposed by Mukherjee and Zhou
(2006). We note that the problem of subset selection
that we consider in this paper is somewhat different,
because rather than extracting new features we just
aim at selecting the most relevant variables. Indeed,
in all the dimensionality reduction methods, variable
selection is not embedded in the training phase and
can be achieved only as a postprocessing.
The paper is organized as follows. In Section 2
we present a new regularized functional, where the
penalty is based on partial derivatives, and propose
an iterative algorithm for finding its minimizer. In
Section 3 we describe the derivation of the proposed
algorithm. Finally in Section 4 the proposed scheme is
empirically evaluated and compared with benchmark
subset selection techniques as standard ridge regression on synthetic and real data.
Notation: In the following, we will use lower indices
to indicate different elements of a space – e.g. xi ∈ X
– and upper indices to indicate different components
of a vector – e.g. x = (xj )pj=1 ∈ X . We use indices
i, s ∈ {1, . . . , n} for the examples and j, l ∈ {1, . . . p}
for coordinates.
2
Regularization for Nonlinear Subset
Selection
n
fτ
1X
= argmin
(yi − f (xi ))2 +
n
f ∈H
i=1
v
u
p
n
X
u1 X
t
|∂j f (xi )|2
2τ
n
i=1
j=1
(1)
where ∂j f (x) denotes the j th partial derivative of f at
some point x and the hypotheses space, H, is assumed
to be a reproducing kernel Hilbert space of smooth
functions. Recall that our primary goals are to estimate the true underlying fρ as well as to detect the
subset of relevant variables, given the training set z.
The above penalty searches for an estimator for which
each partial derivative is small when evaluated on most
points of the training set. This choice is suggested by
the fact that, if a function does not depend on a variable, the corresponding partial derivative will be zero
(or small) for all (or most) points, and such a variable can be considered irrelevant. In theory, a natural
choice in order to enforce the j th derivative to be small
“on average” would be to penalize the L2 (X, ρX ) norm
sZ
|∂j f (x)|2 dρX
k∂j f kρ =
(2)
X
where ρX is the probability distribution of the input
points. Since in practice ρX is unknown, and we only
have an i.i.d. sample x = (x1 , . . . , xn ), a feasible alternative to (2) is its empirical counterpart, which is the
one used in (1). To build a regularizer we then have
to consider the different partial derivatives at once,
and different regularizers can be obtained considering
~ ) ∈ Rp , whose j th
different norms for the vector R(f
element is:
v
u n
u X
~ j (f ) = t 1
|∂j f (xi )|2 .
R
n i=1
The most natural choice is probably
~ )k0 = #{j | R
~ j (f ) 6= 0}.
R0 (f ) = kR(f
When considering linear functions, H = {f : f (x) =
β · x, β ∈ Rp }, the above penalty is the so called ℓ0
norm, kβk0 , which counts the number of non-zero coefficients. It is well known that there are no efficient
algorithms to minimize E + τ R0 , and ℓ1 regularization
– i.e. using the ℓ1 norm kβk1 – is an efficient convex
relaxation to this problem. Reasoning along the same
line we replace R0 with
~ )k1 =
R1 (f ) = kR(f
In the following we study the minimization problem
p
X
j=1
654
~ j (f )|,
|R
L. Rosasco, M. Santoro, S. Mosci, A. Verri, S. Villa
Algorithm 1 Iterative Projection Algorithm
Require: σ, η, τ > 0
Initialize: α0 = 0, β0 = 0, t = 0
while convergence not reached do
(Zj )i,s =
and
(Lj β)i =
1
zj,xi (xs )
n
p X
n
X
Lj,l (xi , xs )βsl ,
l=1 s=1
−1
αt = αt−1 − (2σ)
(Kαt−1 + Zβt−1 − y)
where
set v0 = 0, q = 0 and
for j = 1, . . . p do
while convergence not reached do
j
vq+1
=
vqj
j
−η L
1+
ηkLj
vq −
vq −
σ
τ βt−1
σ
τ βt−1
q := q + 1
end while
set v̄tj = vqj
end for
t = t + 1,
βt = βt−1 −
−
σ
τ Zj αt
Lj,l (xi , xs ) =
− στ Zj αt kn
,
τ
v̄t .
σ
so that the functional in (1) is given by E +2τ R1 . Note
that, indeed, the proposed penalty reduces to ℓ1 regularization in the case of linear models. Though convexity makes the corresponding algorithm more tractable,
the solution of the underlying variational problem is
not straightforward due to nonsmoothness and we propose an efficient optimization procedure in the next
section.
An Iterative Projection Algorithm
We will show (see the beginning of Subsec. 3.3) that
if H is a RKHS of sufficiently smooth functions, a
straightforward generalization of the representer theorem ensures that a generic solution of problem (1) can
be conveniently written as a linear combination:
fτ (x) =
p
n X
n
X
X
1 j
1
αi k(xi , x) +
β zj,xi (x),
n
n i
i=1 j=1
i=1
(3)
where α, β j ∈ Rn for all j = 1, . . . , p and
zj,xi (x′ ) =
∂k(x′ , x)
∂xj
x=xi
.
(4)
In order to compute the coefficients, let us introduce
the two vectors β = (β 1,. . . , β p ) and v = (v 1 , . . . , v p ),
where v j ∈ Rn for all j = 1, . . . , p. Also, let’s define
the matrices K, Zj , Lj as
(K)i,s
x=xi ,x′ =xs
.
As shown in the next section, the coefficients can be
computed using the iterative Algorithm 1.
end while
return (αt , βt )
2.1
1 ∂ 2 k(x, x′ )
n2 ∂xj ∂x′ l
Before describing the derivation of this algorithm and
discussing its convergence properties, we add several
remarks. First, the above algorithm follows recent
works studying optimization procedures for ℓ1 -based
regularization and related algorithms – see Hale et al.
(2008), Daubechies et al. (2007), Figueiredo et al.
(2007) – and more general learning schemes – see for
example Rosasco et al. (2009).
Second, the proposed procedure requires the choice of
an appropriate stopping rule, which will be discussed
later, and of the step sizes σ, η. Simple a priori choices
ensuring convergence are discussed in the following
section and are the one we used in our experiments.
We expect more sophisticated step size choices to considerably speed up the iteration, at the price of a more
complicated convergence analysis, which is outside the
scope of the paper.
Third, while computing solutions corresponding to different regularization parameters we use the continuation method suggested by Hale et al. (2008). Starting from a large regularization parameter value, the
obtained solution is used to initialize the algorithm
for the next smaller regularization parameter value
and the same strategy is iterated for all the other
parameter values. We found this warm restart procedure to greatly improve the computational requirement needed to calculate the whole regularization
path.
Finally, a critical issue is related to the choice of a
suitable selection criterion that enables one to identify
the variables that are more relevant to the specific regression problem. In fact, while in standard ℓ1 -based
regularization the relevant variables correspond to the
non zero entries of the regression coefficients, using our
approach this is no longer valid. Nevertheless a natural
way to select the relevant variables is to set a threshold on the size of the partial derivatives evaluated on
the training set points, that is:
n
X
1
= k(xi , xs ),
n
|∂j fτ (xi )|2 = ||Zj α + Lj β||2
for j = 1, . . . , p.
i=1
655
(5)
A Regularization Approach to Nonlinear Variable Selection
3
Derivation of the Iterative
Projection Procedure
3.2
Iterative Projected Gradient
A key observation is that partial derivatives have a
particularly useful representation if we choose the hypotheses space to be a RKHS. In fact if the kernel is
sufficiently smooth it is possible to prove (Zhou, 2008)
that zj,x ∈ H for all x ∈ X where zj,x is defined in (4),
and the following reproducing property for derivatives,
that will be the main tool towards deriving Algorithm
1, holds
∂j f (x0 ) = hf, zj,x0 iH .
The functional under study is not differentiable, and
standard minimization techniques cannot be used. Actually, minimizing non differentiable objective functions is quite common in sparse approximation, therefore a number of authors have already proposed some
suitable solutions. A popular technique employs proximity operators, a powerful generalization of the notion
of projection operators. Following the mainstream
proposed by Combettes and Wajs (2005), in a previous work – in which we focus on structured sparsity for
linear variable selection – we proposed an optimization
framework consisting of an iterative projection and
based on subgradient calculation to solve an appropriate fixed point equation satisfied by fτ , see Rosasco
et al. (2009) for details. Indeed, both the theorem 1
below, which represents the theoretical basis for iterative algorithm 1, and the equation (11) generalize
analogous results derived in that paper.
In the following it will be useful to view partial derivatives and gradients as linear operators. Consider Rn
with the standard inner product normalized by a factor 1/n denoted by h·, ·in and the corresponding norm
k · kn . We define D̂j : H → Rn as
For convenience, let’s introduce the sampling operator
Sn : H → Rn and its adjoint Sn ∗ , defined
by Sn (f ) =
Pn
(f (x1 ), . . . , f (xn )) and Sn ∗ v(x) = i=1 vi k(xi , x) respectively. The kernel matrix is simply K = Sn Sn ∗
(De Vito et al., 2005). Using the above notations,
solving problem (1) amounts to finding
In this section we derive the procedure proposed in the
previous section. We start rewriting problem (1) in a
more convenient way, allowing for a useful characterization of the regularized solution fτ .
3.1
RKHS and derivatives
p
X
kD̂j f kn .
fτ = argmin kSn f − yk2n +2τ
{z
}
|
f ∈H
j=1
E(f )
|
{z
}
D̂j f = ((∂j f ) (x1 ), . . . , (∂j f ) (xn )) ,
and its adjoint D̂j∗ : Rn → H as
R1 (f )
n
D̂j∗ v =
1X
vi zj,xi
n i=1
It is also useful to view the gradient of f as an operator. Let Rnp be p times the Cartesian product of Rn
so that if v = (v 1, . . . , P
v p ), w = (w1, . . . , wp ) belong to
p
np
R then hv, wiRnp = j=1 v j , wj n . The empirical
ˆ : H → Rnp and its adjoint ∇
ˆ ∗ : Rnp → H
gradient ∇
are defined by
p
ˆ = D̂j f
∇f
j=1
,
and
ˆ ∗v =
∇
p
X
(6)
Theorem 1. Given σ > 0, and assuming E + 2τ R1
to be strictly convex, fτ is the unique solution of the
fixed point equation
f = I − P στ C f − (2σ)−1 Sn ∗ (Sn f − y)
(7)
where PC is a suitable projection defined below.
The solution of the above problem can be calculated
by the following iteration
ft+1 = I − P στ C ft − (2σ)−1 Sn ∗ (Sn ft − y)
(8)
with f 0 = 0 (Combettes and Wajs, 2005). Under suitable assumptions, it is possible to prove that
D̂j∗ v j ,
lim kft − fτ kH = 0,
t→∞
j=1
respectively.
ˆ can be viewed as a p × n matrix so that
Note thatP
∇f
p
R1 (f ) = j=1 kD̂j f kn , can be seen as the norm 2, 1
of the gradient, obtained summing up the Euclidean
norms of each row.
(9)
if σ is appropriately chosen. Following Proposition 1
in Rosasco et al. (2009) it turns out that the best a
priori chosen step-size σ depends only on the smallest
and biggest eigenvalues of the kernel matrix K.
Remark. Note that in order to have a unique solution, strict convexity of the functional is required.
656
L. Rosasco, M. Santoro, S. Mosci, A. Verri, S. Villa
Since in general this assumption is too restrictive, it
is possible to avoid it by adding a further strictly convex regularization term to R1 . We also remark that
strong convergence holds also without requiring strict
convexity – see Theorem 3.4 in Combettes and Wajs
(2005).
The projection PC is the core of the above algorithm,
and is related to the subgradient of the regularizer R1
at 0. If we let
Bnp = {v = (v 1 , . . . , v p ) | kv j kn ≤ 1, j = 1, . . . , p},
then we can use Proposition 2 in Rosasco et al. (2009)
to show that
ˆ ∗ v with v ∈ Bnp .
C = ∂R1 (0) = f ∈ H | f = ∇
Moreover, it is possible to show that ∂R1 (0) is a closed
convex subset of H, and the projection P στ C of a f ∈ H
on στ C, is given by
τ ˆ∗
v̄,
P στ C f = ∇
σ
where
(10)
ˆ ∗ βt+1
ft+1 = Sn ∗ αt+1 + ∇
with
αt+1 = αt −
j
vq+1
=
ˆ ∗ vq − σ f
vqj − η D̂j ∇
τ
ˆ ∗ vq − σ f )kn
1 + ηkD̂j (∇
τ
ˆ∇
ˆ ∗ k it is possible to
with v0j = 0. As long as η = 1/k∇
prove that
τ ˆ∗
lim k ∇
vq − P στ C f kH = 0.
σ
q→∞
(12)
Towards an Implementation
In order to implement the above algorithm we need
to find a finite dimensional representation of f . Note
that from (8) and (10) the minimizer of (6) satisfies
ˆ ∗ ). Henceforth it satisfies
fτ ∈ Range(Sn ∗ ) + Range(∇
the following form of the representer theorem
ˆ ∗β =
fτ = Sn ∗ α + ∇
p
n X
n
X
X
1 j
1
αi k(xi , ·) +
βi zj,xi
n
n
i=1 j=1
i=1
(13)
with α ∈ R and β = (β 1 , . . . , β p ) with β j ∈ Rn , for
j = 1, . . . , p. Note that Zj , Lj defined in Section 2.1 are
the matrices associated to the operators Sn D̂j∗ : Rn →
ˆ ∗ : Rnp → Rn , respectively. Moreover
Rn and D̂j ∇
we define Z and L as the matrices associated to the
ˆ ∗ : Rnp → Rn and ∇
ˆ∇
ˆ ∗ : Rnp → Rnp ,
operators Sn ∇
respectively. Then we have the following result.
(14)
τ
v̄t+1 ,
σ
(15)
βt+1 = βt −
where v̄t+1 is such that
τ ˆ∗
ˆ ∗ βt .
∇ v̄t+1 = P στ C Sn ∗ αt+1 + ∇
σ
(16)
Proof. We proceed by induction. The base case is
clear. Then, by the inductive hypotheses we have that
ˆ ∗ βt .
ft = Sn ∗ αt + ∇
Therefore, using (8) it follows that ft+1 can be expressed as:
I − P στ C
(11)
1
(Kαt + Zβt − y)
2σ
and
τ ˆ∗ 2
v̄ = argmin{kf − ∇
vkH }.
p
σ
v∈Bn
Finally, it can be shown that v̄ can be calculated
component-wise using the iteration:
3.3
Proposition 1. For f0 = 0, the solution at step t + 1
is given by
ˆ ∗ βt
Sn ∗ αt − (2σ)−1 (Kαt + Zβt − y) + ∇
and the proposition is proved, letting αt+1 , βt+1 and
v̄t+1 as in Equations (14), (15) and (16).
The following results shows how to explicitly calculate
the projection.
Proposition 2. Let η = 1/kLk. If f ∈ H can be
written as Sn ∗ α + ∇∗ β, then the projection P στ C f is
ˆ ∗ v̄ where v̄ can be calculated starting from
given by στ ∇
v0 = 0 and using the following iteration
vtj − η Lj vq − στ β − στ Zj α
j
.
(17)
vq+1 =
1 + ηkLj vq − στ β − στ Zj αkn
Proof. In order to get (17) we can plug the representation (13) in (11) and use the fact that D̂j Sn ∗ =
Sn D̂j∗ = Zj .
In practical implementation, Algorithm 1 requires the
definition of a suitable stopping rule. Equations (9)
and (12) suggest the following choice
kft − ft−1 kH ≤ tol
ˆ ∗ (vq − vq−1 )kH ≤ tol.
and k∇
Note that these quantities are computable since
kft − ft−1 k2H
ˆ ∗ δβk2H
= kSn ∗ δα + ∇
= hδα, Kδαin
+2 hδα, Zδβin
+ hδβ, Lδβin
657
A Regularization Approach to Nonlinear Variable Selection
where δα = αt −αt−1 and δβ = βt −βt−1 , and similarly
ˆ ∗ (vq − vq−1 )k2 = h(vq − vq−1 ), L(vq − vq−1 )i .
k∇
H
n
4
Experimental Analysis
In this section we present some preliminary experiments conducted in order to assess empirically the effectiveness of the proposed regularization approach on
both synthetic data sets and a standard benchmark
comprising real data. The aims of the experiments
are: (i) to verify that the proposed algorithm selects
the correct subset of relevant variables when the underlying regression function is nonlinear; (ii) to assess
the prediction performance of the estimator learned
with our algorithm, and to compare it with a number
of more standard alternatives for both sparse and non
sparse regression.
We compare Algorithm 1 with: non-sparse kernelbased ridge regression with a Gaussian kernel (which
will be denoted G-ridge hereinafter), and the sparse
ℓ1 regularization on a linear model both with respect
to the input variables (denoted simply ℓ1 ), and with
respect to a nonlinear combination of the input variables (denoted ℓf1 eat ). Specifically, for ℓf1 eat we consider functions which are the linear combination of
the expansion of a 4th degree polynomial defined over
the input variables. Note that we do not compare
experimentally our approach with the ones based on
sparse nonlinear additive models proposed by Ravikumar et al. (2008). This is mainly due to the fact that
the additiveness assumption required by such models
is too restrictive and is not satisfied by the regression
functions we are interested in (see Subsection 4.1).
As for the actual deployment of Algorithm 1, we considered two different settings. In the first we use the
algorithm as it has been described so far, (this setting
is denoted Alg1). Furthermore, we decided to rely on a
fairly standard approach based on a double optimization – see Candès and Tao (2005), De Mol et al. (2009)
for ℓ1 and ℓ1 -ℓ2 regularization – where Algorithm 1 is
used for selecting the variables only, and subsequently
a second optimization is performed using the G-ridge
algorithm on the selected variables in order to provide accurate variable selection and good prediction
performance at the same time. This second setting is
referred to as Alg1GR . As pointed out in Section 2,
an important point of the algorithm is the choice of
a threshold value for the quantity in (5) to discriminate among relevant and non relevant variables. In the
following experiments we used as threshold the regularization parameter τ , which empirically proved to be
a very effective solution. In order to guarantee a fair
comparison among the different methods, we employ
a common validation protocol for all considered algo-
rithms, based on different data sets for training, validation and test, where the validation set is used for
choosing the value of τ minimizing the error on such a
set, and the test set for estimating the prediction accuracy. All the experiments presented below have been
performed using a prototype Matlab implementation.
4.1
Synthetic Data
In this set of experiments, we generated a number
of synthetic data sets as follows. Given the interval
[−3, 3]6 ⊆ R6 we sampled n = 80 training points randomly drawn from the uniform distribution and two
separate validation and test sets of size m = 200. According to our initial assumption, the output labels
are computed using a noise-corrupted regression function fρ that depends nonlinearly from {x1 , x2 } only, i.e
y = fρ (x1 , x2 )+w, where xi denotes the ith component
of the vector x. For each x, the w is a white noise, sampled from a normal distribution with zero mean and
variance that is a small fraction of the average output
of fρ over the input domain. Specifically, we focused
on the following examples f 1 = (x41 − x21 ) · (3 + x2 );
f 2 = 2(x31 − x1 ) · (2x2 − 1) · (x2 + 1) + (x32 − x2 + 3);
2
2
and f 3 = −2(2x21 − 1) · (x2 ) · e−x1 −x2 .
In order to evaluate the stability of the results, for
each example we run the five algorithms several times
by keeping fixed the test set and letting the training
examples vary.
The ability to select correctly the relevant variables has
been assessed by computing true (TP) and false (FP)
positive rates. For lack of space we do no report a table
summarizing the results of all the tests. However, the
experimental evidence is that the Alg1GR algorithm
has constantly a TP rate equal to 1, and the FP rate
is less than 0.05. With Alg1 the FP rate increases since
the optimization procedure leads to selecting a higher
number of variables in order to keep a low prediction
error. The selection performances of ℓ1 are extremely
poor, since the algorithms almost always selects all the
variables. The ℓf1 eat approach shows a high variance in
the selection of the variables, and the TP rate is less
than 0.8.
The prediction errors of the learned algorithms are reported in Table 1. From the results therein it emerges
that: the ℓ1 algorithm is not appropriate to predict
the output values of the above functions. Furthermore, the performance of a linear selection of nonlinear
features computed from the original input variables is
viable as long as the true regression function is actually within the set spanned by the basis features (as
for f 1 and f 2 ), otherwise the performances degrade
quite dramatically (as for f 3 ). Overall, the best algorithmic approach to nonlinear feature selection is to
658
L. Rosasco, M. Santoro, S. Mosci, A. Verri, S. Villa
Table 1: Average generalization errors associated to
the considered algorithms.
Table 2: Results obtained with the Boston Housing
data set. We report a comparison of the average
squared test errors relative to the different algorithms.
Regression Function
Learning
Algorithm
Alg1
Alg1GR
ℓ1
f eat
ℓ1
G-ridge
f1
f2
f3
0.030 ± 0.006
0.23 ± 0.06
0.071 ± 0.012
0.008 ± 0.003
0.10 ± 0.04
0.003 ± 0.002
0.096 ± 0.003
3.32 ± 0.07
0.103 ± 0.003
0.026 ± 0.007
0.17 ± 0.04
0.079 ± 0.003
0.030 ± 0.003
0.21 ± 0.03
0.015 ± 0.003
Figure 1: Results obtained with the Boston Housing
data set. We report the selection frequency of each
variable.
run the Algorithm 1 for selecting the relevant features
and to perform a second optimization for estimating
the regression function. This leads clearly to better
results then those obtained by G-ridge on all the input variables. Finally, in most cases, the prediction
performance of the Algorithm 1 alone is competitive
with the other methods.
4.2
Real Data
In this experiment, we use the standard Boston
Housing data, i.e.
the housing data for 506
census tracts of Boston, available from the
UCI Machine Learning Database Repository:
http://archive.ics.uci.edu/ml/.
Each census tract represents a data-point, described by 13
features, whereas the output is the housing price. In
our experiment, we randomly partition the data into
50 training, 228 validation and 228 test points. For
comparison we use the four algorithms: ℓ1 , G-ridge,
Alg1 and Alg1GR . We perform the experiments
20 times. For each repetition the data points are
centered and normalized with respect to the training
set points. In Figure 1 we report a table with the
average squared test errors of all the tested algorithms
and the selection frequencies for the Alg1GR and ℓ1 .
From Figure 1 we can observe that the most frequently
selected features are the same in both algorithms, that
is average number of rooms per dwelling (RM), pupilteacher ratio by town (PTRATIO), and % lower status
Learning
Algorithm
Test
Error
ℓ1
G-ridge
Alg1
Alg1GR
30 ± 4
22 ± 5
23 ± 5
21 ± 3
of the population (LSTAT). Nevertheless we can notice
that Alg1GR presents a larger gap between the features
that are most frequently selected and the other features, indicating that selection performed via Alg1GR
is more stable than via ℓ1 -regularization. A number
of previous studies have analyzed this data set. In
particular Zhang (2009) applies different subset selection techniques and compares their prediction accuracy. Indeed, the prediction errors provided by these
techniques are comparable with our results for ℓ1 regularization and are higher than those achieved by kernel methods, with and without subset selection. This
behavior may suggest that, on this data set, nonlinearity is stronger than sparsity. In fact, G-ridge, which
does not perform subset selection, provides better estimate than ℓ1 regularization which trades off nonlinearity for sparsity. Only Alg1GR , which combines both
approaches, is able to perform a more stable subset selection though maintaining, and even improving, the
prediction accuracy of the solution.
5
Conclusions
We have proposed a new regularization scheme for
nonlinear variable selection, promoting estimators
which depend on a small number of the original variables.
To the best of our knowledge this is the first direct
approach to nonlinear variable selection, which on one
hand does not require the use of nonlinear features,
and on the other hand directly promotes sparsity in the
original variables. We described an iterative procedure
to solve the underlying variational problem, which convergence to the optimal solution is proved.
The initial experimental results on synthetic and real
data are promising. Indeed, they show that our
method outperforms linear subset selection techniques
as well as nonlinear regression methods, both in terms
of prediction and variable selection, since it performs
selection as the former, though maintaining nonlinearity as the latter.
659
A Regularization Approach to Nonlinear Variable Selection
Acknowledgements
This paper describes a joint research work done at
the Center for Biological and Computational Learning (within the McGovern Institute for Brain Research
at MIT), at the Department of Brain and Cognitive
Sciences (affiliated with the Computer Sciences and
Artificial Intelligence Laboratory), and at the Departments of Computer Science and Mathematics of the
University of Genoa. The authors have been partially supported by the Integrated Project Health-eChild IST-2004-027749 and by grants from DARPA
(IPTO and DSO), National Science Foundation (NSF0640097, NSF-0827427), and Compagnia di San Paolo,
Torino. Additional support was provided by: Adobe,
Honda Research Institute USA, King Abdullah University Science and Technology grant to B. DeVore,
NEC, Sony and especially by the Eugene McDermott
Foundation.
References
F. Bach. Exploring large feature spaces with hierarchical multiple kernel learning. In NIPS, Sep 2008.
Mikhail Belkin and Partha Niyogi. Towards a theoretical foundation for laplacian-based manifold methods. Journal of Computer and System Sciences, 74
(8):1289–1308, 2008.
E. J. Candès and T. Tao. The dantzig selector: statistical estimation when p is much larger than n.
Annals of Statistics, 35:2313–2351, 2005.
P. L. Combettes and V. R. Wajs. Signal recovery
by proximal forward-backward splitting. Multiscale
Model. Simul., 4(4):1168–1200 (electronic), 2005.
I. Daubechies, G. Teschke, and L. Vese. Iteratively
solving linear inverse problems under general convex
constraints. Inverse Problems and Imaging, 1(1):29–
46, 2007.
C. De Mol, Mosci, M. S. Traskine, and A. Verri. A regularized method for selecting nested groups of relevant genes from microarray data. Journal of Computational Biology, 16(5):677–690, May 2009.
E. De Vito, L. Rosasco, A. Caponnetto, U. De Giovannini, and F. Odone. Learning from examples as
an inverse problem. Journal of Machine Learning
Research, 6:883–904, May 2005.
B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani.
Least angle regression. Annals of Statistics, 32:407–
499, 2004.
M.A.T. Figueiredo, R.D. Nowak, and S.J. Wright.
Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse
problems. Technical report, IEEE Journal of Selected Topics in Signal Processing, 2007.
T. Golub, D. Slonim, P. Tamayo, C. Huard,
M. Gaasenbeek, J. Mesirov, H. Coller, M. Loh,
J. Downing, M. Caligiuri, C. Bloomfield, and
E. Lander. Molecular classification of cancer: Class
discovery and class prediction by gene expression
monitoring. Science, 286:531–537, 1999.
I. Guyon, H. Bitter, Z. Ahmed, M. Brown, and
J. Heller. Multivariate Non-Linear Feature Selection with Kernel Methods, pages 313–326. Springer
Berlin / Heidelberg, 2006.
E. T. Hale, W. Yin, and Y. Zhang. Fixed-point continuation for l1-minimization: Methodology and convergence. SIOPT, 19(3):1107–1130, 2008.
T. N. Lal, O. Chapelle, J. Weston, and A. Elisseeff.
Embedded methods. Foundations and Applications,
Studies in Fuzziness and Soft Computing, 207:137–
165, 2006.
B. Y. Lin and H. H. Zhang. Component selection and
smoothing in multivariate nonparametric regression.
Annals of Statistics, 34(5):2272–2297, 2006.
P.-L. Lions and B. Mercier. Splitting algorithms for the
sum of two nonlinear operators. SIAM J. Numer.
Anal., 16(6):964–979, 1979.
S. Mukherjee and D. Zhou. Learning coordinate covariances via gradients. Journal of Machine Learning Research, 7:319–349, 2006.
P. Ravikumar, H. Liu, J. Lafferty, and L. Wasserman. Spam: Sparse additive models. In J.C. Platt,
D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems.
2008.
L. Rosasco, S. Mosci, M. Santoro, A. Verri, and
S. Villa. Iterative projection methods for structured
sparsity regularization. Technical Report MITCSAIL-TR-2009-50 / CBCL-282, Massachusetts Institute of Technology, Cambridge, MA, 2009.
B. Schölkopf and A. Smola. Learning with Kernels.
MIT Press, Cambridge (MA), 2002.
R. Tibshirani. Regression shrinkage and selection via
the lasso. Journal of the Royal Statistical Society,
Series B, 56:267–288, 1996.
Q. Wu, F. Liang, and S. Mukherjee. Consistency of
regularized sliced inverse regression for kernel models. Technical report. Duke University and University of Illinois Urbana-Champaign, 2008.
Tong Zhang. Adaptive forward-backward greedy algorithm for sparse learning with linear models. In Advances in Neural Information Processing Systems,
pages 1921–1928. 2009.
Ding-Xuan Zhou. Derivative reproducing properties
for kernel methods in learning theory. J. Comput.
Appl. Math., 220:456–463, 2008.
660