Easy Multiple Kernel Learning: January 2014
Easy Multiple Kernel Learning: January 2014
Easy Multiple Kernel Learning: January 2014
net/publication/261460020
CITATIONS READS
4 98
2 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Fabio Aiolli on 29 April 2014.
1 Introduction
Multiple Kernel Learning (MKL) is a general paradigm used to learn kernels.
The kernel computed are (generally linear) combinations of previously defined
weak kernels trained using available data. See for example [1] for a quite ex-
haustive survey. In this paper, we focus on MKL Pwith positive and linear com-
R
bination parameters, that is, in the form K = r=1 ηr Kr , ηr ≥ 0. According
to [1], the majority of existing MKL approaches can be divided into the fol-
lowing two categories. Fixed or heuristic rule based techniques typically result
scalable with respect to the number of kernels combined but their effectiveness
will critically depend on the domain at hand. On the other side, optimization
based approaches learn the combination parameters by solving an optimization
problem. Generally speaking, best performing MKL approaches necessitate of
complex optimization (e.g. SDP, QCQP) which will soon exhaust memory when
copying with hundreds of examples and hundreds of weak kernels. Here, we
propose an algorithm able to effectively and efficiently optimize the separation
between positive and negative examples. We empirically demonstrate the effec-
tiveness of the method proposed which is almost always more accurate than the
baselines.
2 The KOMD Algorithm
We consider a classification problem with training examples {(x1 , yi ), . . . , (xl , yl )},
and test examples {(xl+1 , yl+1 ), . . . , (xL , yL )}, xi ∈ Rm , yi ∈ {−1, +1}. We use
X ∈ RL×m to denote the matrix where examples are arranged in rows and
y ∈ RL the vector of labels. The matrix K ∈ RL×L denotes the complete kernel
matrix containing the kernel values of each (training and test) data pair. Fur-
ther, we indicate with an hat, like for example X̂ ∈ Rl×m , ŷ ∈ Rl , and K̂ ∈ Rl×l ,
the submatrices (or subvectors) obtained considering training examples only.
Given a training set, we consider the domain Γ̂ of probability distributions
γ ∈ Rl+ defined over the sets of positive and negative training examples. More
P P
formally: Γ̂ = {γ ∈ Rl+ | i∈⊕ γi = 1, i∈⊖ γi = 1}.
In [2] a game theoretic interpretation of the problem of margin maximization
and an algorithm called “Kernel method for the Optimization of the Margin
Distribution” (KOMD) have been proposed. The classification task is posed as
a two-player zero-sum game and it is shown how this zero-sum game is equivalent
to an hard SVM which can be solved efficiently by optimizing a simple linearly
constrained convex function on variables γ ∈ Γ̂, namely,
minimizeγ∈Γ̂ D(γ) := γ ⊤ Ŷ K̂Ŷγ.
The vector γ ∗ ∈ Γ̂ that minimizes D(γ) identifies the two nearest points in the
convex hulls of positive and negative examples, in the feature space of kernel
K. Moreover, a quadratic regularization over γ, namely Rγ (γ) = γ ⊤ γ, is intro-
duced, that makes the player to prefer optimal distributions (strategies) with
low variance. The final best strategy for γ will be given by solving the optimiza-
tion problem minγ∈Γ̂ (1 − λ)D(γ) + λRγ (γ). A correct selection of the parameter
λ ∈ [0, 1] (usually made by validating on training data) is fundamental. In
KOMD, the evaluation on a new generic example x is obtained by: f (x) =
P ⊤
i yi γi K(xi , x) = Ktr (x)Ŷγ, where Ktr (x) = [K(x1 , x), . . . , K(xl , x)] .
3 EasyMKL
In MKL we want to find the best combination parameters for a set of predefined
kernel matrices. In our context, this is done by learning
PR a coefficient vector η
that forms the combined kernel according to: K = r=1 ηr Kr , ηr ≥ 0. Clearly,
we must restrict the possible choices of such a matrix K and this can be made
by regularizing the learning process. So, we pose the problem of learning the
kernel combination as a min-max problem over variables γ and η. Specifically,
we propose to maximize the distance between positive and negative examples
with a unitary norm vector η as the weak kernel combination vector, that is
X
max min(1 − λ)γ ⊤ Ŷ( ηr K̂r )Ŷγ + λ||γ||2 = min max Q(η, γ),
||η||=1 γ∈Γ γ∈Γ ||η||=1
r
where Q(η, γ) = (1 − λ)η ⊤ d(γ) + λ||γ||2 and the r-th entry of d(γ) is dr (γ) =
γ ⊤ Ŷ K̂r Ŷγ.
Now, it is possible to see that, the vector η ∗ maximizing the function Q(η, γ)
d(γ)
above has a simple analytic solution: η ∗ = ||d(γ)|| . Plugging this solution into
the min-max problem, we obtain
min Q(η ∗ , γ) = min(1 − λ)||d(γ)|| + λ||γ||2 .
γ γ
The optimal γ will be the (regularized) minimizer of the 2-norm of the vector
of distances. This minimizer is not difficult to find as this is a convex function
though not quadratic. In order to simplify the problem further we prefer to
minimize an upper-bound instead corresponding to the 1-norm of the vector of
distances:
X
min(1 − λ)||d(γ)||1 + λ||γ||2 = min(1 − λ)γ ⊤ Ŷ( K̂r )Ŷγ + λ||γ||2 .
γ γ
r
Interestingly, the minimization problem is the same as the KOMD problem where
the kernel matrix has been substituted with the simple sum of the weak kernel
matrices. This will turn out to be very important for the efficiency (especially
in terms of space) of the method as will be explained later. In the following, we
refer to this algorithm as EasyMKL. A final consideration we can do here is that
the quality of a kernel does not change if it is multiplied by a positive constant.
However, this formulation makes particularly clear that, if kernels with different
traces are present in the combination, they have unequal impact in the MKL
optimization problem. Thus, if not differently motivated, this should be avoided.
4 Experiments and Results
Experiments have been performed comparing the proposed methods with other
state-of-the-art MKL algorithms with respect to accuracy and computational
performance. A total of 4 datasets and 9 different methods have been tested.
Benchmark datasets with different characteristics have been used. In particular,
the Splice dataset [3] (60 features, 1,000 examples), the Batch2 dataset [4] (128
features, 1,244 examples), the Mush dataset [3] (112 features, 2,000 examples)
and the Gisette dataset [5] (5,000 features, 1,000 examples). All data have been
scaled to [−1, +1] interval and the number of training examples selected for the
training part is approximately 10% of the dataset. We decided to use relatively
small training sets as in this case the final accuracy is less influenced by the
classifier (on which the combined kernel is applied) and more on the real quality
of a kernel. Further, we preferred having larger test sets and many different splits
of a single dataset in order to improve the significance of the overall evaluation.
4.2 Methods
We have then considered three baseline methods for our experiments. The first
method is the classical (SVM) trained with all the features. This is only re-
ported for the sake of completeness and just to give an idea of the difficulty of
the datasets. Note that, the feature space in this case is different from MKL
methods and results are not comparable. The second method, called Random
Kernel (Random), consists of random picking from available kernels, which is
equivalent to set only one ηr equal to 1 and all the others equal to 0. We decided
to insert this baseline as an indicator of how much information is brought from
single kernels. Intentionally, the performance of this baseline could be really
poor. Finally, the last baseline method is the Average Kernel (Average) where
the same weight ηr is given to all the available weak kernels. Despite its sim-
plicity, it is known (see [6]) that this is a strong baseline that beats other more
advanced techniques quite often and can obtain very good results. We have then
considered three state-of-the-art algorithms of MKL with SVM. All these three
algorithms are in the optimization based family of MKL methods. A MATLAB
implementation1 of these methods by Mehmet Gönen and Ethem Alpaydin [1]
has been used.
• Simple MKL (SMKL) An iterative algorithm by Rakotomamonjy [7]
that implements a linear approach with kernel weights in a simplex.
• Generalized MKL (GMKL) The second algorithm, by Varma and Babu
[8] exploits a nonlinear approach different from SMKL but, in general, with
better results [1].
• Lasso-based MKL (GLMKL) This algorithm by Kloft [9] and Xu [10]
considers a regularization of kernel weights with 1-norm and finds the
kernel weights by solving a small QP problem at each iteration.
• SMO Projected Gradient Descent Generalized MKL (SPG-GMKL)
This particular algorithm [11] is a state-of-the-art method with respect to
the computational performance.
3. Rank test examples using K and SVM (KOMD for the proposed method)
and evaluate the quality of the ranking with the AUC metric.
The same experiments have been repeated 1,000 times averaging the AUC re-
sults in order to improve the significance of the evaluation. In order to set our
experiments as fair as possible, the KOMD algorithm has been used with the
kernel produced by our MKL method while SVM has been used on all the others.
However, in [2] it has been shown that these two methods have in fact similar
accuracy.
1 http://www.cmpe.boun.edu.tr/~ gonen/mkl.
RBF kernels with all the features have been used to train the SVM baseline
which obtained an AUC value of 86.12% for the Splice dataset, 95.20% for the
Batch2 dataset and 99.52% for the Mush dataset. The results obtained with the
MKL algorithms are summarized in Table 1 (d = 5 and d = 10) which shows
that standard MKL methods do not have performances significantly better than
the simple kernel averaging. On the other side, EasyMKL have significantly
better AUC in two out of three datasets. This trend is confirmed in Table
2 where the proposed method is always significantly better than the average
baseline. Unfortunately, given the relatively large number of features of the
Gisette dataset, it was not possible to run state-of-the-art MKL methods on
this data without running out the memory.
Table 1: AUC results on three datasets of various MKL methods and baselines
(RBF feature subset d = 5 and d = 10).
References
[1] Mehmet Gonen and Ethem Alpaydin. Multiple kernel learning algorithms. Journal of
Machine Learning Research, 12:2211–2268, 2011.
[2] Fabio Aiolli, Giovanni Da San Martino, and Alessandro Sperduti. A kernel method for
the optimization of the margin distribution. In ICANN (1), pages 305–314, 2008.
[3] K. Bache and M. Lichman. Uci machine learning repository, 2013.
[4] Alexander Vergara, Shankar Vembu, Tuba Ayhan, Margaret A. Ryan, Margie L. Homer,
and Raman Huerta. Chemical gas sensor drift compensation using classifier ensembles,
2012.
[5] Isabelle Guyon, Steve Gunn, Asa Ben-Hur, and Gideon Dror. Result analysis of the nips
2003 feature selection challenge. In Lawrence K. Saul, Yair Weiss, and Léon Bottou,
editors, Advances in NIPS 17, pages 545–552. MIT Press, Cambridge, MA, 2005.
[6] Xinxing Xu, Ivor W. Tsang, and Dong Xu. Soft margin multiple kernel learning. IEEE
Trans. Neural Netw. Learning Syst., 24(5):749–761, 2013.
[7] Alain Rakotomamonjy, Francis R. Bach, St’ephane Canu, and Yves Grandvalet. Sim-
plemkl. Journal of Machine Learning Research, 9:2491–2521, 208.
[8] Andrea Pohoreckyj Danyluk, Léon Bottou, and Michael L. Littman, editors. Proceed-
ings of ICML 2009, Montreal, Quebec, Canada, June 14-18, 2009, volume 382 of ACM
International Conference Proceeding Series. ACM, 2009.
[9] Marius Kloft, Ulf Brefeld, Sören Sonnenburg, and Alexander Zien. Non-sparse regular-
ization and efficient training with multiple kernels. CoRR, abs/1003.0079, 2010.
[10] Zenglin Xu, Rong Jin, Haiqin Yang, Irwin King, and Michael R. Lyu. Simple and efficient
multiple kernel learning by group lasso. In ICML, pages 1175–1182, 2010.
[11] A. Jain, S. V. N. Vishwanathan, and M. Varma. Spg-gmkl: Generalized multiple kernel
learning with a million kernels. In Proceedings of the ACM SIGKDD Conference on
Knowledge Discovery and Data Mining, August 2012.