Survit Sra Efficient Primal SVM
Survit Sra Efficient Primal SVM
Survit Sra Efficient Primal SVM
Suvrit Sra
1 Introduction
J. Fürnkranz, T. Scheffer, and M. Spiliopoulou (Eds.): ECML 2006, LNAI 4212, pp. 767–774, 2006.
c Springer-Verlag Berlin Heidelberg 2006
768 S. Sra
decomposes a linear program into smaller chunks and solves them using any LP
solver. However, despite its efficiencies, LPC can be prohibitively slow for large
problems. Thus, an efficient method for solving LP based SVM formulations is
needed, and this paper presents such a method. Our approach yields a scalable
and efficient decomposition method for solving the 1 -SVM, which is several
orders of magnitude faster than the LPC approach and makes learning large scale
1 -SVMs practical. Furthermore, our method is simple to implement, provably
convergent, easily extensible to solve other SVM variants, and yields accuracies
competitive with well established SVM software.
Much of the speed of our algorithm lies in the fact that we solve the primal
as opposed to the dual formulation of the 1 -SVM.2 Recently Chapelle [5] has
provided motivation for reconsidering the solution of the primal formulation of
the SVM. Since many years, owing to its simplicity and extensibility to nonlin-
ear cases, researchers have focused on solving the dual problem for SVMs. For
example, when number of training points greatly exceeds the dimensionality of a
single point, it is advantageous to solve the primal rather than the dual (despite
the novel efficiencies introduced in [14]).
min 1T (f + Cξ)
w,b,f ,ξ
yi (xi , w + b) ≥ 1 − ξi , 1 ≤ i ≤ N, (2.2)
−w − f ≤ 0, w − f ≤ 0, ξ ≥ 0.
2
Our approach has a more primal-dual flavor, but since we never form the dual, we
continue referring to it as a primal approach.
Efficient Large Scale Linear Programming Support Vector Machines 769
For large-scale data solving the LP (2.2) using off-the-shelf software can be very
expensive. Bradley and Mangasarian [2] described a method called Linear Pro-
gramming Chunking for efficiently solving (2.2). However, despite their “chunk-
ing” approach, their method can still be extremely slow for large data sets. In
Section 3 below, we describe a fast decomposition procedure for solving (2.2). We
remark that our techniques can be easily adapted to solve the ∞ -norm SVM.
Furthermore, nonlinear SVMS via the LP-machines [8] can also be handled with
equal ease. We omit the details due to space limitations (these will be published
elsewhere).
3 Algorithm
It may not seem obvious how to solve (2.2) using a simple decomposition method.
Problem (2.2) lacks strict convexity, a necessary ingredient for the application
of many decomposition techniques. Fortunately, we can exploit a very useful
result of Mangasarian [12, Theorem 2.1-a-i] (adapted as Theorem 1 below) that
permits us to transform (2.2) into an equivalent quadratic program that has the
necessary strict convexity.
Theorem 1 (1 SVM). Let g = [w; b; f ; ξ] and c = [0; 0; 1; C1] be partitioned
conformally. If (2.2) has a solution, then there exists an 0 > 0, such that for
all ≤ 0 ,
argmin g + −1 c22 = argmin g22 , (3.1)
g∈G g∈G
where G is the feasible set for (2.2) and G is the set of optimal solutions to (2.2).
The minimizer of (3.1) is unique.
Theorem 1 essentially states that the solution of (3.1) yields the minimum 2 -
norm solution out of all the possible solutions of (2.2). This seemingly counter-
intuitive replacement of a linear program by a corresponding quadratic program
lies at the heart of building a decomposition method for the 1 -SVM.
3.1 Decomposition
min 1
g − (− 1 )c2 , (3.2)
g=[w;b;f ,ξ] 2
and soft-margin (XI) constraints. Further, let A denote the matrix of all these
constraints put together.
Let L(g, z) denote that Lagrangian for (3.2). A first order necessary condition
of optimality is
∂
L(g, z) = g + −1 c + AT z = 0, z ≥ 0. (3.3)
∂g
The decomposition procedure that we use consists of the following main steps:
1. Start with a dual feasible solution, and obtain a corresponding primal so-
lution so that the first-order necessary conditions (3.3) are satisfied. For
example, z = 0 and g = −−1 c is a valid initialization.
2. Go through each constraint individually and enforce it. Enforcing each con-
straint is equivalent to updating the corresponding dual variable zj (1 ≤ j ≤
4N ) so that zj ≥ 0 is maintained, while recomputing g to ensure that (3.3)
remains satisfied.
3. Repeat Step 2 until some convergence condition is satisfied (such as small
net violation of all the KKT constraints, change below a certain threshold
to the objective function etc.).
4 Experimental Results
leading SVM implementations. The latter set of experiments show two things:
i) the efficiency of our 1 -norm SVM implementation, and ii) the importance of
solving the primal problem for data with highly skewed dimensions.
We implemented our algorithm in C++ using the the sparse matrix library
SSLib [18]. The experiments reported in Section 4.1 were performed on a Pentium
4, 3GHz Linux machine equipped with 1GB RAM, whereas those in Section 4.2
were performed on a Pentium Xeon 3.2GHz Linux machine with 8GB RAM.
Table 1. Training accuracies and associated running times for our 1 -SVM imple-
mentation with both soft and hard margins (C = ∞), and comparative numbers for
SVMlight and LIBSVM. The running times reported are exclusive of the time spent
in I/O. We added timing computation code to LIBSVM. For the data sets that came
with a separate test collection, we also report test accuracies and test times. Our im-
plementation was overall faster in both training and testing, with a marginal sacrifice
in terms of accuracies.
For all the implementations tested, we used the same value for the cost pa-
rameter C. Further, for both SVMlight and LIBSVM we used the linear kernels.
Observe that our 1 -SVM outperforms both SVMlight and LIBSVM in terms of
training and testing speed, with a small drop in accuracy—except for the RCV1
dataset, where the 1 -norm SVM has higher training accuracy.
772 S. Sra
200
180
160
140
Running time (secs)
120
100
80
60
40
20
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Number of training points (N) 7
x 10
Fig. 1. Running time of our 1 -SVM to demonstrate its scalability. The plot ranges
from N = 100,000 to 20 million, and the corresponding times range from 0.89s to 181s.
The run time can be seen to grow linearly in the number of training points (because
the number of training points N M , no effect of dimensionality is discernible).
4
This estimate is conservative because if one compares the performance of the cluster
with that of a single Xeon based machine, the cluster is a few times faster—as can
be ascertained by going through CPU/System comparison benchmarks.
Efficient Large Scale Linear Programming Support Vector Machines 773
method of choice. Our speed gains come from two sources: i) the decomposition
approach, and ii) solving the primal problem instead of the dual.
We mention in passing that Mangasarian and Musicant [14] solve a particular
version of the 2 -norm SVM problem by attacking the dual formulation using
an active set approach. Our 1 -norm implementation can be modified to solve
the primal version of their problem and once again outperforms the dual. Man-
gasarian and Musicant [14] reported running times of (on a 400MHz Pentium II
Xeon processor with 2GB RAM) of 38 minutes for a problem size of 4 million
points (in R32 ), and 96.57 minutes for 7 million points. Comparative numbers
for our implementation can be obtained from Figure 1. Interpolation yields the
estimates 36 seconds for 4 million points and 63 seconds for 7 million points.
5 Discussion
Acknowledgments
References
[1] K. Bennett. Combining support vector and mathematical programming methods
for classification. In B. Schölkopf, C. Burges, and A. Smola, editors, Advances in
Kernel Methods, pages 307–326. MIT Press, 1999.
[2] P. S. Bradley and O. L. Mangasarian. Massive data discrimination via linear
support vector machines. Optimization Methods and Software, 13(1), 2000.
[3] Y. Censor and S. A. Zenios. Parallel Optimization: Theory, Algorithms, and Ap-
plications. Oxford University Press, 1997.
[4] C-C. Chang and C-J. Lin. LIBSVM: a libary for support vector machines, 2001.
http://www.csie.ntu.edu.tw/˜cjlin/libsvm.
[5] O. Chapelle. Training a support vector machine in the primal. Technical report,
Max Planck Institute for Biological Cybernetics, 2006.
[6] N. Cristianini and J. Shawe-Taylor. An introduction to support vector machines
and other kernel-based learning methods. Cambridge University Press, 2000.
[7] C.L. Blake D.J. Newman, S. Hettich and C.J. Merz. UCI repository of machine
learning databases, 1998.
[8] T. Graepel, R. Herbrich, B. Schölkopf, A. Smola, P. Bartlett, K.-R. Müller,
K. Obermayer, and R. Williamson. Classification on proximity data with lp-
machines. In 9th Int. Conf. on Artificial Neural Networks: ICANN, 1999.
[9] C. Hildreth. A quadratic programming procedure. Naval Res. Logist. Quarterly,
4, 1957.
[10] T. Joachims. Making large-scale SVM learning practical. In B. Schölkopf,
C. Burges, and A. Smola, editors, Advances in Kernel Methods, pages 42–56. MIT
Press, 1999.
[11] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. RCV1: A new benchmark collection
for text categorization research. Journal of Machine Learning Research, 5:361–
397, 2004.
[12] O. L. Mangasarian. Normal solutions of linear programs. Mathematical Program-
ming Study, 22:206–216, 1984.
[13] O. L. Mangasarian. Exact 1-norm support vector machines via unconstrained
convex differentiable minimization. Journal of Machine Learning Research, 2006.
[14] O. L. Mangasarian and D. R. Musicant. Active support vector machine classifi-
cation. In NIPS, 2001.
[15] K. V. Mardia and P. Jupp. Directional Statistics. John Wiley and Sons Ltd.,
second edition, 2000.
[16] J. Platt. Fast training of support vector machines using sequential minimal opti-
mization. In B. Schölkopf, C. Burges, and A. Smola, editors, Advances in Kernel
Methods, pages 185–208. MIT Press, 1999.
[17] D. Prokhorov. Slide presentation in IJCNN’01, Ford Research Laboratory. IJCNN
2001 neural network competition, 2001.
[18] S. Sra and S. S. Jegelka. SSLib: Sparse Matrix Manipulation Library.
http://www.cs.utexas.edu/users/suvrit/work/progs/sparselib.html, 2006.
[19] J. Zhu, S. Rosset, T. Hastie, and R. Tibshirani. 1-norm support vector machines.
In NIPS, 2004.