ISE503 Project Report

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

ISE503 Term Project

By:
Muhammad Dzulqarnain Al Firdausi
g202210120

Supervised by:
Dr. Syed Mujahid
ISE 503 TERM PROJECT

1. Introduction

This term project provides a review of an algorithm proposed in (Zhi-Qiang Zeng et al.,
2008) for fast training support vector machines (SVM) using parallel sequential minimal
optimization (SMO). SVM design problem lies in how to use linear constraint to solve a quadratic
programming (QP) problem. Due to the availability of massive data nowadays, if one wants to
include all available data to train SVM, he/she should provide more memory space for kernel
matrix since the increment of need for memory is to the power of 2, 𝑂(𝑁 2 ), where 𝑁 is the number
of available data to train SVM. They improved an algorithm based on Symmetric Multiprocessor
(SMP) machine with advantages of speed when applied to problems with large training sets and
high dimensional spaces without reducing generalization performance of SVM.

There are several algorithms to train SVM in fast way, and SMO (Keerthi et al., 2001; Lin,
2002) is the most popular one with its faster, simple, and scaling ability. Although SMO with its
novelty can reduce the difficulty of QP problem with decomposition strategies, the memory saving
problem to train SVM with high number of training data still exists and needs to address further.

2. SMO review

The formula for the output 𝑢 of a linear support machine is:

𝑢 = 𝒘𝑇 𝒙 + 𝑏

Where 𝒙 ∈ ℝ𝑚 is the input, and the objective is to find 𝒘 and 𝑏 with a maximum margin. The
support vector method uses the optimal solution of the optimization problem (1) for finding the
maximum margin:

1
min: ‖𝒘‖2
2 (1)
𝑠𝑡: 𝑦𝑖 (𝒘𝑇 𝒙𝑖 + 𝑏) ≥ 1 ∀𝑖

To solve problem (1) we can use Lagrange by introducing 𝛼 as the Lagrange multiplier:

𝑁
1
ℒ(𝒘, 𝑏, 𝛼) = ‖𝒘‖2 − ∑ 𝛼𝑖 (𝑦𝑖 (𝒘𝑇 𝒙𝑖 + 𝑏) − 1)
2
𝑖=1

Page 1 of 5
ISE 503 TERM PROJECT

Where 𝑁 is the number of training data points for the SVM. Taking partial derivatives with respect
to 𝒘 and 𝑏 and setting to 0, we found:
1 T
ℒ(𝒘, 𝑏, 𝛼) = 𝒘 𝒘 − 𝜶𝒚𝒘𝑇 𝒙𝑖 + 𝜶𝒚T 𝒆𝑏 − 𝜶
2 (2)

1
𝜕 2 𝒘T 𝒘 − 𝜶𝒚𝒘𝑇 𝒙𝑖
=0
𝜕𝒘

𝒘 − 𝜶𝒚𝒙𝑖 = 0

𝒘 = 𝜶𝒚𝒙𝑖 (3)

𝜕𝜶𝒚T 𝒆𝑏
=0
𝜕𝑏

𝜶T 𝒚 = 0 (4)

Where 𝒚 is a vector in which 𝒚 ∈ ℝ such that 𝑦𝑖 ∈ {1, −1}, 𝒆 is a vector contains all ones.
SMO solves the smallest possible optimization problem at every iteration. The smallest possible
optimization problem involves 2 𝜶’s since they must obey a linear equality constraint 𝜶T 𝒚 = 0.
There are two components in SMO:

1. A heuristic to choose which 2 𝜶’s to optimize.


2. An analytic method to solve for the 2 𝜶’s.

And in every SMO iteration:

1. SMO chooses 2 𝜶’s to optimize together.


2. Finds the optimal value of those 2 𝜶’s.
3. Uses the new values to update the SVM.

The bounds for 2 𝜶’s are:

If 𝑦1 ≠ 𝑦2 the following bounds apply to 𝛼1 :

• Lower bound = max(0, 𝛼2 − 𝛼1 )

Page 2 of 5
ISE 503 TERM PROJECT

• Upper bound = min(𝐶, 𝐶 + 𝛼2 − 𝛼1 )

If 𝑦1 = 𝑦2 the following bounds apply to 𝛼2 :

• Lower bound = max(0, 𝛼2 + 𝛼1 − 𝐶)


• Upper bound = min(𝐶, 𝛼2 + 𝛼1 )

Where 𝐶 is the upper bound of all variable (Zhi-Qiang Zeng et al., 2008).

The Karush-Kuhn Tucker (KKT) conditions for 𝜶 are:

• If 𝜶 = 𝟎 means that the input is correctly labeled with room to spare


• If 𝜶 = 𝑪 means that the input is misclassified or in margin
• If 0 < 𝜶 < 𝑪 means that the input is a support vector

Steps for SMO is as follows:

1. Initialization: set 𝐶 for upper bound of all 𝛼 variables, 𝜀 for error threshold, index 𝑖 = 0
for the first 𝛼, 𝑏 = 0, and 𝑘 = 0. Select 𝜶𝑘 as the initial feasible solution, i.e., set 𝜶0 = 𝟎
where 𝜶 ∈ ℝ𝑚 has same numbers as 𝒙.
2. If 𝜶𝑘 is an optimal solution of (5) with constraint (6) and the KKT for 𝜶𝑘 , stop. Otherwise,
go to step 3.
3. For given input 𝒙𝑖 , calculate output error 𝐸𝑖 by using:
𝑢 = 𝒘𝑇 𝒙 + 𝑏
𝒘 = 𝜶𝒚𝒙𝑖
𝐸𝑖 = 𝑢 − 𝑦𝑖 = 𝜶𝒚𝒙𝑖 𝒙 + 𝑏 − 𝑦𝑖
4. Check if 𝛼𝑖 > 0 and |𝐸𝑖 | < 𝜀 or 𝛼𝑖 < 𝐶 and |𝐸𝑖 | > 𝜀 go to step 5, where 𝛼𝑖 is the 𝑖 − 𝑡ℎ
element of 𝜶0 . Otherwise, 𝑖 = 𝑖 + 1, go to step 3
5. Set index 𝑗 for the second 𝛼 randomly, i.e., use random uniform numbers between 0 and
𝑚 and 𝑗 ≠ 𝑖. For given input 𝒙𝑗 , calculate output error 𝐸𝑗 by using:
𝐸𝑗 = 𝜶𝒚𝒙𝑗 𝒙 + 𝑏 − 𝑦𝑗
{max(0, 𝛼𝑗 + 𝛼𝑖 − 𝐶), min(𝐶, 𝛼𝑗 + 𝛼𝑖 ), 2} 𝑓𝑜𝑟 𝑦𝑖 = 𝑦𝑗
6. Set 𝑏𝑜𝑢𝑛𝑑 for 𝜶 where:{
{max(0, 𝛼𝑗 − 𝛼𝑖 ), min(𝐶, 𝛼𝑗 − 𝛼𝑖 + 𝐶), 2} 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
7. Set 𝜂 = 2𝒙T𝑖 𝒙𝑗 − 𝒙T𝑖 𝒙𝑖 − 𝒙𝑗T 𝒙𝑗

Page 3 of 5
ISE 503 TERM PROJECT

𝑏𝑜𝑢𝑛𝑑1 𝑖𝑓 𝛼𝑗 < 𝑏𝑜𝑢𝑛𝑑1


8. If 𝑏𝑜𝑢𝑛𝑑1 ≠ 𝑏𝑜𝑢𝑛𝑑2 and 𝜂 < 0, 𝛼̂𝑗 {𝑏𝑜𝑢𝑛𝑑2 𝑖𝑓 𝛼𝑗 > 𝑏𝑜𝑢𝑛𝑑2 ,
𝛼𝑗 − (𝑦𝑗 × (𝐸𝑖 − 𝐸𝑗 )/𝜂) 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Otherwise, go to step 3
9. Set minimum value of optimizing alpha (𝑚𝑖𝑛α = 0.0001).
If |𝛼̂𝑗 − 𝛼𝑗 | ≥ 𝑚𝑖𝑛α ∶
• 𝛼̂𝑖 = 𝛼𝑖 + 𝑦𝑖 𝑦𝑗 (𝛼𝑗 − 𝛼̂𝑗 )
• 𝑏1 = 𝑏 − 𝐸𝑖 − 𝑦𝑖 (𝛼̂𝑖 − 𝛼𝑖 )(𝒙T𝑖 𝒙𝑖 ) − 𝑦𝑗 (𝛼̂𝑗 − 𝛼𝑗 )(𝒙T𝑖 𝒙𝑗 )
• 𝑏2 = 𝑏 − 𝐸𝑗 − 𝑦𝑖 (𝛼̂𝑖 − 𝛼𝑖 )(𝒙T𝑖 𝒙𝑗 ) − 𝑦𝑗 (𝛼̂𝑗 − 𝛼𝑗 )(𝒙𝑗T 𝒙𝑗 )
• 𝛼𝑖 = 𝛼̂𝑖 ; 𝛼𝑗 = 𝛼̂𝑗

Otherwise, go to step 3

𝑏1 𝑖𝑓 0 < 𝛼𝑖 < 𝐶
𝑏
10. Set 𝑏 = { 2 𝑖𝑓 0 < 𝛼𝑗 < 𝐶
(𝑏1 + 𝑏2 )/2 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
11. 𝑘 = 𝑘 + 1 and go to step 2.

3. Experiment

We implemented the SMO algorithm on two different data sets where the first dataset is a
simulated interview score data, the second and the third are the clustering dataset. The experiment
was carried out with HP Pavilion Laptop 14 with i5-8250U CPU @1.60GHz (8 CPUs), 8GB RAM.
The operating system is Windows 11 pro 64-bit. The parameters for training are: maximum
number of iterations: 50, 𝜀 = 0.001, 𝐶 = 1, 𝑚𝑖𝑛α = 0.0001. The training result of the algorithm
is shown in Table 1 and Table 2.

Table 1 Experiment time

No. Data set # data # features Run time (s)


1 Interview 18 2 0.839
2 Clustering-2 1000 2 65

Page 4 of 5
ISE 503 TERM PROJECT

Table 2 Parameter after training

Data Support vector label alpha 𝑤 𝑏


set
1 [5.123456, 5.123456] -1 0.07892416962267639 [0.83937782 [-6.94
[8.654321, 2.123456] 1 0.3144515640190321 0.32124392] 6395]
[7.123456, 6.123456] 1 0.7644931830761286

2 [-4.83 -7.67] -1 1.3552527156068805e-20 [0.05419894 [0.029


[-3.39 -8.47] -1 0.0015174344953036586 0.09979857] 02828]
[-4. -7.9] -1 0.0003312351054359984
[-3.89 -7.71] -1 0.0018399777928126578
[-4.4 -7.61] -1 0.0007884073985461118
[-4.36 -8.27] -1 0.0018830933020893476
[6.1 6.52] 1 0.0018449629732119996
[3.91 8.1 ] 1 0.004511208969107187

References

Keerthi, S.S., Shevade, S.K., Bhattacharyya, C., Murthy, K.R.K., 2001. Improvements to Platt’s
SMO Algorithm for SVM Classifier Design. Neural Computation 13, 637–649.
https://doi.org/10.1162/089976601300014493
Lin, C.-J., 2002. Asymptotic convergence of an SMO algorithm without any assumptions. IEEE
Transactions on Neural Networks 13, 248–250. https://doi.org/10.1109/72.977319
Zhi-Qiang Zeng, Hong-Bin Yu, Hua-Rong Xu, Yan-Qi Xie, Ji Gao, 2008. Fast training Support
Vector Machines using parallel sequential minimal optimization, in: 2008 3rd
International Conference on Intelligent System and Knowledge Engineering. Presented at
the 2008 3rd International Conference on Intelligent System and Knowledge Engineering
(ISKE 2008), IEEE, Xiamen, China, pp. 997–1001.
https://doi.org/10.1109/ISKE.2008.4731075

Page 5 of 5

You might also like